[SIG-IX] [Fwd: IXP Switching Wishlist v3.0 draft]

  • To: IX SIG <sig-ix at lists dot apnic dot net>
  • Subject: [SIG-IX] [Fwd: IXP Switching Wishlist v3.0 draft]
  • From: Philip Smith <pfs at cisco dot com>
  • Date: Wed, 04 May 2005 23:46:01 +1000
  • List-archive: <http://www.apnic.net/mailing-lists/sig-ix>
  • List-help: <mailto:sig-ix-request@lists.apnic.net?subject=help>
  • List-id: "APNIC Internet Exchange Point Special Interest group."<sig-ix.lists.apnic.net>
  • List-post: <mailto:sig-ix@lists.apnic.net>
  • List-subscribe: <http://mailman.apnic.net/mailman/listinfo/sig-ix>,<mailto:sig-ix-request@lists.apnic.net?subject=subscribe>
  • List-unsubscribe: <http://mailman.apnic.net/mailman/listinfo/sig-ix>,<mailto:sig-ix-request@lists.apnic.net?subject=unsubscribe>
  • Organization: Cisco Systems Inc
  • User-agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)
      For those of you who do not participate in the RIPE EIX Working Group,
      attached is the current draft of the IX Switch Wishlist.
      If you have any comments or input, please send them to Mike Hughes.
      ------------ Forwarded Message ------------
      Date: 04 May 2005 11:42 +0100
      From: Mike Hughes <mike at linx dot net>
      To: eix-wg at ripe dot net
      Subject: IXP Switching Wishlist v3.0 draft
      Hi all,
      Here is a current draft of the wishlist, containing all recent feedback and
      Key changes:
      * Removed sections on MAC-SPF and MPLS options - these don't seem to be
      gaining any traction, and don't seem as relevant anymore given the
      improvement in things such as RSTP.
      * Added section on VLAN tag space issues - tag rewrite or "virtual bridges"
      * Added items on 240V AC power, and Environmental monitoring
      * Added further detail to filtering, mac locking, multicast, and security.
      Let me know if you have any further comments, otherwise, hope to see you
      tomorrow between 11.00 and 12.30.
      Mike Hughes     Chief Technical Officer  London Internet Exchange
      mike at linx dot net   http://www.linx.net/
           "Only one thing in life is certain: init is Process #1"
      ---------- End Forwarded Message ----------
      RIPE European Internet Exchange (EIX) Working Group
      Internet Exchange Point Switching "Wishlist"
      Version 3.0 DRAFT
      Edited by Mike Hughes <mike at linx dot net>
      London Internet Exchange
      May 2005
      At the RIPE meeting held in Amsterdam in February 2000, a number of
      participants agreed that the group should produce a "wishlist" to guide
      equipment manufacturers when producing boxes aimed at the core switching
      market. Over the coming months, ideas were collected from the EIXP
      community to form the basis of this document.
      In Europe, most Internet Exchange Points use a shared switch fabric to
      which the participants connect. Organisations then arrange peering via
      bi-lateral peering agreements. It is not compulsory for all particpants to
      peer with every other participant (called multi-lateral peering).
      Once two participants agree to peer, they will set up BGP4 sessions
      between their routers connected to the Exhcange to exchange routes and
      traffic. In the majority of cases, the Exchange Point operator does not
      become involved in the routing of any traffic across the Exchange, they
      choose to leave this to the participants.
      For this reason, switched Ethernet has become one of the most common
      choices for Exchange Point media. The main reasons behind this are:
      	* Cost effectiveness
      	* Simplicity of setup
      	* Can use standard CAT5 wiring - easy to implement and maintain
      	* Interfaces available across a wide range of platforms
      With the growth of the Internet, more and more traffic is being routed to
      Internet Exchange points, and the importance of IXPs has grown in line with
      this, especially in Europe where private peering is less common than North
      The IXP operators feel that having the right tools and features
      implemented in the equipment they deploy will play an important part of
      scaling ethernet technology to meet the demands placed upon Exchange Points.
      This is an informational document to outline the various features which
      IXPs would like to see implemented in core Ethernet Switching products.
      a) Control of dynamic MAC learning
      Currently, switches are provided with two options, either statically
      configured or dynamically learned forwarding information.
      Exchange Points like to monitor and control how many MAC addresses are
      connected to a participant's port. The XP operator generally does not
      desire ad-hoc extensions connected to their network. The common way of
      managing this is to enforce a "router-only" or "limited MAC address" rule.
      This is currently controlled by statically configuring forwarding
      information, or not controlled, but policed by counting the number of
      MAC addresses learned on each port, and action taken against offenders.
      Static configuration of forwarding information is a somewhat inelegant
      option, as this increases configuration overhead, and decreases
      flexibility, especially in case of emergencies.
      We propose a configurable "maximum learning" limit, configurable on a
      per port basis. In this way, operators can configure participants ports
      according to their house rules, but retain the flexibility of dynamic
      The filtering should not impose a performance hit on those ports which are
      The lockdown should automatically flush when there is a state transition on
      the interface.
      There should be multiple levels of locking:
      * "Forwarding" limit - the maximum number of addresses you will forward for
      on this port
      * "Soft" limit - the limit at which you will record syslog events
      * "Hard" limit - the limit at which you shut down the port (drop link if
      able) and record a syslog event
      A hard limit should require manual intervention to reset the locking and
      bring the port back up.
      This feature must include a "first in-last out" mechanism in the lockdown
      facility, to avoid forwarding information for valid addresses being
      overwritten by addresses in excess of the exchange's house rules.
      All the above locking, timeout, and reset rules should be configurable by the
      network operator.
      b) Disable acting on STP BPDU information
      Many exchange operators currently deploy Spanning Tree Protocol (STP) in
      networks which contain redundant links/full meshing.
      There is however, a danger presented by STP information leaked from a
      participant's network. The participant may have connected a poorly
      configured switch/router product, and may be leaking their STP
      information into that of the exchange.
      We would wish to see a configurable option to allow STP information to be
      ignored, and filtered in hardware on "edge ports", on a per port basis.
      There should also be an option to generate traps or log messages based on
      transgressions of the policy.
      c) Wire-speed ACL-type filtering based on L2/L3 header info
      The ability to look into the layer 2 or 3 header information of a packet,
      and selectively monitor, or filter, based on certain layer 2 or 3 criteria.
      This could be done using pattern matching or masking.
      A common example of an area where this is desirable is to permit frames of
      only certain ethertypes to enter the network through an edge port.
      This sort of filtering should be implemented in hardware wherever possible,
      and not have an effect on the forwarding performance of the system. Where
      this is not possible, it must be clearly documented.
      d) ARP/Broadcast snooping and control
      Many exchange points insist on participants using IP addresses they have
      assigned by the exchange operator. It is desirable for the operator to be
      able to monitor/restrict "off-net" ARP.
      As Ethernet is a broadcast medium, broadcast storms have been known to
      bring exchanges to their knees, affecting the forwarding abilities of both
      the exchange's switches, and the participants' routers.
      Monitoring/rate limiting/control of Ethernet broadcast frames is
      Most exchanges also forbid the speaking of interior routing protocols
      across their peering network. Since these take the form of broadcast or
      multicast frames on ethernet, control would help monitor this type of
      Such control should be able to distinguish (through appropriate
      configuration) between legitimate ARP and genuine broadcast storms.
      There should be suitable configuration knobs to be able to rate limit, shut
      down, log exceptions, etc.
      e) Support for MARP?
      We had looked at seeking support for something such as MARP - "MultiAccess
      Reachability Protocol". This was defined in an Internet Draft, which allows
      for detection of failures at L2 across multiple switch hops.
      However, this has failed to become published as an RFC, as Cisco has
      asserted IPR over the content of the draft, and would therefore require
      licencing. This seems to have somewhat killed this promising idea off.
      f) Policy exception logging
      In the above paragraphs, we have asked for some policy-based tools.
      Operators need to know when these policies have been breached.
      Good logging of policy exceptions need to be implemented:
      	* SNMP-trap
      	* Configurable syslog (i.e. which syslog facility to write to)
      	* RFC3164 compliant syslogging
      	* Syslog over SSL
      g) Access to management interfaces
      In the past, security of management interfaces on Ethernet switching
      products as often been lacking.
      CLI or web interfaces should support authentication using
      username/password pairs, to avoid the use of "password only"
      authentication which implies shared passwords.
      CLI interfaces should also support SSHv2 access, using either
      username/password pair, or public key authentication.
      Web interfaces should be HTTPS/SSL enabled, to avoid passwords being
      passed in the clear over HTTP.
      Support SCP/SFTP for config copy/upload/download, as well as existing
      methods (TFTP/FTP).
      Management interfaces should be able to perform authentication from an
      external source, such as TACACS, RADIUS or LDAP services, as well as
      providing locally held accounts (have to be retained for emergencies)
      All management interfaces, CLI, web and SNMP should be able to benefit from
      access-list control. The access lists should be able to support
      variable-length subnet masks.
      Ability to disable management interfaces on a per-VLAN basis. Many XP
      operators choose to configure a "management" VLAN, so that all
      management is done out-of-band of the core peering traffic. It is
      desirable to have the management interfaces to listen on the management
      networks only.
      On devices which are designed to support high bandwidth per-slot, such as
      high-density GigE or 10GigE, it is preferable to have a 10/100 Mb
      management port provided on the system, to avoid burning a fast port for
      h) Port mirroring
      It is sometimes necessary to mirror participants' ports, either because a
      participant is suspected of some inappropriate activity, or to help obtain
      information to debug a problem.
      Not all exchange points have staff on site 24x7, and port mirroring may
      need to be remotely set up, without hands-on intervention on-site.
      The ability to allow any port to mirror any other port with a similar
      lower speed within the chassis would allow the operator to connect a
      traffic collector/analyser device to a monitoring port, and simply
      configure the switch to mirror a port as desired to monitoring port.
      i) Statistics and Accounting
      As well as implementing defacto SNMP counters/RMON, also consider
      implementing the following:
      * Per-VLAN traffic statistics
      * sFlow export support (via management interface)
      * Counters based on common ethertypes (IPv4, IPv6, multicast, ARP, etc)
      a) Spanning Tree
      Spanning Tree is currently the only cross-platform dynamic solution
      available to operators of exchange points for dynamically managing multiple
      redundant links in their architecture.
      There are a number of problems with Spanning Tree:
      	* Slow convergence
      		- especially in cases of root bridge re-election
      	* Wasteful of reslilent/redundnant resources
      		- redundnant links are switched off
      		- no traffic sharing
      	* Security concerns (highlighted above)
      As the routes collected at an Exchange Point can be routed all over the
      world, any routing instability can act like dropping a pebble in a pond,
      and will spread around the Internet.
      It's desirable to maintain stable routing sessions across Exchange Point
      LANs to minimise these routing flaps, because of load it places on routers,
      and the effects of route dampening penalties.
      We believe that being able to declare ports as "end-stations" should
      avoid them being counted in the STP calculation, enable these ports to
      start forwarding more rapidly, and speed overall STP convergence time.
      Rapid spanning tree (IEEE 802.1w) should be implemented
      (http://www.ieee802.org/1/pages/802.1w.html), and results from testing RSTP
      on certain platforms show that for simple topologies with few redundant
      links, sub-second failover and reconvegence is achievable with minimal
      tuning or additional configuration.
      b) Ring Restoration Protocols
      IEEE 802.17 - This is a standards-based version of the technology currently
      used by Cisco called DPT (Dynamic Packet Transport). This consists of a
      counter-rotating ring-system, with spacial reuse and "ring wrapping"
      circuit protection.
      The Cisco version is currently implemented over SONET/SDH media,
      however, the standardised version is being designed to be more media
      agnostic, and the IEEE working group has already elected to provide
      support for Gigabit Ethernet and 10 Gigabit Ethernet.
      Proprietary Protocols - There are a number of proprietary ring protocols,
      such as Extreme's EAPS (published as informational RFC3619), or Foundry's
      They are relatively similar in operation, in that they make assumptions
      about the number of redundant links in a topology (i.e. only one), have a
      concept of master and transit nodes, use a "heartbeat" sent out by the
      master, and topology change messages are passed between the nodes to speed
      network reconvergence (by triggering FDB flushing, and backup port
      unblocking on the master node).
      These recovery protocols may become less important as RSTP becomes more
      c) Trunking and Link-Aggregation
      It's become increasingly common for exchange points to become multiple
      switch and multiple site based, and many need to deploy link aggregation to
      handle the volume of interswitch traffic, where it exceeds the maximum
      speed of a single link.
      Most equipment implements load-sharing using either round-robin or
      address-based algorithms.
      In exchange points, many pieces of equipment will have similar MAC
      addresses, especially the first and last bytes (corresponding to vendor and
      slot position on router).
      This causes significant problems if the load-sharing hash does not use
      enough significant bits in the frame. If the hash is only based on part of
      the address, this can result in poor efficiency of load-sharing, and
      "clutching" of traffic on a single link inside a group.
      It's prefereable that load-sharing algorithms should consider the whole L2
      address, and where possible the L3 header information, when calculating the
      hash used.
      Load-sharing of broadcasts and multicast traffic should be implemented.
      This is because behaviour such as forwarding all broadcast/multicast
      traffic out of the "primary" port in a trunk have been observed when
      load-sharing using destination MAC addresses has been implemented.
      IEEE 803.3ad link-aggregation "LACP" should be implemented.
      d) Multicast Control and Containment
      Most switches are configured with IGMP snooping for multicast control.
      However, in an exchange point, with only routers attched, there is no
      IGMP present, only PIM and MSDP, and all multicast packets are flooded out
      of all ports.
      An exchange point, however, is an ideal place for mutlicast peering to
      happen, inject the traffic once, and it comes out several times (as much as
      is needed, or in the current situation, as much as isn't needed!).
      Cisco developed RGMP (Router Group Management Protocol). This is a
      proprietary technology whereby the router can communicate to the switch
      which multicast groups it wishes to see.
      This remains, despite being released as an informational RFC (RFC3488), a
      vendor specific feature, and a wide range of routing and platforms are
      present at many exchange points - both in equipment used by the operator,
      and the participants. These are true multi-vendor environments.
      Therefore, this is not a workable solution for most exchange points,
      whose princples are often include "equal treatment" of participants.
      While it may not solve all potential issues with multicast peering,
      implementing PIM-SM snooping and pruning within the switches will
      achieve the traffic containment requirements.
      Where PIM snooping is available, this should not have a negative effect on
      the overall forwarding perfomance of the system. Where there is a
      performance impact, this and it's surrounding caveats shall be clearly
      e) VLAN tagspace issues/overlapping
      A serious emerging issue is VLAN tag space overlapping/clashing issues.
      Most metro transport networks can solve this by using q-in-q (tag
      stacking), however, this doesn't apply to shared networks like Internet
      Current switches use a 1:1 mapping of 802.1q vlans to bridge groups,
      which is the way 802.1q was probably intended. This mapping should be
      loosened if not abandoned - nowadays there are so many ways to egress an
      ethernet frame from a switch that more and more often we have
      to resort to 'tricks' to put the right label on the right ethernet
      packet going out the right interface.
      This problem is being exacerbated by a number of issues:
      * Increased use of switch router products (e.g. Cisco 7600)
      * Use of switches as "channel-banks" - breaking out higher speed router
      * Use of metro-ethernet, lan extension or Ethernet over MPLS ("Martini")
      circuits to connect to the IXP
      We think there are two (fairly similar) approaches to solving this:
      * Basic VLAN tag rewrite
      * Separate the tag from the virtual bridge instance
      VLAN tag rewrite is, as it's name suggests, being able to rewrite a dot1q
      tag on a specific interface to a VLAN ID on the switch. This would need to
      be implemented on both ingress and egress.
      The other option is complete seperation of VLAN ID from the virtual bridges
      inside the switch. You assemble a framework where you can place untagged
      ports, tagged ports, q-in-q tagged ports, mpls endpoints, atm vc's all
      together in into the same virtual bridge. Effectively a bridge group which
      can contain any number of these sort of entities.
      f) Link failure detection
      Link failure detection should be implemented, and should look like:
      * UDLD - Uni-Directional Link Detection
      * LFN - Link Failure Notification
      This avoids the risk of an ethernet link going "one-way" and fooling the
      restoration protcols that the link is working, when really it isn't.
      There should be reasonable environmental monitoring provided:
      * Temperature sensors
      * Fan health sensors
      * Power supply health sensors
      There should be exception logging via SNMP trap and syslog (as specified
      above) of any incidents.
      It should also be possible to shut down a malfunctioning element in the
      system (automatic, user configurable, or manual), in order to preserve
      system health.
      For example, a power supply failing in a system could cause an instability
      in the device. If the system could make a decision to shut that power
      supply down, and assuming a redundant configuration, the switch would then
      operate in a stable condition until such time that the power supply could
      be exchanged.
      IXPs are high-uptime environments. The equipment used in an IXP needs to be
      able to satify this requirement, in terms of redundancy, and hot-swappable
      * Hot swap of management/switch fabric cards with instantaneous failover to
      any installed redundancy (not rebooting onto the "backup")
      * Full-redunancy of PSUs, and hot-swap (i.e. box should run on 50% of
      * Rapid booting and card startup (after all, much functionality is
      implemented in the ASIC hardware)
      * GBIC/SFP-optics for flexibility, easy replacement, and maximised port
      utilisation (freedom to choose SX/LX, etc)
      * "Coloured" (DWDM/CWDM) GBIC/SFP/Xenpak/XFP (etc, etc) support
      * Vendor "lockdown" of pluggable interfaces should either not be
      implemented, or be able to be switched off in configuration.
      * 220-240V AC power options. Unlike most telco-managed facilities, the
      carrier-neutral facilities common in Europe do not provide indigenous 48V
      DC power. Power distribution is done using the regular utility supply
      voltage in that country - usually ~230V AC in EU countries.
      * Cable testing functionality in copper ports, and optical power metering
      in optical ports.
      Thanks are due to the "usual suspects" in the RIPE EIX Working Group, but
      specifically Christian Panigl, Kurtis Lindqvist, Keith Mitchell, Daniele
      Arena, and Remco Van Mook, for their contributions to this document.