Showing posts with label Switching. Show all posts

Saturday, February 14, 2015

Introducing “6-pack”: the first open hardware modular switch

As Facebook’s infrastructure has scaled, we’ve frequently run up against the limits of traditional networking technologies, which tend to be too closed, too monolithic, and too iterative for the scale at which we operate and the pace at which we move. Over the last few years we’ve been building our own network, breaking down traditional network components and rebuilding them into modular disaggregated systems that provide us with the flexibility, efficiency, and scale we need.

We started by designing a new top-of-rack network switch (code-named “Wedge”) and a Linux-based operating system for that switch (code-named “FBOSS”). Next, we built a data center fabric, a modular network architecture that allows us to scale faster and easier. For both of these projects, we broke apart the hardware and software layers of the stack and opened up greater visibility, automation, and control in the operation of our network.

But even with all that progress, we still had one more step to take. We had a TOR, a fabric, and the software to make it run, but we still lacked a scalable solution for all the modular switches in our fabric. So we built the first open modular switch platform. We call it “6-pack.”

The platform

The “6-pack” platform is the core of our new fabric, and it uses “Wedge” as its basic building block. It is a full mesh non-blocking two-stage switch that includes 12 independent switching elements. Each independent element can switch 1.28Tbps. We have two configurations: One configuration exposes 16x40GE ports to the front and 640G (16x40GE) to the back, and the other is used for aggregation and exposes all 1.28T to the back. Each element runs its own operating system on the local server and is completely independent, from the switching aspects to the low-level board control and cooling system. This means we can modify any part of the system with no system-level impact, software or hardware. We created a unique dual backplane solution that enabled us to create a non-blocking topology.

We run our networks in a split control configuration. Each switching element contains a full local control plane on a microserver that communicates with a centralized controller. This configuration, often called hybrid SDN, provides us with a simple and flexible way to manage and operate the network, leading to great stability and high availability.

The only common elements in the system are the sheet metal shell, the backplanes, and the power supplies, which make it very easy for us to change the shell to create a system of any radix with the same building blocks.

Below you can see the high-level “6-pack” block diagram and the internal network data path topology we picked for the “6-pack” system.

The line card

If you’re familiar with “Wedge,” you probably recognize the central switching element used on that platform as a standalone system utilizing only 640G of the switching capacity. On the “6-pack” line card we leveraged all the “Wedge” development efforts (hardware and software) and simply added the backside 640Gbps Ethernet-based interconnect. The line card has an integrated switching ASIC, a microserver, and a server support logic to make it completely independent and to make it possible for us to manage it like a server.

The fabric card

The fabric card is a combination of two line cards facing the back of the system. It creates the full mesh locally on the fabric card, which in turn enables a very simple backplane design. For convenience, the fabric card also aggregates the out-of-band management network, exposing an external interface for all line cards and fabrics.

Bringing it together

With “6-pack,” we have created an architecture that enables us to build any size switch using a simple set of common building blocks. And because the design is so open and so modular – and so agnostic when it comes to switching technology and software – we hope this is a platform that the entire industry can build on. Here's what we think separates “6-pack” from the traditional approaches to modular switches:

“6-pack” is already in production testing, alongside “Wedge” and “FBOSS.” We plan to propose the “6-pack” design as a contribution to the Open Compute Project, and we will continue working with the OCP community to develop open network technologies that are more flexible, more scalable, and more efficient.

Monday, May 27, 2013

Differences between Rapid STP (802.1w) and the legacy STP(802.1d)

The following table outlines the main differences between Rapid STP (802.1w) and the legacy STP(802.1d):

STP (802.1d)	Rapid STP (802.1w)
In stable topology only the root sends BPDU and relayed by others.	In stable topology all bridges generate BPDU every Hello (2 sec) : used as “keepalives” mechanism.
Port states
*DisabledBlockingListeningLearningForwarding*	*Discarding* (replaces disabled, blocking and listening) *LearningForwarding***
To avoid flapping, it takes 3 seconds for a port to migrate from one protocol to another (STP / RSTP) in a mixed segment.
Port roles
*Root* (Forwarding) *Designated* (Forwarding) *Non-Designated* (Blocking)	*Root* (Forwarding) *Designated* (Forwarding) *Alternate(Discarding)Backup* (Discarding)
Additional configuration to make an end node port a *port fast* (in case a BPDU is received).	- An *edge port* (end node port) is an integrated Link type which depends on the duplex : Point-to-point for full duplex & shared for half duplex).
Topology changes and convergence
Use timers for convergence (advertised by the root): *Hello(2 sec) Max Age(20 sec = 10 missed hellos) Forward delay timer* (15 sec)	- Introduce *proposal and agreement* process for synchronization *(< 1 sec*).- Hello, Max Age and Forward delay timer used only for backward compatibility with standard STP
	Only RSTP port receiving STP (802.1d) messages will behaves as standard STP.
Slow transition (*50sec*): Blocking (20s) =>Listening (15s) =>Learning (15s) =>Forwarding	Faster transition on point-to-point and edge ports only:Less states – *No learning state, doesn’t wait to be informed by others, instead, actively* looks for possible failure by *RLQ* (Request Link Query) a feedback mechanism.
Use only 2 bits from the flag octet:Bit 7 : Topology Change Acknowledgment.Bit 0 : Topology Change	Use other 6 bits of the flag octet (BPDU type 2/version 2): Bit 1 : ProposalBit 2, 3 : Port roleBit 4 : LearningBit 5 : ForwardingBit 6 : AgreementBit 0, 7 : TCA & TCN for backward compatibility
The bridge that discover a change in the network inform the root, that in turns informs all others by sending BPDU with TCA bit set and instruct them to clear their DB entries after “short timer” (~Forward delay) expire.	TC is flooded through the network, every bridge generate TC (Topology change) and inform its neighbors when it is aware of a topology change and immediately delete old DB entries.
If a non-root bridge doesn’t receive Hello for 10*Hello (advertised from the root), start claiming the root role by generating its own Hello.	Wait for 3*Hello on a root port (advertised from the root) before deciding to act.
Wait until TC reach the root + short timer (~Forward delay) expires, then flash all root DB entries	Delete immediately local DB except MAC of the port receiving the topology changes (proposal)

Wednesday, April 24, 2013

EVC framework : Flexible Service Mapping

Configuring service instances using the EVC framework : Flexible Service Mapping.

Probably the biggest advantage of the EVC framework is the ability to support multiple services per physical port. This means that under a single physical port you can have any of the following mixed together :

- 802.1q trunk
- 802.1q tunnel
- Local connect
- Scalable EoMPLS (EoMPLS xconnect)
- Multipoint Bridging (L2 bridging)
- Multipoint Bridging (VPLS, SVI-based EoMPLS)
- L3 termination

Besides all of the above, by using the EVC framework you can combine multiple different services from different physical ports, i.e. when using multipoint bridging (aka bridge-domains), in order to put them into the same virtual circuit.

Local connect is a L2 point-to-point service between two service instances on the same system. The service instances can by under the same port (hair-pinning) or under different ports. In contrast with the traditional L2 bridging, this one doesn't use any MAC learning and it's solely between 2 points. Also Local Connect doesn't require any global VLAN resource.

In order to have the following two service instances connect to each other by a L2 point-to-point service, you need first to remove their difference, which is the outer tag (you can also remove both tags).

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10 second-dot1q 100
  rewrite ingress tag pop 1 symmetric

interface Gi1/2
 service instance 20 ethernet
  encapsulation dot1q 20 second-dot1q 100
  rewrite ingress tag pop 1 symmetric

! EVC-LC-10-20 is just a name for this point-to-point connectionconnect EVC-LC-10-20 Gi1/1 10 Gi1/2 20

Note : You can use the same service instance number under different physical ports.

In order to have the following two service instances be connected by Local Connect, you don't need any VLAN tag rewrite, because they both have the same vlans.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10-20

interface Gi1/2
 service instance 20 ethernet
  encapsulation dot1q 10-20

connect EVC-LC-10-20 Gi1/1 10 Gi1/2 20

In order to have the following two service instances be connected by Local Connect, you can either translate the vlan on one of them, or remove the tags on both of them.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  rewrite ingress tag translate 1-to-1 dot1q 20 symmetric

interface Gi1/2
 service instance 20 ethernet
  encapsulation dot1q 20

connect EVC-LC-10-20 Gi1/1 10 Gi1/2 20

Scalable EoMPLS or EoMPLS xconnect is a L2 point-to-point service between two service instances on different systems. Like Local Connect it doesn't use any MAC learning and it's solely between 2 points. It also doesn't require any global VLAN resource (this applies to scalable EoMPLS only; for SVI-based EoMPLS check VPLS below).

You can have any VLAN tag rewrite configuration under the service instances, as long as you keep in mind the following :

a) If both sides are EVC based, then you need to have common VLAN tag rewrite configurations on both sides
b) If one side is not EVC based, then depending on whether it's a physical interface or a subinterface, you'll probably need to remove one tag from the EVC side (subinterfaces remove tags by default)

Note : By default, VC type 5 is used for EoMPLS. In case VC type 4 is negotiated and used, an additional tag will be added after the VLAN tag rewrite configuration and before the data gets EoMPLS encapsulated.

7600-1

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  xconnect 1.1.1.2 10 encapsulation mpls

7600-2

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  xconnect 1.1.1.1 10 encapsulation mpls

Note : Have a look at Scalable EoMPLS for additional information.

Multipoint Bridging uses the concept of bridge-domains. Bridge-domain (BD) is like a traditional L2 broadcast domain where MAC-based forwarding is used for communication between participants (i'll try to write a new post with more details about bridge-domains). Bridge-domains use global VLAN resources.

In the following example, three service instances are put into the same bridge-domain by translating the tags where necessary.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  rewrite ingress tag translate 1-to-1 dot1q 20 symmetric
  bridge-domain 20

interface Gi1/2
 service instance 20 ethernet
  encapsulation dot1q 20
  bridge-domain 20

interface Gi1/3
 service instance 30 ethernet
  encapsulation dot1q 30
  rewrite ingress tag translate 1-to-1 dot1q 20 symmetric
  bridge-domain 20

The bridge-domain ID represents the global VLAN used in the system. Extra care needs to taken in case of L2 trunk/tunnel ports participating in a bridge-domain :

a) L2 trunk ports remove automatically the tag on ingress and add it automatically on egress. Equivalent EVC ports need that to be done manually by using the appropriate rewrite actions.
b) L2 tunnel ports add a new tag on ingress and remove it on egress. Equivalent EVC ports do not need any similar rewrite actions, because by default bridge-domains add a new tag on top of the already existing one.

In the following example two ports (a L2 trunk port and an EVC port) are put into the same bridge-domain (Vlan 20). Tag 10 needs to be removed from the EVC port before it joins bridge-domain 20.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  rewrite ingress tag pop 1 symmetric
  bridge-domain 20

interface Gi2/1
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 20
 switchport mode trunk

In the following example two ports (a L2 tunnel port and an EVC port) are put into the same bridge-domain (Vlan 20). On the EVC port, tag 20 is added on top of tag 10 in order to have the incoming frames join bridge-domain 20.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  bridge-domain 20

interface Gi2/1
 switchport access vlan 20
 switchport mode dot1q-tunnel

VPLS or SVI-based EoMPLS can be accomplished by configuring xconnect under a SVI. This SVI is the same as the one defined by the bridge-domain ID.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10-20
  bridge-domain 30

interface Gi1/2
 service instance 10 ethernet
  encapsulation dot1q 10-20
  bridge-domain 30

interface Vlan 30
  xconnect 1.1.1.2 10 encapsulation mpls

By adding "split-horizon" after the bridge-domain ID in both service instances, there can be no L2 communication between them.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10-20
  bridge-domain 30 split-horizon

interface Gi1/2
 service instance 10 ethernet
  encapsulation dot1q 10-20
  bridge-domain 30 split-horizon

interface Vlan 30
  xconnect 1.1.1.2 10 encapsulation mpls

By adding an additional tag through a rewrite action in both service instances, you can differentiate them, while they are being transfered through the same VC.

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10-20
  rewrite ingress tag push dot1q 21 symmetric
  bridge-domain 30 split-horizon

interface Gi1/2
 service instance 10 ethernet
  encapsulation dot1q 10-20
  rewrite ingress tag push dot1q 22 symmetric
  bridge-domain 30 split-horizon

interface Vlan 30
  xconnect 1.1.1.2 10 encapsulation mpls

SVI-based EoMPLS can be considered like a VPLS, where there is only one VC pointing to one neighbor.

Note : Have a look at SVI-based EoMPLS for additional information.

For L3 termination you have the usual two options : use subinterfaces or use bridge-domains (just like switchports) and SVIs. ES/ES+ and SIP-400 cards support termination of double-tagged traffic too.

Keep in mind the following :

a) you must remove all tags before terminating L3 traffic
b) you must use matching rules based on unique single or double tags (no vlan ranges are supported, although they might be accepted)

This is an example using a bridge-domain and the equivalent SVI:

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  rewrite ingress tag pop 1 symmetric
  bridge-domain 40

interface Gi1/2
 service instance 10 ethernet
  encapsulation dot1q 20 second-dot1q 30
  rewrite ingress tag pop 2 symmetric
  bridge-domain 40

interface Vlan 40
  ip address 1.1.1.1 255.255.255.0

This is an example using subinterfaces:

interface Gi1/1.10
 encapsulation dot1q 10
 ip address 1.1.1.1 255.255.255.0

interface Gi1/1.20
 encapsulation dot1q 20 second-dot1q 30
 ip address 1.1.1.1 255.255.255.0

Note : ES cards have a major limitation : single-tagged vlans configured under a subinterface are global significant. On the other hand, double-tagged vlans are local significant. On the ES+ and SIP-400 cards, both single-tagged and double-tagged vlans are local significant.

Friday, April 12, 2013

EVC : Flexible VLAN Tag Rewrite

Following the previous post about Flexible Frame Matching, this new post describes the second major step in configuring service instances using the EVC framework : Flexible VLAN Tag Rewrite.

Each service instance can change the existing VLAN tag to be a new VLAN tag by adding, removing, or translating one or two VLAN tags. Flexible VLAN tag rewrite includes 3 main operations :

1) pop (remove an existing tag)
2) push (add a new tag)
3) translate (change one or two tags to another one or two tags) - this can be seen as a combination of pop and push operations

Theoretically, any existing combination of one or two VLAN tags can be changed to any new combination of one or two VLAN tags by just using a simple (as soon as you get the idea) line of configuration. Practically, there are some limitations what you'll see below.

These are the relevant CLI options under the service instance (you need first to have configured flexible frame matching for these to appear) :

7600(config-if-srv)#rewrite ingress tag ?
pop        Pop the tag
push       Rewrite Operation of push
translate  Translate Tag

Pop operation

7600(config-if-srv)#rewrite ingress tag pop ?
1  Pop the outermost tag
2  Pop two outermost tags

! remove one tag
7600(config-if-srv)#rewrite ingress tag pop 1 ?
symmetric  Tag egress packets as specified in encapsulation


! remove two tags
7600(config-if-srv)#rewrite ingress tag pop 2 ?
symmetric  Tag egress packets as specified in encapsulation

Push operation

7600(config-if-srv)#rewrite ingress tag push ?
dot1q  Push dot1q tag

! add one tag
7600(config-if-srv)#rewrite ingress tag push dot1q ?
<1-4094>  VLAN id

7600(config-if-srv)#rewrite ingress tag push dot1q 20 ?
second-dot1q  Push second dot1q tag
symmetric     Tag egress packets as specified in encapsulation


! add two tags
7600(config-if-srv)#rewrite ingress tag push dot1q 20 second-dot1q ?
<1-4094>  VLAN id

7600(config-if-srv)#rewrite ingress tag push dot1q 20 second-dot1q 30 ?
symmetric  Tag egress packets as specified in encapsulation

Translate operation

7600(config-if-srv)#rewrite ingress tag translate ?
1-to-1  Translate 1-to-1
1-to-2  Translate 1-to-2
2-to-1  Translate 2-to-1
2-to-2  Translate 2-to-2

! remove one tag and add one new tag
7600(config-if-srv)#rewrite ingress tag translate 1-to-1 dot1q 20 ?
symmetric  Tag egress packets as specified in encapsulation


! remove one tag and add two new tags
7600(config-if-srv)#rewrite ingress tag translate 1-to-2 dot1q 20 second-dot1q 30 ?
symmetric  Tag egress packets as specified in encapsulation


! remove two tags and add one new tag
7600(config-if-srv)#rewrite ingress tag translate 2-to-1 dot1q 20 ?
symmetric  Tag egress packets as specified in encapsulation


! remove two tags and add two new tags
7600(config-if-srv)#rewrite ingress tag translate 2-to-2 dot1q 20 second-dot1q 30 ?

symmetric  Tag egress packets as specified in encapsulation

Examples

interface GigabitEthernet1/2
!
service instance 10 ethernet
encapsulation dot1q 10
! remove one tag (10) on ingress
! add one tag (10) on egress
rewrite ingress tag pop 1 symmetric
!
service instance 20 ethernet
encapsulation dot1q 10 second-dot1q 20
! remove two tags (10/20) on ingress
! add two tags (10/20) on egress
rewrite ingress tag pop 2 symmetric
!
service instance 30 ethernet
encapsulation dot1q 30
! add one tag (300) on ingress
! remove one tag (300) on egress (if the resulting frame doesn't match tag 30, it's dropped)
rewrite ingress tag push dot1q 300 symmetric
!
service instance 40 ethernet
encapsulation dot1q 40
! add two tags (400/410) on ingress
! remove two tags (400/410) on egress (if the resulting frame doesn't match tag 40, it's dropped)
rewrite ingress tag push dot1q 400 second-dot1q 410 symmetric
!
service instance 50 ethernet
encapsulation dot1q 50 second-dot1q 1-4094
! remove one tag (50) and add one new tag (500) on ingress
! remove one tag (500) and add one new tag (50) on egress
! the inner tags (1-4094) remain unchanged
rewrite ingress tag translate 1-to-1 dot1q 500 symmetric
!
service instance 60 ethernet
encapsulation dot1q 60
! remove one tag (60) and add two new tags (600/610) on ingress
! remove two tags (600/610) and add one new tag (60) on egress
rewrite ingress tag translate 1-to-2 dot1q 600 second-dot1q 610 symmetric
!
service instance 70 ethernet
encapsulation dot1q 70 second-dot1q 100
! remove two tags (70/100) and add one new tag (700) on ingress
! remove one tag (700) and add two new tags (70/100) on egress
rewrite ingress tag translate 2-to-1 dot1q 700 symmetric
!
service instance 80 ethernet
encapsulation dot1q 80 second-dot1q 200
! remove two tags (80/200) and add two new tags (800/810) on ingress
! remove two tags (800/810) and add two new tags (80/200) on egress
rewrite ingress tag translate 2-to-2 dot1q 800 second-dot1q 810 symmetric

There are some important things to keep in mind when configuring Flexible VLAN Tag Rewrite.

1) You have to use the "symmetric" keyword, although the CLI might not give you this impression:

7600(config-if-srv)#rewrite ingress tag pop 1 ?
symmetric  Tag egress packets as specified in encapsulation


7600(config-if-srv)#rewrite ingress tag pop 1
Configuration not accepted by the platform
7600(config-if-srv)#rewrite ingress tag pop 1 symmetric
7600(config-if-srv)#

Generally rewrite configurations should always be symmetric. Whatever rewrites are on the ingress direction, you should have the reverse rewrites on the egress direction for the same service instance configuration. So, if you pop the outer VLAN tag on ingress direction, then you need to push the original outer VLAN tag back on the egress direction for that same service instance. All this is done automatically by the system when using the "symmetric" keyword. Have a look at the examples included above and check the comments to see what operations are happening on ingress and egress.

2) Due to the mandatory symmetry, some operations can only be applied to a unique tag matching service instance (so they are not supported for VLAN range configurations) or cannot be applied at all.

i.e.
You cannot translate a range of vlans

7600(config-if-srv)#encapsulation dot1q 10 - 20
7600(config-if-srv)#rewrite ingress tag translate 1-to-1 dot1q 30 symmetric
Encapsulation change is not logically valid.

You cannot pop a range of vlans

7600(config-if-srv)#encapsulation dot1q 10 second-dot1q 20,30
7600(config-if-srv)#rewrite ingress tag pop 2 symmetric
Encapsulation change is not logically valid.

If supposedly you could do the above, how could the opposite be done? i.e. if the system removed the tags from frames matching inner vlans 20,30 on the ingress, how would the system know on which frames to add 20 and on which to add 30 on the egress?

Of course you can push a new vlan over a range of vlans.

7600(config-if-srv)#encapsulation dot1q 10-20
7600(config-if-srv)#rewrite ingress tag push dot1q 30 symmetric

You can only push one or two tags for "encapsulation untagged" and "encapsulation default". No pop or translate operations are supported.

As a rule you can think of "you cannot pop or translate something that is not specifically defined as a single unit". Just imagine what would happen in the opposite direction and everything should become clear.

Keep in mind that some configurations might be accepted, but they won't work.

3) You cannot have more than one VLAN tag rewrite configuration under a single service instance. That means you can have either none or one. If there is no VLAN tag rewrite configuration, the existing VLAN tag(s) will be kept unchanged. If you need more than one, you might want to try to create more service instances using more specific frame matching criteria on each one. The translate operation might also seem useful in such conditions.

4) You need to be extra careful when using Flexible VLAN Tag Rewrite and Bridge Domains. Flooded (broadcast/multicast/unknown unicast) packets will get dropped by the service instances that do not agree on the egress tag. Although all service instances under a common bridge domain will get the flooded frame, there is an internal validation mechanism that checks whether the result of egress rewrite (based on the opposite of ingress rewrite) will allow the flooded frame to pass. The push operations under the examples show this behavior.

5) To have an EVC based port act like a L2 802.1q trunk port, you need to remove the outer tag manually and then put it under a bridge domain. On normal L2 switchports this is done automatically by the system.

So this

interface Gi1/1
 switchport
 switchport mode trunk
 switchport trunk allowed vlan 10

is equivalent to this

interface Gi1/1
 service instance 10 ethernet
  encapsulation dot1q 10
  rewrite ingress tag pop 1 symmetric
  bridge-domain 10

Note: The above examples were done on a 7600 with ES+ cards running 12.2(33)SRB IOS.

Monday, April 8, 2013

EVC : Flexible Frame Matching

EVC stands for Ethernet Virtual Connection and in Cisco's platforms it's used to represent Cisco's software architecture to address Carrier Ethernet Services. In MEF (Metro Ethernet Forum) terminology EVC means "Ethernet Virtual Connection/Circuit", but here EVC represents also the whole Carrier Ethernet software infrastructure developed by Cisco.

EVC has many advantages one of them being the Flexible Frame Matching. Flexible Frame Matching is a functionality that allows each service instance to match frames with either a unique single vlan, or a list/range of vlans. It can also match single/double tagged frames, untagged frames, or everything else that belongs to the default category.

Flexible Frame Matching is the first major step after configuring a service instance. This is the complete idea:

1) Service Instance definition (create the service instance)
2) Flexible frame matching (configure what frames need to be matched based on vlan match criteria)
3) Flexible VLAN tag rewrite (configure the action to do on the matched frames' vlan tags)
4) Flexible Service Mapping (map the service instance to a service)
5) Extra service features (apply some extra features on the service instance, i.e. QoS)

The middle 3 most important steps can also be described as:

a) Frame matching
b) Frame rewrite
c) Frame forwarding

Example

interface Gi1/1
 ! Service Instance definition
 ! ID is local port significant
service instance 10 ethernet
  ! Flexible frame matching
encapsulation dot1q 10 second-dot1q 20
  ! Flexible VLAN tag rewrite
rewrite ingress tag pop 1 symmetric 
  ! Service Mapping
xconnect 10.10.10.10 100 encapsulation mpls 
  ! Extra service features
service-policy input TEST-INPUT-POLICY

The current EVC implementation supports matching only on vlan tags, but in future we may see matching on other L2 fields too, since the hardware is quite capable.

These are the current supported vlan matching configurations:

• Single tagged frames, where match criteria can be a single vlan, a list/range of vlans, or any vlan (1-4094)

encapsulation dot1q 
encapsulation dot1q , 
encapsulation dot1q  - 
encapsulation dot1q any

• Double tagged frames, where first VLAN tag can be only single (software limitation), while second VLAN tag can be single, list/range, or any

encapsulation dot1q  second-dot1q 
encapsulation dot1q  second-dot1q , 
encapsulation dot1q  second-dot1q  - 
encapsulation dot1q  second-dot1q any

• Untagged frames, where all untagged frames are matched

encapsulation untagged

• Default tag frames, where all tagged/untagged frames that are not matched by other more specific service instances are matched

encapsulation default

Examples

interface Gi1/1
!
service instance 10
  ! single tagged frames with a specific tag
encapsulation dot1q 10
!
service instance 20
  ! single tagged frames with multiple tags
encapsulation dot1q 20,22,24,26-28
!
service instance 30
  ! single tagged frames with any tag
encapsulation dot1q any
!
service instance 40
  ! frames with a specific single outer tag and specific single inner tag
encapsulation dot1q 10 second-dot1q 20
!
service instance 50
  ! frames with a specific single outer tag and multiple inner tags
encapsulation dot1q 10 second-dot1q 20,22,24,26-28
!
service instance 60
  ! frames with a specific single outer tag and any inner tag
encapsulation dot1q 10 second-dot1q any
!
service instance 70
  ! frames without a tag
encapsulation untagged
!
service instance 80
  ! frames that do not match under any other service instance
encapsulation default

There are some important things to keep in mind when configuring Flexible Frame Matching.

1) When you have multiple vlan match criteria configured under different service instances of a single physical interface, the most specific is the one that wins (it's like the longest match rule used in the routing table). So the order of service instances under an interface doesn't have the same effect like the classes in MQC. This is because frame matching is done by hardware using the linecard's TCAM table, where each frame matching configuration gets converted to 1 or more TCAM entries (vlan lists/ranges in matching criteria are the most TCAM consuming configurations). The number of 16000 service instances per ES20 module is based on the assumption that each service instance uses a single TCAM entry.

2) When you don't get any match according to the above longest match rule, matching is done according to a looser match algorithm, where a single tag configuration matches all frames that have a common outer tag (regardless of the number of inner tags) and a double tag configuration matches all frames that have common the first 2 tags (regardless of the number of 2+ inner tags; btw, i'm planning of doing a triple-tag test soon).

Example

interface G1/1
service instance 10 ethernet
encapsulation dot1q 10
service instance 20 ethernet
encapsulation dot1q 10 second-dot1q 20
service instance 30 ethernet
encapsulation default

On the above configuration:

10/20 will be matched by service instance 20 (both tags matched)
10/30 will be matched by service instance 10 (outer tag matched)
20/30 will be matched by service instance 30 (no tag matched)

"encapsulation dot1q 10" matches "10", "10/20", "10/30" and so on.
"encapsulation dot1q 10 second-dot1q 20" matches "10/20", "10/20/30", "10/20/40" and so on.

Note: The above examples were done on a 7600 with ES+ cards running 12.2(33)SRB IOS.

Friday, April 5, 2013

Cisco Connected Industries talks about the IE2000 Industrial Ethernet Switch

Yuta Endo, Cisco Connected Industries Product Manager, discusses the newest version of the IE2000 range of products that addresses the growing trends of Industrial and Enterprise Network convergence; connectivity across Industrial equipment; and the heightened security concerns that are burgeoning.

Yuta talks about the product features and benefits, such as support for IEEE standards such as the IEEE 1588 PTP v2 Motion Control standard.

The product is available now for customers in the Manufacturing, Oil and Gas, Mining, Transportation and Energy industries, and is already in use by many Cisco manufacturing customers.
Recently the product line added Power over Ethernet (PoE), so that both the IE2000 and IE3000 ranges have PoE in both the fixed and modular versions.

Tuesday, April 2, 2013

Basics: What’s the Difference Between STP BPDU Guard and Root Guard

Courtesy - Ethereal Mind

BPDU Guard and Root Guard are enhancements to Spanning Tree Protocol (STP) enhancements that improve the reliability of the protocol to unexpected events.

Why ?

Remember that the purpose of the the Spanning Tree algorithm is to create a single path through the network to prevent loops because the Ethernet frame has no loop prevention mechanism. As a result an Ethernet network is always designed like an inverted tree like this:

There are loops in this design that are implemented for resilience ie. STP will block a given path in planned operation but an alternate path can be activated if the primary path fails.

However, STP is susceptible to various failures due to poor network design 1 or certain types of operational problems. Both BPDU Guard and Root Guard are used to enforce design discipline and ensure that the STP protocol operates as designed.

BPDU Guard

BPDU guard disables the port upon BPDU reception if PortFast is enabled on the port. This effectively denies devices connected to these ports from participating in the desgined STP thus protecting your data centre core.

Note: In the event of the BPDU being received the port will typically be shutdown in “errdisable” state and will require manually reenabling the port. Alternately you can configure the port to attempt to re-enable by configuring the “errdisable timeout”

Root Guard

Root guard allows the device to participate in STP as long as the device does not try to become the root. If root guard blocks the port, subsequent recovery is automatic. Recovery occurs as soon as the offending device ceases to send superior BPDUs.

Where ?

Because BPDU Guard and Root Guard are primarily to ensure design enforcement ( integrity / security) , they must configured in specific locations in the networks.

By “design” I mean that people add new switches in the wrong places which breaks that controlled design as shown here.

Thursday, March 28, 2013

Understanding Spanning Tree Protocol

Spanning-tree Protocols
802.1d (Standard Spanning-tree)
So the entire goal of spanning-tree is to create a loop free layer 2 domain. There is no TTL in a layer 2 frame so if you don’t have spanning-tree, a frame can loop forever. So the original 802.1d standard set out to fix this. There are a few main pieces to the 802.1d process. They are…

1. Elect a root bridge.
This bridge is the ‘root’ of the spanning-tree. In order to elect a root bridge, all of the switches send out BPDU (Bridge Protocol Data Units). The BPDU has a bridge priority in it which the switches use to determine which switch should be the root. The lowest ID wins. The original standard specified a bridge ID as…

As time progressed there became a need to create multiple spanning-trees for multiple VLANs (we’ll get to that later). So, the bridge ID format had to be changed. What they came up with was..

So, now you know why you need to have a bridge priority that’s in multiples of 4096 (if you don’t.. A total of 4 bits gives you a total of 16 values, 16 * 4096 gives you 65,536 which is the old bridge priority max value – 1).

So at this point, we have a mess of switches swarming around with BPDUs. If a switch receives a BPDU with a lower bridge priority it knows that it isn’t the root. At that point, it stops sending out it’s own bridge ID and starts sending out BPDUs with the better (lower) priority that it heard of. In the end, all of the switches will be forwarding BPDUs with the lowest bridge ID. At that point, the switch originating the best(lowest) bridge ID knows that it is the root bridge.

2. Each switch selects a root portSo now that we know which switch is the root, every non-root switch needs to select it’s root port. That is, the port with the lowest cost to the root switch. To determine this, the root port sends ‘Hellos’ out of all of it’s port every 2 seconds. When a non-root switch receives the hello, it does a couple of things. First, it reads the ‘cost’ from the hello message and updates it by adding the port cost. So if a hello came in a fast Ethernet port with a cost of 4, the switch would add 19 to it giving you a new cost of 23. After all of the hellos are sent, the switch picks it’s root port by selecting the port which had the lowest calculated cost. Now, a bit about port costs. See the table below…

Interface Speed	Original IEEE Port Cost	New IEEE port Cost
10 Mbps	100	100
100 Mbps	10	19
1000 Mbps	1	4
10000 Mbps	1	2

So as you can see, with the increase in speed came a upgrade to the port costs. Now that we have 40 gig interfaces I’m wondering if they will redo that again. At any rate, if there is a tie, say two ports that have a calculated cost of 23. The switch breaks the tie in the following fashion..

1. Pick the lowest bridge ID of switch that sent the hellos
2. Pick the lowest port priority of the switch that sent the hellos
3. Use the lowest port number of the switch that sent the hellos
(We’ll talk about port priorities in a bit) Now that we have a root port we can move onto step 3.

3. Pick a designated portThis part is pretty easy. Basically, each segment can only have one designated port. The switch that forwards the lowest cost hello onto a particular segment becomes the designated switch and the port that it uses to do that is the designated port. So, that would mean that each port on the root bridge would be a designated port. Then, ports that are neither root ports or designated ports (non-designated ports) go into blocking state. If a tie occurs, the same tiebreaker process occurs as in step 2.

At this point, we have a fully converged spanning-tree!

Normal OperationUnder normal operation the root sends hellos out of all it’s active ports. Each connected switch receives the hellos on their root ports, updates it, and forwards it out of it’s designated port (if it has one). Blocked ports receive the hellos, but never forward them.

Topology Changes
When a switch notices a topology change, it’s responsible for telling all other connected switches about the change. The most effective way to do this, is to tell the root switch so that it can tell all of the other switches. When a switch notices a topology change, it sends a TCN (topology change notification) out it’s root port. The switch will send the TCN every hello time until the upstream switch acknowledges it. The upstream switch acknowledges by sending a hello with a TCA (topology change acknowledgement). This process continues until the root becomes notified. The root will then set the TC flag on it’s hellos. When switches in the tree see the TC set in the hello from the root, they know that there has been a topology change and that they need to age out their CAM tables. Switches aging out their CAM tables is an important part of a topology change and reconvergence.

802.1D Port States

Blocking – The port is blocking all traffic with the exception of receiving STP BPDUs. The port will not forward any frames in this state.
Listening – Same as blocking but will now begin to send BPDUs.
Learning – The switch will begin to learn MAC information in this state.
Forwarding – Normal full up and up port state. Forwarding normal traffic.

TimingThere are a couple of main timers in the STP protocol. These are..
Forward Delay Timer – Default of 15 seconds
Hello – Default of 2 seconds
MaxAge – Default of 20 seconds

Spanning-Tree enhancements (Cisco Proprietary)
PortFast – Immediately puts a port into forwarding mode. Essentially disables the STP process. Should only be used for connecting to end hosts.
UplinkFast – Should be used on access layer switches connecting to distribution. Used to fail over the root port in the case of the primary root port failing. CAM entries are timed out by the access layer generating multicast frames with attached devices MACs as the source for the frames. This is different than the normal TCN process as described earlier. UplinkFast also causes the switch to increase the root priority to 49152 and set all of the ports costs to 3000.
BackboneFast – Used to detect indirect STP failures. This way the switch doesn’t have to wait MaxAge to reconverge. The feature needs to be configured on all switches in order for it to work. The switch queries it’s upstream switches when it sops receiving hellos with a RLQ (Root Link Query). If the upstream switch had a failure it can reply to the local switch so that it can converge to another port without waiting for the MaxAge to expire.

802.1w (Rapid Spanning-Tree)
Rapid spanning-tree takes 802.1d and makes it faster. In addition, they take some of the Cisco proprietary features and standardize them. Here are some of the notable changes that 802.1w makes.

-Switches only wait to miss 3 hellos on their root port prior to reconverging. This number in 802.1d was 10 (MaxAge, or 10 times hello).
-Fewer port states. 802.1w takes the number of port states from 5 (Im counting disabled) down to 3.

The new states are discarding, learning, and forwarding.
-Concept of a backup DP when a switch has multiple ports connected to the same segment.
-Standardization of the Cisco proprietary PortFast, UplinkFast, and BackboneFast.

802.1w Link TypesPoint to Point – Connects a switch to another switch in full duplex mode.
Shared – Connects a switch to a hub using half duplex
Edge – A user access port

802.1w Port roles
Root Port – The same as in 802.1d
Designated Port – The same as in 802.1d
Alternate Port – Same as the uplink fast feature, backup RP connection
Backup Port – Alternate DP port, can take over if the existing DP fails

802.1s (Multiple Spanning-Tree)
Multiple spanning-tree (MST) lets you map VLANs into a particular spanning tree. These VLANs are then considered to be part of the same MST region. MST uses the same features as RSTP for convergence, so if you are running MST, you are by default also running RSTP. Much like any other ‘group’ technology, there are several parameters that must be met before switches/vlans can become part of the same region.

-MST must be globally enabled
-The MST region name must be configured (and the same on each switch)
-Define the MST revision number (and make it the same on each switch)
-Map the same VLANs into each region (or instance)

MST can con-exist with other switches that don’t talk MST. In this case, the entire MST region appears to be a single switch to the other ‘external’ spanning-tree. The spanning-tree that connects the region to the ‘outside’ is considered to be the IST, or Internal Spanning Tree.

Spanning-tree Protection
There are several ‘protection’ mechanisms available that can be implemented in conjunction with spanning-tree to protect the spanning-tree from failure or loops.

BPDU Guard – Should be enabled on all ports that will never connect to anything but an end user port. The configuration will err-disable a port if a BPDU is received on that port. To recover from this condition the port must be shut/no shut.

Root Guard – Protects the switch from choosing the wrong RP. If a superior BPDU is heard on this port the port is placed into root-inconsistent state until the BPDUs are no longer heard.

UDLD – Unidirectional link detection is used to detect when one side (transmit or receive) is lost. States like this can cause loops and loss of connectivity. UDLD functions in two modes, aggressive and normal. Normal mode uses layer 2 messaging to determine if a switches transmission capabilities have failed. If this is detected, the switch with the failed transmit side goes into err-disable. In aggressive mode the switch tries to reconnect with the other side 8 times. If this fails, both sides go into err-disable.

Loop Guard – When a port configured with loop guard stops hearing BPDUs it goes into loop-inconsistent state rather than transitioning into forwarding.

Thursday, March 21, 2013

6500 Architecture and evolution

Since I’ve recently become more interested in the actual switching and fabric architectures of Cisco devices I decided to take a deeper look at the 6500 series switches. I’ve worked with them for years but until recently I didn’t have a solid idea on how they actually switched packets. I had a general idea of how it worked and why DFCs were a good thing but I wanted to know more. Based on my research, this is what I’ve come up with. I’d love to hear any feedback on the post since there is a chance that some of what I’ve read isn’t totally accurate. That being said, lets dive right in…

Control vs Data Plane

All actions on a switch can be considered to be part of either the control or the data plane. The 6500 series switch is a hardware based switch which implies that it performs switching in hardware rather than software. The pieces of the switch that perform switching in hardware are considered to be part of the data plane. That being said, there still needs to be software component of the switch that tells the data plane how to function. The parts of the switch that function in software are considered to be the control plane. These components make decisions and perform advanced functions which then tell the data plane how to function. Cisco’s implementation of forwarding in hardware is called CEF (Cisco Express Forwarding).

Switch DesignThe 6500 series switch is a modular switch that is comprised of a few main components. Let’s take a look at each briefly.

The ChassisThe 6500 series switch comes in many shapes and sizes. The most common (in my opinion) is the 6509. The last number indicates the number of slots on the chassis itself. There are also 3, 4, 6, and 13 slot chassis available. The chassis is what holds all of the other components and facilitates connecting them together. The modules plug into a silicon board called the backplane.

The BackplaneThe backplane is the most crucial component of the chassis. It has all of the connectors on it that the other modules plug into. It has a few main components that are highlighted on the diagram below.

The diagram shows the backplane of a standard 9 slot chassis. Each slot has a connection to the crossbar switching fabric, the three buses (D,R,C) that compose the shared bus, and a power connection.

The switch fabric in the 6500 is referred to as a ‘crossbar’ fabric. It provides unique paths for each of the connected modules to send and receive data across the fabric. In initial implementations the SUP didn’t have an integrated switch fabric which required the use of a separate module referred to as the SFM (Switch Fabric Module). With the advent of the SUP720 series of SUPs the switch fabric is now integrated into the SUP itself. The cross bar switching fabric provides multiple non-blocking paths between different modules. The speed of the fabric is a function of both the chassis as well as the device providing the switch fabric.

Standard 6500 Chassis
Provides a total of 40Gbps per slot

Enhanced(e) 6500 ChassisProvides a total of 80Gbps per slot

SFM with SUP32 SupervisorSingle 8 gig Fabric Connection
256Gbps switching fabric
18 Fabric Channels

SUP720 through SUP720-3B SupervisorSingle 20 gig Fabric Connection
720Gbps switching fabric
18 Fabric Channels

SUP720-3C SupervisorDual 20 gig Fabric Connections
720Gbps switching fabric
18 Fabric Channels

SUP2T SupervisorDual 40 gig Fabric Connections
2.08Tbps switching fabric
26 Fabric Channels

So as you can see, there are quite a few combinations you can use here. The bottom line is that with the newest SUP2T and the 6500e chassis, you could have a module with eight 10Gbps ports that wasn’t oversubscribed.

The other bus in the 6500 is referred to as a shared bus. In the initial 6500 implementation the fabric bus wasn’t used. Rather, all communication came across the shared bus. The shared bus is actually comprised of 3 distinct buses.

DBus (Data Bus) – Is the main bus in which all data is transmitted. The speed of the DBus is 32Gbps.
RBus (Results Bus) – Used by the supervisor to forward the result of the forwarding operation to each of the attached line cards. The speed of the RBus is 4Gbps.
CBus (Control Bus) – Relays information between line cards and the supervisor. This is also sometimes referred to as Ethernet Out of Band or EOB or EOBC (Ethernet Out of Band Controller). The speed of the CBus is 100Mbps half duplex.

The Supervisor (Or as well call them, SUPs)

The switch supervisor is the brains of the operation. In the initial implementation of the 6500 the SUP handled the processing of all packets and made all of the forwarding decisions. A supervisor is made up of three main components which include the switch fabric, MFSC (Multi-Layer Switch Feature Card), and the PFC (Policy Feature Card). The image below shows a top down view of a SUP 720 and the location of each component on the physical card.

MSFC – The Multi-Layer Switch Feature Card is considered to be the control plane of the switch. The MSFC runs processes that help build and maintain the layer 3 forwarding table (routing table), process ACLs, run routing protocols, and other services that are not run in hardware. The MSFC is actually comprised of two distinct pieces.

SP – The SP (Switch Processor) handles booting the switch. The SP copies the SP part of a IOS image from bootlfash, boot’s itself, and then copies the RP part of the IOS image to the RP. Once the RP is booted the SP hands control of the switch over to the RP. From that point on the RP is what the administrator talks to in order to administer the switch. In most cases, the SP still handles layer 2 switch protocols such as ARP and STP.

RP – The RP (Route Processor) handles all layer 3 functions of the 6500 including running routing protocols and building the RIB from with the FIB is populated. Once the FIB is built in the RP it can be downloaded to the data plane TCAM for hardware based forwarding of packets. The RP runs in parallel with the SP which it allows to provide the layer 2 functions of the switch.

PFC – The policy feature card receives a copy of CEF’s FIB from the MFSC. Since the MFSC doesn’t actually deal with forwarding any packets, the MFSC downloads the FIB into the hardware on the PFC. Basically, the PFC is used to accelerate layer 2 and layer 3 switching and it learns how to do that from the MFSC. The PFC is considered to be part of the data plane of the switch.

Line CardsThe line cards of a 6500 series switch provide the port density to connect end user devices. Line cards come in different port densities and support many different interface types. Line cards connect to the SUP via the backplane.

The other pieces…

The 6500 also has a fan tray slot as well as two slots for redundant power supplies. I’m not going to cover these in detail since they don’t play into the switch architecture.

Switching modesNow that we’ve discussed the main components of the 6500 lets talk about the different ways in which a 6500 switches packets. There are 5 main modes in which this occurs and the mode that is used relies heavily on what type of hardware is present in the chassis.

Classis modeIn classic mode the attached modules make use of the shared bus in the chassis. When a switchport receives a packet it is first locally queued on the card. The line card then requests permission from the SUP to send the packet on to the DBUS. If the SUP says yes, the packet is sent onto the DBUS and subsequently copied to the SUP as well as all other line cards. The SUP then performs a look up on the PFC. The result of that lookup is sent along the RBUS to all of the cards. The card containing the destination port receives information on how to forward the packet while all other cards receive word to terminate processing on the packet and they delete it from their buffers. The speed of the classic mode is 32gbps half duplex since it’s a shared medium.

CEF256In CEF256 mode each module has a connection to the shared 32Gbps bus as well as a 8Gbps connection to the switch fabric. In addition each line card has a local 16Gbps bus (LCDBUS) on the card itself. When a switchport receives a packet it is flooded on the LCDBUS and the fabric interface receives it. The fabric interface floods the packet header onto the DBUS. The PFC receives the header and makes the forwarding decision. The result is flooded on the RBUS back to the line card and the fabric interface receives the forwarding information. At that point, the entire packet is sent across the 8Gbps fabric connection to the destination line card. The fabric interface on the egress line card floods the packet on the LCDBUS and the egress switchport sends the packet on it’s way out of the switch.

dCEF256In dCEF256 mode each line card has dual 8Gbps to the switch fabric and no connection to the shared bus. In this method, the line card also has a DFC (Distributed forwarding card) which holds a local copy of the FIB as well as it’s own layer 2 adjacency table. Since the card doesn’t need to forward packets or packet headers to the SUP for processing there is no need for a connection to the shared bus. Additionally, dCEF256 cards have dual 16Gbps local line card buses. The first LCDBUS handles half of the ports on the line card and the second LCDBUS handles the second half of the ports. Communication from a port on one LCDBUS to a port on the second LCDBUS go through the switch fabric. Since the line card has all of the forwarding information that it needs it can forward packets directly across the fabric to the egress line card without talking to the SUP.

CEF720Identical operation to CEF256 but includes some upgrades. The switch fabric is now integrated into the SUP rather than on a SFM. And the dual fabric connections from each line card are now 20Gbps a piece rather than 8Gbps.

dCEF720Identical to dCEF256 with addition of same upgrades present in CEF720 (Faster fabric connections and SF in SUP).

Centralized vs Distributed Forwarding

I had indicated earlier that the early implementations of the switch utilized the SUP to make all switching and forwarding decisions. This would be considered to be centralized switching since the SUP is providing all of the functionality required to forward a packet or frame. Lets take a look at how a packet is forwarded using centralized forwarding.

Line cards by default (in most cases) come with a CFC or centralized forwarding card. The card has enough logic on it to know how to send frames and packets to the Supervisor when it needs an answer. In addition, most cards can accept a DFC or distributed forwarding card. DFCs are the functional equivalent to the PFC located on the SUP and hold an entire copy of CEF’s FIB and adjacency tables. With a DFC in place, a line card can perform distributed forwarding which takes the SUP out of the picture.

How centralized forwarding works…

1. Frame arrives at the port on a line card and is passed to the CFC on the local line card.
2. The bus interface on the CFC forwards the headers to the supervisor on the DBus. All other line cards connected to the DBus ignore the headers.
3. The PFC on the supervisor makes a forwarding decision based on the headers and floods the result on the RBus. All other line cards on the RBus ignore the result.
4. The CFC forwards the results ,along with with the packet, to the line cards fabric interface of the line card. The fabric interface forwards the results and the packet onto the switch fabric towards their final destination.
5. The egress line card’s fabric ASIC receives the packet and forwards the data out towards the egress port.

How distributed forwarding works…

1. Frame arrives at the port on a line card and is passed to the fabric interface on the local line card.
2. The fabric interface sends just the headers to the DFC located on the local line card.
3. The DFC returns the forwarding decision of it’s lookup to the fabric interface.
4. The fabric interface transmits the packet onto the switch fabric and towards the egress line card
5. Egress line card receives the packet and forwards the packet on to the egress port.

So as you can see, distributed forwarding is much quicker than centralized forwarding just from a process perspective. In addition, it doesn’t require the use of the shared bus.

ConclusionThere are many pieces of the 6500 that I didn’t cover in this post but hopefully it’s enough to get you started if you are interested in knowing how these switches work. Hopefully I’ll have time soon to do a similar post on the Nexus 7000 series switch.

Friday, January 4, 2013

Difference between HSRP and VRRP

HSRP stands for Hot Standby Routing Protocol. VRRP stands for Virtual Route Rendundancy Protocol. The differences between HSRP versus VRRP are very slight especially when looking at the basic configuration side by side. But under the covers there are some significant differences. The end result, however is still the same.

If a router fails you need a standby router to become the active gateway and forward packets to the next hop

Here's a break down that compares the major differences between the two protocols.

HSRP Versus VRRP Comparison Table

HSRP	VRRP
Propietary	Standards based
RFC 2281	RFC 3768
Separate IP Address needed for the Virtual	Can use the physical IP Address of the Virtual, if needed, saving IP space.
One Master, all other routers are backup	One Master, One Standby, all others are listening
More familiar to most network engineers	Less familiar - yet very similar
Can track an interface for failover	Can track an interface for failover (depending on operating system and version)
All HSRP routers use multicast hello packets to 224.0.0.2 (all routers) for version 1 or 224.0.0.102 for version 2.	All VRRP routers use IP protcol number 112 (vrrp) to communicate via multicast IP address 224.0.0.18
All virtual router must use MAC address 0000.0c07.acXX where XX is the group ID.	All virtual routers must use 00-00-5E-00-01-XX as its Media Access Control (MAC) address

Configuration differences between HSRP and VRRP

The differences between both VRRP and HSRP, especially on a Cisco router are very slight. If your familiar with Configuring HSRP you can easily understand VRRP commands. Configuring VRRP on Juniper as well as other network equipment can vary significantly depending on the devices. Many load balancers also support VRRP and their configuration is specific to each of these devices.

Here are some configuration examples as seen on a Cisco router:

HSRP Configuration Example

R1(config)# interface GigE 0/1
R1(config-if)# ip address 192.168.1.2 255.255.255.0
R1(config-if)# standby 1 ip 192.168.1.1
R1(config-if)# standby 1 priority 200
R1(config-if)# standby 1 preempt

R2(config-if)# ip address 192.168.1.3 255.255.255.0
R2(config-if)# standby 1 ip 192.168.1.1
R2(config-if)# standby 1 preempt

VRRP Configuration Example

R1(config)# interface GigE 0/1
R1(config-if)# ip address 192.168.1.2 255.255.255.0
R1(config-if)# vrrp 1 ip 192.168.1.1
R1(config-if)# vrrp 1 priority 110

R2(config)# interface GigE 0/1
R2(config-if)# ip address 192.168.1.3 255.255.255.0
R2(config-if)# vrrp 1 ip 192.168.1.1

Notice the lack of a preempt command. This isn't necessary for VRRP. It's enabled by default.
As you can see there sin't a big difference between the two protocols. The primary difference between HSRP versus VRRP would be that HSRP is proprietary to Cisco and can only be used on Cisco devices. VRRP is a standards based protocol and is vendor independant allow some flexibility when choosing network devices.

Friday, September 21, 2012

Cisco Nexus 3548 and Arista 7150: Duelling ultra-low-latency switches

Cisco and Arista Networks Inc. traded punches this week, both announcing new, ultra-low-latency top-of-rack switches that not only are fast, but pack many more features and functions than a typical switch in this class.

The Arista 7150 is the first switch in the industry to use Intel Corp.'s new Fulcrum Alta FM6000 networking chip, while Cisco's Nexus 3548 uses a new Cisco custom application-specific integrated circuit (ASIC), the Algorithm Boost or Algo Boost chip. At first glance, Cisco has taken a lead in the race to near-zero latency.

Arista 7150 and Cisco Nexus 3548: Fast and smart

Both top-of-rack switches push the state of the art in low-latency forwarding, a feature that is critical to the competitive high-frequency trading market and also attractive to high-performance computing shops, particularly in genomic research and oil and gas exploration.

The Arista 7150 has 350-nanosecond forwarding latency, a 30% improvement on previous generations of Arista switches. The Nexus 3548 ships with 250-nanosecond latency, a significant leap over Arista. The Algo Boost ASIC on the Nexus 3548 can operate in "warp" mode to push latency down to 190 nanoseconds. It achieves this by reducing the size of the switch's address table to from 64,000 to 8,000 hosts.

But these ultra-low-latency switches are smart as well as fast. They offer low-latency multicast and unicast routing and in-hardware network address translation (NAT).

The Arista 7150 now has "all the features and functions of a [Cisco] Catalyst 6500," according to Arista customer John Koehl, head of infrastructure and operations for Headland Technologies LLC, an algorithmic financial trading firm based in San Francisco and Chicago. As many financial trading firms do, Koehl collocates his switches with financial exchanges across the world so that trades can be made as close to the exchange as possible. In those environments, NAT becomes critical.

"If you're connecting to the exchange, [NAT] provides a little bit of security because you can mask what you're coming in as," Koehl said. "Second, you don't have to use up all the exchange IP addresses. You can use your private IP inside and just NAT to what the exchange is providing you."

Having features like NAT in an ultra-low-latency switch like the Nexus 3548 and the Arista 7150 means that network engineers don't need to place a firewall or another device inline to perform these functions in ultra-fast switching environments.

"We try to keep our switch footprint as small as possible so that we don't have a lot of switch hops," Koehl said. "So, the important thing at each data center is to be fast on the network and at the same time have enough switch port capacity and provide all the features and functionality that we need so that we ... can do everything in one box."

The Nexus 3548 ships with 48x10 Gigabit Ethernet ports and an 64,000-host address table and 16,000 IP routes. The Arista 7150 ships in three models ranging from 24x10 GbE to 64x10 GbE ports. It offers a 64,000-hosts table and 84,000 IP routes.

Arista 7150: Programmable forwarding plane for SDN and network virtualization flexibility

The Fulcrum Alta chip on the Arista 7150 has a programmable forwarding plane. Combined with EOS, Arista's fully programmable operating system, this chip allows the Arista 7150 to be upgraded to support new protocols in hardware without a device refresh. The switch is shipping with silicon support forVirtual Extensible VLAN (VXLAN), for instance, but Arista could easily add native hardware support for Microsoft's alternative Network Virtualization using Generic Routing Encapsulation, or NVGRE, standard with a simple software update. Arista demonstrated the VXLAN support at VMworld last month, showing how an Arista 7150 can serve as a gateway for attaching non-VXLAN network services to a VXLAN network.

As other protocols emerge in the software-defined networking (SDN) industry, enterprises will be able to migrate to those new technologies without ripping out hardware, according to Martin Hull, senior product manager at Arista. "A fully programmable software stack doesn't necessarily get line-rate performance in hardware. A traditional fixed-logic switch gives performance but doesn't give you flexibility. The solution to all this is a hardware approach that has flexibility in the data plane and a programmable software stack, which is where Arista sits," he said.

That means that the Arista 7150 will have the flexibility to change on the fly as applications come out, especially on the forwarding plane with SDN or if network virtualization techniques are being deployed, said Rohit Mehra, director of enterprise communications infrastructure for Framingham, Mass.-based research firm IDC. "Arista will be able to leverage these technologies and make changes without rewriting and reworking the ASICs. Redoing the ASIC can take anywhere from 12 to 24 months," he said.

Advanced analytics in an ultra-low-latency switch

Both the Arista 7150 and the Nexus 3548 ship with advanced analytical capabilities that can track and analyze latency spikes and buffer utilization. These capabilities allow enterprises to tune their networks to avoid microbursts that can affect ultrafast applications.

Arista offers a Latency and Application Analysis Package (LANZ), which provides detailed visibility into buffer utilization, and captures data held in the buffer when the switch experiences congestion. Arista also has added time-stamping to LANZ so that enterprises can know exactly when a latency microburst occurred. The Algo Boost ASIC on the Nexus 3548 has a similar analytics package that can operate in real time.

"We put functionality in the hardware to do fine-grained polling per port on buffer utilization," said Paul Perez, vice president and chief technology officer for Cisco's data center group. "Even down to the granularity of 10 nanoseconds, we can collect buffer utilization and extract that out into a software interface. It can be used in offline mode to do trend analysis, but also in real-time mode to be able to tune your environment."

Nexus 3548: Cisco's retreat from merchant silicon?

Cisco's first generation of ultra-low-latency switches, the Nexus 3000 series, used merchant silicon from Broadcom Corp. Rather than ride merchant silicon to even lower latency, Cisco elected to build the Algo Boost ASIC. Although Cisco has emphasized ASICs as a major differentiator, the company will continue to use merchant silicon where appropriate.

Cisco CTO Perez said his company believes in using custom ASICs "when you need it and merchant when you don't." "We have a non-religious technology strategy," he said. “We have 600 silicon designers and more than twice that in software developers to drive our custom capabilities. But we will take advantage of commercial silicon where appropriate."

Cisco will use much of the technology in the new Algo Boost ASIC to enhance other silicon in Cisco's product portfolio and not just in switching. "If you look in our server line, Unified Computing, we're differentiating and adding value in a highly competitive environment with custom silicon for expanded memory and also with our custom NIC[network interface card]," Perez said. "I think some synergy between this Algo Boost technology at the switching level coupled to that computing edge at the NIC is a very fertile area of exploration for my team in terms of how we can progress the engineering of the next generation of high-performance computing environments."

Network Enhancers - "Delivering Beyond Boundaries"

Network Enhancers - "Delivering Beyond Boundaries" Headline Animator