Sunday, May 27, 2012

MTU Myth Busters

MTU – Maximum Transmission Unit, always not take importance by anyone, until someone hits by its never-seen & unpredictable results that's break communication. That is the same we faced (me and my team) at Leading Service Provider of Pakistan. MTU is normally termed as the maximum amount of information that can be sent in the Packet….but this is not the right thinking. MTU is the Physical layer characteristics, so better to say…its the maximum amount of information (data) can be sent in the Frame (e.g. Ethernet Frame). As per standard frame, maximum amount of Packet size accommodate in Ethernet frame is 1500B. 


But if packet size is more than 1500B due to any reason, than Layer 2 informs Layer 3 to fragment the information as it cannot be fit into Ethernet frame. Initially it was observed that physical Media technology was not as stable & reliable as today, so Internet Architect suggest to prefer fragmentation, as they only have to re-transmit that small part of segment, not the complete information again. But this puts lots & lots of load on Layer 3 device responsible for fragmentation.


What are the reasons, when our normal HTTP or application traffic does not able to communicate? where Did MTU hits? Lets check it out…

Here are some overhead facts to carry Application/Presentation/Session Layer [Normally termed as Data] information,

•TCP Header = 20B
•GRE = 24B
•IPv4 Header = 20B or IPv6 Header = 40B
•MPLS Header = 4B to 16B (Including L3VPN, FRR TE, AToM Control Word)
•Ethernet Header = 14B
•VLAN/Trunk = 4B & Q-in-Q = 8B

Here are some examples where the end to end communication breaks for certain customers/applications, while all other service work well.

When everything goes well,






Consider the network with default config i.e. MTU 1500 for most of the FastEthernet interfaces (now a day’s Gig Ethernet interface have Jumbo enable by default for some vendors).


If any PC behind Router A want to send the Data and configured MTU at interfaces is 1500 than maximum data coming from A/P/S layers should be calculated based on following,

Data = 1500 – 20 (TCP) – 20 (IPv4) – 14 (Ethernet) = 1446B

This 1446B is usually considered as safe payload from Customer devices to pass all the application data w/o dropping somewhere in between Source & Destination. So if customer set MTU of its CE WAN interface than usually its CE router will do the fragmentation (if required) and usually the traffic will not drop in the transit. There are ways that Service Provider can set DF (Don't Fragment) bit on the incoming customer traffic, so that their Core routers will not be overloaded with Fragmentation process.

But there are scenarios, where the traffic with 1446B can be drop. Lets discuss those,


1) If Service Provider support MTU of 1500B and use VLAN trunk on any intermediate node connectivity:






In this scenario Router B & C are connected over the Ethernet Trunk Link, means there comes another 4B of VLAN TAG overhead. Now if the same 1446B of traffic come in from Customer router A, than it cannot pass over B-C link. Here is the calculation,



1446 (Data) + 20 (TCP) + 20 (IPv4) + 14 (Ethernet) + 4B (VLAN TAG) = 1504B (Required MTU)


If customer application mark the DF bit in Application or SP marked the same for informing customer traffic than Router B will not do the Fragmentation and traffic will be dropped. To resolve this issue, B-C link should support atleast 1504B.

Let’s discuss another scenario as an example:

2) If Service Provider support MPLS along with VLAN tagging.

 
In this scenario Service Provider network B-C-D support MTU of 1504B. Router C & D are connected over the Ethernet Trunk Link and also running MPLS, means there comes 4B of VLAN TAG overhead and 4B of MPLS Label overhead. Now if the same 1446B of traffic come in from Customer router A, than it can pass over B-C link, but not over C-D link. Here is the calculation,


1446 (Data) + 20 (TCP) + 20 (IPv4) + 4 (MPLS Label) + 14 (Ethernet) + 4B (VLAN TAG) = 1508B (Required MTU)

Similarly, if customer application mark the DF bit in Application or SP marked the same for informing customer traffic than Router C will not do the Fragmentation and traffic will be dropped. To resolve this issue, C-D link should support atleast 1508B.


 
 
 
 
The case is worse when Service Provider run MPLS Traffic Engineering and Customer traffic is carried over VPN, this will add additional overhead up to 12B, if Q-in-Q supported than additional 4B, if IPv6 is the transport protocol than IP header overhead will increased to 40B, instead of 20B of IPv4 header. Further if customer is using GRE tunneling than 24B of GRE overhead will be added.


So in the Nut Shell, its Service Provider responsibility to support the maximum MTU that can accommodate all sort of customer services including its own like MPLS etc. To the safe side if service provider enables Jumbo MTU (9192B) in its Access & Core network than almost all possible services can run w/o issue.


Vendors & MTU:

Now look at the MTU in the perspective of Vendors (Cisco, Juniper and Windows/Linux Machine). Cisco & Juniper implementation of MTU is bit different and specially when we try to verify the supported MTU using PING.


Juniper Implementation:


Lets discuss here MTU at Gigabit Ethernet Interface (Other interface have different default/maximum MTU – Check here). By default Physical Interface MTU is 1514 and if we configure Physical MTU other than the default value than underlying protocols will inherit the MTU from physical interface. We can also configure different MTU value on Protocol level as compared to the inherited one – The one reason to do that is to match the MTU on the remote device specially in case of OSPF neighborship, which cannot be established until both end IP MTU is same. The Protocol MTU cannot be more Physical MTU and its important to maintain the protocol header difference between IP & Layer 2, else Juniper will not allowed configuration commit. Here is the example from my M320 router, showing Physical Interface MTU 9100 (configured) and IP protocol MTU is drive from it (9100-18 = 9082B). Since it’s also configured with VLAN TAG 4B overhead will be added over 14B Layer 2 overhead, that's why we deduct 18 from physical interface MTU to get IP MTU.

falikhan@sydlab@M320-m2-re0> show interfaces ge-0/0/1


Physical interface: ge-0/0/1, Enabled, Physical link is Up
Link-level type: Ethernet, MTU: 9100, Speed: 1000mbps, MAC-REWRITE Error: None, Loopback:
Logical interface ge-0/0/1.621 (Index 95) (SNMP ifIndex 531)
Flags: SNMP-Traps 0×4000 VLAN-Tag [ 0x8100.621 ] Encapsulation: ENET2
Protocol inet, MTU: 9082
Protocol inet6, MTU: 9082
Protocol mpls, MTU: 9070

If interface also configured with MPLS address family than 12B (3 labels) overhead will be added.

When we PING from Juniper CLI, of size 1000B, it means, 1000B is ICMP payload, which will be encapsulated in ICMP header of 8B, which will be encapsulated in 20B IPv4 header and finally in 14+4B Layer 2 Ethernet frame overhead. So actually Bytes on wire will be 1000+8+20+14+4=1046B.

Now if we need to test that how much maximum size PING we can send to remote host via interface ge-0/0/1 (over IP network – No MPLS)? So the answer is 9082 (IP MTU) – 20 (IP Header) – 8 (ICMP header)= 9054B. Let’s test it,


falikhan@sydlab@M320-m3-re0> ping 10.250.22.1 logical-system SD31 source 10.250.23.1 size 9054 do-not-fragment

PING 10.250.22.1 (10.250.22.1): 9054 data bytes

9062 bytes from 10.250.22.1: icmp_seq=0 ttl=64 time=8.923 ms
9062 bytes from 10.250.22.1: icmp_seq=1 ttl=64 time=8.888 ms

^C

— 10.250.22.1 ping statistics —

2 packets transmitted, 2 packets received, 0% packet loss

round-trip min/avg/max/stddev = 8.888/8.905/8.923/0.017 ms

Note: I have configured Logical Router on M320 to simulate the multiple routers network.


falikhan@sydlab@M320-m3-re0> ping 10.250.22.1 logical-system SD31 source 10.250.23.1 size 9055 do-not-fragment

PING 10.250.22.1 (10.250.22.1): 9055 data bytes

ping: sendto: Message too long
ping: sendto: Message too long

^C

— 10.250.22.1 ping statistics —

2 packets transmitted, 0 packets received, 100% packet loss

This test shows that when Juniper router tries to PING using ICMP payload of 9055, it need IP MTU to support atleast 9083. But since currently supported IP MTU on interface is 9082, the maximum IP packet that can pass through this interface (w/o fragmentation) is 9054.

Just to clarify, by default for IPv4 traffic Router perform fragmentation i.e. if I remove do-not-fragment knob from PING, it can let it 9055 or higher payload ICMP packet over the same interface.

falikhan@sydlab@M320-m3-re0> ping 10.250.22.1 logical-system SD31 source 10.250.23.1 size 9055

PING 10.250.22.1 (10.250.22.1): 9055 data bytes

9063 bytes from 10.250.22.1: icmp_seq=0 ttl=64 time=9.685 ms

^C

— 10.250.22.1 ping statistics —

1 packets transmitted, 1 packets received, 0% packet loss

round-trip min/avg/max/stddev = 9.685/9.685/9.685/0.000 ms

falikhan@sydlab@M320-m3-re0> ping 10.250.22.1 logical-system SD31 source 10.250.23.1 size 1000

PING 10.250.22.1 (10.250.22.1): 1000 data bytes

1008 bytes from 10.250.22.1: icmp_seq=0 ttl=64 time=1.569 ms
1008 bytes from 10.250.22.1: icmp_seq=1 ttl=64 time=1.552 ms

^C

— 10.250.22.1 ping statistics —
2 packets transmitted, 2 packets received, 0% packet loss

round-trip min/avg/max/stddev = 1.552/1.560/1.569/0.008 ms

falikhan@sydlab@M320-m3-re0> ping 10.250.22.1 logical-system SD31 source 10.250.23.1 size 6000

PING 10.250.22.1 (10.250.22.1): 6000 data bytes

6008 bytes from 10.250.22.1: icmp_seq=0 ttl=64 time=6.179 ms
6008 bytes from 10.250.22.1: icmp_seq=1 ttl=64 time=6.173 ms

^C

— 10.250.22.1 ping statistics —
2 packets transmitted, 2 packets received, 0% packet loss

round-trip min/avg/max/stddev = 6.173/6.176/6.179/0.003 ms


Cisco Implementation:

Cisco implementation is bit different from Juniper. There you can specify the MTU on different families and if IP MTU is larger than physical interface MTU, it will not give you error like Juniper. But if only Physical interface MTU is define, underlying protocol will inherit MTU settings from physical interface. Another difference need to understand that when we do the “show interface” command on Cisco CLI, it will show only Physical interface MTU, to check IP MTU on the same interface, we need to run “show ip interface” command.


Here is the example from my Cisco router, showing Physical Interface MTU 1500 (default) and IP protocol MTU is configured as 1300B (1500 by default).



Router(config)# interface f0/0
Router(config-if)# ip mtu 1300


Router# show interface f0/0

FastEthernet0/0 is up, line protocol is up
Hardware is Gt96k FE, address is c200.5867.0000 (bia c200.5867.0000)
Internet address is 10.0.0.1/24
MTU 1500 bytes, BW 10000 Kbit, DLY 1000 usec,
reliability 255/255, txload 1/255, rxload 1/255


Router# show ip interface f0/0

FastEthernet0/0 is up, line protocol is up
Internet address is 10.0.0.1/24
Broadcast address is 255.255.255.255
Address determined by setup command
MTU is 1300 bytes

When we PING from Cisco CLI, of size 1000B, it means,

This 1000B consist of ICMP payload, ICMP Header (8B) and IP Header (20B), which will be encapsulated in 14B Layer 2 Ethernet frame overhead. So actually Bytes on wire will be 1000+14=1014B. But important point to mention here is that actual Payload transferred is actually 1000B – 8B (ICMP) – 20B (IP) = 972B only. NOTE: if we testing some customer application/service via IXIA or other testing tool (not via PING) than we need to consider ICMP & IP payload along with the DATA payload.

Now if we need to test that how much maximum size PING we can send to remote host via interface f0/0 (over IP network – No MPLS)? So the answer is pretty simple, packet size equals to the configured IP MTU value = 1300B, because it contains all the overheads of IP & ICMP. Let’s test it,


Router#ping 10.0.0.100 size 1001 df-bit

Type escape sequence to abort.

Sending 5, 1001-byte ICMP Echos to 10.0.0.100, timeout is 2 seconds:


Packet sent with the DF bit set
.…

Success rate is 0 percent (0/3)


Router#ping 10.0.0.100 size 1000 df-bit

Type escape sequence to abort.

Sending 5, 1000-byte ICMP Echos to 10.0.0.100, timeout is 2 seconds:

Packet sent with the DF bit set

!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 9/18/39 ms


Router#





1 comment:

  1. Nice artical..finally got answer that i was looking for a while..Thanks from neighbor country :)

    ReplyDelete