Network and Application Root Cause Analysis

Sunday, March 24, 2013

Network and Application Root Cause Analysis

A few years ago, “Prove it’s not the network” was all that a Network Engineer had to do to get a problem off his back. If he could simply show that throughput was good end to end, latency was low, and there was no packet loss, he could throw a performance problem over the wall to the server and application people.

Today, it’s not quite that simple.

Network Engineers have access to tools which give them visibility into the packet level of a problem. This detail in visibility often requires them to work hand in hand with application people all the way through to problem resolution. In order to find performance problems in applications, the network guys have had to take the TCP bull by the horns and simply take ownership of the transport layer.

What does that mean?

First, it means that “Prove it’s not the network” isn’t enough, as we have already mentioned. But it also means that analyzing TCP windows, TCP flags, and slow server responses has fallen on their shoulders. Since this is the case today, let’s look at a quick list of transport layer issues that the Network Engineer should watch for when analyzing a slow application.

1. Check out the TCP Handshake

No connection, slow connection, client to server roundtrip time, and TCP Options. These can all be analyzed by looking at the first three packets in the TCP conversation. It’s important when analyzing an application problem to capture this connection sequence. In the handshake, note the amount of time the server takes to respond to the SYN from the client, as well as the advertised window size on both sides. Retransmissions at this stage of the game are a real killer, as the client will typically wait for up to three full seconds to send a retransmission.

2. Compare server response time to connection setup time

The TCP handshake gave a good idea of the roundtrip time between client and server. Now we can use that timer as a benchmark to measure the server response time. For example, if the connection setup time is 10mSec, and the server is taking 500mSec to respond to client requests, we can estimate that the server is taking around 490mSec to respond. Now, this is not a huge deal when the amount of client requests are low, but if the application is “chatty” (lots of client requests to the server) and we suffer the server delay on each call, this will turn into a bigger problem for performance.

3. Watch for TCP Resets

There are two ways a client can disconnect from the server, the TCP FIN or the TCP Reset. The FIN is basically a three or four packet mutual disconnect between the client and server. It can be initiated by either side, and is done to open up the connection for other calls. However, when a TCP Reset is sent, this represents an immediate shutdown of the connection. If this is sent by the server, it could be that an inactivity timer expired, or worse, a problem in the application code was triggered. It’s also possible that a device in the middle such as a load balancer sent the reset. These issues can be found in the packet trace by setting a filter for TCP Resets and closely analyzing the sender for the root cause.

This list is not exhaustive, but will get you started into looking for the root cause of a performance problem. In future articles on LoveMyTool, we'll give tips and tricks into solving different problems with several analyzers, showing how these can be detected and resolved.