Monday, April 1, 2013

The Science of Network Troubleshooting

Courtesy - Jemery Stretch
 
Troubleshooting is not an art. Along with many other IT methodologies, it is often referred to as an art, but it's not. It's a science, if ever there was one. Granted, someone with great skill in troubleshooting can make it seem like a natural talent, the same way a professional ball player makes hitting a home run look easy, when in fact it is a learned skill. Another common misconception holds troubleshooting as a skill derived entirely from experience with the involved technologies. While experience is certainly beneficial, the ability to troubleshoot effectively arises primarily from the embrace of a systematic process, a science.
 
It's said that troubleshooting can't be taught, but I disagree. More accurately, I would argue that troubleshooting can't be taught easily, or to great detail. This is because traditional education encompasses how a technology functions; troubleshooting encompasses all the ways in which it can cease to function. Given that it's virtually impossible to identify and memorize all the potential points of failure a system or network might hold, engineers must instead learn a process for identifying and resolving malfunctions as they occur. To borrow a cliché analogy, teach a man to identify why a fish is broken, rather than expecting him to memorize all the ways a fish might break.

Troubleshooting as a Process

Essentially, troubleshooting is the correlation between cause and effect. Your proxy server experiences a hard disk failure, and you can no longer access web pages. A backhoe digs up a fiber, and you can't call a branch office. Cause, and effect. Moving forward, the correlation is obvious; the difficulty lies in transitioning from effect to cause, and this is troubleshooting at its core.
 
Consider walking into a dark room. The light is off, but you don't know why. This is the observed effect for which we need to identify a cause. Instinctively, you'll reach for the light switch. If the light switch is on, you'll search for another cause. Maybe the power is out. Maybe the breaker has been tripped. Maybe someone stole all the light bulbs (it happens). Without much thought, you investigate each of these possible causes in order of convenience or likelihood. Subconsciously, you're applying a process to resolve the problem.
 
Even though our light bulb analogy is admittedly simplistic, it serves to illustrate the fundamentals of troubleshooting. The same concepts are scalable to exponentially more complex scenarios. From a high-level view, the troubleshooting process can be reduced to a few core steps:
  • Identify the effect(s)
  • Eliminate suspect causes
  • Devise a solution
  • Test and repeat
  • Mitigate

Step 1: Identify the Effect(s)

If you've been a network engineer for more than a few hours, you've been told at least once that the Internet is down. Yes, the global information infrastructure some forty years in the making has fallen to its knees and is in a state of complete chaos. All this is, of course, confirmed by Mary in accounting. Last time it was discovered her Ethernet cable had come unplugged, but this time she's certain it's a global catastrophe.
 
Correctly identifying the effects of an outage or change is the most critical step in troubleshooting. A poor judgment at this first step will likely start you down the wrong path, wasting time and resources. Identifying an effect is not to be confused with deducing a probable cause; in this step we are focused solely on listing the ways in which network operation has deviated from the norm.
 
Identifying effects is best done without assumption or emotion. While your mind will naturally leap to possible causes at the first report of an outage, you must force yourself to adopt an objective stance and investigate the noted symptoms without bias. In the case of Mary's doomsday forecast, you would likely want to confirm the condition yourself before alerting the authorities.
Some key points to consider:

What was working and has stopped?

An important consideration is whether an absent service was ever present to begin with. A user may report an inability to reach FTP sites as an outage, not realizing FTP connections have always been blocked by the firewall as a matter of policy.

What wasn't working and has started?

This is can be a much less obvious change, but no less important. One example would be the easing of restrictions on traffic types or bandwidth, perhaps due to routing through an alternate path, or the deletion of an access control mechanism.

What has continued to work?

Has all network access been severed, or merely certain types of traffic? Or only certain destinations? Has a contingency system assumed control from a failed production system?

When was the change observed?

This critical point is very often neglected. Timing is imperative for correlation with other events, as we'll soon see. Also remember that we are often limited to noting the time a change was observed, rather than when it occurred. For example, an outage observed Monday morning could have easily occurred at any time during the preceding weekend.

Who is affected? Who isn't?

Is the change limited to a certain group of users or devices? Is it constrained to a geographical or logical area? Is any person or service immune?

Is the condition intermittent?

Does the condition disappear and reappear? Does this happen at predictable intervals, or does it appear to be random?

Has this happened before?

Is this a recurring problem? How long ago did it happen last? What was the resolution? (You do keep logs of this sort of thing, right?)

Correlation with planned maintenance and configuration changes

Was something else being changed at this time? Was a device added, removed, or replaced? Did the outage occur during a scheduled maintenance window, either locally or at another site or provider?

Step 2: Eliminate Suspect Causes

Once we have a reliable account of the effect or effects, we can attempt to deduce probable causes. I say probable because deducing all possible causes is impractical, if not impossible. One possible cause is a power failure. Another possible cause is spontaneous combustion. Only one of these possible causes is probable.
 
There is a popular mantra of "always start with layer one," suggesting that the physical connectivity of a network should be verified before working on the higher layers. I disagree, as this is misleading and often impractical. You're not going to drive out to a remote site to verify everything is plugged in if a simple ping verifies end-to-end connectivity. Similarly, it's unlikely that any cables were disturbed if you can verify with relative certainty no one has gone near the suspect devices. Perhaps this is an oversimplified argument, but verifying physical connectivity is often needlessly time consuming and superseded by alternative methods.
 
Instead, I suggest narrowing causes in order of combined probability and convenience. For example, there might be nothing to indicate DNS is returning an invalid response, but performing a manual name resolution takes roughly two seconds, so this is easily justified. Conversely, comparing a current device configuration to its week-old backup and accounting for any differences may take a considerable amount of time, but this presents a high probability of exposing a cause, so it too is justified.
 
The order in which you decide to eliminate suspect causes is ultimately dependent on your experience, your familiarity with the infrastructure, and your allowance for time. Regardless of priority, each suspect cause should undergo the same process of elimination:

Define a working condition

You can't test for a condition unless you know what condition to expect. Before performing a test, you should have in mind what outcome should be produced in the absence of an outage. For example, performing a traceroute to a distant node is meaningless if you can't compare it against a traceroute to the same destination under normal conditions.

Define a test for that condition

Ensure that the test you perform is in fact evaluating the suspect cause. For instance, pinging an E-mail server doesn't explicitly guarantee that mail services are available, only the server itself (technically, only that server's address). To verify the presence of mail services, a connection to the relevant daemon(s) must be established.

Apply the test and record the result

Once you've applied the test, record its success or failure in your notes. Even if you've eliminated the cause under suspicion, you have a reference to remind you of this and avoid wasting time repeating the same test again unnecessarily.
It is common to uncover multiple failures in the course of troubleshooting. When this happens, it is important to recognize any dependencies. For example, if you discover that E-mail, web access, and a trunk link are all down, the E-mail and web failures can likely be ignored if they depend on the trunk link to function. However, always remember to verify these supposed secondary outages after the primary outage has been resolved.

Step 3: Devise a Solution

Once we have identified a point of failure, we want to continue our systematic approach. Just as with testing for failures, we can apply a simple methodology to testing for solutions. In fact, the process very closely mirrors the steps performed to eliminate suspect causes.

Define the failure

At this point you should have a comfortable idea of the failure. Form a detailed description so you have something to test against after applying a solution. For example, you would want to refine "The Internet is down" to "Users in building 10 cannot access the Internet because their subnet was removed from the outbound ACL on the firewall."

Define the proposed solution

Describe exactly what changes are to be made, and exactly what the expected outcome is. Blanket solutions such as arbitrarily rebooting a device or rebuilding a configuration from scratch might fix the problem, but they prevent any further diagnosis and consequently impede mitigation.

Apply the solution and record the result

Once we have a defined failure and a proposed solution, it's time to pit the two against each other. Be observant in applying the solution, and record its result. Does the outcome match what you expected? Has the failure been resolved?
In addition to our defined process, some guidelines are well worth mentioning.

Maintain focus

Far too often I encounter a technician who, upon becoming frustrated with a failure or failures, opts to recklessly reboot, reconfigure, or replace a device instead of troubleshooting systematically. This is the high-tech equivalent of pounding something with a hammer until it works. Focus on one failure at a time, and one solution at a time per failure.

Watch out for hazardous changes

When developing a solution, remember to evaluate what effect it might have on systems unrelated to those being troubleshot. It's a horrible feeling to realize you've fixed one problem at the expense of causing a much larger one. The best course of action when this happens is typically to immediately reverse the change which was made. Note that this is only possible with a systematic approach.

Step 4: Test and Repeat

Upon implementing a solution and observing a positive effect, we can begin to retrace our steps back toward the original symptoms. If any conditions were overlooked because they were decided to be dependent on the recently resolved failure, test for them again. Refer to your notes from the initial report and verify that each symptom has been resolved. Ensure that the same tests which were used to identify a failure are used to confirm functionality.
 
If you notice that a failure or failures remain, pick up where you left off in the testing cycle, annotate it, and press forward.

Step 5: Mitigate

The troubleshooting process does not end when the problem has been resolved and everyone is happy. All of your hard work up until this point amounts to very little if the same problem occurs again tomorrow. In IT, problems are commonly fixed without ever being resolved. Band-aids and hasty installations are not acceptable substitutes for implementing a permanent and reliable solution. So to speak, many people will go on mopping the floor day after day without ever plugging the leak.
 
A permanent solution may be as complex as redesigning the routed backbone, or as simple as moving a power strip somewhere it won't be tripped on anymore. A permanent solution also doesn't have to be 100% effective, but it should be as effective as is practical. At the absolute minimum, ensure that you record the observed failure and the applied solution, so that if it the condition does recur you have an accurate and dated reference.

A Final Word

Everyone has his or her own preference in troubleshooting, and by no means do I consider this paper conclusive. However, if there's only one concept you take away, make it this: above all else, remain calm. You're no good to anyone in a panic. One's ability to manage stress and maintain a professional demeanor even in the face of utter chaos is what makes a good engineer great.
 
Most network outages, despite what we're led to believe by senior management, are not the end of the world. There are instances where downtime can lead to loss of life; fortunately, this isn't the case with the vast majority of networks. Money may be lost, time may be wasted, and feelings may be hurt, but when the problem is finally resolved, odds are you've learned something valuable.
 

No comments:

Post a Comment