Troubleshooting Methodology

While much has been written on scientific method, there has not been much written on troubleshooting methodology. Troubleshooting methodology includes using scientific method as well as several other methods. This includes team psychology, extended testing, monitoring and measurement, isolation and root cause analysis. Given how complex systems are these days, how to fix a problem can be daunting. It may even be hard to figure out what broke. This means you have to start with a hypothesis.

Hypothesis

The definition of a hypothesis is a proposed explanation based on limited evidence. In other words, a guess. As soon as someone identifies a problem, it is human nature to try to predict the cause of the problem. While this is OK and a natural response, in many cases, a hypothesis is wrong. When you are working as a group to resolve a problem it is very easy to form your hypothesis and want to stick with it. However, you must keep an open mind and be ready to admit to yourself and others that you are wrong. It is very important to admit you're wrong as soon as you realize you are. Your team will not think any less of you if you do. Most likely they will be thankful that you admitted that you were, so that they can get rid of lingering doubts in their minds as well. Also remember that many times it is a good idea to have a second set of eyes look at a problem. Even if you don't think they have as good technical skills as you, they may be able to perceive a problem better than you. Even the process of communicating the issue with them will help. If you are troubleshooting a product, make sure you check your documentation as well. Therere may be some bits of knowledge that you previously overlooked or interpreted differently.

Testing

A great way to troubleshoot a problem is to test, test, test. However, there are a couple of things you need to do before you start testing. The first thing you need to do is find your baseline. You must understand how your system was configured and functioning historically, before it started to malfunction. This always takes foresight, because you need to know what to document and measure before problems start occurring. Luckily, since systems are many times complexly interconnected, measurement at one point can indicate problems arising at other points. This means if you measure your system at key points at the time it was implemented, as well as on a regular basis, you can easily capture problems. You also need keep detailed records of any changes that are made that could have caused problems. Try to find archives of how your system was before it broke. Figure out if anything has changed in it's configuration.

Now get your tools out. You should have a toolkit with a number of different tools that you are comfortable with. Make sure you understand how they work. This means make sure you understand a good, normal result as opposed to a bad, abnormal result. One way to do this is to calibrate your tools. In a fully controlled environment, you should be able to consistently reproduce a good result, and you should understand what that result means. If you have the ability to, try to create some bad results too, so that you understand what they look like also.

Remember to change (test) one variable at a time. Do not change multiple variables at a time to resolve a problem. It will lead to incorrect hypotheses. Test the system in one direction. If you have redundant systems, test your system with the redundant system down. Then test with it up. Then test from another viewpoint. Then test with the redundant system down from another viewpoint. Remember, redundant systems can cloud troubleshooting progress. Make sure you understand how redundancy is affecting your testing. You may think a system is functioning correctly because it's redundancy is up, but in reality the main system is down. You will be shocked at how often and how quickly your hypotheses will change after testing. Also apply the scientific method to your testing. Uniquely identify your problem. Ensure that given a unique set of circumstances, you positively identify your problem. Make sure you can repeat your results consistently. If you cannot repeat your results consistently by any means, you may be looking at a bug.

Bugs

Bugs are by their very nature wildcards. You must remember that bugs do not cause consistent failures. They cause inconsistent failures. This is because given how a system was designed, the engineer did not foresee the unique set of circumstances around which you have built your system. A bug will cause anomalous and strange behavior that you probably cannot understand. It is important to remember that if you suspect you are running into a bug, you must consult with subject matter experts on it and escalate the problem to them. Leave the hypotheses up to the subject matter experts. They have seen problems like this before.

Isolate

To find out what is broken or causing a problem, you must first isolate the problem. To isolate a problem, you must figure out it's realm. When did the problem start? How big is the problem? What other systems is it affecting? What is in common with the other systems it is affecting? For example, what users is it affecting? Cut the problem off from the outside world. Eliminate interference. You must be able to see it clearly to identify it. The less interference you have, the more clearly you will see it. And then isolate it so you can nail it down.

Gain Perspective

A great trick to troubleshooting is to gain perspective. Remember that no matter how objective and omniscient you think you are, your perspective is always subjective. Think of a way to turn the tables on yourself. Put yourself into your customer's shoes. Put yourself in another geographic location. Put yourself into the 50,000' view. You will be shocked how quickly you learn more about the problem you are faced with. All your egotistical assumptions will evaporate as you gain perspective on your problem. There are many ways to gain perspective, you just have to open minded and creative.

Monitoring and measurement

To resolve a problem, it really helps to know when it started. It also helps to know when it started as soon as it starts, so you can fix it fast. Or at least go beat up who started it. You must have monitoring tools in place to gather metrics on your systems on a constant basis. These tools should then put those metrics into a graph. A graph is an invaluable tool to assist troubleshooting. It tells you exactly when things change at a glance. Think about it. What is easier to read, a spreadsheet of numbers, or a graph? Think about the many ways in which you can measure your system. Measure them on a constant basis. Understand the results you are getting, and the metrics (descriptions) you are using. Many times, accurately measuring systems will allow you to resolve problems.

Root Cause

No matter what the problem, you must find the real root cause. This does not mean that restarting it fixed it. Or taking it apart and putting it back together fixed it. Something caused the problem in the first place. That cause must be found so it can be stopped. You may be tempted to get a system running and go home. However, that is not the answer. The answer is to change something about it, so it does not happen again. If you do not change anything the problem will reappear. Finding a root cause will fix the real problem, not just a symptom. Sometimes the root cause can be elusive. However, you must always seek the ultimate truth.