We’ve all seen it. The engineer who almost feels how a system works and can within seconds can identify what’s going on. Ask them how they do it, and most of them will say “I just know”. Not helpful at all.
Then there are the mere mortals who struggle to get some things debugged and wander in circles trying lots of different things and most of the time stumbling on the problem.
I spent a while observing the intuitive engineers, observing myself, and observing others, and here is how I do systematic debugging. It’s very close to my understanding of the technique of differential diagnosis. Apparently I’m not alone (link, link).
Worked example: let’s consider a web server that is not returning data.
Step 0: Start with a System View
Ironically, this is one of the biggest gaps I see.
What’s the first thing the SWAT team pulls out when they are trying to work out what’s happening in a building? It’s the floor plan. Why? So that they can conceptualize where people would be, where the weak and strong spots are. Without a floorplan they are going in blind.
A lot of engineers are either unwilling or unable to communicate how they think a system works. This is the most critical part since the theory of operation that is in your head will guide the following steps. It ultimately doesn’t matter if the system view is right or wrong – it will evolve as more information is found.
A System View doesn’t have to be detailed, it can be superficial but you build up as you go down. It can be a set of interconnected black boxes, that can be simply accepted as working consistently and as intended, unless data indicates otherwise.
Step 1: Gather Information (Gather)
Gather whatever information is available. Logs, Failure Modes, Reproduction Scenarios, whatever is available. It doesn’t really matter how much you have, and sometimes too much makes it harder to diagnose. This starts you going.
Worked Example:, this could be error codes present in the response, amount of time that it takes to fail, do other requests work or not. In this case, we don’t have any return code, it just times out.
Step 2: Create Multiple Hypothesis (Guess)
So based on the available information, you can consider about what may be the cause. As you consider each of the potential causes, don’t worry too much about pruning too quickly. By nature you will automatically prune out some scenarios based on the available information. (See Other Thoughts).
Worked Example: There could be a server failure, server overload, bug in the page rendering, network issue, etc.
Step 3: Work out how to Prove or Refute each Hypothesis (Plan)
For each hypothesis, consider what you can do to either prove or disprove that hypothesis. As you consider what can prove or refute, there will be items that are either easy to test, or split the hypothesis. Keep in mind on these. Don’t worry too much if there is information that missing, or you can’t get completely prove or disprove it.
Worked Example: What error codes are returned (decides between server failure/overload), are other pages working (may split between server side code and/or static code), server load values.
Step 4: Conduct your Experiments
Now that you’ve got your experiments, you are ultimately either removing a potential hypothesis, or your have proven that it’s at least true. If you have refuted a hypothesis, then you can remove it from your list of things to debug. You’ve proven it right, you now have more data that helps provide a better picture. Just like 20 questions, you need to choose what experiments to run that confirm/refute as many of your hypothesis as possible. This mixes in a healthy dose of effort, information gained, number of hypothesis.
Step 5: Repeat Step 2-4
As you work through the debugging, you will continue to get more data. as you get more data, you will possibly come up with other hypothesis. This is more complex than a simple decision tree, since it’s dynamic and not designed. By cycling through multiple times you will be getting more data, and isolating where the problem is.
Note that if it was easy, it would be linear, directed and you would be done. Life isn’t that easy, so there will be mistakes, incorrect assumptions. Always keep a beginner’s mind where you observe, re-challenge and ensure that you don’t take hard positions on things that you may have excluded, new information may make a big difference.
There are a number of other things that are happening in the background here. I’ll explore some things here.
Intuitive Engineers: We’ve all seen them, the engineers who sit down with a problem. Tap a few keys, look at a couple of logs, and magically zero in on the root cause. In my experience, you can usually ask questions like “What do you know it is not?“, or “How did you determine that it wasn’t or was that?“. Most engineers will have immediate answers to those questions. If you ask them directly about how they get to the right answer, most won’t have a strong view on how they got there.
Systems Thinking: A strong part of this is process is to construct a theory of operation about how the system works. A lot of the software industry just plain suck at maintaining good architectural documentation. Independent of that, constructing a System Model of how the system works is a critical step. Like the Beginner’s Mind – allow your view to be fluid and learn and respond to new information.
Remember 20 Questions: As a kid, I’d play 20 questions with the family. The critical part to quick isolation is to segment the solution space as quickly as possible. Through the power of bisection, 20 questions allow you to segment into 1,048,576 different sub-domains. Well constructed hypothesis and experimentation will apply the halving in an ideal world. Real world is messy, but the rule still applies.
Depth vs Breadth First: There is no prescription on driving down a line of experiments – or knock off a different set of experiments first. The best choice comes with practice and improvements in intuition and judgement. Generally, if you keep getting hypothesis confirmed, continuing depth first can help isolate. When there is confusion, breadth helps first.
Beginner’s Mind: Shoshin is a Japanese term for beginners mind, similar in concept to growth mindset. However within that concept you drop preconception, biases and be in the moment with the data that you have.
Allow Yourself to Be Wrong: This is critical, if you don’t take risks, it will be hard to be learn. Learning is best when you make small mistakes that help shape your best path. You learn best when you are wrong.
Practice, Practice, Practice: I personally feel that intuition comes from improved judgement and experiencing patterns. You can only get better if you practice.