So, there is a weird bug going on in your system. You’ve read the code, it all looks fine, yet the results are not what you expected. Something “strange” is happening. What do you do now?
The basic debugging technique is often called divide and conquer, but we can use a slight twist on that concept. It’s not about splitting the code in half, but rather about keeping 2 different positions in the code and then closing the gap between them until you have isolated the issue.
The first step, as always, in debugging is to replicate the bug in a test environment. If you can’t replicate the bug, tracking it down in a live system is similar, but it needs to play out in a slower and more complex fashion which is covered at the end of this post.
Once you’ve managed to be able to reproduce the bug, the next important step is making sure you can get adequate feedback.
There are 2 common ways of getting debugging feedback, they are similar. You can start up a debugger running the code, walk through the execution, and type of the current state of data at various points. The other way is you can put in a lot of ‘print’ statements directly in the code that output to the log file. In both cases, you need to know what functions were run, and what the contents of variables were at different points in the execution. Most programmers need to understand how to debug with either approach. Sometimes one works way better than the other, especially if you are chasing down an integration bug that spans a lot of code or time.
Now, we are ready to start, so we need to find 2 things. The first is the ‘last known’ place in the code where it was definitely working. Not “maybe” works, or “probably” works, but definitely working. You need to know this for sure. If you aren’t sure, you need to back up through the code, until you get to a place where you are definitely sure. If you are wrong, you’ll end up wasting a lot of time.
The second thing you need is the ‘first’ place in the code where it was definitely broken. Again, like above, you want to be sure that it is a) really broken and b) it either started there or possibly a little earlier. If you know there is a line of code a couple of steps before that was also broken since that is earlier, it is a better choice.
So, now, you have a way of replicating the behavior, the last working point, and a first broken point. In between those two points is a bunch of code, includes function calls, loops, conditionals, etc. The core part of debugging is to bring that range down to the smallest chunk of code possible by moving either the last or first points closer together.
By way of example, you might know that the code executes a few instructions correctly, then calls a complex function with good data, but the results are bad. If you have checked the inputs and they are right, you can go into that function, moving the last position up to the start of its code. We can move the broken position up to each and every return, handler block, or other terminal points in that function.
Do keep in mind that most programming languages support many different types of exits from functions, including returns, throws, or handlers attached to exiting. So, just because the problem is in a function, doesn’t mean that it is returning bad data out of the final return statement at the bottom. Don’t assume the exit point, confirm it.
At some point, after you have done this correctly, a bunch of times, you have narrowed the problem down probably to a small set of instructions or some type of library call. Now, you kinda have to read the code and figure out what it was meant to do and watch for syntactic or semantic reasons why that didn’t happen. You’ve narrowed it down to just a few lines. Sometimes it helps to write a small example, with the same input and code, so you can fiddle with it.
If you’ve narrowed it down to a chunk of code that completely works or is completely broken, it is usually because you made a bad assumption somewhere. Knowing that the code actually works is different from just guessing that it does. If the code compiles and/or runs then the computer is fine with it, so any confusion is coming from the person debugging.
What if the code is in a framework and the bug spans multiple entry-points in my code?
It’s similar in that you are looking for the last entry-point that works and the first one called that is wrong. It is possible that the bug is in the framework itself, but you should avoid thinking that until you have exhausted every other option. If the data is right, coming out of one entry point, you can check that it is still right going into the latter one but then gets invalidated there. Most bugs of these types are caused by the entry points not syncing up correctly, corruption is highly unlikely.
What if I don’t know what code is actually executed by the framework?
This is an all too common problem in modern frameworks. You can attach a lot of code into different places, but it isn’t often clear when and why that code is executed. If you can’t find the last working place or the first failing place, then you might have to put in logging statements or breakpoints for ‘anything’ that could have been called in-between. This type of scaffolding (it should be removed after debugging) is a bit annoying and can use up lots of time, but it is actually faster than just blindly guessing at the problem. If while rerunning the bug, you find that some of the calls are good, going in and good coming out, you can drop them. You can also drop the ones that are entirely bad, going in and bad going out (but they may turn out to be useful later for assessing whether a code change is actually fixing the problem or just making it worse).
What if the underlying cause is asynchronous code?
The code seems to be fine, but then something else running in the background messes it up. In most debugging you can just print out the ‘change’ of state, in concurrent debugging, you have to always print out the before and after. This is one place where log files are really crucial to gaining correct understanding. As well, you have to consider the possibility that while one thread of execution is making its way through the steps, another thread of execution bypasses it (starts later but finishes first). For any variables that ‘could’ be common, you either have to protect them or craft the code so that their values don’t matter between instructions.
What if I can’t replicate the problem?
There are some issues, often caused by configuration or race conditions, that occur so infrequently and only in production systems, that you basically have to use log files to set the first and last positions, then wait. Each time it triggers, you should be able to decrease the distance between the two. While waiting, you can examine the code and think up scenarios that would explain what you have seen. Thinking up lots of scenarios is best, and not getting too attached to any of them opens up the ability to insert a few extra log entries into the output that will validate or eliminate some of them.
Configuration problems show up as the programmer assuming that X is set to ‘foo’ when it is actually set to ‘bar’. They are usually fairly easy to fix, but sometimes are just a small side-effect of a larger process or communications problem that needs fixing too.
Race conditions are notoriously hard to diagnose, particular if they are very infrequent. Basically, at least 2 things are happening at once, and most of the time one finishes before the other, but on some rare occasions, it is the other way around. Most fixes for this type of problem involve adding a synchronization primitive that forces one or the other circumstances, so basically not letting them happen randomly. If you suspect there is a problem, you can ‘fix’ it, even if it isn’t wrong, but keep in mind that serializing parallel work does come with a performance cost. Still, if people are agitated by the presence of the bug, and you find 6 potential race conditions, you can fix all 6 at once, then later maybe undo a few of them when you are sure they are valid.
If the problem is neither configuration nor a race condition, then most likely it is probably just unexpected data. You can fix that in the code, but also use it as motivation to get that data into testing, so similar problems don’t keep reoccurring. It should also be noted that it is symptomatic of a larger analysis problem as well, given that the users needed to do something that the programmers were not told about.
No comments:
Post a Comment
Thanks for the Feedback!