Sunday, September 27, 2020

Laws of Software Development

 For non-technical people, it is very easy to get confused about software development. From the outside, creating software seems simple: setup some tools, bang out some source code, and *voila* you get a product. Everybody is doing it, how hard can it be?

However, over the last five decades, we’ve found that software development can be deceptive. What seems easy, ain’t, and what seems hard probably is. Here are some nearly unbreakable “laws” for development that apply:

  1. You get what you’ve paid for.

Software development is slow and expensive. If you want it fast and cheap, the resulting software will either be bad or valueless. It may not be obviously bad, but it will quickly become a time and money sink, possibly forever. If you need the software to be around for more than a week, you’ll have to spend some serious money to make that happen and have a lot of patience. If you want your product to compete against other entries, then just a couple of month’s worth of work is not going to cut it.

  1. Software development takes a ‘huge’ amount of knowledge and experience.

If you are hoping that some kids, right out of school, will produce the same quality of workmanship that a group of seasoned professionals will, it’s not going to happen. The kids might be fast to produce code, but they are clueless when it comes to all of the other necessary aspects of stability like error handling, build environment, packaging, and operational monitoring. A basic development shop these days needs dozens of different technologies, and each one takes years to learn. If you get the code but can’t keep it running, it isn’t really that much of an achievement. 

  1. If you don’t know what to build, don’t build it.

Despite whatever is written out there, throwing together code with little idea about what it’s going to do is rarely a productive means of getting to something that works. It’s far better to work through the difficulties on paper than it is to spend 100x that energy working through them in code. On top of that, code has a tendency to freeze itself into place, making any future work on a bad foundation way more difficult. If you did throw together the code, remember to throw it away afterward. That will save a lot of pain.

  1. If it were easy, it probably already exists.

A tremendous amount of code has been written, rewritten, and deployed all over the place. Most basic ideas have been explored, and people have tried the same workarounds for decades, but still failed to get traction. So, it’s not a matter of brainstorming some clever new idea out of nowhere that is trivial to implement and will be life-changing. If you are looking to build something that isn’t a copy of something else, then the core ideas need to be predicated on very deep knowledge. If they are not, it’s probably a waste of time.

  1. If it looks ugly then people won’t ‘try’ to use it.

There are lots of ugly systems out there that work really well and are great tools. But they are already existing, so there is little friction in keeping them going. Ugly is a blocker to trying stuff, not to ongoing usage. If people aren’t forced to try something new, then if it is ugly they will put up fierce resistance. 

  1. If it is unstable then people won’t keep using it.

Modern expectations for software quality are fairly low, but even still if the software is just too flaky, most people will actively look for alternatives. Any initial patience gets eroded at an exponential rate, so they might seem to be putting up with the bugs and problems right now, but as time goes by each new issue causes larger and larger amounts of damage. At some point, if the software is ‘barely’ helpful, there will be enough incentives for them to switch over to an alternative.

  1. The pretty user interface is only a tiny part of the work that needs to be done.

Software systems are like icebergs, only a small part of them, the graphical user interface, is visible to people. GUIs do take a lot of work and design and are usually the place where most bugs are noticed, but what really holds the system together is the invisible stuff in the backend. Ignoring those foundations, just like with a house, tends to beg for a catastrophe. That backend work is generally more than 80% of the effort (it gets higher as the project grows larger).

There are, no doubt, plenty of other ‘laws’ that seem unbreakable in development. This is just the basic ones that come up in conversations frequently and are unlikely to see many -- if any -- exceptions. 

Thursday, September 3, 2020


 So, there is a weird bug going on in your system. You’ve read the code, it all looks fine, yet the results are not what you expected. Something “strange” is happening. What do you do now?

The basic debugging technique is often called divide and conquer, but we can use a slight twist on that concept. It’s not about splitting the code in half, but rather about keeping 2 different positions in the code and then closing the gap between them until you have isolated the issue.

The first step, as always, in debugging is to replicate the bug in a test environment. If you can’t replicate the bug, tracking it down in a live system is similar, but it needs to play out in a slower and more complex fashion which is covered at the end of this post. 

Once you’ve managed to be able to reproduce the bug, the next important step is making sure you can get adequate feedback. 

There are 2 common ways of getting debugging feedback, they are similar. You can start up a debugger running the code, walk through the execution, and type of the current state of data at various points. The other way is you can put in a lot of ‘print’ statements directly in the code that output to the log file. In both cases, you need to know what functions were run, and what the contents of variables were at different points in the execution. Most programmers need to understand how to debug with either approach. Sometimes one works way better than the other, especially if you are chasing down an integration bug that spans a lot of code or time.

Now, we are ready to start, so we need to find 2 things. The first is the ‘last known’ place in the code where it was definitely working. Not “maybe” works, or “probably” works, but definitely working. You need to know this for sure. If you aren’t sure, you need to back up through the code, until you get to a place where you are definitely sure. If you are wrong, you’ll end up wasting a lot of time.

The second thing you need is the ‘first’ place in the code where it was definitely broken. Again, like above, you want to be sure that it is a) really broken and b) it either started there or possibly a little earlier. If you know there is a line of code a couple of steps before that was also broken since that is earlier, it is a better choice. 

So, now, you have a way of replicating the behavior, the last working point, and a first broken point. In between those two points is a bunch of code, includes function calls, loops, conditionals, etc. The core part of debugging is to bring that range down to the smallest chunk of code possible by moving either the last or first points closer together. 

By way of example, you might know that the code executes a few instructions correctly, then calls a complex function with good data, but the results are bad. If you have checked the inputs and they are right, you can go into that function, moving the last position up to the start of its code. We can move the broken position up to each and every return, handler block, or other terminal points in that function. 

Do keep in mind that most programming languages support many different types of exits from functions, including returns, throws, or handlers attached to exiting. So, just because the problem is in a function, doesn’t mean that it is returning bad data out of the final return statement at the bottom. Don’t assume the exit point, confirm it.

At some point, after you have done this correctly, a bunch of times, you have narrowed the problem down probably to a small set of instructions or some type of library call. Now, you kinda have to read the code and figure out what it was meant to do and watch for syntactic or semantic reasons why that didn’t happen. You’ve narrowed it down to just a few lines. Sometimes it helps to write a small example, with the same input and code, so you can fiddle with it.

If you’ve narrowed it down to a chunk of code that completely works or is completely broken, it is usually because you made a bad assumption somewhere. Knowing that the code actually works is different from just guessing that it does. If the code compiles and/or runs then the computer is fine with it, so any confusion is coming from the person debugging.

What if the code is in a framework and the bug spans multiple entry-points in my code?

It’s similar in that you are looking for the last entry-point that works and the first one called that is wrong. It is possible that the bug is in the framework itself, but you should avoid thinking that until you have exhausted every other option. If the data is right, coming out of one entry point, you can check that it is still right going into the latter one but then gets invalidated there. Most bugs of these types are caused by the entry points not syncing up correctly, corruption is highly unlikely. 

What if I don’t know what code is actually executed by the framework?

This is an all too common problem in modern frameworks. You can attach a lot of code into different places, but it isn’t often clear when and why that code is executed. If you can’t find the last working place or the first failing place, then you might have to put in logging statements or breakpoints for ‘anything’ that could have been called in-between. This type of scaffolding (it should be removed after debugging) is a bit annoying and can use up lots of time, but it is actually faster than just blindly guessing at the problem. If while rerunning the bug, you find that some of the calls are good, going in and good coming out, you can drop them. You can also drop the ones that are entirely bad, going in and bad going out (but they may turn out to be useful later for assessing whether a code change is actually fixing the problem or just making it worse). 

What if the underlying cause is asynchronous code?

The code seems to be fine, but then something else running in the background messes it up. In most debugging you can just print out the ‘change’ of state, in concurrent debugging, you have to always print out the before and after. This is one place where log files are really crucial to gaining correct understanding. As well, you have to consider the possibility that while one thread of execution is making its way through the steps, another thread of execution bypasses it (starts later but finishes first). For any variables that ‘could’ be common, you either have to protect them or craft the code so that their values don’t matter between instructions.

What if I can’t replicate the problem? 

There are some issues, often caused by configuration or race conditions, that occur so infrequently and only in production systems, that you basically have to use log files to set the first and last positions, then wait. Each time it triggers, you should be able to decrease the distance between the two. While waiting, you can examine the code and think up scenarios that would explain what you have seen. Thinking up lots of scenarios is best, and not getting too attached to any of them opens up the ability to insert a few extra log entries into the output that will validate or eliminate some of them. 

Configuration problems show up as the programmer assuming that X is set to ‘foo’ when it is actually set to ‘bar’. They are usually fairly easy to fix, but sometimes are just a small side-effect of a larger process or communications problem that needs fixing too.

Race conditions are notoriously hard to diagnose, particular if they are very infrequent. Basically, at least 2 things are happening at once, and most of the time one finishes before the other, but on some rare occasions, it is the other way around. Most fixes for this type of problem involve adding a synchronization primitive that forces one or the other circumstances, so basically not letting them happen randomly. If you suspect there is a problem, you can ‘fix’ it, even if it isn’t wrong, but keep in mind that serializing parallel work does come with a performance cost. Still, if people are agitated by the presence of the bug, and you find 6 potential race conditions, you can fix all 6 at once, then later maybe undo a few of them when you are sure they are valid. 

If the problem is neither configuration nor a race condition, then most likely it is probably just unexpected data. You can fix that in the code, but also use it as motivation to get that data into testing, so similar problems don’t keep reoccurring. It should also be noted that it is symptomatic of a larger analysis problem as well, given that the users needed to do something that the programmers were not told about.