Thursday, May 9, 2024

Iterative Development

Even if I know that I will end up building something extremely complex, I never start there.

When we get a complex problem to solve, we are taught to decompose that into pieces. But often that is not enough to get something sophisticated to work correctly.

I’ll start with some simplified goals. Usually the most common “special case”.

I lay out some groundwork. Bottom-up programming is always the strongest approach, but it still needs to be guided. So you can whack out all of the common foundational parts you know about, but that isn’t wide enough to get it all completed. So don’t. Just enough to get going.

Then take that special case and work from the top down.

If it’s a GUI, you start with a common screen like the landing one. If it’s an ETL, you just get some relatively close test data. If it's a calculation, you put in a simplish test case. Ultimately you need a fairly simple starting point.

Then you wire it up. Top to bottom. While wiring, you will find all sorts of deficiencies, but you constrain yourself to only fixing the core problems. But you also note all of the others, do not forget them, and dump them into a big and growing list of stuff to do. They are important too, just not now.

So, hopefully, now you have something crude that kinda solves the issue in very limited circumstances and a rather tiny foundation for it to sit on. It is what a friend of mine used to call an inverted T.

If it sort of works, this is not the end, it is just the beginning. You iterate now, slowly evolving the code into the full complex thing that you need.

One key trick is to pass over the unknown-unknowns as early as possible. If there is some technology you are unfamiliar with, or a difficult algorithm. You touch these areas with something crude and you do it as early as possible. Unknown-unknowns are notorious for turning out to be way larger than you have time or estimates to handle. They like to blow stuff up. The sooner you deal with them the more likely you can get a reasonable sense of when the work will be completed.

For each iteration, once you’ve decided what you need to do, you first need to set the stage for the work. That comes in the form of a non-destructive refactoring. That is, if your structure is crude or wonky, you rearrange that first in a way that does not change the overall behavior of the code. If you don’t have enough reuse in the code, you fix that first.

If you did a good job with the non-destructive refactoring, the extension code should be easy, It is just work now. Sure, it takes time, but you spend the time to make sure it is neat, tidy, and super organized. You need to do this each and every time, or the iterations will converge on a mess. You don’t want that, so you avoid it early too.

Once you get into this habit, each and every development is a long-running series of iterations, and each iteration is cleaning up stuff and making it better and more capable. The code will keep getting a little better each time.

This has two core benefits. First, is that you can level out your cognitive effort. The non-destructive refactorings are boring, but they are always followed with the expansions. As you are tracking everything, you have a huge list of todo items, so there is never a ‘hurry up and wait’ scenario. If one little thing is stuck on other people, there are plenty of other better things to do anyway.

The second benefit is that all of the people around you see that the work is getting better and better. They may have been nervous initially that the total effort will be slow, but continual improvements will dissolve that. What they don’t want is big highs and lows, and an endless number of unknown-unknown dramas. That wonky approach will keep them from trusting you.

Done well, the development is smooth. There are plenty of little issues, but they do not derail the path forward. eventually, you will get to something really sophisticated, but you will not get lost in the complexity of it. You will not crumble under technical debt. This is when you know you have mastered your profession.

Thursday, May 2, 2024

Symmetry

“One of these things is not like the others” -- Sesame Street

Symmetry is an often ignored property of code that is incredibly helpful when editing or debugging. It is part of readability, along with naming and structuring.

If you do a quick visual scan of some suspect code and you see asymmetric parts, it quickly tells you that you should spend a little more time there.

Symmetry issues can be simple, for instance:

    if condition {
        // do thing one

    } else {
        if sub-condition {
            // do thing two

        } else {
            // do thing three

        }
    }

Doesn’t seem to be a problem, but that second conditional block is broken into two sub-blocks, while the first one is not. It is unbalanced. There are 3 different things that can be triggered from this code. That is a little odd, given that most permutations are independent and applied consistently, so it could really be 4 things. We may be missing one. Depending on what behavior you are investigating, this would deserve closer inspection.

That makes symmetry an indirect property of readability. For example, you quickly scan large amounts of code to see that every function definition has exactly 3 arguments, and then you come across one with 7. Hmmmm. It stands out, maybe it is wrong.

With good code, given most bugs, you will have a strong idea about where the problem is located. Then a quick scan for asymmetry will highlight some parts around there that are off. You double-check those, and maybe find one that is wonky. That can get you from the report of the bug to the actual flaw in the code in a few minutes, with a fairly low cognitive effort. All you need to do is correct it and test that it is now behaving as expected. Symmetry problems are often second in frequency to typos. 

Symmetry is entirely missing in spaghetti code. The logic wobbles all over the place doing unexpected work. Jumping around. If you are not the author, the flow makes no sense. It takes a lot of cognitive effort to hypothesize about any sort of behavior, and then you have to focus in on as little as possible, just to figure out how to alter it in a way that hopefully is better.

It’s why debugging someone else’s mess is so painful.

Symmetry is not an accidental property. It is there only because someone had the discipline to make sure it is there. They pounded out some code but went back to it later and cleaned it up. Fixed the names, made sure it only does one thing, and put in as much symmetry as they can. That took them longer initially, but it will pay off later. The worse the code, the more you will need to revisit it.

Symmetry can occur at the syntax level, the semantic level, the data level, the function level, and the structural level. Everywhere. It is a sign of organization. Where and when it could have been there, but is missing, it is always suspect. 

The most obvious example is indenting lines of code. Most languages allow you to do whatever you want, but if you do it properly that will be a form of line-by-line symmetry which helps make the flow more understandable.

Spending the time to find and enhance symmetry saves a huge amount of grief later with quality issues. The code is way easier to refactor, debug, and extend. It is worth the time.

Friday, April 26, 2024

The Origin of Data

In software development, we create a lot of variables and data structures for our code that we toss around all over the place.

Into these we put data, lots of it.

Some of that data originates from far away. It is not our data.

Some of that data is a result of people using our interface. Some of that data we have derived from the data we already have. This is our data.

It is important to understand where the data originates, how often it gets created, and how it varies. It is crucial to understand this before coding.

Ultimately the quality of any system rests more on its data than on its code. If the code is great, but the data is garbage, the system is useless right now. If the data is great, but the code is flakey, it is at least partially usable and is fixable. If all you collected is garbage you have collected nothing.

Common mistakes with data:
  • Allowing invalid data to percolate and persist
  • Altering someone else’s data
  • Excessive or incorrect transformations

GARBAGE IN, GARBAGE OUT

It is a mistake to let data into the running code that is garbage.

Data always comes from an “entry point”, so as close to that as you can you block any incoming garbage data. An entry point is a gateway from anywhere outside of the system including the persistent database itself. All of these entry points should immediately reject invalid data, although there are sometimes variations on this that allow for staging data until it is corrected later.

All entry points should share the same validation code in order to save lots of time and ensure consistency. If validation lets in specific variations on the data, it is because those variations are valid in the real world or in the system itself.

It is a lot of work to precisely ‘model’ all of the data in any system but that work anchors the quality of the system. Skipping that effort will always force the quality to be lower.

Data that doesn’t come directly from the users of the system, comes in from the outside world. You have to respect that data.


RESPECT THE DATA

If you didn’t collect the data yourself, it is little more than rude to start changing it.

The problem comes from programmers being too opinionated about the data types, or taking questionable shortcuts. Either way, you are not saving that copy of someone’s data, you are saving a variation on it. Variations always break somewhere.

If the data is initially collected in a different system, it is up to that originating system to change it. You should just maintain a faithful copy of it, which you can use for whatever you need. But it is still the other system’s data, not yours.

Sometimes people seed their data from somewhere else and then allow their own interfaces to mess with it. That is fine, if and only if, it is a one-time migration. If you ignore that and try to mix migrations with keeping your own copy, the results will be disastrous. Your copy and the original will drift and are not mergeable, eventually one version or the other will end up wrong. Eventually, that will cause grief.

It’s worth noting that a great deal of constants that people put into their code are also other people's data. You didn’t collect the constant, which is why it is not in a variable.

You should never hardcode any data, it should come in from persistence or configuration. In that way, good code has almost no constants in it. Not strings, or numbers, or anything really, just pure code that takes inputs and returns outputs. Any sort of hardcoded values are always suspicious. If you do hardcode something, it should be isolated into its own function and you should be incredibly suspicious of it. It is probably a bad idea.


DON’T FIDGET

You'll see a lot of data that moves around at multiple different representations. It is one data item but can be parsed into subpieces which also have value on their own. You often see systems that will ‘split’ and ‘join’ the same data repeatedly in layer after layer. Obviously, that is a waste of CPU.

Most of the time if you get some incoming data, the best choice is to parse it down right away, and always pass it around in that state. You know you need at least one piece of it, why wait until the last moment to get that? So you ‘split’ coming in, and ‘join’ going out. Also ‘split’ is not suitable for any parsing that needs a look ahead to tokenize properly.

There are plenty of exceptions to that breaking down the data immediately. For example, the actual type may be the higher representation, while the piece is just an alias for it. So, parsing right away would disrespect that data. This is common when the data is effectively scoped in some way.

If you need to move around two or more pieces of data together all or most of the time, they should be in the same composite structure. You move that instead of the individual pieces. That keeps them from getting mixed up.

Another way to mess up data is to apply incorrect transformations to it. Common variations are changing data type, altering the character sets, or other representation issues. A very common weakness is to use date & time variables to hold only dates, then in-band signal it with a specific time. Date, time, and date & time are three very different data types, used for three very different situations.

Ambiguous representations like combining integers and floating point values into the same type are really bad too. You always need extra information to make any sense of data so throwing some of the meta information away will hurt. Ambiguities are easy to create but pretty deadly to correct.


SUMMARY

One of the odder parts of programming culture is the perceived freedom programmers have to represent their data any way they choose. That freedom doesn’t exist when the data originated outside of the system.

There are sometimes a few different choices for implementations, but they never come for free. There are always trade-offs. You have to spend the time to understand the data you need in order to then decide how the code should deal with it. You need to understand it first. Getting that backward and just whacking out code tends to converge on rather awkward code full of icky patches trying to correct those bad initial assumptions. That type of code never gets better, only worse.

None of these points about handling data changes over time. They are not ancient or modern. They have been writing about this since the 70s and it is as true then as it is now. It supersedes all technical stacks and applies to every technology. Software that splatters garbage data on the screen is useless, always has been, and always will be.

Thursday, April 18, 2024

Optimizations

“Premature optimization is the root of all evil” -- Donald Knuth

Code generally implements a series of steps for the computer to follow. I am using a slightly broader definition than just an ‘algorithm’ or ‘heuristic’, which are usually defined as mappings between input and output. It is widened to include any sort of code that interacts with one or more endpoints.

We’ll talk about three general possible versions of this code. The first does the steps in an obvious way. The second adds unnecessary extra steps as well. And the third does the steps in a non-intuitive way that is faster. We can call these normal, excessive, and optimized.

Most times when you see people “optimize code” they are actually just taking excessive code and replacing it with normal code. That is, they are not optimizing it, really they just aren’t doing the useless work anymore.

If you take excessive code and fix it, you are not doing premature optimization, you’re just coding it properly. The excessive version was a mistake. It was wasting resources, which is unnecessary. Not doing that anymore is not really optimizing stuff.

If you have good coding habits, for the most part, you will write normal code most of the time. But it takes a lot of practice to master. And it comes from changing how you see the code and how you construct it.

Sometimes normal code is not fast enough. You will need to optimize it. Most serious optimizations tend to be limited to logarithmic gains. That is, you start with O(n^2) and bring it down to O(n). Sorting, for example, starts with O(n!) and gets it to O(n log n). All of these types of optimizations involve visualizing the code from a very non-intuitive viewpoint and using that view to leverage some information that circumvents the normal, intuitive route. These are the hardcode optimizations. The ones that we are warned not to try right away.

It is easy while trying to optimize code to break it instead. It is also easy to make it a whole lot slower. Some optimizations like adding in caching seem to be deceptively easy, but doing it incorrectly causes all sorts of unwelcomed bugs.

Making tradeoffs like space-time are sort of optimizations. They may appear to alter the performance but it can be misleading. My favorite example is matching unique elements in sets. The obvious way to code it is with two for loops. You take each member of the first set and compare it to the second one. But you can swap time for space. In that case, you pass through the first set and hash it, then pass through the second set and see if it is in the hash. If the respective sizes are m and n, the obvious algorithm is O(n*m) where the hashed version is O(n+m). For small data, the extra hash table shifts the operation from being multiplicative to additive. But if you scaled that up to large enough data, the management of the hash table and memory could eliminate most of those gains. It’s also worth noting that it is bounded as logarithmic, you see that by setting m to be another n.

The real takeaway though is to learn to code just the instructions that are necessary to complete the work. You so often see code that is doing all sorts of unnecessary stuff, mostly because the author does not know how to structure it better or understand what happens underneath. You also see code that does and undoes various fiddling over and over again as the data moves through the system. Deciding on a canonical representation and diligently sticking to that can avoid a lot of that waste.

Debloating code is not optimizing it. Sure it makes it run faster and with fewer resources, but it is simply removing what should have not been there in the first place. We need to teach coding in a better way so that programmers learn how to write stuff correctly the first time. Premature optimizations, though, are still the root of all evil. You need to get your code working first before you start messing with logarithmic reductions. They can be a bit mind-bending at times.

Thursday, April 11, 2024

Scope

One of the keys to getting good quality out of software development is to control the scope of each line of code carefully.

This connection isn’t particularly intuitive, but it is strong and useful.

We can loosely define the scope of any piece of code as the percentage of other lines of code in the system that ‘might’ be affected by a change to it.

In the simplest case, if you comment out the initialization of the connection to a database, all other lines of code that do things with that database will no longer work correctly. They will error out. So, the scope of the initialization is that large chunk of code that relies on or messes with the data in the database and any code that depends on that code. For most systems this a huge amount of code.

Way back, in the very early days, people realized that global variables were bad. Once you declare a variable as global, any other line of code can access it, so the scope is effectively 100%. If you are debugging, and the global variable changes unexpectedly, you have to go through every other line of code that possibly changed it at the wrong time, to fully assess and understand the bug. In a sizable program that would be a crazy amount of time. So, we came to the conclusion long ago that globals, while convenient, were also really bad. And that it is a pure scope issue. We also figured out that it was true for flow-of-control, like goto statements. As it is true for function calls too, we can pretty assume it is true in one way or another for all code and data in the system.

Lots of paradigms center around reducing the scope in the code. You encapsulate variables in Object Oriented, you make them immutable in Functional Programming. These are both ways of tightening down the scope. All the modifiers like public and private do that too. Some mechanisms to include code from other files do that. Any sort of package name, or module name. Things like interfaces are also trying to put forth restrictions on what can be called when. The most significant scope reduction is strongly typed languages, as they will not let you do the wrong thing on the wrong data type at the wrong time.

So, we’ve known for a long time that reducing the scope of as much code as much as you can is very important, but why?

Oddly it has nothing to do with the initial coding. Reducing scope while coding makes coding more complicated. You have to think carefully about the reduction and remember a lot of other little related details. It will slow down the coding. It is a pain. It is friction. But doing it properly is always worth it.

The reason we want to do this is debugging and bug fixes.

If you have spent the time to tighten down the scope, and there is a bug in and around that line of code, then when you change it, you can figure out exactly what effect the change will have on the other lines of code.

Going back to the global example, if the variable is local and scoped tightly to a loop, then the only code that can be affected by a change is within the loop itself. It may change the final results of the loop computations, but if you are fixing it, that is probably desirable.

If inside of the loop you referenced a global, in a multi-threaded environment, you will never really know what your change did, what other side effects happened, and whether or not you have really fixed the bug or just got lost while trying to fix it. The bug could be what you see on the code or it could be elsewhere, the behavior is not deterministic. Unlimited scope is a bad thing.

A well-scoped program means that you can be very sure of the impact that any code change you make is going to have. Certainty is a huge plus while coding, particularly in a high-stress environment.

There is a bug, it needs to be fixed correctly right away, making a bunch of failed attempts to fix it will only diminish the trust people around you have in your abilities to get it all working. Lack of trust tends to both make the environment more stressful and also force people to discount what you are saying. It is pretty awful.

There were various movements in the past that said if you did “X” you would no longer get any bugs. I won’t go into specifics, but any technique to help reduce bugs is good, but no technique will ever get rid of all bugs. It is impossible. They will always occur, we are human after all, and we will always have to deal with them.

Testing part of a big program is not the same as fully testing the entire program, and fully testing an entire program is always so much work that it is extremely rare that we even attempt to do it. In an ancient post, I said that testing was like playing a game of battleship with a limited set of pegs, if you use them wisely, more of the bugs will be gone, but some will always remain.

This means that for every system, with all its lines of code, there will come a day when there is at least one serious bug that escaped and is now causing big problems. Always.

When you tighten the scope, while you have spent longer in coding, you will get absolutely massive reductions in the impacts of these bugs coming to light. The bug will pop up, you will be able to look at your readable code and get an idea of why it occurred, then formulate a change to it for which you absolutely are certain of the total impact of that change. You make the change, push it out, and everything goes according to plan.

But that is if and only if you tightened the scope properly. If you didn’t then any sort of change you make is entirely relying on blind luck, which as you will find, tends to fail just when you need it the most.

Cutting down on the chaos of bug fixing has a longer-term effect. If some bugs made it to production, and the handling of them was a mess, then it eats away at any time needed to continue development. This forces the programmers to take shortcuts, and these shortcuts tend to go bad and cause more bugs.

Before you know it, the code is a huge scrambled mess, everybody is angry and the bugs just keep coming, only faster now. It is getting caught in this cycle that will pull the quality down into the mud like hyper-gravity. Each slip-up in handling the issues eat more and more time and causes more stress, which fuels more shortcuts, and suddenly you are caught up in this with no easy way out.

It’s why coming out of the gate really fast with coding generally fails as a strategy for building stuff. You're trying to pound out as much code as quickly as you can, but you are ignoring issues like scope and readability to get faster. That seems to work initially, but once the code goes into QA or actual usage, the whole thing blows up rather badly in your face, and the hasty quality of the initial code leads to it degenerating further into an iky ball of mud.

The alternative is to come out really slowly. Put a lot of effort into readability and scope on the lowest most fundamental parts of the system. Wire it really tightly. Everyone will be nervous that the project is not proceeding fast enough, but you need to ignore that. If the foundations are really good, and you’ve been careful with the coding, then as you get higher you can get a bit sloppier. Those upper-level bugs tend to have less intrinsic scope.

Having lots of code will never make a project better. Having really good code will. Getting to really good code is slow and boring, but it will mitigate a great deal of the ugliness that would have come later, so it is always worth it.

Learn to control the scope and spend time to make that a habit. Resiste the panic, and just make sure that the things you coded do what they are supposed to do in any and all circumstances. If you want to save more time, do a lot of reuse, as much as you can get in. And don’t forget to keep the whole thing really readable, otherwise it is just an obfuscated mess.