Thursday, January 25, 2024

Context

When I discuss software issues, I often use the term ‘context’. I’ll see if I can define my usage a little more precisely.

In software programs we talk about state. The setting of a boolean variable is its state. There are only two states.

For variables with larger ranges, i.e. possible settings, there can be a huge number of possible states, they are all discrete. An integer may be set to 42.

We usually use state to refer to a group of variables. E.g. the state of a UI is its settings, navigation, and all of the preferences.

Context is similar, but somewhat expanded. The context is all of the variables, whether explicit or implicit; formal or informal. It is really anything at all that can vary, digitally or even in reality.

Sometimes people just restrict context to purely digital usages, but it is far more useful if you open it up to include any informal variability in the world around us. That way we can talk about the context of a UI, but we can also talk about the context of the user using that UI. The first is a proper subset of the second.

The reason we want it to be wider than, say just a context in the backend code is because it affects our work. Software is a solution to one or more problems. Some of those problems are purely digital, such as computations, persistence, or communications, but most of our problems are actually anchored in reality.

For instance, consider a software system that inventories cogs created at a factory. The cogs themselves and the factory are physical. The software mirrors them in the computer in order to help keep track of them. So, some of the issues that affect the cogs, the factory, or the types of usage of the system, are really just ‘informal’ effects of reality. What people do with the software is heavily influenced by what happens in the real world. The point of an inventory system is to help make better real world decisions.

We may or may not map all of those physical influences onto digital proxies, but that does not mitigate their effect. They happen regardless. So if there are real events happening in the factory that affect the cogs but are not captured correctly, the digital proxies for those cogs can fall out of sync. We might have the wrong counts in the software for example because a bunch of cogs went missing.

As well, the mappings between reality and the software can be designed incorrectly. The factory might have twenty different types of cogs, but the software can only distinguish ten different types. The cogs themselves might relate to each other in some type of hierarchy, but the software only sees them as a flat inventory list.

In that sense the software developers are not free to model the factory and its cogs in any way they choose. The context in reality needs to properly bound the software context. So that whatever happens in the larger context can be correctly tracked in the software context.

The quality of the software is rooted in its ability to remain correct. Bad software will sometimes be wrong, so it is not trustworthy, thus not too useful.

Now if the factory was very complex, it might be a huge amount of work to write some software that precisely models everything down to each and every little detail. That would be a massive amount of work. So we frequently apply simplifications to the solution context. That works if and only if the solution context is still a proper generalized subset of the problem context.

From our earlier example if all twenty physical cogs map uniquely onto the ten software cogs, the context may be okay. But if some cogs can be mapped in different ways, or some cogs cannot be mapped at all, then the software solution will drift away from reality and people will see this as bugs. If there are manual procedures and conventions to occasionally fix the map, then at some point they'll degrade and it will still fail.

Which is one of the most common fundamental problems with software. There often isn’t time to do the context mappings properly, and the shortcuts applied were invalid. The software context is shifted out from under the problem context, so it will gradually break. More software or even manual procedures will only delay the inevitable. The data, e.g. proxies, in the computer will eventually drift away from reality.

So, if we see the context of the software as needing to be a proper subset of the context of the problem we intend to solve, it is easier to understand the consequences of simplifications.

This often plays out in interesting ways. If you build a system that keeps track of a large number of people you obviously want to be able to uniquely identify them. Some people might incorrectly assume that a full name, as first, middle, last, is enough, but most names are not particularly unique. Age doesn’t help and duplicate birthdays are far too common. You could use a home address as well, but even in some parts of the world that is not enough.

Correctly and uniquely identifying ‘all’ individuals is extraordinarily hard. Identifying a small subset for an organization is much easier. So we cheat. But any mapping only works correctly for the restricted domain context when you don’t have to fiddle with the data. If you have to have Bob and Bob1 for example, then the mapping is broken and should be fixed before it gets even worse.

So as a problem we want to track a tiny group of people and we don’t have to worry about the full context. Yet, if whatever we do forces fiddling with the data, that means our solution context is misfocused and should be shifted or expanded. Manual hacks are a bug. Seen this way, it ends any sort of subjective arguments about modeling or conventions. It’s a context misfit, it needs to be fixed. It’s not ‘speculative generation’ or over-engineering, it is just obviously wrong.

The same issues play out all over software development. We build solutions, but we build them to fit against one or more problem contexts, and those often get bounced around by larger organization or industry contexts.

That is, often people narrow down the context to make an argument about why something is right or wrong, better or worse, but the argument is invalid because the context is just too narrow. The most obvious example I know is the ancient arguments about why Betamax tapes would beat out VHS, when in reality it went the other way. I think the best reference to explain it all was Geoffrey Moore in “Crossing the Chasm” when he talks about the ‘whole product’ which is an expanded context.

All of that makes understanding the various contexts that bound the system very important. Ultimately we want to build the best fitting solutions given the problems we are trying to solve. Comparing the two contexts is how we figure out if we have done a good job or not.

Thursday, January 18, 2024

Buy vs Build

When I was young, at the end of the 80s, the buy vs. build question was straightforward.

In those days, for the emerging smaller hardware, there was not a lot of software available. It was slow and expensive to build anything. For any non-software company, they existed as brick-and-mortar businesses. Software could help automate isolated parts of the company, but that was it.

Mostly, even over some medium horizons, buying an existing software product was far cheaper than building it. And that software was usually run in an independent silo, with minimal integrations, so it wasn’t hard to get it up and running.

If there was already an available product, it didn’t make sense to build it. Buying it was the default choice.

But so much has changed since then...

The biggest change is that many companies now exist in the digital realm way more than the physical one. All of the sales, communications, and management for most lines of business happen digitally. Many of the products and services are still physical, but overall most lines of business are a mix now.

Running a computer is also more complicated. In my youth, you would set up a server room with suitable power, cooling, and network connections. When you didn't have the space you would lease it. But the drop in hardware price caused an explosion, so the number of machines involved these days is astronomical (and often unnecessary).

This made operations chaotic and undesirable, so most software products are a service now. Someone else sets it up and runs it for you. It’s easier to get going, but you don’t have as much control over it.

With the increase in digital presence came a huge need for integrations. There are way more silos now, often hundreds or even thousands of them. There are specialist silos for every subproblem. When the silos were independent, integrations were rare. But now they all need each other's data; dependencies are necessary. So everything needs to be integrated into almost everything else.

When you bought software before, you could get some new hardware to host it, spin it up, and test it out. If it mostly worked as expected it went live. But these days, just running a new system at a vendor's site isn’t that useful. Being trapped in a silo cripples it. It isn’t really live until all of the major integrations are done. Silos made sense before, but they are a hazard now.

It is not in any vendor's best interest to standardize their software. It is a simple calculation. If it is standard, then it is nearly trivial for any customer to switch to someone else. If you run into any glitches and all of the customers leave, you are instantly done. So don’t use standards.

Integrating various non-standard SaaS silos with each other is an epic nightmare. The easiest way to do it is to copy the data everywhere. If you have a dozen silos that need a particular type of data, you make a dozen copies. Then desperately try to keep them in sync somehow.

To make it even worse, each integration team and vendor will choose to model the data moving around differently. So, it is endlessly translated into different formats, and some of those translations will lose valuable parts of the information. That wastes a lot of time and causes all sorts of grief.

So modern integration projects have become huge, expensive, and tricky.

It's counterintuitive, as you think you managed to avoid programming by buying everything, but now you end up having to do way more glue programming in order to connect it all together.

And so much of that ETL code is awful. It was rushed into existence by people with too little experience. You end up with masses of hopelessly intertwined spaghetti and endless operational alerts about warnings and errors, most of which are unfortunately ignored.

And that is the crux of the issue. If you buy everything now, then you’ll get lost while trying to get it to all work together properly, and this is a lot more costly than just having built it properly in the first place.

Some things you don’t want to build though. Sometimes because it's huge, but more often because it is so complex that the devs require specific knowledge and experience to build it reasonably. You can't just hire a whole bunch of kids, they’ll conjure up a mess instead of what you need, it won’t help.

For any group of programmers, there are absolute limits to what they can build. They rarely are self-aware of their own limits, but things won’t go well if you let them stray too far past their abilities.

You can assemble a strong group of developers to build exactly what you need, but if the work dries up they will dissipate and you will run into trouble keeping it functioning later. To keep good developers they need to always have good projects to work on.

Which is to say that if you need to build software, you need to set up a stable ‘dev shop’ with enough capacity to turn out and enhance the types of software you need. The dev shop is what you need to focus on. It should be able to attract new talent, and to always have enough interesting work to keep everyone motivated. Talent attracts talent, so if you get a couple of strong devs, you can grow the affair. You just have to make sure the environment stays reasonable.

If you do that, it fundamentally changes the original buy vs build dynamics.

You want to keep building enough stuff to ensure that the dev shop stays functional. Building, if you have the capacity and ability would now be the first choice. It is a longer-term goal though, as you don’t want all of your good developers to leave.

Then you want to build up and out from a few different starting points. The guidance is to minimize integrations first. They are ultimately more costly than the vendors, so you focus there.


You figure out which categories your shop can handle, then you consolidate all of the little fragmented silos into larger systems. Generalization is the key to keeping the costs in line. Software companies leverage their code for lots of companies; in-house projects need to leverage their code for lots of different problems.

The focus is not on speed though, rather it is on doing the best engineering that you can. Move slowly and carefully. Build up as much reusable kit as you can, model the data as properly as you can, and keep expanding out from the starting point slowly swallowing dozens of other products. But always keeping a close eye on both the dev shop and the operational capacity.

Obviously, the digital parts of existing lines of business would be first. You’d want to do this anyway, since just using the same vendors as everyone else has no competitive advantages. But to get those advantages back, the work you do has to be better and more relevant than the vendors, which means that you have to have strong technologists who really understand the business too.

Then funding isn’t by project, line of business, or budget. It is by dev shop. You set a solid foundation and build up capacity to implement better stuff and keep the funding stable as you grow it. A large organization can have a few different dev shops.

Outside of those areas of software growth, the old buy vs. build choice would remain, but as the starting points get larger they would end up eating stuff around the fringes so you’d need to factor that in as well.

The counterargument to all of this is that building software is seen as outside of the company’s vertical. But the modern reality is that as most businesses got deeper into the digital realm, they drifted ever closer to being applied software companies, than their original lines of business.

The classic examples are Google, a marketing company, and Amazon, a retailer. As applied software companies they thrived while their brick-and-mortar predecessors didn’t.

The general nature though is that software is not a vertical for any digital line of business, it is a part of it. Core. That, and the exploding integration costs means that reasonable software is necessary for scaling and stabilizing. Bad software makes everything unprofitable.

As software eats the world, this same fate will play out in all sorts of other industries, and the winning strategy is to accept that if you are heavily reliant on doing business in the digital realm, then you are already heavily reliant on building software.

Then it is far more effective to build some of the core stuff yourself, instead of just integrating generic vendor products. You’ll need to recruit stronger developers and make sure you can keep them, but if you do your capacity will grow. Then software development capability itself becomes a competitive advantage.

Thursday, January 11, 2024

Lessons Learned

You learn a lot during thirty years. I tried to write about most of it in this blog, at least from a higher level perspective bringing lots of different things that have happened together, but some things are smaller and just don’t fit. Each one of these is rooted in at least one epic failure.
  • Doing the screens first and persistence last is a common top-down development approach, but it is a very bad mistake. The screens are driven by the irrationality of the users, they don’t map cleanly to the demands of persistence, and they never will. If they could, then lots of the very early application generators would have worked, but they didn’t. Persist first, then gradually move it up until it gets into the screens.
  • If you have an RDBMS, use it to its nearly fullest ability to protect itself. You really don’t want to persist garbage data, that will cause all sorts of annoying bugs. You don’t want to double up stuff you are persisting too, it is wasting space and can cause stale or inconsistent data as well. People always try to cheat the database work, and they always pay a high price for it. It isn’t particularly fun work, but it anchors everything else.
  • Don’t try to break dependent things into subparts, like say put the front and back ends for the same system into two different repos. People might decompose by language, for example, but really if there is dependency, like an API, that matters more. It’s hard to explain, but if things can’t stand on their own, then you shouldn’t try to force them to.
  • Disorganization will always be the biggest problem. Organization is a place for everything, everything in its place, and not too many similar things in the same place. That is, if you have something new and you don’t know where to put it, then it is disorganized. It must go somewhere; if that is obvious, then you are okay.
  • If the programmers don’t know what the system holds for data, and they don’t know why people are doing things with that data, then it will have a huge number of bugs. Programmers are the frontline for quality, if they can’t see problems as they work, there will be lots and lots more.
  • Always only every move code in one direction. It goes from Dev to Release, with a few QA stops along the way. Never, never, break that chain. It will result in all sorts of problems including things getting accidentally rolled back, which is avoidable.
  • Always clean up right after a release. Everyone is tired, and cleanup work is boring. If you do not clean up then, you will never clean up and the mess will get worse, far worse.
  • Tackle the hard parts first, not the easy ones. The hard ones are unpredictable in time, if they don’t go well you can raise an early flag on the schedule. The other way around tends to mislead people into thinking that everything is going well when it isn’t.
  • Do the right thing when you start. Only take more shortcuts the closer you are to the deadline. If you take a shortcut, note it, and clean it up right away after the release.
  • Do not freeze code forever. If you freeze the code, you also free the bugs, which is then the foundation of everything else. Building on buggy foundations is problematic.
  • Do not let people add in onion architectures. If they are trying to avoid the main code, and just do “their thing” around the outside, that work is usually very harmful. Push them to do the work properly.
  • Don’t drink the Kool-Aid. There just isn’t an easy or right way to build stuff. The best you can do is make it readable and keep it organized. Most philosophies for coding are extreme and have worse side effects.
  • If what happens underneath matters in the system, it is not fully encapsulated. In that case, you need to learn some stuff about what happens and why it happens. You can’t just ignore it. Some components will never be fully encapsulated.
  • The ramp-up time for an experienced coder is proportional to the size of the codebase. The ramp-up time for an inexperienced coder is far longer.
  • A weird and ugly interface is a strong disincentive against usage. Useful code has a long life. The point of writing professional code is to maximize its lifespan.
  • Reduce friction, don’t tolerate it. Spending the time to mitigate or minimize it always pays off. Putting up with it always slides one downhill.
There is a lot more, but I’ll start with these. Writing code is fairly easy, but building huge reliable systems is exceptionally hard. The two are not the same.

Thursday, January 4, 2024

Time vs Risk

When I was young software development was not in the spotlight. We had quite a bit of time to get our work done. We would carefully craft things, focusing on the key issues.

It was the dawn of the World Wide Web followed by the decadence of the Dot Com era that changed all of that. Suddenly “first mover advantage” outweighed quality, correctness, and readability.

Modern coding is a high-speed game of chicken. It starts with a request to do some work in usually less than 1/3rd of the amount of time you need to do a good job. If you balk at the lack of time, they’ll take their work elsewhere. So, you might try to stretch it out a little, but then you agree.

When time is compressed, you inevitably end up taking a lot of shortcuts. Some programmers know to avoid many of these, but the industry tends to praise them.

A shortcut is a tradeoff. You do something faster now, in the hopes that it will not blow up in your face later.

Some shortcuts never blow up, you get lucky.

Some just are incremental aggravations that if they haven’t built up too deeply will only slow you down a bit later. Just friction.

Some shortcuts, however, will implode or even explode, throwing the whole affair into the trash bin or flatten it forever. It’s been bad enough that I’ve actually seen code come out spectacularly fast then spent half of a decade slogging through near-hopeless bugs. The wrong series of really bad shortcuts can be devastating.

So every shortcut is a risk. But it is hard to quantify, as there are usually aggravating factors that multiply the damage.

Given that you are inevitably pushed into having to take some shortcuts, it’s best to take the least destructive ones. Those tend to be higher up.

If you build code in a rational manner, you would lay out the foundations first and then carefully stack a lot of reusable components on top. That is the minimum amount of work you need to do.

Bad low-level code propagates trouble upward; the stuff built on top needs to counteract the awful behavior below. That tells us that the lower the shortcut, the more risky it is, the more it affects, and the worse the consequences of losing by taking it.

We see that all of the time.

Those systems, for example, where they did crazy fast things with saving the data, then wrote far too much hacky code above to try and hide the mess. If they had just modeled the data cleanly, then the tragically nested conditional nightmare piled on top, which ate huge amounts of time and spread a lot of pain, would not have been necessary. It is a super common example of a small set of shortcuts going rather horribly wrong.

You see exceptionally bad persistence all over the place causing problems. It’s likely that at least half the code ever written is totally unnecessary.

What’s always true is that if you take too many risks and lose enough of them, the time saved by the shortcuts will be massively overwhelmed by the time lost dealing with them. Coming out of the gate far too fast will always cause a project to stumble and will often cause it to lose the race.

If you are forced to take risks then it is worth learning how to evaluate them correctly. If you pick the right ones, you’ll lose a few, but keep on going. It’s not how it should be, but it is pretty much how it is these days.