Thursday, December 11, 2025

The Value of Data

According to Commodore Grace Hopper, back in 1985, the flow is: data -> information -> knowledge.

https://www.youtube.com/watch?v=ZR0ujwlvbkQ

I really like this perspective.

Working through that, data is the raw bits and bytes that we are collecting, in various different ‘data types’ (formats, encodings, representations). Data also has a structure, which is very important.

Information is really what we are presenting to people. Mostly these days via GUIs, but there are other, older mediums, like print. The data might be an encoded Julian date, and the information is a readable printed string in one of the nicer date formats.

Knowledge, then, is when someone absorbs this information, and it leads them to a specific understanding. They use this knowledge to make decisions. The decisions are the result of the collection of data as it relates to the physical world.

A part of what she is saying is that collecting data that is wrong or useless has no value. It is a waste of resources. But we did not know back then how to value data, and 40 years later, we still do not know how to do this.

I think the central problem with this is ambiguity. If we collect data on something, and some or part of it is missing, it is ambiguous as to what happened. We just don’t know.

We could, for instance, get a list of all of the employees for a company, but without some type of higher structure, like a tree or a dag, we do not know who reported to whom. We can flatten that structure and embed it directly into the list, as say a column called ‘boss’, which would allow us to reconstruct the hierarchy later.

So, this falls into the difference between data and derived data. The column boss is a relative reference to the structural reporting organization. If we use it to rebuild the whole structure, then we could see all of the employees below a given person. The information may then allow someone to be able to see the current corporate hierarchy, and the knowledge might be that it is inconsistent and needs to be reorganized somehow. So, the decision is to move around different employees to fix the internal inconsistencies and hopefully strengthen the organization.

In that sense, this does set the value somewhat. You can make the correct decision if you have all of the employees, none are missing, none of them are incorrect in an overall harmful way, and you have a reference to their boss. The list is full, complete, and up-to-date, and the structural references are correct.

So, what you need to collect is not only the current list of employees and who they report to, but also any sort of changes that happen later when people are hired, or they leave, or change bosses. A snapshot and a stream of deltas that is kept up-to-date. That is all you need to persist in order to make decisions based on the organization of the employees.

Pulling back a bit, if we work backwards, we can see that there are possibly millions of little decisions that need to be made, and we need to collect and persist all of the relevant individual pieces of data, and any related structural relationships as well.

We have done this correctly if and only if we can present the information necessary without any sort of ambiguity. That is, if we don't have a needed date and time for an event, we at least have other time markers such that we can correctly calculate the needed data and time.

But that is a common, often subtle bug in a lot of modern systems. They might know when something starts, for instance, and then keep track of the number of days since the start when another event occurred. That’s correct for the date, but any sort of calculated time is nonsense. If you did that, the information you present would be the data only, but if you look at a lot of systems out there, you see bad data, like fake times on the screens. Incorrect derived information caused by an ambiguity caused by not collecting a required piece of data, or at very least, not presenting the actual collected and derived data on the screen correctly. It’s an overly simple example, but way too common for interfaces to lie about some of the information that they show people.

The corollary to all of this is that it seems unwise to blindly collect as much data as possible and just throw it into a data swamp, so that you can sort it out later. That never made any real sense to me.

The costs of modelling it correctly so it can be used to present information are far cheaper if you do it closer to when you collect the data. But people don’t want to put in the effort to figure out how to model the data, and they are also worried about missing data that they think they should have collected, so they collect it all and insist that they’ll sort it out later. Maybe later comes, sometimes, but rarely, so it doesn’t seem like a good use of resources. The data in the swamp has almost no real value, and is far more likely to never have any real value.

But all of that tells us that we need to think in terms of: decision -> knowledge -> information -> data.

Tell me what decisions you need to make, and I can tell you what data we have to collect.

If you don’t know, you can at least express it in general terms.

The business may need to react in terms of changes to the customer spending, for example. So, we need a system that shows at a high level and all of the way down, how the customers are spending on the products and services. And we need it to be historic, so that we can look at changes over time, say last year or five years ago. It can be more specific if the line of business is mature and you have someone whose expertise in that line is incredibly deep, but otherwise, it is general.

It works outwardly as well. You decide to put up a commercial product to help users with a very specific problem. You figure out what decisions they need to make while navigating through that problem, then you know what data you need to collect, and what structure you need to understand.

They are shopping for the best deals. You’d want to collect all of the things they have seen so far and rank them somehow. The overall list of all deals possible might get them going, but the actual problem is enabling them to make a decision based on what they’ve seen, not to overwhelm them with too much information.

The corollary to this is what effectively bugs me about a lot of the lesser web apps out there. They claim to solve a problem for the users, but then they just go and push back great swaths of the problem to the users instead. They’re too busy throwing up widgets onto the screen to care about whether the information in the widgets is useful or not, and they’ve organized the web app based on their own convenience, not the users' need to make a decision. Forcing the users to end up bouncing all over the place and copying and pasting the information elsewhere to refine the knowledge. It’s not solving the problem, but just getting in the way. A bad gateway to slow down access to the necessary information.

I’ve blogged about data modelling a lot, but Grace Hopper’s take on this helps me refine the first part. I’ve always known that you have to carefully and correctly model the data before you waste a lot of time building code on top.

I’ve often said that if you have made mistakes in the modelling, you go down as low as you can to fix them as early as you can. Waiting just compounds the mistake.

I’ve intuitively known when building stuff to figure out the major entities first, then fill in the secondary ones as the system grows. But the notion that you can figure out all of the data for your solution by examining the decisions that get made as people work through their problems really helps in scoping the work.

Take any sort of system, write out all of the decisions you expect people to make as a result of using it, and then you have your schema for the database. You can prioritize the decisions based on how you are justifying, funding, or growing the system.

Following that, first you decide on the problem you want to solve. You figure out which major decisions the users would need to make using your solution, then you craft a schema. From there, you can start adding features, implementing the functionality they need to make it happen. You still have some sense of which decisions you can’t deal with right away, so you get a roadmap as well.

Software essentially grows from a starting point in a problem space; if we envision that as being fields of related decisions, then it helps shape how the whole thing will evolve.

For example, if you want to help the users decide what’s for dinner tonight, you need data about what’s in the fridge, which recipe books they have, what kitchen equipment, and what stores are accessible to them. You let them add to that context, then you can provide an ordered list of the best options, shopping lists, and recipes. If you do that, you have solved their ‘dinner problem’; if you only do a little bit of that, the app is useless. Starting with the decision that they need help making clarifies the rest of it.

As I have often said, software is all about data; code is just the way you move it around. If you want to build sophisticated systems, you need to collect the right data and present it in the right way. Garbage data interferes with that. If you minimize the other resource usages like CPU, that is a plus, but it is secondary.

Thursday, December 4, 2025

Expressive Power

You can think about code as just being a means to take different inputs and then deliver a range of related outputs.

In a relative sense, we can look at the size of that code (as the number of lines) and the range of its outputs. We can do this from a higher system perspective.

So, say we have a basic inventory system. It collects data about some physical stuff, lets people explore it a bit, then exports the data downstream to other systems. Without worrying about the specific features or functionality, let's say we were able to get this built with 100k lines of code.

If someone could come along and write the exact same system with 50K lines of the same type of code, it is clear that their code has more ‘expressive power’ than our codebase. Both are doing the same thing, take the same inputs, generate the same range of outputs, use the same technologies, but one is half the amount of code.

We want to amplify expressive power because, ultimately, it is less work to initially build it, usually a lot less work to test it, and it is far easier to extend it over its lifetime.

The code is half the size, so half of the typing work. Bugs loosely correlate to code size, so there are relatively half the number of bugs. If the code reductions were not just cute tricks and syntactic sugar, it would require a bit more cognitive effort to code, and bug fixing would be a little harder, but not twice, so there is still some significant savings. It’s just less brute force code.

Usually, the strongest way to kick up expressive power is code reuse with a touch of generalization.

Most systems have reams of redundant code; it’s all pretty much the same type of similar work. Get data from the database, put it on a screen, and put it back into the database again. With a few pipes in and a couple out, that is the bulk of the underlying mechanics.

If you can shrink the database interaction code and screen widget layout code, you can often get orders of magnitude code reductions.

But the other way to kick up expressive power is to produce a much larger range of outputs from the inputs. That tends to come from adding lots of different entities into the application model, some abstraction, and leveraging polymorphism everywhere. More stuff handled more generally.

For instance, instead of hard-coding a few different special sets of users, you put in the ability to group any of them for any reason. One generic group mechanism lets you track as many sets as you need, so it’s less screens, less specific entities, but a wider range of capabilities. A bump up in expressive power.

The biggest point about understanding and paying attention to expressive power comes from the amount of time it saves. We’re often asked to grind out medium-sized systems super quickly, but a side effect of that is that the specifications are highly reactive, so they change all of the time. If you build in strong expressive power early, then any of those arbitrary changes later become way less work, sometimes trivial.

If, from the above example, you had hardcoded sets, adding a new one is a major pain. If you had arbitrary groups, it would be trivial.

Brute force code is too rigid and fragile, so over time, it counts as dead weight. It keeps you from getting ahead of the game, which keeps you from having enough time to do a good job. You’re scrambling too hard to catch up.

We see that more dramatically if we write 1M lines of code, when we just needed 50K. 1M lines of code is a beast, so any sort of change or extension to it goes at a snail's pace. And adding new subsystems into something that brute-forced is the same work as doing it from scratch, so there is no real ability to leverage any of the earlier efforts. The code becomes a trap that kills almost all momentum. Development grinds to a halt.

But if you have some solid code with strong expressive power, you can use it over and over again. Sometimes you’ll have to ratchet it up to a new level of expressiveness, but it is a fraction of the work of coding it from scratch. Redeploying your own battle-hardened code a whole bunch of times is far superior to writing it from scratch. Less work, less learning, and way less bugs.

Since time is often the biggest development problem and the source of most problems, anything to save lots of time will always make projects go a whole lot smoother. To keep from getting swamped, we always need to get way more out of any work. That is the only way to keep it sane.

Thursday, November 27, 2025

Software Failures

Way back, maybe 15ish years ago, when I was writing that software projects failed just as often then as they did back in the Waterfall era, lots and lots of people said I was wrong.

They insisted that the newer lightweight reactive methodologies had “fixed” all of the issues. But if you understand what was going wrong with these projects, you’d know that was impossible.

So it’s nice to see a modern perspective confirm what I said then:

https://spectrum.ieee.org/it-management-software-failures

The only part of that article that I could disagree with is that it was a bit too positive towards Agile and DevOps. It did mitigate itself at the end of the paragraph, but it still has an overly positive marketing vibe to the writing. “Proved successfully” should have been lower to “claimed successfully”, which is a bit different and a lot more realistic.

If you lumped in all of the software that doesn’t even pay for itself, you’d see that the situation is much worse in most enterprises. Millions of lines of fragile code that kinda do just enough that it convolutes everything else. It’s pretty ugly out there. A big digital data blender.

From my perspective, the chief problem of our industry is expectations. When non-technical people moved in to take control of development projects, they misprioritized things so badly that the work commonly spins out of control.

If a development project is out of control, the best case is that it will produce a lame system; the worst case is that it will be a total outright failure. It is hard to come back from either consequence.

If we want to fix this, we have to change the way we are approaching the work.

First, we have to accept that software development is extremely, extremely slow. There are no silver bullets, no shortcuts, no vibing, no easy or cheap ways out. It is a lot of work, it is tedious work, and it needs to be done carefully and slowly.

Over the 35 years of my career, it just keeps getting faster, but with every jump up, the quality keeps getting worse. Since you need a minimal level of quality for it to not be lame or a failure, you need a minimal amount of time to get there. You try to skimp on that, it falls apart.

Hacking might be a performance art, but programming is never that. It is a slow, intense slog bordering on engineering. It takes time.

Time to design, time to ramp up, time to train, time to learn, time to code, time to test. Time, time, time.

If the problem is that you are trying to race through the work in order to keep the budget under control, the problem is that you are racing through the work. So, slow it down. Simple fix.

For any serious system, it takes years and years for it to reach maturity. Trying to slam part of it out in six months, then, is more than a bit crazy. Libraries and frameworks don’t save you. SAAS products don’t save you. Being overly reactive and loose with lightweight methodologies doesn’t save you either, and then can actually fuel the problems, making it worse, not better.

If you want your software to work properly, you have to put in the effort to make it work properly. That takes time.

The other big issue that is needed is that the group of people you assembled to build a big system matters a whole lot. Huge amount.

Programmers are not easily replaceable cogs. An all-junior team that is barely functional is not only far less expensive, it is also a massive risk. The resulting system is already in big trouble before any code is written.

The people you put in charge of the development work really matter. They need to be skilled, experienced, and understand how to navigate some pretty complicated and difficult tradeoffs. Without that type of background, the work gets lost and then spins out of control.

It’s very common to see too much focus dumped on trivial visible interface issues while the underlying mechanics are hopelessly broken. It’s like worrying about the cup holder in your car when the engine block is cracked. You need someone who knows this, has lived this, and can avoid this.

As well, enough of the team needs to have significant experience too. Experience is what keeps us from making a mess, and a mess is the easiest thing to create with software. So, a gifted team of developers, mixed in experience with both juniors and seniors, led by experience, is pretty much a prerequisite to keeping the risks under control.

Software development has always been about people, what they know, and what they can build. They are the most important resource in building and running large software systems. If you don’t have enough skilled people, nothing else matters. No methodology, process, or paperwork can save you. You lack the talent to get it done. Simple.

That’s mostly the roots of our problems. Not enough time, and not taking the staffing issues seriously enough. Fix those two, and most development gets back on the rails. Nurture them, and most things built out of a strong shop are pretty good. From there, you can decide how high the quality should be, or how to streamline the work, or strategize about direction, but without that concrete foundation, you are lucky if any of it runs at all, and if it does that the crashes just aren’t too epic.

Thursday, November 20, 2025

Integrations

There are two primary ways to integrate independent software components, we’ll call them ‘on-top’ and ‘underneath’.

On top means that the output from the first piece of software goes in from above to trigger the functionality of the second piece of software.

This works really well if there is a generalized third piece of software that acts as a medium.

This is the strength of the Unix philosophy. There is a ‘shell’ on top, which is used to call a lot of smaller, well-refined commands underneath. The output of any one command goes up to the shell, and then down again to any other command. The format is unstructured or at least semi-structured text. Each command CLI takes its input from stdin and command line arguments, then puts its output to stdout, and splits off any errors into stderr. The ‘integration’ between these commands is ‘on top’ of the CLI in the shell. They can all pipe data to each other.

This proved to be an extremely powerful and relatively consistent way of integrating all of these small parts together in a flexible way.

Underneath integrations are the opposite. The first piece of software keeps its own configuration data for the second piece and calls it directly. There may be no third party, although some implementations of this ironically spin up a shell underneath, which then spins up the other command. Sockets are also commonly used to communicate, but they depend on the second command already being up and running and listening on the necessary port, so they are less deterministic.

A lot of modern software prefers the second type of integration, mostly because it is easier for programmers to implement it. They just keep an arbitrary collection of data in the configuration, and then start or call the other software with that configuration.

The problem is that even if the configuration itself is flexible, this is still a ‘hardwired’ integration. The first software must include enough specific code in order to call the second one. The second one might have a generic API or CLI. If it needs the output, the first software needs to parse the output it gets back.

If the interaction is bi-directional and a long-running protocol, this makes a lot of sense. Two programs can establish a connection, get agreement on the specifics, and then communicate back and forth as needed. The downside is that both programs need to be modified, and they need to stay in sync. Communication protocols can be a little tricky to write, but are very well understood.

But this makes a lot less sense if the first program just needs to occasionally trigger some ‘functionality’ in the second one. It’s a lot of work for an infrequent and often time-insensitive handoff. It is better to get the results out of the program and back up into some other medium, where it can be viewed and tracked.

The top-down approach is considerably more flexible, and depending on the third-party is far easier to diagnose problems. You can get a high-level log of the interaction, instead of having to stitch together parts of a bunch of scattered logs. Identifying where a problem originated in a bunch of underneath integrations is a real nightmare.

Messaging backbones act as third parties as well. If they are transaction-oriented and bi-directional, then they are a powerful medium for different software to integrate. They usually define a standard format for communication and data. Unfortunately, they are often vendor-specific, very expensive, locked in, and can have short lifespans.

On-top integrations can be a little more expensive when using resources. They are slower, use more CPU, and it is costly to format to a common format than to parse back to the specifics. So they are not preferred for large-scale high-performance systems. But they are better for low or infrequent interactions.

However, on-top integrations also require a lot more cognitive effort and pre-planning. You have to carefully craft the mechanics to fit well into the medium. You essentially need a ‘philosophy’, then a bunch of implementations. You don’t just randomly evolve your way into them.

Underneath integrations can be quite fragile. When there are a lot of them, they are heavily fragmented; the configurations are scattered all over the place. If there are more than 3 of them chained together, it can get quite hairy to set them up and keep them running. Without some intensive tracking, unnoticed tiny changes in one place can manifest later as larger mysterious issues. It is also quite a bit harder to reason about how the entire thing works, which causes unhappy surprises. Equally problematic is that each integration is very different, and all of these inconsistencies and different idioms increase the likelihood of bugs.

As an industry, we should generally prefer on-top integrations. They proved to be powerful and reliable for decades for Unix systems. It’s just that we need more effort in finding expressive generalized data passing mechanisms. Most of the existing data formats are far too optimized for limited sub-cases or are too awkward to implement correctly. There are hundreds of failed attempts. If we are going to continue to build tiny independent, distributed pieces, we have to work really hard to avoid fragmentation if we want them to be reliable. Otherwise, they are just complexity bombs waiting to go off.

We’ll still need underneath integrations too, but really only for bi-directional extensive, high-speed, or very specific communication. These should be the exception -- optimization -- rather than the rule. It is easier to implement, but it is also less effective and is a dangerous complexity multiplier.

Thursday, November 13, 2025

Unknown unknowns

If I decided to build a house all on my own, I am pretty sure I would face lots of unexpected problems.

I am comfortable building something like a fence or a deck, but those skills and the knowledge I gained using them are nowhere close to what it takes to build a house.

What does it take to build a house? I have no clue. I can look at already built houses, and I can watch videos of people doing some of the work, but that isn’t even close to enough information to empower me to just go off and do it on my own.

If I tried, I would surely mess up.

That might be fine if I were building a little shed out back to store gardening tools. It’s likely that whatever mess I created would probably not result in injuries to people. It’s a very slim possibility.

But knowing that there are a huge number of unknown unknowns out there, I would be more than foolish to start advertising myself as a house builder, and even sillier to take contracts to build houses.

If a building came tumbling down late one night, it could very likely kill its occupants. That is a lot of unnecessary death and mayhem.

Fortunately, for my part of the world, there are plenty of regulators out there with building codes that would prevent me from making such a dangerous mistake.

The building codes are usually specific in how to do things, but they were initially derived from real issues that explain why they are necessary.

If I were to carefully go through the codes, I am sure that their existence -- if I pondered hard enough-- would shed light on some of those unknown unknowns that I am missing.

There might be something specific about roof construction that was driven by roofs needing to withstand a crazy amount of weight from snow. The code mandates the fix, but the reason for seemingly going overboard on the tolerances could be inferred from the existence of the code itself. “Sometimes there is a lot of extra weight on roofs”.

Rain, wind, snow, earthquakes, tsunamis, etc. There are a series of low-frequency events that need to be factored into any construction. They don’t occur often, but the roof needs to survive if and when they manifest themselves.

Obviously, it took a long time and a lot of effort over decades, if not centuries, to build up these building codes. But their existence is important. In a sense, they separate out the novices from the experts.

If I tried to build a house without reading or understanding them, it would be obvious to anyone with a deeper understanding that I was just not paying attention to the right areas of work. The foundations are floppy or missing, the walls can’t hold up even a basic roof, and the roof will cave in under the lightest of loads. The nails are too thin; they’ll snap when the building is sheared. It would be endless, really, and since I don’t know how to build a house, I certainly don’t know all of the forces and situations that would cause my work to fail.

I’ve always thought that it was pretty obvious that software needs building codes as well.

I can’t count the number of times that I dipped into some existing software project only to find that problems that I find very obvious, given my experiences, were completely and totally ignored. And that, once the impending disasters manifested themselves, everybody around me just said “Hey, that is unexpected”, when it was totally expected. I’ve been around the block; I knew it was coming.

Worse is that whenever I tried to forewarn them, they usually didn’t want to listen. Treated me as some old paranoid dude, and went happily right over the cliff.

It gets so boring having to say “I told you so”, that at some point in my career, I just stopped doing it. I stuck with “you can lead a horse to water, but you can not make it drink” instead.

And that is where building codes for software come in. As a new developer in an existing project, I often don’t carry much weight, but if there was an official reference for building codes that covered the exact same thing, it would be easy to prevent. “You’ve violated code 4.3.2, it will cause a severe outage one day”, is better than me trying to explain why the novice blog posts they read that said it was a good idea are so horribly wrong.

Software development is choked with so many myths and inaccuracies that wherever you turn, you bump into something false, like trying to run quickly through a paper-maché maze without destroying it.

We kinda did this in the past with “best practices”, but it was informal and often got co-opted by dubious people with questionable agendas. I think we need to try again. This time, it is a bunch of “specific building codes” that are tightly versioned. They start by listing out strict ‘must’ rules, then maybe some situationally optional ones, and an appendix with the justifications.


It’s oddly very hard to write, and harder to keep it stack, vendor, and paradigm neutral. We should probably start by being very specific, then gradually consolidate those codes into broader, more general ones.

It would look kinda of like:

1.1.1 All variable names must be self-describing and synchronized with any relevant outside domain or technical terminology. They must not include any cryptic or encoded components.

1.2.1 All function names must be self-describing and must clearly indicate the intent and usability of the code that they encapsulate. They must not include any cryptic or encoded components, unless mandated by the language or usage paradigm.

That way, if you crossed a function called FooHandler12Ptr, you could easily just say it was an obvious violation of 1.2.1 and add it as a bug or a code review fix.

In the past, I have worked for a few organizations that tried to do this. Some were successful, some failed miserably. But I think that in all cases, there was too much personality and opinion buried in their efforts. So, the key part here is that each and every code is truly objective. Almost in a mathematical sense, they are all ‘obviously true’ and don’t need to be broken down any further.

I do know that, given the nature of humanity, there is at least one programmer out there in this wide world who currently believes that ‘FooHandler12Ptr’ isn’t just a good name, it should also be considered best practice. For each code, I think they need an appendix, and that is where the arguments and justifications should rest. It is for those people adventurous enough to want to pursue arguing against the rules. There are plenty of romanized opinions and variations on goals; our technical discussions quickly get lost in very non-objective rationales. That should be expected, and the remedy is for people with esoteric views to simply produce their own esoteric building codes. The more, the merrier.

Of course, if we do this, it will eat up some time, both to write up the codes but also to enforce them. The one ever-present truth of most programming is that there isn’t even close to enough time to spare, and most managements are chronically impatient. So, we sell adherence to the codes as a ‘plus’, partially for commercial products or services. “We are ‘XXX 3.2.1 compliant’ as a means of really asserting that the software is actually good enough for its intended usage. In an age where most software isn’t good enough, at some point, this will become a competitive advantage and a bit later a necessity. Just need a few products to go there first, and the rest will have to follow.