The Programmer's Paradox: The Value of Data

According to Commodore Grace Hopper, back in 1985, the flow is: data -> information -> knowledge.

https://www.youtube.com/watch?v=ZR0ujwlvbkQ

I really like this perspective.

Working through that, data is the raw bits and bytes that we are collecting, in various different ‘data types’ (formats, encodings, representations). Data also has a structure, which is very important.

Information is really what we are presenting to people. Mostly these days via GUIs, but there are other, older mediums, like print. The data might be an encoded Julian date, and the information is a readable printed string in one of the nicer date formats.

Knowledge, then, is when someone absorbs this information, and it leads them to a specific understanding. They use this knowledge to make decisions. The decisions are the result of the collection of data as it relates to the physical world.

A part of what she is saying is that collecting data that is wrong or useless has no value. It is a waste of resources. But we did not know back then how to value data, and 40 years later, we still do not know how to do this.

I think the central problem with this is ambiguity. If we collect data on something, and some or part of it is missing, it is ambiguous as to what happened. We just don’t know.

We could, for instance, get a list of all of the employees for a company, but without some type of higher structure, like a tree or a dag, we do not know who reported to whom. We can flatten that structure and embed it directly into the list, as say a column called ‘boss’, which would allow us to reconstruct the hierarchy later.

So, this falls into the difference between data and derived data. The column boss is a relative reference to the structural reporting organization. If we use it to rebuild the whole structure, then we could see all of the employees below a given person. The information may then allow someone to be able to see the current corporate hierarchy, and the knowledge might be that it is inconsistent and needs to be reorganized somehow. So, the decision is to move around different employees to fix the internal inconsistencies and hopefully strengthen the organization.

In that sense, this does set the value somewhat. You can make the correct decision if you have all of the employees, none are missing, none of them are incorrect in an overall harmful way, and you have a reference to their boss. The list is full, complete, and up-to-date, and the structural references are correct.

So, what you need to collect is not only the current list of employees and who they report to, but also any sort of changes that happen later when people are hired, or they leave, or change bosses. A snapshot and a stream of deltas that is kept up-to-date. That is all you need to persist in order to make decisions based on the organization of the employees.

Pulling back a bit, if we work backwards, we can see that there are possibly millions of little decisions that need to be made, and we need to collect and persist all of the relevant individual pieces of data, and any related structural relationships as well.

We have done this correctly if and only if we can present the information necessary without any sort of ambiguity. That is, if we don't have a needed date and time for an event, we at least have other time markers such that we can correctly calculate the needed data and time.

But that is a common, often subtle bug in a lot of modern systems. They might know when something starts, for instance, and then keep track of the number of days since the start when another event occurred. That’s correct for the date, but any sort of calculated time is nonsense. If you did that, the information you present would be the data only, but if you look at a lot of systems out there, you see bad data, like fake times on the screens. Incorrect derived information caused by an ambiguity caused by not collecting a required piece of data, or at very least, not presenting the actual collected and derived data on the screen correctly. It’s an overly simple example, but way too common for interfaces to lie about some of the information that they show people.

The corollary to all of this is that it seems unwise to blindly collect as much data as possible and just throw it into a data swamp, so that you can sort it out later. That never made any real sense to me.

The costs of modelling it correctly so it can be used to present information are far cheaper if you do it closer to when you collect the data. But people don’t want to put in the effort to figure out how to model the data, and they are also worried about missing data that they think they should have collected, so they collect it all and insist that they’ll sort it out later. Maybe later comes, sometimes, but rarely, so it doesn’t seem like a good use of resources. The data in the swamp has almost no real value, and is far more likely to never have any real value.

But all of that tells us that we need to think in terms of: decision -> knowledge -> information -> data.

Tell me what decisions you need to make, and I can tell you what data we have to collect.

If you don’t know, you can at least express it in general terms.

The business may need to react in terms of changes to the customer spending, for example. So, we need a system that shows at a high level and all of the way down, how the customers are spending on the products and services. And we need it to be historic, so that we can look at changes over time, say last year or five years ago. It can be more specific if the line of business is mature and you have someone whose expertise in that line is incredibly deep, but otherwise, it is general.

It works outwardly as well. You decide to put up a commercial product to help users with a very specific problem. You figure out what decisions they need to make while navigating through that problem, then you know what data you need to collect, and what structure you need to understand.

They are shopping for the best deals. You’d want to collect all of the things they have seen so far and rank them somehow. The overall list of all deals possible might get them going, but the actual problem is enabling them to make a decision based on what they’ve seen, not to overwhelm them with too much information.

The corollary to this is what effectively bugs me about a lot of the lesser web apps out there. They claim to solve a problem for the users, but then they just go and push back great swaths of the problem to the users instead. They’re too busy throwing up widgets onto the screen to care about whether the information in the widgets is useful or not, and they’ve organized the web app based on their own convenience, not the users' need to make a decision. Forcing the users to end up bouncing all over the place and copying and pasting the information elsewhere to refine the knowledge. It’s not solving the problem, but just getting in the way. A bad gateway to slow down access to the necessary information.

I’ve blogged about data modelling a lot, but Grace Hopper’s take on this helps me refine the first part. I’ve always known that you have to carefully and correctly model the data before you waste a lot of time building code on top.

I’ve often said that if you have made mistakes in the modelling, you go down as low as you can to fix them as early as you can. Waiting just compounds the mistake.

I’ve intuitively known when building stuff to figure out the major entities first, then fill in the secondary ones as the system grows. But the notion that you can figure out all of the data for your solution by examining the decisions that get made as people work through their problems really helps in scoping the work.

Take any sort of system, write out all of the decisions you expect people to make as a result of using it, and then you have your schema for the database. You can prioritize the decisions based on how you are justifying, funding, or growing the system.

Following that, first you decide on the problem you want to solve. You figure out which major decisions the users would need to make using your solution, then you craft a schema. From there, you can start adding features, implementing the functionality they need to make it happen. You still have some sense of which decisions you can’t deal with right away, so you get a roadmap as well.

Software essentially grows from a starting point in a problem space; if we envision that as being fields of related decisions, then it helps shape how the whole thing will evolve.

For example, if you want to help the users decide what’s for dinner tonight, you need data about what’s in the fridge, which recipe books they have, what kitchen equipment, and what stores are accessible to them. You let them add to that context, then you can provide an ordered list of the best options, shopping lists, and recipes. If you do that, you have solved their ‘dinner problem’; if you only do a little bit of that, the app is useless. Starting with the decision that they need help making clarifies the rest of it.

As I have often said, software is all about data; code is just the way you move it around. If you want to build sophisticated systems, you need to collect the right data and present it in the right way. Garbage data interferes with that. If you minimize the other resource usages like CPU, that is a plus, but it is secondary.

The Programmer's Paradox

Thursday, December 11, 2025

The Value of Data

No comments:

Post a Comment