Friday, April 26, 2024

The Origin of Data

In software development, we create a lot of variables and data structures for our code that we toss around all over the place.

Into these we put data, lots of it.

Some of that data originates from far away. It is not our data.

Some of that data is a result of people using our interface. Some of that data we have derived from the data we already have. This is our data.

It is important to understand where the data originates, how often it gets created, and how it varies. It is crucial to understand this before coding.

Ultimately the quality of any system rests more on its data than on its code. If the code is great, but the data is garbage, the system is useless right now. If the data is great, but the code is flakey, it is at least partially usable and is fixable. If all you collected is garbage you have collected nothing.

Common mistakes with data:
  • Allowing invalid data to percolate and persist
  • Altering someone else’s data
  • Excessive or incorrect transformations

GARBAGE IN, GARBAGE OUT

It is a mistake to let data into the running code that is garbage.

Data always comes from an “entry point”, so as close to that as you can you block any incoming garbage data. An entry point is a gateway from anywhere outside of the system including the persistent database itself. All of these entry points should immediately reject invalid data, although there are sometimes variations on this that allow for staging data until it is corrected later.

All entry points should share the same validation code in order to save lots of time and ensure consistency. If validation lets in specific variations on the data, it is because those variations are valid in the real world or in the system itself.

It is a lot of work to precisely ‘model’ all of the data in any system but that work anchors the quality of the system. Skipping that effort will always force the quality to be lower.

Data that doesn’t come directly from the users of the system, comes in from the outside world. You have to respect that data.


RESPECT THE DATA

If you didn’t collect the data yourself, it is little more than rude to start changing it.

The problem comes from programmers being too opinionated about the data types, or taking questionable shortcuts. Either way, you are not saving that copy of someone’s data, you are saving a variation on it. Variations always break somewhere.

If the data is initially collected in a different system, it is up to that originating system to change it. You should just maintain a faithful copy of it, which you can use for whatever you need. But it is still the other system’s data, not yours.

Sometimes people seed their data from somewhere else and then allow their own interfaces to mess with it. That is fine, if and only if, it is a one-time migration. If you ignore that and try to mix migrations with keeping your own copy, the results will be disastrous. Your copy and the original will drift and are not mergeable, eventually one version or the other will end up wrong. Eventually, that will cause grief.

It’s worth noting that a great deal of constants that people put into their code are also other people's data. You didn’t collect the constant, which is why it is not in a variable.

You should never hardcode any data, it should come in from persistence or configuration. In that way, good code has almost no constants in it. Not strings, or numbers, or anything really, just pure code that takes inputs and returns outputs. Any sort of hardcoded values are always suspicious. If you do hardcode something, it should be isolated into its own function and you should be incredibly suspicious of it. It is probably a bad idea.


DON’T FIDGET

You'll see a lot of data that moves around at multiple different representations. It is one data item but can be parsed into subpieces which also have value on their own. You often see systems that will ‘split’ and ‘join’ the same data repeatedly in layer after layer. Obviously, that is a waste of CPU.

Most of the time if you get some incoming data, the best choice is to parse it down right away, and always pass it around in that state. You know you need at least one piece of it, why wait until the last moment to get that? So you ‘split’ coming in, and ‘join’ going out. Also ‘split’ is not suitable for any parsing that needs a look ahead to tokenize properly.

There are plenty of exceptions to that breaking down the data immediately. For example, the actual type may be the higher representation, while the piece is just an alias for it. So, parsing right away would disrespect that data. This is common when the data is effectively scoped in some way.

If you need to move around two or more pieces of data together all or most of the time, they should be in the same composite structure. You move that instead of the individual pieces. That keeps them from getting mixed up.

Another way to mess up data is to apply incorrect transformations to it. Common variations are changing data type, altering the character sets, or other representation issues. A very common weakness is to use date & time variables to hold only dates, then in-band signal it with a specific time. Date, time, and date & time are three very different data types, used for three very different situations.

Ambiguous representations like combining integers and floating point values into the same type are really bad too. You always need extra information to make any sense of data so throwing some of the meta information away will hurt. Ambiguities are easy to create but pretty deadly to correct.


SUMMARY

One of the odder parts of programming culture is the perceived freedom programmers have to represent their data any way they choose. That freedom doesn’t exist when the data originated outside of the system.

There are sometimes a few different choices for implementations, but they never come for free. There are always trade-offs. You have to spend the time to understand the data you need in order to then decide how the code should deal with it. You need to understand it first. Getting that backward and just whacking out code tends to converge on rather awkward code full of icky patches trying to correct those bad initial assumptions. That type of code never gets better, only worse.

None of these points about handling data changes over time. They are not ancient or modern. They have been writing about this since the 70s and it is as true then as it is now. It supersedes all technical stacks and applies to every technology. Software that splatters garbage data on the screen is useless, always has been, and always will be.

No comments:

Post a Comment

Thanks for the Feedback!