Thursday, June 18, 2026

Structureless

The most common mistake I have seen in big ugly balls of mud is to try to capture data without enough structure.

Chopping up some incoming data into a lot of little strongly typed fields is a pain. Sometimes it seems like an unwarranted pain. You may need to get a mailing address. Why not just give the user a big textbox to fill in?

The problem isn’t that the users can’t carefully type in the text with appropriate structure; it’s that sometimes they won’t.

And the code to parse unstructured text is stupidly complicated. They can type anything; you have to be able to apply some type of structure to each and every possible variation. Since there are an infinite number of those, you are going to lose.

You can add a ton of validation, but if it’s not rigorous within a fully interactive interface, then if there is any tiny way to bypass it, it was a total waste of effort.

You could scream at the users and make them format it correctly, but as time wears on, unless you keep screaming, eventually that practice will degrade. It will just delay the inevitable.

Which is to say that in a computer, a big box full of text is absolutely nothing more than a big box full of text. It has no other use or value. That makes it useful for someone putting a “personal note” somewhere in a report, or something like that, but there is nothing beyond that. It’s not really data; it’s just an extra external comment of some type.

If what you intend to do is collect data with a very specific structure, you should never subvert text boxes to do that for you. It’s not a shortcut; you haven’t “figured it out”, you just made a very bad mistake. And sadly, doing it right wasn’t that much more time.

Likewise, if you are using some questionable software, and it has a lot of text boxes, so you come up with a clever idea about how to put structured data into those, just because you can’t or don’t know how to change the program, then it isn’t brilliant. It’s just a sloppy, hacky way of trying to get around some other code. One step worse than duct tape.

Data is not data without structure. Untyped data is not data. A big string of characters is a mess. Mostly.

If it originated under the strict control of some code somewhere, and the pathway was closed and guarded, then sure, two programs can use strings and complicated parsers to pass data back and forth. It starts with structure, is transported in an unstructured container, and then it is restructured again. But if the pathway is open or one end of that game is a human, then a big string of stuff is just potential garbage. At some point, usually in the not-too-distant future, someone will fill that string with a problem, and it will end up wasting a lot of time.

Do note that there are subtle variations on this. A user might use one program to render XML that they upload manually into another. In a case like this, the human isn’t the endpoint; they are just the transfer medium. It's an open pathway, but still just between two programs. And you can close down that pathway by only strictly accepting XML with a very specific schema. Since you can verify that and reject garbage as necessary, it becomes closed.

The foundation of all software systems is to collect data. The full and complete structure of that data, as it relates to the real and digital worlds, is an essential part of that data.

There are extraordinary times when it makes sense to only sub-model some external data and live with the consequences of that choice. But the default should always be that if the program needs some data, it is key to its usage and computations, then the data needs to be fully and correctly modelled. That is, it needs the right structures, for example, not shoving a tree into a list, and it needs all of the individual fields in that data to be very strongly typed. It needs this in order to make the correct choices with its instructions based on what is actually there. It can’t be vague or ambiguous; computers are not smart enough to grok external context. They can only act on the very specific information that they have.

It’s also worth noting that while parsing may look easy, and in some tiny instances, it is not that complicated, it should always be considered a hard problem to properly solve. As such, if you can avoid parsing, or at least push it to a human somewhere, then you will get far fewer bugs and the code will be more likely to behave as expected. If you do have to venture into parsing, then it is one of the coding areas where reading a boatload of stuff in advance will pay off huge dividends. Trial and error with parsing is a massive bug generator.

No comments:

Post a Comment

Thanks for the Feedback!