Friday, December 20, 2024

Datasource Precedents

A common problem in all code is juggling a lot of data.

If you are going to make your code base industrial-strength one key part of that is to move any and all static data outside of the code.

That includes any strings, particularly since they are the bain of internationalization.

That also includes flags, counters, etc.

A really strong piece of code has no constant declarations. Not for the logging, nor the configuration, or even for user options. It still needs those constants, but they come from elsewhere.

This goal is really hard to achieve. But where it has been mostly achieved you usually see the code lasting a lot longer in usage.

What this gives you is a boatload of data for configurations, as well as the domain data, user data, and system data. Lots of data.

We know how to handle domain data. You put it into a reliable data store, sometimes an RDBMS, or maybe NoSQL technology. We know that preventing it from being redundant is crucial.

But we want the same thing for the other data types too.

It’s just that while we may want an explicit map of all of the possible configuration parameters, in most mediums whether that is in a file, the environment, or at the cli, we usually want to amend a partial set. Maybe the default for 10,000 parameters is fine, but on a specific machine we need to change two or three. This changes how we see and deal with configuration data.

What we should do instead is take the data model for the system to be absolutely everything that is data. Treat all data the same, always.

Then we know that we can get it all from a bunch of different ‘sources’. All we have to do is establish a precedence and then merge on top.

For example, we have 100 weird network parameters that get shoved into calls. We put a default version of each parameter in a file. Then when we load that file, we go into the environment and see if there are any overriding changes, and then we look at the command line and see if there are any more overriding changes. We keep some type of hash to this mess, then as we need the values, we simply do a hash lookup to get our hands on the final value.

This means that a specific piece of code only references the data once when it is needed. That there are multiple loaders that write over each other, in some form of precedence. With this it is easy.

Load from file, load on top from env, load on top from cli. In some cases, we may want to load from a db too (there are fun production reasons why we might want this as the very last step).

We can validate the code because it uses a small number of parameters all in the place they are needed. We can validate the default parameters. We can write code to scream if an env var or cli param exists but there are no defaults. And so forth. Well structured, easy to understand, easy to test, and fairly clean. All good attributes for code.

The fun part though is that we can get crazier. As I said, this applies to all ‘data’ in the system, so we can span it out over all other data sources, and tweak it to handle instances as well as values. In that sense, you do something fun like use a cli command with args that fill in the raw domain data, so that you can do some pretty advanced testing. The possibilities are endless, but the code is still sane.

More useful, you can get some large-scale base domain data from one data source, then amend it with even more data from a bunch of other data sources. If you put checks on validity and drop garbage, the system could use a whole range of different places to merge the data by precedence. Start with the weakest sources first, and use very strong validation. Then loosen up the validation and pile other sources on top. You’ll lose a bit of the determinism overall, but offset that with robustness. You could do it live, or it could be a constant background refresh. That’s fun when you have huge fragmented data issues.

No comments:

Post a Comment

Thanks for the Feedback!