Thursday, January 19, 2023

Once

The number of times that you need a specific piece of data in a large organization is exactly “once”, You only need one writable copy of it. Not once per screen, not once per system, but actually just once for each company.

If you have more than one writable copy of the data, you have a huge problem.

If 2 different people make 2 different changes to the 2 different sources of data at the same time, there is no ‘general’ way to merge those changes together with software. There are specific instances where the independence is low enough that an automatic merge is possible, but the general problem is unsolvable. You would have to punt it to a human who can figure out the context and consequences, then update it.

But we don’t want to engineer people into our software if we can avoid it. And we can.

Just keep the writable data once. You can spin off any number of read-only copies, basically as a form of caching or memoization, but have only one version of the data, anywhere in the organization, that can be changed.

For any of the read-only copies, you’ve handed them properly if it is always safe to delete them. That is, you won’t lose data, just trigger a space-time trade-off, nothing more.

So, why don’t we do this?

The first reason is control. Programmers want total control over their systems, which includes control over their data. If they don’t have control, it can take a lot longer to do even simple things. If they do have control, and there is stupid politics, they can just fix their own data and let the rest of the organization be damned.

Crafting a crappy ETL to snag a copy of the data, and putting some screens out there to let people fix its bad quality is easy, even if it takes a bit of time. Setting things up, so that the users know how to go to some other system’s data administrators and correct a serious data issue is actually hard, mostly because it is incredibly messy. It’s a lot of work, a lot of conversations, and a lot of internal agreements. So, instead of playing nicely with everyone else, a lot of people choose to go it alone.

The second reason is laziness. If someone else had the data, and it was in some other form, you’d have to spend a lot of time to figure it out. You have to also know how to access it, how to get support, and how to understand what they are even saying sometimes. That is often a massive amount of learning and research, not coding. You’d have to read their schema, figure out their naming convention, work through their examples and special cases, etc. You have to deal with budget allocations, security permissions, and then proper support.

A third reason is politics. There are all sorts of territorial lines in a large company that you can’t easily cross. Different levels of management are very protective of their empires, which also means being protective of their data. Letting it leak out everywhere makes it harder to contain, so they put up walls to stop or slow down people wanting access. So, people silo instead.

A fourth reason is budgets, which is a slight variation on territorial issues and involves services as well. If you put out a database for use by a large number of people in the organization, then there is a fair amount of service and support work that you now have to accomplish as well as maintaining the data. If the funding model doesn’t make that worthwhile, then letting other people access your data is effective taking a loss on it. You don’t get enough funds to cover the work that people need for access.

How do you fix this?

You make it easier for all of the ‘developers’ to get access to the primary copies of any and all of the corporate data types. You make it easy for them to just do a quick pull, and intermix that data with whatever they have. If it’s not easier than rolling their own version, they will mostly take the easiest route and roll it themselves.

You can’t make it easier by spinning up millions of micro-services. That eventually doesn’t scale because it’s hard to find that one-in-a-million service you are looking for with your data. As it grows and becomes even more of a disorganized mess, it gets even harder. If there is no explicit organization, then it is disorganized. Pretty much the primary benefit of microservers is that you don’t have to spend time organizing them. Opps.

The data lake philosophies share the same issue. Organizing a lot of data is hard, painful work, so we’ve spent decades coming up with clever new technologies that seemingly try to avoid spending any effort on organization. Just throw it all together and it will magically work out. It doesn’t; it never does.

So you need a well-organized well-maintained “library” where developers can go and find out how to easily source some data in a few minutes. Minutes. You probably need some librarians too. If you get that, then the programmers will take that as the easiest route, the data won’t be copied everywhere, and sanity will prevail.

The priority of any IT dept has to be about finding ways to reduce the number of silos and copies of data. That needs to take precedence over pretty much everything else. If it isn’t more important than budgets, politics, process, etc. then any of those other issues will continue to derail it, and the organization will have dozens, if not hundreds, of copies of the same stuff littered everywhere. Chaos is expensive.

No comments:

Post a Comment

Thanks for the Feedback!