Wednesday, February 3, 2021

Plumbing

By now it should not be hard to synchronize data between different software systems. We have the storage capacity, the CPU power, and there are a plethora of ways to connect stuff. Still, the situation in the trenches isn’t much better than in the 90s, which is sad.


XML was in many ways an attempt to send around self-documenting data. The grand idea was that you could get stuff from anywhere, and there was enough meta-information included that it could be automatically, and seamlessly imported into your system. No need to write the same endless ETL code over and over again. Yeah.


But rather than mature into that niche, it got bogged down in complexity which made it hard to use. JSON took a few massive steps backward, but at least you could wire it up quickly.


One of the underlying problems is semantic meaning. If one system has a symbolic representation of part of a user’s identity in the real world you have to attach a crazy amount of stuff to it when it is exported in order for a second system to be able to interpret it correctly, in a completely different context. Just knowing the bits, or groking that they are byte encodings of particular glyphs, or even knowing that those glyphs collected together represent an identifier in a particular human language is not enough. The implicit ‘model’ of the information as it is mirrored from the real world in both systems have to hold enough common information that a stable mapping of them is possible. And it’s probably some offbeat artifact of reality that the stability itself probably needs to be bi-directional.


We could play with the idea that the two systems engage in a Q&A session. The second system keeps asking “what do you mean by xxxx?”, slowly descending the other system’s knowledge graph until it hits an ‘ah-ha’ moment. Oddly, if they were both event-based, then the conversation would only have to occur after one of them changed code or config, so not that frequently and probably not in haste.


That, of course, would be a crazy amount of code. As well as implementing the system you’d have to implement a malleable knowledge graph that spanned each and every line of code and a standard protocol to traverse it. It’s no surprise that we haven’t seen this type of code in the wild yet, and may never see it.


What if we went the other way? A system needing data contacted an intermediary and requested data. Another system contacts the same intermediary and offers to share data. That gives us two structural representations of a complex set of variables. Some map one to one, based on a mapping between the labels. Some have arbitrarily complex mappings, that in the worst case could include state. If the intermediary system had a way of throwing up its hands, admitting defeat, and bothering a few human beings for intervention, that version of the problem might be solvable. Certainly, the export and import from the two systems is well constrained. The interface is fairly simple. And, over time, as the intermediary built up a sophisticated database of the different sources and requestors, the effort to connect would converge on being fully automatable.


If the request for data had a property that any imports were idempotent with what amounts to a universal key for that specific entity, then the whole thing would also be fairly resilient to change. If the paradigm for the intermediary’s interface was a fancy to-do list based on incoming requests one could prioritize the connectivity based on organizational needs. 


So, one programmer’s assignment would be to send all of the public entities to X. X might choose to not save them, it isn’t a historic database, but it would accept any and everything.


Another programmer’s assignment would be to add in some background task to get an entity from X. It might poll if that works, or it might just register for an event, and get data as it is received.


Neither programmer cares about the other’s schema or field names or any other thing, They deal with what they need or already have. No fiddling.


In their calls, they include a contact email, each time. Then they wait.


A data administer sees that there is a publisher that kinda matches one or more requestors. Kinda. They can contact either side to get more info, or make suggestions. They have some ability to put small computations into the pipeline. They can contact others to sort through naming issues. At some point, hopefully, with minimal changes to either side, they just flip a switch, and each instance of the entities gets translated. The receiving system springs to life.


Obviously, they do this in a testing environment. Once it is agreed upon, the publishers go into production first, it is verified, then the receivers go live. The schedule for this work is fairly deterministic and estimable after the agreement from testing is reached. 


Ah, but what about ‘state’ you say? Yea, that is a problem. Some of it is mitigated by keys and idempotency. One can utilize that to support historic requests.


Other state variables more often tend to be embedded in the data itself. And with an event handling mechanism, you could have one system offer up a status to be consumed by another. Which segues nicely into error handling. The intermediary could fill in errors with a computation, say one system polls for the state of the other, since a specific time. The response could be computed to be ‘down’ if within that window there was no publishing of a heartbeat.


We can do similar things with abstract variables, like computing response time between heartbeats. It opens up a lot of great ways to monitor stuff.


A nice secondary attribute is that the intermediate system effectively documents the system interconnections. It can produce nice reports. 


But it won’t scale? Ya, sure it will. Not as one galactic intermediary system, but as many many smaller completely independent ones that could even be parallelized for big data flows. It won’t be super slow, but it's not the mechanism one needs for real-time data anyways, that is usually custom code, but very expensive code so you want to keep it to a minimum. It’s an 80/20 solution. If there are hundreds of these beasts running, you could still have one data management interface that binds them all together at an organizational level. Each instance has a synced persistent copy of only the data that it needs, not all of it.

No comments:

Post a Comment

Thanks for the Feedback!