Thursday, April 27, 2023

Transactional Integrity

It is important for computers to do exactly what they are told to do. They must always match the expectation of the users, even when there are other problems happening around them.

Getting any code to work is important, but it is only half the effort. The other half is reliable error handling. It turns out that is actually a very hard problem to tackle.

To really understand this we have to look carefully at a theoretical problem, known as the Two Generals Problem (TGP).

Basically, two generals are trying to communicate a time to attack, but they have unreliable communications between them. If either general attacks without the other, they will lose. So, obviously, they want to make sure they are coordinated.

There was a great deal of work done in the 70s and 80s on how to build semantics in order to guarantee their coordination. We see it in RDBMSes with their one or two phase commit protocols. For people unfamiliar with that work, the semantics may initially seem a bit awkward, but they are based around a very difficult problem.

In TGP, there is a subtle little ambiguity. If one general sends a message to the other general, and they don’t get a timely response, one of two very different things may have occurred. First, the initial message may have been intercepted or lost. So the other general doesn’t know when to attack. But, it is also true that the message may have made it -- the communication may have been successful -- it’s just that the other general’s acknowledgment of receipt may have been lost. So, now that general will attack, but the first general doesn’t know that, so the attack time is still in jeopardy.

Thus the ambiguity is that because of a lack of ‘full’ communication, there are two very different distinct possibilities, and it is entirely unknown as to which one has actually occurred. Did the message get received or not? Will it be acted on? With no acknowledgment coming back either scenario is possible.

If we look deeply at this problem and really any sort of ambiguity, without some further information it is totally impossible to resolve. Worse is that by adding other sorts of information, you can only reduce the ambiguity, but you’ll never actually get rid of it. For instance, the second general could wait for an acknowledgment of their acknowledgment before attacking. But that just quietly swaps the ambiguity back to them, and while it is a bit smaller it is still there. Maybe their acknowledgment didn’t make it, or the receipt of that didn’t make it back to them. Still can’t tell. Maybe there is a total block on any communications and both generals definitely shouldn’t attack. But if that is a one-sided block only, then one of the generals is still wrong, and we can’t ever know.

So, basically, you can minimize TGP, but it still has that ambiguity, and it will always be there in some shape or form.

This plays out all of the time on a larger scale in software.

If there is reliable communication, always 100%, then there are no problems. But if the communication is 99.9999% then there is at least some small ambiguity. If we need any two distinct sets of data to be in sync with each other and the communication is not 100%, then there is always a circumstance where it can and probably will break.

This is the core of any distributed programming since once you are no longer just calling a function explicitly in the same ‘computing context’ (such as the same thread) there will be some ambiguities. Well almost, because we can use an ‘atomic lock’ somewhere to implement reliable communication protocols over an unreliable medium, but we won’t delve into that right now.

If we can’t lock atomically, then we have to implement protocols to only minimize these windows of ambiguity and thus at least keep any transactional integrity bugs from occurring too frequently. With a bit of work and thinking, we can reduce the frequency to something tiny like “once in a million years” and thus be fairly sure that if a problem did occur in our lifetime that it would likely only be a one-off issue.

Getting back to the work of the 70s and 80s, if we implement a one or two-phase commit, we can shrink the window accordingly. This is why in an RDBMS when you do some work, you ‘commit’ it afterward as an ‘independent’ second step.

The trick is to ‘bind’ that commit to any other commit on other dependent resources. That is, if you have to move data from one database to another, you do the work on one database, then do the work on the other database, then commit the first one, then the second one. That intertwines the two resources in the same protocol, reducing the window.

There still could be problems with the commits themselves, they can go wrong. So you add in a ‘prepare’ phase (the “two” in two-phase commit), that does almost everything possible other than the tiniest bit of effort necessary to turn it all on. Then the transaction does the work in two places, then the ‘prepare’ in two places, and finally the ‘commit’ for both.

If during all of this fiddling, anything at all goes wrong, then all of the work completed so far is rolled back.

This will result in “mostly” only two circumstances occurring. Either the work is “entirely completed” or “none of the work was completed”. There is no middle ground, where some of the work was completed, as that would result in a mess. It is a strict either-or circumstance. All or nothing. All of the transactions were done, or none of the transactions were done. But we need to keep in mind that particularly with the second case, there is still a very tiny window where that is not correct. Maybe some of the transactions were done and you have absolutely no way of knowing that, but if that only can occur once in a million years, then you can just ignore it, mostly.

As might be obvious from the above, any sort of ambiguity in the digital world is a big problem. But you can minimize it if you are careful and correctly wire up the right semantics. It is best to leave this sort of problem to be encapsulated in a technology like an RDBMS, but knowing how to use it properly from above is crucial. You don’t call one database and do all of the work and commit, and then call the other one. That invalidates your transactional integrity. You have to make sure both efforts are intertwined, everywhere, going right back to the user’s endpoint of initialization. If you do that, then mostly, the code will always do what everyone expects it to do. If you ignore this problem, then some days will be very bad days.

No comments:

Post a Comment

Thanks for the Feedback!