Sunday, July 18, 2021

Idempotent

When I was very young, a way more experienced programmer told me that for incoming feeds, basically any ‘imports’, they must always be ‘idempotent’. 

In the sense that he meant it, if you have some code to import a collection of data, say in a csv file, you can feel safe and confident rerunning that import as many times as you need to. At worst, it just wastes some CPU.


That feed will do exactly the same thing each time. Well, almost. The result of running it will mean that the data in the file is now in the database, but it is there uniquely. 


If the feed was run before, that earlier data will be updated or it will be ‘deleted and reinserted’. It will not be duplicated.


The corollary is that if the data changed in the file from an earlier version, it is that later data that will persist now. This provides a secondary feature of being able to use the feed to ensure that the data is always up-to-date and correct without first having to verify that it is not. If in doubt, rerun it.


If you have a big system, with a lot of data loaded over years, this means that if you kept all of the data input files, you could replay all of them to thus ensure that the outcome would be clean data. 


But It also means that if someone had edited the data manually, directly in the database, then replaying the feed would overwrite that change. A behavior that is sometimes good, sometimes bad.


To get around that ambiguity, it’s best to understand that any data coming in from an external source should be viewed as ‘read-only’ data by definition. It’s somebody else’s data, you just have a ‘copy’ of it. If there were ‘edits’ to be made to that data, they should be made externally to the system (the source does it or you might edit the feed files, but not the database itself). 


If it is necessary to annotate this incoming data, those modifications would be held separately in the schema, so that replaying the original data would not wipe them out. That doesn’t mean that it isn’t now inconsistent (the new data nullifies the annotations), but it does ensure that all of the work is preserved.


An important part of getting feeds to be idempotent is being able to understand uniqueness and keys. If the data is poorly keyed, it will inevitably become duplicated. So, making sure that the internal and external keys are handled properly is vital to assuring idempotency.


Oddly, the mathematical definition of idempotent is only that repeated ‘computations’ must result in the same output. Some people seem to be interpreting that as having any subsequent run of the feed be nil. It does nothing and now ignores the data. That’s a kind of a landmine, in that it means that fixing data is now a 2 step process. You have to first delete stuff, then rerun the feed. That tends to choke in that deleting things can be a little trickier than performing updates, they can be hard to find, so it’s often been far more effective to set the code to utilize ‘upserts’ (an insert or a delete depending on if the data is already there) as a means of replying the same computation. Technically, that means the first run is actually slightly different from the following ones. Initially, it inserts the data, afterward, it keeps updating it. So, the code is doing something different on the second attempt, but the same for all of the following attempts...


What I think that veteran programmer meant was not that the ‘code paths’ were always identical, rather it was that the final state of the system was always identical if and only if the state of the inputs were identical. That is, the system converges on the same results with respect to the data, even if the code itself wobbles a little bit.


Since we had that talk, decades ago, that has always been my definition of ‘idempotent’. The hard part that I think disrupts people is that by building things this way you are explicitly giving control of the data to an external system. It changes when they decide it changes. That seems to bother many programmers, it’s their data now, they don’t want someone else to overwrite it. But,  if you accept that what you have is only a read-only copy of someone else’s data, then those objections go away.