Sunday, July 18, 2021

Idempotent

When I was very young, a way more experienced programmer told me that for incoming feeds, basically any ‘imports’, they must always be ‘idempotent’. 

In the sense that he meant it, if you have some code to import a collection of data, say in a csv file, you can feel safe and confident rerunning that import as many times as you need to. At worst, it just wastes some CPU.


That feed will do exactly the same thing each time. Well, almost. The result of running it will mean that the data in the file is now in the database, but it is there uniquely. 


If the feed was run before, that earlier data will be updated or it will be ‘deleted and reinserted’. It will not be duplicated.


The corollary is that if the data changed in the file from an earlier version, it is that later data that will persist now. This provides a secondary feature of being able to use the feed to ensure that the data is always up-to-date and correct without first having to verify that it is not. If in doubt, rerun it.


If you have a big system, with a lot of data loaded over years, this means that if you kept all of the data input files, you could replay all of them to thus ensure that the outcome would be clean data. 


But It also means that if someone had edited the data manually, directly in the database, then replaying the feed would overwrite that change. A behavior that is sometimes good, sometimes bad.


To get around that ambiguity, it’s best to understand that any data coming in from an external source should be viewed as ‘read-only’ data by definition. It’s somebody else’s data, you just have a ‘copy’ of it. If there were ‘edits’ to be made to that data, they should be made externally to the system (the source does it or you might edit the feed files, but not the database itself). 


If it is necessary to annotate this incoming data, those modifications would be held separately in the schema, so that replaying the original data would not wipe them out. That doesn’t mean that it isn’t now inconsistent (the new data nullifies the annotations), but it does ensure that all of the work is preserved.


An important part of getting feeds to be idempotent is being able to understand uniqueness and keys. If the data is poorly keyed, it will inevitably become duplicated. So, making sure that the internal and external keys are handled properly is vital to assuring idempotency.


Oddly, the mathematical definition of idempotent is only that repeated ‘computations’ must result in the same output. Some people seem to be interpreting that as having any subsequent run of the feed be nil. It does nothing and now ignores the data. That’s a kind of a landmine, in that it means that fixing data is now a 2 step process. You have to first delete stuff, then rerun the feed. That tends to choke in that deleting things can be a little trickier than performing updates, they can be hard to find, so it’s often been far more effective to set the code to utilize ‘upserts’ (an insert or a delete depending on if the data is already there) as a means of replying the same computation. Technically, that means the first run is actually slightly different from the following ones. Initially, it inserts the data, afterward, it keeps updating it. So, the code is doing something different on the second attempt, but the same for all of the following attempts...


What I think that veteran programmer meant was not that the ‘code paths’ were always identical, rather it was that the final state of the system was always identical if and only if the state of the inputs were identical. That is, the system converges on the same results with respect to the data, even if the code itself wobbles a little bit.


Since we had that talk, decades ago, that has always been my definition of ‘idempotent’. The hard part that I think disrupts people is that by building things this way you are explicitly giving control of the data to an external system. It changes when they decide it changes. That seems to bother many programmers, it’s their data now, they don’t want someone else to overwrite it. But,  if you accept that what you have is only a read-only copy of someone else’s data, then those objections go away. 

Sunday, July 11, 2021

Time Consuming

For most development projects, most of the time to build them ends up in just two places:


  • Arranging and wiring up widgets for the screens

  • Pulling a reliable window of persistent data into the runtime


The third most expensive place is usually ETL code to import data. Constantly changing “static” reports often falls into the fourth spot.


Systems with deep data (billions of rows, all similar in type) have easier ETL coding issues than systems with broad data (hundreds or even thousands of entities). 


Reporting can often be off-loaded onto some other ‘reporting system’, generally handled outside of the current development process. That works way better when the persistent schema isn’t a mess.


So, if we are looking at a greenfield project, whose ‘brute force’ equivalent is hundreds or thousands of screens sitting on a broad schema, then the ‘final’ work might weigh in as high as 1M lines of code. If you had reasonably good coders averaging 20K per year, that’s like 50 person-years to get to maturity, which is a substantial amount of time and effort. And that doesn’t include all of the dead ends that people will go down, as they meander around building the final stuff.


When you view it that way, it’s pretty obvious that it would be a whole lot smarter to not just default to brute force. Rather, instead of allowing people to spend weeks or months on each screen, you work really hard to bring it down to days or even hours of effort. The same is true in the backend. It’s a big schema, which means tonnes of redundant boilerplate code just to make getting around to specific subsets convenient.  


What if neither of these massive work efforts is actually necessary? You could build something that lays out each screen in a readable text file. You could have a format for describing data in minimalistic terms, that generates both the code and the necessary database configurations. If you could bring that 1M behemoth down to 150K, you could shave down the effort to just 7.5 person-years, a better than 6x reduction. If you offload the ETL and Reporting requirements to other tools, you could possibly reach maturity in 1/6th of the time that anyone else will get there. 


Oddly, the above is not a pipe dream. It’s been done by a lot of products over the decades; it is a more common approach to development than our industry promotes. Sure, it’s harder, but it’s not that much riskier given that starting any new development work is always high risk anyways. And the payoff is massive.


So, why are you manually coding the screens? Why are you manually coding the persistence? Why not spend the time learning about all of the ways people have found over the last 50 years to work smarter in these areas? 


There are lots of notions out there that libraries and frameworks will help with these issues. The problem is that at some point the rubber has to hit the road, that is, all of the degrees of freedom have to be filled in with something specific. When building code for general purposes, the more degrees of freedom you plug up, the more everyone will describe it as ‘opinionated’. So, there is a real incentive to keep as many degrees open as possible, but the final workload is proportional to them. 


The other big problem is that once the code has decided on an architecture, approach, or philosophy, that can’t be easily changed if lots of other people are using the code. It is a huge disruption for them. But it’s an extraordinarily difficult task to pick the correct approach out of a hat, without fully knowing where the whole thing will end up. Nearly impossible really. If you built the code for yourself, and you realized that you were wrong, you could just bite the bullet and fix it everywhere. If it’s a public library, they have more pressure to not fix it than to actually correct any problems. So, the flaws of the initial construction tend to propagate throughout the work, getting worse and worse as time goes by. And there will be flaws unless the authors keep rewriting the same things over and over again. What that implies is that the code you write, if you are willing to learn from it, will improve. The other code you depend upon still grows, but its core has a tendency to get permanently locked up. If it was nearly perfect, then it’s not really a big problem, but if it were rushed into existence by programmers without enough experience, then it has a relatively short life span. Too short to be usable. 


Bad or misused libraries and frameworks can account for an awful lot of gyrations in your code, which can add up quickly and get into the top 5 areas of code bloat. If it doesn’t fit tightly, it will end up costing a lot. If it doesn’t eliminate enough degrees of freedom, then it’s just extra work on top of what you already had to do. Grabbing somebody else’s code seems like it might be faster, but in a lot of cases it ends up eating way more time.  

Sunday, July 4, 2021

Priorities

For me, when I am coding for a big system, my priorities are pretty simple:

  1. Readability

  2. Reusability

  3. Resources


It’s easy to see why this helps to produce better quality for any system.


Readability 


At some point, since the ground is always shifting for all code, either it dies an early death or it will need to be changed. If it’s an unreadable ball of syntactic noise, it’s fate is clear. If the naming convention is odd or blurry, its fate is also clear. If it’s a tangle of unrelated stuff, it is also clear. A big part of the quality of any code is its ability to survive and continue to be useful to lots of people. Coding as a performance art is not the same as building usable stuff. 


Deliberately making it hard to read is ensuring its premature demise. Accidentally making it hard to read isn’t any better.


Cutesy little syntactic tricks may seem great at the time but it is far better to hold to the ‘3am rule’. That is, if they wake you up in the middle of the night for something critical can you actually make sense of the code? You’re half asleep, your brain is fried, and you haven’t looked at this stuff for over a year. They give you a loose description of the behavior, so your code is only good if a) you can get really close to the problem quickly, and b) the reason for the bug is quite obvious in the code. The first issue is architecture, the second is readability. 


Another way of viewing readability is the number of people who won’t struggle to understand what your code is doing. So, if it's just you, then it is 1. It is a lot higher if it’s any other senior programmer out there, and even higher if it’s anyone with a basic coding background. Keep in mind that to really be readable someone needs to understand both the technical mechanics and the domain objectives.


Reusability


If the code is readable, then why retype in the same code, over and over again for 10x or even 100x times? That would be insane. A large part of programming is figuring out how NOT to retype the same code in again. To really build things as fast as possible you have to encounter a problem, find a reasonable working solution, get that into the codebase and move on. If you can’t leverage those previous efforts, you can’t move on up to solving bigger and more interesting problems. You're stuck resolving the same trivial issues forever.


As you build up more and more reusable components, they will get more and more sophisticated. They will solve larger chunks of the problem space. The project then gets more interesting as it ages, not more unstable. It’s an active part of making your own work better as well as making the system do more. Finding and using new libraries or frameworks is the fool's gold equivalent since they almost never fit cleanly into the existing work, it just makes the mess worse. What you need is a tight set of components that all fit together nicely, and that are specific to your systems’ problem domain. 


Detachment: Sometimes in order to get better reusability you have to detach somewhat from the specific problem and solve a whole set of them together. Adding a strong abstraction helps extend the reusability, but it can really hurt the readability (which is why people are afraid of it). Thus to maintain the readability, you have to make sure the abstraction is simple and obvious, and provide relevant comments on how it maps back to the domain. Someone should be able to cleanly understand the abstraction and should be able to infer that it is doing what is expected, even if they don’t quite know the mapping. If the abstraction is correct and consistent, then any fix to its mechanics will change the mapping to be more in line with the real issues.


Resources 


The ancient adage “Premature optimization is the root of all evil” is correct. Yes, it is way better that the code minimizes all of its resource usage including space, time, network, and disk storage. But if it's an unreadable mess or it’s all grossly redundant, who cares? It’s only after you master the other two priorities that you should focus on resources.


It’s also worth pointing out that fixing self-inflicted deoptimizations caused by bad coding habits is not the same thing as optimizing code. In one case bad practices callously waste resources, in the other, you have to work hard to find smarter ways to achieve the same results. 


Summary


If you focus on getting the code readable, it’s not difficult to find ways to encapsulate it into reusable components. Then, after all of that, you can spend some time minimizing resource usage. In that order, each priority flows nicely into the next one, and they don’t conflict with each other. Out of order they interact badly and contribute to the code being a huge mess.