Thursday, April 27, 2023

Transactional Integrity

It is important for computers to do exactly what they are told to do. They must always match the expectation of the users, even when there are other problems happening around them.

Getting any code to work is important, but it is only half the effort. The other half is reliable error handling. It turns out that is actually a very hard problem to tackle.

To really understand this we have to look carefully at a theoretical problem, known as the Two Generals Problem (TGP).

Basically, two generals are trying to communicate a time to attack, but they have unreliable communications between them. If either general attacks without the other, they will lose. So, obviously, they want to make sure they are coordinated.

There was a great deal of work done in the 70s and 80s on how to build semantics in order to guarantee their coordination. We see it in RDBMSes with their one or two phase commit protocols. For people unfamiliar with that work, the semantics may initially seem a bit awkward, but they are based around a very difficult problem.

In TGP, there is a subtle little ambiguity. If one general sends a message to the other general, and they don’t get a timely response, one of two very different things may have occurred. First, the initial message may have been intercepted or lost. So the other general doesn’t know when to attack. But, it is also true that the message may have made it -- the communication may have been successful -- it’s just that the other general’s acknowledgment of receipt may have been lost. So, now that general will attack, but the first general doesn’t know that, so the attack time is still in jeopardy.

Thus the ambiguity is that because of a lack of ‘full’ communication, there are two very different distinct possibilities, and it is entirely unknown as to which one has actually occurred. Did the message get received or not? Will it be acted on? With no acknowledgment coming back either scenario is possible.

If we look deeply at this problem and really any sort of ambiguity, without some further information it is totally impossible to resolve. Worse is that by adding other sorts of information, you can only reduce the ambiguity, but you’ll never actually get rid of it. For instance, the second general could wait for an acknowledgment of their acknowledgment before attacking. But that just quietly swaps the ambiguity back to them, and while it is a bit smaller it is still there. Maybe their acknowledgment didn’t make it, or the receipt of that didn’t make it back to them. Still can’t tell. Maybe there is a total block on any communications and both generals definitely shouldn’t attack. But if that is a one-sided block only, then one of the generals is still wrong, and we can’t ever know.

So, basically, you can minimize TGP, but it still has that ambiguity, and it will always be there in some shape or form.

This plays out all of the time on a larger scale in software.

If there is reliable communication, always 100%, then there are no problems. But if the communication is 99.9999% then there is at least some small ambiguity. If we need any two distinct sets of data to be in sync with each other and the communication is not 100%, then there is always a circumstance where it can and probably will break.

This is the core of any distributed programming since once you are no longer just calling a function explicitly in the same ‘computing context’ (such as the same thread) there will be some ambiguities. Well almost, because we can use an ‘atomic lock’ somewhere to implement reliable communication protocols over an unreliable medium, but we won’t delve into that right now.

If we can’t lock atomically, then we have to implement protocols to only minimize these windows of ambiguity and thus at least keep any transactional integrity bugs from occurring too frequently. With a bit of work and thinking, we can reduce the frequency to something tiny like “once in a million years” and thus be fairly sure that if a problem did occur in our lifetime that it would likely only be a one-off issue.

Getting back to the work of the 70s and 80s, if we implement a one or two-phase commit, we can shrink the window accordingly. This is why in an RDBMS when you do some work, you ‘commit’ it afterward as an ‘independent’ second step.

The trick is to ‘bind’ that commit to any other commit on other dependent resources. That is, if you have to move data from one database to another, you do the work on one database, then do the work on the other database, then commit the first one, then the second one. That intertwines the two resources in the same protocol, reducing the window.

There still could be problems with the commits themselves, they can go wrong. So you add in a ‘prepare’ phase (the “two” in two-phase commit), that does almost everything possible other than the tiniest bit of effort necessary to turn it all on. Then the transaction does the work in two places, then the ‘prepare’ in two places, and finally the ‘commit’ for both.

If during all of this fiddling, anything at all goes wrong, then all of the work completed so far is rolled back.

This will result in “mostly” only two circumstances occurring. Either the work is “entirely completed” or “none of the work was completed”. There is no middle ground, where some of the work was completed, as that would result in a mess. It is a strict either-or circumstance. All or nothing. All of the transactions were done, or none of the transactions were done. But we need to keep in mind that particularly with the second case, there is still a very tiny window where that is not correct. Maybe some of the transactions were done and you have absolutely no way of knowing that, but if that only can occur once in a million years, then you can just ignore it, mostly.

As might be obvious from the above, any sort of ambiguity in the digital world is a big problem. But you can minimize it if you are careful and correctly wire up the right semantics. It is best to leave this sort of problem to be encapsulated in a technology like an RDBMS, but knowing how to use it properly from above is crucial. You don’t call one database and do all of the work and commit, and then call the other one. That invalidates your transactional integrity. You have to make sure both efforts are intertwined, everywhere, going right back to the user’s endpoint of initialization. If you do that, then mostly, the code will always do what everyone expects it to do. If you ignore this problem, then some days will be very bad days.

Thursday, April 20, 2023

Software Development Ethics

Computers can do great things for people, but like any other tool, they can also be used for evil.

So we, as software developers, need an ethical code.

Basically, it is a commitment that we will not build evil things for evil people. That is, when we sit down to build some software, we always do so with ethical standards. If someone asks us to violate those standards, we refuse or we walk away.

For this to work, the ethics have to be simple and clear.

First, no matter how many instructions are executed for any type of automation, the code must always be triggered by humans. Not someone, or anyone, but the actual people involved.

So, for in-house batch jobs, they are started and stopped by operations personnel. Someone also has to explicitly schedule them to run at a frequency. Those people are the ones that ‘operate’ the software and provide it as a service for the users. They are on the hook for it, so they need to be able to control it.

For commercial web app sign-ups, they can’t be automatic. You can’t add someone to an email list, for example, without their explicit consent. And they must always have a say and a way to get themselves off the list. You can’t track where people go after they leave your site.

That even applies to issues like single sign-on. You log into a workstation, and that workstation passes on the credentials. That is fine, as you triggered it by logging in. But you can’t just quietly log them in some other way with different credentials. That would not be okay.

Second, a computer should never lie or trick the users. People need to be able to trust the machines, which means they need to be able to trust any software running on the machines, all of it. For web signups, you can’t trick people into signing up. You can’t hold them hostage, you can’t actively coerce or manipulate them

For big websites you can’t throw up an unreadable eula and then use that as an excuse to sell the data out from under the users. If you want to monetize by reselling the things people type in, you need to explicitly tell them, not try to hide it. You need to make sure that they understand.

A big part of this is that the data presented to the users must be correct, at all times. So, it must be named correctly, modeled correctly, stored correctly, cleaned correctly, and presented correctly. Stale stuff in a cache would be an ethics violation. Badly modeled data that is broken is one as well.

This applies to operations personnel and other developers. Writing a script called add_user that actually deletes all of the users is unethical. Reusing an old field in the database for some other type of data is unethical.

The third part is to not enable chaos, strife, and discord. This is the hardest of the three.

If you write a library that makes it easy for people to violate rules, then you are culpable. You did something that enabled them.

It is unethical unless your library is in opposition to some other oppression. But for that to be the case you would have to know that, have researched it, and have found the best ways to oppose that without enabling wanton evil. So your library isn’t an accident, it is a personal statement. And your mitigations are enough to restrict any types of obviously bad usage. So you fully understand the costs of your opposition. If you don't, you can’t ethically release the library.

If you write something innocent and later find out that it is being used for evil, you are now obligated to do something about that. You can’t ignore it or say you didn't know. You are now stuck and have to do your best to make that situation better.

We see a similar problem with social networks. We value freedom of speech, but we also need to not make it easy for lies and hate speech to propagate. When you build social networks, of any kind, both of these requirements must be in your design, and both shape the solution. You can’t pick one and ignore the other.

In that sense, for anything right on the ‘line’ in ethics, you have to know that it is on the line and have to actively try everything to ensure that it doesn’t cross over. If you don’t bother, you are enabling evil, so it is unethical. If you try, there may still be some unhappy occurrences, but as you find them, you must do something about them.

Getting too close to the line is a huge pain, but it was your choice, so now you have to do everything you can to ensure that it stays tilted toward the ethical side. If you don’t like it, walk away.

The physical similarity is with power tools. They are very useful but can be dangerous so people spend a lot of time adding in safeties, guardrails, and stuff. You can build the tool but you cannot ignore its usage. If you are aware of a problem, you have to act on that awareness. If you ignore it or exploit it, you are being unethical.

If you are ethical, you can no longer use the excuse that ‘they’ told you to do it. That is not acceptable. If ‘they’ ask for something bad, you refuse or walk away.

In the practical sense ‘they’ can order you to do the work and it may take some time to be able to walk away, so while you are stuck, you would restrict yourself to the bare minimum. For example, you tell them it is unethical, but they force you to write some dubious code anyways, and since you need to eat and the labor market sucks right now, you are trapped. You do the work, as minimally as you can, but you can’t ethically damage or sabotage it. But you certainly don’t have to do anything to help it get released either. You actively try to find another job in the meantime.

You’ve let them know, and you have not gone above or beyond in your role, and you are in the process of walking away. That is the best that you can do in this type of situation. It is being as ethical as you can get, without being self-destructive.

It is very hard to be an ethical software developer these days. That is why we see so many unethical circumstances. But as software eats the world, we have to live in the mess we have helped create, so we should do everything possible to make it as good as possible.

Thursday, April 13, 2023

DevOps

Not sure what happened, but my initial understanding of the role of DevOps comes from my experiences in the early 90s.

I was working for a big company that had a strong vibrant engineering culture. We were doing very complex, low-level work. Worldwide massively distributed fault tolerance. Very cutting-edge at the time.

Before we’d spent a lot of time coding, we write down what we were going to do in a document and send it out for feedback. The different teams were scattered all over the planet, so coordinating all of the efforts was vital.

Once we knew what we needed to do, we’d craft the code and test it in our environments. But these were effectively miniature versions of that global network.

When we were happy with how our code was performing, we’d give the code to the QA group. Not the binaries, not an install package, but the actual source code itself.

We had our own, hand-rolled source code control repository, it was long before CVS, Subversion, or later git. The QA team had its own separate repo, and more importantly, they had their own environment configurations for our code as well. Only our source would get copied into their repo.

We had some vague ideas about their environment but were never told the actual details. It was more like we were a vendor than an internal group. The code we wrote was not allowed to make weak, hardcoded assumptions.

They took the code, configured it as they needed to, set it up on their test networks, and beat it mercilessly. Where and when problems occurred, they would send us the details. Usually with some indication about whether the problems needed to be fixed right away or could wait until the next release.

After that, we had no idea. We never knew when stuff was going into production, we only got infrequent reports outlining its behavior. We delivered the code to QA and they integrated it, configured it, tested it, and then when they were happy, deployed it.

That separation between development and operations was really, really great.

We were able to focus on the engineering, while they took care of the operational issues. It worked beautifully. We really only had one bug ever in production, and given the complexity of what we were building, the quality was super impressive.

When DevOps came along, I figured it would play that same intermediary role. Get between dev and ops. That it would be the conduit for code going out and feedback coming in. Developers should develop and operations should operate. Mixing that up just rampantly burns time, degrades quality, and causes chaos. A middle group can focus on what they have now, how it works, and what could make it better is a much stronger arrangement.

Over the decades, it seems like people have gotten increasingly confused about operations. It’s not just throwing the software out there and then reacting to it later when it glitches. Operations is actively ‘driving’ the software, which is proactive. And certainly, for any reoccurring issues, a good operations department should be able to handle them all on their own.

On the other side, development is slow, often tedious, and expensive. As technology has matured, it has also gotten increasingly convoluted. While there are more developers than ever, it’s also harder to find developers with enough underlying knowledge to build complex, yet stable, code. It is easier to craft trivial GUIs and static reports, but how many of those do you really need? Hiring tanks to get the core and infrastructure solid is getting harder to do, driving them away by getting them tangled up in operational issues is self-defeating.

So we get back to the importance of having strong developers focus entirely on producing strong code. And we put in intermediaries to deal with the mess of getting that tested and deployed. Then if the operations group really is monitoring properly, we can build less, but leverage it for a lot more uses. That is working smarter, not harder.

Thursday, April 6, 2023

Waterloo Style

When I first started learning how to program, I stumbled onto an extremely strong programming philosophy that was fairly dominant at the University of Waterloo in the 1980s.

Before I was exposed, I would struggle with crafting even simple programs. Afterward, for pretty much anything codable, including system’s level stuff like databases, operating systems, and distributed systems, building it was just a matter of being able to carve out enough time to get the work done.

Over and over again I’ve tried different ways to explain it in this blog. But I think I keep getting lost in definitions, which probably makes it inaccessible to most people.

So, I’ll try again.

The primary understanding is that you should ignore the code. It doesn’t matter. It is just a huge list of instructions for the stupid computer to follow.

If you try to code by figuring up increasingly larger lists of instructions, you’ll go stark raving mad, long before you get it to work properly. So, don't do that.

Instead, focus on the data. Figure out how it should flow around.

Right now it is stored on a disk somewhere, tucked into some type of database. But the user wants to see it on their screen, wrapped in some pretty little interface.

You’ve got stuff running in the middle. You can contact and engage the database with enough information that it can just give you the data you care about. You take it from there and fiddle with it so that it will fit in the middle of the screen. Maybe that prompts the user to mess with the data. In that case, you have to grab it from the screen and tell the database technology to update it.

If you need some data from another system, you call out to it, get a whack load of stuff, then slowly start putting it in your database. If you have to send it elsewhere, you do the opposite. If you don’t want to call other systems, you can have them push the stuff to you instead.

If you take any complex system and view it by how the data flows around it, you’ll find that it isn't so complex anymore. A data perspective is the dual of a coding perspective, I can’t resist throwing that in, but it is intrinsically less complicated.

Okay, so if you get that perspective, how do you apply it?

In the object-oriented paradigm, you take each and every different type of data that you are moving around, wrap it in an object, and put all of the little functions that fiddle with in there too. Then you just move around the objects. It is literally a one-to-one mapping.

It gets messy in that some of the technologies aren't object-oriented. But fortunately one of the core influences into OO was ADTs, or abstract data types. People tend to know these as lists, trees, stacks, and such. But they are just "nearly" objects without the Object idealogy stamped into the programming language. That is, any collection of data has some structure to it that you have to accommodate, which is not surprisingly called ‘data structures’. Capture that, and the underlying mechanics will be able to correctly hold any instance of that data. Screw it up, and at least one instance won’t fit, you’ll have to hack it and you will regret that later.

What’s interesting about the ADT philosophy is that it fits on top of any procedural language, and it helps if the language has composite abilities like typedefs or structs. They allow you to package a group of variables together, and keep them together as they move around the system. You don’t need to package them, but it really helps to prevent mistakes.

If you package the data together, and you build some little functions that understand and work correctly with that data and its structure, you can keep these all together effectively “encapsulating” them from everything else. Or basically a loosely defined object.

So, you get right back to the main issue. You move around the data, and you can do that with objects, structures, or in a Language like Go, with interfaces. It is all the same thing in the end, as it is all derived from that original ADT philosophy. Different names and different language capabilities, but all are part of the same data structure coding style.

Now the only other trick is that you need to keep this organized as you build it. So it takes a little discipline. You have some new type of data in the system, you wrap it in objects or structs ‘first’ before you start using it everywhere else. You build from the bottom up. You don’t skip that, you don’t cheat the game, and you don’t make exceptions. 

Once that data is available, you just flow it to where ever you need it. Since you wrapped it, you will nicely reuse that wrapping everywhere, and if you were missing some little function to fiddle with the data, you would add it very close to that data, in the object, or with the other data structure code. It is altogether in the same place.

When you see it that way, then an operating system is just a very large collection of data structures. So, is a relational database. So are compilers and interpreters. Pretty much everything, really.

Most domain applications are relatively small and constrained sets of data structures. Build the foundation, and move the stuff around where you need it.

Very occasionally, you do need to think about code. Some systems have a small kernel of complexity, usually less than 10%, where they need some really difficult code. It’s best to do a bit of research before you tackle it, but the mechanics are simple. Get all of the data you need, feed it to the algorithm in hopefully a stateless way, then get all of the outputs from it and move them around as necessary. Easy peasy.

The most common objection to this type of coding comes from people believing that it is too many objects or too many little functions. It is ironic, in that when it is consistently applied, it is usually far fewer objects and functions than just slamming out reams of endless redundant code. It is often at least 1/10 of the amount of code. It may seem to be more work, but it is less.

And while it is harder to see the full width of all of the instructions that will be executed, it is actually far easier to vet that the underlying parts are acting correctly. That is, if as the code degenerates, you end up with a problem buried deep in the middle of hundreds of lines of tanged code, it is unlikely that you’ll be able to fix it cleanly. But if you have some little function, that has an obvious bug, you can fix it simply, and then start walking up the calling stack to make sure that everybody above doesn’t break with the fix. It’s actually less work, usually easier, and since you can properly assess the impact of the change, it is about 1000x times safer. It also has the added benefit that it will fix a ‘class’ of bugs, not just the bug itself, which helps to get far better quality.

Now, wrapping all of the data down low is a bit of extra overhead. It doesn’t make sense to do it for small throwaway programs, you generally write those too quickly. But for a medium system, it usually pays off in at least the first 6 months, and for large or huge, it pretty much exponentially cuts down on the development time. A huge saving.

That is, you may come out of the gate a little slower, but the payoff over the life of the project is massive. You can see the difference in large systems, quite clearly. After a year or so, either they are quickly degrading into an impossible mess, or you can keep extending them with a lot more functionality.

There is more. But you need to accept the foundations first. Then you can deal with difficult data, and maybe even understand why design patterns exist (and how not to abuse their usage).

Figuring out how to flow the data around is a much more powerful means of understanding computer systems than just trying to throw together massive lists of instructions. Once mastered it means that most of the other problems with designing and building complex systems are really about people, processes, and politics. If you can easily build anything you want, then you have a lot more choices for what you build, and why you build it.