Sunday, February 21, 2021

Knowing How Things Work

 There are some real benefits in knowing how things work.

Jumping into something totally blind sometimes is fine, but only if you are there to learn it, or to enjoy it or, or just to take in the experience. But it’s always amateur-hour, the results might work out okay, but it is more likely they will leave some room to be desired.


Having a shallow overview might be okay sometimes, however being able to get right down to the details and see what is happening underneath is far, far better.


Most things that are complex are somewhat opaque from the outside. Their internal complexity can be counter-intuitive and built up over long and fractious histories. You need to know most of the parts, but it also helps to see how it all evolved, so that the exceptions make sense and fit back into the bigger picture.


There is no such thing as too much depth. There might however be too many details for any one person to cope with, so for very large and complex things, it is often necessary to specialize. A generalist can lead a bunch of specialists, but only if they are trusted and their advice heeded.  Thus one needs to know their limits: how much they know, how much they don’t know, and all of the little things that they think they might know, but really don’t. They need to know if they need someone else who knows more.


If you know how things work, you can make changes to them that have a much higher degree of continuing to work or actually improving stuff. If you don’t know how things work, then any changes you make will effectively cause random side-effects that you constantly are reacting to. That is terribly inefficient, somewhat dangerous, and there is no way of knowing once you start this, whether you’ll ever get it finished in a manner that is acceptable. 


Having an opinion about how things ‘should’ work is not the same as knowing how they actually work. And worse, opinions based on misunderstandings are more likely to be destructive, than helpful.


If you know how things work, then you know all of the changes that have to be made to achieve a certain goal. You also know how long it will take to make those changes. If it’s long-running work and you set a plan, then that plan will work out, in so long as there have not been any unforeseeable events. If the plan fails, it fails because of the things you didn’t know. Thus being able to successfully plan is a good way to prove that you actually know how things work.


It takes a lot of time to know how things work. It takes a lot of reading, and ingesting what you have learned, and then getting out there and experiencing stuff before you can really see how all of the pieces fit together. Learning from experienced mentors is the fastest way to get gain understanding. Learning from courses or textbooks is better at providing an overall view. Somethings can be known discretely, and that knowledge is definitive. Somethings can only be known as a sort of intuitive feel for how they will react in underlying difficult circumstances. You might know all of the parts really well, but are still not be entirely sure how they will react to a multi-dimensional set of changes. Most things require some form of balance, which is why oversimplifications tend to throw things out of whack. They favor one aspect of the problem over the others. 


Full knowledge is obviously the fastest and most effective way forward, but when that spans multiple people the team dynamics became an actual part of the thing getting changed, they are no longer separatable. Partial knowledge can be okay if it is accepted and account for, but you still have to find out about the unknowns and the unknown unknowns as well. 


A professional is someone who knows how things work for a given industry and can successfully reach any ‘possible’ goals. But they also know which goals are not possible, and which ones are unlikely. They know how to avoid failure.

Sunday, February 14, 2021

Gyrations

It’s instructive to walk through code sometimes. It reveals a lot more than just scanning it at a high level.

One of the things you find is that the code might take some pretty odd paths through the logic, just to get back to some really simple data changes.


These gyrations are of course unnecessary. They usually exist because the programmer didn’t know how to go from A to B. Instead, they went skewing off to H, bounced over to M, and then came back to B.


Sometimes this is a result of a second programmer coming into the code, with the intent to ‘just fix’ the minimum necessary. 


Sometimes it is a result of the original programmer not fully understanding the work they are doing. For some coders, they have a discreet set of ‘instructions’ that produce certain types of output. They assemble their code from this set. If this set has gaps, their code also has gaps, so they fill them by jumping over to other areas that they have in their set, doing intermediately work, then eventually jumping back. That is, they know how to go from H to M, and they know how to go from A to H, and from M to B. This can be recursive and often quite intertwined, there can be some pretty crazy gyrations going on.


It’s far easier if you want to code A to B, to just code A to B, even if there aren’t any underlying libraries that help. Most of these transformations are pretty trivial, they involve basic data type changes or structural rearrangements. If it turns out that it is not possible for scheduling reasons to do the code yourself, then the best way to handle this is to craft a function to encapsulate A to B. In that function, you can throw H and M into the mix, that’s okay. Later, if another programmer sees this and knows how to be more straightforward, it’s pretty safe for them to just replace the H and M madness with better code. If it’s encapsulated, then the impact of this change is easy to figure out. 


Sometimes programmers don’t want to encapsulate A to B because they feel that it is either hard to read or slow in performance. Neither issue really applies. It’s easier to read code if the weirdness is pulled away from it. You get a bigger sense of what the logic is trying to accomplish. The performance costs of adding lots of functions have been effectively trivial for decades. There are certain types of high-performance code where it might matter, but chances are that is not the code you are writing right now, and even if it was, it’s better to also provide a slower, more stable, more debuggable version in the source as well.


The biggest problem with large systems is that any and all disorganization feeds into making it a tangled mass of spaghetti. This always starts with little stuff, and each problem itself does not seem significant, but as these problems stack on top of each other, the entire edifice grows shaky and difficult. In a sense, passing through H and M is entirely artificial. They weren’t really needed, they are wasting resources, and they are just confusing the code. The code needs to get B from A, and it needs to do that in the clearest and cleanest way possible. Even if you are rushed, encapsulating that in a little function doesn’t take more than a few minutes, and it will pay off immensely later. 


Fixing this is actually really easy. Just start going through the big functions, creating littler ones. Once you have little ones, revise them to fit into other places in the code. Sometimes you have to unwind it a little to get it cleaned up. So, if A->H->M->B is intertwined with A’->Q->X->N->C, you probably need to separate them first, before you can encapsulate them. Obviously, it was less work to get it right originally, but it’s not complicated work to sit there and unravel it either. It’s just slow and boring. You have to do it a little at a time, then check that the code still runs correctly, then continue on. It definitely is work worth doing, particularly if there are still big changes to the system that are coming. Ignoring it only makes it worse.

Thursday, February 11, 2021

Multi-tasking

When I first started coding I often struggled with remaining focused. While writing code, there is a lot of stuff going on in our heads that need to get put down into the computer. It turns out that code is best and the work is fastest if you can basically just type what you ‘see’ during prolonged stretches. Later you keep passing over it, again and again, editing it until it becomes clean and tidy.

As I grew in my abilities, I learned to focus for hours and hours at a time. In the heyday of programming, I had my own office, I could close the door and get emersed in the work. That was easily the best code I have ever written. It was very readable, the mechanics were discussed thoroughly and thought out clearly, the project had good strong conventions that we meticulously followed, and the code was consistent. Reuse was critical to the success of the project, otherwise, we would have ended up with way too much stuff. It was a very ‘technical’ development, a big distributed fault-tolerant cache that had high-performance specifications. Despite the complexity, we only ever had 1 known software bug in production. It was also my favorite job.

As Agile killed the profession, programming moved from being able being about deep focus to often just bouncing around throwing a lot of code fragments everywhere to see if we could get things working. Noisy open office spaces, lots of interruptions, stress, and drama, and maybe about 30% of the time necessary to get barely functional code.

It took some adjustments, but I gradually learned to work in a messy, loud, mutli-tasking environment. Part of the secret was learning to write a lot of stuff down. The core habit is to keep a ‘note’ of what you are going to do before you go and do it. If you are disrupted in the middle of working on something, you make a new note for the next task, jump to the new work and keep going. When you are finished, you go back over your notes and work your way back out of the pit. In rough environments, you might find yourself down 6 or 7 levels, really quickly.

Another great habit is time-boxing. That is, you try to allocate very specific times of the day to do specific work. Mornings are for big email replies, management issues, etc. After lunch, somedays are ‘coding days’, some are ‘meeting days’. Rather obviously, sticking all the meetings back to back helps keep you in ‘talking context’, rather than ‘coding mode’. It’s always been very different, at different companies, but trying to batch similar stuff together helps, thinking about it in terms of ‘focused work’ and ‘unfocused (chatty) work’ helps too.

Another thing to do is keep a separate list of boring tasks. When you are baked, work on stupid, boring, trivial stuff. When you have the energy, do the more cognitively demanding work. It does cause a bit of interference with scheduling though. When you are giving estimates, you have to give them in terms of the good days/hours. So, if most weeks consist of 2-3 days of insanity and drama, your capacity is really only 3-2 days of work, but it’s better to err on the lower side, so you are often down to 2. Thus 10 days of grueling, hard-ass coding are going to span 5 weeks unless you start clocking in overtime. If you keep clocking in overtime, eventually all of the days are ‘unfocused days’ and you’re down to 0, which gets messy with your employer (but at least the boring work is all done, isn’t it).

Sometimes I just cycle. Get four projects going at once, work on each until they hit a significant blocker, then move on to the next one. When doing this, the code itself can’t be rocket science, so it’s usually a little more brute forcey, domain-specific stuff. Then the blockers tend to be poor specifications, external decisions, ambiguities, etc. Most of them involve getting a conversation going with other people about some issue that needs to be addressed right away. So, the flow is to read back the items that need to get done (as per above), then work on them. Either close them off or jot down that you need to punt it into some discussion. When the little bits are all finished, flip over into ‘email mode’ and trigger all of the discussions. That project is now all jammed up, move on to the next project. Keep cycling between all of them, with the same flow.

Sometimes, shifting priorities will pop up that force a different order, but it’s best if you try to spend a bit of time first to close off the thing you are actively working on right now (by either finishing it or at least updating your notes). Working this way, keeps everything moving forward, and you keep a record of where you are and what didn’t get done yet. It doesn’t however produce great quality work.

My personal feeling is that spending the effort to get good or even great code is both faster and produces way better results for the users. We can rush through the work, at times they do only need crappy quality, but doing this all of the time gradually paints the entire effort into a corner. Technical debt is most often exponential, a little of it starts leaking in, and before you know it, getting anything done, at all, is a Herculean effort. Preventing that decline is equally, if not more important than getting out most features, but since that work is basically invisible to management and the users -- until it is too late -- people incorrectly ignore it. Still, the industry turned into this current direction, like a huge ship, so it’s not so easy to set it back on a good course. We have to deal with whatever madness that is paying the bills, so it is best to learn how to spew out lots of code, on many different tasks, all at once, even if you realize that people aren’t really getting what they need.

Saturday, February 6, 2021

Theory and Practice

In many discussions, I’ve often referenced the need to know some ‘theory’ when writing specific types of code. What I’ve seen is that people who bypass acquiring this knowledge will end up writing code that just won’t ever work correctly. That is, there are some very deep, rather counter-intuitive problems buried at the heart of some discrete computational problems, and that approaching it with clever ideas is not enough.

If you are okay with your stuff falling back 30 to 50 years in behavior then you can attempt to reinvent these wheels, but if you need to get it out there, working as well as it can, then you should not ignore the underlying theories.

The first example is parsing. People have long noticed that string operations like ‘string split’ are pretty powerful, and they do some of what is needed to parse incoming text. So if you have a buffer of “x = y” statements all on their own line, then you could split and squeeze the whole thing into a dictionary. Easy, peasy.

It starts to fall apart when bits of the parsing becomes ‘modal’. The classic example is a quoted string in a .csv file that contains a comma, which is also the field separator for the file. The right way to handle this is to basically have 2 states. One that is normal and one that is ‘in quotes’. While ‘in-quotes’ you treat the comma as a normal character. Otherwise, you treat it as a separator.

And this is where things get messy. The comma character has 2 distinct meanings. It’s either just a normal character in a string, or it is a special character used to act as a separator between fields. If you don’t keep track of the state, the meaning is ‘ambiguous’. You just don’t know. Split might handle the work of breaking up the string into pieces, known as tokenizing, but it provides no means of state mechanics.

The syntax in a complex programming language has tonnes of these state-based issues. They are everywhere, from the operators we use to the way we handle blocks in the code. It’s a state-fest. This occurs because the bandwidth of our characters is somewhat smaller than the bandwidth of the things we want to do, so we pack our mechanics tightly into the symbols that we have available, thus causing lots of ambiguity. Language theory refers to this as an LR(1) grammar, but the underlying issue is that we often hit a token that is ambiguous, so we try to process it one way, realize that is wrong, then we have to back up and do it some other way. If the language is really gnarly, we might have to back up a few times before we can resolve the ambiguity.

Without the ability to back up, basically anywhere at any time, the code will continue to hit ambiguities and do the wrong thing. So the underlying problem is that if you try to hard code the mechanics, there will continue to be circumstances where that static coding won’t do what is expected. These special cases will continue to occur infinitely. In the beginning, you fix a lot of them, it will slow down as you get way more code, and the unobserved variance of the inputs reduces, but it will never, ever go away. The code is incomplete because the code can’t adapt dynamically to the input, so the code is weak by construction. The mechanics can’t be static they have to be dynamic enough to bend to any ambiguity that they find anywhere in the text.

Ambiguities are actually found all over the space of discrete systems. They are essentially just missing information, but they are missing in a way that sometimes the physical universe itself can’t resolve them. The most classic one is TGP, the two generals problem. It’s at the very heart of all distributed processing.

Basically, if you have more than one ‘serialized computation’, sometimes described as Turing Machines, the communication between them has an intrinsic difficulty. If one ‘machine’ sends a message to another ‘machine’ and after some time there is no response, what happened is ambiguous. The sending of the message may have failed, or the sending of the reply may have failed, or some code in between may have failed. You can’t ever be sure which side of the communication was dropped. You can’t be sure if the work was done or not.

If the message was just asking for some information, basically a ‘read’, then when the first machine gets impatient it can just redo the question.

If the message was asking the second machine to change a value, then it suddenly gets funny. If the message never made it, the second machine obviously didn’t make the change. If the message made it, but the reply didn’t come back from them the second machine may have actually made the change. The lack of reply doesn’t imply the state of the change.

If you don’t care whether the change has been made or not, it’s not much of a problem, but in a lot of cases we build up huge amounts of ‘related’ work that are distributed over a bunch of different computers. We wrap these in a ‘transaction’ so that we get the property that either all of the changes worked, or all of the changes failed. That’s critical in that we need to inform the user that their request, which was ‘singular’ to them, was actually fully and properly processed, or there was some problem so nothing was done. We refer to this as ‘transactional integrity’. It’s one transaction to the users, but it may span a lot of different machines, databases, etc. A trivial example is transferring money between two systems. The amount must be decremented from one database if and only if the amount is incremented in another. The transfer either works, or it does not. There can be no other intermediary states, or things will get ugly.

So, we have this problem of coordinating a bunch of different machines to either all do the work, or to all fail. But if one of them is late in replying, we don’t know whether that work was done or not. That ambiguity should halt the other machines from continuing until it’s been correctly resolved. Worse, there may be lots of other transactions piling up behind this one, and we don’t know whether they are independent of this change, or they rely on it for their changes. Say, transferring the new balance in one of those earlier accounts to a third system.

If we built this type of computation to be entirely reliable then every time one of the communications gets dropped, the entire system would grind to a halt, until that message was found or resent. That, of course, would perform incredibly badly, as there are lots of reasons in the physical world that communications can be interrupted.

In practice what we used to do was minimize the window of failure to be as small as possible. That is, a transaction is sent out to all participants to ‘stage’ the work. When they insist that their next step is tiny, the second round of messages is set out to turn it all live, or ‘commit it’. If it’s successfully staged, it is treated as if it will go through. If that first part of the conversation failed, it is all rolled back and an error is propagated upwards to let the initiator know that it didn’t work.

If you get this right, the behavior of the system overall is nicely deterministic. When you see a confirmation, it means everything really was updated. When you see an error, it means that it was not. When you get this wrong, neither a confirmation nor an error has an exact meaning. It could mean it mostly worked, or it could mean that it kinda failed. You really can’t trust the system at that point.

In modern practice, the whole notion of transactional integrity is often ignored. All of the subsystems just get the requests and process them or they error out. Everything is handled individually. The problem is amplified by the modern habits of adding more and more distributed systems. When most software was on the same hardware, the likelihood of problems was really tiny. When it is distributed across a lot of machines, the likelihood of problems is quite significant. It happens all of the time. If the data is spread all over the place, and there are a lot of mindless workers changing it, without any integrity, the quality of the data, as a whole, is going to be really, really bad. Constant problems will cause gaps, overlaps, and contradictions. Mostly it will go unnoticed, but every once in a while someone will do the work of reconciling their different data sources, and find out that collectively it is all a mess.

As if we needed more problems, we used to architect large scale systems with Transaction Management technology, but since they are slow and the programmers felt that they were difficult to use and somewhat pedantic, we switched over to using stateless API calls. No transactional integrity at all, on purpose. That might have been okay if we partitioned the systems based on data dependencies, but then along came ‘microservices’, where we break it all up even further into even more little, dependent, parts. Rather obviously that is a recipe for disaster. The finer grain we go, as well as making it more distributed across more hardware, while ignoring the necessity of transactions, the more we degrade the quality of the data we collect. This occurs as a natural result of tunnel visioning into small performance or organizational problems, while ignoring larger cross-cutting concerns like transactions. It’s not easy to cope with physical limitations, but it certainly isn’t advisable to just ignore them.

But it’s not just persistence that gets into trouble. Even though they are way faster, computers are still too slow. That oddly comes from their popularity. As we use them for more processing, we expect them to process way more stuff. The size of the stuff we are trying to handle these days has grown faster than the increases in hardware. That is, the workload for software outpaces Moore’s law in growth, right now.

To get around that, we don’t just want faster hardware, we want to involve more hardware as well. In the early days of computing, when we parallelized stuff, we were very careful to sandbox it as well. So, one computation as it runs could not be affected by any other running computation. We crafted a ‘process’ abstraction so that we could fit a bunch of these together on the same hardware, but they maintained their ‘safety’. Later, when programmers decided that safety was just too awkward, they came up with ‘threads’. These are independent computations that can cause all sorts of havoc with each other. Basically, they are fighting for dominance within the hardware, and if they bang into each other, the results of the collisions can be rather messy.

Realizing that multi-threaded code was a problem, we enhanced our knowledge of fine-grained synchronizations. That is, two threads could share a larger piece of data like a string if they both nicely agreed to use the same mechanics to synchronize their operations on it. The primitives to drive synchronization have been around since the early days, as they were needed for foundational technology like operating systems, networks, and databases. What changed was that we pushed their necessity upwards, higher and higher in the technical stacks, so that the programmers writing code on top now also needed to understand how they worked. It’s a form of unencapsulating stuff, where we solved a problem, made it nicely usable then went out and unsolved it again.

It’s not the only place where we provided mechanics to people that were highly likely to let them injure themselves. Colloquially we call it “giving them enough rope” and we’ve seen that over and over again in other areas like memory management, pointers, type management, and variable arguments.

It’s not enough to just avoid data collisions in the code, you also have to worry about starvation, race conditions, deadlocks, livelocks, etc. Synchronizing two dependent computations is non-trivial. But getting it wrong still lets the code mostly work. It works a lot, then strangely errors out, then magically works again. Mostly if the collisions are infrequent, the testing won’t notice them, and the users just put up with the occasional, unexplained problems.

There are many other known and unknown problems that quickly flow into theoretical issues, like caching or algorithmic growth, or even error handling to some degree. On the surface, they seem easy, but underneath it’s incredibly hard to get them done correctly, or at least close to the state-of-art knowledge.

Most code however is pretty trivial, and people often use brute force to make it even more trivial. Get some data from a database, throw it to the user, have them make some edits, and then save it again. Ask another system for an update, store it away forever. Spew out a few million slightly customized reports. Most of this type of programming needs a really good understanding of the way that the data should be modeled, but the technical side is quite routine. The trick isn’t coding it, it is keeping it all organized as it becomes large and hard to handle. Keeping millions of little pieces neatly arranged is a somewhat different problem from getting one little tricky piece to behave as expected. What we need to do better is to keep the tricky bits encapsulated, so that they aren’t wrong so often.

Wednesday, February 3, 2021

Plumbing

By now it should not be hard to synchronize data between different software systems. We have the storage capacity, the CPU power, and there are a plethora of ways to connect stuff. Still, the situation in the trenches isn’t much better than in the 90s, which is sad.


XML was in many ways an attempt to send around self-documenting data. The grand idea was that you could get stuff from anywhere, and there was enough meta-information included that it could be automatically, and seamlessly imported into your system. No need to write the same endless ETL code over and over again. Yeah.


But rather than mature into that niche, it got bogged down in complexity which made it hard to use. JSON took a few massive steps backward, but at least you could wire it up quickly.


One of the underlying problems is semantic meaning. If one system has a symbolic representation of part of a user’s identity in the real world you have to attach a crazy amount of stuff to it when it is exported in order for a second system to be able to interpret it correctly, in a completely different context. Just knowing the bits, or groking that they are byte encodings of particular glyphs, or even knowing that those glyphs collected together represent an identifier in a particular human language is not enough. The implicit ‘model’ of the information as it is mirrored from the real world in both systems have to hold enough common information that a stable mapping of them is possible. And it’s probably some offbeat artifact of reality that the stability itself probably needs to be bi-directional.


We could play with the idea that the two systems engage in a Q&A session. The second system keeps asking “what do you mean by xxxx?”, slowly descending the other system’s knowledge graph until it hits an ‘ah-ha’ moment. Oddly, if they were both event-based, then the conversation would only have to occur after one of them changed code or config, so not that frequently and probably not in haste.


That, of course, would be a crazy amount of code. As well as implementing the system you’d have to implement a malleable knowledge graph that spanned each and every line of code and a standard protocol to traverse it. It’s no surprise that we haven’t seen this type of code in the wild yet, and may never see it.


What if we went the other way? A system needing data contacted an intermediary and requested data. Another system contacts the same intermediary and offers to share data. That gives us two structural representations of a complex set of variables. Some map one to one, based on a mapping between the labels. Some have arbitrarily complex mappings, that in the worst case could include state. If the intermediary system had a way of throwing up its hands, admitting defeat, and bothering a few human beings for intervention, that version of the problem might be solvable. Certainly, the export and import from the two systems is well constrained. The interface is fairly simple. And, over time, as the intermediary built up a sophisticated database of the different sources and requestors, the effort to connect would converge on being fully automatable.


If the request for data had a property that any imports were idempotent with what amounts to a universal key for that specific entity, then the whole thing would also be fairly resilient to change. If the paradigm for the intermediary’s interface was a fancy to-do list based on incoming requests one could prioritize the connectivity based on organizational needs. 


So, one programmer’s assignment would be to send all of the public entities to X. X might choose to not save them, it isn’t a historic database, but it would accept any and everything.


Another programmer’s assignment would be to add in some background task to get an entity from X. It might poll if that works, or it might just register for an event, and get data as it is received.


Neither programmer cares about the other’s schema or field names or any other thing, They deal with what they need or already have. No fiddling.


In their calls, they include a contact email, each time. Then they wait.


A data administer sees that there is a publisher that kinda matches one or more requestors. Kinda. They can contact either side to get more info, or make suggestions. They have some ability to put small computations into the pipeline. They can contact others to sort through naming issues. At some point, hopefully, with minimal changes to either side, they just flip a switch, and each instance of the entities gets translated. The receiving system springs to life.


Obviously, they do this in a testing environment. Once it is agreed upon, the publishers go into production first, it is verified, then the receivers go live. The schedule for this work is fairly deterministic and estimable after the agreement from testing is reached. 


Ah, but what about ‘state’ you say? Yea, that is a problem. Some of it is mitigated by keys and idempotency. One can utilize that to support historic requests.


Other state variables more often tend to be embedded in the data itself. And with an event handling mechanism, you could have one system offer up a status to be consumed by another. Which segues nicely into error handling. The intermediary could fill in errors with a computation, say one system polls for the state of the other, since a specific time. The response could be computed to be ‘down’ if within that window there was no publishing of a heartbeat.


We can do similar things with abstract variables, like computing response time between heartbeats. It opens up a lot of great ways to monitor stuff.


A nice secondary attribute is that the intermediate system effectively documents the system interconnections. It can produce nice reports. 


But it won’t scale? Ya, sure it will. Not as one galactic intermediary system, but as many many smaller completely independent ones that could even be parallelized for big data flows. It won’t be super slow, but it's not the mechanism one needs for real-time data anyways, that is usually custom code, but very expensive code so you want to keep it to a minimum. It’s an 80/20 solution. If there are hundreds of these beasts running, you could still have one data management interface that binds them all together at an organizational level. Each instance has a synced persistent copy of only the data that it needs, not all of it.