Thursday, June 27, 2024

Guessable

For some technologies I have used, there is a philosophy that holds all of its functionality together. It is mostly consistent. It fits a pattern, it feels connected.

Its existence means that if you need to do something, and you understand at least part of that philosophy, you can take a pretty good guess at how to get it accomplished.

Usage is guessable.

For some technologies I have used, because of their construction, everything is erratic. All of the parts and pieces were built at radically different times, by very different people, following very different individual philosophies of construction. There isn’t one big philosophy holding it all together, rather there are so many little islands that there is effectively no philosophy at all.

In these technologies to do something, you have to have known how to do it already. This doesn't slow you down with common repeating tasks, but if you have to leave your comfort zone, then it is difficult. You first have to search around a lot until you find the magic incantation that will make it happen. If what you are doing is complicated, then you will need to spend a lot of time to find a bunch of incantations.

Any sort of guess you make will be wrong. Some of them will have very bad consequences, so you quickly learn not to guess.

Usage is only searchable, and often not easily searchable.

When you have to spend a lot of time searching, and it is hard since you don’t even know what you are looking for, everything takes way longer to accomplish. It is a lot of friction. It is slow. If you are used to smoother technology, it is frustrating.

It was pretty obvious to me when I was younger that I preferred to spend my days working in an environment that is guessable. The classic example back then was Unix versus Mainframes. Unix had a wickedly great philosophy and was mostly consistent, Mainframes could do all the same stuff, but it was brutal trying to figure out how to do it.

It plays out with programming languages as well. I prefer smaller languages that are consistent over larger languages that are a mess. I liked C for example, tolerated a subset of Perl, and despised Php. These days I prefer Golang over Rust, pretty much for the same reason. I return to AWK again and again, because it is just so sweet.

Technologies with tight philosophies can get broken over time. It is inevitable to some degree if it gets popular. Other people want to contribute, but they do not bother to learn or understand the philosophies that make the tech great. They are too keen to get going. They enhance it, but in ways that are not as guessable and require searching. I see those types of enhancements as mostly negative.

Big technologies should have tight philosophies. Sticking within the philosophy should be enforced somehow. It doesn’t matter if you can do it, if you cannot guess how to do it. Existing functionality is nothing if it isn’t frequently used. Oddly, we see a form of this within development projects as well. When usage is guessable, there is more reuse. When it is only searchable, the code is excessively redundant.

Thursday, June 20, 2024

Outside

For most complicated things in our world, there is an inside and an outside.

It’s a repeating pattern.

If there is some clump of complexity, most often not all of that complexity is fully transparent. It is only visible locally. You have to be in there to see it and understand it. That causes a difference in perception depending on your position.

It is at least a form of accidental encapsulation but it can also be intentional.

That lack of visibility forms a boundary, although it doesn’t have to be consistent or well-defined. It can wobble around.

What’s common though is that from the outside it is too easy to underestimate the complexity. It is also true that the inside complexity may twist the dynamics in very non-intuitive ways. From the outside, you cannot guess what lies beneath.

Fuzzy encapsulation borders are everywhere. Most of the specializations or skills in our modern world are heavily wrapped in them. Common knowledge is tiny.

The deeper you go, the more you can come to terms with this built-up complexity. This applies to history, science, art, politics, law, engineering, and any sort of company, institution, or organization. Everything, really. It’s evolved over the centuries. It’s all very convoluted, at least from the outside.

It is always best to always assume that you are on the outside. That whatever view you have is grossly oversimplified and likely not correct. This is not pessimism, just being realistic. Grounded.

If you are outside, then you should step lightly. There is more that you don’t understand than you know. If you expect that with any increasing depth, the story will get tricker, you will not be disappointed.

Just blindly leaping into things is not always a bad idea, so long as you don’t lock your expectations in advance. You give it a try, it turns out to be different than you expected, but that is okay and not surprising. In that way, as you go deeper, you can keep adjusting to the new levels of complexity that you encounter. It is best if you always assume you haven’t ever reached the bottom. It’s a long way down.

This also decreases the frustration that things are not going as you expected them to go. Calms the fear. Your outside perspective may contain scattered elements of the truth, but the circumstances are always more nuanced. When it is different, you no longer get angry, but rather curious as to what parts of the complexity you might have missed. Being wrong is a learning opportunity, not a failure. The more you dig, the more you will unravel the complexity. Expect it to be endless.

This obviously helps with planning. You need something done, so you start with a list. But for lots of items on that list, you are effectively an outsider, thus getting it done will explode. It will get harder. The internal complexity will bubble up, throwing off your plans. Again though, if this is just playing out according to your expectations, it will be easier to adapt, instead of stubbornly forging ahead. You knew there were outside items, you accepted them as placeholders for lots more unknown inside items. With each new discovery, you revise your plans. Each new plan gets a little closer to being viable. A little better. One tiny step at a time.

Often as you are descending, you have to deal with impatience. Some people around you are stuck in their outside perspectives. They can be tricky to deal with, but generally trying to enlighten them often works. If you keep them in the dark, they lose confidence, which further agitates them. The relationship naturally cycles downward. If you spend the effort to carefully communicate the issues as translated into their perspective, they might be sympathetic, even if they don’t fully understand. It doesn’t always work, but being defensive or hiding the problems fails far more often and usually more dramatically. Communication is alway worth a try.

It’s okay to be outside. We’re outside of most things. It’s normal to stumble around blindly in the dark, knocking things over accidentally. It’s just that wrapping yourself in a delusion that your perspective is strong is rarely constructive. It is far better to be a happy tourist than to walk yourself off a cliff because you refused to accept its existence.

Thursday, June 13, 2024

Dirty Writes

If there are two people who may edit the same data concurrently, as explained to me when I first started working, there is a serious problem. Someone way back then called it the dirty write problem.

Bob opens up the data and starts to edit. But then he goes to lunch.

During lunchtime, Alice opens up the same data, edits it, then saves it.

After lunch, Bob returns, makes a couple more changes, and then saves his work.

Bob's write will obliterate Alice’s changes. Alice did not know that Bob was also working on the data. Bob did not know that Alice had changed the data.

This problem exists in far too many GUIs to count. Way back people would actually put code in to solve it, but these days that is now considered too complicated. Too long, didn’t code. Alice will just have to remember to check later to see if her changes actually did or did not persist. Lucky Alice. Bob should probably check all of the time too. Maybe they should both write down their changes on paper first...

One way to solve this is to have Bob lock the file for editing. Alice will find out at lunchtime that she cannot edit, but of course, if her changes are urgent it will be a big problem, Bob might be having a very long lunch. Alice will be upset.

Another way to solve the problem is when Bob starts to make his post-lunch edits, a warning pops up saying the underlying data has changed. It would give Bob a screen to deal with it. It’s a bit tricky since Bob already made changes, any merge tool would be 3 way at that point. The original, Bob’s version, and Alice’s version. This might hurt Bob’s brain, and it isn’t the easiest stuff to code either.

A variation on the above is to just err out and lose Bob’s changes. Bob will be upset, it was a long lunch so now he can’t remember what he did earlier.

In that sense, locking Alice out, but giving her some way to just create a second copy seems better. But then you also need to add in some way to reconcile the two copies after lunch, so we are back to a diff screen again.

Because it is complicated and there are no simple solutions, it is ignored. Most systems are optimistic assuming that overlapping edits don’t occur on shared data. They do nothing about it. And if work is lost, most people will forget about it or at least not exactly remember the circumstances, so no valid bug reports.

Occasionally I will run across a rare programmer who will realize that this is a problem. But then I usually have to explain that because we are working with a shoestring budget, it is best if they just look away. Software, it seems, is increasingly about looking away. Why solve a problem if you can pretend you didn’t see it?

Thursday, June 6, 2024

Data Modelling

The strength and utility of any software system is the data that it persists.

Persisting junk data is just wasting resources and making it harder to utilize any good data you have collected.

Storing partial information is also bad. If you should have collected more data, but didn’t, it can be difficult or impossible to go back and fix it. Any sort of attempt to fake it will just cause more problems.

All data is anchored in the real world, not just the digital one. That might not seem generally true, for example, with log files. But those log entries come from running on physical hardware somewhere and are tied to physical clocks. While they are data about software running in the digital realm, the act of running itself is physical and always requires hardware, thus the anchor.

All data has a form of uniqueness. For example, users in a system mostly match people in reality. When that isn’t true, it is a system account of some type, but those are unique too, and usually have one or more owners.

For temporal data, it is associated with a specific time. If the time bucket isn’t granular enough, the uniqueness could be improperly lost. That is an implementation bug, not an attribute of the data.

For geographical data, it has a place associated with it, and often a time range as well.

Events are temporal; inventory is geographical. History is temporal. We’re seeing a pattern. Data is always anchored, therefore it has some type of uniqueness. If it didn’t it would at least have an order or membership somewhere.

Because data is unique we can always key it. The key can be composite or generated, there can be different types of unique keys for the same data, which is often common when cross-referencing data between systems. Mostly, for most systems, to get reasonable behavior we need keys so we can build indices. There are exceptions, but the default is to figure out exactly what makes the data unique.

Data always has a structure. Composite data travels together, it does not make sense on its own. Some data is in a list, or if the order is unknown, a set.

Hierarchical data is structured as a tree. If subparts are not duplicated but cross-linked it is a directed acyclic graph (dag). If it is more indescribably linked it is a graph, or possibly a hypergraph.

If data has a structure you cannot ignore it. Flattening a graph down to a list for example will lose valuable information about the data. You have to collect and store the data in the same structures as it exists.

All datum have a name. Sometimes it is not well known, sometimes it is very generalized. All data should be given a self-describing name. It should never be misnamed, but occasionally we can skip the name as it is inferred.

Understanding the name of a datum is part of understanding the data. The two go hand in hand. If you cannot name the parts, you do not understand the whole.

If you don't understand the data, the likelihood that you will handle it incorrectly is extremely high. Data is rarely intuitive, so assumptions are usually wrong. You don’t understand a domain problem until you at least understand all of the data that it touches. You cannot solve a problem for people if you don’t understand the problem itself. The solution will not cover the problem correctly.

There are many forms of derived data. Some derivations are the same information just in a different form. Some are composites. Some are summaries of the data along different axes. An average for some numbers is a summary. You can get to the average from the numbers, just redo the calculation, but you cannot go the other way. There are infinitely many sets of numbers that would give the same average.

Technically, derived data can be unique. You can have the composite key for all of the attributes of the underlying data that you are combining to get the value. Realistically we don’t do that very often. It’s often cheaper and safer to just regenerate it as needed.

Most data that is useful is complex. It is a large combination of all sorts of intertwined other data structures. As we present it in different ways it helps people make better decisions. It has recursively diverse sub-structures, so it might be a list of trees of graphs of lists, for example. From that perspective, it is too complex for most people to grapple with, but we can adequately work through each of the subparts on their own and then recombine them effectively.

Sometimes capturing all of the available data for something is just too large. We ignore some dimensions or attributes and only capture parts of the data relative to a given context. That is common, but for any given piece of data within that context, we still don’t want it to be partial which is junk. A classic example is to capture a graph of roads between cities without capturing it as a geological database. We dropped the coordinate information, but we still have captured enough of it to properly identify the cites, so later, if required, we can reconnect the two different perspectives. Thus you may not want or need to capture everything, but you still have to be careful about which aspects you don’t capture, which means you still need to understand them.

Data modeling then is two very different things. First, you are setting up structures to hold the data, but you are also putting in constraints to restrict what you hold. You have to understand what you need to capture as well as what you need to avoid accidentally capturing. What you cannot store is as important as what you can store. A classic example is an organization where people can report to different bosses, at the same time. Shoving that into a tree will not work, it needs to be a dag. You would not want that in a graph, it would allow for cycles. If you need to know about who is reporting to whom, you need to capture it correctly. Getting it wrong is misleading the users which is very bad.

Limits of understanding and time constraints are often used as excuses for not spending time to properly model the data, but most difficult bugs and project delays come directly from not properly modeling the data. Data anchors all of the code, so if its representations are messed up, any code built on top is suspect and often a huge waste of time. Why avoid understanding the data only to rewrite the code over and over again? It’s obviously slower.

There is a lot more to say about modeling. It is time-consuming and pedantic, but it is the foundation of all software systems. There is a lot to understand, but skipping it is usually a disaster. Do it correctly and the code is fairly straightforward to write. Do it poorly and the hacks quickly explode into painful complexity that always spirals out of control.