Sunday, November 19, 2017

Bombproof Data Entry

The title of this post is from a magazine article that I barely remember reading back in the mid-80s. Although I’ve forgotten much of what it said, its underlying points have stuck with me all of these decades.

We can think of any program as having an ‘inside’ and an ‘outside’. The inside of the program is any and all code that a development team has control of; that they can actively change. The outside is everything else. That includes users, other systems, libraries, the OS, shared resources, databases, etc. It even includes code from related internal teams, that is effectively unchangeable by the team in question. Most of the code utilized in large systems is really outside of the program, often there are effectively billions of lines that could get executed, by thousands of coders.

The idea is that any data coming from the outside needs to be checked first, and rejected if it is not valid. Bad data should never circulate inside a program.

Each datam within a program is finite. That is, for any given variable, there are only a finite number of different possible values that it can hold. For a data type like floats, there may be a mass of different possibilities, but it is still finite. For any sets of data, like a string of characters, they can essentially be infinite in size, but practically they are usually bounded by external constraints. In that sense, although we can collect a huge depth of permutations, the breadth of the data is usually limited. We can use that attribute to our advantage.

In all programming languages, for a variable like an integer, we allow any value between the minimum and the maximum, and there are usually some out-of-band signals possible as well, like NULL or Overflow. However, most times when we use an integer, only a tiny fraction of this set is acceptable. We might, for example, only want to collect an integer between one and ten. So what we really want to do as the data comes in from the outside is to reject all other possibilities. We only want a variable to hold our tiny little subset, e.g. ‘int1..10’. The tightest possible constraint.

Now, it would be nice if the programming language would allow us to explicitly declare this data type variation and to nicely reject any outside data appropriately, but I believe for all modern languages we have to do this ourselves at runtime. Then, for example, the user would enter a number into a widget in the GUI, but before we accept that data we would call a function on it and trigger possibly a set of Validation Errors (errors on data should never be singular, they should always be sets) if our exact constraints are not met.

Internally, we could pass around that variable ad nauseam, without having to ever check or copy it. Since it made it inside, it is now safe and exactly what is expected. If we needed to persist this data, we wouldn’t need to recheck it on the way to the database, but if the underlying schema wasn’t tightly in sync, we would need to check it on the way back in.

Overall, this gives us a pretty simple paradigm for both structuring all validation code, and for minimizing any data fiddling. However, dealing with only independent integers is easy. In practice, keeping to this philosophy is a bit more challenging, and sometimes requires deep contemplation and trade-offs.

The first hiccup comes from variables that need discontinuous ranges. That’s not too hard, we just need to think of them as concatenated, such as ‘int1..10,20-40’, and we can even allow for overlaps like ‘int1-10,5-7,35,40-45’. Overlaps are not efficient, but not really problematic.

Of course for floating-point, we get the standard mathematical range issues of open/closed, which might look like ‘float[0..1)’, noting of course that my fake notation now clashes with the standard array notation in most languages and that that ambiguity would need to be properly addressed.

Strings might seem difficult as well, but if we take them to be containers of characters and realize that regular expressions do mostly articulate their constraints, we get data types like ‘string*ab?’ which rather nicely restrict their usage. Extending that, we can quite easily apply wildcards and set theory to any sort of container of underlying types. In Data Modeling, I also discussed internal structural relationships such as trees, which can be described and enforced as well. That then mixes with what I wrote in Containers, Collections and Null, so that we can specify variable inter-dependencies and external structure. That’s almost the full breadth of a static model.

In that way, the user might play with a set of widgets or canvas controls to construct some complicated data model which can be passed through validation and flagged with all of the constraint violations. With that strength of data entry, the rest of the program is nearly trivial. You just have to persist the data, then get it back into the widgets later when requested.

A long time ago, programmers tended towards hardcoding the size of their structures. Often that meant unexpected problems from running into these constraints, which usually involved having to quickly re-deploy the code. Over the decades, we’ve shifted far away from those types of limitations and shouldn’t go backward. Still, the practical reality of most variable-sized data in a system is that there are now often neglected external bounds that should be in place.

For example, in a system that collects user's names, if a key feature is to always produce high-quality documentation, such as direct client marketing materials, the need for the presentation to never-be-broken dictates the effective number of characters that can be used. You can’t, for example, have a first name that is 3000 characters long, since it would wrap over multiple lines in the output and look horrible. That type of unspecified data usage constraint is mostly ignored these days but commonly exists within most data models.

Even if the presentation is effectively unconstrained, resources are generally not. Allowing a user to have a few different sets of preferences is a nice feature, but if enough users abuse that privilege then the system could potentially be crippled in performance.

Any and all resources need to be bounded, but modifying those bounds should be easily configurable. We want both sides of this coin.

In an Object-Oriented language, we might choose to implement these ideas by creating a new object for every unique underlying constrained type. We would also need an object for every unique Container and every unique Collection. All of them would essentially refuse to instantiate with bad data, returning a set of error messages. The upper-level code would simply leverage the low-level checks, building them up until all of the incoming data was finally processed. Thus validation and error handling become intertwined.

Now, this might seem like a huge number of objects, but once each one was completed it would be highly leverageable and extremely trustworthy. Each time the system is extended, the work would actually get faster, since the increasing majority of the system’s finite data would already have been both crafted and tested. Building up a system in this manner is initially slower, but the growing reductions in bugs, testing, coding, etc. ultimately win the day.

It is also possible to refactor an existing system into this approach, rather mindlessly, by gradually seeking out any primitives or language objects and slowly replacing them. This type of non-destructive refactoring isn’t particularly enjoyable, but it's safe work to do when you don’t feel like thinking, and it is well worth it as the system grows.

It is also quite possible to apply this idea to other programming paradigms, in a sense this is really just a tighter variation on the standard data structure ideas. As well as primitive functions to access and modify the data, there is also a means to validate and return zero or more errors. Processing only moves forward when the error count is zero.

Now one big hiccup I haven’t yet mentioned is cross-variable type constraints. The most common example is a user selecting Canada for country should then be able to pick from a list of Provinces, but if they select the USA, it should be States. We don’t want them to have to select from the rather huge, combined set of all sub-country breakdowns, that would be awful. The secondary dependent variable effectively changes its enumerated data type based on the primary variable. For practical reasons, I am going to assume that all such inter-type relationships are invertible (mostly to avoid confusing the users). So we can cope with these types of constraints by making the higher-level types polymorphic down to some common base type. This might happen as a union in some programming languages or an underlying abstract object in others. Then the primary validation would have to check the variable, then switch on its associated sub-type validation for every secondary variable. This can get a bit complex to generalize in practice, but it is handled by essential moving up to the idea of composite-variables that then know their sub-type inter-relationship mappings. In our case above, Location would be aware of Country and Province/State constraints, with each subtype knowing their own enumerations.

So far everything I’ve talked about is with respect to static data models. This type of system has extremely rigid constraints on what it will and will not accept. That actually covers the bulk of most modern systems, but it is not very satisfying since we really need to build smarter and more adaptable technologies. The real world isn’t static so our systems to model and interact with it shouldn’t be either.

Fortunately, if we see these validations as code, and in this case easily computable code, then we can ‘nounify’ that code into data itself. That is, all of the rules to constrain the variables can be moved around as variables themselves. All we need do to them is to allow the outside to have the power to request extensions to our data model. We can do that by explicitly asking for the specific data type, the structural changes, and some reference or naming information. Outside, users or code can explicitly or even implicitly supply this information, which is passed around internally. When it does need to go external again, say to the database, essentially the inside code follows the exact same behavior in requesting a model extension. For any system that supports this basic dynamic model protocol, it will just be seamlessly interconnected. The sophistication of all of the parts will grow as the underlying data models are enhanced by users (with the obvious caveat that there needs to be some configurable approach to backfilling missing data as well).

We sort of do this right now, when we drive domain data from persistence. Using Country as an example again, many systems will store this type of enumerated set in a domain table in the database and initialize it at startup. More complex systems might even have an event mechanism to notify the runtime code to re-initialize. These types of system features are useful for both constraining the initial implementation, but also being able to adjust it on-the-fly. Rather than do this on an ad hoc basis though, it would be great if we could just apply it to the whole data model, to be used as needed.

Right now we spend an inordinate amount of effort hacking in partial checks and balances in random locations to inconsistently deal with runtime data quality. We can massively reduce that amount of work, making it easier and faster to code, by sticking to a fairly simple model of validation. It also has the nifty side-effect that it will reduce both bloat and CPU usage. Obviously, we’ve known this since at least the mid-80s, it just keeps getting lost and partially reinvented.

Saturday, October 28, 2017

Freedom

A big part of programming culture is the illusion of freedom.

We pride ourselves on how we can tackle modern problems by creatively building new and innovative stuff. When we build for other programmers, we offer a mass of possible options because we don’t want to be the ones who restrict their freedoms.

We generally believe that absolutely every piece of code is malleable and that releasing code quickly is preferable over stewing on it for a long time. In the last decade, we have also added in that coding should be fun, as a fast-paced, dynamic response to the ever-changing and unknowable needs of the users. Although most of us are aware that the full range of programming knowledge far exceeds the capacity of our brains, we still express overconfidence in our abilities while contradictorily treating a lot of the underlying mechanics as magic.

Freedom is one of our deepest base principles. It appears to have its origin in the accelerated growth of the industry and the fact that every 5 to 10 years some brand new, fresh, underlying technology springs to life. So the past, most programmers believe, has very little knowledge to help with these new emerging technologies.

That, of course, is what we are told to believe and it has been passed down from generations of earlier programmers. Rather than fading, as the industry ages, it has been strengthening over the last few decades.

However, if we examine programming closely, it does not match that delusion. The freedoms that we proclaim and try so hard to protect are the very freedoms that keep us from moving forward. Although we’ve seen countless attempts to reverse this slide, as the industry grows most new generations of programmers cling to these beliefs and we actively ignore or push out those that don’t.

Quite obviously, if we want a program to even be mildly ‘readable’, we don’t actually have the freedom to name the variables anything we want. They are bound to the domain; being inconsistent or misusing terminology breeds confusion. It's really quite evident that if a piece of data flows through a single code base, that it should always have the same name everywhere, but each individual programmer doesn’t want to spend their time to find the name of existing things, so they call it whatever they want. Any additional programmers then have to cope with multiple shifting names for the exact same thing, thus making it hard to quickly understand the flow.

So there isn’t really any freedom to name variables ‘whatever’. If the code is going to be readable then names need to be correct, and in a large system, those names may have been picked years or even decades earlier. We’ve always talked a lot about following standards and conventions, but even cursory inspections of most large code bases show that it usually fails in practice.

We love the freedom to structure the code in any way we choose. To creatively come up with a new twist on structuring the solution. But that only really exists at the beginning of a large project. In order to organize and build most systems, as well as handle errors properly, there aren’t really many degrees of freedom. There needs to be a high-level structure that binds similar code together in an organized way. If that doesn’t exist, then its just a random pile of loosely associated code, with some ugly spaghetti interdependencies. Freely tossing code anywhere will just devalue it, often to the point of hopelessness. Putting it with other, similar pieces of code will make it easy to find and correct or enhance later.

We have a similar issue with error handling. It is treated as the last part of the effort, to be sloppily bolted on top. The focus is just on getting the main logic flow to run. But the constraints of adding proper error handling have a deep and meaningful impact on the structure of the underlying code. To handle errors properly, you can’t just keep patching the code at whatever level, whenever you discover a new bug. There needs to be a philosophy or structure, that is thought out in advance of writing the code. Without that, the error handling is just arbitrary, fragile and prone to not working when it is most needed. There is little freedom in properly handling errors, the underlying technology, the language and the constraints of the domain mostly dictate how the code must be structured if it is going to be reliable.

With all underlying technologies, they have their own inherent strengths and weaknesses. If you build on top of badly applied or weak components, they set the maximum quality of your system. If you, for instance, just toss some randomish denormalized tables into a jumble in an RDBMS, no matter how nice the code that sits on top, the system is basically an untrustworthy, unfixable, mess. If you fix the schema problems, all of the other code has to change. Everything. And it will cost at least twice as long to understand and correct it, as it would to just throw it away and start over. The value of code that ‘should’ be thrown away is near zero or can even be negative.

So we really don’t have the freedom to just pick any old technology and to use it in any old, creative, way we choose. That these technologies exist and can or should be used, bounds their usage and ultimately the results as well. Not every tech stack is usable for every system and picking one imposes severe, mostly unchangeable limitations. If you try to build a system for millions of users on top of a framework that can only really handle hundreds of them, it will fail. Reliably.

It might seem like we have the freedom to craft new algorithms, but even here it is exceptionally limited as well. If for example, you decide to roll your own algorithm from scratch, you are basically taking a huge step backward, maybe decades. Any existing implementations may not be perfect, but generally, the authors have based their work on previous work, so that knowledge percolates throughout the code. If they didn’t do that research then you might be able to do better, but you’ll still need to at least know and understand more than the original author. In an oversimplified example, one can read up on sorting algorithms and implement a new variant or just rely on the one in their language’s library. Trying to work it out from first principals, however, will take years.

To get something reasonable, there isn’t much latitude, thus no freedom. The programmers that have specialized in implementing algorithms have built up their skills by having deep knowledge of possibly hundreds of algorithms. To keep up, you have to find that knowledge, absorb it and then be able to practically apply it in the specific circumstance. Most programmers aren’t willing to do enough legwork, thus what they create is crude. At least one step backward. So, practically they aren’t free to craft their own versions (although they are chained by time and their own capabilities).

When building for other programmers it is fun to add in a whole boatload of cool options for them to play with, but that is only really useful if changing any subset of options is reliable. If the bulk of the permutations have not been tested and changing them is very likely to trigger bugs, then the options themselves are useless. Just wasted make-work. Software should always default to the most reasonable, expected options and the code that is the most useful fully encapsulates the underlying processing.

So, options are not really that useful, in most cases. In fact, in the name of sanity, most projects should mostly build on the closest deployments to the default. Reconfiguring the stuff to be extra fancy increases the odds of running into really bad problems later. You can’t trust that the underlying technology will hold any invariants over time, so you just can’t rely on them and it should be a very big and mostly unwanted decision to make any changes away from the default. Unfortunately, the freedom to fiddle with the configurations is tied to causing seriously bad long-term problems. Experience makes one extraordinary suspect of moving away from the default unless your back is up against the wall.


As far as interfaces go, the best ones are the ones that the user expects. A stupidly clever new alternative on a date selection widget, for example, is unlikely to be appreciated by the users, even if it does add some new little novel capability. That break from expectations generally does not offset a small increase in functionality. Thus, most of the time, the best interface is the one that is most consistent with the behaviors of the other software that the user uses. People want to get their work done, not to be confused by some new ‘odd’ way of doing things. In time, they do adjust, but for them to actually ‘like’ these new features, they have to be comfortably close to the norm. Thus, for most systems, most of the time, the user’s expectations trump almost all interface design freedoms. You can build something innovative, but don’t be surprised if the users complain and gradually over time they push hard to bring it back to the norm. Only at the beginning of some new technological shift is there any real freedom, and these are also really constrained by previous technologies and by human-based design issues.

If we dig deeply into software development, it becomes clear that it isn’t really as free as we like to believe. Most experienced programmers who have worked on the same code bases for years have seen that over time the development tends to be refined, and it is driven back towards ‘best practices’ or at least established practices. Believing there is freedom and choosing some eclectic means of handling the problems is usually perceived as ‘technical debt’. So ironically, we write quite often about these constraints, but then act and proceed like they do not exist. When we pursue false freedoms, we negatively affect the results of the work. That is, if you randomly name all of the variables in a system, you are just shooting yourself in the foot. It is a self-inflicted injury.

Instead, we need to accept that for the bulk of our code, we aren’t free to creatively whack it out however we please. If we want good, professional, quality code, we have to follow fairly strict guidelines, and do plenty of research, in order to ensure that result.

Yes, that does indeed making coding boring. And most of the code that programmers write should be boring. Coding is not and never will be the exciting creative part of building large systems. It’s usually just the hard, tedious, pedantic part of bringing a software system to life. It’s the ‘working’ part of the job.

But that is not to say that building software is boring. Rather, it is that the dynamic, creative parts of the process come from exploring and analyzing the domain, and from working around those boundaries in the design. Design and analysis, for most people, are the interesting parts. We intrinsically know this, because we’ve so often seen attempts to inject them directly into coding. Collectively, most programmers want all three parts to remain blurred together so that they don’t just end up as coding monkeys. My take is that we desperately need to separate them, but software developers should be trained and able to span all three. That is, you might start out coding from boring specs, learn enough to handle properly designing stuff and in time when you’ve got some experience in a given domain, go out and actually meet and talk to the users. One can start as a programmer, but they should grow into being a software developer as time progresses. Also, one should never confuse the two. If you promote a programmer to run a large project they don’t necessarily have all of the necessary design and analyze skills and it doesn’t matter how long they have been coding. Missing skills spells disaster.

Freedom is a great myth and selling point for new people entering the industry, but it really is an illusion. It is sometimes there, at very early stages, with new technologies, but even then it is often false. If we can finally get over its lack of existence, and stop trying to make a game out of the process, we can move forward on trying to collect and enhance our already massive amount of knowledge into something that is more reliably applied. That is, we can stop flailing at the keyboards and start building better, more trustworthy systems that actually do make our lives better, instead of just slowly driving everyone nuts with this continuing madness and sloppy results we currently deliver.

Sunday, October 15, 2017

Efficiency and Quality

The strongest process for building software would be to focus on getting the job done efficiently while trying to maximize the quality of the software.

The two objectives are not independent, although it is easy to miss their connection. In its simplest terms, efficiency would be the least amount of work to get the software into production. Quality then might seem to be able to range freely.

However, at one end of the spectrum, if the quality is really bad, we do know that it sucks in tonnes of effort to just patch it up enough to barely work. So, really awful quality obviously kills efficiency.

At the other end though, it most often seems like quality rests on an exponential scale. That is, getting the software to be 1% better requires something like 10x more work. At some point, perfection is likely asymptotic, so dumping mass amounts of effort into just getting trivial increases in quality also doesn’t seem to be particularly efficient either.

Are they separated in the middle?

That too is unlikely in that quality seems to be an indirect byproduct of organization. That is, if the team is really well-organized, then their work will probably produce a rather consistent quality. If they are disorganized, then at times the quality will be poor, which drags down the quality of the whole. Disorganization seems to breed inefficiency.

From practical experience, mostly I’ve seen that rampant inefficiency results in consistently poor quality. They go hand in hand. A development team not doing the right things at the right time, will not accidentally produce the right product. And they certainly won’t do it efficiently.

How do you increase both?

One of the key problems with our industry is the assumption that ‘code’ is the only vital ingredient in software. That, if there was only more ‘code’ that would fix everything. Code, however, is just a manifestation of the collected understanding of the problem to solve. It’s the final product, but certainly not the only place were quality and efficiency are noticeable.

Software development has 5 distinct stages: it goes from analysis to design, then into coding and testing and finally, it is deployed. All five stages must function correctly for the effort to be efficient. Thus, even if the coding was fabulous, the software might still be crap.

Little or hasty analysis usually results in unexpected surprises. These both eat into efficiency, but also force short-cuts which drag down quality. If you don’t know what you are supposed to build or what data it will hold then you’ll end up just flailing around in circles hoping to get it right.

Designing a system is mostly organization. A good design lays out strong lines between the pieces and ensures that each line of code is located properly. That helps in construction and in testing. It’s easier to test well-written code, and it's inordinately cheaper to test code if you properly know the precise impact of any of the changes.

On top of that, you need to rely on the right technologies to satisfy any engineering constraints. If you pick the wrong technologies then no amount of effort will ever meet the specifications. That’s a key aspect of design.

If the operational requirements never made it into the analysis and design, the costs of supporting the system will be extraordinary, which will eventually eat away at the resources available for the other stages. A badly performing system will rapidly drag down its own development.

Realistically, a focus on efficiency, across all of the stages, is the best way to ensure maximum quality. There are other factors, of course, but unless the software development process is smoothly humming along, they aren’t going to make a significant impact. You have to get the base software production flow correct first, then focus on refining it to produce high-quality output. It’s not just about coding. Generating more bad code is never going to help.

Sunday, October 1, 2017

Rainy Days

When first we practice to code, we do, of course, worry most about branching between different instructions and repeating similar blocks of work over and over again.

In time, we move on to longer and more complex manipulations.

Once those start to build up, we work to cobble them together. Continually building up larger and larger arrangements, targeting bigger features within the software.

That’s nice and all, but since we’ve usually only focused on getting the code to run, it really is only reliable on ‘sunny’ days. Days when there are no clouds, no rain, no wind, etc. They are warm and pleasant. When everything just works.

Code, however, needs to withstand all of the elements; it runs in a cruel, cruel world where plenty of unexpected things occur at regular frequencies. Storms come often, without warning.

Rainy day coding is a rather different problem to solve. It means expecting the worst and planning for it. It also means building the code in a way that its rainy day behavior is both predictable and easily explainable.

For instance, any and all resources might be temporarily unavailable. And they might go down in any number of permutations. What happens then should be easily explainable, and it should match that explanation. The code should also ensure that during the outages no data is lost, no user uninformed, no questions unanswered. It needs to be trustworthy.

Any data coming from the outside might be wrong. Any events might occur in a weird order. Any underlying technology might suddenly behave erratically. Any piece of hardware might fail. Anything at all could happen...

If that sounds like a lot of work, it is! It is at least an order of magnitude harder than sunny day coding. It involves a lot of digging to fully understand rainy days, or tornadoes, or hurricanes or even earthquakes. You can’t list out the correct instructions for a rainy day if you’ve never thought about or tried to understand them.

As software eats the world, it must live up to the cruelties that exist out there. When it doesn’t, it might be temporarily more convenient on the sunny days, but it can actually spawn off more work than it saves when it rains. The overall effect can be quite negative. We were gradually learning how to write robust systems, but as the underlying complexity spiraled out of control these skills have diminished. Too many sunny days and too little patience have driven us to rely on fragile systems while deliberately ignoring the consequences. If we want to get the most out of computers, we’ll have to change this...

Sunday, September 17, 2017

Decisions

We can model large endeavors as a series of decisions. Ultimately, their success relies on getting work completed, but the underlying effort cannot even be started until all of the preceding decisions are made. The work can be physical or it can be intellectual or it can even be creative.

If there are decisions that can be postponed, then we will adopt the convention that they refer to separate, but related pieces of work. They can occur serially with the later piece of work relying on some of the earlier decisions as well as the new ones. Some decisions are based on what is known up-front, while others can’t be made until an earlier dependent bit of work is completed.

For now, we’ll concentrate on a single piece of work that is dependent on a series of decisions. Later we can discussion parallel series and how they intertwine.

Decisions are rarely ever ‘right’ or ‘wrong’, so we need some other metric for them. In our case, we will use ‘quality’. We will take the decision relative to its current context and then talk about ‘better’ or ‘worse’ in terms of quality. A better decision will direct the work closer to the target of the endeavor, while a worse one will stray farther away. We’ll accept decisions as being a point in a continuous set and we can bound them between 0.0 and 100.0 for convenience. This allows for us to adequately map them back to the grayness of the real world.

We can take the quality of any given decision is the series as being relative to the decision before it. That is, even if there are 3 ‘worse’ decisions in a row, the 4th one can be ‘better’. It is tied to the others, but it is made in a sub-context and only has a limited range of outcomes.

We could model this in a demented way as a continuous tree whose leaves fall onto the final continuous range of quality for the endeavor itself. So, if the goal is to build a small piece of software, at the end it has a specific quality that is made up from the quality of its subparts, which are directly driven by the decisions made to get each of them constructed. Some subparts will more heavily weight the results, but all of it contributes to the quality. With software, there is also a timeliness dimension, which in many cases would out-weight the code itself. A very late project could be deemed a complete failure.

To keep things clean, we can take each decision itself to be about one and only one degree of variability. If for some underlying complex choice, there are many non-independent variables together, representing this as a set of decisions leaves room to understand that the one or more decision makers may not have realized the interdependence. Thus the collection of decisions together may not have been rational on the whole, even if all of the individual decisions were rational. In this sense, we need to define ‘rational’ as being full consideration of all things necessary or known relative to the current series of decisions. That is, one may make a rational decision in the middle of a rather irrational endeavor.

Any decision at any moment would be predicated both on an understanding of the past and the future. For the past, we accrue a great deal of knowledge about the world. All of it for a given subject would be its depth. An ‘outside’ oversimplification of it would be shallow. The overall understanding of the past would then be about the individual’s or group’s depth that they have in any knowledge necessary to make the decision. Less depth would rely on luck to get better quality. More depth would obviously decrease the amount of luck necessary.

Looking towards the future is a bit tricker. We can’t know the future, but we can know the current trajectory. That is, if uninterrupted, what happened in the past will continue. When interrupted, we assume that that is done so by a specific event. That event may have occurred in the past, so we have some indication that it is, say, a once-in-a-year event. Or once-in-a-decade. Obviously, if the decision makers have been around an area for a long time, then they will have experienced most of the lower frequency events, thus they can account for them. A person with only a year’s experience will get caught by surprise at a once-in-a-decade event. Most people will get caught by surprise at a once-in-a-century event. Getting caught by surprise is getting unlucky. In that sense, a decision that is low quality because of luck may indicate a lack of past or future understanding or both. Someone lucky may look like they possess more knowledge of both than they actually do.

If we have two parallel series of decisions, the best case is that they may be independent. Choices made on one side will have no effect on the work on the other side. But it is also possible that the decisions collide. One decision will interfere with another and thus cause some work to be done incorrectly or become unnecessary. These types of collisions can often be the by-product of disorganization. The decision makers are both sides are unaware of their overlap because the choice to pursue parallel effort was not deliberate.

This shows that some decisions are implicitly made by willingly or accidentally choosing to ignore specific aspects of the problem. So, if everyone focuses on a secondary issue instead of what is really important, that in itself is an implicit decision. If the choices are made by people without the prerequisite knowledge or experience, that too is an implicit decision.

Within this model then, even for a small endeavor, we can see that there are a huge number of implicit and explicit decisions and that in many cases it is likely that there are many more implicit ones made than explicit. If the people are inexperienced, then we expect the ratio to have a really high multiplier and that the quality of the outcome then relies more heavily on luck. If there is a great deal of experience, and the choices made are explicit, then the final quality is more reflective of the underlying abilities.

Now all decisions are made relative to their given context and we can categorize these across the range of being ‘strategic’ or ‘tactical’. Strategic decisions are either environmental or directional. That is, at the higher level someone has to set up the ‘game’ and point it in a direction. Then as we get nearer to the actual work, the choices become more tactical. They get grounded in the detail, become quite fine-grained and although they can have severe long-term implications, they are about getting things accomplished right now. Given any work, there are an endless number of tiny decisions that must be made by the worker, based on their current situation, the tactics and hopefully the strategic direction. So, with this, we get some notion of decisions having a scope and ultimately an impact. Some very poor choices by a low-level core programmer on a huge project, for instance, can have ramifications that literally last for decades, and that degrade any ability to make larger strategic decisions.

In that sense, for a large endeavor, the series of decisions made, both large and small, accumulate together to contribute to the final quality. For long running software projects, each recurring set of decisions for each release builds up not only a context, but also boundaries that intrinsically limit the best and worst possible outcomes. Thus a development project 10 years in the making is not going to radically shift direction since it is weighted down by all of its past decisions. Turning large endeavors slows as they build up more choices.

In terms of experience, it is important for everyone involved at the various levels to understand whether they have the knowledge and experience to make a particular decision and also whether they are the right person at the right level as well. An upper-level strategic choice to set some weird programming convention, for example, is probably not appropriate if the management does not understand the consequences. A lower-level choice to shove in some new technology that is not inline with the overall direction is also equally dubious. As the work progresses bad decisions at any level will reverberate throughout the endeavor, generally reducing quality but also the possible effectiveness of any upcoming decisions. In this way, one strong quality of a ‘highly effective’ team is that the right decisions get made at the right level by the right people.

The converse is also true. In a project that has derailed, a careful study of the accumulated decisions can lead one back through a series of bad choices. That can be traced way back, far enough, to find earlier decisions that should have taken better options.

It’s extraordinarily hard to choose to reverse a choice made years ago, but if it is understood that the other outcomes will never be positive enough, it can be easier to weigh all of the future options correctly. While this is understandable, in practice we rarely see people with the context, knowledge, tolerance and experience to safely make these types of radical decisions. More often, they just stick to the same gradually decaying trajectory or rely entirely on pure luck to reset the direction. Thus the status quo is preserved or ‘change’ is pursued without any real sense of whether it might be actually worse. We tend to ping-pong between these extremities.

This model then seems to help understand why complexity just grows and why it is so hard to tame. At some point, things can become so complex that the knowledge and experience needed to fix them is well beyond the capacity of any single human. If many of the prior decisions were increasingly poor than any sort of radical change will likely just make things worse. Unless one can objectively analyze both the complexity and the decisions leading to it, they cannot return to a limited set of decisions that need to be reversed properly, and so at some point, the amount of luck necessary will exceed that of winning a lottery ticket. In that sense, we have seen that arbitrary change and/or oversimplifications generally make things worse.

Once we accept that the decisions are the banks of the river that the work flows through, we can orient ourselves into trying to architect better outcomes. Given an endeavor proceeding really poorly, we might need to examine the faults and reset the processes, environment or decision makers appropriately to redirect the flow of work in a more productive direction. This doesn’t mean just focusing on the workers or just focusing on the management. All of these levels intertwine over time, so unwinding it enough to improve it is quite challenging. Still, it is likely that if we start with the little problems, and work them back upwards while tracking how the context grows, at some point we should be able to identify significant reversible poor decisions. If we revisit those, with enough knowledge and experience, we should be able to identify better choices. This gives us a means of taking feedback from the outcomes and reapplying it to the process so we can have some confidence that the effects of the change will be positive. That is, we really shouldn’t be relying on luck to make changes, but not doing so is far less than trivial and requires deeper knowledge.