The Programmer's Paradox: 2019

Tuesday, December 10, 2019

Development Speed

Software is counter-intuitive. Sometimes to speed up development, you have to slow it down.

Let me explain.

It’s easy to get a small piece of software together. You belt out the basic code; you don’t need to worry about high falutin ideas like architecture, organization, etc. You just need code.

That works, only as long as the software stays small. But the technique itself is predicated on pounding out rather frail, hard-coded instructions that are very specific to the task at hand. That is fine, but it doesn’t scale, Not even a little bit.

Once the project has accumulated enough lines of code or other forms of complexity, making any changes to it involves finding the right compromises that are fragmented across all of the source. Once that source exceeds the ability for someone to easily remember it, and move around it seamlessly, then attempts to fix or improve it will slow it down. They don’t just gradually slow down, they basically walk off a cliff. Changes that might have required a few hours suddenly balloon to taking days or weeks. As well, the quality of the changes declines, it then feeds back into a cycle making all future changes harder as well. Parts of the code hit this vicious cycle at different times in the descent, but the chaos in one part of the system spreads.

If a project has started swirling downwards this way, the only real way out of this is to admit that the code has become a huge mess. That the work necessary to stop this decay is essentially re-organizing huge swaths of the code base with an eye on defragmenting it and laying down some rather strict organization. That kind of work is really slow. You can’t just leap into it, rather it takes time and effort to diligently, slowly, rearrange it at multiple levels, back up to a place where it can move forward again.

But just cleaning up the code isn’t the only issue. Over specific code and redundant code, both occur in projects that are falling apart. Both of these problems need addressing.

Redundant code is easier since you can just look around the codebase and find a lot of similar functions or config data. If they are close enough to each other, it is an easy change to merge them together and only use one copy. Again it is slow, but if done well, it has a huge lift on the quality and tends to make a lot of bugs just disappear, so its payoff is obvious.

Over-specific code is a little harder to tackle.

Most development projects go on for years, if not decades. During the life of the project, the focus generally starts at one specific point in the domain and spreads, much like construction, over the surrounding areas. When the project is started, people are often tunnel-visioned on that initial starting point, but taking a big project one tiny step at a time is painfully slow.

Instead of pounding out code for each specific case, when there are a lot of them in the same neighborhood, the most optimal method over the total lifespan of the development is to batch these cases together and deal with them in one large blow. This is an off-handed way of talking about generalization and abstraction, but it gets the point across that picking the right abstraction that repeats over and over again in the domain, will accelerate a project through that territory at light speed. The cost is often just marginally more effort in the beginning, but the payoff is massive.

Often the counter-argument to the above is that it is impossible to guess at which problems will repeat significantly enough in the domain to make a reasonable guess at what to abstract. However, that’s never been a valid point, in that software starts at a particular spot, and the functionality spreads from there. It is all connected. One doesn’t start writing an accounting system and end up with a video game, it doesn’t work that way. So the possible paths for the code are connected and intertwined, and for the most part, obvious. Given that, some prior experience is enough to lay out the pathways for a very long time, with reasonable accuracy, subject to radical shifts like a startup that pivots. So basically if you get in some experienced people, along with their effort, you get their foresight, which is more than enough to be able to maximize the efficiencies over a long period of time.

On top of all of these other efforts, hard lines need to be drawn through the code to separate it into smaller pieces. Without that separation, code gets fragmented quickly and the problems come right back again.

The test that the separation is good enough and clean enough, is that it can be easily documented with fairly simple diagrams. If the sum of the details is so messy that it is a Herculean job to create an accurate diagram of the parts, then it is disorganized. The two go lockstep with each other. If you can create a simple diagram that only kinda reflects the code, then quite obviously you want to refactor the code to match that diagram.

These lines then form an architecture, which needs to be preserved and enhanced for any and all future extensions. Although it should be noted that as the general size of the code base grows, it is quite common for it to outgrow its current architecture and then need to be entirely reorganized by a different set of parameters. That is, no specific organization of code is scalable. It is all relative to its size and complexity. As it grows, then any organization needs to change with it.

Given the above issues, it is inevitable that during the life of a coded base there are times when it can run quickly without any consequences and times when everyone has to slow down, rearrange the project, and then start working through the consequences.

Avoiding that will bring on disaster faster than just accepting it and rescheduling any plans that are impacted. If a project decays enough, it reaches a point where it can not move forward at all, and is either permanently stuffed in maintenance mode or should be shelved and started over from scratch. The only problem with restarting is that if the factors that forced it into that cycle in the first place are not corrected, then the next iteration of the code will suffer the exact same problem. Basically, if the developers don’t learn from history, then they will absolutely repeat it and get back to the same result.

Wednesday, August 21, 2019

Bugs as a Reflection of Coding Issues

A long, long time ago I read a book about the most popular programming errors in C. Off-by-one was in the top ten.

I have made thousands of off-by-one bugs in my career over the decades. It is easily my biggest mistake, and you’d think knowing that I would be able to reduce the frequency of them.

I suspect that the problem comes from the way I see code. At the point that I am focussed on the higher-level mechanics and flow, I easily ignore the low-level details. So, if I am coding a complex algorithm that is manipulating lots of arrays, whether or not they are symmetric or asymmetric bounds is not particularly interesting. Depending on the tech stack, it's not uncommon to see too much switching between the two, and when I am clearly not paying attention, I run a 50/50 chance of getting it wrong. Which I do, a lot.

Now, in knowing this, I am not able to change my coding mindset, but that doesn’t mean I can’t correct for the problem. So I do. When I am testing to see that the code I’ve written matches my understanding of its behavior, one of the key places I pay attention to is the index bounds. So, if there is an array, for example, I need to add in at least a few examples, then remove a few. That is one of the key minimal tests before I can trust the work.

As a consequence, even though I make a large number of off-by-one bugs, it is actually very rare for them to get into production. If they do, I generally take that as a warning sign that the development process has at least one significant problem that needs to be addressed right away.

Generalizing, because we model the solutions in our head, then code them into the machines, that code can be no stronger than our internal models. Well, almost. For a large set of code that has been worked on by many different programmers, the strength is an aggregate of the individual contributions and how well they overlap with each other.

What that means is that you could segment each and every bug in the system by its collective authors and use that to refine the development process.

So, for example, the system is plagued by poor error handling. That’s an attribute of not enough time or the developers not considering the real operational environment for the code. Time is easy to fix, but teaching programmers to look beyond the specified functionality of the code and consider all of the other possible failures that can occur is tricky. Either way though, the bugs are providing explicit feedback into issues within the development process itself.

It’s actually very similar for most bugs. As well as being the problems, they shed light on the overall analysis, design, coding, and testing of the system. A big bug list for a medium-sized project is an itemized list of how the different stages of development are not functioning correctly. If, for example, the system is doing a poor job with handling the workflow of a given business problem, it’s often due to incomplete or disorganized analysis. If the system can’t keep up with its usage, that comes from technical design. And of course, if the interfaces are awkward and frustrating the user than the UX design is at fault. And of course, stupidly embarrassing bugs getting out to product are instances of testing failure.

The way we build software has a profound effect on the software we produce. If we want better software, then we need to change the way we build it, and we need to do this from an observational perspective, not just speculative opinions (since they likely have their own limited-context problems).

Wednesday, July 31, 2019

Breaking it Down

It is generally understood that the best way to solve a large problem is by breaking it down, decomposing it into bite-sized pieces.

While that overall approach is easy to say, it is actually quite difficult to list out the specific steps for how to decompose problems into their ‘atomic’ components. As well, it is often forgotten that once all of these pieces have been decided upon, they still need to be assessed together in order to ensure that they still cover the original problem.

For that first issue, large problems are intrinsically complex, that’s their definition. So it's worth noting that the full weight of all of their details exceeds the ability of any single person to internally understand or visualize them. That’s why they are considered large.

To get beyond their size, we essentially rely on layering. We take the original problem and break it down one level at a time into a new set of smaller problems. Each of these is a layer in the solution.

Obviously adding layers increases complexity for the overall problem, so it is vital that each new layer only contains pieces that are independent from each other. That is, the complexity needs to be split cleanly, any extra complexity from adding the layer should be less than the complexity of the underlying pieces.

If we were to think of this more formally, the sum of the new pieces should be less than the altered whole. That seems easy and obvious, but it entirely relies on the pieces being independent of each other. If they aren't, then the worst case is that for dependent pieces, they inherit each other’s complexity, their individual complexities are the combined. To be specific, take a problem P, and break it into 3 pieces, c1, c2, and c3. If c1 and c2 are intertwined then in the worst case we get c(P) + L <= (c1 + c2) + (c2 + c1) + c3 where L is the cost of adding a new layer. By decomposing the problem into a ‘blurry’ layer, we’ve essentially increased the artificial complexity beyond any benefits of adding that layer.

That is the quantitative cost, but there is a human cost as well. The combination of the first and second parts have only been reduced by around 2/3rds of the whole, not the full 1/3rd that we could have had to bring this part of the problem down into a manageable size. This builds up. So, if it should have taken 4 layers to contain the solution, we might need to double that to 8 because of the overlaps.

This points back again to the importance of ensuring that any breakdown is only useful if the pieces themselves are independent.

The secondary problem, particular with scaling the solution, is to have gaps between the pieces. If the pieces fit poorly, then slightly different reassemblies will create new problems. The solution isn’t scalable. Most solutions only have sufficient value if they are applied more than once, but gaps can interfere with being able to do that.

Both issues: overlaps and gaps, strongly diminish the decomposition. If we add in enough blurry layers, we have seen in practice that we can bloat the complexity exponentially or worse. In that sense, while layers are necessary to solve the problem, they are also risky.

So, the underlying issue is that given a big problem how do you subdivide it cleanly into independent, atomic, pieces? The unfortunate answer is that you can only really do this if you fully understand all of the details. But often that is not possible, and as previously mentioned the problem is already too large for any individual to handle.

The approx answer is that we need to get close enough to a reasonable solution, then slowly converge on the best answer.

To do this, decomposition requires many forms of definition and categorization. We can start these loosely, and continually revise them. For this, we want as many instances of the problem as we can get our hands-on. For each of them, we can precisely define them with all of their known details, then we can parse that information into ‘syntactic’ components, essentially verbs and nouns. From here we can split any clashes in say nouns, subdividing them until all of the obvious overlaps are gone. Then for each instance, this gives us a breakdown of what are associated attributes, basically the smallest verbs and nouns. With this set, for each instance, we can partition all of the instances from each other. In doing this, we have to count what are really the dimensions of variability (since the axis provide a natural decomposition).

It’s worth noting that any variable dimensionality must match. If you construct a 1D categorization over 2 dimensions, it increases the likelihood that one of the dimensions will span multiple categories, which becomes a dependency, so it bumps up the complexity. However, if you have 2 distinct categorizations for each of the dimensions, then you can work with the permutations at a higher layer to combine them at the cost of adding in that extra layer. In that way, as we are making sense of all of the special cases and categorizing them, we are also building up these layers. The layering itself though can be seen as a more general instance of another problem that needs its own solution, so it is upwardly recursive.

A somewhat related programming example is that you want to define an architecture (overall organization) for a fairly large system. You might split all of the code by its usage, say between an interactive interface and a batch reporting facility. But along with usage, there might be shared commonalities for specific data types like strings. The usage of the code is a different dimension from the shared data-structure manipulation code. We’d like the manipulations to only be specified once (so they are independent) but they are equally likely to be necessary on either side of the usage breakdown. We need them for the interface and we need them in batch. Without giving it much consideration, programmers often bind the usage dimension to the higher architectural arrangements but keep a hybrid category called a shared library available to span any architectural divide. Occasionally, you see this done cleanly, but most often the failure to admit these are different dimensions leads to a hodgepodge of redundant and shared code. So, because of that, it is an easy issue to spot in a large codebase, but an extraordinarily hard one to solve with a trivial solution.

Given all of the above, to really get going with a large decomposition means collecting a large number of examples, breaking them down into attributes, identifying the dimensions, then drawing all of the vertical and horizontal lines between them. At that point, one can cross-check that artificial examples do not fit into multiple boxes and that any permutation has only one unique location. For completeness, the boxes need names that truly reflect their content. At this point, the first step is done.

As mentioned at the start, after decomposition, all of the boxes need to be specified fully, then recomposed to see that they still work together. If everything were truly independent, and it was obvious, then this next step could be skipped, but again this problem is large enough so that neither of those preconditions exists.

Given a small enough box, a capable person can now produce a solution, but the overall context may have been lost. This box may contain a significant amount of variability and it is this intrinsic freedom that is dangerous. Thus, each box still contains a large number of possibilities, but not all of these solutions will interact correctly with the whole.

Another issue is that the necessary size of the boxes is dependent on individuals. Some people can cope with larger boxes, some cannot, so the boxes may still need further decomposition.

At some point with recomposing, it is likely that some missed dependency will creep in. One little box in one part of the solution will be found to be unexpectedly tied to another box somewhere else. Given the size, scope and general usage of the solution there are multiple ways of handling this. The best, but most time-intensive, is to roll the dependency upwards until it reaches a layer were the cross-dependency exists, then just recategorize that layer and all of the affected children.

Sometimes, due to operational or time issues, that is not possible, so the alternative is documentation. The dependency is noted in the attached documentation but the instances are handled redundantly. That type of documentation needs to be bound to the solution, and to stay that way for its entire usage. The worst thing to do is to ignore or dismiss the problem, as it is most likely to set other problems into motion.

A major concern with the above is the fear that rigorously following it will lead to too many layers. Some layers exist mainly for the purpose of bringing down the complexity, others are tightly bound to discovered dimensions. Obviously invalid layers that do neither are just increased complexity without benefit, but for the necessary layers, the underlying degree of sophistication of the solution is for the most part dependant on their existence. If you remove them, the solution is unknowable or it is oversimplified. In the first case, an unknowable amount of complexity will not be predictable and so is not trustworthy. Eventually, it won’t be the solution, but rather take its place as part of the problem itself, so it is a rather negative contribution. Being over-simplified is similar. It won’t really solve the full problem and will spin off lots of sub-problems as part of its usage. Generally, things get worse, but not necessarily linearly.

Relying on a faulty solution may take a long time to trigger the full weight of the mistake. It might seem like it worked, but that’s misleading. Because of that, for a given problem there is a bound on the necessary number of layers required for a tight-fitting solution. Comprehension and dimension handling open the door for some wiggle room, but it is safe to say that there are some problems that need a nearly fixed number of layers to solve properly. So, if the sophistication of the problem requires around 20 layers, but the proposed solution only has 5, we can pretty well infer that that specific solution will not really handle that set of problems. That at some point, it will go horribly wrong. If the proposed solution has 30 layers, again we can often infer that it will take longer than necessary to implement it and that it could be difficult to extend when needed. There are always a lot of possible solutions, but very few will fit properly.

With all of the above in mind, identifying a problem then decomposing it into pieces to build a solution has a lot of small subtleties that for problems that are highly intertwined make it tough to get real workable solutions. From a dependency standpoint, problems that are trivially independent are trivially solvable, but all it takes to throw that off is non-obvious dimensions that weed their way through the decomposition. In a real sense that is why most problems look a lot easier on the outside then they do on the inside when you’ve acquired deep knowledge about them. In that depth, the details lead to interconnections, which bind the parts at a distance. Not seeing those is the beginning of the end.

Tuesday, March 19, 2019

Cooperation, Competition and Control

Life is a dynamic process. All forms of life compete for the ability to propagate.

Our species bands together; this cooperation gives us a competitive advantage. Within our societies, we compete with each other for control of any of the resources. We are driven to do this.

Competition and cooperation intertwine at all levels within our interactions.

When a competition becomes stagnant, incentives grow to subvert the underlying cooperation that enables it. If some of the competitors bend the rules, to remain in the competition, the rest have to as well. Each iteration of the game converges to being stale, so the need to subvert the cooperation increases with time and the individual stability of the players.

Outside enforcement or a steady turn over of the players tends towards a fairer competition. Most new competitions start reasonably fair, but without correction will not remain that way.

Control, in an uncontrollable world, is the prize. It is best utilized when achieved, since it may be increasing lost with time. Through control, we can offset or at least delay other competitions.

A stronger base cooperation enables more intense competition. The two extremes cycle in dominance; one always pushes back on the other.

The game sometimes plays out across generations, but it isn’t always obvious to the players.

As some players compete and push their way up through the ranks, they become willing to do anything to move into the best position.

Some people just don’t want to play, they favor more cooperative environments.

A desire to win seems to be the stronger deciding factor, but can sometimes backfire, depending on the game.

We build a lot of myths around competing fairly, but most competitions are well past that stage. Most people outside the game are unaware of the status and most of the players would rather not admit it.

A stable competition must constrain the game. Stability comes from the outside, it must be tied to our most basic need to cooperate. The outsiders must maintain their ability to enforce the rules of the game. There is no naturally inherent stability, time will always pass, the game will always get stale.

The rules of the game need verification. Bends or breaks must be detectable. Any enforcement must know when to act.

Limited play time helps, but that can be subverted by cooperating groups which extend the context. If one group’s horizon significantly exceeds the field, then all other players will bind to different groups and the effect is the same as individuals, but just plays out a little slower.

Cooperation drives us at a basic level, but the need to compete and to gain control are dominate in our societies. We will compete at any and everything. If we want a better world, we need to address this at the core; to accept it and to allow it, but also to contain it to remain positive. Otherwise, the same decaying cycles just play out over and over again.

Thursday, February 21, 2019

Software Optimizations

Most software can execute faster.

There are many ways that software can be optimized to improve its performance. Most of these techniques are well-understood, but they still need to be used with caution, in that they can accidentally harm other attributes of the system.

The most obvious way to speed up code is to no longer do useless work.

One common form of wasted effort is to redundantly copy the same data to many different areas of memory. Another is to parse the data into smaller pieces and then reassemble it later or vice versa. Removing these from code should get it closer to the minimal effort, but not necessarily the minimum.

Sometimes, however, extra copies of data are necessary for security, locking or architectural reasons. These types of redundancies can stay, if they are justified. For sanity reasons, most incoming data for a large system should be fully parsed as soon as possible. Exporting this data may legitimately require reassembling it. This too is fine.

Switching to a better algorithm, quite often, can afford very large time savings. There is a huge amount of knowledge available about algorithms and their performance attributes. Significant research is always required.

Sometimes shifting between time and space works really well. We can rebalance the code to shift this resource usage. In some cases though, there are natural boundaries for reductions, so the embedded information doesn’t exist to optimize the code.

Algorithmic optimizations are often the most effective, but they require a great deal of knowledge and often a lot of time to implement properly.

Beyond that, memoization which is the reuse of earlier computations can produce decent optimizations. Caching is the most famous of these, but care must be taken to distinguish between read-only and write-through caching. They are very different from each other and frequently confused. A bad implementation can cause weird bugs.

The big trick with memoization is not in saving the value, but rather in knowing ‘precisely’ when that value is no longer of any use. These types of optimizations require a strong understanding of usage and data frequency. Generalized solutions can help (and hurt), but specific solutions will always produce better results.

An example of this is compression. Data can be taken down close to its information theoretic minimum, beyond that some data is lost (which can also be acceptable). The act of reducing the size of the data is accomplished by utilizing these redundancies. This is also a classic time vs space tradeoff.

Parallelizing computation is another strong form of optimization. To make it work on interconnected data usually requires synchronization primitives like locking. Locking can be coarse-grained or fine-grain, with the latter usually providing better performance at the cost of more management overhead. Locking gets extraordinarily challenging when it occurs outside of a single process.

Locking algorithms spread across different computers are bounded by TGP (two generals problem) which in itself influences impossibility results like CAP and FLP. Generally, this is caused by an inherent ambiguity (missing information) within the communication between the separated computations (getting worse as the underlying reliability of the communication weakens). This sometimes described as transactional integrity as well.

In general, we can optimize code by seeking out data independence. If for a given computation, there is some dependence for the result on some other piece of data, then that relationship bounds the minimum amount of work. All outputs must be produced from some finite set of inputs. This is true for all computations. As there is a precise minimum for information, there also exists one for computation.

Optimization attempts then can start by observing for a given context that there will never be ties between any specific set of variables and using that information to reorder the work involved to get closer to the minimum. That is, we can conjecture that for any specific output, there are a finite number of both computations and data that form a minimum directed acyclic graph (DAG) with all inputs as leaves. Then there should exist a minimal such DAG (relative to the computational primitives). This can be applied mechanically to any set of instructions, for a given set of data, as it is bounded by a specific context. Fill in these unknowns and the minimal set of work is explicit.

Some algorithmic optimizations are tricky in that they would require currently unknown relationships to exist in order to find the actual minimum effort. We can, however, come close to the minimum, even if we can’t get there yet.

Most other optimizations are easier, in that they really come from understanding the data, its usage and the underlying functioning of the computers themselves (sometimes optimizations at one level exist to counterbalance bad optimizations at a lower level).

Most code is written as the ‘obvious first try’, so most of the time there is plenty to optimize. However, most programmers do not fully understand the data or the context, which is why we warn younger coders to not prematurely attempt to optimize. They do not have a full enough understanding yet to do it correctly and bad optimizations, by definition, will use more resources not less. Poor optimizations can impair readability or extendability. Only good optimizations will help.

Friday, February 15, 2019

Model-based Systems Design

Start with some data that you want the system to capture.

Where does this data come from? Data is usually entered by people or generated by some type of machine.

Most data is composite. Break it down into its subparts. It is fully decomposed when each subpart is easily representable by a programming language primitive. Try to stick to a portable subset of primitives like strings, integers, and floating point numbers. Converting data is expensive and dangerous.

Are any of these subparts international or common standards? Do they have any known, necessary properties, constraints, restrictions or conventions? Do some research to see what other people in the same industry do for representing these types of values. Do some research to see what other programmers have done in representing these values.

Now, investigate usage. How many of these values will the system store? How often are they generated? Are there any weird rules for frequency or generation? Sometimes close estimates for frequency are unavailable, but it’s always possible to get a rough guess or something like a Fermi estimation. For the lifetime of some systems, the initial data frequency will differ quite a bit from the actual real-world data frequency. Note this, and take it into account later, particularly by not implementing optimizations until they are necessary.

Once the data is created, does it ever need to change? Who can change it? Is the list of changes itself, significant data too? Are there security concerns? Can anyone see the data, can anyone change it? Are there data quality concerns? What are the chances that the initial version of the data is wrong? Does this require some extended form of auditing? If so, what is the frequency for the audit entities themselves? Is the auditing fully recursive?

For some of the subparts, there may not be a one-to-one relationship. Sometimes, the main data often called an ‘entity’ is associated with many similar subparts. Break this into its own entity. Break any one-to-many, many-to-one or even many-to-many bits of subparts into separate entities. During implementation, it may not make sense for the initial version of the system to treat the subparts as a separate entity, but for data modeling, it should always be treated as such. The model captures the developers understanding of the data as it gets created and used, the implementation may choose to optimize that if necessary. The two viewpoints are related, but should not be intermixed.

For some data, there may be an inter-relationship between entities of the same kind. Capture this as well. The relationships span the expressibility of data structures, so they include single entities, lists, trees, dags, graphs, and hypergraphs. There may be other structural arrangements as well, but most of these will be decomposed into the above set. These interrelationships sometimes exist externally, in which case they themselves are another entity and should be treated as such. Sometimes they are confused with external indexing intended to support searching functionality, that too is a different issue, more entities. Only real structural interrelationships should be captured this way. Most entities do not need this.

What makes each entity unique? Is there a key or a composite set of values that is unique? Are there multiple keys? If so, can they conflict with each other? Spend a lot of time understanding the keys. Poorly keyed data causes huge problems that are hard to fix. Almost all entities need to be unique, so there is almost always at least one composite key, sometimes quite a few. Key mappings can be a huge problem.

While working with one type of entity, a whole bunch more will be created. Each one of these new entities needs the same analysis treatment as the original entity. As the understanding of each one has been explored they can be added to the model. The model should grow fairly large for non-trivial systems. Abstraction can combine sets of entities with the same structural arrangement together, but the resulting abstract entities should not be so generic that they can no longer properly constrain the data. A model is only useful if it accurately holds the data and prevents invalid data from being held.

Be very careful about naming and definitions. The names need to match the expected usage for both the target domain and computer science. Sometimes it takes a while to figure out the correct name, this is normal. Misnaming data shows a lack of understanding, and often causes bugs and confusion later. Spend a lot of time on names. Do research. They are hard to change later. They need to be accurate.

Don’t try to be clever or imaginative. The system only adds value by capturing data from the real world, so the answers to most questions are just laying around, out there, in the real world. There has been plenty of history for building up knowledge and categorizations, leverage that work. Conflicts between the model and reality should be resolved by fixing the model. This work only has value if it is detail-oriented and correct, otherwise, it will become one of the sources of the problem, not part of the solution.

There are forms of optimizations that require injecting abstract data into the model, but those types of enhancements should be added later. They are implementation details, not data modeling.

Some types of data are constrained by a fixed or limited set of values. These sets are domain based. Try to find existing standards for them. Put them into their own entities, expect them to change over time. Do some analysis to figure out how often they are expected to change and who is expected to change them. Anyone involved in running, using or administrating a system is a user, and not all users are end-users.

As this work progresses, it will build up a large collection of entities. Groups of these entities will be tightly related. These groups draw architectural lines within the system.

Now, look at how people will use this collection of data. Do they need access to it quickly? Is it huge, and what types of navigation will users need to find out what they are looking for? Is the searching slow, does it need some form of indexing optimization to make it usable? Will the system build up data really quickly, consistently or very slowly? Do different parts of the system have very different access requirements? Is there data in the system which is creativity-based or experimental? Some systems need tools to play around with the data and are subject to lots of little experimental changes. Some systems need the data to remain relatively static and only require changes to fix quality issues.

Will different users need the same data at the same time? Will different users edit the same data at the same time? How do changes with one entity affect others?

For large data, summaries are often important to present overviews of the data? For each entity what types of summaries are necessary? Are these their own entities? How often do they change, who can regenerate them? Are there users specific categorizations that would be helpful in crafting useful summaries? Can these be shared? Are they entities as well? Can this reporting part of the system be fully integrated into the main model so that it stays in sync with any enhancements to the domain model itself?

Can the summary data be computed on-the-fly, or is it expensive enough to be regenerated only periodically? What is the precise cost of computations? Are there industry standards, existing libraries, etc. that can be used?

The answers to some of the above questions, such as how quickly the data needs to be viewed after changes and how many users need to view it, will create optimization requirements. The collection of these requirements will dictate the hardware, the architecture and any dependent technologies. Often there will be multiple possible solutions so regional programming resources and current programming fads will pin down a precise implementation.

The work modeling the data, since it is small in comparison to the implementation work, should extend beyond the initial short term goals of the development. It doesn’t have to fully cover all scope and all detail of the given domain, but it should at least extend out to the next couple of expected iterations in the development cycle. This makes it possible in the future to pipeline extensions in the system first by modeling, then by the actual implementation, so that the current coding work is at least one generation behind the current modeling work. Dramatic, unexpected pivots in the development will, of course, disrupt this cycle, but the frequency of these should diminish rapidly in the early days of development (or the project is already doomed for non-technical reasons).

A full data model then includes all of the entities, their internal and external relationships, all subparts that are ‘typed’ and all of the expected computations and necessary optimizations that are expected. Follow up extensions, should be highlighted as changes from the previous version. The version changes should match the code implementations (development cycles). The structure of any entity groups should lay out a high-level architecture, with further architectural constraints driven by the optimizations and possibly the arrangement of the development teams themselves.

The data model should include any of the analyst’s notes, and any issues about standards, ambiguities, conventions, and issues with keys. The full document can then be used to produce a high-level design and a number of necessary mid-level and low-level designs needed to distribute the work to the development teams.

Depending on the haste involved in doing this work, it is possible that there are flaws in the data model. These will come in different types: the model does not reflect the real world, b) it has missing elements or c) the model differs from an incorrect model used by an upstream or downstream system. If the model is wrong or incomplete it should be updated. That should be pushed through design and implementation as a necessary change to the core of the system. It is the same process as extending the system. If the model doesn’t reflect some earlier mistake, an appendix should be added that maps that mistake back to the model and outlines the consequences of that mapping. That mapping should be implemented as a separate component (so that it can be removed later) and enabled through configuration.

For most systems, spending the time to properly understand and model the data will lay out the bulk of the architecture and coding. Building systems this way will produce better quality, reduce the development time and produce far less operational issues. Done correctly, this approach also lays out a long term means of extending the system without degrading it.

Thursday, February 7, 2019

Implementing Sophistication

Computers are intrinsically stupid.

To get around this problem, programmers have to take all of the knowledge they have acquired, visualize it in a way that makes it codable, and then implement it in software.

The easiest approach to this is to be as crude as possible. The coder does nothing other than sling around unknown data; all of the complexity is pushed either to the users or to the operating environment. This gets the baseline mechanics up quickly, but it does create a fragile system that isn’t extendable. It’s good for a demo, but rarely solves any real underlying problems.

Sophistication comes from the code taking on more and more of the problem in a way that is reliable. Instead of pushing back unintelligible character strings for a user to track, a sophisticated system employs powerful navigation so that they don’t have to remember anything. Instead of crashing poorly and requiring a long recovery time, a sophisticated system bounces right up again, ensuring that all of its components are in working order. The efforts pushed back to the ‘users’ are minimized, while the functioning of the system is predictable.

It’s a huge amount of work to add in sophistication for software. We can crank out flaky websites quickly, but to actually build up deep functionality takes time, skill and knowledge. Often people believe that it is not necessary. They figure that it’s only a little extra work that is pushed back to the users, so it is okay. But if you look at our modern software, with the amount of our time that it wastes, then it should be more than obvious that we aren’t making good use of our hardware and all of the electricity that we pour into it. Crude software doesn’t really automate processes for us, rather it just shifts the way we waste our time.

Sophistication starts with understanding the data. Since computers are confined to the digital realms, at best they are only able to ‘symbolically’ represent things in the real world. These representations are bound by real-world constraints that are often extraordinarily complicated. It takes time to map their informal behavior into a rigorous formal environment. Time that isn’t wasted. If the system properly encapsulates the managed data then it doesn’t need external help or hacks when it is operating. If the data is a mess, then any or all of the computations on the data are suspect. Bad systems always have bad data models, the two are intertwined.

Really understanding data is complicated and it usually means having to go beyond just the normal branches of programming knowledge and directly into domain knowledge. For some people, this is the interesting part of programming, but most try very hard to avoid building up depth on any particular domain. That is unfortunate since the same basic ‘structural’ domain issues span across areas like finance, healthcare, etc. From an implementation standpoint, the usages are very similar and digging into one domain can lead to insights in others. Uniqueness and time, for instance, are common across everything, even if the instances of those problems are steeped in domain-specific terminology.

If the base of the system rests on an organized and complete data model, the construction of the system is quite easy. Most code, in most systems, is just moving the data from one part of the system to another. The trick is to maximize reuse.

Coding is still a slow, tedious, operation particularly when you include the work of testing and debugging. Reuse means that the finalized, well-edited code is deployed repetitively, which can eliminate huge amounts of work. That is, the only code that is ‘good’ code has been heavily edited and battle tested. Otherwise, it is fresh code; it should be assumed that it contains significant bugs. This is always a safe assumption that makes it easier to understand the amount of work involved in releasing a new version of a system.

In most systems, there are huge opportunities for reuse, but they often require deep abstraction skills. Unfortunately, this makes them unavailable for most development efforts. To leverage them requires a significant up-front investment that few organizations are willing to gamble on. It’s not possible, for instance, to convince management that for an extra six months up front, it saves years of work down the road. Our industry is too impatient for that. Still, one can identify reuse and slowly refactor the code in that general direction, without having to commit significant resources immediately. This spreads the effort over the full duration of the project but requires that this type of work is not discontinued halfway through. Thus modern programming should accept that reuse and refactoring are bound together. That the latter is the means to achieve the former.

Big sophisticated systems take years, if not decades, to build. That is never how they are pitched, the time frame is usually ridiculously short and overly ambitious. Still, any developer that has been through a number of big projects is aware that the amount of work invested is massive. Because of this, systems in active development are continuously being extended to do more than their original designs. This is quite dangerous for the software in that the easiest way to extend code is to just build some independent functionality on the side and barely integrate it. This type of decay is extremely common. It is a lot less work to slap something different at the edge, then it is to revisit the underlying data model and work out how to enhance it. But each time this shortcut is chosen, the system gets considerably less sophisticated, more fragile and more bug-prone. What seems to be the faster way of achieving our goals, is actually extremely destructive in the long run. So, sophisticated isn’t just an initial design goal, it is an ongoing effort that continues as long as there are new capabilities and data getting added into the system. Sophistication can be watered down or essentially removed by poor development efforts.

Given that adding sophistication to a system is extremely time-consuming, coders have to learn how to be efficient in order to be able to meet most development constraints.

The first issue is that not all time spent building the system should be actual coding. In fact, coding should be the last, most routine part of the effort. Programmers need to learn how to acquire an understanding first, then visualize a solution and then only at the end do they sit down and start fiddling with the code. Diving head first into the code and getting lost there always wastes a huge amount of time. As well, new programmers are often hesitant to delete their code, so instead, they start to build up unintelligible, disorganized messes, that they flail at to fix a never-ending set of bugs. Code gets sticky and that causes its own problems. Fear of changing code often leads to writing more bad code.

Sometimes, the best approach to fixing the code is to walk away from the computer. Research (textbooks, blogs, etc.) and bouncing ideas off other programmers are two really critical but underused approaches to being more efficient. If you are having trouble explaining what the code should do, then you probably don’t understand it well enough to get it to work properly. It’s a waste of time to fight with code.

Efficiency also comes from non-development activities as well. Micro-management is a very popular approach these days for software development projects, but it can be horrifically misapplied and lead to tonnes of make-work. Stakeholders need some level of accountability for the work they commission, but software development should never be driven by non-technical people. They don’t understand the priorities so they focus on the shallow issues that are most visible to themselves, while the real problems in development come from the details. This always leads to a quick death as the technical debt overwhelms the ability to progress. A reasonable methodology can help, but it is tricky to apply it properly. Methodology for small projects is trivial, but the complexities grow at least exponentially as the size of the project grows. It is a very difficult and concentrated skill to keep large scale development projects from imploding or exploding. It is quite a different set of skills from either coding or architecture. In this sense, software development is intrinsically unscalable. More resources often result in more make-work, not real progress.

Realistically it isn’t difficult to type in a small set of instructions for a computer to follow. It is difficult however to type in a complete set of instructions that would help a user reliably solve one of their problems. We often get these two things confused and a great deal of modern software development is about claiming to have done the second, by only doing the first. As the software development industry matures, we should be able to do more this enhanced type of development and we do this by getting beyond our crude practices and adding in sophisticated code. This type of coding takes longer, is harder and requires more skills, but ultimately it will make computers significantly more usable for us. We shouldn’t have to accept so many software problems; we shouldn’t let our standards remain so low. Computers are amazing machines which still have a huge ability to improve our lives. Sophisticated software is what makes this possible.