The Programmer's Paradox: October 2015

Sunday, October 25, 2015

Organization

Disorganization is extremely dangerous for a software development project. No matter what code quality is delivered by the programmers, if it is just dumped into a ball of mud it will grow increasingly unstable. Software rusts quite quickly and cannot be maintained if the costs to read and modify it are unreasonable. Disorganization significantly amplifies those costs.

Organization applies both to the output, such as the data and code, and to the processes used for all five stages of software development. If the right work is not being done at the right time, the resulting inefficiencies sap the time and morale of the project, causing the work to falter. Once things have become messy, progress gets painfully slow. So it not only applies to the end, but to the means as well.

What is organization? It's a fairly simple term to claim, but a rather tricky one to define precisely. My favorite definition is the 17th (or 18th) century proverb "a place for everything and everything in its place" but that actually turns out to be a rather incomplete definition. It works as a start, but needs a little bit of refining:

A place for everything.
Everything in its place.
A similarity to everything in the same place.

While this definition may seem to only revolve around the physical world, it does actually work for the digital one as well.

'Everything' for a computer is code and data. There is actually nothing else, just those two although there are plenty of variations on both.

A 'place' is where you need to go to find some code or data. It's an absolute location e.g. the source code is located on machine X in the folder Y. The binary is located on machine P in the folder Q. These are both rather 'physical' places, even if they are in the digital realm.

With the first two rules it is clear that to be organized you need to define a rather large number of places and categorize all data and code into 'everything'. Then you need to make sure that all things are stored exactly where they belong. In principle it isn't that hard; in practice any medium to large computer system consists of millions of little moving parts. To get all of those into the proper place just seems like a huge amount of work. Really it's not. Particularly if you have some overwhelming 'philosophy' for the organization. Higher-level rules, i.e. best practices, fit generalizations over the details and when done appropriately can provide strong organizational guidance towards keeping a project clean. And then it is actually far less effort than being rampantly disorganized.

The third rule is necessary to insure that there aren't too many things all dumped into one big place. Some unscrupulous coders could claim that a ball of mud is organized because all of the code is in one giant directory in the repo. That, of course, is nonsense. Hundreds or thousands of source files in one giant mess is the antithesis of organization. The third rule fixes that. Everything in one place must be 'similar' or it just isn't organized.

So what does 'similar' mean? In my last post on Abstraction I talked about the rather massive permutations available for collections of sets. This directly implies that there are huge variations on degrees of possible similarity. As well, different people have very different tolerances for the proximity of any two things. Some people can find similarities at a great distance, while others only see them if they are close together. Taken together this implies that 'similarity' is subjective.

Two completely different people may disagree on what is close enough to be similar, and that of course propagates up towards organization. Different people will have very opposing views as to what is really organized, and what is disorganized. That being said, there is always some outer boundary on which the majority of reasonable people will agree that two things are definitely not similar, and it is this boundary that is the minimal necessity for the third rule. That is, there is basic organization and then there are many more advanced, more eclectic versions. If a project or process doesn't even meet the basics, then it is clearly doomed. If it twists a little to be tightly organized relative to a specific person's organizational preferences, then it should at least be okay, and it is always possible to re-organize it at some point in the future if the staff changes.

The corollary to this definition does imply however that if there are four programmers working on the same project with four distinctly different organization schemes, then if they overlap in any way, the project is in fact disorganized. If the programmers all choose different naming convention for the same underlying data, a thing, it is essentially stored in four different places, not one, violating the first rule. If duplicate code appears in many places, then the second rule is violated. If some component is the dumping ground for all sorts of unrelated functionality, then the third rule is broken. New programmers ignoring the conventions and tossing a "bag on the side" of an existing project are actually making it disorganized.

A large and growing project will always have some disorganization. But what is important is that there is continual work to explicitly address this. The data and code continually grow, drifting in similarity, so once the disorganization starts to impacts the work it needs to be addresses before it consumes a significant amount of time. And it needs to be handled consistently with the exact same organization scheme that already applies to everything else. A project that sees this as the minimum mandatory work involved in building up a system is one with a fighting chance for success. A project where this isn't happening is in trouble.

Testing to see if a system is organized is fairly simple. You just need to go through the data and code and see if for every place, the things are all similar. If there is lots of stuff out of place, then it is disorganized. If everything fits exactly where it should, then not only is it organized but it is also often described as 'beautiful'. The term 'elegant' is also applied to code, and is that ability to make a very complex problem appear to be rather simple. Underlying that achievement is a excellent organizational scheme, not just a good one.

Organization relates back to simplification and complexity. I talked about this in The Nature of Simplification nearly ten years ago, but it was in regards to simplifying without respect to the whole context. A bad simplification like the files/folder example is also disorganization because it gradually grows to violate the similarity rule. This feeds back into complexity in that disorganization is a direct form of added 'artificial' complexity. A mess is inordinately more complex than a well-designed system, but that mess is not intrinsic to the solution. It could have been done differently.

Organization ties back to other concepts such as Encapsulation and Architecture. A fully defined place for all of the data and code that are 'interrelated' is an alternative definition of Encapsulation. Architecture is often described as the main 'lines' or structures over which everything else fits, which is in a real sense an upper level of organization of the places themselves. Given enough places, they need to be put into meta-places and enough of those need to be organized too, etc. A massive system requires many layers to prevent any potential disorganization from just being pushed elsewhere and ignored. Organization is upwardly recursive as the scale increases.

Applying this definition of organization to processes is a bit tricky, but very similar. As always, l think it's easier to work backwards for understanding process issues. Starting with the users actually accessing the necessary functionality to solve their problems, we can see that the minimal organization includes getting each dependent 'thing' into its place with the minimal amount of effort. So the process of upgrading a system, for example, is well-organized when the only steps necessary are all dissimilar from each other. If there are five config files, then there ought to be just one step that gets them all installed into the right place. More importantly, if upgrading occurs more than once, then actually automating the process is in itself a form of organizing it.

Switching out to the earlier side of the development stages, the output of analysing an underlying problem to be solved requires that everything known and discovered is appropriately stored in the correct place, in the correct form. It isn't mashed together, but rather partitioned properly so that it is most useful for the next design stage. If this principle of organization is followed, it affects everything from the product concept to the actual feedback from operations. Everything is appropriately categorized, collected and then stored in its right place. The process is organized and anyone, at any stage, can just know where to find the correct information they need or know immediately that it is missing (and is now work to be done).

This may seem like a massive effort, but really it's not. There is no point collecting and organizing data from one stage if it's not going to be used in another. Big software projects sometimes amplify make-work because of misguided perspectives on what 'could' be useful, but after decades of development it becomes rather clear as to what 'will' actually be useful and what is just wasted effort. Processes should be crafted by people with significant hands-on experience or they miss those key elements. In a disorganized process, no real distinction can be made on the value of work, since you cannot ever be sure if there will be usage someday. In a well-organized project, spotting make-work is really easy.

One can extend this definition of organization fully to processes by substituting the nouns in the above rules with the appropriate verbs for the process. Then there is a process for every action, and every action takes place in the appropriate process. As well, there is a similarity to all of the actions within the same process. Of course with computers it is entirely possible to automate significant parts of the process, such that an overwhelming number of verbs really fall back to their output; back to nouns again. Seeing the work in this way, one can structure a methodology that ensures that the right outputs are always constructed at the right time, thus ensuring efficiencies throughout the flow of work.

A well-organized process is not at all as onerous as it sounds, rather it makes concentrating on the the quality of work much easier in that there are fewer disruptions and unexpected changes. As well, there are less arguments about how to handle problems, in that the delegation of work is no longer as subjective. For instance, if at the coding stage it becomes clear that some proposed functionality just won't work as advertised, then the work reverts back to the design stage or even further to the analysis. There is little panic, and an opportunity to insure that in the future the missing effort isn't continually repeating. In a disorganized process, someone generally just whips out duct tape and any feedback is lost, the same issues repeat over and over again. In that sense, an organized process is helpful to everyone involved, at a minimal cost. If chaos and pain are frequent process problems, then disorganization and make-work are strongly present.

A development project that is extremely well-organized is one that is fairly simple to manage and will produce software at a predictable pace, allowing for estimations to properly define priorities. A smooth development project, of course, provides more options for the work to match the required solutions. Organization feeds directly into the ability of the work to fulfil its goals in a reasonable time, at a reasonable cost. The very thing that users want most from software development. Organization is initially more expensive, but in any effort that is over a few months it quickly pays for itself. In a project expected to take years, with multiple developers, it is fairly insane not to take it seriously.

Organization springs from some very simple ideas. Being somewhat subjective, it is not the sort of thing that can be appropriately defined by a standard or a committee. Really the core leaders at each and every stage of development need to setup their own form of organization and apply it consistently to everything (old or new). If that is completed, then the development work proceeds without significant friction. If it is just left to happen organically, it won't. Once disorganization takes hold, it will always get worse. It's a vicious cycle, with more and more resources sucked away. It can be painful to reverse this, but given that it is absolutely fatal to a project, ignoring it won't work. Organization is a necessity to build all but the most trivial of software.

Sunday, October 4, 2015

Abstraction

About the hardest thing I've ever tried to write about is 'abstraction'.

It defies a simple explanation precisely because it is so abstract. Some people get it at an intuitive level, but for many others, it slips past their grasp, which is unfortunate since it is the gateway to efficiently building sophisticated software systems. It's the high-performance engine of the software industry, but it is utilized way too infrequently.

For this post, I'll try approaching the description from a completely different angle. It may make it more concrete for some people, perhaps. The main prerequisite is an understanding of entity-relationship modeling -- covered somewhat by my earlier post 'Data Modelling'.

A software entity represents some "concrete" thing in reality. We pick these things because they are useful for solving a user's problem. Each of these things has a set of attributes that are mostly discrete variables, although they could also be sub-entities. There are no continuous variables such as Real numbers, stored in a computer, only discrete finite approximations of them like floating point numbers. That everything, except time, is finite and bounded is sometimes important in ensuring correct behavior, but it also imparts some interesting constraints on the characteristics and usefulness of any software program.

If we turn our attention back to 'reality' we find that all of the 'things' in it are composed of particles. There are decompositions that go far deeper but for this post, we can ignore those. If we were to look at a person, we could think of them as being a 'container' of some incredibly large number of particles. Within that container, are a very large number of subsets, such as arms and legs. In a sense, a person is decomposable into any and all of the possible permutations for any subset of their particles. That's an incredibly huge number of subsets. In practice, however, only a very small number of these subsets actually make sense to us, like 'limbs'; the rest are ignored. Still one could conceivably organize every one of these permutations into hierarchies based on their 'closeness'; perhaps having the full set that represents a person at the root.

Given any two people, there are obviously common subsets like 'limbs', that represent the same relative parts, yet are filled with different particles for each person. In a sense, we simply designate the term 'limb' as being a common 'category' containing the arms or legs for both of these people. We attach it to a particular collection of subsets, even if we don't explicitly and precisely define what they are during most occurrences of our usage.

Now it so happens that our software term 'entity' also means a container, but for variables instead of particles. Thus, we can easily arrange to have entities in the software that have a one-to-one mapping to any containers of particles in reality. That makes a great deal of sense, but we do find that people often have attributes such as 'first name' that aren't directly related to subsets of their particles. That's not really a problem, in that if we use a term like 'limb' for body parts, it really is just a 'label' to refer to a collection of subsets. Names are exactly that, labels, but in this case, they refer to a specific instance, instead of a collection of them.

That idea that we can label both instances and generic groups of subsets is very important. It shows that this labeling mechanism is fairly powerful. We can use it uniquely, or we can apply it to be unbounded across all of time. When we refer to a limb, for instance, it holds its meaning for all of the past, present, and future. Throughout time, there will be a massive number of limbs, too many to ever count. No matter where, or when it is applied, the term references a very specific subset or 'category' of particles; or parts of a person if you would prefer.

And thus, we have this means of relating together vast subsets as being explicit 'categories'. Now we can create these categories and attach them to virtually any group of subsets imaginable, but very much like the permutations only some of these types of labels are useful.

Now comes the fun part. Earlier we noted that entities can be mapped to these subsets, but also that the attributes themselves are explicitly related to the subsets as well. Also, we noted that a person may contain up to four limbs, two of which are arms and two of which are legs. While the term 'legs' reference that same subset -- entity -- as 'limbs' it is just slightly more specific about the instances. If we were to collect specific information about someone's legs, given the differences in particles from 'arms', our corresponding entity would contain leg-specific variables. Placeholders, if you will, for the specifics of an individual.

When jumping from an entity of type 'legs' to a slightly more abstract 'limbs' there would be less variables that we could store. There might be specific information we could add about the 4 limbs like the number still remaining, but anything leg or arm specific would be 'abstracted away' as we moved up to the higher category.

For many object-oriented programmers, they may have recognized the above description as being 'polymorphism' which in itself is a particular type of abstraction. In moving 'upwards' we can see that we lose the specifics related to categories of instances, but we also gain metrics relating to the larger collection of subsets. In that, any abstraction is obviously a generalization that implicitly refers to more things than the original. A key reason for doing this is to bring special cases, with some useful similarity together, to save on the complexities of handling them individually. That is, it is simpler with respect to both organization and resources.

One thing to fully understand is that there is essentially an unlimited number of ways to relate similar subsets together, particularly if they contain a significant number of particles. That gives us an unfathomably large number of ways to categorize entities. That also means that there is no 'right' way to categorize, just convenient ones. This really hits home in software in that modeling any type of data with a static schema is inevitably going to limit the usefulness of that data itself. The static relationships cannot be 'complete' or all-inclusive. So, in a sense, any software system that is growing is going to need to re-arrange or re-categorize its modeling, whenever that functionality requested exceeds the abilities of the earlier model. That is, software is constantly in flux, because it is always, at most, focussed on some absurdly small number of the subset possibilities.

What's interesting about this view is that there are a huge number of subsets of particles in the real world that are useful to track with a computer. We're at the very early stages right now, most of the entities we capture are extremely crude. As we progress in our understanding of both modeling and complexity, we should be able to build systems that really do encompass enough intelligence to simplify people's lives. We're a long way away from that right now.

Earlier, I said that entities represent "concrete" things in reality. That isn't exactly true. For instance, in a video game, an entity might represent a humanoid like an elf or a dwarf. While these 'fictional' ideas share some similarity to reality, they are definitely not real. That is fine, in that our imaginations allow us to combine real things with abstract ones and contemplate these. One could not even say that these imaginary entities are bounded by the real world, give some of the abilities that a superhero, like Superman, has. But we can actually go even further in that the abstractions of mathematics quite easily allow us to ponder any sort of thing -- particularly as a formal system -- that utilizes alternatives that would directly contradict reality. That is, we can model anything, as a set of complex relationships, even though it violates the physical constraints of our existence. We are not bounded by our observable reality. We can still think of any of this as "concrete" because we are able to communicate it to each other. It is describable and can be modeled symbolically by our software.

All of this gives us the ability to see if something is really an abstraction and to possibly optimize the abstractions to be both larger and more useful. It also leaves us with a sense of coverage; that is a supposed abstraction that only intersects minimally with the original subset but fails to cover it. For testing a basic abstraction, all we need to do is see that there are plenty more subsets included as we go higher. If the abstraction bends upward, getting stuck on some alternative categorization, we can see that too. That latter case is actually sometimes useful, in that an abstraction that binds together a collection of subsets with some 'imaginary' subsets as well is an often used technique to increase performance, readability or reduce code. The imaginary parts simply reduce friction and make everything fit properly. It's exactly for that reason that getting pedantic about the correspondence between reality and the subsets can often make the problem harder, not simpler, which leads some programmers to avoid it. That's unfortunate, given that the directed creativity necessary to smooth over the bumps in a complex system is actually one of the types of problems that programmers say they want to solve. It takes a bit of thinking, but it isn't impossible, or particularly strange.

Abstraction then isn't all that abstract. We're really just loosening the labeling to collectively refer to a larger number of things. Given the variety of relations between collections of subsets, we have to be careful to ensure that any loosening shifts upwards in the desired manner. Once we have abstracted, we now have less individual cases to consider so we can save time and resources as we implement the models. Used properly, good abstractions can free the designers, allowing them to tackle the really sophisticated problems, instead of getting bogged down in the brute force tar pit.