Sunday, October 4, 2015


About the hardest thing I've ever tried to write about is 'abstraction'. 

It defies a simple explanation precisely because it is so abstract. Some people get it at an intuitive level, but for many others, it slips past their grasp, which is unfortunate since it is the gateway to efficiently building sophisticated software systems. It's the high-performance engine of the software industry, but it is utilized way too infrequently.

For this post, I'll try approaching the description from a completely different angle. It may make it more concrete for some people, perhaps. The main prerequisite is an understanding of entity-relationship modeling -- covered somewhat by my earlier post 'Data Modelling'.

A software entity represents some "concrete" thing in reality. We pick these things because they are useful for solving a user's problem. Each of these things has a set of attributes that are mostly discrete variables, although they could also be sub-entities. There are no continuous variables such as Real numbers, stored in a computer, only discrete finite approximations of them like floating point numbers. That everything, except time, is finite and bounded is sometimes important in ensuring correct behavior, but it also imparts some interesting constraints on the characteristics and usefulness of any software program.

If we turn our attention back to 'reality' we find that all of the 'things' in it are composed of particles. There are decompositions that go far deeper but for this post, we can ignore those. If we were to look at a person, we could think of them as being a 'container' of some incredibly large number of particles. Within that container, are a very large number of subsets, such as arms and legs. In a sense, a person is decomposable into any and all of the possible permutations for any subset of their particles. That's an incredibly huge number of subsets. In practice, however, only a very small number of these subsets actually make sense to us, like 'limbs'; the rest are ignored. Still one could conceivably organize every one of these permutations into hierarchies based on their 'closeness'; perhaps having the full set that represents a person at the root.

Given any two people, there are obviously common subsets like 'limbs', that represent the same relative parts, yet are filled with different particles for each person. In a sense, we simply designate the term 'limb' as being a common 'category' containing the arms or legs for both of these people. We attach it to a particular collection of subsets, even if we don't explicitly and precisely define what they are during most occurrences of our usage.

Now it so happens that our software term 'entity' also means a container, but for variables instead of particles. Thus, we can easily arrange to have entities in the software that have a one-to-one mapping to any containers of particles in reality. That makes a great deal of sense, but we do find that people often have attributes such as 'first name' that aren't directly related to subsets of their particles. That's not really a problem, in that if we use a term like 'limb' for body parts, it really is just a 'label' to refer to a collection of subsets. Names are exactly that, labels, but in this case, they refer to a specific instance, instead of a collection of them.

That idea that we can label both instances and generic groups of subsets is very important. It shows that this labeling mechanism is fairly powerful. We can use it uniquely, or we can apply it to be unbounded across all of time. When we refer to a limb, for instance, it holds its meaning for all of the past, present, and future. Throughout time, there will be a massive number of limbs, too many to ever count. No matter where, or when it is applied, the term references a very specific subset or 'category' of particles; or parts of a person if you would prefer.

And thus, we have this means of relating together vast subsets as being explicit 'categories'. Now we can create these categories and attach them to virtually any group of subsets imaginable, but very much like the permutations only some of these types of labels are useful.

Now comes the fun part. Earlier we noted that entities can be mapped to these subsets, but also that the attributes themselves are explicitly related to the subsets as well. Also, we noted that a person may contain up to four limbs, two of which are arms and two of which are legs. While the term 'legs' reference that same subset -- entity -- as 'limbs' it is just slightly more specific about the instances. If we were to collect specific information about someone's legs, given the differences in particles from 'arms', our corresponding entity would contain leg-specific variables. Placeholders, if you will, for the specifics of an individual.

When jumping from an entity of type 'legs' to a slightly more abstract 'limbs' there would be less variables that we could store. There might be specific information we could add about the 4 limbs like the number still remaining, but anything leg or arm specific would be 'abstracted away' as we moved up to the higher category.

For many object-oriented programmers, they may have recognized the above description as being 'polymorphism' which in itself is a particular type of abstraction. In moving 'upwards' we can see that we lose the specifics related to categories of instances, but we also gain metrics relating to the larger collection of subsets. In that, any abstraction is obviously a generalization that implicitly refers to more things than the original. A key reason for doing this is to bring special cases, with some useful similarity together, to save on the complexities of handling them individually. That is, it is simpler with respect to both organization and resources.

One thing to fully understand is that there is essentially an unlimited number of ways to relate similar subsets together, particularly if they contain a significant number of particles. That gives us an unfathomably large number of ways to categorize entities. That also means that there is no 'right' way to categorize, just convenient ones. This really hits home in software in that modeling any type of data with a static schema is inevitably going to limit the usefulness of that data itself. The static relationships cannot be 'complete' or all-inclusive. So, in a sense, any software system that is growing is going to need to re-arrange or re-categorize its modeling, whenever that functionality requested exceeds the abilities of the earlier model. That is, software is constantly in flux, because it is always, at most, focussed on some absurdly small number of the subset possibilities.

What's interesting about this view is that there are a huge number of subsets of particles in the real world that are useful to track with a computer. We're at the very early stages right now, most of the entities we capture are extremely crude. As we progress in our understanding of both modeling and complexity, we should be able to build systems that really do encompass enough intelligence to simplify people's lives. We're a long way away from that right now.

Earlier, I said that entities represent "concrete" things in reality. That isn't exactly true. For instance, in a video game, an entity might represent a humanoid like an elf or a dwarf. While these 'fictional' ideas share some similarity to reality, they are definitely not real. That is fine, in that our imaginations allow us to combine real things with abstract ones and contemplate these. One could not even say that these imaginary entities are bounded by the real world, give some of the abilities that a superhero, like Superman, has. But we can actually go even further in that the abstractions of mathematics quite easily allow us to ponder any sort of thing -- particularly as a formal system -- that utilizes alternatives that would directly contradict reality. That is, we can model anything, as a set of complex relationships, even though it violates the physical constraints of our existence. We are not bounded by our observable reality. We can still think of any of this as "concrete" because we are able to communicate it to each other. It is describable and can be modeled symbolically by our software.

All of this gives us the ability to see if something is really an abstraction and to possibly optimize the abstractions to be both larger and more useful. It also leaves us with a sense of coverage; that is a supposed abstraction that only intersects minimally with the original subset but fails to cover it. For testing a basic abstraction, all we need to do is see that there are plenty more subsets included as we go higher. If the abstraction bends upward, getting stuck on some alternative categorization, we can see that too. That latter case is actually sometimes useful, in that an abstraction that binds together a collection of subsets with some 'imaginary' subsets as well is an often used technique to increase performance, readability or reduce code. The imaginary parts simply reduce friction and make everything fit properly. It's exactly for that reason that getting pedantic about the correspondence between reality and the subsets can often make the problem harder, not simpler, which leads some programmers to avoid it. That's unfortunate, given that the directed creativity necessary to smooth over the bumps in a complex system is actually one of the types of problems that programmers say they want to solve. It takes a bit of thinking, but it isn't impossible, or particularly strange.

Abstraction then isn't all that abstract. We're really just loosening the labeling to collectively refer to a larger number of things. Given the variety of relations between collections of subsets, we have to be careful to ensure that any loosening shifts upwards in the desired manner. Once we have abstracted, we now have less individual cases to consider so we can save time and resources as we implement the models. Used properly, good abstractions can free the designers, allowing them to tackle the really sophisticated problems, instead of getting bogged down in the brute force tar pit.