Sunday, January 6, 2008

Abstraction and Encapsulation

Some of our most fundamental programming concepts are steeped in confusion. Because software development is such a young discipline, we often question and redefine many of the basic definitions to suit our personal understandings.

While this is a problem, it is also to be expected. Terms and definitions are the building blocks on which we build all of the higher concepts, if these are weak or incorrect they can lead us in circles long before we realize it is too late. Thinking in depth about their inner meanings is important. Older disciplines have no doubt survived much more discourse on their base topics; we are really only at the beginning of this stage. Other than what is mathematically rigorous, we should absolutely question our assumptions, it is the only way for us to expose our biases and grow. So much of what we take for fact is not nearly as rigid as we would like to believe.

For this posting, I'll bounce around a couple of basic terms that I've been interested in lately. There are no doubt official definitions somewhere, but if you'll forgive my hubris, I want to stick with those definitions that have come from my gut, driven by my experiences in the pursuit of software. When we reconcile experience and theory, we stand on the verge of true understanding.

A SIMPLE DEFINITION OF ENCAPSULATION

Encapsulation is a pivotal concept for software development. Its many definitions range from simple information hiding all of the way up to encompassing a style of programming. For me, the key issue is that encapsulation isolates and controls the details. It is how it relates back to complexity that is truly important. In a couple of my previous earlier posts I clarified my perspective of it:

http://theprogrammersparadox.blogspot.com/2007/11/art-of-encapsulation.html

and I added a few examples of what I think 'partial' encapsulation means and why it is such a big problem:

http://theprogrammersparadox.blogspot.com/2007/12/pedantically-speaking.html

These two posts define encapsulation as a way of breaking off parts of the system and hiding them away. The critical concept is that encapsulation is a way to manage complexity and remove it from the higher levels of the project. Controlling software development is really about controlling complexity growth. If it overwhelms us, the likelihood is failure. Partial encapsulation, which is a common problem, allows the details to peculate upwards which essentially undoes all of the benefits of encapsulation.

A SIMPLE DEFINITION OF ABSTRACTION

It may have just been a poor-quality dictionary, but when I looked up the definition of 'abstraction' it was self-referential: an abstraction is defined as an 'abstract' concept, idea or term. Where some people see this as a failure, I see it as a chance to work my own definition, at least with respect towards developing software. Following in the footsteps of all of those math textbooks that tortured me in school I'll leave a more generalized definition of the word as an exercise to the reader.

Although it is not necessary reading, in another earlier post I delved into the nature of 'simple':

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

this entry included a few ideas that might help to understand this some of the following craziness.

An abstraction is a simplification of some 'thing' down to its essential elements and qualities, that still -- under the current circumstances -- uniquely defines it relative to itself and any other 'thing' within the same abstraction. If two 'things' are separated, the two things when abstracted are still separated. Mathematically most abstractions involve isomorphic mappings (unique in both directions: one-to-one and onto) onto a simpler more generalized space, but if the mapping is not isomorphic, than any of the 'collisions' (onto, but not one-to-one?) must be unimportant. Thus for a non-isomorphic abstraction, it is workable if and only if it is projected onto a space that is simplified with respect towards a set of non-critical variables. If not, then it is no longer a valid abstraction for the given data. Other than collisions, if anything is not covered by the abstraction (not one-to-one?), then it too, at least in its present form is not a valid abstraction.

Lets get a little more abstract.

We all know, in our own ways that software is not real; it is not concrete. In that it lives encased in a mathematical existence, it has the capacity for things not tied to the real world, like perfection or a lack of the effects of entropy, for example. All software does is manipulate data, and it is there that we need to focus: a piece of data in a computer is an abstract representation of that data in the real world. It is nothing more than a shadow of the 'thing' onto some number of bits of 'information' that represent it to us. If I have an entry in the computer for me as a user, 'I' am not in the computer, but some information 'about' me is. And, that information is only a subset of all there is to know about me, a mere fraction of the complete set of information in the real world that is essentially infinite.

So all of our data within the computer is nothing more than an abstract representation of something in reality. Well, it could actually be several steps back from reality, it may be an abstract representation of a summary of abstract representations of some things in reality, or something convoluted like that. But we can ignore that for now.

Getting less meta-physical, the data in the computer is a 'placeholder' for some data in reality. The 'captured' relationships between pieces of data in the computer are supposed to mirror the 'important' relationships between the 'things' in reality. In the real world for any given piece of data, there is an 'infinite' number of relationships with other data, but the computer can only capture a discrete number of these relationships. It is entirely finite at any instance in time. We can simulate active infinite relationships, but only through the expenditure of computational power, thus allowing our 'time' dimension to be appear infinite. We can also simulate the infinite handling of things symbolically, but again this comes only through computational power.

It is perhaps an odd way of looking at the overall landscape, but this perspective is important in being able to explore some various points.

A FEW ESSENTIAL POINTS

If we step back and see the data in the software as a placeholder for stuff in reality, that leads us to surmise:

1. Everything in the computer is by virtue of its existence, abstract. Nothing is concrete, and all of this abstract stuff lies entirely in the domain of mathematics. This is why the essence of software development -- Computer Science -- is in some schools considered to be a branch of mathematics. It absolutely is. It is applied, to be sure, but it is all mathematics even if we think we are dealing with real world data. Software inhibits a bizarrely mathematical world, dragged down only by the reality of our problems.

2. Because everything is already an abstraction, there is always at least one abstraction that can contain the current abstraction without losing some essential virtual characteristic. The trivial case of course is every abstraction can be abstracted by itself. This is seemingly useless, except that it leads to the next item.

3. Every 'real-world' based abstraction has at least one complete higher-level abstraction. At very least some type of summary. Most real-world based abstractions have many many higher-level abstractions; often diverging on very different paths. At very least, they have the taxonomy for their specific category of information. Many have multiple taxonomies in which they are situated. For example, poodle is a type of dog, which is a type of mammal, which is a type of animal, etc. But poodle is also a type of pet, which is an animal with a specific relationship towards man, which are sentient beings, etc. Multiple languages act as other alternative taxonomies. Taxonomies are bounded by the messiness of mankind but there are generally a large number of mostly undisputed taxonomies for all things in reality. In fact, there is one for everything we know about, at least for everything we can tell each other about. Everything has a name or at very least a unique description.

4. The abstraction of the data is fundamentally a mathematical concept, which is not the same as the 'implementation' of the abstraction for a specific piece of software. Instantiating an abstraction into a specific computer based language is an implementation, which while based on the abstraction, it is not the same as the abstraction. The abstraction for instance could be correct, while the implementation might not be. One is fundamentally an idea, while the other is a thing. More concretely, the implementation is a physical thing existing at least as a series of magnetic impulses on a disk, if not another similar series of impulses loaded into a couple of silicon chips. The ideas behind software are abstract, but once written software is not. It has a physical manifestation; tiny but still physical.

5. An abstraction as a mathematical concept is not related to encapsulation. Encapsulation is a black box that contains some piece of complexity, encapsulating it away from the rest of the system. Whether or not the encapsulation is abstract, or even which implementation is used inside of the black box, is entirely independent. An implementation may encapsulate an abstraction, but it is more likely that the abstraction itself is the interface towards the encapsulated implementation, or something like that. Keeping abstraction and encapsulation separated is important because both have very different effects on software development.

6. An abstraction fits over an underlying 'data' abstraction completely. If it does not, then while it still may be an abstraction, it is not necessarily related to the underlying data one. E.g. if 80% of the abstraction is correct, then it is a badly fitting abstraction. Just because it partially fits, does not make it a good abstraction. However, multiple simple abstractions can be combined together to form an abstraction that fits entirely over another one, even if the individual ones do not. This new composite abstraction is more or less complete, if it contains no holes, but that doesn't necessarily mean that it isn't 'ugly'.

Together these points about abstractions, encapsulation and implementations have some interesting properties that often seem to be mis-understood. Consider these points, as we look at two related essays and explore some of the base assumptions.

THE LAW OF LEAKY ABSTRACTIONS

In his essay, Joel Spolsky defines abstractions, says they can leak and then defines a law of leaky abstractions: http://www.joelonsoftware.com/articles/LeakyAbstractions.html

While his problems are real, one cannot blame an abstraction for the fact that a) it doesn't fit the underlying problem, or b) the implementation is bad.

An abstract representation is one that fits over the underlying details. So if it only fits over 80%, then it is a poor abstraction. If the programmer doesn't encapsulate the details away from his users, then he is a poor programmer. Often we focus so hard on the code we build that we forget about the people using it.

In his essay Joel says that TCP/IP is an example of a leaky abstraction because the implementation chooses to allow the packet connection to timeout at some point. It could have just as easily never timed out, and then delivered the data on an eventual reconnect. There are publish/subscribe infrastructures that do exactly that, but they do require some interim holding storage that has the potential of overflowing at some point. Because of that, the choice to time-out is a resource vs. timing trade-off that is related to a generalized implementation of TCP. The idea of implementing a reliable protocol over an unreliable one is still a reasonable abstraction, regardless of which trade-offs must be made for an implementation to actually exist at some point.

In another example he says iterating memory in a higher level language leaks through because the order in which the programmer iterates through the array affects the performance. But it is possible for the iteration code to be examined by the software at compile time or even runtime and the 'direction' changed to that of the fastest approach. Some Existing languages such as APL do similar things. While this is not commonly done as an optimization by many languages, that does not imply that the abstraction of the language over assembler is any less valid. The implementation allows this detail through, not the abstraction. The abstraction just creates a higher level syntax to allow for more complicated instructions to be inputted with a smaller number of base primitives through a slightly different idiom.

I don't think abstractions can leak, although some are clearly poorly fitting. Everything has at least one valid abstraction that fits entirely over it. For all but the most esoteric applications, there is a huge number of potential abstractions that could be used. Just because we so frequently choose to jam the wrong abstractions over the wrong data doesn't mean that abstractions are the cause of the problem. They are actually the opposite. Our increases in sophistication come from our discovery and usage of ever increasing complex abstractions. Blaming a mathematical tool for the ugliness of our real world implementations is just not fair.

However, not to lose Joel's main point, there is an awful lot of 'complexity' bubbling up from the lower layers of our software. Every time we build on top of this stuff, we become subject to more and more of these potential problems. Getting back to my earlier writings, I think the real underlying cause is that the implementation is only partially encapsulating the underlying details. Often this is because the programmers want to give the users of their stuff some larger degree of freedom for the way they can utilize the code. Once the holes are open, the details 'leak' through to the higher levels and cause all sorts of fun and wonderful complexity ripples in the dependent code or usage. Because of this, I think Joel's law should exist, but be renamed to: "the law of leaky partial encapsulations". It is not as pretty a name, but, I think it is more accurate.


DIVING INTO AN ABSTRACTION PILE:

Another interesting read is: http://www.ericsink.com/Abstraction_Pile.html

Although I think Eric Sink does a excellent and interesting job of digging into most of the layers existing for his computer software, I disagree somewhat with his three initial rules and some of his terminology. The various 'encapsulated' layers in the system he mentions often have abstractions, but again we need to keep abstractions and layers as separate concepts. Erik puts forth a few rules about abstractions 1) they contain bugs, 2) they reduce performance and 3) they increase complexity. Dealing with each point in turn:

1) Abstractions contain bugs. An abstraction is a mathematical concept, it is what it is. The implementation can easily contain bugs, either because it is a poor fit to the problem or because some of the underlying logic is actually broken. Abstractions fit or they don't. A badly fitting abstraction is not the abstraction's fault, it is the implementation. Being more specific though, all code, abstract or not contains bugs. While a specific sequence of steps done in a brute force manner may be easier to fix if there are bugs discovered, in general abstractions contain orders of magnitude less code and because the behavior is more generalized, the testing of the code is denser. As such, while abstract code is harder to fix, it is less likely to contain bugs as the code ages. Specifically for working tested abstract code, it will contain less bugs than some brute force version, but both will contain bugs. The denser usage and testing will bring problems to the surface much faster. Layering is another important way of breaking off complexity, but you can just as easily do it with very specific brute force code as you can do it with abstract code, it is just a way of slicing and dicing the problem into smaller pieces.

2) Abstractions reduce performance. We all recognize that an expert hand-coding some assembler can produce faster-tighter code than a C compiler can. That is not, I think a by-product of the abstraction that C provides as a language, but simply because the work hasn't been done on a C compiler to better optimized the final code, yet. We've seen tremendous speed increases in Java virtual machines, but primarily because the semantic and syntactic freedom of Java is way less than C, so it appears to be easier to think of new and exciting ways to detect specific circumstances that can easily be optimized. Nothing about a correctly fitting abstraction will inherently interfere with the performance characteristics that some clever programmer with a lot of time cannot detect and counter-balance. If that is not true then the abstraction has collapsed down some critical variables into an ambiguity, which was still the essence of the problem, so it is badly fitting by definition.

Also, in general a compiler can produce code that performs better than the 'average' assembler programmer. The level of expertise for most programming languages spans a huge margin and the average of that margin over time will produce lots of un-optimal code. In the simple cases most compilers, with their inherent consistency will exceed the abilities of an average (or possibly somewhat less) programmer over much of the code they produce. Even the sloppiest of programmers will be better sometimes, but probably not overall.

Getting even more detailed, in order to optimize any piece of code one needs to abstract it to begin with. If we still have just a rigid set of instructions, there is actually no way to optimize them, they are exactly what is needed to get the work done and nothing more. If we go to a higher level of abstraction, then we can start to move many of the 'computational' operations forward, so that we can reuse the calculations for the optimization. Optimizations often come from getting more abstract and then moving more calculations forward and reusing the values. Caching, for example can be viewed that way. Continue this generalization of the problem until at some point the processing can't be reduced any more. This is a core technique for manually optimizing code. Abstractions, rather then reducing performance, are the reasons why we actually can optimize code.

As for layering, the overhead from creating layers in a system is most often minimal in comparison with the work that the system is actually doing. Assuming that you don't go to far, slicing and dicing the code into segments generally pays for itself, and often allows the replacement of some of the generalized segments with optimized abstract variants.

Applying abstractions that allow for optimizations at various layer can cause massive boosts in performance. I once optimized a program that was initially running in 2 hours, down to 1/2 hour with less than half the amount of code. Not only that, but the abstractions allowed for a huge generalized set of things to be calculated instead of just the specific one, and building the code took half as many man-hours as the original. Less code, faster performance, faster development and applicable to many problems instead of just one. That is the true power of abstraction.

3) Abstraction increases complexity. The point of an abstraction is to simplify something, and by virtual of doing that it does not increase complexity *beyond* the actual cost of the understanding and implementing the abstraction. If we replace some long set of repetitive instructions with a generalized abstract set, we will dramatically reduce the overall amount of code in the system. While each line of new code in itself is more complicated, the reduction in size reduces the overall complexity. Not only that, but the newer denser code is easier to test and more likely to be consistent. For any reasonable fitting abstraction, if it is implemented correctly, the system will have less complexity relative to the size of the problems it is solving. Often it will have dramatically less complexity, although on a line-by-line basis it may have been increased.

Also, we can utilize layering and encapsulation to essentially 'hide' complexity in various lower levels of the system. For example, if we break off half of a system and hide it in a black box, then the remaining half is simpler then the original and the half in the black box is simpler, but overall the whole: half + half + box is a little more complicated. If the block box only partially encapsulates, then the system is indeed worse because all of the details are now visible again at the top level. While encapsulating a layer does add some additional complexity, it hugely decreases the 'localized' complexity which allows it to counter-balance the localized complexity increases from implementing an abstraction.

We could never write our current level of programs in assembler, the complexity would overwhelm us. It is the abstractions we have found that give us the ability to build higher and higher solutions to our ever growing problems. Through them we can manage the increases in complexity. Through layering, we can distribute it into smaller parcels. Even if it is disappointing sometimes, we cannot deny that the sophistication of today's systems is orders of magnitude above and beyond those written fifty years ago, progress is evident and abstraction is clearly a core reason. But while we have some neat abstractions, we have only scratched the surface of those that are actually possible.


WHY IS THIS IMPORTANT?

While it is often overlooked, Computer Science is firmly rooted in mathematics. In its purest sense, software exists in an entirely precise and nearly perfect mathematical world. As we fill in more and more placeholders for the real world, we often tie down our software with the vagaries and messiness of our daily existence. Chaos and disorder play in reality, finding their way into the code. But intrinsically there is no entropy for mathematical formulas, they will stand unaltered throughout the ages.

That software straddles the real world while running in a purely mathematical one is important in understanding how much disorder and complexity are necessary in our solutions. We are tied so closely to reality, that software -- unlike mathematics -- actually rusts if it is not continually updated: http://people.lulu.com/blogs/view_post.php?post_id=28963

What this means is that while we cannot undo or avoid any of the inherent complexity of the problem domain for the tools we are building, we can find more 'abstract' ways to represent the information we are manipulating which allows us more flexibility in how we create these tools. Brute forcing the problem of creating code causes programmers to iterate out every single step of the problem they wish to solve, an easy but dangerous approach. If we fall back on more abstract and general representations for our code, not only can we optimize the speed of handling the problem, but we can also optimize the speed of creating a tool to handle the problem. Massively reducing the amount of code we need for the solution comes directly from utilizing abstractions.

It is only in leveraging our work in developing software that we can hope to build the full range of tools possible for a computer. If the alternative is pounding out every line of code that we need or will ever want, then our computer tools will not live up to their full potential for hundreds of years at least. We need massive effort through brute force to create and maintain reams of fragile code.

Our real troubles in development lay with our implementations and encapsulation. We go thorough a lot of work to hide the details, and then we go through a lot more to make them visible again. A small bit of insanity that we have been constantly plagued with.

Even more concerning, there have been many movements towards modelling the real world, or towards pounding out the systems with variations of brute force. In either case we are passing over the obvious: the data starts as an abstraction and we can take advantage of that in our implementations. As we generalize up the abstraction levels, we increase some of the localized complexity, but we leverage the code enough to make it worthwhile. With more generalized solutions we can solve larger problem domains with only marginally larger solutions. Abstraction is the key to leveraging our work.

Abstractions are a pure mathematical concept. They exist for all things that can be computerized and we've only really seen the tip of the iceberg as far as abstractions are concerned. It is a huge space, and there are many millions more 'abstract' ways of solving our current problems than are currently realized. Abstractions are the closest thing to a silver bullet for software development that we will ever get. We should take advantage of them wherever possible.