The Programmer's Paradox: Search results for encapsulation

Showing posts sorted by relevance for query encapsulation. Sort by date Show all posts

Thursday, June 18, 2015

Encapsulation

One of the strongest, but possibly least understood principles of object-oriented (OO) programming is 'encapsulation'.

The OO paradigm explicitly injects structure on top of code which allows programmers to build and maintain considerably larger programs than in the distant past. This extra level of organization is the key to managing complexity. But while this amplifies our abilities to build big programs, there is still a 'threshold of complexity' that once crossed will quickly start to degrade the overall stability of the development project, and eventually the software itself.

An individual programmer has fixed limits on how quickly they can build up instructions and later on how quickly they can correct problems. A highly-effective team can support and extend a much larger code base than the sum of its individuals, but eventually the complexity will grow beyond their abilities. There is always some physical maximum after which the work becomes excessively error prone or consistently slower or both. There is no getting around complexity, it is a fundamental limitation on scale.

However we can significantly minimize it, to prevent crossing the threshold for as long as possible. This obviously comes from the strict avoidance of adding any artificial complexity, such as special cases, twisted logic, arbitrary categorizations and other forms of disorganization. That helps, but there is another approach as well.

Encapsulation can be seen as drawing a 'black box' around a subset of a complex system. That box prevents anyone from outside in seeing the inner workings, but it also ensures that what's on the inside is not influenced by random outside behaviour. The inside and outside are explicitly walled off from each other, so that the only interaction is a precisely defined 'interface'.

To get the most out of encapsulation, the contents of the box must do something significantly more than just trivially implement an interface. That is, boxing off something simple is essentially negative, given that the box itself is a bump in complexity. To actually reduce the overall complexity, enough sub-complexity must be hidden away to make the box itself worth the effort.

For example, one could write a new layer on top of a technology like sockets and call it something like 'connections', but unless this new layer really encapsulates enough underlying complexity, like implementing a communications protocol and a data transfer format, then it has hurt rather than helped. It is 'shallow'. What this means is that for any useful encapsulation, it must hide a significant amount of complexity, thus there should be plenty of code and data buried inside of the box that is no longer necessary to know outside of it. It should not leak out any of this knowledge. So a connection that seamlessly synchronizes data between two parties (how? We don't know) correctly removes a chunk of knowledge out of the upper levels of the system. And it does it in a way that it is clear and easy to triage problems as being 'in' or 'out' of the box.

Once a subset of the complexity has been fully encapsulated, and is easily diagnosable, it can be vetted and ignored for the moment. That is, if the connections library is known to work for all current circumstances used by the project, then programmers don't have to revisit the internals, unless they want to expand them. That's the real power of encapsulation. For a little bit of extra thinking and work, some part of the system is carved off and put to rest. Later, because of reuse or enhancements, it may make sense to revisit the code and widen its functionality, but for the moment it is one less (hopefully large) problem to deal with. The system is more complex, but the system minus the encapsulated parts is relatively less complex, and thus it is easier to work on what remains unsolved.

In little programs, encapsulation isn't really necessary, it might help but there just isn't enough overall complexity to worry about. Once the system grows however, it approaches the threshold really fast. Fast enough that many software developers ignore it until it is way too late, and then the costs of correcting the code becomes unmanageable.

It is for that reason that many seasoned developers have learned the habit of encapsulating early and often. They essentially have a second 'editing pass' on any new code that is aimed at breaking off any potentially encapsulated parts into their own independent chunks. This is essentially 'partial' encapsulation. The code is partitioned, but not rigorously enforced. Doing this regularly and reliably means that at some point later, when it is required, the code can easily be enhanced to full encapsulation.

If you look at any well-written code base, the related instructions tend to be localized, and there is at least an implicit organization that cleanly separates the underlying pieces. That is, the individual lines of code appear in the sub-parts of the system, exactly where you would expect them to be located. Perhaps, some of the code is formally encapsulated, but often that degree of rigor is not necessary at the current stage of the code base, so its just partial. Contrast that with code containing an over abundance of hollow encapsulation or just what seems to be code randomly located anywhere and you can see why this principle is so important in keeping the code useful.

To build big systems, you need to build up a huge and extremely complex code base. To keep it manageable you need to keep it heavily organized, but you also need to carve out chunks of it that have been done and dusted for the moment, so that you can focus your current efforts on moving the work forward. There are no short-cuts available in software development that won't harm a project, just a lot of very careful, dedicated, disciplined work that when done correctly, helps towards continuing the active lifespan of the code.

Thursday, December 6, 2007

Pedantically Speaking

Spending your days programming a computer -- if you do it long enough -- will start to have a noticeable effect on your personality. Not that it is a big surprise, one's profession has always -- no matter how complex or simple -- gradually morphed their personality. If you think that all of the accountants you've met are basically the same, you're not that far off base.

Programming is a discipline where we bury ourselves in the tiniest of details. Perfection is necessary, at least to the degree that your work won't run on the computer if it is too sloppy, and it is exceptionally difficult to manage if it is even a bit sloppy. This drives in us, the need to be pedantic. We get so caught up in the littlest of things that we have trouble seeing the whole.

Most often this is a negative, but for this post I wanted to let it loose into its full glory. It is my hope that from the tiniest of things, we may be able to draw the greatest of conclusions.

Originally, the thoughts for this blog entry were a part of a comment I was going to make for another entry, but on reflection I realized that there was way too much of it to fit into a simple comment field. This post may be a bit long winded with lots of detail, but it is the only real way to present this odd bit of knowledge. You get some kinda award if you make it to the end in one piece.

I was going to be rigorous in my use of a programming language for any examples. Then I thought I might use pseudo code instead. But after I had decided on the opening paragraph, I figured that the last thing I actually needed was to be more rigorous. If I am going to be pedantic then the examples should be wantonly sloppy. Hopefully it all balances out. The best I can do for that is to mix and match my favorite idioms from various different programming languages. The examples might be messy -- they certainly won't run -- but hopefully beyond their usefulness there should be minimal extra artificial complexity. Straight to the point, and little else.

For me the classic example of encapsulation is the following function:

string getBookTitle() {
    return "The Programmer's Paradox";
}

This is an example of cutting out a part of the code into a function that acts as an indirect reference to the title of a book. The function getBookTitle is the only thing another programmer needs to add into their code in order to access its contents. Inside, encapsulated away from the rest of the world is a specific instance of 'book title' information that references a specific title of a book. It may also happen to be the title of a blog, but that is rather incidental to the information itself. What it has in common with other pieces of similar information out there is possibly interesting, but not relevant.

The title in this case both explains the underlying action for the function -- based on the verb: get -- but it also expresses a categorization of the encapsulated information: a book title. It may have been more generalized, as in getTitle, or less generalized, as in getMyBookTitle, but that is more about our degree of abstraction we are using then it is about the way we are encapsulating the information.

In a library, if the code were compiled and the source wasn't available, the only way other programmers could access this information is by directly running this function. This 'information hiding', while important is just one attribute of encapsulation. In this case my book title is hidden, but it is also abstracted to one degree as a generic book title. Used throughout a system, there is no reference to the author or any necessity to know about the author.

Now this "slice and dice" of the code is successful if and only if all of the other locations in the code access the getBookTitle function not the actual string itself. If all of the programmers agree on using this function, and they are consistent in applying that agreement, then we get a special attribute from this. If you consider that this is more work than having the programmers just type in the string each time they needed it, then that extra work should be offset by getting something special in return. In this case, the actual title string itself exists in one and only one place in the code, so it is a) easy to look up, b) easy to change consistently for the whole program and c) easy to verify correctness visually. All strong points that make this more than just a novel approach for coding.

So far, so simple. Now consider the following modification to the function:

string getBookTitle(string override) {
    if(override is NULL)
        return "The Programmer's Paradox";
    else
        return override;
}

This is a classic example of what I often refer to as partial encapsulation. While the base function is identical, the programmer has given the option to any other programmers to override it with their own individual title. Not a realistic example, you say? Well, if you consider how many libraries out there allow you to override core pieces of their information, you'll see that this little piece of code is actually a very common pattern in our software. In a little example like this, it seems silly to allow an override, but it is done so often that it always makes me surprised that people can't see that what they are doing is really isomorphic to this example. It happens frequently.

The beauty of encapsulation is that we can assume that if all is good in coding-land and everyone followed the rules, that we can easily change the book title to something else. If we rebuild then the impact of that change will be obvious and consistent. More great attributes of the work we did. The problem with partially encapsulating it however, is that these great attributes are no longer true. We no longer know were in the code programmers used the original string and where they chose to override it with one of their own. While we could use a search program like grep to list out all of the instances of the call and check each one manually, that is a lot of extra time required in order to fully understand the impact of a change to the code. At 4am in the morning, that makes partially encapsulating something a huge pain in the ass. We lose the certainty of knowing the impact of the change.

With the exception of still having the ability to at least search for all of the calls for the getBookTitle function, the partially encapsulated version is barely better than just having all of the programmers brute force the code by typing in the same string each time, over and over again. This is still one advantage left, but consider how many great attributes we lost by opening up the call. If it was just to be nice to the calling programmers and give them a 'special' feature then the costs do not justify the work.

We can move on to the next iteration of the function:

string getBookTitle() {
    file = openFile("BookInformation.cfg");
    hash = readPropertiesFromFile(file);
    closeFile(file);
    return hash{"BookTitle"};
}

Now in this example, we moved our key information out of the code into a location that is far more configurable. In a file it is easy for programmers, system administrators and sometimes the users themselves to access the file and change the data. I ignored any issues of missing files or corrupt data, they aren't relevant to this example, other than to say that the caller high up the stack is concerned about them. The BookTitle functionality is really only concerned with returning the right answer. Presumably the processing stops if it is unavailable.

When we push out the information to a file, we open up an new interface for potential users. This includes anyone who can or will access the data. A configuration file is an interface in the same way that a GUI is one. It has all of the same problems requiring us to check all of the incoming data in a bullet proof manner to make sure we are protecting ourselves from garbage.

More to the point, we've traded away our BookTitle information, which we've encapsulated into the file definition in exchange for the location of the file itself and the ability to correctly parse it. We've extended our core information.

In the outside world, the other programmers don't know about the file and if we are properly encapsulating this information, they never will. If we open up the ability to pass in an overriding config file, we are back to partially encapsulating the solution, but this time it is the config file access info itself, not the original title info that has been opened up.

Now the BookTitle is an accessible file that even if we set read only it is hard to actually determine if it was edited or not, so in essence it is no longer encapsulated at all. It should be considered to be publicly available. But, even though its actual value can no longer be explicitly controlled, its usage in the system is still fully encapsulated. The information is no longer hidden, but most of the encapsulation properties still hold. We have the freedom to handle any value, but the encapsulation on how it is used within the system. The best of both worlds.

Now we could consider what happens when we are no longer looking at a single piece of information, for example:

(title, font, author, date) = getBookInformation();

In this case, the one function is the source for multiple pieces of related information for a book. This has the properly of 'clumping' together several important pieces of information into the same functionality that in itself tends towards ensuring that the information is all used together and consistently. If we open it up:

(title, font, author, date) = getBookInformation(book_identification);

Then a programmer could combine the output of multiple different calls together and get them mixed up, but given that that would take extra work it is unlikely to end up being that way in production code. Sometimes the language can enforce good behavior, but sometimes all you can do is lead the programmer to the right direction and make it hard, but not impossible for them to be bad.

In this last case, the passed in key acts as a unique way of identifying each and every book. It is itself an indirect reference in the same way that the function name was in the very first example. Assuming that no real information was 'overloaded' into the definition of the key itself, then the last example is still just as encapsulated as the very first example in this post.

Passing around a key does no real harm. Particularly if it is opaque. If the origin of the key comes from navigating directly around some large data source, then the program itself never knows either the key or the book information.

It turns out that sticking information in a relational database is identical to sticking it in a file. If we get the key from a database, along with the rest of the information then it is all encapsulated in the calling program. There are still multiple ways to manipulate it in the database, but if at 4am we see the value in the database we should be safe to fully assume that wherever that key is used in the code it is that specific value, and that property is also true of all other values. We have enough encapsulation to know that the value of the data itself is not significant.

That is, if I see a funny value on the screen then it should match what is in the database, if not then I can make some correct assumptions about how the code disrupted it along the way.

There are many more examples that are similar, but to some degree or another they are all similar to encapsulation or partial encapsulation. We just need to look at whether or not the information is accessible in some manner and how much of it is actually accessible. Is all comes down to whether or not we are making our jobs harder or easier.

So far this has been fairly long piece, so I'll need to wrap it up for now. Going back over it slightly, we slice and dice our code to make it easier to build, fix and extend. Encapsulation is a way in which we can take some of the information in the system and put it out of harm's reach so that our development problems are simpler and easier to manage. Partial encapsulation undoes most of that effort, making our work far less productive. We seek to add as many positive attributes to our code as possible so that we can achieve high quality without having to strain at the effort.

We can construct programs that have little knowledge of the underlying data beyond its elementary structure. For that little extra in complexity, we can add many other great attributes to the the program, making it more dynamic and encapsulating the underlying information. The best code possible is always extremely readable, simple and easy to fix. This elegance is not accidental, it must be built into the code line by line. With the correct focus and a bit of practice programmers can easily build less complex and more robust systems. PS. I was only kidding about the reward, all you'll get from me is knowledge. Try the shopping channel for a real reward...

Sunday, January 6, 2008

Abstraction and Encapsulation

Some of our most fundamental programming concepts are steeped in confusion. Because software development is such a young discipline, we often question and redefine many of the basic definitions to suit our personal understandings.

While this is a problem, it is also to be expected. Terms and definitions are the building blocks on which we build all of the higher concepts, if these are weak or incorrect they can lead us in circles long before we realize it is too late. Thinking in depth about their inner meanings is important. Older disciplines have no doubt survived much more discourse on their base topics; we are really only at the beginning of this stage. Other than what is mathematically rigorous, we should absolutely question our assumptions, it is the only way for us to expose our biases and grow. So much of what we take for fact is not nearly as rigid as we would like to believe.

For this posting, I'll bounce around a couple of basic terms that I've been interested in lately. There are no doubt official definitions somewhere, but if you'll forgive my hubris, I want to stick with those definitions that have come from my gut, driven by my experiences in the pursuit of software. When we reconcile experience and theory, we stand on the verge of true understanding.

A SIMPLE DEFINITION OF ENCAPSULATION

Encapsulation is a pivotal concept for software development. Its many definitions range from simple information hiding all of the way up to encompassing a style of programming. For me, the key issue is that encapsulation isolates and controls the details. It is how it relates back to complexity that is truly important. In a couple of my previous earlier posts I clarified my perspective of it:

http://theprogrammersparadox.blogspot.com/2007/11/art-of-encapsulation.html

and I added a few examples of what I think 'partial' encapsulation means and why it is such a big problem:

http://theprogrammersparadox.blogspot.com/2007/12/pedantically-speaking.html

These two posts define encapsulation as a way of breaking off parts of the system and hiding them away. The critical concept is that encapsulation is a way to manage complexity and remove it from the higher levels of the project. Controlling software development is really about controlling complexity growth. If it overwhelms us, the likelihood is failure. Partial encapsulation, which is a common problem, allows the details to peculate upwards which essentially undoes all of the benefits of encapsulation.

A SIMPLE DEFINITION OF ABSTRACTION

It may have just been a poor-quality dictionary, but when I looked up the definition of 'abstraction' it was self-referential: an abstraction is defined as an 'abstract' concept, idea or term. Where some people see this as a failure, I see it as a chance to work my own definition, at least with respect towards developing software. Following in the footsteps of all of those math textbooks that tortured me in school I'll leave a more generalized definition of the word as an exercise to the reader.

Although it is not necessary reading, in another earlier post I delved into the nature of 'simple':

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

this entry included a few ideas that might help to understand this some of the following craziness.

An abstraction is a simplification of some 'thing' down to its essential elements and qualities, that still -- under the current circumstances -- uniquely defines it relative to itself and any other 'thing' within the same abstraction. If two 'things' are separated, the two things when abstracted are still separated. Mathematically most abstractions involve isomorphic mappings (unique in both directions: one-to-one and onto) onto a simpler more generalized space, but if the mapping is not isomorphic, than any of the 'collisions' (onto, but not one-to-one?) must be unimportant. Thus for a non-isomorphic abstraction, it is workable if and only if it is projected onto a space that is simplified with respect towards a set of non-critical variables. If not, then it is no longer a valid abstraction for the given data. Other than collisions, if anything is not covered by the abstraction (not one-to-one?), then it too, at least in its present form is not a valid abstraction.

Lets get a little more abstract.

We all know, in our own ways that software is not real; it is not concrete. In that it lives encased in a mathematical existence, it has the capacity for things not tied to the real world, like perfection or a lack of the effects of entropy, for example. All software does is manipulate data, and it is there that we need to focus: a piece of data in a computer is an abstract representation of that data in the real world. It is nothing more than a shadow of the 'thing' onto some number of bits of 'information' that represent it to us. If I have an entry in the computer for me as a user, 'I' am not in the computer, but some information 'about' me is. And, that information is only a subset of all there is to know about me, a mere fraction of the complete set of information in the real world that is essentially infinite.

So all of our data within the computer is nothing more than an abstract representation of something in reality. Well, it could actually be several steps back from reality, it may be an abstract representation of a summary of abstract representations of some things in reality, or something convoluted like that. But we can ignore that for now.

Getting less meta-physical, the data in the computer is a 'placeholder' for some data in reality. The 'captured' relationships between pieces of data in the computer are supposed to mirror the 'important' relationships between the 'things' in reality. In the real world for any given piece of data, there is an 'infinite' number of relationships with other data, but the computer can only capture a discrete number of these relationships. It is entirely finite at any instance in time. We can simulate active infinite relationships, but only through the expenditure of computational power, thus allowing our 'time' dimension to be appear infinite. We can also simulate the infinite handling of things symbolically, but again this comes only through computational power.

It is perhaps an odd way of looking at the overall landscape, but this perspective is important in being able to explore some various points.

A FEW ESSENTIAL POINTS

If we step back and see the data in the software as a placeholder for stuff in reality, that leads us to surmise:

1. Everything in the computer is by virtue of its existence, abstract. Nothing is concrete, and all of this abstract stuff lies entirely in the domain of mathematics. This is why the essence of software development -- Computer Science -- is in some schools considered to be a branch of mathematics. It absolutely is. It is applied, to be sure, but it is all mathematics even if we think we are dealing with real world data. Software inhibits a bizarrely mathematical world, dragged down only by the reality of our problems.

2. Because everything is already an abstraction, there is always at least one abstraction that can contain the current abstraction without losing some essential virtual characteristic. The trivial case of course is every abstraction can be abstracted by itself. This is seemingly useless, except that it leads to the next item.

3. Every 'real-world' based abstraction has at least one complete higher-level abstraction. At very least some type of summary. Most real-world based abstractions have many many higher-level abstractions; often diverging on very different paths. At very least, they have the taxonomy for their specific category of information. Many have multiple taxonomies in which they are situated. For example, poodle is a type of dog, which is a type of mammal, which is a type of animal, etc. But poodle is also a type of pet, which is an animal with a specific relationship towards man, which are sentient beings, etc. Multiple languages act as other alternative taxonomies. Taxonomies are bounded by the messiness of mankind but there are generally a large number of mostly undisputed taxonomies for all things in reality. In fact, there is one for everything we know about, at least for everything we can tell each other about. Everything has a name or at very least a unique description.

4. The abstraction of the data is fundamentally a mathematical concept, which is not the same as the 'implementation' of the abstraction for a specific piece of software. Instantiating an abstraction into a specific computer based language is an implementation, which while based on the abstraction, it is not the same as the abstraction. The abstraction for instance could be correct, while the implementation might not be. One is fundamentally an idea, while the other is a thing. More concretely, the implementation is a physical thing existing at least as a series of magnetic impulses on a disk, if not another similar series of impulses loaded into a couple of silicon chips. The ideas behind software are abstract, but once written software is not. It has a physical manifestation; tiny but still physical.

5. An abstraction as a mathematical concept is not related to encapsulation. Encapsulation is a black box that contains some piece of complexity, encapsulating it away from the rest of the system. Whether or not the encapsulation is abstract, or even which implementation is used inside of the black box, is entirely independent. An implementation may encapsulate an abstraction, but it is more likely that the abstraction itself is the interface towards the encapsulated implementation, or something like that. Keeping abstraction and encapsulation separated is important because both have very different effects on software development.

6. An abstraction fits over an underlying 'data' abstraction completely. If it does not, then while it still may be an abstraction, it is not necessarily related to the underlying data one. E.g. if 80% of the abstraction is correct, then it is a badly fitting abstraction. Just because it partially fits, does not make it a good abstraction. However, multiple simple abstractions can be combined together to form an abstraction that fits entirely over another one, even if the individual ones do not. This new composite abstraction is more or less complete, if it contains no holes, but that doesn't necessarily mean that it isn't 'ugly'.

Together these points about abstractions, encapsulation and implementations have some interesting properties that often seem to be mis-understood. Consider these points, as we look at two related essays and explore some of the base assumptions.

THE LAW OF LEAKY ABSTRACTIONS

In his essay, Joel Spolsky defines abstractions, says they can leak and then defines a law of leaky abstractions: http://www.joelonsoftware.com/articles/LeakyAbstractions.html

While his problems are real, one cannot blame an abstraction for the fact that a) it doesn't fit the underlying problem, or b) the implementation is bad.

An abstract representation is one that fits over the underlying details. So if it only fits over 80%, then it is a poor abstraction. If the programmer doesn't encapsulate the details away from his users, then he is a poor programmer. Often we focus so hard on the code we build that we forget about the people using it.

In his essay Joel says that TCP/IP is an example of a leaky abstraction because the implementation chooses to allow the packet connection to timeout at some point. It could have just as easily never timed out, and then delivered the data on an eventual reconnect. There are publish/subscribe infrastructures that do exactly that, but they do require some interim holding storage that has the potential of overflowing at some point. Because of that, the choice to time-out is a resource vs. timing trade-off that is related to a generalized implementation of TCP. The idea of implementing a reliable protocol over an unreliable one is still a reasonable abstraction, regardless of which trade-offs must be made for an implementation to actually exist at some point.

In another example he says iterating memory in a higher level language leaks through because the order in which the programmer iterates through the array affects the performance. But it is possible for the iteration code to be examined by the software at compile time or even runtime and the 'direction' changed to that of the fastest approach. Some Existing languages such as APL do similar things. While this is not commonly done as an optimization by many languages, that does not imply that the abstraction of the language over assembler is any less valid. The implementation allows this detail through, not the abstraction. The abstraction just creates a higher level syntax to allow for more complicated instructions to be inputted with a smaller number of base primitives through a slightly different idiom.

I don't think abstractions can leak, although some are clearly poorly fitting. Everything has at least one valid abstraction that fits entirely over it. For all but the most esoteric applications, there is a huge number of potential abstractions that could be used. Just because we so frequently choose to jam the wrong abstractions over the wrong data doesn't mean that abstractions are the cause of the problem. They are actually the opposite. Our increases in sophistication come from our discovery and usage of ever increasing complex abstractions. Blaming a mathematical tool for the ugliness of our real world implementations is just not fair.

However, not to lose Joel's main point, there is an awful lot of 'complexity' bubbling up from the lower layers of our software. Every time we build on top of this stuff, we become subject to more and more of these potential problems. Getting back to my earlier writings, I think the real underlying cause is that the implementation is only partially encapsulating the underlying details. Often this is because the programmers want to give the users of their stuff some larger degree of freedom for the way they can utilize the code. Once the holes are open, the details 'leak' through to the higher levels and cause all sorts of fun and wonderful complexity ripples in the dependent code or usage. Because of this, I think Joel's law should exist, but be renamed to: "the law of leaky partial encapsulations". It is not as pretty a name, but, I think it is more accurate.

DIVING INTO AN ABSTRACTION PILE:

Another interesting read is: http://www.ericsink.com/Abstraction_Pile.html

Although I think Eric Sink does a excellent and interesting job of digging into most of the layers existing for his computer software, I disagree somewhat with his three initial rules and some of his terminology. The various 'encapsulated' layers in the system he mentions often have abstractions, but again we need to keep abstractions and layers as separate concepts. Erik puts forth a few rules about abstractions 1) they contain bugs, 2) they reduce performance and 3) they increase complexity. Dealing with each point in turn:

1) Abstractions contain bugs. An abstraction is a mathematical concept, it is what it is. The implementation can easily contain bugs, either because it is a poor fit to the problem or because some of the underlying logic is actually broken. Abstractions fit or they don't. A badly fitting abstraction is not the abstraction's fault, it is the implementation. Being more specific though, all code, abstract or not contains bugs. While a specific sequence of steps done in a brute force manner may be easier to fix if there are bugs discovered, in general abstractions contain orders of magnitude less code and because the behavior is more generalized, the testing of the code is denser. As such, while abstract code is harder to fix, it is less likely to contain bugs as the code ages. Specifically for working tested abstract code, it will contain less bugs than some brute force version, but both will contain bugs. The denser usage and testing will bring problems to the surface much faster. Layering is another important way of breaking off complexity, but you can just as easily do it with very specific brute force code as you can do it with abstract code, it is just a way of slicing and dicing the problem into smaller pieces.

2) Abstractions reduce performance. We all recognize that an expert hand-coding some assembler can produce faster-tighter code than a C compiler can. That is not, I think a by-product of the abstraction that C provides as a language, but simply because the work hasn't been done on a C compiler to better optimized the final code, yet. We've seen tremendous speed increases in Java virtual machines, but primarily because the semantic and syntactic freedom of Java is way less than C, so it appears to be easier to think of new and exciting ways to detect specific circumstances that can easily be optimized. Nothing about a correctly fitting abstraction will inherently interfere with the performance characteristics that some clever programmer with a lot of time cannot detect and counter-balance. If that is not true then the abstraction has collapsed down some critical variables into an ambiguity, which was still the essence of the problem, so it is badly fitting by definition.

Also, in general a compiler can produce code that performs better than the 'average' assembler programmer. The level of expertise for most programming languages spans a huge margin and the average of that margin over time will produce lots of un-optimal code. In the simple cases most compilers, with their inherent consistency will exceed the abilities of an average (or possibly somewhat less) programmer over much of the code they produce. Even the sloppiest of programmers will be better sometimes, but probably not overall.

Getting even more detailed, in order to optimize any piece of code one needs to abstract it to begin with. If we still have just a rigid set of instructions, there is actually no way to optimize them, they are exactly what is needed to get the work done and nothing more. If we go to a higher level of abstraction, then we can start to move many of the 'computational' operations forward, so that we can reuse the calculations for the optimization. Optimizations often come from getting more abstract and then moving more calculations forward and reusing the values. Caching, for example can be viewed that way. Continue this generalization of the problem until at some point the processing can't be reduced any more. This is a core technique for manually optimizing code. Abstractions, rather then reducing performance, are the reasons why we actually can optimize code.

As for layering, the overhead from creating layers in a system is most often minimal in comparison with the work that the system is actually doing. Assuming that you don't go to far, slicing and dicing the code into segments generally pays for itself, and often allows the replacement of some of the generalized segments with optimized abstract variants.

Applying abstractions that allow for optimizations at various layer can cause massive boosts in performance. I once optimized a program that was initially running in 2 hours, down to 1/2 hour with less than half the amount of code. Not only that, but the abstractions allowed for a huge generalized set of things to be calculated instead of just the specific one, and building the code took half as many man-hours as the original. Less code, faster performance, faster development and applicable to many problems instead of just one. That is the true power of abstraction.

3) Abstraction increases complexity. The point of an abstraction is to simplify something, and by virtual of doing that it does not increase complexity *beyond* the actual cost of the understanding and implementing the abstraction. If we replace some long set of repetitive instructions with a generalized abstract set, we will dramatically reduce the overall amount of code in the system. While each line of new code in itself is more complicated, the reduction in size reduces the overall complexity. Not only that, but the newer denser code is easier to test and more likely to be consistent. For any reasonable fitting abstraction, if it is implemented correctly, the system will have less complexity relative to the size of the problems it is solving. Often it will have dramatically less complexity, although on a line-by-line basis it may have been increased.

Also, we can utilize layering and encapsulation to essentially 'hide' complexity in various lower levels of the system. For example, if we break off half of a system and hide it in a black box, then the remaining half is simpler then the original and the half in the black box is simpler, but overall the whole: half + half + box is a little more complicated. If the block box only partially encapsulates, then the system is indeed worse because all of the details are now visible again at the top level. While encapsulating a layer does add some additional complexity, it hugely decreases the 'localized' complexity which allows it to counter-balance the localized complexity increases from implementing an abstraction.

We could never write our current level of programs in assembler, the complexity would overwhelm us. It is the abstractions we have found that give us the ability to build higher and higher solutions to our ever growing problems. Through them we can manage the increases in complexity. Through layering, we can distribute it into smaller parcels. Even if it is disappointing sometimes, we cannot deny that the sophistication of today's systems is orders of magnitude above and beyond those written fifty years ago, progress is evident and abstraction is clearly a core reason. But while we have some neat abstractions, we have only scratched the surface of those that are actually possible.

WHY IS THIS IMPORTANT?

While it is often overlooked, Computer Science is firmly rooted in mathematics. In its purest sense, software exists in an entirely precise and nearly perfect mathematical world. As we fill in more and more placeholders for the real world, we often tie down our software with the vagaries and messiness of our daily existence. Chaos and disorder play in reality, finding their way into the code. But intrinsically there is no entropy for mathematical formulas, they will stand unaltered throughout the ages.

That software straddles the real world while running in a purely mathematical one is important in understanding how much disorder and complexity are necessary in our solutions. We are tied so closely to reality, that software -- unlike mathematics -- actually rusts if it is not continually updated: http://people.lulu.com/blogs/view_post.php?post_id=28963

What this means is that while we cannot undo or avoid any of the inherent complexity of the problem domain for the tools we are building, we can find more 'abstract' ways to represent the information we are manipulating which allows us more flexibility in how we create these tools. Brute forcing the problem of creating code causes programmers to iterate out every single step of the problem they wish to solve, an easy but dangerous approach. If we fall back on more abstract and general representations for our code, not only can we optimize the speed of handling the problem, but we can also optimize the speed of creating a tool to handle the problem. Massively reducing the amount of code we need for the solution comes directly from utilizing abstractions.

It is only in leveraging our work in developing software that we can hope to build the full range of tools possible for a computer. If the alternative is pounding out every line of code that we need or will ever want, then our computer tools will not live up to their full potential for hundreds of years at least. We need massive effort through brute force to create and maintain reams of fragile code.

Our real troubles in development lay with our implementations and encapsulation. We go thorough a lot of work to hide the details, and then we go through a lot more to make them visible again. A small bit of insanity that we have been constantly plagued with.

Even more concerning, there have been many movements towards modelling the real world, or towards pounding out the systems with variations of brute force. In either case we are passing over the obvious: the data starts as an abstraction and we can take advantage of that in our implementations. As we generalize up the abstraction levels, we increase some of the localized complexity, but we leverage the code enough to make it worthwhile. With more generalized solutions we can solve larger problem domains with only marginally larger solutions. Abstraction is the key to leveraging our work.

Abstractions are a pure mathematical concept. They exist for all things that can be computerized and we've only really seen the tip of the iceberg as far as abstractions are concerned. It is a huge space, and there are many millions more 'abstract' ways of solving our current problems than are currently realized. Abstractions are the closest thing to a silver bullet for software development that we will ever get. We should take advantage of them wherever possible.

Thursday, November 16, 2023

The Power of Abstractions

Programmers often complain about abstractions, which is unfortunate.

Abstractions are one of the strongest ‘power tools’ for programming. Along with encapsulation and data structures, they give you the ability to recreate any existing piece of modern software, yourself, so long as you have lots and lots of time.

There is always a lot of confusion about them. On their own, they are nothing more than a generalization. So, instead of working through a whole bunch of separate individual special cases for the instructions that the computer needs to execute, you step back a little and figure out what all of those different cases have in common. Later, you bind those common steps back to the specifics. When you do that, you’ve not only encoded the special cases, you’ve also encoded all of the permutations.

Put another way, if you have a huge amount of code to write and you can find a small tight abstraction that covers it completely, you write the abstraction instead, saving yourself massive amounts of time. If there were 20 variations that you needed to cover but you spent a little extra time to just create one generalized version, it’s a huge win.

Coding always takes a long time, so the strongest thing we can do is get as much leverage from every line as possible. If some small sequence of instructions appears in your code dozens of times, it indicates that you wasted a lot of time typing and testing it over and over again. Type it once, name it, make sure it works, and then reuse it. Way faster.

A while back there were discussions that abstractions always leak. The example given was for third-generation programming languages. With those, you still sometimes need to go outside of the language to get some things done on the hardware, like talking directly with the video card. Unfortunately, it was an apples-to-oranges comparison. The abstractions in question generalized the notion of a ‘computer’. But just one instance of it. Modern machine architecture however is actually a bunch of separate such computing devices all talking to each other through mediums like the bus or direct memory access. So, it’s really a ‘collection’ of computers. Quite obviously if you put an abstraction over the one thing, it does not cover a collection of them. Collections are things themselves (which part of what data structures is trying to teach).

A misfitting abstraction would not cover everything, and an abstraction for one thing would obviously not apply to a set of them. The abstraction of third-generation programming languages fit tightly over only the assembler instructions that manipulated the computer but obviously didn’t cover the ones that were used to communicate with peripherals. That is not leaking, really it is just scope and coverage.

To be more specific, an abstract is just an abstraction. If it misfits and part of the underlying mechanics is sticking out, exposed for the whole world to see, the problem is encapsulation. The abstract does not fully encapsulate the stuff below it. Partial encapsulation is leaking encapsulation. There are ugly bits sticking out of the box.

In most cases, you can actually find a tight-fitting abstraction. Some generalization with full coverage. You just need to understand what you are abstracting. An abstraction is a step up, but you can also see it as binding together a whole bunch of special cases like twigs. If you can visualize it as the overlaid execution paths of all of the possible permutations forming each special case, then you can see why there would always be something that fits tightly. The broader you make it the more situations it will cover.

The real power of an abstraction comes from a hugely decreased cognitive load. Instead of having to understand all of the intricacies of each of the special cases, you just have to understand the primitives of the abstraction itself. It’s just that it is one level of indirection. But still way less complexity.

The other side of that coin is that you can validate the code visually, by reading it. If it holds within the abstraction and the abstraction holds to the problem, then you know it will behave as expected. It’s obviously not a proof of correctness, but being able to quickly verify that some code is exactly what you thought it was should cut down on a huge number of bugs.

People complain though, that they are forced to understand something new. Yes, absolutely. And since the newer understanding is somewhat less concrete, for some people that makes it a little more challenging. But programming is already abstract and you already have to understand modern programming language abstractions and their embedded sub-abstractions like ‘strings’.

That is, crafting your own abstraction, if it is consistent and complete, is no harder to understand than any of the other fundamental tech stack ones, and to get really good at programming, you have to know those anyway. So adding a few more for the system itself is not onerous. In some cases, your abstraction can even cover a bunch of other lower-level ones, so if it is encapsulated, you don’t need to know those anymore. A property of encapsulation itself is to partition complexity, making the sum more complex but each component a lot less complex. If you want to write something sophisticated with extreme complexity, partitioning it is the only way it will be manageable.

One big fear is that someone will pick a bad abstraction and that will get locked into the code causing a huge mess. Yes, that happens, but the problem isn’t the abstraction. The problem is that people are locking things into the codebase. Treating all of the code in the system as write-once and untouchable is a huge problem. In doing that, it does not matter if the code is abstract or not, the codebase will degenerate either way, but faster if it is brute force. Either the code on top a) propagates the bugs below, b) wraps another onion layer around the earlier mess, or c) just spins off in a new silo. All three of these are really bad. They bloat up the lines of code, enshrine the earlier flaws, increase disorganization, and waste time with redundant work. They get you out of the gate a little faster, but then you’ll be stuck in the swamp forever.

If you pick the wrong abstraction then refactoring to correct it is boring. But it is usually a constrained amount of work and you can often do it in parts. If you apply the changes non-destructively, during the cleanup phase, you can refactor away some of the issues and check their correctness, before you pile more stuff on top. If you do that a bunch of times, the codebase improves for each release. You just have to be consistent about your direction of refactoring, waffling will hurt worse.

But that is true for all coding styles. If you make a mistake, and you will, then so long as you are consistent in that mistake, fixing it is always a smaller amount of work or at the very least can be broken down into a set of small amounts. If there are a lot of them, you may have to apply the sum over a large number of different releases, but if you persist and hold your direction constant, the code will get better. A lot better. Contrast this with freezing, where the code will always get worse. The mark of a good codebase is that it improves with time.

Sometimes people are afraid of what they see as the creativity involved with finding a new abstraction. Most abstractions however are not particularly creative. Really they are often just a combination of other abstractions fitted together to apply tightly to the current problem. That is, abstractions slowly evolve, they don’t just leap into existence. That makes sense, as often you don’t fully appreciate their expressibility until you’ve applied them a few times. So, it’s not creativity, but rather a bit of research or experience.

Programming is complicated enough these days that you will not get really far with it if you just stick to rediscovering everything yourself from first principles. Often the state of the art has been built up over decades, so going all of the way back in time and trying to reinvent everything again is going to be crude in comparison.

This is why learning to research a little is a necessary skill. If you decide to write some type of specific computation, doing some reading beforehand about others' experiences will pay huge dividends. Working with experienced people will pay huge dividends. Absorbing any large amount of knowledge efficiently will allow you to start from a stronger position. Code is just a manifestation of what the programmer understands, so obviously the more they understand the better the code will be.

The other side of this is that an inexperienced programmer seeking a super-creative abstraction will often be a disaster. This happens because they don’t fully understand what properties are necessary for coverage, so instead they hyper-focus on some smaller aspect of the computation. They optimize for that, but the overall fit is poor.

The problem though is that they went looking for a big creative leap. That was the real mistake. The abstraction you need is a generalization of the problems in front of you. Nothing more. Step back once or twice, don’t try to go way, way out, until much later in your life and your experience. What you do know should anchor you, always.

Another funny issue comes from concepts like patterns. As an abstraction, data structures have nearly full coverage over most computations, so you can express most things, with a few caveats, as a collection of interacting data structures. The same isn’t true for design patterns. They are closer to idioms than they are to a full abstraction. That is why they are easier to understand and more tangible. That is also why they became super popular, but it is also their failure.

You can decompose a problem into a set of design patterns, but it is more likely that the entire set now has a lot of extra artificial complexity included. Like an idiom, a pattern was meant to deal with a specific implementation issue, it would itself just be part of some abstraction, not the actual abstraction. They are implementation patterns, not design blocks. Patterns should be combined and hold places within an abstraction, not be a full and complete means of expressing the abstraction or the solution.

Oddly programmers so often seek one-size-fits-all rules, insisting that they are the one true way to do things. They do this because of complexity, but it doesn’t help. A lot of choices in programming are trade-offs, where you have to balance your decision to fit the specifics of what you are building. You shouldn’t always go left, nor should you always go right. The moment you arrive at the fork, you have to think deeply about the context you are buried in. That thinking can be complex, and it will definitely slow you down, thus the desire to blindly always pick the same direction. The less you think about it, the faster you will code, but the more likely that code will be fragile.

You can build a lot of small and medium-sized systems with brute force. It works. You don’t need to learn or even like abstractions. But if you want to work on large systems, or you want to be able to build stuff way faster, abstractions will allow you to do this. If you want to build sophisticated things, abstractions are mandatory. Once the inherent complexity passes some threshold, even the best development teams cannot deal with it, so you need ways of managing it that will allow the codebase to keep growing. This can only be done by making sure the parts are encapsulated away from each other, and almost by definition that makes the parts themselves abstract. That is why we see so many fundamental abstractions forming the base of all of our software, we have no other way of wrangling the complexity.

Monday, November 5, 2007

The Art of Encapsulation

In all software projects we quickly find ourselves awash in an endless sea of annoying little details.

Complexity run amok is the leading cause of project drownings; controlling it is vital. If you become swamped, it usually gets worse before you can manage to get it under control. It starts as a downwards spiral; picking up speed as it grows. If it is not explicitly brought under control, it swallows the entire project.

The strongest concept we have to control complexity is "encapsulation". The strength behind this idea is the ability to isolate chunks of complexity and essentially make them disappear. Well, to be fair they are not really gone, just buried in a black box somewhere; safely encapsulated away from everything else. In this way we can scale down the problem from being a monstrous wave about to capsize us, into a manageable series of smaller waves that can be easily surmounted.

Most of us have been up against our own personal threshold for keeping track of the project details, beyond which things become to complex to understand. We known that increasing the size of the team can push up the ability to handle bigger problems, but it also increases the amount of overhead. Big teams require bigger management, with each new member becoming increasingly less effective. Without a reasonable means to partition the overall complexity, the management cross-talk for any significant sized project would consume the majority of the resources. Progress quickly sinks below the waves. Just adding more bodies is not the answer, we need something stronger.

Something that is really encapsulated, hides the details from the outside world. By definition, it has become simpler. All of the little details are on the inside, and none of them need to be known on the outside to be effective. In this way, that part of the problem has been solved and is really to go. You are free to concentrate on the other unresolved issues, hopefully removing them one-by-one until the project comes down to just a set of controlled work that can be completed. At each stage, the complexity needs to be found, dealt with and contained.

The complexities of design, implementation and operation are very different from each other. I've seen a few large projects that have managed to contain the implementation complexity, only to lose it all by allowing the operational complexity to get out of hand. Each part in the process requires its own understanding to control its own special complexity. Each part has its own way of encapsulating the details.

Successful projects get there because at every step, the effort and details are well understood. It is not that they are simpler, just that they are compartmentalized enough that changes do not cause unexpected complexity growth. If you are, for example adding some management capabilities to an existing system, the parts should be encapsulated enough that the new code does not cause a significant number of problems with the original code, but it should also be tied closely enough to it too leverage its capabilities and its overall consistency. Extending the system should be minimal work that builds on the existing code base. This is actually easy to do if the components and layers in the system are properly encapsulated; it is a guaranteed side effect of a good architecture. It is also less work and higher quality.

Given that, it is distressing how often you see architectures, libraries, frameworks and SDKs that should encapsulate the details, but they do not. There is some cultural aspect to Computer Science where we feel we cannot take away the choices of other programmers. As such, people often build development tools to encapsulate some programming aspects, but then they leave lots of the details free to percolate upwards; violating the whole encapsulation notion. Badly done encapsulation is not encapsulation. If you can see the details, then you haven't hidden them, have you?

The best underlying libraries and tools are ones that provide a simplified abstraction over some complicated domain, hiding all of the underlying details as they go. If we wanted to get down and dirty we wouldn't use the library, we would do it directly. If we choose to use the library, we don't want to have to understand it and the underlying details too. That's twice as much work, when in the end, we don't care.

If you are going to encapsulate something, then you really should encapsulate it. For every stupid little detail that you let escape, the programmers above you should be complaining heavily. If you hide some of the details, but the people above still need to understand them in order to call things properly, then it wasn't done correctly. We cannot build on an infrastructure that is unstable because we haven't learned how it works. If we have to understand it to use it, it is always faster in the long run just to do it ourselves. So ultimately building on a bad foundation is just wasting our time.

It is possible to properly encapsulate the details. It is something we need to do if we want to build better software. There is no need to watch your projects slip below the churning seas. It is avoidable.

Wednesday, December 10, 2008

The Best of The Programmer's Paradox

Blogs always see a constant flow of new, possibly intrigued readers. To answer that all important question that every reader has for a new blog, I put together a selection of some of my better works. This should help in making that all important judgment between insane lunatic, crazy rambler or just run-of-the-mill nut job. It's an easier choice if I provide references to some of my more interesting pieces, ones that have a strong point, were more popular or just contain fluid writing.

I've collected together my favorites with short intros, which I'll update from time to time to keep it up-to-date. Hopefully I can tag this in Blogger to make it standout.

If you're a long-time reader and I missed a favorite or two, please feel free to leave a comment. If I managed to get a few more suitable entries, I'll update this document. Comments for older pieces should be left on the original piece, Blogger will let me know that they are there.

NORMAL FORMS

Probably the most influential thing that I'll write in my career is the basis for code normalization, however I fully expect that the software development industry will happily choose to ignore any points in these works for at least another hundred or hundred and fifty years. Because of that, don't feel obligated to read any of these, they probably won't be relevant for quite a while:

http://theprogrammersparadox.blogspot.com/2008/11/code-normal-form.html
http://theprogrammersparadox.blogspot.com/2008/10/structure-of-elegance.html
http://theprogrammersparadox.blogspot.com/2008/10/revisiting-structure-of-elegance.html

SIMPLIFICATION

Simple is a constantly mis-used term, in this entry I'm trying very hard to get clarify a definition of it. It's the sort of thing that programmers should really really understand, yet the kind of thing that they rarely do:

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

ABSTRACTION AND ENCAPSULATION

Most of the incoming searches from the web seem to find my abstraction and encapsulation writings. Again these pieces are focused around clearly defining these terms. The first entry references the other two, I threw in the fourth one because its also deals with more essential definitions:

http://theprogrammersparadox.blogspot.com/2008/01/abstraction-and-encapsulation.html
http://theprogrammersparadox.blogspot.com/2007/11/art-of-encapsulation.html
http://theprogrammersparadox.blogspot.com/2007/12/pedantically-speaking.html
http://theprogrammersparadox.blogspot.com/2008/01/construction-of-primitives.html

SOFTWARE DEVELOPMENT

I've been struggling for years with trying to find the right words to explain how to make it easier to develop software. Software development is a hard task, but people so often make it much much harder than it actually needs to be. More software fails at the root development level then it does from organizational problems. If it starts off poorly, it will fall apart quickly:

http://theprogrammersparadox.blogspot.com/2008/01/essential-develoment-problems.html
http://theprogrammersparadox.blogspot.com/2007/10/first-principles-and-beyond.html
http://theprogrammersparadox.blogspot.com/2008/05/software-blueprints.html

BLUEPRINTS

Another interesting post that got a fair amount of feedback. I think that a simplified high-level (but imprecise) way to specify functionality would go a long way in reducing risk and allowing programmers to learn from each other:

http://theprogrammersparadox.blogspot.com/2008/05/software-blueprints.html

TESTING

Coding is great, but as programmers we fill up our efforts with a steady stream of bugs. You haven't really mastered software development until you've mastered spending the least amount of effort to find the maximum number of bugs, which is not nearly as easy as it sounds:

http://theprogrammersparadox.blogspot.com/2008/03/testing-for-battleships.html
http://theprogrammersparadox.blogspot.com/2008/02/testing-perspectives.html

IN THE FUTURE

I did a couple of forward-looking posts about things up and coming in the future. Computer Science forces us to be able to understand knowledge in a new and more precise way than is necessary for most other disciplines:

http://theprogrammersparadox.blogspot.com/2008/02/age-of-clarity.html
http://theprogrammersparadox.blogspot.com/2008/03/science-of-information.html
http://theprogrammersparadox.blogspot.com/2008/12/lines-of-progress.html

BUILDING BLOCKS

These are some of my earlier writings. Similar to this blog, but often a little dryer. The whole repository can be found here:

http://www.lulu.com/content/1054566 (should be free to download)

Some of the better entries:

Rust and Bloat -- http://people.lulu.com/blogs/view_post.php?post_id=28963

Five Pillars -- http://people.lulu.com/blogs/view_post.php?post_id=35232

When Ideas become Products -- http://people.lulu.com/blogs/view_post.php?post_id=33933

THE BOOK -- THE PROGRAMMER'S PARADOX

This site is named after my first attempt at a book. I quite work and decided to push out all of my software development understanding onto paper. It was something I always wanted to do. I can't say that it is a good piece of work, but I can say that there is a lot there, it's just in a very raw form. It was my first big attempt at writing (besides research papers) and apart from a few typos, the printed edition looks and feels very much like a real book (my sister says so, and likes the quotes at the start of each chapter). Professional editing could bring it out, but I've never been able to convince any publishers that this 'type' of material should see the light of day. My advice, don't buy the book, convince a publisher to publish it instead (if you'd really like to read the book, send me an email and I'll send you a copy).

http://www.lulu.com/content/178872

RANTS AND RAVES

Sometimes the world drives me nuts, so I set up a place to vent. Some of the venting is horrible, but occasionally it's amusing:

http://irrationalfocus.blogspot.com/2008/06/seven-platitudes-of-highly-ineffectual.html
http://irrationalfocus.blogspot.com/2007/12/ho-ho-what-ho.html
http://irrationalfocus.blogspot.com/2008/02/edging-towards-third.html

GAMES PEOPLE PLAY

Andy Hunt suggested a long time ago that my writing was OK, but not entirely personal. He said I was delivering observations but not tying them back to the reader. As a sort of fun exercise I set myself the goal of writing just little thought-lets: two sentence expressions. The first sentence is completely general, an observation; while the second one must tie that back to the reader somehow. The earlier versions were on yahoo360:

http://ca.blog.360.yahoo.com/blog-jQ.RlBwherOgu.uuFz5qB.I1

but there is a full copy at:

http://idealeftovers.blogspot.com/2007/11/yahoo360-yahoo-360-rant-in-long-run.html

Thursday, September 22, 2022

Helpers or Encapsulation

I was having a discussion with another developer recently. He suggested that we could create a library of ‘helpers’.

I like creating libraries for most things, but this suggestion turned me off. It just sounded wrong.

A while ago I was working on a medium-sized system where the developers basically went crazy with ‘helpers’. They had them everywhere, for everything.

Generally, if you take any ideal in programming and apply it in an over-the-top extreme manner it doesn’t work out very well, and that was no exception.

It basically destroyed the readability, and the code in the helpers was haphazard, all over the place. So without bouncing into lots of different files and directories, you wouldn’t get any sense of what the code was actually trying to do. And far worse, that fragmentation was hiding some pretty significant bugs, so really it was a huge mess.

But breaking things down is one of the core ideas of good software development, so why did the helper-fest go so horribly wrong?

You want to build up a lot of reusable pieces, gradually moving up from low-level operations into higher-level domain primitives. You want this so that when you go to build new stuff, you can most often do it from a trusted set of existing lower stuff. You’ve built a lot of the parts already, you’ve tested them, and they are being heavily used. So, if you leverage your earlier work you’ll save yourself massive amounts of time.

But you also need to encapsulate any of the complexity. If you don’t it will grow out of control.

Basically, you want to stick something complex into a box, hide it away, and then keep reusing it. Once you have solved a specific coding problem, you don’t want to waste time by having to solve it again.

Encapsulation is always more than just code. There are some data, constants, or config parameters of some type that goes inside the box as well. You might even have a bit of state in there too, but you have to be very careful with that. As long as some of the underlying details never leave the box, you’ve at least encapsulated something.

So, a good reusable lower component is encapsulated code, data, and the mechanics needed to do some functionality, that is nicely wrapped and hidden away from the caller. You see this commonly in language libraries, string handling is often a good example. You don’t get caught in messing with an array of individual characters, instead, you do common operations directly on the abstraction notion of a ‘string’.

Basically, you’ve removed a piece of complexity and replaced that with a box that adds only a little bit of extra complexity. Overall, complexity is higher, but at any given level you have lowered it.

As long as you name it correctly, it is reusable. Later someone can find it, use it, and trust it, without having to get lost in those underlying details.

The way most people write helpers though, they are just pure code fragments wrapped with a function. Really just some idiom or a clump of code that people are using often. It’s more like a cut and paste, but without making explicit copies.

So that’s where helpers get into trouble. Basically, they’re often disjoint code fragments, arbitrarily chosen, and both because they do not encapsulate, and because they are not well-defined primitives, you could reuse the code but the overall complexity is still going up. It doesn’t hide things, really it just reduces retyping. You gain a little bit from calling it, but you lose a lot more from its inherent complexity, especially if you need to read it later.

And that’s pretty much what I saw in that earlier project. Readability was shattered because the lines drawn around the helper code fragments were arbitrary. You can’t even guess what the calls will do.

In that earlier case though, a lot of the helpers also modified globals, so the side effects were scary and rather unpredictable. But even if that wasn’t true, the general ideas around helpers may help reduce pasted code, but do not encapsulate any complexity, so they really aren’t helping much.

A variation on this theme is to have a large number of static functions in an OO system. It’s the same problem just not explicitly called out as helpers.

Probably the worst thing you can do with code in a large system is duplicate stuff all over the place. It’s a recipe for creating bugs, not functionality. But avoiding encapsulation by using a weaker approach like helpers isn’t really better. There is never enough time to do a proper job coding anymore, so you have to make any work you do count. The more time you save, the more time you now have to do a better job.

Monday, June 18, 2012

What is Complexity?

This is going to be a long and winding post, as there are always fundamental questions that do not have easy or short answers. Complexity is one of those concepts that may seem simple on its surface, but it encompasses a profoundly deep perspective on the nature of our existence. It is paired with simplicity in many aspects, which I wrote about in:

http://theprogrammersparadox.blogspot.ca/2007/12/nature-of-simple.html

It would be helpful in understanding my perspective on complexity to go back and read that older post before reading further.

The first thing I need to establish is that this is my view of complexity. It is inspired by many others, and it may or may not be a common viewpoint, but I’m not going to worry in this posting about getting my facts or references into the text. Instead, I’m just going to give my own intuitive view of complexity and leave it to others to pick out from it what they feel is useful (and to disregard the rest).

Deep down, our universe is composed of particles. Douglas Hofstadter in “I am a Strange Loop” used the term ‘epiphenomenon’ to describe how larger meta-behavior forms on top of this underlying particle system. Particles form molecules, which form chemicals, which form into materials, which we manipulate in our world. There are many ‘layers’ going down between us and particles. Going upwards, we collect together as groups and neighborhoods, based in cities in various regional collections to interact with each other as societies. Each of these layers is another set of discrete ‘elements’ bound together by rules that control their interaction. Sometimes these rules are unbreakable, I’ll call these formal systems. Sometimes they are very malleable: thus informal systems. A deeper explanation can be found here:

http://theprogrammersparadox.blogspot.ca/2011/12/informal-ramble.html

If we were to look at the universe in absolute terms, the sum total of everything we know is one massive complex system. It is so large that I sincerely doubt we know how large it actually is. We can look at any one ‘element’ in this overall system and talk about its ‘context’; basically all of the other elements floating about and any rules that apply to the element. That’s a nice abstract concept, but not very useful given that we can’t cope with the massive scale of the overall system. I’m not even sure that we can grok the number of layers that it has.

Because of this, we narrow down the context to something more intellectually manageable. We pick some ‘layer’ and some subset of elements and rules in which to frame the discussion. So we talk about an element like a ‘country’, and we are focused on what is happening internally in it or we talk about how it interacts with the other countries around it. We can leverage the mathematical terminology ‘with respect to’ -- abbreviated to ‘wrt’ -- for this usage. Thus we can talk about a country wrt global politics or wrt citizen unrest. This constraints the context down to something tangible.

A side-effect of this type of constraint is that we are also drawing a rather concrete border around what is essentially a finite set of particles. If we refer to a country, although there is some ambiguity, we still mean a very explicit set of particles at a particular point in time (inferred).

So what does this view of the world have to do with complexity? The first point is that if we were going to craft a metric for complexity then whatever it is it must be relative. So it is ‘complexity wrt a,b,c,.., z’. That is, some finite encapsulation of both the underlying elements (possibly all the way down to particles or lower) and some finite encapsulation of all of the rules that control their behavior, at every layer specified. Complexity then relates to a specific subsystem, rather than some type of absolute whole. Absolute complexity is rarely what we mean.

In that definition, we then get a pretty strong glimpse of the underpinnings of complexity. We could just take it as some projection based on all of the layers, elements and rules. That is of course a simplification of its essence and that in itself is subject to another set of constraints imposed by the reduction. Combined with the initial subsystem, it is easy to see why any metric for complexity is subject to a considerable number of external factors.

Another harder. but perhaps more accurate way of looking at complexity is as the size of some sort of multidimensional space. In that context we could conceive of what amounts to the equivalent of a ‘volume’, a spatial/temporal approach to looking at the space occupied by the system. This allows use to take two constrained subsystems and roughly size them up against each other. To be able to say that one is more ‘complex’ than the other

Complexity in this way of thinking has some interesting attributes. One of them is that while there is some minimum level of complexity within the subsystem, organization does appear to reduce the overall complexity. That is, in a very simple system, if the rules that bind it are increased, but the increase reduces the interactions of the epiphenomenon, the overall system could be less complex than the original one. There is a still a minimum, you can’t organize it down to nothing, but chaos increases the size of complexity (which is different from the way information theory sees the world). So there is some ‘organizational principle’ which can be used to push down complexity to its minimum, however this principle is still bound by the similar constraints that hold for any restructuring operation like simplicity. That is, things are ‘organized’ wrt some attributes.

Another interesting aspect of this perspective of complexity is how it relates to information. If complexity is elements and rules in layers, information is a path of serialization through these elements, rules and layers. That is, it is a linearized syntactic cross-section of the underlying complexity. It is composed of details and relationships that are interconnected, but flattened. In that sense we can use some aspect of Information Theory to identify attributes of an underlying subsystem. There is an inherent danger in doing this because the path through the complexity isn’t necessarily complete and may contain cycles and overlaps, but it does open the door to another method of navigating the subsystem besides ‘space’. We could also use some compression techniques to show that a particular information path is near a minimal information path. So that the traversal and the underlying subsystem are in essence as tightly woven as they could possibly be.

A key point is that complexity is subject to decomposition. That is, things can appear more or less complex by simply ignoring or adding different parts of the overall complexity. Since we are usually referring to some form of ‘wrt’, then what we are referring to is subject to where we drew these lines in the space. If we move the lines substantially, a different subsystem emerges. Since there are no physical restrictions on partitioning the lines, they are essentially arbitrary.

Subsystem complexities are not mutually independent of the overall complexity. We like to think they are, but in that all things are interrelated. However, there are some influences that are so small that they can be considered negligible. So for instance fluctuations on the temperature of Pluto (the planetiod) are unlikely to affect local city politics. The two seem unrelated, however they both exist in the same system of particles floating about in space, and they are both types of epiphenomenon, although one is composed of natural elements while the other is a rather small group of humans interacting together in a regional confrontation. It is possible (but highly unlikely) that some chunk of Pluto could come crashing down and put an end to both an entire city and any of its internal squabbling. We don’t expect this, but there is no rule forbidding it.

The way we as a species deal with complexity is by partitioning it. We simply ignore what we believe is on the outside of the subsystem and focus on what we can fit within our brains. So we tend to think that things are significantly less complex than they really are, primarily because we have focused on some layer and filtered down the elements and rules. Where we often get into trouble with this is with temporal issues. For a time, two subsystems appear independent, but at some point that changes. This often misleads people into incorrectly assessing the behaviors.

Because we have to constrain complexity, we choose to not deal with large systems, but they still affect the complexity. For the largest absolute overall system, it seems likely that there is a fixed amount of complexity possible. One has to be careful with that assumption though because we already know from Godel’s Incompleteness Theorem that there is essentially an infinite amount of stuff theoretically out there as it related to abstract formal systems. One could get caught up in a discussion about issues like the tangibility of ‘infinite’, but I think I’ll leave that for another post and just state an assumption that there likely appears to be a finite number of particles, a maximum size, an end to time and thus a finite number of interactions possible in the global system. For now we can just assume it is finite.

Because of the sheer size of the overall system, there is effectively no upper limit on how complex things in our world can become. We could apply the opposite of the earlier ‘organizational principle’ to build in, what amounts to artificial complexity and make things more complicated. We could shift the boundaries of the subsystem to make it more complex. We could also add in new abstract layers would again would increase the complexity. It is fairly easy to accomplish, and from our perspective there is effectively an infinite amount of space (wrt a lifetime) to extend into.

One way of dealing with complexity is by encapsulating it. That is cleaving off a subsystem and embedding it in a ‘black box’. This works, so long as the elements and rules within the subsystem are not influenced by things outside of the subsystem in any significant way. This restriction means that working encapsulation is restricted to what are essentially mutually independent parts. While this is similar to how we as people deal internally with complexity, it requires a broader degree of certainty of independence to function correctly. You can not encapsulate human behavior away from the rules governing economies for instance, and these days you cannot encapsulate one economy from any other on the planet, the changes in one are highly likely to affect the other. Encapsulation does work in many physical systems and often in many formal system, but only again wrt elements in the greater subsystem. That is, a set of gears in a machine may be independent of a motor, but both are subject to outside influences, such as being crushed.

Overall, complexity is difficult to define because it is always relative to some constraints and it is inherently woven through layers. We don’t tend to be able to deal with the whole, so we ignore parts and then try to convince ourselves that these parts are not effecting things in any significant way. It is evident from modern societies that we do not collectively deal with complexity very well, and that we certainly can’t deal with all of the epiphenomenon currently interacting on our planet right now. Rather we just define very small artificial subsystems, tweak them and then hope for or claim positive results. Given the vast scale of the overall system, we have no realistic way of confirming that some element or some rule is really and truly outside of what we are dealing with, or that the behavior isn’t localized or subject to scaling issues.

Mastering complexity comes from an ever-increasing stretching of our horizons. We have to accept external influences and move to partition them or accept their interactions. In software, the complexity inherent in the code comes from the environment of development and the environment of operations. Both of these influence the flow and significance of the details within the system. Fluctuations from the outside needs and understanding, drive the types of instructions we are assembling to control the computer. Our internal ‘symbols’ for the physical world align or disconnect with reality based on how well we understand their influences. As such, we are effectively modelling limited aspects of informal systems in the real world with the formal ones in a digital world. Not only is the mapping important, but also the outside subsystems that we use to design and built it. As the boundaries increase, only encapsulation and organization can help control the complexity. They provide footholds into taming the problems. The worst thing we can do with managing complexity is to draw incorrect, artificial lines and then just blind ourselves to things crossing them. Ignoring complexity does not make it go away, it is an elementary property of our existence.

Tuesday, November 9, 2010

Reducing Test Effort

Last week a reader, Criador Profundo, asked a great question on my last post Code Validation:

“Could you comment on how tests and similar practices fit in with the points you have raised?”

I meant to reply earlier, but every time I started to organize my thoughts on this issue I found myself headed down some closely-related side issue. It's an easy area to get side-tracked in. Testings is the point in software development where all of the development effort comes together (or not), so pretty much everything you do from design, to coding, to packaging affects it in some way.

Testing is an endless time sink; you can never do enough of it, never get it perfected. I talked about tactics for utilizing limited resources in an older post Testing for Battleships (and probably a lot of the others posts as well). I’ll skip over those issues in this one.

The real question here is what can be done at the coding level to mitigate as much of the work of testing as possible. An ideal system always requires some testing before a release, but hopefully not a full battery of exhaustive tests that takes weeks or months. When testing eats up too many resources, progress rapidly slows down. It’s a vicious cycle that has derailed many projects originally headed out in the right direction.

Ultimately, we’d like for previously tested sections of the system to be skipped if a release isn’t going to effect them. A high-level ‘architecture’ is the only way to accomplish this. If the system is broken up into properly encapsulated self-contained pieces, then changes to one piece won’t have an effect on the other pieces. This is a desirable quality.

Software architecture is often misunderstood or maligned these days, but it is an absolute necessity for an efficient iterative based development. Architectures don’t happen by accident. They come from experience and a deliberate long-term design. They often require care when extending the system beyond its initial specifications. You can build quickly without them, but you’ll rapidly hit a wall.

Encapsulation at the high level is the key behind an architecture, but it also extends all of the way down at every level in the code. Each part of the system shouldn’t expose its internal details. It’s not just a good coding practice, it’s also essential in being able to know the scope of any changes or odd behaviors. Lack of encapsulation is spaghetti.

Redundancies -- of any kind: code, properties, overloaded variables, scripts, etc. -- are also huge problems. Maybe not right away, but as the code grows they quickly start to rust. Catching this type of rusting pushes up the need for more testing. Growth slows to a crawl as it gets quickly replaced by testing and patching. Not being redundant is both less work, and less technical debt. But its easier said than done.

To avoid redundant code, once you have something that is battle-tested it makes little sense to start from scratch again. That’s why leveraging any and all code as much as possible is important. Not only that, but a generalized piece of code used for fifty screens is way less work to test, then fifty independently-coded screens. Every time something is shared, it may become a little more complex, but any work invested in testing is multiplied. Finding a bug in one screen is actually finding it in all fifty. That is a significant contribution.

The key to generalizing some code and avoiding redundancies in a usable manner is abstraction. I’ve talked about this often, primarily because it is the strongest technique I know to keep large-scale development moving forward. Done well, not only does development not slow down as the project progresses, it actually increases it. It’s far easier to re-use some existing code, if it is well-written, then it is to start again from scratch.

Sadly, abstraction and code re-use are controversial issues in software development because programmers don’t like having to look through existing code bases to find the right pieces to use, they fear sinking too much time into building up the mechanics, and because it is just easier to splat out the same redundant code over and over again without thinking. Coding style clashes are another common problem. Still, for any large scale, industrial strength project it isn’t optional. It’s the only way to avoid an exponential growth in the amount of work required as the development progresses. It may require more work initially, but the payoff is huge and it is necessary to keep the momentum of the development going. People often misuse the phrase “keep it simple, stupid” to mean writing very specific, but also very redundant code, however a single generalized solution used repeatedly is far less complex than a multiple of inconsistent implementations. It’s the overall complexity that matters, not the unit complexity.

Along with avoiding redundancies comes tightening down the scope of everything (variables, methods, config params, etc.) wherever possible. Technically it is part of encapsulation, but so many programmers allow the scope of their code or data to be visible at a higher level, even if they’ve tried to encapsulate it. Some misdirected Object Orient practices manage to just replace the dreaded global variable, with a bunch of global Objects. They’re both global, so they are both the same problem. Global anything means that you can’t gauge the impact of any changes, which means re-testing everything, whether it is necessary or not. If you don’t know the impact of a change, you’ve got a lot more work to do.

Another issue that causes testing nightmares is state. Ultimately, the best code is stateless. That is, it doesn’t reference anything that is not directly passed into it, and its behavior cannot change if the input doesn’t change. This is important because bugs can be found accidentally as well as on purpose, but if the test case to reproduce the bug is too complex to reproduce, it will likely be ignored (or assumed to have magically disappeared) by the testers. It’s not uncommon for instance to see well-tested Java programs still having rare, but strange threading problems. If they don’t occur consistently, they either don’t get reported or they are summarily dismissed.

There are plenty of small coding techniques as well. Consistency and self-discipline are great for reducing the impact of both bugs and extending the system. Proper use of functions (not to big, not to small, and not to pedantic) makes it easier to test, extend and refactor. Making errors obvious in development, but hiding them in the final system helps. Limiting comments to “why” and trying to avoiding syntactic noise are important. Trying to be smart, instead of clever, helps as well. If it’s not obvious after a few months of not looking at it, then it’s not readable and thus a potential problem.

Ultimately once the work has been completed and tested a bit, it should be set aside and ignored until some extension is needed later. If you’re forced to constantly revisiting the code then you’re not building anymore. It’s also worth noting that if the code is any good there will be many many people that look at it over the decades that it remains in service. The real indicator of elegance and quality is how long people continue to use the code. If it’s re-written every year, that says a lot about it (and the development shop). (Of course it can also be so bad that nobody has the nerve to look at it, and it just gets dumped in by default).

There is a big difference between application development and system’s programming. The latter involves many complex technical algorithms, usually based on non-intuitive abstractions. It doesn’t take a lot of deep thinking to shift data back and forth between the GUI and the database, but it does to deal with resource management, caching, locking, multiple-processes, threading, protocols, optimizations, parsing, large scale sorting, etc. Mostly, these are all well-explored issues and there is a huge volume of available knowledge about how to do them well. Still, it is not uncommon to see programmers (of all levels of skill) go in blindly and attempt to wing it themselves. A good implementation is not that hard, but a bad one is an endless series of bugs that are unlikely to ever be resolved, and thus an endless series of testing that never stops. Programmers love to explore new territory, but getting stuck in one of these traps is usually fatal. I can’t even guess at the number of software disasters I’ve seen that come from people blindly diving in without first doing some basic research. A good textbook on the right subject can save you from a major death march.

Unit testing is hugely popular these days, but the only testing that really counts in the end is done at the system level. Specifically, testing difficult components across a wide range of inputs can be faster at the unit level, but it doesn’t remove the need to verify that the integration with other pieces is also functioning. In that way, some unit testing for difficult pieces may be effective, but unit testing rather simple and obvious pieces at both the unit level and the system level is wasted effort, and it creates make-work when extending the code. Automated system testing is hugely effective, but strangely not very popular. I guess it is just easier to splat it out at the unit level, or visually inspect the results.

From a user perspective, simple functionality that is easily explained is important for a usable tool but it also makes the testing simpler as well. If the test cases are hugely complicated and hard to complete properly, chances are the software isn’t too pleasant to use. The two are related. Code should always encapsulate the inherent difficulties, even if that means the code is somewhat more complicated. An overly-simple internal algorithm that transfers the problems up to the users may seem elegant, but if it isn’t really solving the problem at hand, it isn’t really useful (and the users are definitely not going to be grateful).

There are probably a lot more issues that I’ve forgotten. Everything about software development comes down to what you are actually building, and since we’re inherently less than perfect, testing is the only real form of quality control that we can apply. Software development is really more about controlling complexity and difficult people (users, managers AND programmers) than it is about assembling instructions for a computer to follow (that’s usually the easy part). Testing is that point in the project where the all of the theories, plans, ideas and wishful thinking come crashing into reality. With practice this collision can be dampened, but it’s never going to be easy, and you can’t avoid it. It’s best to spend as much initial effort as possible to keep it from becoming the main source of failure.