Monday, January 14, 2008

The Construction of Primitives

There is still plenty of 'meat' left in exploring fundamental development issues. We could spend time on the more interesting higher-order topics, but when the discussions get gummed up in problems with basic terminology, understanding or process we don't make significant headway. As a young industry, Computer Science has established some level of competency, but many of our key assumptions are still open for discussion. We have built on our knowledge, but our results are clearly not reliable.

Software developers, it seems, have all gone off to their own little worlds, a situation which is quite possibly analogous to mathematics prior to Sir. Issac Newton managing to enforce notational standards. With everyone running around using their own unique terms and definitions, we are getting a lot of impedance mismatches in our conversations. As far as I know, there isn't even an 'official' body on which we all could agree as being the respected authority on software development. Our industry has fragmented into a million pieces.

For this post I figured I would continue on digging deeply with a look into the way we choose to decompose code into primitive functions. Mostly I'll use the rather older term 'function' when I talk about a block of code, but this discussion is applicable to any functional paradigm including object-oriented, for which the term 'method' is more appropriate. But I'll stick to using function because it is the term with the least connotations.

Again, as with any of these abstract discussions this will start fairly conventionally and then get plenty weird as it winds it way into the depths.

PRIMITIVE FUNCTIONS

We are all familiar with arithmetic; it is a simple branch of mathematics that consists of the four basic operators +, -, * and /. True number theorists would also define some information with respect to the underlying set of numbers, e.g. integers, real, complex, etc. but for this discussion I'd rather stay away from group and ring theory; they are two rather complex branches of mathematics that are interesting, but not immediately relevant.

What is important is that we can also see the above operators as the functions 'add', 'subtract', 'multiple' and 'divide'. Each one of these functions takes two arguments and returns a resulting value. We get something vaguely like:

add(x,y) := return (x + y);
sub(x,y) := return (x - y);
mult(x,y) := return (x * y);
div(x,y) := return (x / y);

There are other 'wordings' that are possible to describe these functions. I think in some cases you might call them axioms, in some contexts they could represent a geometry, or we could talk about them as language tokens that have a specific expressibility. In x86 assembler the following functions for handling unsigned types are similar:

ADD -- Add
SUB -- Subtraction
MUL -- Unsigned multiply
DIV -- Unsigned divide

There are many more 'isomorphic' ways of addressing the same four functions, but in the end, they are all essentially a description of the same four 'primitive' functions that comes from the collection of operators available in arithmetic.

Whatever terminology or space you want to use to look at these four functions, what I want to draw towards you attention is not the verbiage used to describe them, nor their underlying behaviors, instead it is the relationship between the four things themselves. Note that a) they are all of the possible operators* b) they do not overlap** and c) they form a complete set***.

* all of the operators for 'arithmetic' only, on the ring for real numbers for + and * or something like that (it has been decades since I took ring theory).
** multiply is really X added Y times to itself, but you need the Y for the number of adds, so even though it is rooted in addition, it is still something unique.
*** add is the inverse of sub, div is the 'sortof' inverse of mul (see group theory). They have all of the consistent and complete properties: associative, commutative, inverse, entity, etc.

We can see that these four primitive functions form a 'language' for which we can express all possible arithmetic problems. If you need to do any arithmetic, all you need to know and understand are these four functions. They are simple, consistent and complete.

On a lexicographical note, I prefer to the use the term 'primitive' to describe any set of functions which do not overlap with each other. I think 'atomic' was another term we used in school, because each operation was an atom at the data structure level -- which might have possibly been a pre-quantum-physics usage -- but we have added so many connotations to that term that it has a very definitive meaning to most people. Also, 'atomic' sometimes means a function that will execute in one all-or-nothing shot, such as 'test and set'. Other terms like elementary, fundamental and primal, etc. are OK, but 'primitive' works well because we use it in regard to type information as well, e.g. 'int' is a primitive type. So, the functions are primitive, but not in the sense that they are crude. It is the sense that they are at the very bottom level and are not decomposable into smaller 'primitives'.

A DIFFERENT SET OF FUNCTIONS

So far, so easy. Just a set of the four base functions that completely cover and define arithmetic. Now suppose we created eight more functions, each taking two values, and working them as follows:

f1 = (2*x*x - y) / 2*x
f2 = x + 2*y
f3 = 4*x
f4 = 3*x
f5 = x*(-2*x - 5*y)
f6 = 2*x*y
f7 = x - y
f8 = y*( -2*x*(x + y) + x)

What is interesting about these eight functions is that none of them is identical to any of the original four functions. Also, they are obviously quite complex. Interestingly, combined they cover the exact same space as the original four functions:

add = f1 - f2 / f3
sub = f2 * f4 + f5
mul = f4 + f5 / f6
div = f6 * f7 + f8

However, as functions they have a significant amount of overlap with each other. f3 and f4 for example are close but not identical. Most of the functions use the * operator. All of them have at least one one underlying arithmetic operator, while one has as many as 6. Together they form another 'complete' interface for arithmetic, but they are clearly not primitive functions. They are composites.

So, we have two complete sets of functions, each forming an API that can be used to express any and all arithmetic problems. They are similar, equally expressive, but clearly the second one with eight functions is considerably more complex than the first. It would be possible to memorize all eight functions and use them entirely in place of +, -, * and /, but given their rather useless added complexity who would bother?*

*You can speak Klingon? You're kidding :-O

Now this might seem to be a contrived example, but it happens frequently on a larger scale. Consider a library for a graphic user interface (GUI) that consists of 300 basic functions. Now consider another one that has over 4000 functions. It is obvious that if we could find 300 functions to express most of the functionality needed for handing a GUI application, then few of the 4000 functions would likely be 'primitive'. While the overlap between the two libraries might not be exact, without even getting into the specifics we can start seeing when things are composed of primitives and when they are composed of many composite functions.

But we do have to be a little careful. There can of course be multiple different independent sets of primitives that share the same 'space'. The classic example is with Boolean logic. AND, OR and NOT form a complete set of operators. NAND by itself does as well. The two sets of primitives are completely interchangeable, consistent and complete, although one is three times the size of the other.

Primitive functions can exist for anything, although we generally tend to see them as elementary operations on things like abstract data types. For example, linked-lists generally have the functions: new, find, delete and insert defined. Although data-structures are a more popular usage, every sequence of instructions can be broken down into a reasonable set of primitives. And every set of primitives can be consistent and complete.

WHY IS THIS IMPORTANT?

We break our programs into smaller pieces so that we can manage the complexity.

At each different layer we provide some type of interface so that ourselves or more often, other programmers can access the underlying layer. We always want the simplest possible 'interface' that can do the job, given whatever level of abstraction and encapsulation are necessary. Any interface that is even marginally more complex is 'artificially' complex, because it is possible to remove that complexity.

There may be circumstances where artificial complexity is acceptable because of the abstraction or the need for an optimized non-primitive version of the function, but that is probably a weakness in the interface and most likely represents a bad design. Non-primitives are repetitive, which increases the risk of bugs.

Our best interfaces are the ones that presents a consistent set of primitives to the user without overlap. The functions themselves form the basis for the grammar needed to spell out the problem at a higher level. We simply 'say' the higher-level problem in terms of the lower-level primitives. Arithmetic, for example is spoken in terms of addition, subtraction, multiplication and division. It forms the basis of any book-keeping system. With these basic functions we could start to lay the foundations for a double entry accounting program.

A good interface, with its consistency allows the user to remember only a small number of 'axioms' that they can then use to spell out a larger number of higher-order problems. Completeness and consistency with primitives makes it easier, and thus significantly faster to express the solution with the overall set of available functions. All of the extra artificial complexity massively gums up the works. It is easy to remember the four operators for arithmetic, but although it is only another four operators, try actually writing out any type of significant arithmetic calculation using the arbitrarily complex eight functions in the first example. Although the API is only slightly larger, the complexity is through the roof.

GETTIN' JIGGY WIT IT

In an earlier entry on abstraction I suggested that data was only an abstract projection of something in the real world:

http://theprogrammersparadox.blogspot.com/2008/01/abstraction-and-encapsulation.html

That abstract view of data is extremely useful in accepting some of the behavioral characteristics of data within a computer. Extending this a bit, there are essentially only two things in all software: data and code. There is nothing else; everything is either one or the other. Going farther, all code is broken down into functions, and on a very esoteric level there is no real difference at any give point in time, between the data itself and the functions as they exist in the computer.

The key point here is 'in time'. Functions execute, but frozen in time they are just data as it exists in the system. A great visualisation example of this is a fractal generation program. Before it has been run, the fractal program will only have high-level information regarding a Mandelbrot fractal. But that information, in the form of data, is stored as a set of symbolic finite discrete parameters that explain to the computer the underlying 'details' of a fractal. When you run the program, you pick a range for the fractal to be 'visualized' within. The program generates from its internal information a large set of data that it can display as an image. This image shows what the fractal looks like at a specific range.

The information within a fractal is infinite, you can endlessly explore the depths forever, there is no limit*. But at any given point in 'time' all of the data on a computer is discrete and finite. That you appear to be browsing through this infinitely deep Mandelbrot fractal is somewhat of a trick. The computational power of your computer is stretching out the fractal definition infinitely over the time dimension while inside of the computer at point there exists just the finite set of information describing the fractal, and a few images of it at various ranges. Nothing more.

* there are algorithms, such as Karatsuba that can deal with an arbitrarily large precision, given an arbitrarily large amount of time. At some depth, physical resources like disk space may become an issues as well.

A function as a piece of data at a point in time is a hard concept to grasp. The converse, making functions out of all of the data in the system is much easier and a standard for object-oriented programming. Just about everything you access in most modern object-oriented programs is a 'method' for some underlying object. The data exists, but is almost always wrapped in a functional layer. Interestingly enough, most of the Design Patterns center around data, but some are effectively data'izing function calls, such as the Command pattern. Commands which are 'tied' to functions, once issued become data, allowing them to be 'undone'.

Although it is rather an abstract notion, data and functions are essentially the same thing. Well, at least at any discrete slice in time. There are many languages and techniques where data and functions are mixed to great effect. Being able to abstractly see one in terms of the other allows for a broader level of generalization.

PRIMITIVE DATA

Given that data and functions are mostly interchangeable, we can generalize some of our understanding about primitive functions.

Data can be primitive as well, but the internal meaning is a little different than functions. Primitive data is information that has been parsed down into its smallest usable units 'within' a system. If for example, the system stored and used Universal Resource Locators, better known as URLs, to maintain the location of dependencies on the Web, these are the base level and there is no need to decompose them any further. However some systems may need to break up the data to access smaller pieces such as the embedded host name or protocol information. In that case, they need to parse the URLs to access the sub-data. URLs may be primitive in one system, but not in another.

Dealing with non-primitive data brings up two common problems: a) redundant data and b) data overloading. If for example, the code requires both the URL and the host name, some programmers would parse one from the other and store both.

Redundant 'anything' in a computer is always dangerous because as the code is changed frequently by the programmers, there is a significant risk that changes might not occur equally to both copies. At all levels in the code, the data should be manipulated only in its most primitive form. That might change depending on the level, but it needs to be consistent with each segment of the code. Composite data should be deconstructed into primitives immediately. Keeping that type of 'discipline' in development ensures that there are far less bugs to be found.

Data overloading is a related problem. The programmer essentially combines two separate and distinct variables into a single one. Most often we find this happening with calculated conditions that are lying around in the code, such as user-defined options or internal flags. The original definition means something specific, but over time more and more code gets attached to a conditional variable, but with very different meanings. The code breaks when a fix is applied that fixes one usage for the variable, but breaks it for the other one.

The fix for this type of problem is really easy, just break the variable into two distinct variables, updating the correct data type for each. While it is a hard problem to explain, it is a common one in coding and an easy one to fix. Programmers tend towards saving a bit of space from time-to-time, so they frequently overload the meaning of existing variables, instead of creating new ones.

Primitive data is easier to see in a system because all you need do is follow the data around and see if it is decomposed any further. If it is, the 'parsing' should be pushed back as early as possible in the code, and the 're-assembly' as late as possible. If all of the data is stored at the same depth within the code, copies of it are not necessary. In many systems, there can be dozens of copies of the same data as it winds it way through the logic, buffers and caching. This redundant data is dangerous, costly and mostly avoidable.

SUMMARY

There are a couple of very important points spawned by this discussion.

The first of which is that as software developers we should endeavour to pick primitives for all of our API function calls and all of our internal data. We do this because it make the code a) easier, b) more memorable and c) less complex.

For functions, we need to realize that everything we write is an interface and the 'users' are our fellow programmers. In building a system, every function call belongs to some interface. Beyond just good design and simpler code we should also have empathy for those poor fellows stuck in the same position as ourselves. Nobody wants to have to memorize 200 convoluted function calls when 20 would do.

The same is true for data. Systems that are constantly breaking up data, then reassembling it, waste huge amounts of resources and are inherently convoluted. It is not a complicated amount of refactoring to push 'parsing' to be earlier and 'reconstruction' to be later.

If we stick to primitives, it simplifies the code and keeps it closer to optimal; we don't end up with a lot of convoluted data manipulations that aren't necessary. We really want to decompose things in the smallest consistent bits because that is the most expressive and least complex approach towards building.

For functions we also want to 'implement' the 'complete' API at any layer in the system, whether or not we think we are going to use it right way. Ignoring the 'Speculative Generality' code smell issue, whenever we implement a layer for ourselves or other programmers to use, that layer is an interface. The Speculative Generality smell is about not spending effort building something because you are just speculating that someone might use it. That is different than only doing a 'half' job on an interface. At some point, always, if half the code exists, someone will need the other 'half'. That is not speculating, that is fact. It is about finishing off what you started. Where you can add something, you should be able to delete it. If you can edit it, you can save it or create a new one. Half implemented tools are horrible.

The final point is that as consumers of other's code, we should demand that they provide to us the simplest possible code to get the job done. It is not uncommon to see very large and overly complex APIs full of overlapping, inconsistent poorly thought-out functions. These types of interfaces require considerable extra effort in order to utilize them. Not only is effort wasted in development, but it is massively wasted in utilizing this type of code. We need to be more vocal in not accepting this type of mess. If you build on a dependency that is convoluted now, how much worse will it get in the future? Thanks to 'backward compatibility' bad things no longer get removed from code, they just fester and grow worse over time. If we accept a mess now, it will only get worse. Programmers need to stop being fooled by shiny new technology; it is often only shiny because of excessive marketing.

The core thing to remember is that there is really no technical reason for us to have to frame our solutions in non-primitive constructs. And each and every time we do, we have just added artificial complexity to the system. It is always possible to get to an underlying set of primitives. It may take some work to 'find' or 'fix' the code, but the 'value' of that work is beyond doubt.