The Programmer's Paradox

Saturday, April 25, 2015

Special Case Considered Harmful

Without a doubt, the all-time quickest way to implement any new variability in an existing code base is to create a new branch with if/else statements, copy & paste the original code into its new block, and then modify the copy. You don't have to spend any time thinking. It's quick. It's initially painless. And it is one of the most popular ways to create code.

As a result of this, we often see programs that consist of layer upon layer of nested branches; sometimes four levels deep, sometimes worse. Advanced programmers avoid this by hiding these special cases in a mass of methods or functions.

Although the code itself gets larger, the testing of conditionals is usually cheap so there aren't any real performance penalties associated with this type of code. It may not be optimal, but it's not really slow either.

Testing does require more effort; forgetting to check the functionality of a newly added special case can easily mean that it gets released accidentally. If the branches are deeply nested, it can take considerable effort to unwind all of them them into a complete list of test cases. Most often that's not actually done, leading to bugs.

Extending the code is where the worst hit is taken. A few branches are not hard to deal with, but as they quickly build up, it becomes harder and harder to guess their behaviour just by reading the code. With lots of test cases necessary to cover the whole range of behaviour, it also becomes harder to understand when using a debugger. Thus you can't read it and it's no easy task to walk it either.

As the difficulties mount, the uncertainty grows -- probably exponentially -- until any new change to the code has a much better chance of being wrong then it does of working properly. At this point, since the code can no longer be safely changed, other worse options are usually selected.

At the root of this very common problem are special cases. If there is only one way in the code to handle the all of the data, then everything goes through the same logic and there is no need for any branches. The code is almost trivial.

Still, most people find it easier to solve big problems by breaking off a subset and then handling it in a 'special' way.

It's taught to us offhandedly in that we learn to solve problems by decomposing them. That's fine, except when that decomposition cuts across deeply related problems. Breaking up everything into special cases might be an effective way to understand how to go about solving the pieces, but the second half of that approach -- that isn't taught -- is to bring it all together again for the whole solution.

Special cases are often highly favored in many domains, thus the domain experts and analysts are always a significant source of them entering into the project. Big systems get built by going after the low hanging fruit first, then picking off the lessor cases one by one. This insures that progress is visible right from on onset of the project.

Thus the proliferation of special cases is fast, easy and we're often led to believe that it is the correct way to handle complex problems. That this is then mapped directly into deeply nested code or a quagmire of functions is not really a surprise. But it doesn't have to be that way.

One of the most visible aspects when reading any elegant code is that it does not have either an excessive number of branches nor too many loops. The code is organized, the flow through it is clear and the special cases few or non-existent. Clarity comes from the level just above the instructions, done well it keeps the logic from spewing off into a myriad of spaghetti-like flows.

From those rare examples, we can easily conclude that minimizing branches is both good and definitely worthwhile if we are intent on having the code continue to expand. The follow up questions then are "how?" and "what's the minimum number of branches?"

Long ago, when I first started programming, there was always a lot of effort put into simplifying coding logic. The earlier languages offered little guidance in structure, so the programmers had to spent more time back then, trying to work out appropriate ways to keep it from getting out of control. Object Oriented programming, and indirectly design patterns, embedded some of that organization directly into the languages themselves, but at the expense of the newer generations of programmers no longer understanding why well-structured code was important. The emphasis shifted to gluing stuff together as fast as possible. That might work for consolidating clumps of existing technologies, but it's an unmanageable nightmare when applied to a sea of special cases coming in from the domain side.

What's necessary is to avoid implementing special cases wherever and whenever possible. And to cleanup some of the existing mess, it is critical to merge back any existing special cases into the mainstream. Fortunately special cases are easy to spot since they exist as branches, and they are easy to remove, you just get rid of the if/else statements and merge the code together.

This does require some thought and practice, but it is not rocket science and it becomes easier to do as you gain more experience.

In the really easy cases, you can drop a branch when the difference between the two underlying blocks of code is only a variable or two. That is, you create new variables, assign them values at the beginning and then perform the sequence of actions, passing in the variables where appropriate. In the worst case, you might have to go back aways through the code, adding back the variables right up until the data loaded. I'll get back to that approach later.

For some cases, the difference is actually contained in a larger set of variables. This is handled the same way as individual variables, but with a structure or object instead of a single variable. Keeping related variables together and building up complex structures to pass around is a strong approach to organizing any code.

To get rid of some cases, the lists of executed instructions are slightly different in each case. There are some subtle ways of handling this, but they usually involve either table-driven programming or an intermediate data-structure. The idea is that you build up data that lists the instructions, varying by each case, as you go. Once the structure is complete, you walk through it executing everything. Simple lists fit into statically created tables, while more complex logic can be dynamically created as lists, trees or even DAGs. You'll find this type of code at the heart of interpreters, compilers, editors, graphics programming and some really beautiful domain applications. The if statements are implicitly hidden because of the way the structure was constructed. This is a powerful way to eliminate special cases and to control complexity, while still making it almost trivial to extend the code. For static tables you just add in new rows, for intermediate representations you just add more functionality to manipulate the structure.

That of course leads to employing polymorphism in OO languages. The special cases disappear because everything is handled by overrides of basic primitive methods that are specific to each class. That is, the blocks in the if/else statements form the methods implemented in the objects. Then you get a 1:1 relationship between the smallest blocks needed and the primitives necessary to solve the problem. Set up correctly, expanding a polymorphic hierarchy requires as little thinking as creating deeply nested code (which generally means you have to be careful) but the end result is still organized. To go a bit farther, you can even reuse common code blocks via inheritance, which saves on typing and testing.

In particullary messy systems where there are an explosive number of permutations that all might need to be handled someday, polymorphism is a really strong approach. Even before OO we'd implement the same mechanics in C by using table-driven function pointers that implemented the primitives. Then to change processing for different sets of data or configuration all you needed to do was update the table with a new set of function pointers, thus making it easy and reliable to extend to new cases. Setting up that type of organization doesn't take much time, but it saves massive amounts later as the code grows.

Sometimes the special cases are very very hard to remove. For that we have to go up to abstraction, and then manufacture virtual elements to smooth out the logic. This is a tricky technique to explain in general terms but the basic idea is to add in things that doesn't really exist, but if they did then there wouldn't be any special cases remaining. A very simple example of it exists in the "The Art of Computer Programming" where Donald E. Knuth documents a couple of algorithms for linked lists that always have a head node (circular and doubly linked). That is, the list is never empty, even if there is nothing in it. This simple trick reduces the logic down to something quite manageable and provides several neat little micro optimizations. It's a very small "space vs. time" trade-off that's unfortunately been mostly forgotten.

I've had to use this abstraction technique many times in the past to simplify difficult, yet performance dependent logical flows. The different cases are tantalizingly close to each other, but they don't actually fit together. It can take a bit of creativity and patience, but eventually you stumble onto some virtual piece that ties it all into one. The key is that the code actually gets simpler, if it doesn't then you need to toss it immediately and try again. Almost always it has taken a couple of iterations to get right, but in the long run it has always been worth it.

If you look carefully, you'll see this technique in all sorts of interesting places, but because it rounds out the code so nicely it is easy to miss. Still, if you really want to write something sophisticated, it is an indispensable part of the your toolkit.

Sometimes branches are just unavoidable, but that doesn't mean they have to gum up the code. Like limiting scope in variables, you can push a branch down to the absolute lowest point in the code. It wraps only the absolute minimum instructions that differ. Doing that is often enough to get it encapsulated away from the rest of the complexity. Another strong way to handle stubborn branches is to move them to the earliest part of the code where the data comes in. It has always been a good idea to only convert and verify data 'once' upon entry from an external source, thus avoiding needless fiddling and redundant checks. This is also a great place to add in variables that avoid branching. Placed there, the branch can set up a variable that flows cleanly through the rest of the main logic.

Most special cases are unnecessary. There are the very few rare ones that are completely unavoidable, but long before accepting any new special cases into that category all effort should be made force fit them into the mainstream properly. Special cases match the way people think and they get created quite naturally from various domains, programmers and the analysis process. That is to be expected, so it is up to the system's designers and its programmers to merge them back into as few cases as possible. Although it's easy to just whack out more special cases in a code base, it is one of the fastest and most dangerous forms of technical debt. Intrinsically each one is bump up in disorganization. It's also mindless, and in software anytime something isn't thought about explicitly, it usually manifests itself into a problem. We got into software development because we like to solve problems by thinking deeply about them, oddly trying not to think deeply about them is counter to what most people claim they enjoy.

Thursday, April 16, 2015

Shades of Grey

Every profession changes its practitioners. How could it not? They'll spend a big part of their life approaching problems from a given angle, applying knowledge built up by others with that same perspective. Prolonged immersion in such an environment gradually colours their views of the world.

Programmers encounter this frequently. Coding is all about engaging in a dialogue with a rigid machines that don't tolerate ambiguity or understand intentions. They do exactly, and precisely, what we tell them to do. There is no greyness, no magic; the machines only follow their instructions as given and nothing else (if we ignore issues like faulty hardware and viruses).

This constant rigidity leads us to frame our instructions as being 'right' or 'wrong'. That is, if the code compiles and runs then it is right, else it is wrong. There is nothing in between.

Gradually this invades how we see things. We quickly go from the idea that the instructions are right, to the idea that we've built the right thing for the users, and then we talk about the right ways to design systems and run projects. But for many, it doesn't just stop there. It continues to propagate outwards, affecting the rest of their views of the world. And the way they interact with it.

A long, long time ago, on learning the basis of first learning object oriented, one of my friends declared "There can only be 'one' right object oriented design. Everything else is wrong". I giggled a bit when I heard this, but then I went on a rather prolonged ramble about the greyness of trade-offs.

I explained that known issues like space-time trade-offs meant that there were actually a great number of different, yet equally valid designs, all of which should be considered 'right'. That some of those trade-offs might be constrained by known user or operational requirements, but that there were many more that differed only because of 'free' parameters.

My friend was sceptical at first, but he came around when I began talking about different 'sorting' algorithms. Ones with very different computational complexities in the best, average and worst categories. They all sorted, quite efficiently, but the whole collection of behaviours was considerably varied. Optimizing for any one aspect of a complex system always offsets some other aspect. That is, one is traded off for the other.

The simplest example of this is 2D vectors on a Cartesian plot. If magnitude is the only constraint, then there are an infinite number of vectors that are all basically identical. It doesn't matter where they are on the plot, or at what angle. A single constraint in n variables maps lots of stuff onto each other.

Getting back to software, it's discrete, so there is actually a fixed set of 'good' object oriented designs, but it is a really huge set. Outside of that set there is a much larger one of all of the less than correct designs. But that still doesn't make any one specific design 'right', it just makes it one of the many 'appropriate' solutions. "Right" is definitely the wrong word to use in this context

When I started programming, the big decision we had to make was whether we were 'vi' or 'emacs' people. Both editors requires significant effort to learn to use correctly. Emacs was seen as the grander solution, but vi was pretty much available everywhere and consumed a lot less resources. On the early internet, wars raged between the proponents of both, each pounding on their keyboards about the rightness of their view. That of course is madness, editors are just a tool; the different varieties suite different people, but almost by definition there is no one perfect tool.

There is no 'right' editor. There are certainly some crappy editors out there, but they are invalid only because they are plagued with bugs, or are missing core functionality. They are deficient, rather than 'wrong'. If you can do the basics of editing text, then an editor is functional.

In that same sense, we can talk about lines of code being right, in that they actually compile, and we can talk about compiled code being right in the sense that it actually runs, and perhaps we could include 'proofs of correctness' to show that the code does what we think it does but beyond that it quickly becomes grey. There is no right program for the users. There is no right architecture. No right way to build systems. As we get farther away from the code, the shades of grey intensify. Right and wrong quickly become meaningless distinctions.

A quick perusal of the many discussions on the web about computers out there are predicated on this mistake. People argue in general about how one end of a trade-off is so much better than the other. But it doesn't make sense, and it siphons away energy from more meaningful discussions.

A black and white perspective comes with the early stages of learning to program. It is very easy to let those views cloud everything else. I fell into this trap for a long time. Gradually, I have been trying to correct my views. Right and wrong is far too rigid to use in being able to make sense of the world. It certainly doesn't help us understand how to interact with people or to come together and build sophisticated computer programs. It doesn't make life any easier, nor does it make anyone happy. It's sole purpose is to help down at the coding and debugging level, but once you've learned its place, you have to learn to leave it where it belongs.

The safest thing to do, is to not use the words 'right' or 'wrong' at all. They tend to lead towards missing the essential nature of the underlying trade-offs. It's far better to come at complex systems and environments with an objective viewpoint. In that way, any specifics might have better qualities, at the expense of known or unknown side-effects.

Another easy example is with computer languages. Some languages make doing a task faster and more understandable, but all of them are essentially equivalent in that they are Turing-complete. For a given problem, it might be more naturally expressed in Lisp, but it also might be way easier to find Java programmers that could continue to work on the project. The time saved in one regard, could be quite costly in another.

More to the point, the underlying paradigms available as primitives in Lisp could be reconstructed in Java, so that the upper-level semantics is similar. That is, you don't necessarily have to just do lispy things only in Lisp, they are transportable to any other programming language, but you have to take one step backwards from the language in order to see that. And you can't be objective if you are dead-set against the inherent wrongness of something.

Putting that all together, although at the lowest levels we need to view things in a black and white context, at the higher levels labelling everything as right and wrong is debilitating. If you're discussing some complex aspect of software development and you use the word 'right', you're probably doing it wrong.

Sunday, March 29, 2015

Types of Data

The primary goal of any software system is to collect data.

Some of that data represents entities in the real world, it is raw information that is directly captured from the domain. Most systems store this raw data as simply as possible, although it usually gets re-keyed to align with the rest of the system, and sometimes the structure is rearranged. Usually it's preserved in its rawest format, so that it is easier to double check its correctness, and update it later if there are corrections.

Raw data accounts for the bulk of most data in the system, and it is usually at the core of allowing the functionality to really solve problems for the user. In most domains, it comes as a finite set of major entities whose structure is driven by the real world collection techniques involved in getting it.

Raw data has breadth or depth, and occasionally both. By breadth, I mean that there are a large number of underlying parts to it, such that if the data were stored in a normalized schema, there would be a huge number of different, interconnected tables. Healthcare is an example of a domain that has considerable breadth, given that there are a lot of different types of diseases, diagnostics and treatments.

By depth, I mean that the same entities are repeated over and over again, on masse. Intraday stock data is an example of depth. On an entity by entity basis, both breadth and depth usually don't exist together, although some domains are really combinations of the two. Bonds for example can have a large number of different flavors, and they also have lots of different intraday valuations. But the master data is always a different entity than the daily quotes.

Mostly depth tends to be static with regard to structure. Some domains have entities of static breadth, but more often than not the breadth is dynamic. It changes on a regular basis, which is often the reason it has so much substructure in the first place. Dynamic breadth can be contained via abstraction, but it becomes tricky for most people to understand and visualize the abstract relationships, so it isn't common practice. Instead there is continual scope creep.

Since computers can conveniently augment the original data, systems usually have a fair amount of derived data. Sometimes this data is calculated by extending the relationships, sometimes it is just basic summary information. Derived data is more often calculated on-the-fly, usually to avoid staleness and inconsistency. Some derived data is composed entirely of calculations using raw data. Some derived data is built on other derived values. Visualizing the dependencies as a directed graph, it is always true that all terminal nodes are raw data. That is, what goes into the calculations must come from somewhere.

The computations used are often called 'business logic', but some of the more esoteric ones really encompass the irrationality of historic conventions. Derived data is most often fast to calculate, but there are plenty of domains where it can take hours, or even perhaps days. Those types of big calculations are usually stored in the persistence technology, and as such are sometimes confused with raw data.

Raw data is often self-explanatory in terms of its structure as stored persistently. Derived data however usually requires deeper explanations, and in most systems is the predominate source of programming bugs. Not only do the programmers need to understand how it is constructed but the users do as well, and as a consequence of this most derived data calculations are either reasonably straight-forward or they are created from a component built and maintained by external domain experts who are authoritative references.

In well-organized designs, the derived data calculations are all collected together in some consistent subsystem. This makes it easier to find and correct problems. If the calculations are scattered throughout the code base, then inconsistencies are common and testing is very expensive.

The underlying calculations can and should be implemented in a stateless manner. That is all of the required data should be explicitly passed into the calculations. In some domains, since the underlying data triggers branches, it is not uncommon to see the code inverted. As the calculation progresses, the branches cause auxiliary data to be gathered. Thus if the data is type A, then table A is loaded and if it is B, then table B is loaded. This type of logic is easy to build up, but it complicates the control and error handling in the code, it also diminishes the reusability.

It takes a bit of thinking, but virtually all derived calculations can be refactored to have the necessary data passed in as the initial inputs. In that way, all valid rows from both table A and B are input, no matter what the underlying type. It is also true that the performance of coding it this way will be better. In a big system, the branching won’t gradually degenerate into too many sub-requests. Given this, derived data should always be implemented as a set of straight-forward functions that are stateless and entirely independent from the rest of the system. There should be no side-effects, no global access and no need for external interactions. Just a black-box engine that takes data and returns derived results. Getting derived data fully encapsulated makes it really easy to extend, and deploy in a wide variety of circumstances. In the rare circumstance where this is not possible for some tricky special cases, the rest of the derived data in the system should still be implemented correctly (one small special case does not excuse ignoring best practices).

The third type of data in a system is presentation data, but given the above definitions of raw and derived, initially it might not seem like there is a lot of data in this category. However it is surprisingly large. Presentation is primarily about how we dress up data for the screen, but it also implicitly includes the navigational information about how the user got to the screen in the first place. Thus we have decorative data, like font choice, colors, etc. but we also have things like session ids, the current primary data key and any of the options that the user has set or customized. In most professional systems that actually includes a lot of user, group and system preferences, as well as authentication and authorization. Basically it’s anything that isn't raw, or derived from the raw data, which in modern systems is considerable.

Presentation data has often confused programmers, particularly with respect to patterns like model-view-controller (MVC). There is a tendency to only think of the model as the raw data, then implement the derived data as methods, and leave out the presentation data all together. If done this way, the construction generally degenerates into a mess. The model should be the model for all of the data in the whole system, and act as a 1:1 mapping between main entities and objects. Thus the model would contain sets of objects for all three types of data, which would include, for example, an object that encapsulates the current navigational information for a given user session. It would also include any fiddly little switches on the screen that the user has selected.

In most cases, the objects in the model would be nouns and their interactions would be equivalent to the way that the user describes what is happening at an interface level. The model, in this sense, is essentially object wrappers for any and all structural data within the system, at any time. In most cases, one can stop at the main entity level or one below, not all substructural data needs to be explicitly encapsulated with objects, but the depth should be consistent. Creating the full model for the system means that it is well-defined where to put code like validation and decoration. It also means that there is another 1:1 mapping between the widgets in the UI and the model. This again keeps the code organized and clearly defines only one place for any new logic.

An organized system is one where there is only one valid location for any of the code, and a well constructed system is where everything is exactly where it should be.

There are really only three types of data in a system: raw, derived and presentation. These percolate throughout the code. Most code in large systems is about moving the three types data around between the interface and the persistence technology. If the three types are well-understood, then very straight forward organizational schemes can be applied to keep them encapsulated and consistent. This schemes considerably reduce the amount of code needed, saving time and reducing testing and bugs. Most programs don't have to be as complicated as they have become, not only are they hard to extend, but it also decreases their performance. If we step back from the code, we are able to see these similarities and leverage them with simple abstractions and consistent rules. Minimal code is directly tied to strong organization, that is spaghetti code is always larger than necessary and usually disorganized on multiple different levels. If the data management is clean, the rest of the system cleanly falls into place.

Saturday, March 7, 2015

The Data

A very long time ago, I bought my very first -- brand spanking new -- computer. It was an XT clone with a turbo button. I joyfully set it up on the desk in my bedroom, I was very excited.

My grandmother came to look at my new toy. "What does it do?" she asked. I went on a rather long explanation about the various pieces of software I had already acquired: games, editors, etc.

"Sure, but what does it do?" she asked again.

At the time, I was barely able to afford the machine and the monitor, so there was no printer attached nor a modem and this was long before the days of the Internet. I tried to explain the software again, but she waved her hand and said "so it doesn't really do anything, does it?"

I was somewhat at a loss for words. It 'computed', and I could use that power to play games or make documents, but really all on it's own, yes, it doesn't actually do anything. It doesn't print out documents, it doesn't help with chores, and at the time it wasn't going to make me any money or improve my life in any way. She had a really valid point. Also, it was very expensive, having drained a significant amount of cash out of my bank account. My bubble had been burst.

I never forgot that conversation because it was so rooted in a full understanding of what a general purpose computer really is. It 'computes' and on its own, that is all it really does. Just hums along quietly manipulating internal symbols and storing the results in memory or on a disk. For long, tedious computations it is quite a useful time saver, but computation, all by itself, doesn't have any direct impact on our world.

Then the mid-90s unexpectedly tossed my profession out of the shadows. Before the Internet explosion, if I explained what I did for a living people would give me pitting looks. After, when computers leaped into the mainstream of everyone's lives, they seemed genuinely interested. It was quite the U-turn.

Still, for all of the hype, computers didn't fully emerge on the landscape until social networking came along. Before that, for most people they were mere playthings stored in an unused bedroom and fiddled with occasionally for fun. What changed so radically?

What people see when they go online is certainly not the code that is running in the background. And they really only see the interface if it is hideous or awkward. What they see is the data. The stuff from their friends, the events of the world, things for sale and the stories and documents that people have put together to entertain or explain. It's the data that drives it all. It's the data that has value. It's the data that draws them back, over and over again. What computers do these days is collect massive amounts of data and display it in a large variety of different ways. You can search, gain access, then explore how it all relates back to itself.

The best loved and most useful programs are the ones that provide seamless navigation through large sets of structured data. They allow a user to see all of what's collected, and to explore its relationships. They often augment these views with other 'computed' data derived from the raw underlying stuff, but mostly that's just decoration. Raw data is what drives most sites, applications, etc.

Given the large prominence of data within our modern applications, it is somewhat surprising that most programmers still see programming as assembling 'lists' of instructions. That's probably because the first skills they learn are to cope with are branches and looping. That gradually leads them to algorithms and by the time they are introduced to data-structures they are somewhat overwhelmed by it all.

That all too common path through the knowledge leaves most people hyper-focused on building up ever more code, more lists of instructions, even if that's only a secondary aspect of the software.

A code-specific perspective of programming also introduces a secondary problem. That is, variables are seen as independent things. One creates and manipulates a number of different variables to hold the data. They may be 'related', but they are still treated as independent entities.

In sharp contrast, the user's perspective of a program is completely different. What they see is the system building up ever more complicated structures of closely related data that they can navigate around. Variables aren't independent, rather they're just the lowest rung that holds a small part of the complex structure. The value comes from the structure, not the underlying pieces themselves. That is, a comment on a blog includes all of the text, the user name and link, the date and time when it was published and possibly some ranking. These are not independant things, they all exist together as one unit. But we also need the whole stream of comments, plus the original post to fully understand what was said. The value of a comment isn't a single block of text, it is the whole complex data-structure surrounding a particular comment on a blog entry somewhere. Individual data points are useless if they are not placed in their appropriate context.

There may be a huge amount of underlying code involved in getting a blog entry and its comments out onto a web page, but from the functionality standpoint it is far simpler to just think of it as a means to shift complex data-structures from somewhere in persistent storage to a visible presentation. And with that data-flow perspective, we can quickly get a solid understanding of any large complex system, particularly if in the end it only really displays a small number of different entities. The data may flow around in a variety of different ways, but most of that code is really just there to convert it into alternative structures as it winds its way through the system. The minimal inherent complexity of a system comes directly from the breadth of the complex data-structures that it supports.

Obviously, if you wanted to heavily optimize a big system, reducing the number of variables and intermediate representations computed along the main access pathways would result in significant bloat reduction. And not quite as obvious, it would also make the code way more readable, in that the programmers would not have to understand or memorize nearly as many different formats or manipulations while working on it. The most efficient code reduces the effort required to move data around, and also reuses any intermediate calculations as appropriate.

That in it's own way leads to a very clear definition of elegance. We will always have to live with the awkwardness of any of the language constructs used in the code, but looking past those, a really elegant program is one that is organized so that the absolute minimal effort is used to move the data from persistence to presentation, and back again. In that sense if you have a large number of complex data entities, they are slurped up from persistence to an internal 'model' that encapsulates their structure. From there they are transported as is, to their destination, where they are dressed up for display. If in between, there isn't any fiddling or weird logic, then the work has been minimized. If the naming convention is simple, readable and consistent, and any algorithmic enhancements (derived data from the raw stuff) are all collected together, then the code is masterfully done. Not only is it elegant, but in being so, it is also fast and very easy to enhance if you continue to follow its organization.

Viewing a large system as huge lists of instructions is complicated. Viewing it as various data-structures flowing between different end-points is not. In fact it leads to being able to encapsulate the different parts of the system away from each other very naturally. We can see, for instance, that reading and writing data to persistent storage stands on its own. Internally we know that we can't keep all of the data from the database in memory at once, so setting how much is necessary for efficiency is important. We can split off the actual presentation of any data from it's own inherent categorization. We can also clearly distinguish between raw data and the stuff that needs to be derived. This allows us to isolate larger primitives, lego blocks that can be reused throughout the code, simply because they are directly involved with manipulating larger and larger more complex structures. As we build up the structures, we pair them with suitable presentation mappings. And most of this can be driven dynamically.

In a very real sense, the architecture and the bottom up construction of the fundamental blocks comes directly from seeing the system as nothing more than a flow of data. That this also matches the way the users see and interact with the system is an added bonus.

Data really is the defining fundamental thing in building any system. The analysis and design start out with finding the data that is necessary to solve a specific problem. Persistence is the foundation for all of the work built on top, since it sets the structure of the data. The architecture is rooted on organizing all of the moving parts for the data-flow. The code, inspired both from the history of ADTs and Object Oriented, can be encapsulated along side its specific data, and for any of the stuff that does not fit elsewhere, like derived data, it can be collected together so that it is easily found and understood. Data is the anchor for all of the other parts of development.

Data drives computers, without it they are just general purpose tools with no specific purpose. In so many ways my Grandmother was correct, but what both she and I both failed to see all those decades ago was that as soon as we started filling up these machines with enough relevant data, they did suddenly become very, very useful. Code is nice, but it's not really what's important in software. It's just the means to access the data. And for our young industry, it's definitely the case that we have only just begun to learn how to construct systems, once we've mastered complexity and can manipulate large amounts of data reliably, computers will transform our societies in all sorts of ways that are currently unimaginable.

Sunday, February 15, 2015

Static vs. Dynamic

Possibly the most significant movement in programming has been to avoid 'hardcoding' values. Sizes, limits and text should not be directly encoded into the code, doing so would require the system to be rebuilt and redeployed each time they needed to be changed. That causes long delays in being able to adapt to unexpected changes.

A better approach would be to stick any values into a configuration file, so that they can be easily changed if the need arose. This allows the static values to be managed independently of the code, but could still be somewhat painful on occasion because it requires manual intervention in an operational environment.

The best approach is to make the program dynamic, so that is it no longer needs the values anymore and can adapt to any changes. A simple example is with user preferences. It's easy to lock in a fixed number in the code so each user can change no more than say 10 settings, but 10 is a completely arbitrary number. Computers really like fixed finite data, but users don't. It really isn't all that complicated to allow any user to have as many custom preferences as they desire. Internally one can shift from a fixed size array to a linked list. Most persistent solutions don't have fixed limitations, so there aren't real problems there. Searching could become an issue, but applying proper data structures like a hash table can keep the performance in line with growth.

The fact that the size of the data is now variable may make resource allocation slightly more complex, but realistically if we understand why users have individual preferences then we can estimate an average number they need and along with the expected number of users, we can get a reasonable guess on the overall size. Of course, theses days with disk space so cheap and such tiny data we no longer need to spend time thinking about this type of usage.

Driving things dynamically can be taken to a level far more significant than just avoiding hardcoded values. Dynamic behaviour in code is really internally driven variability. That is, any of the underlying attributes -- both data and code -- can vary based on the current state of the program. Compilers are a good example of this. They can take any large set of instructions in a programming language and use that data to create equivalent machine code that will execute properly on a computer. They can even examine the code macroscopically and 'optimize' parts of it into similar code that executes faster. Internally, they dynamically build up this code based on their input, often creating some form of intermediate representation first, before constructing the final output. Being able to do that gave us the ability to construct programs in languages that were far more convenient than machine code, thus saving a stunning amount of time and allow us to build bigger more complex systems.

Being able to build up complex logic on the fly is a neat trick, but there are still more interesting ways to make code dynamic. In the case of compilers, besides dynamically creating a structure the code itself doesn't actually know what the program it is creating is actually going to do when it runs. This 'hands off' approach to managing the data is an important side-effect of dynamic behavior. As we generalize, the specifics start to fade. The code understands less of what it is manipulating, but making this tradeoff allows it to be more usable. It wouldn't make sense to write a special compiler that can only compile one specific program. That's too much work for too little value.

We can take this need-to-know approach and apply it to other programs. For instance if we have a client/server architecture, there might be hundreds of different unique types of data structures moving back and forth between the different tiers. Explicitly coding each message type is a lot of work, it would be far better to find a dynamic and cost effective way of moving any data about. For this we could utilize a data structure format like JSON that would encode that data into 'containers'. Doing so would allow any communications code to have almost no knowledge of what's inside the container, cutting down on a huge amount of code. Instead of one chunk of code for each structure, we just have a single chunk that handles all of the structures.

We can go farther by using technologies like introspection to dynamically create containers out of any other data-structure in the system, and we could apply a looser typing paradigm at the other end to allow parsing the container back into a generic structure. Creating a structure of explicitly declared variables is common, but we could push those compile-time variables into runtime keys attached to the same values. If the values can also include other containers recursively, then the whole dynamic structure has the full expressibility of any other static data-structure created manually. This not only drives the values dynamically, but also their combined structures as well. The in between code can move and manipulate the data, but may still not know what it is or why it's structured in a particular way.

This type of dynamic behavior can really amplify the functionality of a small amount of code. At some point, but only briefly, the specifics have to come into play. Minimizing the size of that explicit code can save a huge amount of work, allow greater flexibility and actually make the system more robust.

Getting to this level of dynamic code can go right out to the interface as well. Widgets for example don't really care about the data they are handling. They have to validate based on a generic type or domain table or another widget, but other than that it is just some data typed in by a user. If we attached the generic communications data to a dynamic arrangement of widgets, all we need to do is bind by some key, which could easily be from a key/value pair. In that way, we could throw up dynamic forms, filling them with loosely attached dynamic data, and providing some really convenient means of flagging unmatched keys. We could apply the same trick on the backend for popular persistent technologies like key/value databases. Using an ORM and some long skinny tables, we could also persist the data into a relational database. The whole system can be dynamic end-to-end, we could even wired it so that the flows from screen to screen are driven dynamically, thus creating a full dynamic workflow system.

I should mention that the caveat is that there are limits to how dynamic a system can be. If you go too far you get back to nothing more than a general purpose computer. Still, even at that full 'Turing Machine' level we can get value by implementing a domain specific language (DSL) with primitives and syntax that is tailored to allow users to easily construct logic fragments that precisely match their own quickly changing domain problems. The challenge is to create a language specifically for the users to easily understand. That is, it speaks to them as natively as is possible given the underlying formalities. If they feel comfortable reading and writing it then they can craft fragments that bind dynamic behavior to their own specifics. That can be an immensely powerful way to empower the users without having to get bogged down in statically analysing every tiny detail in their domain. You just push the problem back onto the experts by creating tools that are dynamic enough to allow them to fulfil their own needs. What's in between solves the technical problems, but all of the domain ones are driven dynamically.

What's so powerful about dynamic behavior is that if you have good encapsulated dynamic code you can find all sorts of ways to reuse it. We traditionally do this in very limited circumstances with common libraries for each language. Those always have limited understand of the data they are manipulating, which is why they are so usable by so many people. Scaling up these types of abstract coding practices to the rest of the system is the essential ingredient for creating reusable 'lego blocks'. The blocks can be slightly more static than what's in a common library, but so long as they are not completely hardcoded and some thought is given to their many usages, they can be deployed for handing dozens of problems, even ones that are not currently on the foreseeable horizon.

The slow trend in software is in gradually towards making our code more dynamic. It happens regularly in technologies like languages, operating systems and databases, but it can also be applied with great success to other places like applications software. Paradigms like Object Oriented Design were intended to help programmers make more use of dynamic code, but often those goals were not well shared within the programmer communities. Objects as just a means to slice and dice code makes little sense unless you see it as a way to help simplify creating dynamic code. As such, bloated static hardcoded real-world objects are really going against the grain of the original paradigm, which would prefer fully reusabled encapsulated abstract objects as the best way to fully leverage the technology and create better software.

Dynamic code is an exceptionally powerful means to building better software, but in the rush to code this is often forgotten. We should really focus harder on teaching why this is so important to each new generation of programmers. Leveraging existing code is a whole lot better than just rewriting it over and over again. It's a lot faster and safer too.