Sunday, February 21, 2010

Data, Data, Data

In one of my early jobs, almost twenty years back, I worked on a team that spent years building a very sophisticated cache. It was a UNIX based multi-process system that was built for speed and fault tolerance.

For each instance, it featured eight 'writer' processes combining their efforts to minimize the work of grabbing masses of data from a humongous, but slow database. Their results were restructured and then stored in a huge shared memory segment that was accessible by eight more 'reader' processes responsible for responding to data requests.

The code was written in straight simple ANSI C, and we had no access to any underlying libraries, so we had to write all of our own mechanics: memory management, resource management, locking, process monitoring, logging, etc. It took well over a year to get the initialize design up and running, but we also built in a system-level automated testing component, so the code was extremely robust (it went for years in production over several releases, and had only one known bug).

Once we got the system up and running, I did the obvious thing and went back over it to see where we could eek out some better performance.

We had started with a full design, carefully segmented the project into pieces and then built each piece consistently and using very rigid standards. We were a small but dedicated team. We researched all our algorithms, and used the best practices for building 'systems level' code within the architecture. For the most part, the code was clean, easy to read, well-structured and very consistent. Still, even the best works always have some aspect that could have been better.

As I started to go through the code, I concentrated on the way data was percolating throughout the system.

For instance, a request would come in, move from the initial parsing into a structure that was then placed in a queue in shared memory. When available, a number of writers would start up, each with a minimized fragment of the query, and they would hit database (it was highly parallel).

The responses from the database would then be reformatted, broken up if necessary, and the various sections of shared memory updated. When complete, the last writer would signal any appropriate waiting readers. From there the readers would grab their data, package it for transport and then send it out to a waiting client. A basic straight-forward multi-process read-only caching architecture.

As I dug, I realized that the requests and responses moved through the system were passing through a large number of different components. We had broken the system down into well-defined pieces to make its development and our task allocation easier to manage, but this break down had come with consequences. Each component, being written with an appropriate level of caution, carefully checked and copied its incoming data to insure that it was valid and correct. Each time the cost was small, and the resulting code was more robust, allowing it to detect problems quickly, report them, abort and then force us to fix them while still in development.

The approach was solid, and certainly for the amount of code we wrote, being that consistent and tight with error handing had been a big reason why we were on time and on budget. But, we were losing a considerable amount of time to this excessive copying and checking of data. In one case I remembering seeing 6 or 7 memcpys on large buffers as the code moved from component to component. In each component, each mempy seemed reasonable, but the sum of all of them was not optimal for performance.

Of course, we are taught to not worry about performance until later. So, from a development perspective, given our results the project was a raging success. Still, after my analysis, when I had finally cut down on all excessive copies and had moved much of the checking into 'debug mode', we picked up a 5% to 10% boost, which given our equipment and operational specs was necessary. At the time I couldn't help wondering if with some small changes to our development approach we might not have been able to avoid this problem right away.


DATA-STRUCTURES

From my university days, I was taught the importance of data structures and how we should use them as the basic building blocks on which to form the rest of the system. This approach, sometimes referred to as ADTs was the father of Object Oriented programming. Before OO, it was up to the programmer's discretion as to how to structure their code in an organized manner. The data-structure approach proved so powerful that the next evolution was for it to become embedded directly into our computer languages, and thus OO was born.

Still, the point of many of these earlier approaches was to help the programmers find the simplest and most effect means in encoding their systems to run on the computer.

If you think of a system as a large set of functionality, that overall perspective can be quite complex. It can also lead to programmers thinking that the best and fastest way to get new functionality is to copy existing functionality and edit it with the changes. That, it turns out, is the heart and soul of a very effective technique to create masses of really bad spaghetti code. It may seem simple and fast at first, but it accrues so much technical debt, so quickly, that it isn't long before it hopelessly swamps the project and ends in failure. Of course, the earlier programmers had lived through this, and figured out ways around it.

If instead of being hyper-focused on the code, the programmer considers the system as just the data moving around within it, the overall system drops astronomically in complexity. In this perspective, the code is secondary, and really only responsible for getting the data from point A to point B with a bit of reformatting along the way. Functionality isn't a way of manipulating data, rather it is what transforms it from some raw state into something more usable for the users.

The 'print' mode in an document editor for instance, takes the data from the raw, internal cut-buffer data-structure and then pretties it up in a way that is convenient for a printer to output it. There are lots of different ways of making it pretty, so there are lots of little choices for the user. If you consider all of the 'functionality' available in the printing sub-system it is huge, but if you see it all as only some conversions between these two types of data-structures, it is really not that complex at all.

While Object Oriented was an attempt to enshire these concepts directly into the languages, so that the programmers would have to utilize them, most development still fails to approach this correctly. Most programmers still seem to be steadfastly focused on the 'code' they are going to write, only thinking about the data later. Code, code, code. And that's why, quite quickly as the work progresses they keep heading to bigger and bigger messes.


WHAT WORKS?

Oddly, one of the most successful technological implementations I've even see for getting the focus right was a COBOL derivative. I can't remember the name of the language/system, it was so long ago, but programming in it was simple.

You specified what you wanted from the database. You specified how that was changed as it went to the screen. You specified what each of the different keys on the screen did (if it wasn't default). And then you specified the validation and restructuring of the data as it went back into the database. Simple and straight-forward.

In truth, I hated it because it was so orderly, simple and easy to do. I felt more like a typist then a computer programmer when I was building systems. Still, it and so many of those other older tools are no doubt the reason why so much of the world's crucial data relies on mainframe technologies and not some of these newer cooler, hipper (but undependable) technologies. Too much freedom may seem fun initially; right up until you start missing deadlines and everyone is screaming.

In many ways you can see the results of the 'functionality' based approach right away in most big systems. All you need to do is understand the breath of the underlying data, and compare it back to the interface.

So many products are disconnected masses of randomly placed functionality that have way more to do with laziness and history than with making it easily accessible to the users, or logically consistent. You quickly get lost in these stupid senseless mazes of drop-down menus, redundant screens and little click-able thingies that have gradually built up over time. Related functionality is split all over many different screens. It is organized (badly) by barely related functionality and the composition of programming teams, when it would have been far more accessible if it was organized by the way it outputs or manipulated its data.

For example, you'd like all of the related controls for manipulating the outputted printing structure to be grouped together by the final manipulations produced, not by how the programmers wanted to cut and paste the code. Not by which DLLS or libraries were installed. Not by which driver was active.

A trade-off must always be made between the convenience of the programmers and the convenience of the users. Most developers, these days, make the wrong choice. It is extremely rare to find well-thought out interfaces. And in some cases, the first generation might have manged to do it well, but were quickly followed by a less gifted one.


A NEW PERSPECTIVE

There is a cure. Things are bad in programming these days, but it doesn't have to be this way. Certainly our elders learned their lessons and through ideas like Objected Oriented they tried to pass them on to the newer generations. It failed, but it doesn't mean we can't redeem ourselves.

Mostly, the single largest and most successful change programmers can make is in their perspective. If they give in, and stop fighting against the underlying data-oriented concepts like objects, they can quickly learn to build faster, better code that isn't so hard to maintain. It comes simply from putting the data first.

The first thing software developers need to assemble is some kind of model of the data that will be in the system. Long before they start asking the users questions about what they are doing, they need to know what is in  underlying data, how frequent it is, its lifespan, its quality, and its structure. All the cool functionality in the world is no good unless it has the right data to work on. The first pillar of every system is whether or not it contains the necessary data.

And it's not only the user's data that is important. It is also the system admin's data and the operational data too. There is always more to a system than just the key functionality required, and so often all of these contributing parts get skipped over until later. So, it's the data in the persistent storage, but also the data in the config files and any other parameters that need to all be addressed in a similar and consistent manner.

The thing I learned most from my earlier example with the cache is that it is important to minimize the way the data flows through the system. Having access to the data is just a first step.

It is also important to minimize the parsing and re-assembling of the data in various different subsystems. The data coming into the system should be broken down into its primitive pieces as early as possible. And it should not be reassembled until as late as possible. Both of these rules of thumb help keep the data in its most usable state, and cut down on unnecessary manipulations.

It should not be copied into different components. In fact, in an OO system, it should be wrapped initially with one and only one object. Different data objects might be related and grouped together at a higher level, but since each piece of data is immediately broken down into its most primitive components, it should only exist in only one form in the system, in only one instance.

As the data flows through the code, similarities should be exploited. That is, if the system supports jpeg, png and gif images, while each may have its own object type if necessary, there should be an overall 'image' category object as well. The higher up you go, the more you should exploit polymorphism, the less code that is required.

But with less code, there is also less debugging. That is, if you have four functions that each use their own code, you need to test four times. If you have four functions that all share the same basic code, testing one essentially tests the others. When you first write the code you should probably test all four, but later on, for small fixes and updates the 'impact' of the changes can often be assessed by only testing one. The long-run savings in 'impact' testing can be enormous if the architecture is well-structured.

Once you have the data in the system, and it is managed by a minimal amount of common infrastructure you can start having fun by thinking about how to manipulate it.


FLEXIBILITY AND CHANGES

Because it helps in addressing their issues, users will generally require some redundant screens, and slightly inconsistent handling. It is the nature of people. We're messy and we need messy interfaces.

The trick is to confine those inconsistencies and duplications to the least amount of code possible. If you can contain it, minimize it and then make it easy to read and change, it becomes relatively simple to re-arrange the different aspects of the interface to suite the particularities of the users (it would be nicer if the users could do this themselves).

Interfaces that are locked in stone, or take massive amounts of similar work to create, tend to become broken very quickly. It was one of the few weaknesses with the COBOL systems I discussed above. There is way too much work going into too many inflexible screens. Changes -- which are constant and inevitable -- become problematic and costly. The accrued technical debt becomes an impossible weight, and development grinds to a halt.

Not unsurprisingly Object Oriented concepts are well suited towards visual interfaces. This is where they really shine. Interfaces really are just ways of laying out and collecting various bits of data on a screen. The same ideas that apply to getting data from a database, also apply to laying out data in an interface. They're just different angles on the same problem. If you map each and everything you see on a screen to a unique object, then all screen definitions are just some methods for constructing larger structures.

Sure the frameworks and toolkits do this with panels and widgets, but most programmers stop there and then just start belting out long lines of ugly redundant construction code.

The idea is not to make one big bloated super-object for the entire application, but to continue the object composition upwards.

That is, there should be a type of object for everything that is visible on the screen, including the various sections containing underlying widgets and all of the different pieces. If someone points to it and refers to it as something, then that something should be an object of that type. Getting that visual-to-object map simple and consistent makes it far easier to construct or rearrange objects, and thus screens. It makes it easier to debug as well. It exploits the power of the paradigm.

This technique works best if you also map the visual containers (like lists, tables, sections etc) to generic objects, with minimal instance information. In this way you can build up screens from larger and larger consistent structures, and one change will fix all of the different instances on all of the different screens. Instead of fixing all of the screens with user information, you fix the user information object, and it is updated on all of the screens.

As well as interfaces, most big systems will have some large specialize set of functionality that is not presentation related. That is, most systems contain some type of processing engine. Generally, these are well-defined enough that they can be fully isolated from the rest of the system. Engines can and should stand on their own.

Here is one of the few places where checking the input, at least in some type of debug mode is a really good and valuable idea. When engines go wrong, their depth makes them difficult to diagnose, so a fail loud and fast policy tends to pay big divides in diagnosing escaped bugs.

One interesting thing about engines is that underneath they are driven by their algorithms. In a stark contrast to the rest of the system, here is a place where the programmer should be far more concerned with the code, then with the data.

The input to an engine is usually all of the necessary data, while the output is either a set of errors, or the finish data structures. There is no interface. Building an engine with a fine-grained object approach tends to lead towards distributing the behavior of the code all over the engine. But in this type of purely computational programming, a bigger object that is based around an algorithm, and contains a large number of well thought out, small primitive operations is generally a more readable approach, precisely because it keeps all of the related code in the same proximity.

This is an important point in programming, because while it is nice to have hard and fast rules for the development, and consistency is vital, there are always times where an exception can be the more appropriate answer. Strong consistency in 80% of the code is generally what I am shooting for, but you can't sacrifice other useful attributes like readability just to get to 100%. It's all about making the right trade-offs.

On the other hand, the consistent approach should always be tried or investigated first. Most of the time it works, so most of the time it is a safe bet. Ignore hunches.


PATTERNS, LIBRARIES AND OTHER ISSUES

When design patterns first came along, I thought they were excellent and would be really useful. That was, until I started seeing programmers use them as defacto building blocks and naming their objects after them such as *Factory or *Facade or such.

If your perspective is on the data, it is easy to see that patterns are code related. They've really the opposite of data-structures (code-structures?). Because of this, if you use them as building blocks, they become effective tools for obfuscating the real underlying flow of the code. They were meant to be a starting point for coding; just an initial template, not a building block.

Most sophisticated objects require multiple overlapping patterns. By separating them, and raising them to the status of building block, they just become more noise to hide what the underlying code is really doing. Does knowing that an object is a Singleton change how you'd use it? It shouldn't. So you don't need to know.

Patterns should be mixed and matched, and then should show up in the comments, with references, but they should never, never be in the object name space. With a data-oriented approach, patterns can be useful in helping to drive some consistency into the underlying implementations surrounding the data, but I guess because they are code-centric they are easy to abuse, and this has become standard practice.

A data-oriented approach to libraries would be exceptionally useful particularly if programmers could agree on consistent interfaces. After all, most misery in modern programming comes from interacting with all of these badly engineered, oddly interfaced library. Java is the worst language for that. The libraries are sporadic and messy, with horrible interfaces that seem bound to make it much harder to use.

For libraries, I only want two types: one that provides an encapsulation of some type of data. An image library perhaps, that allows me to load and save images. The other type of library is for handling specific algorithms. Say, something to run Gaussian blur on my images.

I'd like them separated because sometimes I only want clean and simple access to the data, particularly if my intent is to add my own higher level algorithms.

If it were a matter of just matching libraries to specific categories of data, then choosing implementations and working with them would be a whole lot easier. These days, you often get these "half-baked" libraries that partially solve some limited aspect and add in a fraction of data. They are unusable on their own.

They often have large usage flexibility too and you have to really wonder who the programmer's thought would use all of this stuff. The coders were probably so concentrated on writing the code, that they didn't give much though to how people could or would use it. But even if you wanted to, it is not safe to depend on some minor functionality contained in a library. That's the type of thing that changes quickly between releases. Or it doesn't get updated. Either way, it is too much technical debt.

Often with many of the modern available libraries, it seems clear that the programmers were more concerned about the ease of their implementations, than about the ease of other coders using their works. It's a great recipe for an ugly interface.

This leads to simple technique. Sometimes in programming, if I have some very complex interface to design I'll start by writing the higher-level example code first. That is, I will write out the calling code in a simple and straight-forward manner as it 'should' be, then later I will back-fill in the missing library code to make it match how it was called. It's a hugely valuable technique in getting clean and simple interfaces.

A final issue always related to data is simplicity. Most programmers get into programming because they have a deep down love for intricate complexity. They've always loved machinery, cogs, gears, that sort of thing. That is fine, but that is also their Achilles heal. That is, their love of 'intricate' leads them to build intricate code. A finely crafted watch is a thing of beauty. Code that is fine and intricate is also fragile, delicate and it is hard to explain how it really works. All attributes that are undesirable in programming.

The simplest, most straight forward code that does the job (so long as it is not COBOL :-) is the code that all programmers should be striving for.

That doesn't mean that you can't have large complex abstractions that generalize the mechanics and allow for lots of code to leverage off a little. That type of abstraction, particularly if it is within the data-level is fine, and actually desirable. If you weight it properly, 10,000 lines of an abstract engine for some large range of calculations easily beats 250,000 brute forced lines, where each set of options is cut and pasted. Not only is the smaller, denser code more stable, it is also easier to test.

Simple is especially important when it comes down to user and admin behaviors. A good approach would probably be to write the documentation first, and then back-fill the code in later. Sophisticated code with complex algorithms is really hard to explain to users, and because of that, not particularly appreciated. If the users keep forgetting how the code works, they can't exploit it properly.

Simple code is really spartan code. That is, there are a minimum of variables, nothing is overloaded, and all of the code is broken into reasonably sized functions. Short, clean and with no extra effort. Fancy comments, or extra annotations are just more things that need to be updated in the end, and should be avoided. What programmers need is the exact minimum amount of work that is needed to get the job done, and nothing else.

Spartan code is data-oriented code. It happens that way because the programmer's are concentrated on the least number of variables, the least number of calls, and least amount of duplication, and all of these things come easily if you follow how the data moves through the system, not how the code moves the data through it.


SUMMARY

Data, data, data. I can't say it enough, and even if I sound like a broken record it doesn't seem as if the message is getting out there.

This isn't really a new idea, it has been there all along. We just seem to have trouble as an industry in passing along our own domain knowledge to the next generations of programmers. We keep losing our understandings, and each new generation re-invents the same wheels with a small set of improvements, and some hug steps backwards.

It's tough because as the technologies become more complex, and until we get better abstractions on which to build, most development is sitting on edge of a cliff about to fall. You'd think that the industry would seek out methods of making coding more reliable, but instead we seem to want to do the opposite. More people write code, and more new code is coming into the markets, yet the overall quality has gone down.

Young programmers are too interested in writing code to care if what they are writing is reasonable. Many experience programmers I know have just dropped out of any public discussions about their occupation. They see little value in many of the newer methodologies. And old programmers... , well, old programmers rarely exist. Long before most people have mastered coding, then have left it behind.

Still, with one simple change in perspective, even the largest, most daunting projects can become manageable. If you start out with the right foundations, and keep up with the technical debt, then most software development doesn't have to be a risky endeavor. Projects don't have to fail as often as they do. We don't have to put in as much effort as we have been.

3 comments:

  1. Hi Paul,

    I have read your post multiples (printed it and read on the way to work) but I'm having a hard time applying it to a typical enterprise app.

    For example in an employee management system and other typical enterprise apps, it seems that the nouns naturally map to object oriented systems (Employee for example). As a data structure I would think of an Employee as a data aggregate like a struct or a class. How would you apply data oriented design to these types of systems? Thanks.

    - John

    ReplyDelete
  2. Hi John,

    Off hand, without tying it to concrete example, I'd say that the system is composed of a set of unique entity structures and that these map almost on a one-to-one basis with generalized objects, but I realize that that answer may not be particularly clear. If you want to discuss this or specific examples in detail, then email is probably a more comfortable place to talk. I'm at paul underscore homer at yahoo dot ca, and I'm always happy to talk :-)

    Paul.

    ReplyDelete
  3. Hi Paul,

    Thanks! Will email you soon.

    ReplyDelete

Thanks for the Feedback!