Monday, March 29, 2010

The Edit Loop

A very common programming and architectural problem comes from what I like to call the 'edit loop'.

It is a software programming construct that exists when getting the user's data out of some form of long-term persistent storage and then putting it into a user interface for display or editing. The second half of the loop comes from taking any changes made by the user and getting them back into storage.

Mostly these days the interfaces are GUIs, and the storage is a relational database, but all of the same problems apply even if the interface is a command line or text based, and if the storage is some other type of persistent storage such as a key/value data-store. For this discussion we'll just assume the interface is a standard GUI with panels and widgets, and the database is relational.

These loops are generally triggered by some event mechanism, which is started by a user navigating to a specific screen-full of data in their application. The loops happen often, and are the main form of processing for the application. They usually constitute the bulk of the functionality and of the code.

Millions of programmers have been coding these types of loops for decades. Virtually every system has some type of interface, and in all of these there are often many types of edit loop.

The number of times this code has been written, rewritten, hacked or refactored is staggering. It generally accounts for at least 80% of the work of most application development. It is were most programmers spend most of their time.

This post will start first with the stand practice for building such loops in large systems, and then examine some of the common problems with this approach. It will then get into ideas about how to reduce the work and redundancies. Finally it will deal with more complex architectures.


STANDARD DESIGN

The easiest approach to building an edit loop is to start with the database.

Since the major strength of a relational database is to allow several different applications (such as reporting and mining) to share the same underlying data, the first thing that should be done is to create a fourth normal form schema in a 'universal' representation usable by all of the applications.

Some applications have more stringent performance requirements, so a few of the tables may have to be de-normalized as required.

From the database, the programmers need to get the data into their running code. Since most applications have slightly different requirements then the universal schema, there is usually some finessing of the data to put it into an application specific model.

The current popular programming paradigm is object-oriented, which generally relates the tables in the database to specific objects in the application.

There could be a one-to-one correspondence between the tables and the objects, but it is more likely that the programmers will source the same underlying tables within many different objects in the system. There is usually a lot of repetition. This collection of objects in the application is often loosely referred to as the "model", although it is rarely so precise.

Once the users have created a large number of different models of the data, most system architectures are layered into distinct pieces, either as a client/server architecture or just as some lower data level with an interface level sitting on top of it.

In either case, the data constructed in the modeling part of the code needs to get transported to the interface part of the code.

The predominate convention for this type of 'transportation' code is to make it strongly typed. Each variable loaded from the database has a consistent data type that stays with it as it travels throughout the system. If the field is an integer in the database, it is read as an integer, passed through the system as an integer and edited as an integer. The type stays consistent.

Beyond the transportation code lies the user interface. In most systems this is predominately the largest chunk of code. Although the convention is to not decorate the data with 'presentation', normally the different models in the back-end have been specifically created based on different presentation needs, so some of that presentation information is already implicitly encoded into the data.

This mostly occurs because the panel/widget code is often thick, ugly and confusing, so the programmers push the different views of the data back down to the database layer where it won't get mixed into the GUI code. What starts as good intentions quickly gets obfuscated.


THE REVERSE TRIP

So far, the data has worked its way up from the database, through to the interface. It has been strongly typed right out of the database and throughout its journey.

Once into the interface, if it is only a half loop, the data is just further spruced up for presentation, annotated (with things like links) and then dumped to the screen.

If the data is involved in a full loop, it is usually displayed on the screen in a series of widgets that allow for it to be modified. Once the user's editing work is deemed complete, the data is then heavily validated.

Usually this means some set of checks that is more stringent than just its simple data-type. For example, if the field is an integer, perhaps the only valid values for it are between 1 and 4. It is highly restrictive.

Many fields are also cross-checked with each other, to make sure that the whole set of data fields is consistent with some set of external rules.

If this process fails, the problem is pushed back to the user, to try again. And again if required. Until the data is finally validated enough to start the journey back to the database.

On the reverse trip the data is also commonly strongly typed. It goes from the interface code through some transportation back to the database code. Many systems use entirely different sections of code and different transportation mechanisms for the second half of the loop. It is frequently seen as a different problem from fetching the data.

Now, most schema usually have some serious validations checks encoded into the database as well. It is an essential part of a well-designed schema. If the application requires a relational database, it should use that database properly and to its fullest extent.

The incoming data must be consistent in order to get added or updated in the database. However, most database code assumes that the incoming data is fine, and then blindly tries to save it.

A common problem is basing user edits on stale data -- data that has changed since it was retrieved -- but this one is rarely solved in most systems, usually just ignored.

In the database code, the data is jammed into the database API, and then added or updated as needed. If this fails, the code will generally just chuck an error straight up towards the user. If it succeeds then a success message is sent instead.

Sometimes you'll find a second layer of validation checks, just ahead of the database. Usually this is to allow for a slightly different but more specific message then would have been issued by the database in the error case.


INTRINSIC WEAKNESSES

While this type of code construction is very common, it makes a trade-off between just getting the code written quickly and working smartly as a software programmer.

Simply put, it is more often a result of the programmers needing to get something working now for a deadline, then it is a result of the programmers sitting around spending lots of time thinking about the ultimate way to build their systems. It is a by-product of the pressures to get something built fast.

In these types of systems, there are essentially three main sections: a) the database code, b) the transportation code, and c) the interface code.

Starting backwards, the interface code is the mechanism to allow the users to use some functionality within a specific context. That is, the user really wants to do something to the data in the system; their choices are to edit it, or to view it in some fashion (that may take significant computing power). Because of this, we can view the interface code as just remembering the user's context, and as an entry-point to launch some functionality for the user.

In most big systems, there are literally thousands of different individual 'user-functions' that can be accessed. These can be large like editing data in a form, or small like re-sorting some data in a table. Each and every action by the user executes some user-function of some type.

This is usually the most code in the system, and the worst code in the system.

Generally, most programmers short-cut the object-orientation of their user interfaces so that the code becomes very big and bulky. There is a tendency to create larger objects that contain more underlying screen element construction code, like setting up panels or widgets. Full object-orientation is very rare, particularly in the more modern programming languages like Java and C#.

Most of this interface code is very specific to individual screens in the system. Most screens are only used once within the system. There are a huge number of redundancies.

The transportation code moves the data back and forth between the interface and the database. Although the transportation code is simple, because of an insistence on strong typing, it is not uncommon the find systems out there with lots and lots of very specific transportation code.

Usually it comes across as just glue code, binding one interface to another, but with a large number of specific variable copies. Sometimes you'll find a  number of different data transformations as well, where the programmers have gone to extra work to fiddle with the data in some way, in the middle of transport.

On the opposite spectrum, sometimes the code is just spanned right out to a full, but highly redundant representation, that could have been easily compressed. The programmers thinking it was too much effort to pack the data a little tighter. Simple tricks like pulling out any common absolute partial strings, leaving only relative ones, can help cut down on data duplication, but are rarely used.

Mostly, the bulk of the transportation work goes into setting it up on one side, and then re-sorting it into a different shape on the other. Different schools of philosophy and coding style have very different preferences in how to build this part. Many prefer brute force.

The final section of code is the database code. In a language like Java, much of it is as direct SQL implementations in APIs like JDBC. In C#, the programmers can save a bit of the ugly syntax with LINQ. Most software development environments allows for some type of direct SQL access to the database.

Later implementations of alternative paradigms like object-relational-mapping (OR/M) try to put an abstracted layer over this code to make it less redundant and more consistent.

Still, one key problem is that the data representation in a relational database is both universal and less expressive than the application representation. As such, there are a lot of things that the code can do, that are not easy, convenient or even possible with the database.

The difference in expressibility leads to a very common software development mistake, where some developers try to build the system backwards, starting from the interface, and doing the database layer last.

This top-down approach is easier to code, but it is more likely that there will be issues when trying to force the application perspective into a database, then if they had started with a bottom-up approach.

Also, the top-down approach leaves the data in a highly de-normalized state which is application specific. If the data is only usable in its persistent form by one application then it is a technological waste to store it into a complex container like a relational database, when there are far simpler and more expressive ways to store the data. Still the convention is to not think that way.


CODE BLOAT

These three sections, in most systems, get quickly filled with lots and lots of redundant code. Most programming efforts expect and allow this. That is, they set up the architecture, add in some common libraries, and they start letting these three sections grow large and fat.

There is usually more re-use initially, but as the work progresses and the teams change slightly, more and more programmers tend to avoid the old code, and just write new stuff. Bloat is normal.

What inevitably happens is that the code gets more and more redundant. Each new user-function gets its own specific interface, transport and database code. Most times, the teams are conscious enough to not duplicate data in the actual schema, but even then you can often find the same data stored in multiple different tables.

Redundancies become common place. One clear way to see this is by breaking the code up into two. Given two distinct sets of screens, how easily can you split the system into two separate programs? How much code will they share as common libraries?

If this decomposition is easy and includes the bulk of the code, then the screens are 'independent' from each other.

In a non-independent system, most of the code in the system is shared between the different screens. Each screen has very little unique code.

In an independent system, you could keep breaking the system into two new pieces, recursively, until all of the screens exist in their own programs. If it wasn't for context, you could also break down all of the user-functions into their own programs. A thousand little programs would work just the same as one big one (although navigation from program to program would be tricky).

Independence means that there is more and more specific code that has a 1-to-1 correspondence with some user-function. The code is only executed by a very specific entry-point into the system. As that code is replicated throughout the system, it becomes worse and worse.

We can define a metric to count how well the code is being re-used. We'll call it 'leverage'. The leverage value L, for a line of code in the system, is the number of user-functions which require it. If the code is used only by one specific piece of functionality, then L=1. If it is shared in two places, then L=2. If it is never used then L=0. If there are N user-functions that need it, then L=N. It is a straight-forward count.

We can combine the different L values for the various code blocks to get one overall value for the system. Independent systems have a lower average L value. The more independent the code, the more it is redundantly doing the same work over and over again. The more likely that changes will cause new bugs.

Dense systems have a higher L value. If the value closely matches the amount of functionality in the system, then the bulk of the code is being greatly re-used, and even minimal testing has large coverage. Systems that have higher L values are more likely to have higher overall quality. There is way less code in the system, and more bugs are found with less testing. All good attributes to have in a code base.

We clearly want to try to maximize the leverage in our systems. Less work, means we can do a better job. Nothing is worse than flailing away at poorly structured code.


FIXES AND SOLUTIONS

If you have a big system with thousands of user-functions, then even if the functions are all relatively simple and only require a thousand lines of code each, a complete independent system is easily over a million lines.

And it is that multiplier effect that is so poorly understood by many programmers.

There is a limit to the speed at which we can write, and as the application becomes bigger and more redundant, our coding speed gets slower and slower. Inconsistencies get larger and more dangerous, causing the bug fixing to become another huge problem.

Mostly the lower the leverage value in the system, the harder future development will become, often further lowering it. It is a downward spiral.

The only real solution is to try to increase the leverage to its highest possible value. Only in a highly leverage system will the developers be able to keep up with the work load and most development will get easier with time, not harder.

The first, most obvious change is to the transportation code. Wherever it is, in any system, it really doesn't need to know the type of data it is handling. It is crucial to minimize parsing, and conversions, but we also want to leverage any block of code as highly as possible.

Even though code is handling the data, the code should always have a minimal understanding of its data. That is, a container like a tree, doesn't need to know about its underlying objects, it just needs to hold them. And the same is true in the transportation section.

The data has to come from the model, and the model has to alter the incoming data, but from there all data is the same as any other data. It should go into one large generic container.

In that way, each and every user-function that uses an edit loop, should go through the same transportation code. If there are fifty user-functions that need a full or partial loop, then L should equal 50 for the transportation code.

Modern languages allow for introspection, which allows for further encapsulation of the packaging and unpacking code. The method of transportation for the system should be identical for all parts of the system. Writing duplicate, but slightly different code sections is neither necessary nor reasonable.

From a programming standpoint, it should be trivial to package all or parts of an underlying model into a transport container. It should be a one-line statement. It should also be trivial to get the data from a screen and send it back to the back-end. These are hard, but good qualities to achieve.


STRONG TYPING

Strongly typed code is often viewed as the "correct" way to build systems these days, because it pushes some of the validation work onto the compiler. While that can be useful in detecting some types of compile-time bugs, the compiler cannot check the running content of the data, so there are an awful lot of bugs that can't get caught this way.

On the other hand, any programmers with experience in loosely typed languages know that there is always less code, and that the underlying code is always intrinsically more reusable. Loose typing makes for flexible code. Strongly typing makes it more rigid, which is sometimes useful, but only in specific circumstances.

If we are really interested in significantly reducing the size of our systems, then we are interested in any type of approach that reduces code.

By keeping the data as loosely typed as possible, we can move it through the system without having to create a lot of specific code to manage it. For instance, in Java, since everything can be of type Object, then everything can be loosely typed. Even the containers like lists and trees can be packaged as Objects.

When we finally need to use the data, we can convert it to a more convenient type. This process is natural in a loosely typed language -- usually being handled quietly and automatically -- but can easily be emulated in a strongly typed one. If the interface needs a date from the database we can grab it, cast it to an Object, and move it through the system and into position. Only at the last moment do we need to need to re-convert it back into an actual date string for display.

At the end, this may seem like an unnecessary conversion, but truthfully it is a small cost compared to managing all of the strongly typed code in between. Software development is always about trade-offs, and this is one of the better ones.

The presentation layer generally needs to 'stringify' the data anyways, so most often it isn't even extra, and it might just be an optimization in some cases. If your underlying transportation code is forcing all of the data into a strings, then the final conversion might just be moot.

But it shouldn't matter to the programmer, because the transportation code should abstract away all of the underlying details. String, integer or other data-type, the programmer shouldn't know or have to know about the in-transport data representation, it should be encapsulated.

All they need to know is what their final data-type for the user should be, and if they are using some interface abstraction, they may not even need to know that much.

The less a programmer needs to know about what is underneath, the less they should know. Many programmers tend towards being too specific in their handling of data, often writing large amounts of excessive, unnecessary code. It's a common form of over-complicating the code. Stopping this behavior, or at least containing it, helps in keeping the code from getting under-leveraged.


DUPLICATE DATA VALIDATIONS

For most systems, the interface has very strict validation requires and so does the database. As previously mentioned, the database validates and returns either an error or success during an update or save.

It is not possible to keep the database from ever returning an error, as there will always be unforeseen events like being out of disk space, or communication problems.

Because of this, there must be a passageway through the system for database errors, which one might as well utilize for other things as well. The more we reuse the existing mechanisms, the less specific testing we'll need.

Code re-use always cuts down on testing.

So ultimately, we really don't want to do much duplicate validation at the database level, and we certainly don't want to do any of it at the transport level. Display data is validated by the fact that it came directly from the database.

Since editing always requires strong validation, if we stack our validation code into one place, then it is not duplicated or spread all over the system. Editing is the best candidate.

Implicitly it does also exist in the database schema, but that is OK. The database schema encodes a universal view of the data, which can have its own different validation, and differences should be accommodated with the reading and transformation of the data. The differences may be small, but they need to be encapsulated together.

In that way, although we want strict validations in the interface, the rest of the system shouldn't know, and shouldn't care about its data. That is, it should just be some loosely typed 'thing' that is going someplace.

Code shouldn't know any more about the data, then what is absolutely necessary.

Now once at the front-end, both the presentation and editing do require the data to be strongly typed. In fact, they require 'strict' typing, where the type is far more limited than just a simple data-type, it is tied by its domain and by being cross-referenced to one or more other variables.


FRONT-END ABSTRACTIONS

One place where programmers spend way too much time is on the front end. The code is highly repetitive, and none of the current framework paradigms attempt to abstract that down to something smaller.

Common practice is too create a small set of unique objects per screen, or action, and then fill these with highly-redundant low-level GUI code, such as allocating widgets, or handling events.

Model-view-controller (MVC) frameworks bring this down marginally to dealing with actions, but since they usually have a one-to-one correspondence with screens, there is still a large amount of nearly identical code.

Mostly, all of the interface code is about displaying data or editing it. The displays can be textual or image based. They include embedded user-functionality to help drive the appearance of interactivity. They are all very similar in nature, and always need to have some overriding consistency.

But if the code is spread out into a large collection of redundant sections, keeping it neat and tidy quickly becomes impossible. What's needed is an overlaying abstraction to minimize the code and enforce consistency.

A good abstraction provides a higher-level structure, in an attempt to reduce the amount of code necessary to work with an encapsulated set of behavior. That is, the programmer should have to do way less work, when working with a good abstraction. It not only provides structure, but also reduces effort.

An abstraction that works nice is to fit a 'form' over all of the interfaces. Read-only code fits into a read-only form, while editing works mostly within a standard forms model.

In a reasonable implementation, the user would need some minimum way of defining the form, and then populating it with initial data. A good abstraction would hide most of the normal interaction with the user, allowing for only special cases to flow through. This is necessary because the programmers shouldn't have to reinvent the mechanics for handling simple technical things like paging list of data from the database.

Keeping the programmers away from re-inventing the smaller technical solutions, also helps to enforce consistency in the behavior of the interface.

Some care needs to be established because frameworks can occasionally make things too implicit. That is, APIs often present the wrong types of options to the programmers. Simple obvious values should be set into the defaults, and the only options that need explicit overriding should be those that really represent a degree of freedom within the method call (although multiple calls are a better choice sometimes).

Restricted validation and some cross-conditional handling is always necessary. When ready, a good abstraction will present the final and completed data to the programmer, ready for transportation.

Of course, many applications will have special corner-cases, so the abstraction will need to allow the programmer to 'hook' in code at specific points. This can be complex because these hooks are intrinsically disconnected from the rest of the system, making them less obvious and harder to understand. A good interface should read very simply, and so the nature and purpose of the hooks should be obvious and easy to grasp.


BACK-END COMPRESSION

One of the hardest places to reduce overly redundant code is in the back-end data models. Still, there are some key data elements contained in the database which are of interest to the application. If you're strict about minimizing these, and keeping out presentation information, the code can be reduced.

Most schema have inherent redundancies in their tables, such as date stamps and user auditing information. These can be encapsulated, and reused over and over again. Small convenience libraries can be used to make any of these common fields share one implementation.

As well, even if the schema is forth normal form, some of the tables themselves can be brought together at higher levels. Although this type of generalization can cut down on code, you have to be very careful not to over-do it and make the schema itself impossible to share across applications. It becomes a set of very hard trade-offs.

If the data is only ever used in one application, and has some alternative import/export format for external system's comparability, then using some other non-relational format may be a better choice. Ideally, the less work required to load and finesse the data is the best solution.

Thus in an object-orient language, a real object-oriented persistence mechanism that is super-simple is the best choice, although consideration has to be made in correctly handling system data upgrades at some later point.

The best solution is to just boil it down to its absolute minimum, and take advantage of techniques like polymorphism wherever possible. Less code, means less duplication, less redundancy, less testing and less bugs.


SUMMARY

Inexperienced programmers are generally more concerned with getting their code to run then they are with keeping it so. Because of this, they rely on the simpler and more obvious brute force approaches towards development, which significantly increase the amount of code, reduce the leveraging and the increase the work in testing it.

A system with a few hundred user-functions might start out OK, but if most of the code is poorly leveraged, that quickly changes as the project matures.

Poorly leveraged, highly redundant code, is the most common critical problem with most software development. You see it everywhere. An initial release might be successful, but the accrued technical debt grows more rapidly then the resources to offset it. Stagnation or implosion are the expected consequences.

Ultimately, if the system is so highly leveraged that just logging into it tests a large swath of the transportation and back-end code, then the overall quality of the system will be high. It will be high because it was intrinsically built into the architecture. A few simple tests will cover large code sections, which makes for a trivially well-tested system.

Ultimately, it's not how many bugs you have in the system, it is how easy they are to detect that really matters.

As well, writing the initial code is not particularly hard. Given enough concentration most people can put together a sequence of instructions to tell the computer to do something. The real trick to programming, is to be able to keep this code sane, version after version.

Without structure, and with a low leverage value, the code will degrade rapidly as it gets pushed and expanded.

On the other hand, strong abstractions in highly leverage code are the gold standard of programming. They make extending the code -- any part of it -- easy. A sign of good code is that it should be easier to grow the system, then it should be to re-write it.

We don't want huge, inconsistent systems. We don't want independent code. Programmers never start out with this as their goal, but it is the inevitable consequence of many of our standard programming best practices.

Given, that with a little more effort and thought, the work involved in most systems can easily to be reduced by orders of magnitude, it is surprising to see how rare this is in practice. Programmers are usually their own worst enemies, even though few stay in the profession long enough to correctly understand this.