Thursday, June 26, 2008

Readability

You don't have to hang around many large software development projects before you realize that most of them are crippled by their own code base.

The biggest symptom is how widely the differences in coding style vary are from all of the collaborators.

All programmers have different styles and preferences, but teams that are not trying to work together lead to heavily siloed code; code that is redundant and inconsistent. A bad situation that only gets worse with time.

The projects that work the best do so because (most of) the team is in harmony, going at it the same way, and reusing the same basic infrastructure. What I like to call "Several Species of Small Furry Animals Gathered Together in a Cave and Grooving with a Pict" after the well known Pink Floyd song.

While it is such a significant indicator of problems, and represents a long-term issue, it is not easy to get a group of programmers to work together in a consistent fashion. There are lots of reasons why that is true, but for this particular post I want to only focus in on one of them.

Readability, for those that like things to be clearly defined in the beginning, is the ease in which some unknown 'qualified' person can examine the code in question and make a reasonably correct judgment as to its function. The code is easily readable.

This is a key attribute in code, in that the most elegant answer to the problem is the simplest one that meets all of the necessary constraints, and is implemented in a readable manner.

That is, readability is the second half of making something elegant. It does you no good to have the perfect solution if its implementation obscures the fact that it is the perfect solution. The problem, in that case is not fully solved.

All of that is rather simple in it's own way, but as we dig deeper into tying to pin down the details we hit on a really nasty question: is readability subjective?


SUBJECTIVITY

If you are going to get a team of people to follow a standard, it is far more difficult if that standard is arbitrary. You lack convincing arguments to justify it.

Programmers in particular, love to be unique, so without a good argument they tend to fall back onto their own styles, ones that suit their personality and cultural bias. Good for them, bad for the team.

This means that you need strong arguments to sway a team into following similar courses, and the argument "because I say so" is usually a bust with any group of even mildly contentious or cantankerous programmers. The strongest possible answer you could have is that there is a "provable" right answer which is not debatable.

At this point, I could dig into the essence of subjectivity and form some type of formal proof for saying that readability is or isn't subjective. But I won't. I leave the reasoning behind that for the very end of this post. You'll have to get through the other bits first before I can explain it.

Readability, in its essence is a form of simplification. You can examine it on a line-by-line basis, but just because the individual lines themselves are at their most readable, doesn't mean that the set of them together is particularly readable. This is one of those problems where we must consider the whole as well as the parts.

My instinct suggests that there is some element of subjectivity inherent in the definition of readability, since as a simplification there is no objective way to constrain all of the variables that may or may not make something readable.

More so, what is readable to one person is not necessarily to another, and standards, conventions, culture and style change the readability.

Even more interesting, the encompassing technology itself changes it, as one notices the change in style brought on tools like syntax-colored editors and larger screen real estate.

And of course, the implementation technology itself "bounds" the readability. Highly readable APL for example -- a horribly cryptic language of symbols operating on vectors and matrices -- is only "so" readable. COBOL on the other hand is overly so.

But, and this is a huge but, if you fall back and look at a number of textbooks for branches of mathematics like calculus, you will find a remarkable degree of similarity amongst the various proofs for theorems.

In that context, the proofs have to be rigorous, and the readers are always students, so the details need to be simple; these specifics drive the authors to find proofs that meet those criteria. These proofs are similar to each other.

In all ways, proofs are a type of code example, being nothing more than an algorithmic series of steps that must be followed to verify that sometime is rigorously true.

Those proofs then are isomorphic to computer language programs, which themselves are just sequences of steps that result in some change to a data-structure.

The degree of similarity between different versions of the same mathematical proof is important. The degree of similarity between proofs and code is also important.

This shows, that while there may be some subjectivity, it is barely so. The differences between two proofs are most likely to be trivial or near-trivial. The same is true for "readable" code implementing the same algorithms in the same consistent manner.

Yes, it is subjective, but not enough to meaningfully effect the results.

Indirectly my favorite example of this comes from Raganwald, where Reg Braithwaite writes about narcissism in coding:

http://weblog.raganwald.com/2008/05/narcissism-of-small-code-differences.html

My particular view of that post is that the agnostic was focused on a simple reliable way of getting the job done. A good, simple readable answer.

The other three, the ascetic, the librarian and the purist, all took turns at applying their own "simplifications". The problem, was that in their quest to simplify the code with respect to their individual biases, they introduced their own additional complexities. Unnecessary ones.

All three other pieces of code were consistent with their author's beliefs, but that was a trade-off made against the code's readability. And, in this specific case, since we don't know the rest of the system context, the trade-off is a poor one.

Worse though, is the fact that the "team", as it were is flip-flopping inconsistently between styles. Any of the overly righteous code might be obfuscated, but at least if it were consistent it wouldn't be that hideous. A mixed team is far worse than a purist one.


SIGNIFICANCE

Readability being negligibly subjective is a huge statement, that if inherently correct has significant ramifications for software developers.

If you're writing an introductory book on programming, you spend a great deal of time fiddling with the examples to bring out their visual sense. The readers should be able to quickly look and see that the code is for a linked list and that it functions in such a manner.

Why, would it be any different for writing a full system? Is it less likely that there will be other readers of the code? No, we know that if the code is good, lots of people will view and edit it.

Is this a one-shot deal, it is written and then cast in stone? No, the code will likely go through many iterations before it is even mildly stable.

Is is faster to write or execute ugly code? Definitely not, simple code usually performs better.

To some degree, in a simple programming example, because of the size, some of the detail has been dumbed down. But, in that way, while the detail should not be necessary for the example, it can also be abstracted away from the actual production code to leave the results more readable. Industrial strength code shouldn't be that far away from a good textbook example.

Sure, it involves some extra work, but that is a short-term effort that saves significant long-term trouble, and gets easier as you do it more often.

If you are serious about development, then you want to write elegant code. In all ways, that is what separates a professional software developer from just any old domain-specific expert who can type: they can write it, but its ugly and fragile.

The high end of our skill is in not just getting it to work, but also getting it done elegantly, so that it continues to work for release after release, revision after revision. Doing it once is nice; begin able to do it consistently is the great skill.

Then as a team, is it not obvious that coming together to build an elegant solution is a best practice? That said, no matter how hard you agonize about the code, you fall short if its implementation is just plain weird. And if you're not following your team-mates, then you're being just plain weird, aren't you? (weird is relative, after all)

The full strength of this argument is that someone -- anyone with a minor amount of technical experience -- should be able to look at the code and get a sense of what it is for. As such, if they look at it for hours and shake their head, it is more than likely the code's fault than the reader's. And that, in the overall sense is critically important to being successful with the project.

If the code's meaning shines clearly through the awkward syntax, it is good, but if not, you're doing it wrong.


THE BEST EXAMPLES

There are two simple properties to look for in really great code:

- the code "looks" simple.

- the code does "not" match what the code is actually doing.

The first one is obvious, but it is actually very hard to achieve. Some programmers get 'sections' of clean code, but to put forth an entire system that is clean simple and obvious is a skill beyond most current practitioners today.

Partly because they've never tried, but also partly because turning something simple into something hard is easy, going the other way is a rare talent.

You know you are looking at good code when it is easy to completely misgauge the amount of effort that went into it. If you figure you could just whack that out in a couple of days, then because it reads that simple, it is clearly readable.

The second property is harder for many programmers to understand.

If you have abstracted the nature of the problem and have thought long and hard about how to generalize it, then the implementation of the code will be at a higher level then the problem itself.

The upper implementation will be smaller, more configurable, more optimized, run faster, use less resources and be more easily debugged than having just belted out all of the instructions. A good abstraction is a huge improvement.

But it is a curve: go too far and you fall off the other side, it becomes convoluted. If you don't go far enough and you have huge bloated code that is rigid and prone to bugs.

At that very peak of abstraction, or somewhere in that neighborhood, the essence of the algorithms and data-structures are generalized just enough, only to the degree to leverage the code for its widest possible use. But in that, it (the code) deals with and references the generalness of the data.

So, for instance you implement the corporate hierarchy as a multi-branch tree. The code talks with, and deals with the abstract concept of a tree, while the real-world problem is a corporate hierarchy.

You could "name" the variables and routines after the hierarchy, but that would be misleading if you choose to use the code elsewhere, so the real naming scheme for the bulk of the code should follow the level of generalization. Tree code should talk about trees.

Once you've implemented the tree code, the you can bind an instance of that code to any other hierarchy functioning in the system. The "binding" refers to the hierarchy, but the nuts and bolts underneath still refers to the trees.

But of course, finding that higher level of generalization and abstraction is just the first half in creating an elegant solution. The second half falls back onto that first principle of making it look simple.


DIGGING INTO THE DETAILS

Readability is huge, and there is some bias for reader, culture and technology. However, I'll ignore all of that, and concentrate on a few simple examples.

I've tried to stay away from clearly subjective issues, but some readers will no doubt be entrenched in using particular syntaxes, styles or conventions.

As a very big warning I need to say that not all things that have made it into our best practices are such, and just because "everybody" does it that way doesn't make it right.

If you've found that I've over-stepped the line, the best thing to do is make a list of all of the pros and cons, and give them some weight. Most issues are trade-offs, so you need to consider what you lose as well as what you gain. A one-sided approach is flawed.

As well, I don't claim to be perfect, and I'm not always a rigid fan of currently popular approaches like "pure" object-oriented. I'm agnostic, I am will to try anything, but I want those solutions that are simple and really work.

A perspective that seems to grow with age and experience. I've lost my fascination with all of the little gears and dials in a clock, all I want know it to know is what time it is.


LINE BY LINE

For an individual line, it is most readable when its purpose with respect to the surrounding code is immediately clear, within the context of its technology. That's fairly easy to ascertain.

An assignment, for example, in most common programming languages is equally readable no matter what the syntax. The technical mumbo-jumbo that surrounds the statement is a necessary complexity for expressing the instruction.

Simple statements, on their own are simple and readable. Combining multiple statements/operators/methods on a single line opens up the door for making it unreadable.

So for example, in a language like Java, the syntax allows one to chain method calls, but it does so in a backward fashion from reading from right to left. If, the normal syntax is left to right, and suddenly that changes, it is an easy "asking for trouble" indicator:

object.bigMethod(argument, subobject.getNext().leftChild());

is not particularly readable. Chaining can be good, but not mixed in to normal syntax.

Lines of code should be simple, clean and obvious. Each and every line has but one purpose. If you've crossed multiple purposes into the same line, you've got a mess. If the arguments to your functions are syntactically complex, then you're just being lazy. Break it out into simple consistent lines.


EXCEPTIONS

At first try-catch blocks seemed like a blessing. That was, until they started showing up everywhere. What you get very quick is two distinctly different programs overlaid onto the same call structure. Two of anything in code is always bad.

And far worse is multiple deeply embedded try-catch blocks in the same method. It might be there to be "specific", but it threatens to be overly messy.

If you need a philosophy, catch the stuff you want to ignore at a low level (encapsulate it into its own functions please). Embed any expected normal results right into the return data. Then put one big global catch at the top to split out everything "exceptional" to log files, email (if needed) and the screen (with full or partial detail depending on the operational circumstances). Lots of little, tightly scoped low handlers and one great big one for everything else.

Only truly exceptional things should use exceptions (see "The Pragmatic Programmer" by Andy Hunt and Dave Thomas). If you expect it, then it should be part of the normal program flow.

Just say no to complex error handling madness. It will only make you cry at 4am.


LOOPING BLOCKS

It gets a little messier when we consider more complicated syntax. For example, with loop constructs in Java we can look at the followingsyntaxes:

for(Iterator it=object.iterator(); it.hasNext();) {
Object value = it.next();
...
}

and:

Iterator it=object.iterator();
while(it.hasNext()) {
Object value = it.next();
...
}

This is a good "block-level" example because it's not necessarily obvious.

From one perspective the 'for' loop is worse than the 'while' loop because it is 'clever' to abuse the syntax by not using a third argument. Clever is never good. Also, each line itself is simple.

But from a broader perspective the following are also true:

1. The declaration and loop construct all fit neatly into one line (and they reference the same underlying things).

2. The scope of the 'it' variable is limited to the scope of the loop block. (in the while, it is scoped for the whole function).

3. Traditionally 'for' loops are used for 'fixed' length iterations, 'while' loops are used for variable conditional iterations. E.g. we usually know in advance the number of times the loop will execute in a for loop, but we don't know for a while loop. That makes it easier to quickly scan code and draw some assumptions about how it works.

In most cases, the "iterator" is really just syntactic sugar for:

for(int i=0;i<container.size();i++) {
Object obj = container.get(i);
...
}


which is a fixed length traversal of a container. The iterator encapsulates the index variable, the size and the get call in one single object (hardly worth it, but that's another topic).

4. The new Java 5 for-each syntax allows:

for(Object obj: container) {
...
}

but I think they should have changed the loop name to 'foreach' like Perl to make it more obvious ;-)

5. It's more conventional to use 'for' loops for iterators in Java?

Thus for a wider range of reasons, the 'for' syntax is more readable than the 'while' one. Someone scanning the two quick is more like to be less confused by the 'for' loop.

The biggest reason for me is #3, that a simpler loop would do just as well. In a sense, there is a precedence between the different constructs. Never use a 'while' loop, when a 'for' loop will do. Never use a 'do-while' loop when a 'while' loop will do. Save the exotic stuff for exotic circumstances.

All functionally equivalent syntaxes should be seen as having a precedence. In that way, you should always gravitate towards the simplest syntax to get the job done.

In many languages there might be multiple ways to get things done, but you have to restrain yourself to consistently using the most simplified one. Just because you can, doesn't mean you should deliberately hurt yourself.


CONDITIONALS AND LOOPS

The whole point of subroutines, functions or procedures, was to allow the programmer to break off some piece of code for the purpose of making it more readable. Re-using it in multiple places in the program is a nice side-effect, but it is secondary to the ability to isolate and encapsulate some specific set of "related" instructions into a well-named unit.

At its very most basic level, one could advise newbie programmers to make a new function at the point of every conditional or loop in the program. Yep, I am saying that you might put each 'if', 'for' and 'while' statement in its own method.

That perspective is extreme of course, but if you start there, and then refactor backwards combining things together into the same functions because they are related -- in the way that one combines sentences together in the same paragraph -- then you won't be far wrong from having a clearly normalized call structure.

And if, the arguments at each level are all similar, and the vocabulary of the variables in each similar function are also similar, then the overall structure of the code is nearly as clean as one can achieve.

Thus, if you have some huge routine with lots of conditionals and loops, the larger it is, the more you have to justify not having broken it down into smaller units. It happens, but usually only for complex processing like parsing, or generation.


WHOLE FILES

The acronym Don't Repeat Yourself (DRY) is well intentioned (see "The Pragmatic Programmer" by Andy Hunt and Dave Thomas), but repeating oneself is not necessarily the real problem.

The reason we don't want to repeat things is because the same information is located in two different places at the same time. As the code moves over time, the odds that both instances are changed correctly decrease rapidly. Thus if you have essentially the same detail in two different spots in the same program, you are just asking for trouble, and trouble is what you will always get.

A huge way around this is to apply encapsulation to bring the details together into the same location.

So, it's not a matter of not repeating yourself, it is more like "don't separate the repeated parts across the program". Bring them all together in one place. If you have to repeat, then do it in one location, as close together as possible.

But it is not just identical data, it is any 'related' or similar data for which it must stay in sync.

That, and only that is the driving nature behind packing things carefully into files. Any code that should be together when you examine it, should be placed together when you build it.

If the compiler or interpreter can pre-determine the correctness of the code, then it is the most technically strong answer. If you can verify it visually by looking at the line, then it is still quite strong. If you can see the whole context at once and verify it, that is also great.

If, in order to verify a line of code, you need to jump around to a dozen files, cross-reference a huge number of fields, and perform some other herculean acts, the code is weak, and likely to have bugs.

So, an element of very bad code is such that in order to make a simple and obvious change, you need to jump around a huge amount to a large number of different files.

Yes, I am saying that as a consequence of this some well dispersed object-oriented code is not particularly readable, and as such is not the most elegant solution to the problem. This is nothing inherent in object-oriented that it promotes or creates readable solutions. It's just a way of chopping things up. There are, some types of programming problems that are more readable when implemented as object-oriented, but there are also some problems that become horribly more obfuscated.


WRAPPING IT UP

We so fear telling each other that there is a cleaner, more readable way to implement that line of code. But as a group working together, I think it is hugely important to have these types of discussions and to really spend time thinking about this. A good development team will choose to follow the same conventions.

You can't spend too much time trying to get everyone on the same page, because the consequences involve way more effort. Wrong code doesn't get re-written fast enough, gumming up the works and causing maintenance headaches which eat into available resources.

We never move fast enough to fix our early mistakes, so we drown in our own mess.

Somehow culturally we have to get beyond our own insecurities and be able to at least talk to each other in a polite manner about the small things as well as the big.

If readability is to any significant degree widely subjective, then any differences are just opinion. That would be bad, in that it leaves the door open to massive inconsistencies and the many assorted stupid problems cause by them. Two programmers never share the same "opinion", but they might share the same "right" practice.

Not that I really want to suggest it, but in many ways, the most correct answer for readability is too presume it is universal; setting it up as a myth. The curious may seek and possibly find the truth, but hopefully along the way they would come to understand why a convenient myth is better. Sometimes we just have to choose between being right, or getting the team to work correctly.

I do know, directly from experience that it is horribly difficult to sit there with other programmers and nitpick their code for little violations of readability. But, and that can't be stressed too heavily, those small inconsistencies are far from trivial. They are in all ways little "complexity" bubbles that are brewing and forming, and getting larger all of the time. They are the start of the trouble that may potentially wreck the project. They are not trivial, even if they are small.

UPDATE: I think that it's just "readability", but someone has done an excellent job in codifying an approach towards mostly making code more readable, and called it 'Spartan Programming'. It's an excellent approach to normalizing code to make it readable and the examples are absolutely worth checking out:

http://ssdl-wiki.cs.technion.ac.il/wiki/index.php/Spartan_programming

Although, in this case, I do think that the overly-short variable names go too far and instead increase the base complexity. Acronyms, codes and mnemonics are far more 'complex' then their full un-abbreviated names. Longer is better, so long as it's not too long.

Tuesday, June 10, 2008

Structuring Noun & Verb Data

The "secret" to programming is to boil down 14 million lines of code into a few hundred thousand that do exactly the same thing, without introducing a significant amount of "extra" complexity by having to deal with a nasty configuration. Less is more. But only if you haven't just traded it off for some other type of complexity. Less must really be less.

Some code is clearly redundant, so eliminating it is a straight-forward reduction in complexity. The rest however, only really goes away when you go up to the next level of abstraction.

Writing something in assembler for instance, is a huge amount of work compared to writing it in Java. Java operates at a higher level, that in turns requires less information, which means less code to complete the same tasks. The higher level axioms accomplish the same underlying work (although often requiring more resources).

Complexity kills software, so smaller manageable "code" bases are an important part of really solving the problems.

With that in mind, I've been off wondering around contemplating different ways of abstracting code and data in systems. It may seem like an abstract tangent, but I've reached that point where I realize that we need to find the next major step upwards. Our systems are insanely complicated, and as any one in the industry knows, are highly undependable. We like to ignore it, or pretend like it isn't true, but ...


DYNAMIC DYNAMITE

The anchor for most of our moderns systems is a database of some type. We need to preserve the data far longer than the system is actually running, so we have turned to solutions like relational database management systems (RDBMS) for holding our information in-between running instances of the program. This type of technology provides control over the long-term persistence for all of the application data.

If you follow the relational data analysis theories fully, you can collect all of the data that will exist in your system, arrange it into entities and relationships (ER diagrams), and then normalized it into "fourth normal form". This will produce a balanced generalized schema for storing your data. It will allow reasonable performance, and avoid data duplication problems.

The theory and know-how to do this have been around for a long time, so it might be surprising to anyone outside of the industry to find out that it is rarely done this way. Most databases are heavily denomalized.

One of the many problems with a normalized schema is that the data does not easily match the way it is modeled in the code, and is exceptionally "static". Any Turing complete programming language is far more expressive than the underlying set mechanics that forms the basis of relational theory. The difference in "expression" usually means applying a huge number of dangerous transformations on the data as it is loaded into the software, and then undoing them as it is saved back into the database.

Also, the code and the data are both changing rapidly, making it hard to keep the schema in sync with its expectations. Databases often support multiple applications and have specialized administrators that slow down the ability to change the schema. The code is easier to change, but quickly becomes a mess with too many quick hacks.

A great approach to getting around the 'staticness' of a relational database is the Object-Oriented Database (OODB). The "objects" in the code are simply made "persistent" within a transactional context, and 'viola', they are saved or restored. Simple and elegant. It is quite amazing for anyone who felt that relational databases were forged-in-hell just to explicitly waste time and torture programmers (OK, a slight exaggeration, they were just forged to waste time, it's not personal).

For all their strengths, OODBs have a fatal Achilles heel: the schema is completely dynamic. That may not seem like much, but a huge problem with commercial software comes from upgrading the old stuff into the new, in a way that is reliable. People really frown if they have lots of data and you screw this up. It's considered a big no-no.

A static SQL schema can exist as a set of commands in textual format, so it can be stored in a source code control repository and diff'ed with newer versions. In that way, with only a small amount of effort you can know with 100% certainty what exactly are all of the changes from one version of the system to the next. A big requirement for not screwing up a release.

With a truly dynamic database, there is no schema. The structure of the data and the content of the database have a huge latitude. They can be nearly anything. That would be fine if the code were that flexible, but it's very difficult to write extremely dynamic code. Any assumption about structure is a bug waiting to happen. This makes it very important to test all of the code properly with enough data.

If you want to know the differences between database versions you can compare actual sample databases or diff the code and reverse engineer the changes from there.

Both methods are far from 100% certain. Far enough, often enough that while dynamic databases make the coding fast, easy and accurate, they leave in their wake a huge upgrade problem; one that is easily fatal to most software. It is easy for some of the installations to contain unexpected data. Data that can not be handled.

Software is more than just a collection of code that gets released. The persistent data is very important too, and "operational insecurities" such as risky updates are potentially expensive. Damage control for a serious bug takes significant resources.

Code that is easy to build, but hard to update is problematic, to say the least. Updating can be a larger potential cost than development.


QUASI-RELATIONAL

I learned about dynamic databases the hard way, but fortunately my lessons where neither particularly long nor fatal. Even so, one quickly comes to appreciate the 'staticness' of a relational database at times. Static is bad, except when it is good.

After some experience with an OODB, my next attempt was to try to marry together aspects of the relational model with a considerably more dynamic foundation.

The hybrid I liked to call "quasi-relational".

It was at it's highest level similar in design to any relational database, all of the major data entities where broken down into tables. The difference came at the lower level. With the table structure, the columns were arbitrary and any row could have any possible combination or subset of them. Even more important, rather than make each column a 'primitive' type, they were actually allowed to contain complex sub-structures.

In this way you could stick some similar things all together in the same table, yet their underlying details could be quite different, and the associated information could be structural (avoiding the need for lots of little sub-tables).

This compacts the representations, and sets the handling on a column-by-column basis, providing a great deal of flexibility for the major entities within the system. The structural sub-column properties allow the code to be built very close to the underlying persistent representation, avoiding many costly and dangerous transformations.


SOME UNDERSTANDING

What my quasi-relational database allowed was for me to rapidly extend the system with increasingly complex data, but under some static constraints. Used wisely this provided ultra-fast access to a medium-sized complex data-store, something that is impossible for a "traditionally-designed" relational database to accomplish.

There was a danger of the earlier mentioned 'update' problem, but implementing some form of "static constraint enforcement" would have controlled that.

I could have used some declarative format to restrict the structure of the sub-data. I didn't, but that was because there were two few programmers to make it a useful requirement. With a limited number of changes, and frequent releases the update problem is a lot more manageable.

For this particular system, transactions were not necessary (it was primarily a read-oriented data store), so most of the "tank-like" infrastructure surrounding the ACID properties in an RDBMS would have been total over-kill.

The two big things that were missing were the ability to do "adhoc" queries and the ability to allow access to the data by a reporting engine, although neither of these was really an issue. The database inherently had script access, it was written in a scripting language and the overall system was a fully capable reporting engine, although most people missed that fact.

Overall, I'd have to say it was a big improvement from using a standard relational database, both at the coding level and at the operations level. It was fast, flexible and stable, and required no special management handling.

However, most programmers prefer pre-canned solutions for handling their persistence, so it is not a design idea that people would consider popular.

The essence is good, but what would be better is to get the same properties on a more "conventional" foundation. Programmers are not great risk takers -- a reason why techniques like brute force are chosen so frequently -- they like to stick to what is obvious, even if it has obvious problems (which they will deny with too much enthusiasm).


NOUNS AND VERBS

I've done a lot of deconstructing lately, and I keep coming back to the same underlying realization: data is essentially nouns or verbs. Well, at least the main entities are -- if you squint "and" ignore a few minor anomalies. For the sake of simplicity, we are allowed to do that.

Information breaks down nicely into 3D things being nouns, and any time or action based things being verbs. As a foundation this is enough to contain all of the data that we might choose to pile together in our computer.

This "perspective" comes into play if we are dealing with event based data, and it is particularly useful with historic systems.

If we were building up a repository of the value of financial instruments for example, the instruments and their contractual features or options are nouns, where the events in their lifetimes are verbs. The actions (verbs) that occur in a warehouse for an inventory system (nouns) or the changes in state (verbs) imposed by a user on a document (nouns) in a work-flow also fit well into this model.

It also suits things like the historic buffering of commands in an editor, but its real strength is in partitioning out the different data-types in more verb-based systems.

In fact the real strength is in the fact that all data fits into it in a simple and obvious way.

In English for example, the world is already broken down into syntax based on this split, so implicitly the decomposition is already done. There is a proper, simple sentence to describe everything we know about. You don't have to decide, it just comes that way by knowing how to talk about the data in a sentence.


SCHEMA MADNESS

The idea then is to take the noun/verb decomposition and get it working on top of a standard relational database. Fortunately, this is not horribly complex. We can start with the two obvious tables, Noun and Verb.

The Noun table holds all of the nouns, while the Verb tables holds all of the verbs.

Two tables however, does not a database make. Both tables need unique ids, presumably database controlled long integers. They both need sub-types so that we can construct some taxonomy of the different types of nouns and verbs. Useful for creating sets of things.

Verbs are events, so we want to know why they occurred, and it is always wise to stamp some user information onto every row to track usage (databases should always 'force' separate accounts, and they should always audit all interactions, as table data, possibly a good place for stored procs or triggers or some other database contained interface).

Time issues can get quite complex in a relational database, so much so that time-series databases were invented in-order to deal with handling events in a better fashion. However, we do want this in a relational foundation, but it is worth learning a bit about the features of time-series dbs.

Continuing on, we really just need some conventions for dealing with the 'chronological' effects of the verb. Is it a moment in time, a range, a repeating moment or a repeating range? To simplify, we can drop repeats and store them as separate verb rows, but then the door opens on the issue of storing some type of interconnection between different rows. We will leave this issue for the moment and get back to it later.

Nouns are data about things. To be most useful, they are any (static) data that is expressible in a computer language. They can have any structure.

That opens the requirement for the full range of their expressibility to be equivalent to a bidirectional graph (in graph-theory terms, not in Excel terms). But structurally that opens the possibility for some overly-complex performance issues (O(n^n) - type problems) so again for now, we'll just limit this to being a directed acyclic graph (DAG). Simply put that means that there are some arrangements of data that we won't be able to store, but not many.

In that spirit, most of the data we will need to represent can fit into a simple model of containers and primitives. Primitive values are easy as they are the same as the most of the base types in the database or language, the usual suspects, strings, integers, floating point numbers, etc. There are some technological issues and probably more than a few necessary custom primitive types such as money, but they can be sorted out later.

As for containers, it turns out that two structures, the array and the hash table, as a combined pair can be used to create most of the other possible known data-structures. Some usages are obvious, such as pushing and popping values in an array to get a stack, others such as tree implementations based on storing references to the parents and children in a hash table are a little more conceptually difficult, but still very understandable. Using arrays and hashes as foundational axiomatic primitives happens in Perl and JSON (Javascript), and probably a huge number of other modern languages.

Getting back to the Nouns, we can convert most, if not all language expressible forms of information into some structure containing nothing more than arrays, hashes and a collection of base primitive types. These can, I admit be structurally complicated, recursive and quite deep at times, but the combination forms a really expressible base-line for any type of dynamic and arbitrary data.

Tying this back to a relational database, we can create one table that has an element id in it, a element type and then a specific id for that type. If we create one type table for each primitive and one for hashes and another for arrays, the element table ties all of this together in a polymorphic way.

Given an element id, you can use that to traverse the structure and load all of the required dynamic rows from the database. You don't even need to worry about how many elements can be held by a noun, because the top-level element can always be an ordered set. Thus (nearly) any and all data can fit into the static schema, even if the actual form of the data is unknown until it is created. The very model of a modern major dynamic data-source.

Falling back to links between verbs, our element structure gives us the compositional tools to be able to create any type of arrangement of collections for different events. We need only add it back as an underlying type of element.

And of course, if it's good for verbs, then it is also good for nouns.


A TANGLED WEB WE WEAVE

So now your thinking "great, he's just given me the power to create the world's greatest web of tangled data with no way to control it".

The ugly part in most code is tying things like user records back to their persistent representations. I almost don't want to spoil the fun, but certainly in a language like Java you can use powerful tools like introspection to traverse an object structure converting it into nouns, verbs, elements, etc.

Distinguishing between nouns and verbs is an intelligence problem, so you'll have to explicitly map specific objects in your model to one or the other. Using some trick like inheritance is a clean way of making the programmer choose at compile time, but not making the code fugly with extraneous config files.

Collections map to arrays, hashes or some combination. All underlying values map to primitives. The "arrangement" of the data is dynamic, but the overall set of building blocks used in the programming language is static. In that way, either by looking directly at the class, or by inheritance all of the incoming objects have a deterministic mapping towards the base structure which is written explicitly to the database.

Reading things back from the database is the same, with the core attribute that all of the data gathering starts from either a noun id, or a verb one. If you bury that down in the model's implementation, inheriting all verbs from a Verb class and all Nouns from a noun class, then the ugly specific details can be dealt with when you create the infrastructure, and then forgotten about later.


GOING FURTHER

Coming back to some of my earlier problems, there are still a couple of details left to get sorted out. The first is how to apply some 'staticness' to this mess to make system updates more manageable. A fancy database is no good if you have to keep tossing it away for every release. The second problem is how to leverage the two other great qualities of a relational database that I have seemingly thrown away: adhoc queries and reporting engines.

For the first, in my description of a quasi-relational database I described my ability to create a "static structure enforcer". Simply put, the database stores some meta-data about what arrangements of the arbitrary structure are acceptable. Before writing, and after reading these constraints are applied to the final transformation and an error is thrown if they are violated. Testing would detect transgressions.

That would solve the problem, but it would be far better to just let the structures fill out into their own format. The reason for this will be obvious in a moment.

For the second question, I'm pretty sure that SQL and a reporting engine will be unhappy with this schema. That's not good. But what is good is that this is a normalized abstraction of the underlying data, which means that the type of structure that both SQL and some reporting software would like, is still there, it is just buried.

We can release it by utilizing views. The simplest idea is to write some code runs through the different nouns and verbs, creating tables, sub-tables, sub-sub-tables etc. for each of the different "unique" things that it discovers.

In a sense, any unique sub-type would normally end up in its own entity table. Any of the underlying data would normally get joined into sub-tables. Any of the ranges of data would end up in a domain table. All of these 'transformations' are simple, deterministic and computable.

In order words, you run some scheduled job to traverse the structure and create a large number of views on the original tables. The combination of all of these views is essentially a normalized schema that is mapped over the noun/verb schema.

That potentially leaves a few synchronization issues with the tables, but these days you can get around that type making automated nightly copies and running most of the reports on day-old data. If mining is not real time, it still shouldn't hurt.

The final big problem is that the noun, verb and element tables will be growing at a huge rate. Partitioning them is obvious, there is already a subtype, but one would prefer to build in the mechanics quietly in the background so the entire database is seamless from a higher perspective. That's probably something worth pressuring the vendor to solve for you. A generalized automatic partitioning (that is invisible to the code) is the best solution.


IN THE END

Dynamic is good. Particularly if you are talking about relatively small amounts of very complex data. However, not being able to reliably update existing persistent data is a far bigger problem.

We keep falling back on static approaches, but these lead to huge unwieldy brute force implementations.

Much like a time vs. space trade-off, the static vs. dynamic ones are not easy choices. You must always give up some good attributes in order to get the best balanced solution.

Implementing a generalized noun/verb model in a relational database is a mix between the different approaches. It should allow flexible data handling, but also provide some "staticness" to the structure to make it easier to write handling code.

Handled well, it should allow different database silos to overlap in the same underlying data source. Realistically, this should open up the base data to allow for more sophisticated mining attempts. Collecting data is easy, combining it is hard.