Tuesday, June 10, 2008

Structuring Noun & Verb Data

The "secret" to programming is to boil down 14 million lines of code into a few hundred thousand that do exactly the same thing, without introducing a significant amount of "extra" complexity by having to deal with a nasty configuration. Less is more. But only if you haven't just traded it off for some other type of complexity. Less must really be less.

Some code is clearly redundant, so eliminating it is a straight-forward reduction in complexity. The rest however, only really goes away when you go up to the next level of abstraction.

Writing something in assembler for instance, is a huge amount of work compared to writing it in Java. Java operates at a higher level, that in turns requires less information, which means less code to complete the same tasks. The higher level axioms accomplish the same underlying work (although often requiring more resources).

Complexity kills software, so smaller manageable "code" bases are an important part of really solving the problems.

With that in mind, I've been off wondering around contemplating different ways of abstracting code and data in systems. It may seem like an abstract tangent, but I've reached that point where I realize that we need to find the next major step upwards. Our systems are insanely complicated, and as any one in the industry knows, are highly undependable. We like to ignore it, or pretend like it isn't true, but ...


DYNAMIC DYNAMITE

The anchor for most of our moderns systems is a database of some type. We need to preserve the data far longer than the system is actually running, so we have turned to solutions like relational database management systems (RDBMS) for holding our information in-between running instances of the program. This type of technology provides control over the long-term persistence for all of the application data.

If you follow the relational data analysis theories fully, you can collect all of the data that will exist in your system, arrange it into entities and relationships (ER diagrams), and then normalized it into "fourth normal form". This will produce a balanced generalized schema for storing your data. It will allow reasonable performance, and avoid data duplication problems.

The theory and know-how to do this have been around for a long time, so it might be surprising to anyone outside of the industry to find out that it is rarely done this way. Most databases are heavily denomalized.

One of the many problems with a normalized schema is that the data does not easily match the way it is modeled in the code, and is exceptionally "static". Any Turing complete programming language is far more expressive than the underlying set mechanics that forms the basis of relational theory. The difference in "expression" usually means applying a huge number of dangerous transformations on the data as it is loaded into the software, and then undoing them as it is saved back into the database.

Also, the code and the data are both changing rapidly, making it hard to keep the schema in sync with its expectations. Databases often support multiple applications and have specialized administrators that slow down the ability to change the schema. The code is easier to change, but quickly becomes a mess with too many quick hacks.

A great approach to getting around the 'staticness' of a relational database is the Object-Oriented Database (OODB). The "objects" in the code are simply made "persistent" within a transactional context, and 'viola', they are saved or restored. Simple and elegant. It is quite amazing for anyone who felt that relational databases were forged-in-hell just to explicitly waste time and torture programmers (OK, a slight exaggeration, they were just forged to waste time, it's not personal).

For all their strengths, OODBs have a fatal Achilles heel: the schema is completely dynamic. That may not seem like much, but a huge problem with commercial software comes from upgrading the old stuff into the new, in a way that is reliable. People really frown if they have lots of data and you screw this up. It's considered a big no-no.

A static SQL schema can exist as a set of commands in textual format, so it can be stored in a source code control repository and diff'ed with newer versions. In that way, with only a small amount of effort you can know with 100% certainty what exactly are all of the changes from one version of the system to the next. A big requirement for not screwing up a release.

With a truly dynamic database, there is no schema. The structure of the data and the content of the database have a huge latitude. They can be nearly anything. That would be fine if the code were that flexible, but it's very difficult to write extremely dynamic code. Any assumption about structure is a bug waiting to happen. This makes it very important to test all of the code properly with enough data.

If you want to know the differences between database versions you can compare actual sample databases or diff the code and reverse engineer the changes from there.

Both methods are far from 100% certain. Far enough, often enough that while dynamic databases make the coding fast, easy and accurate, they leave in their wake a huge upgrade problem; one that is easily fatal to most software. It is easy for some of the installations to contain unexpected data. Data that can not be handled.

Software is more than just a collection of code that gets released. The persistent data is very important too, and "operational insecurities" such as risky updates are potentially expensive. Damage control for a serious bug takes significant resources.

Code that is easy to build, but hard to update is problematic, to say the least. Updating can be a larger potential cost than development.


QUASI-RELATIONAL

I learned about dynamic databases the hard way, but fortunately my lessons where neither particularly long nor fatal. Even so, one quickly comes to appreciate the 'staticness' of a relational database at times. Static is bad, except when it is good.

After some experience with an OODB, my next attempt was to try to marry together aspects of the relational model with a considerably more dynamic foundation.

The hybrid I liked to call "quasi-relational".

It was at it's highest level similar in design to any relational database, all of the major data entities where broken down into tables. The difference came at the lower level. With the table structure, the columns were arbitrary and any row could have any possible combination or subset of them. Even more important, rather than make each column a 'primitive' type, they were actually allowed to contain complex sub-structures.

In this way you could stick some similar things all together in the same table, yet their underlying details could be quite different, and the associated information could be structural (avoiding the need for lots of little sub-tables).

This compacts the representations, and sets the handling on a column-by-column basis, providing a great deal of flexibility for the major entities within the system. The structural sub-column properties allow the code to be built very close to the underlying persistent representation, avoiding many costly and dangerous transformations.


SOME UNDERSTANDING

What my quasi-relational database allowed was for me to rapidly extend the system with increasingly complex data, but under some static constraints. Used wisely this provided ultra-fast access to a medium-sized complex data-store, something that is impossible for a "traditionally-designed" relational database to accomplish.

There was a danger of the earlier mentioned 'update' problem, but implementing some form of "static constraint enforcement" would have controlled that.

I could have used some declarative format to restrict the structure of the sub-data. I didn't, but that was because there were two few programmers to make it a useful requirement. With a limited number of changes, and frequent releases the update problem is a lot more manageable.

For this particular system, transactions were not necessary (it was primarily a read-oriented data store), so most of the "tank-like" infrastructure surrounding the ACID properties in an RDBMS would have been total over-kill.

The two big things that were missing were the ability to do "adhoc" queries and the ability to allow access to the data by a reporting engine, although neither of these was really an issue. The database inherently had script access, it was written in a scripting language and the overall system was a fully capable reporting engine, although most people missed that fact.

Overall, I'd have to say it was a big improvement from using a standard relational database, both at the coding level and at the operations level. It was fast, flexible and stable, and required no special management handling.

However, most programmers prefer pre-canned solutions for handling their persistence, so it is not a design idea that people would consider popular.

The essence is good, but what would be better is to get the same properties on a more "conventional" foundation. Programmers are not great risk takers -- a reason why techniques like brute force are chosen so frequently -- they like to stick to what is obvious, even if it has obvious problems (which they will deny with too much enthusiasm).


NOUNS AND VERBS

I've done a lot of deconstructing lately, and I keep coming back to the same underlying realization: data is essentially nouns or verbs. Well, at least the main entities are -- if you squint "and" ignore a few minor anomalies. For the sake of simplicity, we are allowed to do that.

Information breaks down nicely into 3D things being nouns, and any time or action based things being verbs. As a foundation this is enough to contain all of the data that we might choose to pile together in our computer.

This "perspective" comes into play if we are dealing with event based data, and it is particularly useful with historic systems.

If we were building up a repository of the value of financial instruments for example, the instruments and their contractual features or options are nouns, where the events in their lifetimes are verbs. The actions (verbs) that occur in a warehouse for an inventory system (nouns) or the changes in state (verbs) imposed by a user on a document (nouns) in a work-flow also fit well into this model.

It also suits things like the historic buffering of commands in an editor, but its real strength is in partitioning out the different data-types in more verb-based systems.

In fact the real strength is in the fact that all data fits into it in a simple and obvious way.

In English for example, the world is already broken down into syntax based on this split, so implicitly the decomposition is already done. There is a proper, simple sentence to describe everything we know about. You don't have to decide, it just comes that way by knowing how to talk about the data in a sentence.


SCHEMA MADNESS

The idea then is to take the noun/verb decomposition and get it working on top of a standard relational database. Fortunately, this is not horribly complex. We can start with the two obvious tables, Noun and Verb.

The Noun table holds all of the nouns, while the Verb tables holds all of the verbs.

Two tables however, does not a database make. Both tables need unique ids, presumably database controlled long integers. They both need sub-types so that we can construct some taxonomy of the different types of nouns and verbs. Useful for creating sets of things.

Verbs are events, so we want to know why they occurred, and it is always wise to stamp some user information onto every row to track usage (databases should always 'force' separate accounts, and they should always audit all interactions, as table data, possibly a good place for stored procs or triggers or some other database contained interface).

Time issues can get quite complex in a relational database, so much so that time-series databases were invented in-order to deal with handling events in a better fashion. However, we do want this in a relational foundation, but it is worth learning a bit about the features of time-series dbs.

Continuing on, we really just need some conventions for dealing with the 'chronological' effects of the verb. Is it a moment in time, a range, a repeating moment or a repeating range? To simplify, we can drop repeats and store them as separate verb rows, but then the door opens on the issue of storing some type of interconnection between different rows. We will leave this issue for the moment and get back to it later.

Nouns are data about things. To be most useful, they are any (static) data that is expressible in a computer language. They can have any structure.

That opens the requirement for the full range of their expressibility to be equivalent to a bidirectional graph (in graph-theory terms, not in Excel terms). But structurally that opens the possibility for some overly-complex performance issues (O(n^n) - type problems) so again for now, we'll just limit this to being a directed acyclic graph (DAG). Simply put that means that there are some arrangements of data that we won't be able to store, but not many.

In that spirit, most of the data we will need to represent can fit into a simple model of containers and primitives. Primitive values are easy as they are the same as the most of the base types in the database or language, the usual suspects, strings, integers, floating point numbers, etc. There are some technological issues and probably more than a few necessary custom primitive types such as money, but they can be sorted out later.

As for containers, it turns out that two structures, the array and the hash table, as a combined pair can be used to create most of the other possible known data-structures. Some usages are obvious, such as pushing and popping values in an array to get a stack, others such as tree implementations based on storing references to the parents and children in a hash table are a little more conceptually difficult, but still very understandable. Using arrays and hashes as foundational axiomatic primitives happens in Perl and JSON (Javascript), and probably a huge number of other modern languages.

Getting back to the Nouns, we can convert most, if not all language expressible forms of information into some structure containing nothing more than arrays, hashes and a collection of base primitive types. These can, I admit be structurally complicated, recursive and quite deep at times, but the combination forms a really expressible base-line for any type of dynamic and arbitrary data.

Tying this back to a relational database, we can create one table that has an element id in it, a element type and then a specific id for that type. If we create one type table for each primitive and one for hashes and another for arrays, the element table ties all of this together in a polymorphic way.

Given an element id, you can use that to traverse the structure and load all of the required dynamic rows from the database. You don't even need to worry about how many elements can be held by a noun, because the top-level element can always be an ordered set. Thus (nearly) any and all data can fit into the static schema, even if the actual form of the data is unknown until it is created. The very model of a modern major dynamic data-source.

Falling back to links between verbs, our element structure gives us the compositional tools to be able to create any type of arrangement of collections for different events. We need only add it back as an underlying type of element.

And of course, if it's good for verbs, then it is also good for nouns.


A TANGLED WEB WE WEAVE

So now your thinking "great, he's just given me the power to create the world's greatest web of tangled data with no way to control it".

The ugly part in most code is tying things like user records back to their persistent representations. I almost don't want to spoil the fun, but certainly in a language like Java you can use powerful tools like introspection to traverse an object structure converting it into nouns, verbs, elements, etc.

Distinguishing between nouns and verbs is an intelligence problem, so you'll have to explicitly map specific objects in your model to one or the other. Using some trick like inheritance is a clean way of making the programmer choose at compile time, but not making the code fugly with extraneous config files.

Collections map to arrays, hashes or some combination. All underlying values map to primitives. The "arrangement" of the data is dynamic, but the overall set of building blocks used in the programming language is static. In that way, either by looking directly at the class, or by inheritance all of the incoming objects have a deterministic mapping towards the base structure which is written explicitly to the database.

Reading things back from the database is the same, with the core attribute that all of the data gathering starts from either a noun id, or a verb one. If you bury that down in the model's implementation, inheriting all verbs from a Verb class and all Nouns from a noun class, then the ugly specific details can be dealt with when you create the infrastructure, and then forgotten about later.


GOING FURTHER

Coming back to some of my earlier problems, there are still a couple of details left to get sorted out. The first is how to apply some 'staticness' to this mess to make system updates more manageable. A fancy database is no good if you have to keep tossing it away for every release. The second problem is how to leverage the two other great qualities of a relational database that I have seemingly thrown away: adhoc queries and reporting engines.

For the first, in my description of a quasi-relational database I described my ability to create a "static structure enforcer". Simply put, the database stores some meta-data about what arrangements of the arbitrary structure are acceptable. Before writing, and after reading these constraints are applied to the final transformation and an error is thrown if they are violated. Testing would detect transgressions.

That would solve the problem, but it would be far better to just let the structures fill out into their own format. The reason for this will be obvious in a moment.

For the second question, I'm pretty sure that SQL and a reporting engine will be unhappy with this schema. That's not good. But what is good is that this is a normalized abstraction of the underlying data, which means that the type of structure that both SQL and some reporting software would like, is still there, it is just buried.

We can release it by utilizing views. The simplest idea is to write some code runs through the different nouns and verbs, creating tables, sub-tables, sub-sub-tables etc. for each of the different "unique" things that it discovers.

In a sense, any unique sub-type would normally end up in its own entity table. Any of the underlying data would normally get joined into sub-tables. Any of the ranges of data would end up in a domain table. All of these 'transformations' are simple, deterministic and computable.

In order words, you run some scheduled job to traverse the structure and create a large number of views on the original tables. The combination of all of these views is essentially a normalized schema that is mapped over the noun/verb schema.

That potentially leaves a few synchronization issues with the tables, but these days you can get around that type making automated nightly copies and running most of the reports on day-old data. If mining is not real time, it still shouldn't hurt.

The final big problem is that the noun, verb and element tables will be growing at a huge rate. Partitioning them is obvious, there is already a subtype, but one would prefer to build in the mechanics quietly in the background so the entire database is seamless from a higher perspective. That's probably something worth pressuring the vendor to solve for you. A generalized automatic partitioning (that is invisible to the code) is the best solution.


IN THE END

Dynamic is good. Particularly if you are talking about relatively small amounts of very complex data. However, not being able to reliably update existing persistent data is a far bigger problem.

We keep falling back on static approaches, but these lead to huge unwieldy brute force implementations.

Much like a time vs. space trade-off, the static vs. dynamic ones are not easy choices. You must always give up some good attributes in order to get the best balanced solution.

Implementing a generalized noun/verb model in a relational database is a mix between the different approaches. It should allow flexible data handling, but also provide some "staticness" to the structure to make it easier to write handling code.

Handled well, it should allow different database silos to overlap in the same underlying data source. Realistically, this should open up the base data to allow for more sophisticated mining attempts. Collecting data is easy, combining it is hard.