Monday, December 22, 2008

My Stack Overfloweth

A couple of weeks ago I finally got around to visiting Stack Overflow, the new Q&A site by Joel Spolsky and Jeff Atwood. It's a simple concept that I found both fun and engaging, a place where users can ask and answer specific programming questions.

It's easy to submit new questions, so there is a constant stream of inquiries to be answered. For each question you ask or answer you give, you get a collection of reputation points (some positive, some negative, depending on content). As you answer more questions, you get more abilities, you can do more things on the site. Also, there are these cutesy little badges that you can earn for specific milestones, adding an extra amusing level to it.

It's a great site, completely engaging with a decent interface. It is simple enough to use, and it has the feel of being a cross between a social site, a news feed and a repository of answers. In concept I think it's a great idea, but I was finding that in practice there were some interesting issues.

Sites come and go rapidly on the Internet, it is a very volatile medium. Things move very quickly. This makes it an interesting place to observe how people interact with each other and their software. You don't have to spend a long time to get an a good sense of what is happening. Web 2.0 has provided a large number of good public examples for analysis.

For this entry I figured that although I like Stack Overflow, the dynamics were interesting enough that I could spend a little time digging into the types of problems that may occur with the site. It's a good place for analysis, and we usually don't have to wait long to see if its right or not.


My short membership on the site has been fun, but not without its share of controversy. Not being one to lurk for too long, I jumped in pretty quickly with a few questions and a fair number of answers.

What surprised me was that my first question "Success vs. Freedom" got into trouble right away. The title was re-edited to "What kind of programming method do you prefer? Success vs. Freedom", and then a discussion broke out in the comments about whether or not it was an appropriate question.

The mandate of the site is to answer "programming" questions, and it seems that some of the "stackers" as they like to call themselves, take that very literally, while others would prefer a wider latitude. A problem with this is that roving stackers with high reputation are allowed to enforce any rules as they see fit. If they have enough reputation points, they are allowed to alter the questions. The site content is completely controlled by its users.

My first question was closed initially because there was no way to answer it with just facts, all answers were inherently subjective. However it was re-opened shortly thereafter by another stacker, and managed to stay around for a while.

My next question was trying to provoke a discussion on this issue, clearly against the rules, but it too failed. It was "Are there legitimate programming questions that cannot be answered by facts?", a loaded question making reference to the fact that the FAQ says any question asked should be expecting just facts as answers.

The problem with Computer Science is that it is a highly subjective industry at the moment, we have precious few 'facts' that are not actually contained in a manual somewhere on the web. I wanted to get people thinking about the scope of this site, and whether or not facts are just far too limiting.

In a very real sense the questions with subjective answers are the ones that are above and beyond just the simple RTFM questions. Facts might be nice, but they are too simple to really be helpful, at least not in the long run.

Most surprisingly, my third question, which was really relevant to my current working situation got shut down immediately. I recently switched my main programming workstation to a quad core chip and I was wondering how to best set it up. Here for the first time I really needed an answer to a real problem, yet my query "What are the best performance options in XP for a quad core ship?" was shut down right away for not being relevant to programming.

The comments suggested that a cheap spin on the question -- like following one of those future cookie tricks were one adds "in bed" to the end of each fortune -- might have allowed the question to remain open. In this case one might simply tack on "for programming" to each question and possibly get away with it. I was thinking about trying that, but it just seemed far too lame.

Judging by the range of the various questions in the site, there seemed to be some disagreement going on, as to how to interpret the FAQ. As often plagues these sites, the old guard is struggling against the new guard for control of the content. Wherever there are people, there are disagreements.


The rules of the site are very specific, but some stackers feel that the interpretation should be lenient, while others want a very strict enforcement. That matches any large group of people, where the more serious people often get worked up about the little details, while others advocate a more relaxed approach to life.

The conservative members do have the noble goal of trying to keep the site from degenerating. Clay Shirky famously pointed out that the group is its own worse enemy, so it's likely they are right; if left unchecked the content on the site will go down-hill quickly. Still, one has to be careful with simplifying any situation to that degree. Behavior is dependent on the context and the type of people involved.

A site like this is mostly made up of two types of people, those seeking answers and those wanting to answer. The dynamics of those two groups will either help to the site to grow, or let it wither away.

The rules dictate little discussion, so the stream of questions is forced to be very specific and more or less junior in nature. Fact-based answers are really things that should be documented in help pages somewhere, thus the site's mandate is to essentially stick to well-known documented items or items that were missed for some reason.

Keeping the scope of the current questions down to a tight-nit range of purely factual questions will definitely keep out the fracas, but it will also heavily constraint both sets of users of the system.

New programmers go to the web to avoid reading documentation, and as I said, that makes up the majority of fact-based questions. The questions are simple, and direct, yet they belong to some larger, unknown context. They weren't asked at random. That means the usage of the answer may actually be more significant than the question itself. We'll get back to that point a bit later.

Also of interest, those answering the questions are not necessarily, as one might assume, the seniors in the industry. In fact it's more likely that they themselves are still in their first five to ten years of experience, just starting their careers. It's a leap, but not a far one since for most programmers who have seen several generations of technology come and go, there is a tendency to be more elastic with their knowledge.

When you first get into coding, you're driven to learn everything possible, but as time goes on and your favorite technologies get replaced again and again, you start to wise up. Why go deep into the technology if it's just going to get replaced again? There are, no doubt, many older senior programmers that maintain their high energies, but they are unlikely to be the majority. Most of us have seen it once too often to be impressed by seeing it again (in a crappier form). We lose interest in the little details as we get more experienced. You can't help it.

Thus, most of the people capable of answering the more modern fact-based questions are the ones that have learned that information recently and are still interesting in showing off how much they know. At the moment, it's fairly safe to assume that most of the people attracted to the site are juniors and intermediates.

Sure there are lots of seniors (you see this in the "How old are you, and how old were you when you first started coding?" question that they keep trying to kill), but their answers are probably more selective, and subjective. And they'll drift away faster.

Thus the conservative approach to keeping the site pure by weeding out questions then is to continually put pressure on the site to answer just fact based questions. But these type of questions, for both the asking and answering sides tend to drive the content down to a more junior level. A level that competes more heavily with the rest of the Internet, and also tends to become stale quickly.

The conservative side is perpetually in jeopardy of choking off the good and really useful questions, primarily because they don't have a long enough horizon to distinguish between the real issues, and the faux ones. My question on XP settings was a good example, as it very much relates to programming, yet it doesn't explicitly mention the word 'programming'. A limited scope means that limited usefulness.

Still, there always needs to be some type of constraints put over the content. If the site becomes too unruly, the signal to noise ratio will cut off its usefulness. There will be too much junk to make using it worthwhile. There are enough degenerated sites out there to know that it is a very probably direction for this site.

So we can can speculate that the site leaning too heavily in either direction will kill it. However, there are also other problems that will build up as significant issues over time.


A long long time ago, when I was a junior programmer, a coworker came in hurriedly and asked me a technical question. I can't remember the details, but it was an easy question for me to answer. Just as I was starting too, my officemate, a really skilled senior, battle-hardened consultant jumped in and started asking questions. They were mostly in the form of "why do you want to know that?". As he dug, it turned out that my coworker was clearly on the wrong track. A dangerous track that was minutes away from doing something dreadful to the system.

If we hadn't dug into the issues, I would have contributed to causing a huge mess. My coworker had glommed onto that specific question, because they were basing their knowledge on a series of incorrect assumptions. The question, being simple was actually a huge indicator of a serious misunderstanding. If they understood what was going on, they never, never would have asked that question.

The consultant, of course, chewed me out for having very nearly supplied the fatal piece of information. He told he, and I always remembered, that as a professional it was my duty to understand how people were going to use the information, before I was to give it out easily. In a case like the above, he said it was important to understand why they were asking the questions, because in many ways that was far more important than the question itself.

These days, that attitude feels very selfish, but there are certainly enough things out there that juniors shouldn't know until they are ready to understand them. They need the information, perhaps, but they need the surrounding context far more.

If we give out information, without giving out knowledge, then there is a huge chance that that information will be used poorly or in a wrong way. And its exactly that which has caused a shift for the worst in programming. It is too easy to get past simple problems, without having the prerequisite knowledge first. We're teaching toddlers to drive cars before they can even walk, and then we're surprised by the numbers of accidents.


The web brought with it a huge boon for programmers. Manuals were no longer hard to get things, hidden in some corner or on another floor. We no longer have to pour over them for hours to seek the smallest of answers to the biggest of questions. All of the sudden a quick search and *poof* you've got your answer, or at least a discussion of the answer.

That change is a good thing in that programming has become easier and diagnosing problems has become faster and more effective. However, all things good come with consequences ...

The web has made it easier to program. In fact the web has made it easier to know far less about programming. And in many ways this is a big problem. We are seeing far more code with a higher degree of artificial complexity that should never have been allowed into a production site. Sites like WTF are getting overloaded with content. Stuff that should have died because the programmer was flailing badly, suddenly gets released because the coder managed to route around the types of bugs that should have been fatal. Bugs, oddly have their value sometimes.

We've become so addicted to this fast-food information, and its having a huge effect. It's highly dangerous to give all of these programmers the simple stupid answers if they do not understand the content of what they are doing. We're not making software development better, we're ruining it. We're making it more possible for partially trained programmers to route around problems that they have no idea how to actually solve.

When you respond with "set the CosmicInverterThreshold variable to 24" it allows the programmer to get past the understanding of why that is the correct parameter, without really knowing what a cosmic inverter is, or why the old value was causing a malfunction. They can just happily ignore the obvious problems, and hopefully the non-obvious ones will disappear to.

We've been seeing the effects of this in the industry for years. Yes, the web opened up the door to quick information, but the quality of the underlying tools has degenerated. We're losing touch with how the machines works, trading it away for just poorly patching problems in a debugger.

It's nice on the one hand for making development faster and easier, but the underlying complexity of our systems has clearly outpaced our abilities. How many programmers actually know what is really happening in their OS? How much is just a big mystery? Are the little elfs that move around the bits red or green?

This both allows weaker programmers to create things that should not have been created, but it also helps to preserve libraries and utilities that should have died a nature death because they were poorly built. Too much information, with too little knowledge has become a serious problem to most disciplines, but a particularly sever one in software.

What is worse, of course is when you see those posts that ask or wonder if you need to learn any theory to program. The current answer: "no, you don't", but it should really be: "yes you do!"

You should absolutely have that knowledge, particularly if you are going to write something serious, but now because of the web, you can just skip over it and forget about the consequences. It's as if we've given the ability for home hobbyists to start building apartment buildings, in any manner they choose. Some buildings, of course, will be nice and well-built, but the rest?


Quick answer sites serve up huge plates of junk food knowledge. Fast little tidbits that are incomplete platitudes passing for real answers, for real knowledge. The problem is that a steady diet of this type of fluff foolishly convinces most people that they don't need the real thing.

And like anything in this world, there is a huge distance between essentially understanding something, and really understanding the details. Baring those with photographic memory, most people don't really get the details unless they sit within a bigger more extensive conceptual understanding.

That is, you do need to know the theory, so that you can really understand the practice. No matter how well you know the practice, you cannot fill in the missing bits without understanding the encompassing theory. And its what you don't know that will cause all the problems.

You might get by, but you'll never be an expert in something if your only real tangible knowledge is relatively light. Sure you can program for instance, but without a bigger deeper theoretical background you probably will have trouble as an architect, for example. Precisely because you'll follow the popular trends, even when they are way off, and you won't be able to see why they are dysfunctional (until it's way too late). Software development is full of a lot of bad advice, and the only cure is to have a deep understanding.

This problem already occurs on a frequent basis in software. The industry relies on a huge number of domain specific programmers to write very specialized code for specific applications. That is fine, but if you've ever encountered a project were it's all domain specific coders, 100% of them, and no classically trained Computer Scientists, you're likely to find very esoteric code that fails on a very basic level. The domain specific stuff may be quite acceptable for the core functionality, but the whole package is often weak and unstable because of what's not known.


Knowledge isn't just remembering a few facts or being able to ape a few movements. Information might be what you get, but when you can place something in it's context, it's real context, then you truly know it.

We take everythign we learn and we use that to drive our work. If there are huge gaps in what we know, then our efforts are more dependent on luck than they should or need to be. That, in an industry like software where we already rely on a huge amount of luck and guess work to see us through, puts any project driven by people without a lot of knowledge in a dangerous position.

We all know luck pays off sometimes, somebody wins the lottery, and good things sometimes happen, but the difference between an amateur and a professional is easily the amount of luck they rely on to be successful in their endeavors. Anybody can get lucky, and many people do. But most do not.

Software is exceptionally complicated, and now it's packed into endless layers, each a little more complex than not. For most of the code out there, there are plenty of real theories on which the original designs were based. The industry has a great deal more theoretical knowledge than most programmers are aware of. We know more than we think we do, it's just lost in old papers, private mailing lists or not disseminated to the masses. We're probably the worst industry for having the practitioners ignore most of the common knowledge.

Knowledge is really what most juniors needs when they start asking questions, however easy access to quick answers allows them to bypass that need, and get back to flailing at the keyboards.


Overall, despite the problems with the content I really like the existing site. Stack overflow is entertaining, however that is not necessarily a good thing. Fun may draw in a number of people in the short-term but it's not enough to keep them hanging around over time. Fun quickly gets dull, the users move on to the next thing.

It takes more than fun to keep users hanging around for the long term. Once the site is no longer the next cool thing, then it faces the real challenges. You need more than just fun to give people a real reason for coming back to the site. The content must go beyond just simple fact-based questions. If people aren't getting something more substantial, they'll quickly move on to the next form of entertainment.

Software, in its current state is a mess and the only real way out of that space is by evolving our knowledge. The reason this isn't happening is that every new generation of coders is rewriting, in a new technology, the same old stuff over and over. And in some weird ways, some strange memes like MVC become sacred cows, while other more significant ideas get lost and reinvented poorly.

What we really need is to honestly discuss our profession and to explore what works and what doesn't. It sounds simple enough, but programmers always fall back into defensive positions, thus making it nearly impossible to really discuss things.

One reason progress is so slow with software development, is that our culture is to hate something or believe in it 110%. There's no in between, so questioning becomes a lack of faith, and thus we get very few real objective discussions. This has been stagnating the industry for years. Once X has become the technology du jour, everyone jumps on the bandwagon or completely hates it. It never gets viewed objectivity for both its good and bad qualities. Once its old, its 100% crap, we never learn any lessons from the good parts.

We just go round in circles. Software will never improve until we find a way to improve it.

And we'll never improve it, until we find a way to discuss the strengths and weaknesses, honestly without rhetoric.


Discussion and deeper knowledge are huge issues that I think Stack Overflow will need to confront if it wants to remain current. These are the real value-adds that a site like this needs in order to continously drawn in people over a long period of time, and they are the real value-adds that our industry also needs in order to break out of its current mediocre bonds and start utilizing the potential of hardware.

With that in mind, there are a few simple things I think would really help kick the site up to a more useful level:

- allow polls (and let other people mine the data)

- set up a section for definitions (with the best ones floating to the top)

- limit the site to all of software development, not just programming

- open up separate domain-specific sections

- open up separate sections for environment issues (chairs, desks) and tools.

- allow discussions, but limit them; subjective stuff is very important to our industry.

As far as discussion go, find someway to allow them, yet contain them at the same time. Everyone should get their say, but just once or twice at most (I like how the current comments are way too short, so you can't respond with an essay).

And most importantly:

- Encourage alternative ways of thinking, expressing, working, etc.

The answers we have, are by no means at a significantly high state. Software is an immature industry, which eventually will grow tremendously over the centuries. What we know now, will get supplanted with better technologies over time, it's just a matter how fast.

No matter what happens with Stack Overflow, I'm finding it fun right now, so I'll hang around for a bit. But no doubt, like Slashdot, Facebook, Digg and most of the others, when the novelty wares off, I'll fad away. I always lose interest quickly in these types of things.

Wednesday, December 10, 2008

The Best of The Programmer's Paradox

Blogs always see a constant flow of new, possibly intrigued readers. To answer that all important question that every reader has for a new blog, I put together a selection of some of my better works. This should help in making that all important judgment between insane lunatic, crazy rambler or just run-of-the-mill nut job. It's an easier choice if I provide references to some of my more interesting pieces, ones that have a strong point, were more popular or just contain fluid writing.

I've collected together my favorites with short intros, which I'll update from time to time to keep it up-to-date. Hopefully I can tag this in Blogger to make it standout.

If you're a long-time reader and I missed a favorite or two, please feel free to leave a comment. If I managed to get a few more suitable entries, I'll update this document. Comments for older pieces should be left on the original piece, Blogger will let me know that they are there.


Probably the most influential thing that I'll write in my career is the basis for code normalization, however I fully expect that the software development industry will happily choose to ignore any points in these works for at least another hundred or hundred and fifty years. Because of that, don't feel obligated to read any of these, they probably won't be relevant for quite a while:


Simple is a constantly mis-used term, in this entry I'm trying very hard to get clarify a definition of it. It's the sort of thing that programmers should really really understand, yet the kind of thing that they rarely do:


Most of the incoming searches from the web seem to find my abstraction and encapsulation writings. Again these pieces are focused around clearly defining these terms. The first entry references the other two, I threw in the fourth one because its also deals with more essential definitions:


I've been struggling for years with trying to find the right words to explain how to make it easier to develop software. Software development is a hard task, but people so often make it much much harder than it actually needs to be. More software fails at the root development level then it does from organizational problems. If it starts off poorly, it will fall apart quickly:


Another interesting post that got a fair amount of feedback. I think that a simplified high-level (but imprecise) way to specify functionality would go a long way in reducing risk and allowing programmers to learn from each other:


Coding is great, but as programmers we fill up our efforts with a steady stream of bugs. You haven't really mastered software development until you've mastered spending the least amount of effort to find the maximum number of bugs, which is not nearly as easy as it sounds:


I did a couple of forward-looking posts about things up and coming in the future. Computer Science forces us to be able to understand knowledge in a new and more precise way than is necessary for most other disciplines:


These are some of my earlier writings. Similar to this blog, but often a little dryer. The whole repository can be found here: (should be free to download)

Some of the better entries:

Rust and Bloat --

Five Pillars --

When Ideas become Products --


This site is named after my first attempt at a book. I quite work and decided to push out all of my software development understanding onto paper. It was something I always wanted to do. I can't say that it is a good piece of work, but I can say that there is a lot there, it's just in a very raw form. It was my first big attempt at writing (besides research papers) and apart from a few typos, the printed edition looks and feels very much like a real book (my sister says so, and likes the quotes at the start of each chapter). Professional editing could bring it out, but I've never been able to convince any publishers that this 'type' of material should see the light of day. My advice, don't buy the book, convince a publisher to publish it instead (if you'd really like to read the book, send me an email and I'll send you a copy).


Sometimes the world drives me nuts, so I set up a place to vent. Some of the venting is horrible, but occasionally it's amusing:


Andy Hunt suggested a long time ago that my writing was OK, but not entirely personal. He said I was delivering observations but not tying them back to the reader. As a sort of fun exercise I set myself the goal of writing just little thought-lets: two sentence expressions. The first sentence is completely general, an observation; while the second one must tie that back to the reader somehow. The earlier versions were on yahoo360:

but there is a full copy at:

Wednesday, December 3, 2008

The Lines of Progress

Some posts come easily, this one however was a struggle. Life it seems, conspires to keep us occupied making it hard sometimes to find long periods for focusing. And it's an age thing too, I think. One friend of mine believes that thirty is the age cut-off for great inventors, but I tend to favor the idea that our responsibilities grow as we age, along with our expectations. Often the combination far exceeds our abilities. Whichever way, sometimes it makes it hard to get any writing done.

Without bemoaning fate, in this entry I want to show why Computer Science, as well as many of mans other endeavors, suffers so horribly from our fuzzy perceptions of the world around us. It's as if we are walking through a fog with blinders on. The things that underpin our knowledge are often so much weaker than we know or really want to believe, yet until we can accept our knowledge as it is, we cannot fix our understandings. We'll just stay trapped in a world of both abundant truth and falsity that often balance each other out.

I'm sure that most people see our modern knowledge as extensive and nearly complete, we as a species often have a strange faith that somebody somewhere entirely gets it, even if we are personally no longer able to keep up. That might be the case for some limited branches of knowledge, but it has become increasingly clear that the sum of our understandings easily exceeds even the best and the brightest minds of our time. We collectively know way more than any individual can assimilate. And this overall amount of knowledge is increasing at a rapid pace.

It's in exactly this growth that the real dangers lie. Each new leap is becoming increasingly dependent on what preceded it, and increasingly likely to be built on some falsehoods, even if they are subtle and unintentional. This isn't a problem that just effects a specific domain. It winds it way through all of the sciences, no matter how rigorous our efforts.

We build on a house of cards, which often comes to light when we try to organize our understanding, or if we try to automate it in some way. The details reveal our weaknesses.

It starts close to the very roots of our knowledge. Mathematics is our highest abstract, theoretically pure system, far removed from the real world. It sits at the top of all of our knowledge. Below that, science is how we rationalize the world around us. It takes the real world and tries to make sense of it. Computer Science which sits in that shadowy middle ground between these two, ties elements of the real world to some type of theoretical foundation. Running code is theoretically pure mathematics, what it is calculating is not.

It's because of this positioning that software development is one of the few occupations that demands that its practitioners must be able to learn and understand the other disciplines. Software developers are frequently interlopers into other bodies of knowledge. We are free to delve into the details of other knowledge bases, in fact we have no choice, we must learn from their details in order to complete our own efforts.

Because we build on the other disciplines, we delve deep into their information. But so often, as we try to map this back towards our own purely logical systems, we confront all of the irrational inconsistencies that have been ignored and accepted as conventions. Knowledge in a textbook can lightly skip over a few dubious facts, but once in a computer these issues become glaring problems.


As we age, we trade our inexperience for a diminished focus. We may know more but we have less opportunity to change things, mostly because of our own priorities. It's not that we don't rise to positions of power, but rather that the limitless energy and enthusiasm of youth quietly disappears over time. We're higher in authority, but we choose to do less.

In experiencing this knowledge verses energy trade-off I can't help but think about all of the things that I've learned over the years that are slowly fading away. It's the natural consequence of aging that we start forgetting twenty year old details. This makes me realize the fragility of everything I understand. The details fade, but I retain the patterns; the justifications disappear but I remember the results. Learning fades first at the details, then spreads upwards.

Our knowledge is built on a foundation. In our modern age we have increasingly become more rigorous in studying and proving our advances, but that rigor is tightly focused. What we learn from scientific methods is combined with what we understand from the world around us. The melding of our theoretical and empirical understandings, which is necessary because we have to allow for the messiness of the real world, opens up a gaping hole whereby the underlying absoluteness of our understanding can become tainted. What might be perfect in a research paper, can become suspect as it reaches practice. What we know, even if it is based around islands of proof, is not nearly as correct as we believe.

To really understand this perspective, you have to set yourself in the position of an academic scholar several hundred years ago. Pick a time when it was easily possible to have a nearly complete and full understanding of all things known to mankind. If you studied long and hard enough you would reach a point where you knew all that there was to know, yet from our modern perspective you knew just a proverbial drop in the bucket.

Yes, you understood geometry, philosophy and farming, and the patterns of the stars and the moon and whatnot, but electricity, engines, flight and computers would all be pure magic. Our modern chemical based lifestyle with it's vast array of foods and materials would seem to be predicated on a mass of unknown information.

Information that would exceed a man's lifespan to learn. So much more detail than one could fill up on. It's not the differences in technology that are so stunning, or the increases in basic issues like human rights and fairness. No, rather it's that we have progressed from a time where a person might have understood most things, to a time where we can barely even keep up with all of the details of our own field, let alone the explosion of science and knowledge that are raining down on us.

As an ancient scholar we could probably cope with the concept of a car, and possibly the highway system. What might be more difficult are the actual details of a gas powered internal combustion engine. The things that are similar to what we observe, we have to accept, but we may choose to explain the underlying details with some other, more convenient explanation.

If you can set yourself back to that ancient time, then you might also be able to set yourself forward about the same distance.

Although to many it may feel like our modern society has opened all of the major doors to most branches of knowledge, there is still much to learn. As an ancient scholar we were equally confident of the fullness of our knowledge, but look how vastly we were mistaken. There is a huge amount we didn't know, just waiting for us. Cars, flight, and chemistry for instance.

In a sense we are probably only halfway between that old ancient knowledge, and the new future knowledge that we are still left to discover. There is as much waiting for us to learn as there was for the ancient scholar if he was in our time. We know a lot, but really we know so little.


I was buried deep in the functionality of a software program recently trying to get a sense of how it was working. It was old, maybe twenty or thirty years -- ancient by software standards -- and it's behavior was not as expected.

There was something wrong with the underlying algorithms, something pointing to code that was far cruder than anyone would have initially guessed. There are a lot of known algorithms, but this code wasn't matching any of them.

But then to some degree that is a common problem in software. Dig in something deep enough and you'll always find a programmer winging out a crude version of some known algorithm. Worse, sometimes that coder isn't actually a Computer Scientist, but a domain based expert that has moved in coding. There may be a multitude of better, faster, more accurate algorithms, but the one used forever in production is lame.

I can't even say anymore how many times I've been digging into the details, only to find a significant collection of systematic yet long-time accepted mistakes underpinning some well-known software. It's far too common.

It doesn't really matter the industry either, from round-off errors in finance, to printing errors in marketing, to calculation errors in scientific code, to threading errors in development products. The software we produce has a significant number of problems. Some known, some just ignored.

Often it's not even the software's fault. Not actually a bug per say. I can remember one financial product where the convention for a summary statistic was based around an entrenched hardware bug. The bug, extremely well-known, became the basis for the industry convention. A not entirely odd occurrence where an industry bent towards the irrationality of its history. It's always been that way, so why change it?

But it's exactly that digging over the years in so many different industries that has open up the door to my seeing the foundations of our real understanding. Or at least into accepting that most of what we're currently doing is guessing. I keep drilling down into the details, only to find that the details are wrong. Incorrect. Broken. Never by a huge amount, but almost always by something.

But whenever I've push this back up, the various industries are almost always aware of these errors. Small enough to be ignored, big enough to be significant.

Digging into any domain when building software, always involves digging into some industry's dirty laundry.


If you sit on the fence between purity and reality, you quickly find that the cleanness and elegance of abstract mathematics holds an allure, a fascinating smooth, clean black and white philosophy for the world around us. Of course, it's completely untrue, the real world is a messy grey place with many more in-betweens than we want, but that region on the border easily high lights the differences.

Whenever my desire for perfection in the real world becomes to strong, I always fall back on my understanding of sidewalks.

In my neighborhood, at the top of the street the sidewalk has a huge crack running through it. The ground probably shifted sometime after the concrete was laid. Although this bit of reality in many ways is an ugly blight, that particularly sidewalk hosts a large number of people, going back and forth on a daily basis.

Everyday a large number of people walk by. And rarely, if no more than any other place, does someone trip on the crack. For it seems that however un-perfect the crack, it does not in most ways diminish the usefulness or working life of the sidewalk. It continues to serve a purpose, cracked or not.

When I've looked back at the software problems, although the errors get out there, the industries themselves just tend to route around them. That is, they become known, accepted, then move into being the convention. Often people just take that knowledge as if it were somehow absolutely true and undeniable.


My understand of the sidewalk, always reminds me that much of what we have or need in this world does not have to be perfect. Of course, because I've seen software working for decades, pinned on incorrect calculations, or serious bugs, that have been worked around. Computers crash, software generates bad numbers, hardware burns out, and yet all things continue to exist. We live in a world, were our reality tolerates a considerable amount of incorrect data, flaws and disasters.

But that is exactly the key towards looking toward our future. The details in our knowledge, are immense, but they are filled with as many old wives tales, myths, mis-believes, lies, spin, half truths, and other assorted bits of incorrect, or nearly incorrect bits of knowledge. Even when we are sure that the things underneath are solid, it is not uncommon to dig deep enough and find serious mistakes.

Consider our earlier perspective of an ancient scholar drifting through life with the full complement of man's knowledge. Now, with the exception of man-made items, we as this scholar knew of as many things visible in this world, as people do now. Yet for all of these things, say perhaps lightning, while the sense was the same -- we can see and hear it -- the underlying explanation was completely different.

Since electricity was unknown, static electricity was too. Thus, there was some other theory attached to lightning to explain it, and in our studies we were taught these ideas. Doesn't really matter what we was taught, but it does matter that as time wore on, these explanation grew and grew more correct, probably by leaps and bounds, until it came to match the modern explanation for lightning. Lightning always existed, but much like a plant, the explanation for it has been morphing and growing all of these centuries, gradually working its way down farther into the depths of the details.

In a sense, if knowledge isn't get broader at least not at this moment, it certainly is getting deeper. Much deeper. But even with that trend, everyone can see lightning, and most people can explain simple elements of it, but how many really understand it?


My sense from exploring the other industries has always been that they each have their serious cracks. That the knowledge we know is vast, yet messy and incomplete, and more importantly filled with tonnes of misdirection.

Yes, we've learned how to be rigorous with some of the smaller details, some of the process, but we have no idea how to assemble this knowledge in a rigorous manner. We can collect the knowledge, interpret parts of it, but we cannot stitch it together.

Computers, while being great tools are also great at showing us our obvious flaws in what we understand. Software isn't hard in concept, but the messiness of our real world understanding makes it horribly difficult. System fail because people grossly mis-estimate the complexity. Complexity that stems from inconsistencies and misunderstandings.

And its in automating our efforts that we reveal the problems. Only a programmer really needs to understand how the calculations for financial instruments work for example. They are so messy and often incorrect. The financial industry has lots of quick cheats and rules of thumb for partially accurate calculation. That's all that is needed to start trading them. It's only at the deepest level of detail where you have to agonize over the tiniest of points, that you really start to see the holes.

Yes, we do know a lot, but we need a way to organize and contain that knowledge. It no longer fits in someone's head. We can't utilize what we know because we don't know how to connect the dots, to bring it all together into something coherent.

Really, if we did we could build one big massive all inclusive database, containing all of man's known facts. Yet while that idea is easy to write down in a blog post, nothing we have in the way of technology can do this. The best we can do is a great pyramid inspired scheme to throw massive manpower at it. We can create something like wikipedia which may seem impressive, but it's not our knowledge that allowed this to happen, it's our vast numbers that we're utilizing.


In many ways it's easy to predict the our future. It is, after all, something that Computer Science will have to achieve one day in order to progress into a real science or perhaps engineering. To make things work we must follow an obvious path.

We will have to learn the structure of data and we will have to learn how to combine it together. We will have to learn how to sort out the truth, the real stuff, from the masses of mis-information that are swarming all over.

If in the past we saw a trend where our information was getting broader over time -- spanning out -- then in the future, it will be getting deeper. We know the categories, we just need to understand the details.

We won't just guess at what we know, we won't just optimistically collect it in little bits, hoping to find it useful later. We'll know we need it, and what to do with it.

I could always hope that we'll get there soon, but the truth about mankind is that it doesn't really like progress. Sure, we've become addicted to little fiddly electronics like iPod, and other toys, but while the advancement of these beads have gone at breakneck speeds throughout my lifetime, the really big leaps have been far slower. We shouldn't get the consequences of a few major jumps confused with the jumps themselves.

Of course we have more people with more effort, but those population advances have not been radically outpaced by our technological innovations. And we're easily lead astray. It's not surprising for example, to find that whole generations can succumb to easily ideas like Freud. We, as a species bend towards the static and easy. Perhaps when we are younger we have the strength and energy to change the world, but time wears us down.

Our next great leaps will come from our dropping the notion that thinking itself is somehow magical; that knowledge mysteriously appears out of nowhere. We've learned to organize physical labor, to control factories, yet we are careful never to apply these industrial ideas to our thinking in other areas. Why?

Oddly, software development, which interacts with so many other disciplines is the one that has been the most active in trying to stay far away from any attempt to organize our approaches to mental effort. Programmers despise order, organization and methodology.

The very things that we know we'll need in the future are the things that scare us most in the present. And it's not like another discipline will find the answer first, software development is the platform on which all of the others now rely. As we guess, and hack, and flail at our keyboard, they follow our examples with their own development. The limits of software now drive research and development.

That it seems may actually be a possible explanation for the high degree of systematic mistakes that pollute our technologies. Domain experts go at building their specific code in the same reckless manner that Computer Scientists have applied to their discipline for decades.


The things that we need to get to the next step are simple. We have the puzzle pieces, but we need to know what to do with them. We have a little bit of depth, but we need to perfect it, rather than just wrap it in more complexity.

We need to be able to put structure to our data, in a way that we know, not guess, but know that it is correct. Data forms the heart of our knowledge and while its still collected randomly with no provable basis for its structure, it is little more than useless. We have lots of it, but we can't combine it easily and we certainly can't mine it for anything other than trivial known relationships. We talk bravely about exploiting it, but you can't dig a mine in a tar pit.

We need to normalize our code and change our practices to stop wrapping the older crappy stuff. You can't build a technological base on mistakes and rampant complexity. You can't just endless whack out new code in the hopes that someday it will accidentally be the right stuff. Elegance is not a dream and it's not unachievable. We need and can have beautiful code, not as a means for itself, but as the only way to build a foundation stable enough that we can actually leverage it to really increase our knowledge. We need to know, what we know, and know how to know more of it.

We need to take the knowledge the we acquire in building today's systems, and apply it to tomorrow's. We need to capture our understandings so that we can extend them, not just restart each time with a new technology. Just as industrialization transformed factories, we now need it in our technologies. Software, of all the known industries is the worst here. We're polarized between ideas based around dysfunctional decades old large-scale manufacturing, that are too static and too large to work properly, and ideas based around cutesy fun game-like processes that are oriented more on being popular (to increase book and training sales) and less on actually being functional or improving the process. Old school dysfunction vs new school irresponsibility.

Still for all we need these things, they are absolutely not what people want, or our looking for. We've continually chosen the opposite, the fast food concept of knowledge, gorging steadily until we're ready to burst.

Judging from interest, programmers don't want blueprints, they don't want to be able to normalize their code, and they really don't want a methodology that works. They'd much rather go into work and wing it, making wild often incorrect guesses than spend the effort to figure out how to really make it work. Even when they have techniques like database normalization, they'd rather just whip it together on their own, basing their designs on instinct and hunches.

The state of the industry is that most projects fail because most programmers would rather that outcome than change their ways. In a couple of decades we've barely improved from a 15% success rate to a 30% success rate. That says a lot.

The web, of course, documents this. There's far more gossip and fanboy posting than there are fair and honest discussions of a technical nature. And those in the online community themselves, are only a small fraction of the currently practicing software industry. Most programmers stay far away from talking about programming. The industry wants to know what magical lines are required to fix a problem, but they don't want to know why they are required, or even whether or not the issue should be fixed.


We live in a age where every week you can read about a new scientific break-through study that supposedly changes the way we should think about something. But if you read a bit deeper you'll find that some of these studies, often it feels like more than half of them, are sitting on very dubious foundations. They make it to the news not based on their truth, but rather on their news worthiness. That is, they are entertaining enough to make it to the papers.

That might be OK if these questionable efforts where to disappear, but we're steady filling our knowledge banks with as much bad information as we are filling it with good. The world of infomercials has overflowed from our TVs and right into our research and development. The information age has promoted a kind of madness where rigor appears everywhere, and nowhere all at once. Who care what we know, if it's so padded with crap that we don't know what's true or not anymore. Proper marketing techniques are replacing proper critical thinking ones; grant money, after all, has become more important than progress.

In our future -- distance or not, that's our choice -- we'll eventually find the intellectual tools to quickly sort out the underlying truthfulness of what we know. We have to, as we are quickly being swamped by masses of questionable information. This circumstance cannot continue forever. One day people will look back on these as the dark ages, an information explosion perhaps, but one where we we flooded with propaganda and lies.

Wednesday, November 12, 2008

Code Normal Forms

The unknown quality of code is a simple, yet highly influential problem with existing software. If you have a millions of lines of code, is it fine or does it need serious work? Is it well-written, or is it a huge mess? An objective way to determine the current state of a large code base is necessary.

In the past, we've often relied on subjective opinions, but programmers are notoriously jealous of each other's work. Their answers are not always dependable.

With no real way to qualify a block of code, it is hard to properly plan out any development efforts. It's no wonder that necessary cleanup work is never scheduled, how do you know when it's actually needed?

What would effectively solve this problem is a simple, objective way to determine the essential quality of any code base. Relational databases have had this capability for a long time with their use of normal forms for the underlying schema. It makes a huge difference for the industry, giving database analysts a common benchmark for their work. Programmers desperately need similar ideas to help with analyzing their software code.

In my last couple of posts, I outlined six levels of normal forms for code, starting with the easiest and getting progressively harder:

1st -- No duplicate functions or data.
2nd -- Balanced common utilities.
3rd -- Balanced domain code.
4th -- All similar functionality uses similar code.
5th -- All data is minimized in scope.
6th -- All functionally similar code exists only once.

These ideas come directly from the two preceding posts:

I had initially started out by just describing a more structural way to view software, but in the process of answering comments, I gradually blundered my way into something deeper. Way back, I had suggested that a normal form should exist for code (more than one actually), but I wasn't actively searching for one.

More or less, I laid out these rules from an intuitive understanding of the common problems with large code bases. Normalization rules work to re-order the system in any arbitrary manner, so I oriented these specific rules towards well-known common problems.

They are accumulative, code in 2nd normal form must also satisfy 1st normal form. The higher rules sit on top of the lowers ones. They are also arranged in order of difficulty. 4th normal form, for instance, is much harder to achieve then 1st normal form. This matches the database equivalents and it roughly matches what I've seen over the years in practice as well. There is lots of software that wouldn't even make 1st normal form, with fewer and fewer projects achieving higher results.


After I wrote my posts, I came across a similar academic paper by Markus Pizka:

The approach in this paper was far more rigorous, but because it was done on an almost line-by-line mapping of code to data, it fell apart when trying to get past 1st normal form. I also found it hard to relate it to coding itself, the ideas are great, and the mapping is far more correct than mine, but it just doesn't easily match how I see the code. To be practical, a useful normal form has to be easily done on at least a manual level. To do this, it needs to relate to a programmer's perspective.

Although I wasn't particularly rigorous, I was really trying to fit these forms into my real understanding of existing coding problems, while still trying to keep in the spirit of the original database normalizations.

Indirectly in the database version, an entity or a table is the central focus, acting as the primary level on which the normalizations are based. I followed that same sense by dealing not with individual lines of code, but with collected sets of them. More importantly, as the code gets higher up, the implicit size of these sets indirectly gets larger. Low-level code is more specific, while high-level code is more generalized. You deal with the broad strokes at a higher level, then descend into the functions to get the actual work done. Depth is a dimension unique to coding issues, it doesn't have a database equivalent.

If you take this collected group of instructions as the atom piece at whatever level, then it's easy enough to guess that a normalized block of code should precisely match to the same level as a function. A one-to-one correspondence keeps the scope and activities of any function to a single purpose in the code. That is, well-balanced functions have only one single purpose, be it a sequence of broad strokes, an algorithm or a very specific set of bit manipulations. They are computationally cheap, so there is little need to conserve.

I did embed some arbitrary weirdness into the rules in the form of a) initially setting the structure of the code to four, not three levels, b) splitting the 2nd and 3rd forms based on the type of code and c) breaking 4th and 5th forms apart by code and data. Each of these irregularities draws itself from real underlying coding issues.

Although it is an architectural issue, splitting the domain level into a broad shallow layer and a deeper thicker primitive-based one allows for better optimizing of the code towards its real usage. Trying to apply one rule would fail and produce unwanted side-effects. There are two competing real problems in development, so there should be two levels that are optimized differently. I get into this in more details later.

Realistically, many software projects get to 2nd normal form but for whatever reasons they don't apply the same process upwards to get to 3rd. Because of that, splitting the two, based on a utility/domain level distinction matches common industry practices. With only a single level, code that was well-constructed in its utilities would be cast as only 1st normal form, when in fact it has risen significantly above that distinction.

Both 4th and 5th normal form are exceptionally hard for large projects to achieve and because of that, they needed to be separated. 4th is about getting to a stable point with neat clean code, something that many code-oriented programmers aim for as a normal basis, while 5th is really about minimizing the usage of the system resources and keeping everything as encapsulated as possible. Because they are both challenging, and very different in effect I separated them to allow programmers to get to the earlier level, without having to achieve both.


The real importance of these rules come from being able to understand the problems with the code and how to easily fix them. A gradual stepwise refinement is necessary, starting with the easier more common problems and gradually getting more complex. The higher the form achieved, the less long-term problems that will happen to the code base.

Most projects that are stuck in a nasty development tar pit are there because the bulk of the code isn't even close to 1st normal form. Simple cleanup would fix a lot of the problems, but without a reasonable stopping point, the work seems endless and arbitrary. These rules change that.

Of course these rules don't qualify whether any specific set of instructions is wrong or right, but as the normal form increases, an indirect result is that the code will become more dense. Duplicate blocks of code will be replaced by many more calls to the same functions.

Dense code may be slightly harder to understand, but because it is frequently exercised, it is increasingly likely to be of higher quality. The relationship is pretty clear, if something runs in the middle of a system a thousand times, in a test scenario any errors are more likely to be noticed than if that something runs only once. Density and quality are related. Density magnifies testing.


At the start, 1st normal form is simply a cleanup level. Obviously duplicate code and duplicate variables should be removed, but in practice they rarely are. Simple to do, but it is a very common problem with anything but brand new code.

Once a few rounds of changes have been applied, code starts to get left out. The most common problem is one programmer taking a different approach than the others. That leaves a couple of different, but identical ways of handling the same problems. Programmers love to roll their own stuff, so few code large code bases are even close to 1st normal form. The bigger the code base, the more redundancy that gets added.

The code is not always identically duplicated, often it is doing the same basic functionality, but in a slightly different way. The same is true for the data, which gets copied into multiple variables with different names, and sometimes slightly different levels of parsing.

Duplication obviously causes lots of problems, mostly because any future changes do not get evenly applied, so one part of the program starts working activity against another part. Regression testing may catch some of this, but the best solution is to not be repetitive. It's easily said, but rarely followed.

Even if the code itself is functioning properly, inconsistent interfaces look sloppy and lead to user irritation. It may not always be clear, but most programs that you hate using are usually that way because they are exceptionally inconsistent. If your first instinct is to dislike a program, chances are it's plagued with small, irritating inconsistencies.

Sadly, better working arrangements and proper cleanups would easily reduce these common problems, but they don't get done very often. Any software development project that is still in active development should continually spend time to make sure it is in at least 1st normal form. That should be the minimum competence level for any code base. Professional code should not be a mess, and now we have a simple, objective qualification for the minimum amount of effort.


Once programmers have progressed beyond just being able to make simple loops and calculate simple variables, they quickly start to try to unify all of their common efforts into utility libraries. 2nd normal form simply says that these libraries should be structured nicely from a functional point of view. That is, at this level all of the functions that are similar, have similar arguments, are of similar size and can be used interchangeably to work out different problems.

For many languages, the lowest of these utility libraries often ends up as common libraries in the language. A good example is ANSI C, where the included libraries are, for the most part, well-organized primitives that are roughly equivalent to each other. A bad example is Java, where the system libraries are strangely inconsistent (someone noted that there are 42 collection classes in Java, and 4 equivalent ones in Ruby), and hard to use without documentation.

The acid test for consistency is in seeing a few examples of some functionality, can you correctly guess the rest? In a well-ordered system, the consistency and conventions make this easy. In C for example, if you get that all string functions start with str, and that n is a limiting factor, then if you know about strcpy and strcmp, you can guess the existence of strncpy and strncmp. The convention is obvious. In Java on the other hand, for almost all of the libraries, particularly collections and networking, the calls are essentially weird arbitrary inconsistencies that are impossible to guess without documentation. Without online access to help, Java would be a brutally hard language to use. As it is, the inconsistencies make it a slower more awkward language than most.

On top of the usual system libraries, most big systems have their own custom utilities libraries that are used across the system. If they are really good and well-structured, most of the programmers will use them in their code. When they are messy, people start creating their own, and the code quickly violates 1st normal form. Thus, well-ordered utility libraries are crucial in keeping all of the coders from degenerating the code base. Lots of arbitrary copies of the same underlying code represent structural or organizational problems with the development project.

Projects that fail to get to 2nd normal form easily slide past 1st normal form as well. These are usually the meat-grinder projects, where the coders are flailing at the keyboard faster than the development structure can get set up. Usually these projects degenerate into nasty tar pits, even if the first couple of iteration show some promise. Bad development habits produce bad code which produces an increasingly sever headwind, which eventually drives all work to a near halt. Sloppy work takes longer.

It's important to note that this is not just a documentation problem. A huge library of ugly inconsistent, but well-documented utility functions will not lead the programmers to better practice. They must be able to find what they need quickly and its usage must be obvious. If you have to read a ream of documentation, then it won't get used properly. Programmers want to get into a zone and stay there for as long as possible; looking up documentation is a distraction.

2nd normal form is important for not consistently re-violating 1st normal form as the project moves forward. It also helps to leverage some of the common development effort. It make take a bit of extra work, but it pays enormous dividends, and should be considered a normal practice for professional programmers.


The business end of the system I intentionally split across two layers. I'm sure people will argue about this, but we can't forget to balance long-term needs against the short-term ones.

Users always want more functionality, and they want it quickly. Thus we want this shallow easy to build, but slightly messy layer in the code to match the reality of what we are doing. On the other hand, if years are going to get sunk into a big project, then it would be nice for the work to get easier as it progresses over time, not harder. That can only happen if we are building up larger and larger chunks of functionality. And so, it is inevitable that we optimize the main bulk of our code in two completely different ways for two completely different reasons.

These diverging pressures drive an awful lot of hopeless arguments over correctly applying a single unified approach to coding. Splitting the core of the code into two -- based on reality, I think -- is a more reasonable way around this conflict.

Like 2nd normal form, at the lower business layer we want to build up commonly used functionality so that we can exploit it quickly in the upper layers. Inconsistencies in a shallow layer of code are far easier to cleanup than inconsistencies in a deeper one.

The key to 3rd normal form is to match the shallow layer as closely as possible to the business requirements. In fact, the best result is for the code itself to express the actual language of the requirements in the least technical sense with the minimum of translation. If the users want to follow the foobar convention for the first calculation, and the barfoo one for the second, then the code should be exactly that:


In other words, the language of the business domain matches the expression of the code. If there is an intervening abstraction, then the language of the configuration matches the language of the business domain. Either way, it should be more than obvious as to how to map the business issues to the code issues, and the code issues back to the business. Why make it any more complex?

As for the deeper, primitive layer, building this up to larger and larger re-usable primitives allows the upper layer to be more expressive with less syntax. As the system tackles bigger and bigger problems, the reach of the primitives should advance as well. Ugly bits, which occur in all systems should be encapsulated, so that the details stay out of the shallow layer. Details are important, but mixing them with broad strokes is an easy way to lose their significant and make it harder to visually detect problems with the logic. A function should be of a single focus, this is one of the keys to both 2nd and 3rd normal form.


Some languages offer a large number of different ways to accomplish the same goal, under the assumption that flexibility is a good thing. It's true that any overly rigid language has been surpassed by more open ones, but just because this type of flexibility exists, doesn't mean that it is a wise thing to use it.

Experience C programmers often described 'pointers' as just more rope to hang oneself, and clearly the most common problems with the language were bad pointer issues (hanging pointers). Language designers want to be popular, but one guesses that to get there, they aren't really interested in looking out for the rest of us. Threads, another great bit of rope is the leading underlying cause of many of today's inconsistent software behavior. If it happens occasionally, with no obvious pattern, it's probably a thread bug (unless its old, then its probably a hanging pointer).

The easiest way around these type of problems is to always use the language in a consistent and correct manner. Once the correct approach to handling a programming situation is discovered, using it consistently in the code to solve the same type of problem over and over again is reasonable. Even if the approach isn't entirely correct, consistency makes it easy to find all of the same circumstances and update them with something more reasonable. Consistency makes changes faster and more reliable. Consistency breeds quality.

What this really means is consistency is really more important than the actual code itself. You can easily fix a consistent, but incorrect implementation. Fixing an inconsistent mess is a lot of work, which is often avoided.

As we progress in development knowledge, it is always nicer to be able to make sure that 100% of the code reflects our current understandings. This way there are no weird behaviors leaching through the source, causing unnecessary complexity.

4th normal form is all about making the similar parts of the code look as similar as possible. In that light, it is very easy to see problems that are occurring because of inconsistencies. If four out of five functions have 6 lines of code, the one with 5 lines deserves closer inspection.

Ultimately we want a small consistent code base that is easy to debug and to extend with the next wave of functionality. These attributes come from making the code obvious, which is another way of saying highly consistent. If you have to struggle with each different piece of code, enhancements are slow and painful. Why deliberately make your job worse?

For a single programmer getting to 4th normal form is all about self-discipline. Even in rushed schedules, there are usually moments where the programmer can use trivial refactorings as a way of taking a break from larger more complex coding efforts. Once the habits are developed, than become easier, and pay more dividends.

For big teams of programmers, 4th normal form is next to impossible. Consistency across several development teams is not part of our current development culture. There is no real reason why it can't be done, but the personalities involved in most development efforts will generally derail any significant attempts. As such, big projects just have to except that only sections of the code can be normalized, not the whole. That's fine, so long as each new set of conventions is rigorously enforced, i.e. teams always spend the extra effort to refactor any inherited code, or strictly honor its original conventions. The system across the board may not be a consistent 4th normal form, but each and every section of it should be.

Undoubtedly 4th normal form is hard to achieve, and unlike some earlier forms the benefits of getting there are not nearly as great. However, it is a necessary step on taking the code base above and beyond just the 'usable' designation. The next couple of forms lay out true excellence in coding, but you can't really get there with a messy code base. Code in 3rd normal form is workable, but hardly impressive.


The biggest waste of CPU in large programs generally comes from assembling, disassembling and copying the same data over and over again throughout the different parts of the system. Most programmers are code-oriented, so the data is an after-thought, something to jam into the code when needed and dump out at the end. The data is grabbed from input, altered slightly, copied, altered, copied, etc. over and over again. The by-product of this is a huge amount of resources wasted in unnecessary copies and manipulations. Bloat is a huge problem.

This code-oriented viewpoint produces a lot of redundant work on data. It's not just unnecessary copies, it's also an endless sea of translating the data from one format to another, frequently leaving around duplicated versions. If we need a string in two pieces, multiple times in the same program, it makes far more sense to break it up once on entry, and then only reassemble it once on exit, keeping one and only one copy throughout.

We really want to minimize the handling of the data thought the code. With effort and a good architecture, this can be achieved. In systems where it has been, the code becomes a fraction of the original size, and it runs at a much faster rate. All of the redundant fiddling is more than justed wasted effort in coding, it's also resource intensive, and it makes extending the code harder.

The real amount of unique data flowing through most systems is far smaller than the number of variables and the naming conventions would suggest. All systems are centered around a very small number of key entities, which drive most of the functionality. Even in systems with dynamic data, the range of the data is generally fixed, a necessity in being able to code up a working solution. Few programmers can really deal with a hugely dynamic code, it's abstract nature makes it tricky.

5th normal form is based around minimizing the data as it propagates it way throughout the system. This creates a truly tight, simple portion of code that wastes no effort, so it does the absolute minimum necessary. Structurally we can view this as a requirement for specific data to only be accessible in the smallest number of subtrees possible. Squeezing down that scope so that the data isn't visible elsewhere insures that it is encapsulated. In some instances, we may have architectural reasons for redundantly copying the data, but those should be few and far between.

Another important aspect is for the data to be precisely named for what the data really is. If the incoming data is a username, then all of the variables that handle that data at a high level should refer to it with the same variable name: username. As it descends into the code, becoming lower, the name of the variables may be generalized to better represent the current level of abstraction. As such, a bit deeper than its entry point, the username might just be called 'name' or even 'string' because thats the level of generalization in which it is being used.

A 5th normal form code base should have a rational variable name convention that maps out specific names to specific incoming data and levels within the system. If there are all sorts of inconsistencies, such as the same data at the same level sometimes being called name, username, accountname and user, then the code is not in 5th normal form.

4th normal form made all of the code look the same, 5th normal form makes all of the variables look the same as well. These two layers put a necessary consistency over the information in order to reduce the effects of inconsistency and duplication.

For unexperienced programmers 5th normal form may sound very daunting, but as it is really more of a discipline issue, that is very achievable. It's chief benefit is in providing a huge boost to performance and way less resource usage.

It really is about just getting the code to use only the bare minimum to get the job done. When its achieved, the performance difference can often be orders of magnitude. Although it is rarely understood, if you really need to optimize some code, 5th normal form is were a programmer should go first, before starting to work on more exotic algorithms. Performance issues are either algorithmic or just a result of sloppy code, with most falling in the second category.


As programmers progress in experience and skill, they start noticing that they are writing nearly identical code over and over again. Similar screens, and similar database access for example. Some accept this as a normal part of development, but others are driven to seek out ways to eliminate these types of redundancies. Clearly to be able to build code in 6th normal form is the sign of a master craftsmen. It's difficult, and it's not strictly necessary to make the system work, but it is achievable and easily the most elegance solution possible for any given problem.

6th normal form is defined as there being no substantially similar code in the system. That is, each and every piece of code looks different and is totally unique, and is not collapsible into some more general piece of code. For all of the general pieces, they are instanced in the system with only the absolute minimum of configuration information (and that configuration is orderly and easy to see if it's consistent; techniques like 'table-driving' code in C work wonders here).

Only in a very few cases have I seen substantial code bases that actually reach 6th normal form. But it's not as impossible a goal as it seems, although only a few programmers can envision this type of design initially, and refactoring into it is a lot of work. Still, when it is done, you get a tiny code base, that is tightly wound and absolutely consistent, because it is the code itself that enforces the consistency. The quality of this code is nearly self-evident, i.e. if the code actually manages to run, then it's extremely likely to be running correctly.

The common example of not being in 6th normal form comes from using any of the web-based application technologies like ASP. These technologies allow programmers to easily create new screens for their systems by quickly mixing HTML and some programming language code. For a short quick system, say 10 screens, this is great; the code seems to magically pop out of nowhere.

The real problems roost as the system grows, quickly at first, but ever decreasingly over time. Each new screen copies -- redundantly -- many of the aspects of the earlier ones. Sticking to 4th normal form helps somewhat, but by the time the system is getting past medium in size, it's getting pretty ugly.

It's obvious to some degree, that if you have a couple of redundant copies of a variable in the system it is a well-known bad thing, so if you have 200 versions of what is essentially the same screen then it must be a very very bad thing. Still, it is absolutely common practice (and extremely hard to do anything about). Eliminating these redundancies puts the resulting system into 6th normal form.

For example, if you create a system with just 6 basic screen-layouts that are multiplied hundreds of times over for all of the data access, then the interface portion of the system is in 6th. The basic layouts are a little different, but the configuration data instances them into the appropriate list, detail, edit, delete screens. Thus you have minimal code, with minimal configuration, that rigorously enforces a consistent convention over all of the screens in the system. With just 6 layouts, inconsistencies are minimized.

The same is true for the data/model back-end of the system, particularly with respect to persistence in a relational database. If there is only one set of code to store all data, which is then instanced by a minimal configuration, then the back-end of the system is also in 6th normal form.

Any where that the code is nearly duplicated, is somewhere that can be generalized. All that redundant typing, or horrible cutting and pasting, can and should be eliminated. After all, it is just a nasty long-term problem waiting to spring into action.

Real 6th normal form is an extremely hard state to reach, and younger less experienced programmers often confused redundant static configuration messes as being minimal configuration. I.e. if you simply trade your 200 screens for two hundred messy XML files, not only have you not solved the issue, you've actually just transferred it to a worse problem. Many of today's common libraries are extremely guilty of making this mistake. Trading redundant code for redundant configuration is a step backwards. Most declarative implementations are horrific nightmares.

6th normal form isn't a reasonable goal if the code base is coming to the end of it's lifetime, but for new projects where the programmers really care about producing their best work, this is the highest standard they can achieve.

If you can get a system into 6th normal form -- really get it there, not just think you did -- then it becomes a strong base for quickly building up massive amounts of functionality. As the more code gets added, the scope of the system grows rapidly, providing a great base for long-term development. More importantly development doesn't slow to a crawl as the project ages, it actually gets faster if the quality of code is maintained.

Any, and every big system should start out with this foundation in mind. They need it in the same way that skyscrapers need to go deep into the ground to make the buildings stable. If you are going to build big, build correctly.


If our modern libraries were better structured, our code would be much easier to write. Often many libraries are an inconsistent mix of calls loosely based around some common functionality. A smaller set of consistent libraries based solely around limited data or specific algorithms would be far more useful.

We want to bring the choice of underlying libraries down to a simple one about whether or not a specific system should support a new type of data, or some specific algorithm. If we get there, then it's easy to see that wrapping data in well-contained libraries with clean, simple and consistent interfaces makes it easy to quickly move through a large series of them, increasing the functionality of the system. A set of libraries with nearly common interfaces is far easier to wire up, then is a smaller number of oddly interfaced ones.

A well-written library may require some explanation for the encapsulated data or algorithms, but really it should require NONE for the interface. If you have to read lots of information about how to use weird configuration files, or set strange values, or even work with some non-obvious paradigm when calling the library, then it is those inconsistencies that are wasting lots and lots of time and leaving in their wake lots and lots of bugs.

A good library is not hard to use, so if you're finding the interfaces awkward and difficult to figure out, and you've been at this for a while, then clearly the problem is the library.

Too many library developers throw in their own unique signature across the API, or write so that the coding issues are simplified at the expense of the interface (Hibernate is a classic example). Either way, there is wasteful unnecessary complexity added with the library that will probably ripple upwards into the code. Java, as a language is particularly bad for this type of problem; the basic libraries are all very obtuse and irrational. Rumor has it that .NET is an order of magnitude worse; just a massive cesspool of poorly thought out code.

The solution to our programming problems is not to have access to more libraries, it is to have access to BETTER libraries. A big ball of mud in the libraries propagates upwards into a big ball of mud for the architecture. If you build on a crappy foundation, your code is?


An important question for code normal forms is whether or not recognition of the different forms is computable. That is, can you write a program that will return with the correct normal form, for any given piece of source code. Making this work rests on being able to identify similar pieces of code and being able to identify similar pieces of data. The first of these problems is a little easier.

A while back, people were writing signature-based hashing algorithms to help determine which parts of a large code base, Linux, where copied from other large code bases. As code is just a series of instructions, you can also match two series together to some relative degree. I.e. the two are 90% similar, for instance.

We can do that by stripping out the contextual information, like variable names, and other things, and just laying out the underlying types and mechanics. Of course, since there is some arbitrariness in some of the order for the steps, it is not an easy problem, but one that could be handled.

In that manner, and by looking at the anchor points in the subtree, a program could flag possible duplicate code. By tracing transitions from an external point, in and around the code, duplicate data can also be found.

Identifying duplicate pieces of data is actually considerably more complex. Just because two pieces currently happen to have the same value, that does not guarantee that they are the same. In that vein, real intelligence is the only way to make sure that any two pieces are actually the same, and that is something a computer can never do. While it may seem complicated by this, since there is actually a fairly small number of unique entities in most software; all we really need are external mappings from one named variable to another. Interestingly enough, if there are no possible entries for this mapping file, then the code is in 5th normal form.

If you know the structure and all of the unique data, you can start plotting out diagrams to show structural problems in the code. Along with a data dictionary automatically kept, and some added extra information included by the programmers, all of the system's variables can be tied down to a limited set of common entities.

In that sense it should be entirely possible to automate recognition for each of the different normal forms. You should be able to press a button and get a listing back for all the code in the system, file-by-file, and its final normal form.

I'm not saying it will be fast, clearly the overhead lights will dim for a while, but that it should be entirely automatable. Linked in with an IDE with some type of form-specific refactoring capabilities, the next wave of programmer tools should be able to give us status on our code bases in the same way that MS Word provides readability statistics on writing. While that doesn't guarantee that the code is good, it does guarantee that it's not total garbage.

We should be able to say things like "you can't check that in until it's at least 3rd normal form", and use a tool to show that the work hasn't been completed properly. When it's no longer subjective, it's no longer a personality conflict issue it's just a simple fact.


Almost anyone can write computer code, often programmers take too much pride in this this simple exercise. Intrinsically most of them understand that there is a huge difference between just being able to get something to work, and in being able to build a longer term, more elegant solution.

Programming, in terms of variables and loops, isn't really that hard, but then again it isn't really the central problem in software development either. Even getting a single version of software out the door, is not the same as reliably releasing version after version for years. There's always lots of programming to be done, but it's all of the things around the code that ultimately determine the success or failure of the project.

To get beyond being just some people who know how to code, we have to set higher standards for ourselves. A professional programmer should be able to produce a grade of code that is far above and beyond some person playing with a computer language.

Just showing up to work on Mondays should orient most programmers towards the goal of trying to get to at least 1st normal form. Cleaning up their code helps them, it makes their lives easier.

2nd and 3rd are structural issues, where the code is laid out with a non-random architecture. This makes extending the code a simpler process, and if done well, it makes it possible for quick turn-arounds on user-driven changes. At the very least, a system in 3rd normal form is fun to work with; below that level the coding is more painful and tiring.

4th and 5th involve a type of consistency that is very difficult for huge teams, or groups of people. They're both necessary in producing an optimal solution, but they are likely beyond the abilities of average groups of programmers. These forms are probably better left to individuals or small highly effective teams.

6th normal form is the dream for any programmer in love with coding. It is that state where just the true minimum of code exists, and it is extremely dense, but not impossible to understand. I've seen it in practice and created a few systems were it has been the underlying goal, so it's quite doable, but the extra effort needed to achieve it is beyond most development aspirations. 6th normal form is where beautiful elegant code exists, and not surprisingly as many blog commenters have stated, they've never seen that type of code base in their lives.

Once we get these ideas into practice, they will really help with improving development. An organization, for instance, should establish minimal standards for coding, and deviations should only be for good reasons.

What really plagues Computer Science is that we spend too much time guessing, and not enough time making sure that we are really doing the right thing at the right time. While this can be fun, failure, which is the common result of this lack of science, is not. It really does make work better when you remove all the needless anxiety associated with mismanaged complexity.

It's nice to know when you start a project that it will work, and that it will really meet the initial goals. If you allow them, these code normal forms will significantly help with that; but only if you allow them.

Thursday, October 30, 2008

Revisiting the Structure of Elegance

This post is a continuation on an my earlier one entitled The Structure of Elegance. If you haven't read the first post, then most of this one won't make any sense.

I left off by presenting a different perspective on code, that of being subtrees of functions within the system. This type of perspective is convenient for allowing a graphical view on the code, one that can be easily manipulated to help normalize the structure to something considerably more elegant. Large collections of code can be a overwhelming jumble of complexity, but by inspecting limited subtrees, most people can get a better grasp on their structure.

This perspective provoked a few questions about how and why this is useful. One of the biggest pieces of feedback I received was about what type of rules would exist for normalizing the code. Before I step into that issue, we need to get a higher level view.

Normalization, much like optimization and simplification are all reductive processes. Given some constraints, the system is reduced to a minimal form that is more or less equivalent to the original system but improved in some regards. These processes all require trade-offs, so its never a case of the perfect solution. A number of ideal solutions may exist with respect to a set of constraints, and the constraints themselves can vary widely.

For instance, what is simplified with respect to a computer -- a massive list of assembler operations -- is nearly impossible for a human to work with. We pack things into encapsulated functions, adding in lots of extra complexity, but that allows us to bite off limited sections of the code at a time. We must simplify with respect to keeping things in bite-sized pieces. A computer doesn't need that extra complexity.

To pick rules for normalizing code we need to first understand what attributes within the code are really important. For instance, we know that a single node in tree with all of the instructions place together in the massive function is not a particularly easy bit of code to deal with. It's just a big mess of instructions.

Alternatively, we could place each instruction in its own function, giving us a super-long and skinny tree; but that wouldn't be very nice either. Balancing the depths and breadth of the trees is important.

We need to understand what types of dimensions work best, but first we need to look back at various layers in the tree. Code as different depths is quite different, and requires different types of normalizations. The overall structure of the program is not uniform, and shouldn't be handled as such.

At the top of the tree is a fair amount of code that deals with technical control issues. It controls the looping, communication, functionality triggering capabilities. Anything that binds the driver factors, human or computer, onto specific configured blocks of functionality.

At the very lower level of the tree is code that is also handling a number of technical issues. Low-level things like logging, number and string handling. Variable manipulation, common libraries, etc. Even persistence through a database. The lowest levels of the tree either end in specific data manipulations or interfaces towards library or external functionality.

It is in the middle where we find the guts of the program. The part were all of the work is completed.

All code is driven by some usage, which will call the problem, or business domain (although business isn't necessary defined to only be commerce). Problems are either very specific, such as a system to handle a particular type of data like client management, or very general such as a spreadsheet. However, even in the general case like a spreadsheet, it gets out there and is configured to be used for very specific problems. In that sense, all software domain problems start their origin as a business one, and then move inwards to become more abstract, either in the configuration, or in the actual code.

As the problem domain gets more abstract, the language describing it does as well. You might start out with stock prices, getting mapped to configuration information, getting mapped to formulas, then getting mapped to spreadsheet cells. Generally the deeper you go with the data, the more general the usage.

No matter what level of abstraction, the middle of the tree contains some business domain code that gets more and more abstract as it gets lower. Thus we have something like:

Real Problems -> Configuration -> Business Related Problems

Technical Control Issues
Business Related Problems
Abstraction on Business Problems
Smaller and/or Common Technical Issues

So, on the outside we have the real user problems that are mapped onto the primary "business" related ones in the code. Then in the code itself, we have a heirarchy of four layers that can overlay the whole execution tree for the system.

At the very top of the tree, the 1st level, we find we are dealing with the technical control issues. These we want to get too quickly and then set them aside. They are technical issues that are not directly effected by changes in the business issues. Expansion of the business demands sometimes requires enhancements, but ideally we like to minimize this code and encapsulate it away from the rest of the system. It's purpose is to really just bind code to some external triggering mechanism.

The same is true for all of the low-level common stuff, the 4th level. It gets created initially, then appended overtime. It usually is fairly controlled in its growth, so we want to encapsulate it as much as possible, and then hide it away at the lower levels of the code.

The ugly bit that is messy and subject to frequent changes is the stuff in the middle. It's also the hardest code to write and frequently the most boring. But it is the guts of the system, and it defines all of the functionality.


We know that the code is frequently changing at the 2nd level and 3rd level, and that it is driven by forces outside of the developer's control. Generally, the best approach is to try and make a 3rd level that is abstract enough as to be resistant to all but the worst possible changes. In a sense we want to minimize the 2nd level, in order to reduce the impact of change.

That means that we really want to build up a set of reusable primitives in the 3rd level that can be used to express the logic in the 2nd. As trees, we want to minimize the depth of the 2nd level, and maximize the depth of the 3rd. The 2nd level is shallow and wide, while the 3rd is deep and wide.

As the project progresses if we have more and more base primitives in which to construct increasingly complex domain specific answers, where each primitives sits on a larger and larger subtree, then it becomes easier over time to construct larger more complicated solutions to the external problems. It's easier and it gets faster as each new abstracted set of subtrees covers more territory.

A broader 3rd level doesn't help if it's huge -- new programmer's would reinvent the wheel each time because they can't find existing answers -- so keeping the number of functions/subtrees down to a bare minimum is important too.

This helps in even more ways, since if the same subtree is used only once in a problem, then it is only tested 0 or 1 times. But if it is used in hundreds of places, the likelihood of testing is severally increased, thus the likelihood of unnoticed bugs is diminished. Heavily run code is more likely to be better code. Well-structured code promotes quality.

That principle is true for the whole system: a tightly-wound dense program where all of the code is heavily reused will have considerably better quality than a brute-force one where every piece of code is only ever executed in one distintctly small scenario. Ever wonder why after twenty years Microsoft still has so many bugs? Millions of lines of code, especially if they are mostly single use is an extremely bad thing, and nearly impossible to test. Also, it is subject to increasing inconsistencies from changes. The modern process of just wrapping old code in a new layer with an ever-increasing onion architecture is foolishly bad. It's worse than spaghetti, and completely unnecessary.

So overall we can see that we want very simple common and library code at the very bottom layer that deals with the technical problems in technical language. At the top we are force to have layers of technical code that also deals with the problems in a higher technical language.

At the second level we will have lots of code, as it represents all of the functionality of the system, but hopefully it is very shallow. Below that we want to continually build a minimal set of increasingly larger subtrees that act as domain-driven primitives to accomplish non-overlapping goals. The strength of the 3rd layer defines the flexibility of the system, while the consistency in behavior is often defined mostly from the 2nd level.


A big application will have a fairly large number of different major data entities. The scope of this data in terms of the subtrees, easily quantifies the structure of the code. Data that is duplicated, and copied into many different locations in the code is a mess. Refining the data down to specific subtrees limits the scope and contains the errors.

Another important consideration is that the impact of changes to the data is way easier to calculate if the data is only visible within a specific subtree. Change impact analysis can save on costly retesting.

Data drives the primitives in that each of them, if they are similar should deal with the exact same set of system data.

All of the string functions for instance, just deal with strings. All of the socket library code deals only with sockets and buffers. The persistence layer only deals with SQL and relational structuring.

A set of common primitives share the same underlying data structures. If one of them requires access to something different, that's a sign of a problem. Similarities are good, inconsistencies point to coding errors.

Within the system there will sometimes be vast gulfs between a few of the subtrees caused by drawing architectural lines in the code. One set of subtrees will be mutually distinct from another set with virtually no overlap.

This is a good thing, even if it means that some data is copied out of one subtree and replicated in another; it's a necessary resource expenditure to ensure that the two sections of the code remain mutually independent. Of course, this means that there is clearly a cost to adding in any architectural line, one that should be understood upfront. Somewhere in another post I talk about putting in lines for structure, but also for organizational and operational reasons.


By now some readers have probably surmised that I've been completely ignoring object-oriented issues. Although I've swept them aside, since objects are really just a collection of related functions tied to a specific set of data, much of the structuring I talked about works very well with object-oriented languages if we work in a few very simple observations.

The first is a little weird: there are really only two types of relationships between objects in this perspective. One is 'composite', the the other is 'peer'.

From a subtree perspective, one object in the system contains another if and only if all of the dependent objects methods appear on the stack below the other object. I.e. one is fully encapsulated by the other, it is a composite object. As this is testable on the stack it is easy to see if it is true, for two objects A, and B, if A.method1 is always ahead of B.method1, B.method2, then more or less A contains B. All methods of one are bounded by the other.

In the peer case, any two objects are interleaving on the stack at various levels. The order isn't important, simply that neither is contained so they work at a peer level with each other.

There of course are other possibilities, but these we don't want. For instance if A contains B and B contains A, then either these two are really peers, or there is something messed up about their relationship. Consequently if two objects are peers and also one contains the other, then there are really at least three objects all tied to each other, one object at least is horribly overloaded. That is it is at least two individual objects crammed together in the same body.

One could get into proving how or why some relationships imply that at least one of the objects is overloaded, but it isn't really that necessary. We really just want to focus on the two basic relationships.

Convoluted objects can can be refactored. With some snapshots of the stack, you can break the objects into their component objects, thus normalizing them.

With only these two relationships between all of the objects, it becomes considerably simpler to construct the system. For instance in a GUI, we glue upwards into the interface, mostly peer objects. Then we create a model of the data we are going to use in the 3rd level, mostly containing composite objects. We try to bury the SQL database persistence quietly in the 4th level again with composite relationships, and that just leaves the second, which is probably some type of mix, depending on what the code is really doing.

Since we're trying to minimize the 2nd level anyways, the basic overall structure of the code becomes quickly obvious, especially if we're concerned about trying to really encapsulate any of the ugly technical problems.

And this can be extended easily. If you get the same problem, but with a client/server breakup in the middle, it's easy to see that it can be solved by adding in another layer between 2 and 3 that hides the communication, although occasionally optimizing a few pieces of functionality into the client (from the server) just to save common resources. The four layers stay the same, even if you're adding a fifth one into the middle.

With the above perspective, peer objects can have complex runtime relationships, but composite ones should not. They should be very simple containing relationships in order to keep the code as simple as possible.

Some object-oriented schools of though seem to really like complex behavior between many different objects. The problem is that this quickly leads to yet another layer of possible spaghetti. Wide differences between structural relationships and runtime ones can be very difficult to maintain over time. We're best to focus on simpler code, unless the incentives are really really large.

Ideally, we want to take our business problems and express them in code in a format that is as close as possible to the original description. In that sense, very few real problems have a requirement for a complex runtime relationship, most of our data manipulation is about creating or editing new data in a structural fashion. Systems with extensive peer objects should heavily consider whether the dramatic increases in complexity are worth the cost of the design. It is entirely too easy to build something overcomplicated, when a much simpler design produces the exact same solution.


I've seen a few references to spartan programming rules:

these are great, good steps along the lines of normalization, but like all things, taking simplification too far can actually produce more complexity not less.

The spartan rules about naming conventions for example, are too much. Naming is a key way to embed "obvious comments" right into the code. The name used should be the shortest, longest possible name correct for that level of abstraction, and consistent with similar usage throughout the system.

All names are always key pieces of information, and even with temporary variables using a rule of simplifying to something like one character names is a really bad idea. Code should flow, and odd naming conventions become distractions. All variables have a proper name, it's just that the programmer may not realize that.

Still, we want to boil down the code to the absolute bare minimum. We want to reuse as much as possible, to achieve better quality. We don't want duplicate information, extra variables, extra code or any other thing that we have to maintain over the years, if in the end it isn't absolutely necessary.

Minimizing most elements brings down the amount of information to maintained over a long time to the smallest possible set.

There should only be one way to do everything in the system, and it should be consistently done that way. The consistency is important for achieving quality.

In languages where there are more than one way to do things, only one way to do things should be used. More is worse. Pick a good way of handing the technical issues, like loops, and stick with it for as far as it will go (although don't use this rule as an excuse to ignore features of the language, or go too far).

Ultimately it's like anything else, if you have too many bits, they become harder and harder to maintain, so eventually it gets messier and messier. Discipline is part of the answer, but so is cutting done on the overall amount of work. Why spend two months working on something, if you could have completed it in two weeks?


The above discussion provides a good overall understanding of the types of things we are looking for in the code. We want to normalize with respect to a limited subset of variables, more or less in some specific order, so choosing the constraints carefully is important.

There are some easy attributes that we can see we want from our code:

  1. All unique pieces of data have unique names.
  2. Variables only get more abstract as they go deeper.
  3. All equivalent subtrees are called at the same level for any tree
  4. All symmetric instructions are in the same function, or functions at the same level.
  5. All data is scoped to the minimal tree
  6. All instructions in a function belong there, and are related to each other
  7. All functions speak of only a few pieces of data.
  8. All data is parsed only once, and then reassembled only once.
  9. Data is only copied between trees for architectural purposes. Otherwise all data exists in the program in one and only one memory location.

With respect to all of the above, and with an understanding of the common problems we find in code, we can lay out a bunch of layers to the forms:

1st Normal Form

- no duplicate sub-trees or data

Go through the code, line-by-line and delete any duplicate functions or data. Delete copies of any identical/partial data.

2nd Normal Form

- balanced common 4th level libraries: height, subtrees and arguments.

This is an architectural level. Similar functions should have similar arguments and be called on a similar level. Push and pull instructions until this happens. Some new functions might have to be created, some combined, and many deleted. Refactor the code until all of the pieces are encapsulated non-overlapping primitives, that encapsulate specific technical issues.

3rd Normal Form

- balanced domain code: 2nd, 3rd level

This is also an architectural level. 2nd level is minimized, 3rd level is maximized, but with few subtrees. Like 2nd normal form, but at the higher level in the code. The 3rd level can actually be modeled on the verbs/nouns (functions/data) used by the real users to discuss their problems, and their needs. I.e. you can map their language to the 2nd layer, with the building blocks created in the 3rd.

4th Normal Form

- All similar operations use similar code.

All blocks of code are consistent, and use the exact same consistent style and handling. Any code doing similar things is formated in the same way, using the same style and conventions. This can be done by smaller refactorings in the code to enforce consistency.

5th Normal Form

- All data is minimized in scope.

All data is named properly given it usage, and appears a absolute minimum number of times, and is only every parsed once, and only ever reassembled once. It is only stored once per section. Names for identical data should be identical at the same level. Names (and usage) should get more abstract at deeper levels. Some structural refactoring may be necessary, but a data-dictionary with all variables names (and associated levels) would help find misnamed or duplicate variables.

6th Normal Form

- All functionally similar code exists only once, and is instanced with a minimal set of config variables, both at the high and low levels in the system.

For example, there are only a small number of screens types, even though there are hundreds of screens in the system, and there are only a small number of data types, even though there is tonnes of data in the system. When these two common repetitive levels are melded together into a smaller denser system, the code quality is massively increased, the amount of work is decreased and the consistency of the entire system is enforced by the code itself. 6th Normal Form implies that there are no duplicate 'functionally similar' blocks of code, since they have all been combined into general ones (every piece of code looks different).


With plenty of work, any existing system can be refactored into a fully normalized code base. There is nothing but effort, and the occasional regression testing to stand in the way of cleaning up an existing system. Often the work required seems significantly larger than it actually is, and as part of the process, generally a large number of bugs are quickly removed from the system.

For new systems, development can aspire to achieving higher normal forms initially, but on occasion allow for a little less because of time pressure.

All development is iterative, even if the cycles are long and the methodology tries to hide it, so it makes perfect sense to start each new round of development with a period of refactoring and cleaning.

Done a little at a time, the work is not nearly as painful, and the consistency helps keep the development at a brisk pace. Big systems grind to halt only because of the messiness of the code. New changes become increasingly more onerous, making them increasingly less desirable.

For an initial project, it is best to quickly show that the upper technical problems fit well with the lower building blocks. New technologies should always be prototyped, but sometimes that process is driven by the development itself. Generally good practice is to start by building a small foundational 4th level, then attach upwards a minimal bit of functionality. This inverted T structure quickly shows that the flow through the code and the technologies are working correctly. Enhancements come from adding more functionality, a bit at a time, but if needed expanding the foundation level first. This type of iterative construction minimizes redoing parts, while making sure that the main body of code is usually close to working most of the time.


The single greatest problem that I've seen over and over again with many large or huge development projects is the code base itself. Sure, there are lots of excuses about management, users or technology, but in the end the problems stem from the coders not applying enough discipline to their work.

Messy code is hard to work, while elegant code is not. But, as we do not teach coding particularly well, it seems as if very few programmers actually have a sense or understanding of the difference between good and bad.

That's why normalization is so important. It will allow programmers a clear cut set of rules for cleaning up their code. For getting it into a state where other programmers can easily help work on it. For not getting caught up in endless debates about right and wrong, and various conventions.

The rules as laid out in this post are simple and ascend in order of importance, that is just getting to 1st Normal Form for many systems would be a huge accomplishment. Getting past 2nd and 3rd would help prepare the system for a new way of expansion. Getting even higher would be elevate the work from simple system to something more extraordinary. Achieving 6th Normal Form would produce a work of art.

But -- as is also true with this normalization's database counter-part -- no set of normalization rules is absolutely perfect. There will always be occasions to denormalize; to not follow the rules, or break them temporarily. That's fine, so long as the the costs are understood, and a rational trade-off is accomplished. We may be rigid in our coding habits, but we should always be flexible with our beliefs. The best code is the stuff that is easy to understand, fix and then expand later as needed. All of this normalization is just help to point the way towards achieving those attributes in your code.