Wednesday, November 12, 2008

Code Normal Forms

The unknown quality of code is a simple, yet highly influential problem with existing software. If you have a millions of lines of code, is it fine or does it need serious work? Is it well-written, or is it a huge mess? An objective way to determine the current state of a large code base is necessary.

In the past, we've often relied on subjective opinions, but programmers are notoriously jealous of each other's work. Their answers are not always dependable.

With no real way to qualify a block of code, it is hard to properly plan out any development efforts. It's no wonder that necessary cleanup work is never scheduled, how do you know when it's actually needed?

What would effectively solve this problem is a simple, objective way to determine the essential quality of any code base. Relational databases have had this capability for a long time with their use of normal forms for the underlying schema. It makes a huge difference for the industry, giving database analysts a common benchmark for their work. Programmers desperately need similar ideas to help with analyzing their software code.

In my last couple of posts, I outlined six levels of normal forms for code, starting with the easiest and getting progressively harder:

1st -- No duplicate functions or data.
2nd -- Balanced common utilities.
3rd -- Balanced domain code.
4th -- All similar functionality uses similar code.
5th -- All data is minimized in scope.
6th -- All functionally similar code exists only once.

These ideas come directly from the two preceding posts:

http://theprogrammersparadox.blogspot.com/2008/10/structure-of-elegance.html
http://theprogrammersparadox.blogspot.com/2008/10/revisiting-structure-of-elegance.html

I had initially started out by just describing a more structural way to view software, but in the process of answering comments, I gradually blundered my way into something deeper. Way back, I had suggested that a normal form should exist for code (more than one actually), but I wasn't actively searching for one.

More or less, I laid out these rules from an intuitive understanding of the common problems with large code bases. Normalization rules work to re-order the system in any arbitrary manner, so I oriented these specific rules towards well-known common problems.

They are accumulative, code in 2nd normal form must also satisfy 1st normal form. The higher rules sit on top of the lowers ones. They are also arranged in order of difficulty. 4th normal form, for instance, is much harder to achieve then 1st normal form. This matches the database equivalents and it roughly matches what I've seen over the years in practice as well. There is lots of software that wouldn't even make 1st normal form, with fewer and fewer projects achieving higher results.


PRIOR ART AND OTHER BITS

After I wrote my posts, I came across a similar academic paper by Markus Pizka:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.8060&rep=rep1&type=pdf

The approach in this paper was far more rigorous, but because it was done on an almost line-by-line mapping of code to data, it fell apart when trying to get past 1st normal form. I also found it hard to relate it to coding itself, the ideas are great, and the mapping is far more correct than mine, but it just doesn't easily match how I see the code. To be practical, a useful normal form has to be easily done on at least a manual level. To do this, it needs to relate to a programmer's perspective.

Although I wasn't particularly rigorous, I was really trying to fit these forms into my real understanding of existing coding problems, while still trying to keep in the spirit of the original database normalizations.

Indirectly in the database version, an entity or a table is the central focus, acting as the primary level on which the normalizations are based. I followed that same sense by dealing not with individual lines of code, but with collected sets of them. More importantly, as the code gets higher up, the implicit size of these sets indirectly gets larger. Low-level code is more specific, while high-level code is more generalized. You deal with the broad strokes at a higher level, then descend into the functions to get the actual work done. Depth is a dimension unique to coding issues, it doesn't have a database equivalent.

If you take this collected group of instructions as the atom piece at whatever level, then it's easy enough to guess that a normalized block of code should precisely match to the same level as a function. A one-to-one correspondence keeps the scope and activities of any function to a single purpose in the code. That is, well-balanced functions have only one single purpose, be it a sequence of broad strokes, an algorithm or a very specific set of bit manipulations. They are computationally cheap, so there is little need to conserve.

I did embed some arbitrary weirdness into the rules in the form of a) initially setting the structure of the code to four, not three levels, b) splitting the 2nd and 3rd forms based on the type of code and c) breaking 4th and 5th forms apart by code and data. Each of these irregularities draws itself from real underlying coding issues.

Although it is an architectural issue, splitting the domain level into a broad shallow layer and a deeper thicker primitive-based one allows for better optimizing of the code towards its real usage. Trying to apply one rule would fail and produce unwanted side-effects. There are two competing real problems in development, so there should be two levels that are optimized differently. I get into this in more details later.

Realistically, many software projects get to 2nd normal form but for whatever reasons they don't apply the same process upwards to get to 3rd. Because of that, splitting the two, based on a utility/domain level distinction matches common industry practices. With only a single level, code that was well-constructed in its utilities would be cast as only 1st normal form, when in fact it has risen significantly above that distinction.

Both 4th and 5th normal form are exceptionally hard for large projects to achieve and because of that, they needed to be separated. 4th is about getting to a stable point with neat clean code, something that many code-oriented programmers aim for as a normal basis, while 5th is really about minimizing the usage of the system resources and keeping everything as encapsulated as possible. Because they are both challenging, and very different in effect I separated them to allow programmers to get to the earlier level, without having to achieve both.


LEVEL BY LEVEL

The real importance of these rules come from being able to understand the problems with the code and how to easily fix them. A gradual stepwise refinement is necessary, starting with the easier more common problems and gradually getting more complex. The higher the form achieved, the less long-term problems that will happen to the code base.

Most projects that are stuck in a nasty development tar pit are there because the bulk of the code isn't even close to 1st normal form. Simple cleanup would fix a lot of the problems, but without a reasonable stopping point, the work seems endless and arbitrary. These rules change that.

Of course, these rules don't qualify whether any specific set of instructions is wrong or right, but as the normal form increases, an indirect result is that the code will become more dense. Duplicate blocks of code will be replaced by many more calls to the same functions.

Dense code may be slightly harder to understand, but because it is frequently exercised, it is increasingly likely to be of higher quality. The relationship is pretty clear, if something runs in the middle of a system a thousand times, in a test scenario any errors are more likely to be noticed than if that something runs only once. Density and quality are related. Density magnifies testing.


FIRST NORMAL FORM

At the start, 1st normal form is simply a cleanup level. Obviously, duplicate code and duplicate variables should be removed, but in practice they rarely are. Simple to do, but it is a very common problem with anything but brand new code.

Once a few rounds of changes have been applied, code starts to get left out. The most common problem is one programmer taking a different approach than the others. That leaves a couple of different, but identical ways of handling the same problems. Programmers love to roll their own stuff, so few code large code bases are even close to 1st normal form. The bigger the code base, the more redundancy that gets added.

The code is not always identically duplicated, often it is doing the same basic functionality but in a slightly different way. The same is true for the data, which gets copied into multiple variables with different names, and sometimes slightly different levels of parsing.

Duplication obviously causes lots of problems, mostly because any future changes do not get evenly applied, so one part of the program starts working activity against another part. Regression testing may catch some of this, but the best solution is to not be repetitive. It's easily said but rarely followed.

Even if the code itself is functioning properly, inconsistent interfaces look sloppy and lead to user irritation. It may not always be clear, but most programs that you hate using are usually that way because they are exceptionally inconsistent. If your first instinct is to dislike a program, chances are it's plagued with small, irritating inconsistencies.

Sadly, better working arrangements and proper cleanups would easily reduce these common problems, but they don't get done very often. Any software development project that is still in active development should continually spend time to make sure it is in at least 1st normal form. That should be the minimum competence level for any code base. Professional code should not be a mess, and now we have a simple, objective qualification for the minimum amount of effort.


SECOND NORMAL FORM

Once programmers have progressed beyond just being able to make simple loops and calculate simple variables, they quickly start to try to unify all of their common efforts into utility libraries. 2nd normal form simply says that these libraries should be structured nicely from a functional point of view. That is, at this level all of the functions that are similar, have similar arguments, are of similar size and can be used interchangeably to work out different problems.

For many languages, the lowest of these utility libraries often ends up as common libraries in the language. A good example is ANSI C, where the included libraries are, for the most part, well-organized primitives that are roughly equivalent to each other. A bad example is Java, where the system libraries are strangely inconsistent (someone noted that there are 42 collection classes in Java and 4 equivalent ones in Ruby), and hard to use without documentation.

The acid test for consistency is in seeing a few examples of some functionality, can you correctly guess the rest? In a well-ordered system, the consistency and conventions make this easy. In C for example, if you get that all string functions start with str, and that n is a limiting factor, then if you know about strcpy and strcmp, you can guess the existence of strncpy and strncmp. The convention is obvious. In Java on the other hand, for almost all of the libraries, particularly collections and networking, the calls are essentially weird arbitrary inconsistencies that are impossible to guess without documentation. Without online access to help, Java would be a brutally hard language to use. As it is, the inconsistencies make it a slower more awkward language than most.

On top of the usual system libraries, most big systems have their own custom utilities libraries that are used across the system. If they are really good and well-structured, most of the programmers will use them in their code. When they are messy, people start creating their own, and the code quickly violates 1st normal form. Thus, well-ordered utility libraries are crucial in keeping all of the coders from degenerating the code base. Lots of arbitrary copies of the same underlying code represent structural or organizational problems with the development project.

Projects that fail to get to 2nd normal form easily slide past 1st normal form as well. These are usually the meat-grinder projects, where the coders are flailing at the keyboard faster than the development structure can get set up. Usually, these projects degenerate into nasty tar pits, even if the first couple of iteration show some promise. Bad development habits produce bad code which produces an increasingly sever headwind, which eventually drives all work to a near halt. Sloppy work takes longer.

It's important to note that this is not just a documentation problem. A huge library of ugly inconsistent, but well-documented utility functions will not lead the programmers to better practice. They must be able to find what they need quickly and its usage must be obvious. If you have to read a ream of documentation, then it won't get used properly. Programmers want to get into a zone and stay there for as long as possible; looking up documentation is a distraction.

2nd normal form is important for not consistently re-violating 1st normal form as the project moves forward. It also helps to leverage some of the common development effort. It may take a bit of extra work, but it pays enormous dividends and should be considered a normal practice for professional programmers.


THIRD NORMAL FORM

The business end of the system I intentionally split across two layers. I'm sure people will argue about this, but we can't forget to balance long-term needs against the short-term ones.

Users always want more functionality, and they want it quickly. Thus we want this shallow easy to build, but slightly messy layer in the code to match the reality of what we are doing. On the other hand, if years are going to get sunk into a big project, then it would be nice for the work to get easier as it progresses over time, not harder. That can only happen if we are building up larger and larger chunks of functionality. And so, it is inevitable that we optimize the main bulk of our code in two completely different ways for two completely different reasons.

These diverging pressures drive an awful lot of hopeless arguments over correctly applying a single unified approach to coding. Splitting the core of the code into two -- based on reality, I think -- is a more reasonable way around this conflict.

Like 2nd normal form, at the lower business layer we want to build up commonly used functionality so that we can exploit it quickly in the upper layers. Inconsistencies in a shallow layer of code are far easier to cleanup than inconsistencies in a deeper one.

The key to 3rd normal form is to match the shallow layer as closely as possible to the business requirements. In fact, the best result is for the code itself to express the actual language of the requirements in the least technical sense with the minimum of translation. If the users want to follow the foobar convention for the first calculation, and the barfoo one for the second, then the code should be exactly that:

followFoobarConvention(firstCalculation);
followBarfooConvention(secondCalculation);

In other words, the language of the business domain matches the expression of the code. If there is an intervening abstraction, then the language of the configuration matches the language of the business domain. Either way, it should be more than obvious as to how to map the business issues to the code issues, and the code issues back to the business. Why make it any more complex?

As for the deeper, primitive layer, building this up to larger and larger re-usable primitives allows the upper layer to be more expressive with less syntax. As the system tackles bigger and bigger problems, the reach of the primitives should advance as well. Ugly bits, which occur in all systems should be encapsulated so that the details stay out of the shallow layer. Details are important, but mixing them with broad strokes is an easy way to lose their significance and make it harder to visually detect problems with the logic. A function should be of a single focus, this is one of the keys to both 2nd and 3rd normal form.


FOURTH NORMAL FORM

Some languages offer a large number of different ways to accomplish the same goal, under the assumption that flexibility is a good thing. It's true that any overly rigid language has been surpassed by more open ones, but just because this type of flexibility exists, doesn't mean that it is a wise thing to use it.

Experience C programmers often described 'pointers' as just more rope to hang oneself, and clearly the most common problems with the language were bad pointer issues (hanging pointers). Language designers want to be popular, but one guesses that to get there, they aren't really interested in looking out for the rest of us. Threads, another great bit of rope is the leading underlying cause of many of today's inconsistent software behavior. If it happens occasionally, with no obvious pattern, it's probably a thread bug (unless it's old, then its probably a hanging pointer).

The easiest way around these type of problems is to always use the language in a consistent and correct manner. Once the correct approach to handling a programming situation is discovered, using it consistently in the code to solve the same type of problem over and over again is reasonable. Even if the approach isn't entirely correct, consistency makes it easy to find all of the same circumstances and update them with something more reasonable. Consistency makes changes faster and more reliable. Consistency breeds quality.

What this really means is consistency is really more important than the actual code itself. You can easily fix a consistent, but incorrect implementation. Fixing an inconsistent mess is a lot of work, which is often avoided.

As we progress in development knowledge, it is always nicer to be able to make sure that 100% of the code reflects our current understandings. This way there are no weird behaviors leaching through the source, causing unnecessary complexity.

4th normal form is all about making similar parts of the code look as similar as possible. In that light, it is very easy to see problems that are occurring because of inconsistencies. If four out of five functions have 6 lines of code, the one with 5 lines deserves closer inspection.

Ultimately we want a small consistent code base that is easy to debug and to extend with the next wave of functionality. These attributes come from making the code obvious, which is another way of saying highly consistent. If you have to struggle with each different piece of code, enhancements are slow and painful. Why deliberately make your job worse?

For a single programmer getting to 4th normal form is all about self-discipline. Even in rushed schedules, there are usually moments where the programmer can use trivial refactorings as a way of taking a break from larger more complex coding efforts. Once the habits are developed, then become easier, and pay more dividends.

For big teams of programmers, 4th normal form is next to impossible. Consistency across several development teams is not part of our current development culture. There is no real reason why it can't be done, but the personalities involved in most development efforts will generally derail any significant attempts. As such, big projects just have to except that only sections of the code can be normalized, not the whole. That's fine, so long as each new set of conventions is rigorously enforced, i.e. teams always spend the extra effort to refactor any inherited code, or strictly honor its original conventions. The system across the board may not be a consistent 4th normal form, but each and every section of it should be.

Undoubtedly 4th normal form is hard to achieve, and unlike some earlier forms the benefits of getting there are not nearly as great. However, it is a necessary step on taking the code base above and beyond just the 'usable' designation. The next couple of forms lay out true excellence in coding, but you can't really get there with a messy code base. Code in 3rd normal form is workable but hardly impressive.


FIFTH NORMAL FORM

The biggest waste of CPU in large programs generally comes from assembling, disassembling and copying the same data over and over again throughout the different parts of the system. Most programmers are code-oriented, so the data is an after-thought, something to jam into the code when needed and dump out at the end. The data is grabbed from input, altered slightly, copied, altered, copied, etc. over and over again. The by-product of this is a huge amount of resources wasted in unnecessary copies and manipulations. Bloat is a huge problem.

This code-oriented viewpoint produces a lot of redundant work on data. It's not just unnecessary copies, it's also an endless sea of translating the data from one format to another, frequently leaving around duplicated versions. If we need a string in two pieces, multiple times in the same program, it makes far more sense to break it up once on entry, and then only reassemble it once on exit, keeping one and only one copy throughout.

We really want to minimize the handling of the data thought the code. With effort and a good architecture, this can be achieved. In systems where it has been, the code becomes a fraction of the original size, and it runs at a much faster rate. All of the redundant fiddling is more than justed wasted effort in coding, it's also resource intensive, and it makes extending the code harder.

The real amount of unique data flowing through most systems is far smaller than the number of variables and the naming conventions would suggest. All systems are centered around a very small number of key entities, which drive most of the functionality. Even in systems with dynamic data, the range of the data is generally fixed, a necessity in being able to code up a working solution. Few programmers can really deal with hugely dynamic code, it's abstract nature makes it tricky.

5th normal form is based around minimizing the data as it propagates its way throughout the system. This creates a truly tight, simple portion of code that wastes no effort, so it does the absolute minimum necessary. Structurally we can view this as a requirement for specific data to only be accessible in the smallest number of subtrees possible. Squeezing down that scope so that the data isn't visible elsewhere insures that it is encapsulated. In some instances, we may have architectural reasons for redundantly copying the data, but those should be few and far between.

Another important aspect is for the data to be precisely named for what the data really is. If the incoming data is a username, then all of the variables that handle that data at a high level should refer to it with the same variable name: username. As it descends into the code, becoming lower, the name of the variables may be generalized to better represent the current level of abstraction. As such, a bit deeper than its entry point, the username might just be called 'name' or even 'string' because thats the level of generalization in which it is being used.

A 5th normal form code base should have a rational variable name convention that maps out specific names to specific incoming data and levels within the system. If there are all sorts of inconsistencies, such as the same data at the same level sometimes being called name, username, accountname, and user, then the code is not in 5th normal form.

4th normal form made all of the code look the same, 5th normal form makes all of the variables look the same as well. These two layers put a necessary consistency over the information in order to reduce the effects of inconsistency and duplication.

For unexperienced programmers, 5th normal form may sound very daunting, but as it is really more of a discipline issue, that is very achievable. Its chief benefit is in providing a huge boost to performance and way less resource usage.

It really is about just getting the code to use only the bare minimum to get the job done. When it's achieved, the performance difference can often be orders of magnitude. Although it is rarely understood, if you really need to optimize some code, 5th normal form is where a programmer should go first, before starting to work on more exotic algorithms. Performance issues are either algorithmic or just a result of sloppy code, with most falling in the second category.


SIXTH NORMAL FORM

As programmers progress in experience and skill, they start noticing that they are writing nearly identical code over and over again. Similar screens, and similar database access for example. Some accept this as a normal part of development, but others are driven to seek out ways to eliminate these types of redundancies. Clearly to be able to build code in 6th normal form is the sign of a master craftsman. It's difficult, and it's not strictly necessary to make the system work, but it is achievable and easily the most elegance solution possible for any given problem.

6th normal form is defined as there being no substantially similar code in the system. That is, each and every piece of code looks different and is totally unique, and is not collapsible into some more general piece of code. For all of the general pieces, they are instanced in the system with only the absolute minimum of configuration information (and that configuration is orderly and easy to see if it's consistent; techniques like 'table-driving' code in C work wonders here).

Only in a very few cases have I seen substantial code bases that actually reach 6th normal form. But it's not as impossible a goal as it seems, although only a few programmers can envision this type of design initially, and refactoring into it is a lot of work. Still, when it is done, you get a tiny code base, that is tightly wound and absolutely consistent because it is the code itself that enforces the consistency. The quality of this code is nearly self-evident, i.e. if the code actually manages to run, then it's extremely likely to be running correctly.

The common example of not being in 6th normal form comes from using any of the web-based application technologies like ASP. These technologies allow programmers to easily create new screens for their systems by quickly mixing HTML and some programming language code. For a short quick system, say 10 screens, this is great; the code seems to magically pop out of nowhere.

The real problems roost as the system grows, quickly at first, but ever decreasingly over time. Each new screen copies -- redundantly -- many of the aspects of the earlier ones. Sticking to 4th normal form helps somewhat, but by the time the system is getting past medium in size, it's getting pretty ugly.

It's obvious to some degree, that if you have a couple of redundant copies of a variable in the system it is a well-known bad thing, so if you have 200 versions of what is essentially the same screen then it must be a very very bad thing. Still, it is absolutely common practice (and extremely hard to do anything about). Eliminating these redundancies puts the resulting system into 6th normal form.

For example, if you create a system with just 6 basic screen-layouts that are multiplied hundreds of times over for all of the data access, then the interface portion of the system is in 6th. The basic layouts are a little different, but the configuration data instances them into the appropriate list, detail, edit, delete screens. Thus you have minimal code, with minimal configuration, that rigorously enforces a consistent convention over all of the screens in the system. With just 6 layouts, inconsistencies are minimized.

The same is true for the data/model back-end of the system, particularly with respect to persistence in a relational database. If there is only one set of code to store all data, which is then instanced by a minimal configuration, then the back-end of the system is also in 6th normal form.

Anywhere that the code is nearly duplicated, is somewhere that can be generalized. All that redundant typing, or horrible cutting and pasting, can and should be eliminated. After all, it is just a nasty long-term problem waiting to spring into action.

Real 6th normal form is an extremely hard state to reach, and younger less experienced programmers often confused redundant static configuration messes as being minimal configuration. I.e. if you simply trade your 200 screens for two hundred messy XML files, not only have you not solved the issue, you've actually just transferred it to a worse problem. Many of today's common libraries are extremely guilty of making this mistake. Trading redundant code for redundant configuration is a step backwards. Most declarative implementations are horrific nightmares.

6th normal form isn't a reasonable goal if the code base is coming to the end of it's lifetime, but for new projects where the programmers really care about producing their best work, this is the highest standard they can achieve.

If you can get a system into 6th normal form -- really get it there, not just think you did -- then it becomes a strong base for quickly building up massive amounts of functionality. As the more code gets added, the scope of the system grows rapidly, providing a great base for long-term development. More importantly, development doesn't slow to a crawl as the project ages, it actually gets faster if the quality of code is maintained.

Any and every big system should start out with this foundation in mind. They need it in the same way that skyscrapers need to go deep into the ground to make the buildings stable. If you are going to build big, build correctly.


NORMALIZED LIBRARIES

If our modern libraries were better structured, our code would be much easier to write. Often many libraries are an inconsistent mix of calls loosely based around some common functionality. A smaller set of consistent libraries based solely around limited data or specific algorithms would be far more useful.

We want to bring the choice of underlying libraries down to a simple one about whether or not a specific system should support a new type of data or some specific algorithm. If we get there, then it's easy to see that wrapping data in well-contained libraries with clean, simple and consistent interfaces makes it easy to quickly move through a large series of them, increasing the functionality of the system. A set of libraries with nearly common interfaces is far easier to wire up, then is a smaller number of oddly interfaced ones.

A well-written library may require some explanation for the encapsulated data or algorithms, but really it should require NONE for the interface. If you have to read lots of information about how to use weird configuration files, or set strange values, or even work with some non-obvious paradigm when calling the library, then it is those inconsistencies that are wasting lots and lots of time and leaving in their wake lots and lots of bugs.

A good library is not hard to use, so if you're finding the interfaces awkward and difficult to figure out, and you've been at this for a while, then clearly the problem is the library.

Too many library developers throw in their own unique signature across the API or write so that the coding issues are simplified at the expense of the interface (Hibernate is a classic example). Either way, there is wasteful unnecessary complexity added with the library that will probably ripple upwards into the code. Java, as a language is particularly bad for this type of problem; the basic libraries are all very obtuse and irrational. Rumor has it that .NET is an order of magnitude worse; just a massive cesspool of poorly thought out code.

The solution to our programming problems is not to have access to more libraries, it is to have access to BETTER libraries. A big ball of mud in the libraries propagates upwards into a big ball of mud for the architecture. If you build on a crappy foundation, your code is?


AUTOMATION

An important question for code normal forms is whether or not recognition of the different forms is computable. That is, can you write a program that will return with the correct normal form, for any given piece of source code. Making this work rests on being able to identify similar pieces of code and being able to identify similar pieces of data. The first of these problems is a little easier.

A while back, people were writing signature-based hashing algorithms to help determine which parts of a large code base, Linux, where copied from other large code bases. As the code is just a series of instructions, you can also match two series together to some relative degree. I.e. the two are 90% similar, for instance.

We can do that by stripping out the contextual information, like variable names, and other things, and just laying out the underlying types and mechanics. Of course, since there is some arbitrariness in some of the order for the steps, it is not an easy problem, but one that could be handled.

In that manner, and by looking at the anchor points in the subtree, a program could flag possible duplicate code. By tracing transitions from an external point, in and around the code, duplicate data can also be found.

Identifying duplicate pieces of data is actually considerably more complex. Just because two pieces currently happen to have the same value, that does not guarantee that they are the same. In that vein, real intelligence is the only way to make sure that any two pieces are actually the same, and that is something a computer can never do. While it may seem complicated by this, since there is actually a fairly small number of unique entities in most software; all we really need are external mappings from one named variable to another. Interestingly enough, if there are no possible entries for this mapping file, then the code is in 5th normal form.

If you know the structure and all of the unique data, you can start plotting out diagrams to show structural problems in the code. Along with a data dictionary automatically kept, and some added extra information included by the programmers, all of the system's variables can be tied down to a limited set of common entities.

In that sense, it should be entirely possible to automate recognition for each of the different normal forms. You should be able to press a button and get a listing back for all the code in the system, file-by-file, and its final normal form.

I'm not saying it will be fast, clearly the overhead lights will dim for a while, but that it should be entirely automatable. Linked in with an IDE with some type of form-specific refactoring capabilities, the next wave of programmer tools should be able to give us status on our code bases in the same way that MS Word provides readability statistics on writing. While that doesn't guarantee that the code is good, it does guarantee that it's not total garbage.

We should be able to say things like "you can't check that in until it's at least 3rd normal form", and use a tool to show that the work hasn't been completed properly. When it's no longer subjective, it's no longer a personality conflict issue it's just a simple fact.


SUMMARY

Almost anyone can write computer code, often programmers take too much pride in this simple exercise. Intrinsically most of them understand that there is a huge difference between just being able to get something to work, and in being able to build a longer term, more elegant solution.

Programming, in terms of variables and loops, isn't really that hard, but then again it isn't really the central problem in software development either. Even getting a single version of software out the door, is not the same as reliably releasing version after version for years. There's always lots of programming to be done, but it's all of the things around the code that ultimately determine the success or failure of the project.

To get beyond being just some people who know how to code, we have to set higher standards for ourselves. A professional programmer should be able to produce a grade of code that is far above and beyond some person playing with a computer language.

Just showing up to work on Mondays should orient most programmers towards the goal of trying to get to at least 1st normal form. Cleaning up their code helps them, it makes their lives easier.

2nd and 3rd are structural issues, where the code is laid out with a non-random architecture. This makes extending the code a simpler process, and if done well, it makes it possible for quick turn-arounds on user-driven changes. At the very least, a system in 3rd normal form is fun to work with; below that level, the coding is more painful and tiring.

4th and 5th involve a type of consistency that is very difficult for huge teams or groups of people. They're both necessary in producing an optimal solution, but they are likely beyond the abilities of average groups of programmers. These forms are probably better left to individuals or small highly effective teams.

6th normal form is the dream for any programmer in love with coding. It is that state where just the true minimum of code exists, and it is extremely dense, but not impossible to understand. I've seen it in practice and created a few systems were it has been the underlying goal, so it's quite doable, but the extra effort needed to achieve it is beyond most development aspirations. 6th normal form is where beautiful elegant code exists, and not surprisingly as many blog commenters have stated, they've never seen that type of code base in their lives.

Once we get these ideas into practice, they will really help with improving development. An organization, for instance, should establish minimal standards for coding, and deviations should only be for good reasons.

What really plagues Computer Science is that we spend too much time guessing, and not enough time making sure that we are really doing the right thing at the right time. While this can be fun, failure, which is the common result of this lack of science, is not. It really does make work better when you remove all the needless anxiety associated with mismanaged complexity.

It's nice to know when you start a project that it will work and that it will really meet the initial goals. If you allow them, these code normal forms will significantly help with that; but only if you allow them.

12 comments:

  1. Quite a rich and interesting post. It clarified for me the previous posts for me.

    Why didn't you explicitely refered to design patterns when talking about the 5th normal form? And why didn't you say that the second layer is kind of a DSL layer? And did I miss some other hidden references?

    BTW, although I contradict a bit some of my own previous comment, there is a link between programming an database: object orientation. IIRC at the beginning, OO was about making sure that the problem domain by modeling correctly its data (in a way the customer could understand and valide; it often eventually mapped to database structures). I believe your normal form is the dual form of OO, moving the focus from data structure to code structure. Of course, both approaches encompass code and data, but there's alway one of them that is sort of a second class citizen. I think it is a weakness; function programming - at a language level (lets forgive OO-FPL like Scala) - may setup a better balance.

    Speaking of languages, in your opinion, in which ways a programming language may help with normalizing code, or write normalized code?

    PS: you should fix the typo "FORTH normal form". That's really too weird for me, being a hobby forth programmer.

    ReplyDelete
  2. Hi Astrobe,

    Thanks, I was hoping to clarify it :-)

    The normal forms are above and beyond any specific language or programming paradigm (such as OO). They should be applicable to any type of language where stack traces make sense. Some type of normalization exists for all types of programming languages.

    All languages have the same basic structure, in that they contain a series of instructions (code) on how to manipulate underlying variables (data). The two might be mixed somewhat in OO model, but they really can be treated independently. In a strange sense, code can be seen as just a way to move data through a series of transformations. It's dynamic nature makes it more expressive, but it's still related.

    The best languages are simple, easy to relate back to the domain problems, but allow sophisticated methods for encapsulating issues. What we really want is a language that matches the way we express the solutions at each different level in the code. I.e. the less fiddly we make the language, the easier it is to see if it works. The really important issue is in allowing real encapsulation. Once we solve a specific technical problem, it's important to be able to set it aside and ignore it.


    Paul.

    ReplyDelete
  3. I linked to this great post on my blog here. I particularly found myself agreeing with your 3rd Normal Form splitting the domain layer into two layers to address the competing needs of code expressing your model and enterprise re-use.

    ReplyDelete
  4. Hi Moffdub,

    Thanks for the comment. Coding sits on the fence between reality and an abstract machine; we have to accept being pulled in two different directions at once. Some things in are code are obvious and inherently correct, while some are subject to the will of the people around us. It takes two different approaches to deal with two different types of requirements.

    Paul.

    ReplyDelete
  5. Hello Paul,

    Coding sits on the fence between reality and an abstract machine [...]

    From my point of view, you accept this or take for granted too easily.
    I think that abstractions that are not in the user requirements are unnecessary complexity.
    Your 3rd layer is necessary not in order to convert users' requirements into the abstractions of a machine, but because one often needs to use many computer abstractions in order to match users' abstractions. If there is a mismatch between users' and computers' abstractions, it is because computers' abstraction are more fine-grained, or more orthogonal, or more general. In the latter case for instance, we simply have to specialize it.
    In an ideal world, APIs (of OSes, of libraries) are well designed, and they are well designed because they introduce only the necessary abstractions, and thus often match the users' abstractions. For instance, a well-design 3D graphic engine library lets the programmer manipulate cubes, spheres, etc., without having him to learn the esoterics of 3D trigonometry or the complexities of OpenGL or DirectX.
    In the real world, we have to deal with poorly designed OS and library APIs. To correct this is the role of the bottom layer in your model.

    ReplyDelete
  6. Great post. However, knowing what you wrote, I'd redo it a bit differently. So let's climb on your shoulders!

    I see two main things: the qualities of the code, and the separation of that code in layers. (That's an aproximation, I hope it's good enough.)

    Each layer have an interface and a implementation. The notion of normal form should exist separately for both.

    Example: some big project have some common utilities, and some application code (2 layers implemented in that project). We could divide this project into:

    -> Interface 0: the programming language and other COTS.
    -> Implementation 1: common utilities.
    -> Interface 1: common utilities.
    -> Implementation 2: application code
    -> Interface 2: application interface

    The qualities of your code and interfaces are more like (Cf your normal forms):
    -> 1: no duplicates
    -> 2: similar functionality => similar code/interface
    -> 3: minimized scope (for code), minimzed data input (for interfaces)
    -> 4: similar functionality => same interface/code

    Take it as another classification of normal forms. Of course, 2=>1, 3=>2, and 4=>3.

    The only thing an end user will care about is the application interface. I take for granted that it should be in 4th normal form —at least in an ideal world. The interesting question is how much help is a given normal form at a given level when one want to achieve another given normal form at the above level.

    ReplyDelete
  7. Hi Loup,

    Thanks for the comments.

    Every sort of functional decomposition has an explicit interface, it's just a question of whether you want to expose that as an API for a layer. Internally or externally I don't see much difference, other than in programmers trying to save time by cutting corners for their own code. You can see 2nd and 3rd as interface-oriented normalizations, where the other levels are directed more towards implementations. The lines we draw in the code are there to encapsulate, split work and help with testing. Keeping all of them consistent with each other makes it easier to refactor the environmental elements of the project (not just the code).

    To answer your last question, abnormalities in lower layers work their way upwards. That means that if you build something on a non-normalized library, it is very difficult (but not impossible) to normalize it. If you build something on a normalized layer, then if you follow (more or less) the same conventions, the results tend to be very close (if not exactly) to normalized.

    Paul.

    ReplyDelete
  8. Thank you for the fast answer. Now I can state the obvious.

    Maybe the decomposition of a given project is not that relevant beyond your 3rd and 4th normal forms. I'm not experienced enough to tell. There is still a crucial layer left, however: the Components Of The Shelf.

    Definition:Abnormality <=> mess <=> non normalized.

    Axiom:"Abnormalities in lower layers work their way upwards". Maybe you can make it a theorem.

    Theorem:With messy COTS, normalized code will be difficult, and so will be normalized end user interface. So, as far as possible, we should not use messy COTS.

    Axiom:The programming language(s) is the single most important COTS of basically any project. That's obvious, but again, maybe we can prove this.

    Theorem:Using a messy programming language is the best first step to jeopardize a project and doom your career (or secure your job).

    Conclusion:Don't use messy programming languages. Murder them, and reformat some brains along the way.

    PS: C++ is not in normal form, right?

    ReplyDelete
  9. Hi Loup,

    Sorry, I was out of the country for a few weeks.

    I'd guess that if you did start with a normalized programming language, it would be easier to produce a normalized result, but there is no reason why you can't build a normalized layer on a non-normalized one. It is more work, and is safer if the you hide the mess underneath, but there should really be nothing in any Turing complete language, that should stop you from building something normalized on top. Any 'layer' can be normalized irregardless of its interactions with any other layer (although the 'border' in that case is unlikely to be normalized).

    Caveat: I'm still jet-lagged :-)

    Paul.

    ReplyDelete
  10. Fascinating post. Have you pursued this further?

    Also, I would not be so quick to dismiss the academic paper you cited, I'm going to have to give a good read, but at least as far as db apps go, defining code forms in terms of data structures seems sound.

    ReplyDelete
  11. Followup:

    If we take the spirit of the first three database normal forms, we could paraphrase them as:

    1) indivisible

    2) An action claiming to do something must do all of it (no insert or update anomalies

    3) An action claiming to do something must not do something else as well (delete anomalies)

    So three forms of code, which I find it convenient to think of in terms of functions, might be:

    1st normal form: a function is in first normal form if cannot be further decomposed

    2nd normal form: a function is in second normal form if it completes its entire purpose

    3rd normal form: a function has no side effects

    ReplyDelete
  12. Hi Ken,

    No, I haven't taken it much further, other than having it affect the way I write code (or avoid writing redundant code). I talked with Marcus (the author of the paper I cited) a bit, but I believe he went on to be a consultant, so I don't think he pursued it either. I think it's an idea slightly ahead of its time :-)

    I like your paraphrasing, what these ideas need most is to get boiled down into their essence :-)

    I took a quick skim through Andromeda, it looks great. I'm glad to see that there are others pursuing new approaches to frameworks and databases, it is desperately needed. I've played around a lot in my blog with alternative paradigms, you may find some interesting stuff in the archives. Not too popular, but there are definitely better abstractions out there then what the industry is currently using.

    Thanks for the comments. You can send me email if you want a deeper discussion.

    Paul.

    ReplyDelete

Thanks for the Feedback!