Thursday, October 30, 2008

Revisiting the Structure of Elegance


This post is a continuation on an my earlier one entitled The Structure of Elegance. If you haven't read the first post, then most of this one won't make any sense.

I left off by presenting a different perspective on code, that of being subtrees of functions within the system. This type of perspective is convenient for allowing a graphical view on the code, one that can be easily manipulated to help normalize the structure to something considerably more elegant. Large collections of code can be a overwhelming jumble of complexity, but by inspecting limited subtrees, most people can get a better grasp on their structure.

This perspective provoked a few questions about how and why this is useful. One of the biggest pieces of feedback I received was about what type of rules would exist for normalizing the code. Before I step into that issue, we need to get a higher level view.

Normalization, much like optimization and simplification are all reductive processes. Given some constraints, the system is reduced to a minimal form that is more or less equivalent to the original system but improved in some regards. These processes all require trade-offs, so its never a case of the perfect solution. A number of ideal solutions may exist with respect to a set of constraints, and the constraints themselves can vary widely.

For instance, what is simplified with respect to a computer -- a massive list of assembler operations -- is nearly impossible for a human to work with. We pack things into encapsulated functions, adding in lots of extra complexity, but that allows us to bite off limited sections of the code at a time. We must simplify with respect to keeping things in bite-sized pieces. A computer doesn't need that extra complexity.

To pick rules for normalizing code we need to first understand what attributes within the code are really important. For instance, we know that a single node in tree with all of the instructions place together in the massive function is not a particularly easy bit of code to deal with. It's just a big mess of instructions.

Alternatively, we could place each instruction in its own function, giving us a super-long and skinny tree; but that wouldn't be very nice either. Balancing the depths and breadth of the trees is important.

We need to understand what types of dimensions work best, but first we need to look back at various layers in the tree. Code as different depths is quite different, and requires different types of normalizations. The overall structure of the program is not uniform, and shouldn't be handled as such.

At the top of the tree is a fair amount of code that deals with technical control issues. It controls the looping, communication, functionality triggering capabilities. Anything that binds the driver factors, human or computer, onto specific configured blocks of functionality.

At the very lower level of the tree is code that is also handling a number of technical issues. Low-level things like logging, number and string handling. Variable manipulation, common libraries, etc. Even persistence through a database. The lowest levels of the tree either end in specific data manipulations or interfaces towards library or external functionality.

It is in the middle where we find the guts of the program. The part were all of the work is completed.

All code is driven by some usage, which will call the problem, or business domain (although business isn't necessary defined to only be commerce). Problems are either very specific, such as a system to handle a particular type of data like client management, or very general such as a spreadsheet. However, even in the general case like a spreadsheet, it gets out there and is configured to be used for very specific problems. In that sense, all software domain problems start their origin as a business one, and then move inwards to become more abstract, either in the configuration, or in the actual code.

As the problem domain gets more abstract, the language describing it does as well. You might start out with stock prices, getting mapped to configuration information, getting mapped to formulas, then getting mapped to spreadsheet cells. Generally the deeper you go with the data, the more general the usage.

No matter what level of abstraction, the middle of the tree contains some business domain code that gets more and more abstract as it gets lower. Thus we have something like:

Real Problems -> Configuration -> Business Related Problems

Technical Control Issues
Business Related Problems
Abstraction on Business Problems
Smaller and/or Common Technical Issues


So, on the outside we have the real user problems that are mapped onto the primary "business" related ones in the code. Then in the code itself, we have a heirarchy of four layers that can overlay the whole execution tree for the system.

At the very top of the tree, the 1st level, we find we are dealing with the technical control issues. These we want to get too quickly and then set them aside. They are technical issues that are not directly effected by changes in the business issues. Expansion of the business demands sometimes requires enhancements, but ideally we like to minimize this code and encapsulate it away from the rest of the system. It's purpose is to really just bind code to some external triggering mechanism.

The same is true for all of the low-level common stuff, the 4th level. It gets created initially, then appended overtime. It usually is fairly controlled in its growth, so we want to encapsulate it as much as possible, and then hide it away at the lower levels of the code.

The ugly bit that is messy and subject to frequent changes is the stuff in the middle. It's also the hardest code to write and frequently the most boring. But it is the guts of the system, and it defines all of the functionality.


THE MIDDLE ROAD

We know that the code is frequently changing at the 2nd level and 3rd level, and that it is driven by forces outside of the developer's control. Generally, the best approach is to try and make a 3rd level that is abstract enough as to be resistant to all but the worst possible changes. In a sense we want to minimize the 2nd level, in order to reduce the impact of change.

That means that we really want to build up a set of reusable primitives in the 3rd level that can be used to express the logic in the 2nd. As trees, we want to minimize the depth of the 2nd level, and maximize the depth of the 3rd. The 2nd level is shallow and wide, while the 3rd is deep and wide.

As the project progresses if we have more and more base primitives in which to construct increasingly complex domain specific answers, where each primitives sits on a larger and larger subtree, then it becomes easier over time to construct larger more complicated solutions to the external problems. It's easier and it gets faster as each new abstracted set of subtrees covers more territory.

A broader 3rd level doesn't help if it's huge -- new programmer's would reinvent the wheel each time because they can't find existing answers -- so keeping the number of functions/subtrees down to a bare minimum is important too.

This helps in even more ways, since if the same subtree is used only once in a problem, then it is only tested 0 or 1 times. But if it is used in hundreds of places, the likelihood of testing is severally increased, thus the likelihood of unnoticed bugs is diminished. Heavily run code is more likely to be better code. Well-structured code promotes quality.

That principle is true for the whole system: a tightly-wound dense program where all of the code is heavily reused will have considerably better quality than a brute-force one where every piece of code is only ever executed in one distintctly small scenario. Ever wonder why after twenty years Microsoft still has so many bugs? Millions of lines of code, especially if they are mostly single use is an extremely bad thing, and nearly impossible to test. Also, it is subject to increasing inconsistencies from changes. The modern process of just wrapping old code in a new layer with an ever-increasing onion architecture is foolishly bad. It's worse than spaghetti, and completely unnecessary.

So overall we can see that we want very simple common and library code at the very bottom layer that deals with the technical problems in technical language. At the top we are force to have layers of technical code that also deals with the problems in a higher technical language.

At the second level we will have lots of code, as it represents all of the functionality of the system, but hopefully it is very shallow. Below that we want to continually build a minimal set of increasingly larger subtrees that act as domain-driven primitives to accomplish non-overlapping goals. The strength of the 3rd layer defines the flexibility of the system, while the consistency in behavior is often defined mostly from the 2nd level.


HOW DATA FITS IN

A big application will have a fairly large number of different major data entities. The scope of this data in terms of the subtrees, easily quantifies the structure of the code. Data that is duplicated, and copied into many different locations in the code is a mess. Refining the data down to specific subtrees limits the scope and contains the errors.

Another important consideration is that the impact of changes to the data is way easier to calculate if the data is only visible within a specific subtree. Change impact analysis can save on costly retesting.

Data drives the primitives in that each of them, if they are similar should deal with the exact same set of system data.

All of the string functions for instance, just deal with strings. All of the socket library code deals only with sockets and buffers. The persistence layer only deals with SQL and relational structuring.

A set of common primitives share the same underlying data structures. If one of them requires access to something different, that's a sign of a problem. Similarities are good, inconsistencies point to coding errors.

Within the system there will sometimes be vast gulfs between a few of the subtrees caused by drawing architectural lines in the code. One set of subtrees will be mutually distinct from another set with virtually no overlap.

This is a good thing, even if it means that some data is copied out of one subtree and replicated in another; it's a necessary resource expenditure to ensure that the two sections of the code remain mutually independent. Of course, this means that there is clearly a cost to adding in any architectural line, one that should be understood upfront. Somewhere in another post I talk about putting in lines for structure, but also for organizational and operational reasons.


IN OBJECT-ORIENTATION

By now some readers have probably surmised that I've been completely ignoring object-oriented issues. Although I've swept them aside, since objects are really just a collection of related functions tied to a specific set of data, much of the structuring I talked about works very well with object-oriented languages if we work in a few very simple observations.

The first is a little weird: there are really only two types of relationships between objects in this perspective. One is 'composite', the the other is 'peer'.

From a subtree perspective, one object in the system contains another if and only if all of the dependent objects methods appear on the stack below the other object. I.e. one is fully encapsulated by the other, it is a composite object. As this is testable on the stack it is easy to see if it is true, for two objects A, and B, if A.method1 is always ahead of B.method1, B.method2, then more or less A contains B. All methods of one are bounded by the other.

In the peer case, any two objects are interleaving on the stack at various levels. The order isn't important, simply that neither is contained so they work at a peer level with each other.

There of course are other possibilities, but these we don't want. For instance if A contains B and B contains A, then either these two are really peers, or there is something messed up about their relationship. Consequently if two objects are peers and also one contains the other, then there are really at least three objects all tied to each other, one object at least is horribly overloaded. That is it is at least two individual objects crammed together in the same body.

One could get into proving how or why some relationships imply that at least one of the objects is overloaded, but it isn't really that necessary. We really just want to focus on the two basic relationships.

Convoluted objects can can be refactored. With some snapshots of the stack, you can break the objects into their component objects, thus normalizing them.

With only these two relationships between all of the objects, it becomes considerably simpler to construct the system. For instance in a GUI, we glue upwards into the interface, mostly peer objects. Then we create a model of the data we are going to use in the 3rd level, mostly containing composite objects. We try to bury the SQL database persistence quietly in the 4th level again with composite relationships, and that just leaves the second, which is probably some type of mix, depending on what the code is really doing.

Since we're trying to minimize the 2nd level anyways, the basic overall structure of the code becomes quickly obvious, especially if we're concerned about trying to really encapsulate any of the ugly technical problems.

And this can be extended easily. If you get the same problem, but with a client/server breakup in the middle, it's easy to see that it can be solved by adding in another layer between 2 and 3 that hides the communication, although occasionally optimizing a few pieces of functionality into the client (from the server) just to save common resources. The four layers stay the same, even if you're adding a fifth one into the middle.

With the above perspective, peer objects can have complex runtime relationships, but composite ones should not. They should be very simple containing relationships in order to keep the code as simple as possible.

Some object-oriented schools of though seem to really like complex behavior between many different objects. The problem is that this quickly leads to yet another layer of possible spaghetti. Wide differences between structural relationships and runtime ones can be very difficult to maintain over time. We're best to focus on simpler code, unless the incentives are really really large.

Ideally, we want to take our business problems and express them in code in a format that is as close as possible to the original description. In that sense, very few real problems have a requirement for a complex runtime relationship, most of our data manipulation is about creating or editing new data in a structural fashion. Systems with extensive peer objects should heavily consider whether the dramatic increases in complexity are worth the cost of the design. It is entirely too easy to build something overcomplicated, when a much simpler design produces the exact same solution.


SPARTAN CODE

I've seen a few references to spartan programming rules:

http://ssdl-wiki.cs.technion.ac.il/wiki/index.php/Spartan_programming

these are great, good steps along the lines of normalization, but like all things, taking simplification too far can actually produce more complexity not less.

The spartan rules about naming conventions for example, are too much. Naming is a key way to embed "obvious comments" right into the code. The name used should be the shortest, longest possible name correct for that level of abstraction, and consistent with similar usage throughout the system.

All names are always key pieces of information, and even with temporary variables using a rule of simplifying to something like one character names is a really bad idea. Code should flow, and odd naming conventions become distractions. All variables have a proper name, it's just that the programmer may not realize that.

Still, we want to boil down the code to the absolute bare minimum. We want to reuse as much as possible, to achieve better quality. We don't want duplicate information, extra variables, extra code or any other thing that we have to maintain over the years, if in the end it isn't absolutely necessary.

Minimizing most elements brings down the amount of information to maintained over a long time to the smallest possible set.

There should only be one way to do everything in the system, and it should be consistently done that way. The consistency is important for achieving quality.

In languages where there are more than one way to do things, only one way to do things should be used. More is worse. Pick a good way of handing the technical issues, like loops, and stick with it for as far as it will go (although don't use this rule as an excuse to ignore features of the language, or go too far).

Ultimately it's like anything else, if you have too many bits, they become harder and harder to maintain, so eventually it gets messier and messier. Discipline is part of the answer, but so is cutting done on the overall amount of work. Why spend two months working on something, if you could have completed it in two weeks?


NORMALIZATION RULES

The above discussion provides a good overall understanding of the types of things we are looking for in the code. We want to normalize with respect to a limited subset of variables, more or less in some specific order, so choosing the constraints carefully is important.

There are some easy attributes that we can see we want from our code:

  1. All unique pieces of data have unique names.
  2. Variables only get more abstract as they go deeper.
  3. All equivalent subtrees are called at the same level for any tree
  4. All symmetric instructions are in the same function, or functions at the same level.
  5. All data is scoped to the minimal tree
  6. All instructions in a function belong there, and are related to each other
  7. All functions speak of only a few pieces of data.
  8. All data is parsed only once, and then reassembled only once.
  9. Data is only copied between trees for architectural purposes. Otherwise all data exists in the program in one and only one memory location.

With respect to all of the above, and with an understanding of the common problems we find in code, we can lay out a bunch of layers to the forms:


1st Normal Form

- no duplicate sub-trees or data

Go through the code, line-by-line and delete any duplicate functions or data. Delete copies of any identical/partial data.


2nd Normal Form

- balanced common 4th level libraries: height, subtrees and arguments.

This is an architectural level. Similar functions should have similar arguments and be called on a similar level. Push and pull instructions until this happens. Some new functions might have to be created, some combined, and many deleted. Refactor the code until all of the pieces are encapsulated non-overlapping primitives, that encapsulate specific technical issues.


3rd Normal Form

- balanced domain code: 2nd, 3rd level

This is also an architectural level. 2nd level is minimized, 3rd level is maximized, but with few subtrees. Like 2nd normal form, but at the higher level in the code. The 3rd level can actually be modeled on the verbs/nouns (functions/data) used by the real users to discuss their problems, and their needs. I.e. you can map their language to the 2nd layer, with the building blocks created in the 3rd.


4th Normal Form

- All similar operations use similar code.

All blocks of code are consistent, and use the exact same consistent style and handling. Any code doing similar things is formated in the same way, using the same style and conventions. This can be done by smaller refactorings in the code to enforce consistency.


5th Normal Form

- All data is minimized in scope.

All data is named properly given it usage, and appears a absolute minimum number of times, and is only every parsed once, and only ever reassembled once. It is only stored once per section. Names for identical data should be identical at the same level. Names (and usage) should get more abstract at deeper levels. Some structural refactoring may be necessary, but a data-dictionary with all variables names (and associated levels) would help find misnamed or duplicate variables.


6th Normal Form

- All functionally similar code exists only once, and is instanced with a minimal set of config variables, both at the high and low levels in the system.

For example, there are only a small number of screens types, even though there are hundreds of screens in the system, and there are only a small number of data types, even though there is tonnes of data in the system. When these two common repetitive levels are melded together into a smaller denser system, the code quality is massively increased, the amount of work is decreased and the consistency of the entire system is enforced by the code itself. 6th Normal Form implies that there are no duplicate 'functionally similar' blocks of code, since they have all been combined into general ones (every piece of code looks different).


BUILDING SYSTEMS

With plenty of work, any existing system can be refactored into a fully normalized code base. There is nothing but effort, and the occasional regression testing to stand in the way of cleaning up an existing system. Often the work required seems significantly larger than it actually is, and as part of the process, generally a large number of bugs are quickly removed from the system.

For new systems, development can aspire to achieving higher normal forms initially, but on occasion allow for a little less because of time pressure.

All development is iterative, even if the cycles are long and the methodology tries to hide it, so it makes perfect sense to start each new round of development with a period of refactoring and cleaning.

Done a little at a time, the work is not nearly as painful, and the consistency helps keep the development at a brisk pace. Big systems grind to halt only because of the messiness of the code. New changes become increasingly more onerous, making them increasingly less desirable.

For an initial project, it is best to quickly show that the upper technical problems fit well with the lower building blocks. New technologies should always be prototyped, but sometimes that process is driven by the development itself. Generally good practice is to start by building a small foundational 4th level, then attach upwards a minimal bit of functionality. This inverted T structure quickly shows that the flow through the code and the technologies are working correctly. Enhancements come from adding more functionality, a bit at a time, but if needed expanding the foundation level first. This type of iterative construction minimizes redoing parts, while making sure that the main body of code is usually close to working most of the time.


GRAND FINALE

The single greatest problem that I've seen over and over again with many large or huge development projects is the code base itself. Sure, there are lots of excuses about management, users or technology, but in the end the problems stem from the coders not applying enough discipline to their work.

Messy code is hard to work, while elegant code is not. But, as we do not teach coding particularly well, it seems as if very few programmers actually have a sense or understanding of the difference between good and bad.

That's why normalization is so important. It will allow programmers a clear cut set of rules for cleaning up their code. For getting it into a state where other programmers can easily help work on it. For not getting caught up in endless debates about right and wrong, and various conventions.

The rules as laid out in this post are simple and ascend in order of importance, that is just getting to 1st Normal Form for many systems would be a huge accomplishment. Getting past 2nd and 3rd would help prepare the system for a new way of expansion. Getting even higher would be elevate the work from simple system to something more extraordinary. Achieving 6th Normal Form would produce a work of art.

But -- as is also true with this normalization's database counter-part -- no set of normalization rules is absolutely perfect. There will always be occasions to denormalize; to not follow the rules, or break them temporarily. That's fine, so long as the the costs are understood, and a rational trade-off is accomplished. We may be rigid in our coding habits, but we should always be flexible with our beliefs. The best code is the stuff that is easy to understand, fix and then expand later as needed. All of this normalization is just help to point the way towards achieving those attributes in your code.

Saturday, October 25, 2008

Less Than Best Practices

We live in a era where there is a massive volume of information that is available about any topic. An era where people often combine together common knowledge into idealized concepts like '"best practices", giving them some sort of semi-official sounding endorsement. An era where large amounts of what we know is right, isn't really.

For those that build or operate computers, we work in an industry where we wildly guess at our needs. Where we cling to ideas that are easily proven incorrect, because we lack better ones. Where we react, rather than understand. An industry that says incorrect things often with a certainty that should be laughable, if only it really was.

There is an huge distance between an idea that sounds good and a good idea; something that too easily escapes so many people. Recently I was stumbling right across another classic example, that of forcing users to constantly change their passwords at regular, enforced intervals.

If a person stands back and thinks heavily about a problem, they can often envision a reasonable number of the variables. In that context, they might come up with some pretty reasonable sounding solutions, but they always need to understand that their "universe" of construction was just a artificially limited subset of the real one. We're not able to work in the full scale, only a subset. That makes a huge difference, because while the primary behaviors may work as expected with the primary variables, it's the secondary side-effects that are not properly being accounted for. And sometimes, those sides-effects are far more significant than the original variables.

Forcing users to change their passwords on a fixed cycle, and to not reuse old passwords is consider to be a "best practice" by most of the IT industry. There are a lot of intelligent sounding arguments that have been devised to justify this approach. In theory it sounds great.

For anyone who has actually worked in a environment where this was been implemented, if they are objective, and thorough they will find empirically that it is frequently not working correctly. It can actually force a significant number of the users into making their passwords less physically secure, and far less secure in general, not to mention that it seen as an irritant and a morale killer.

The problem is that most people barely remember their password to begin with. And in many environments it is not entirely uncommon to have lots of passwords. All of the grand attempts at unifying systems under big corporate user databases have generally failed due to politics, complexity or technological problems. Most users still have multiple passwords, and many users have dozen of them, particularity operations and development staff.

While some truly organized person with a great memory devised this password changing scheme -- perhaps fearing that a stale password was a weak one -- the bulk of the population does not have the capacity or desire to change their multiple passwords every few months. Stale passwords are no more or less easier to crack through brute-force; their biggest problem is in them getting out verbally to the public, or in a crack not being noticed. Not sharing accounts and passwords, and locking out dormant accounts, as well as telling a user each time they log on when they were in last, are all effective ways around these issues.

Stale passwords aren't necessary weak passwords. Causing new crack attempts after every password change, doesn't negate why the initial attempt was successful; the problem remains.

Forcing people to change their password every three months is forcing the passwords out of most people's active memory. That is, it's far harder to remember something that is constantly changing, so you don't. And that, is one big, giant, huge, enormous step backwards.

So once the users cannot remember their passwords, they must find other ways not to forget them. Writing it down, memory schemes, common dates, etc. Each and everyone of these schemes is weak by definition. Where a user "might" have picked a strong password, frequently changing schemes force most of them into definitely picking weak ones. The problem hasn't been fixed, it has simply been moved into a worse one.

The missed variables in this case, are people's inability to remember constantly changing things. For some security expert whose focus is on passwords, the idea that the bulk of the population are too focused on their work to be able to or care to spend effort continuously on creating unique passwords is totally missed; just not added to the equation. Their focus is too narrow to match the reality.

An additional problem is procrastination, as all users wait till the very last moment before the computer forces them to change their password, so the password chosen is often chosen in a panic and haste.

In many cases, using some type of simplification scheme is the only way to avoid constantly forgetting the passwords. Automated checks may prevent a bit of this, but there are a multitude of clever ways around this. For instance, my six digit phone mail password is actually just a two digit incrementing (by 2) number. For the "once a month", call-me-back message I get, real security is pointless. I kept forgetting six digits, remembering one isn't that bad.

Thus the passwords end up on papers and stickies, and other bits floating physically around the office. Clever schemes are created to remember them. Frequently they are lost, causing wasted hours of work waiting for resets. Operational people have to keep reseting them, using up more of their time and costs. In the end, it irritates most of the users and it is all a big mess.

There are probably some environments where this type of scheme may work well -- perhaps low-skill jobs in a high turnover environment -- but for most environments that involve a computer this type of practice is counter-productive.

The really big problem is that it is more than obvious to many of the users that this type of practice is needlessly controlling, and therefore insulting to them. It's a "we don't trust you" practice. When IT departments fall back on the mantra that it's officially a "best practice" so it's not arguable, that diminishes their standing in front of their users. People hate to be told that things are good, when they are clearly not. It's just another reason why there is a growing gulf between many IT departments and their users. If they don't trust our security practices, why should they trust our estimations ones, or even our architectural choices.

Sunday, October 12, 2008

The Structure of Elegance

Computer programming, for many coders is essentially creating a series of instructions that need to be executed. The order of execution is unimportant, so long as all of the instructions have been completed. This instruction-oriented perspective generally implies that the breakup of the actual instructions themselves into methods or functions is more or less arbitrary. So long as enough instructions are executed, these programmers don't seem to mind the structure or order. It is a common way of seeing code, but it has its own inherent problems.

Over the years, I've found that the more randomized the code base, the harder it is to make it work properly. Random and brute force code both have the annoying attribute that you cannot easily tell visually whether or not the code is correct, or even close to correct. On the other hand, a well-balanced, well-structured program not only looks cleaner, but if all of the pieces are in the right place, the imperfections are obvious. Bad code stands out.

Yes, you can see the bugs caused by the inconsistencies in construction. They become obvious blights on an otherwise clean canvas. You should be able to read the code and have some idea of what it is doing, and whether or not it will work correctly.

Clearly, being able to visually detect inconsistencies in code is a highly critical aspect in achieving high quality. Testing is hit or miss, with never enough resources, getting it right at a lower lever is far more effective.

Since we're frequently digging into the code, if it is obvious that there are problems that need to be corrected, it is easier to correct the problems as they are found, rather than allow them to build up into bigger issues.

If code is just some mysterious mess until it's running in a debugger, then new code is tossed in haphazardly, causing a toxic buildup. Relying on a debugger is a dangerous practice because you're only walking through one specific path at a time, this makes it easy and likely that the corner cases will have a significant number of bugs.

While paradigms such as object-oriented were intended to discourage programmers from creating spaghetti code, they can't actually stop it from happening. Logic spread randomly through a messy series of objects, even if they have plausible real-world sounding names, is no better than a random series of functions and procedures. There must be structure to the code, or else the code is a mess.

Nothing in code should ever be arbitrary or random. Ever.

The way to avoid these types of problems is by effectively normalizing the code. Relational database theory has a similar concept, whereby a set of rules is applied to a schema in increasing order to make it more and more orderly. The most orderly version, known as 4th normal form (or possibly 5th, I always get that confused), is considered to be the correct one to used in most general circumstances. Certainly, even if the schema has been denormalized for performance, most good data architectures are still well aware of the equivalent fourth normal form version of their database. They know what it is, before they choose to violate it.

The process of normalization applies an increasingly strict rule-base on an existing structure, to force it into some generalized simplifications. You can't arbitrary simplify everything:

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

but these rules of schema normalization take into account the necessary variables to bring down the schema's redundancy and overall complexity. There are no doubt trade-offs made, but they are fairer than just leaving it up to chance.

Code too, can be modified with a simple set of rules until it is in a cleaner more normalized version. Simplifications, particularly when there are subjective elements are never straight-forward, but within reason the purpose of applying rules to a code base is to amplify the readability without incurring a considerable expense to the performance. This brings down the major variables into something more tangible.

This isn't a particularly new idea, refactoring has been around for ages, but I often think that many people aren't applying it effectively because they have no idea what it should become. It's a series of small transformations based on localized issues, but that still leaves it all rather arbitrary. When, where and why at a global level do these things help the code, and when are they actually making it worse?


FROM EXPERIENCE

For this posting, which is going to be very long, I want to go through my own perspective on code. In particular, on how I see it's internal structure and why I think it can be normalized. It's a long, and often painful argument, but without people understanding the foundations I have no really good way of just boiling this down into nice little bits of advice. Sorry.

In one of my very early development experiences I was lucky enough to work with a medium sized code base that had been heavily edited by a lot of very determined programmers. The results were as close to elegant as I've ever been able to see in any real production code.

Everything had its place, and there was a place for everything; it all fit nicely, it was obvious and everything was right were you would expect it. If you closed your eyes and guessed where a specific set of instructions would have been placed, you'd find that they were almost always exactly were they should be. It was in its own technical way incredibly beautiful.

Fixing and extending a good normalized code base is a pleasant task, hacking away at some pile of muck is not. There are less bugs, changes are obvious and extending the original code is actually fun. Because I had a good experience early on, I was always really sensitive to the difference between a disorganized mess and something more elegant. And, more importantly I was always aware of just how little can often differentiate the two.

The biggest problem has always been trying to explain this to other programmers. I can go on forever about attributes and properties of code, but to most people that just doesn't stick. My "obsession" with arrangement to some, seems counter-productive, but only until they've learned for themselves that in getting it right, we don't have to needless keep pounding on the same code hoping that it will work better each time. Good code is easy to make work, poor code is not. Good code is always less work.

To some, the concept of elegance may not make sense, but to someone who's seen it, it is crazy not to build things this way. I'm not into putting in any extra work into my projects, I've just learned that keeping the code clean, simple, consistent and elegant is actually the least amount of effort. If we keep up with discipline, the workload decreases. It's that real understanding of how much time gets wasted with sloppy code and quick stupid changes that drives me forward, nothing else. The shortest path to a good long term product is via elegance. I know this to be true from multiple different working experiences.

To get the full sense of normalization, you need to understand the context, so I'll start off this discussion with a few abstract perspectives on code. Weird, yes, but entirely necessary later when understanding my (poor) attempts at normalization rules.


DECOMPOSITION

Software is a long sequence of instructions assembled for computer hardware to execute. It has a beginning, and at some point* it has an end. Over any given instance of the lifespan of a piece of software, the instructions executed are a finite list. The list may change from run to run, but it still is fixed.

* everything ends, but more specifically best practices for computer operations should involve rebooting the machine on a fixed schedule. Theoretical models of computing like finite state or Turing machines have infinite paper, but that just complicates matters unnecessarily in this case.

You can see the instructions from the hardware perspective, say as micro-code or assembler that is is functioning, but it is just as easy and convenient to see them in terms of a higher-level language, one that supports the modern notions of functions, scope, conditionals and looping constructs. Mostly for this discussion any the of the functional, or procedural languages will do (they do embed specific paradigms into their mechanics, but not enough to change their underlying nature).

In the crudest sense, if we wanted to create a new software program, we could just create a program with each and every instruction included, in its proper order. Yes, there would be a huge amount of redundancy, in fact most of the program would be redundant repetitive tasks happening over and over again.

For a program running with over an hour's worth of CPU time, there be would be a massively large number of instructions. It would be insane to attempt to sit down and type all of those instructions into a computer. Even with a totally impossible 100% accuracy, it would takes years and years to complete the work. Clearly that's impossible.

While it's one big set of instructions, most software interacts with some other control mechanism, be it a user, some hardware, or some other software. In that way we can partition the whole as just smaller sets of instructions triggered by individual entry-points into the list. Each subset performs some discrete piece of functionality and then returns back to one or more controlling loops.

In these many subsets of the instructions there are a huge number of repeating patterns of various sizes. Patterns that repeat quickly, over and over again, and longer running patterns that pay out through similar instructions for huge sections of the code. Patterns within patterns.

So we can really see software as a smaller set of lists mapped back to specific functional actions. Lists that are driven by functionality. This perspective helps to break down the big problem into smaller ones, but it's still not really that useful.


DIRECTED WITH NO CYCLES

The idea of having lots of these smaller lists does not make it easy to picture or build complex software. We need a better viewpoint for assembling the functionality.

The list-of-instructions view of software may be interesting from an conceptual point of view, but it really does not match how we build the code. To save time, energy and to make it less likely to have problems we have to take these huge lists and mentally break them down into a large number of sublists that we call names like functions, procedures or methods. The difference between the three is not important for this particular essay, so I'll refer to each smaller block of instructions as a just a function.

We continuously deconstruct the bigger lists into many many smaller ones, primarily to make the problem easier to handle. Once the functions are small enough, they become readily implementable.

A typical program consists of thousands and thousands of functions, broken down into collections based on underlying functionality and/or data. We group these functions together with various concepts likes libraries, packages, modules, etc.

Often at an even higher level, referred to as the architecture, we collect the libraries, packages, modules, etc. into larger parts, called things likes engines, subsystems, components, etc. In this way we start building up more complicated structural pieces from the little pieces that we have just torn off the main list. At each level it's just a specific term attached to a sub-list of some size.

Mostly we start by looking down on the problem, then decompose it into little pieces, and start building it up again. These layers of abstraction help us to encapsulate the massive complexity of the system, into a small finite number of discrete components that should all work together nicely. Software is total is too complex, so we must continually break it into pieces.


FUNCTIONAL PATHWAYS

Functions are a common visual representation for us. We work with them, but we've also become completely used to seeing them in other circumstances like stack traces. When an un-handled error occurs, most modern languages dump out a stack trace, a list of the the currently executing functions, at the time of the error.

This is a useful debugging device, but there is also more happening here. A stack trace is a specific pathway in the software. A collection of functions executed at a specific time, in a specific order. While we may see this as a time slice leading to an erroneous condition, the truth is that you could create a stack-trace for each and every instruction in the language. Why would you do this? If you took all of the possible stack traces, treating them as paths, you could assemble a much larger data-structure that shows the complete run-time linkages within your program. You'll get one big massive execution graph.

Nice, but it's still not leading anywhere. A graph is a rather hard structure to deal with. In it's purest definition it is just an unordered collection of vertices and edges. There's lots of theory and algorithms to deal with them, but life would easier if we continue to simplify.

We can flatten the expressibility somewhat by the realization that any cycles in the graph are caused by recursion. Function A calls function B which calls function A again. It is interesting to know where and when the design is recursive, but not a necessarily bit of knowledge for handling normalizations. Thus we can drop the recursions, by simply truncating any path at the first sign of a repetitive element.

This leaves us with a simpler structure, generally known as a dag, which stands for directed acyclic graph. What's nice about this structure, is that it pretty much looks like a tree where some of the children have been repeated in different locations. A tree where many different parents can point to the same underlying children. Thousands of functions point to utility functions like string append, for example. There is a lot of overlap.

Just to keep life mildly simpler, for the rest of this post I'll talk about the execution graph as a tree. When you see the word "tree", think dag, although I prefer the earlier term because it fits in a bit more with my concerns.

In this discussion I'm not readily concerned with recursion, or that fact that the same function pops up in multiple different places in the same tree. They may have some impact on a higher perspective, but that really shouldn't make a difference here. Because of that, we can just choose to see the whole thing as one big execution tree of functions.


THE IMPORTANCE OF TREES

Sometimes, if you get things framed with the right perspective, understanding comes more naturally. In this case, if we can see all software programs as just big trees of functions, we can make some every interesting statements about their arrangement and structure.

In most large programs in the first few levels of each tree, there is often some control looping construct such that the programmer has no influence over. Beyond that, at specific entry points in the system, a programmer can start attaching code in specific sub-trees. Simple programs might have a small number of entry-points, while complex ones might have hundreds.

Seeing a big complex program as a massive tree of functions is probably more detail than most people can handle, so we need to focus in on the details instead. We're not particularly interested in the whole of the program, as much as we are interested in specific sub-trees of the program, and often just within limited ranges (depths) for those trees. What we are most interested in is two things: the relative level of similar functions, and the sub-tree scope of any accessed data.

But we'll have to digress for the moment.


TYPES OF CODE

There are, as it were, only a small number of things that you are actually doing with your code.

Some code is basically a single long running algorithm that follows a particular set of logic to achieve a result, basically a specific connected series of instructions. In some cases, a large collection of algorithms has been stitched together conceptually in something like an engine, all co-operating with each other. More complex, but basically the same as a single algorithm.

Some code is just glue. We are tying together disassociated parts of the system, at either a very high level like a GUI interface, or a low level like an asynchronous callback. Another huge amount of glue in most systems is just taking an internal data model and allowing it to be persistent. Glue is really just a mapping between two orthogonal interfaces.

The final common type of code are those sets of reusable primitives intended to work over and over again. Common routines are here, but so are all of the explicit data handling that forms some internal model of the data that is accessed by other parts of the system. Not always, but the bulk of many large complex systems is the composition/parsing/traversal code that wraps the main data-types. We spend a lot of resources converting the persistent form into something more flexible, apply some simple type of operators and then repackaging it for long-term storage again.

Thus we have: algorithms, glue and primitives forming our most basic types of code.

Algorithms are easy to deal with, in that you really want to get the entire algorithm all into one big function. Splitting it over a lot of little functions, even if it matches some paradigm like object-oriented generally makes it significantly harder to debug. The biggest most important attribute of an algorithm is that it works. Usually it forms some anchor for the functionality, and it's often subject to permutations on input, making testing all the more critical.

A big function that handles the algorithm simplifies any of the issues, so its worth violating paradigms like object-oriented in order to maintain the oneness of the algorithm. Of course, the design of a full engine, particularly if it has lots of co-operating algorithms is considerably more difficult, as the programmer is forced to balance distributing the logic for cleanliness with making it more complex. Realistically, it often takes several attempts to find good trade-offs for complex engines, experience pushes the developers to accept having to do way more refactoring on that type of code, then is normal.

Glue code is just ugly by nature, and usually uglier in languages that don't make static initial declarations an easy process. Code that sits between any two arbitrary interfaces is inherently ugly by definition and there is little, other than comments to help. Glue is glue, and it is increasingly common in our code bases, the side-effect of having lots of underlying libraries to call. The best results are that the glue itself is encapsulated and not allowed to leak out across of the rest of the design. More about that later.

So mostly, the heart and soul of our systems are the models and primitive functions we build based around the fair amount of data that needs to be manipulated. We spend a great deal of effort in modern systems copying the data back and forth between a persistence representation and the runtime one. We generally build systems by implementing some internal model of the data we want to manipulate and then map it forwards and backwards to the other parts of the system. Forwards to the interfaces, and GUI. Backwards to the database and persistence.

For all of the complexities of modern software, there really isn't all that much happening under the hood. Sure there is a lot of copying the data around, combining it together and then parsing it again. Moving it from this block of functions, over to another one, and then back. Often there are tangles of if/else statements blocking out endless features, strange sub-loops, and scary error handling. And of course the GUI is inherently ugly, but so is the persistence handling, both parts of the system that quickly degenerate into silliness.

Early spaghetti code was such because it had no inherent structure. Concepts like abstract data-types (ADT) came along, giving us ways to create structure out of modeling the data in the system. We moved more of our code base into being nice well structured primitives. Object-orientation is just a language based implementation of that philosophy. In each case, the structure of the code is actually driven by the structure of the data. Sometimes it gets confused, and often it is not implemented that way, but that's the core of the ideas behind these paradigms.

This, I think is important to understand because it means that inherently the way we have been pushing ourselves to structure our code has always been indirectly driven by the actual structure of the data that we are choosing to manipulate. Granted, this often gets lost in modern dogma, but once we get back to understanding our execution trees and the scope of data within them, this data-oriented approach makes far more sense. Basing the system around the way data is transformed is a simpler perspective than basing it on the millions of steps needed to complete those transformations.


BALANCED TREES

Returning to our overall perspective, we can see every program as a series of entry-points into various sub-trees of functions. If we want the cleanest most simplified system then we can apply various rules at this level to move the instructions and/or functions around to achieve the cleanest, most balanced version of the sub-trees possible. The benefit of all this effort should be to reduce the system to a simple enough state that a larger degree of deficiencies become obvious visually-detectable coding problems.

Two of the key properties in the tree are balance and symmetry. Balance not only refers to width/height of the tree being optimized, it also implies that any two given sub-trees that are similar are in balance with each other, roughly the same height, width and depth and that the arguments to the different functions at the head of the sub-trees are nearly or exactly the same.

The first big property, balance, means that co-aligned primitives should sit at the same level together. All of the similar sub-trees always start together on the same level. All of the primitives are balanced if they form sub-trees of approx the same level and size. The level and depth of all similar functions should be in balance with each other.

For any instruction in a sub-tree, if there is a symmetrical instruction, it too should be in balance. For instance, an 'open' at a specific level should also have a 'close' at that level. The open/close pair should bound a block of code, visually, even if that means they exist by themselves.

This property of symmetry is important because it's absence is easily noticeable. It is a great way to spot code that is out of place. If all the functions have a starting instruction, and an ending one, then any function missing one or both is a problem. When we cannot use the computer to enforce this type of consistency, such as in aspect-oriented programming, we must do so visibly.

If the same underlying code is being used in multiple places at various different levels than that is an indicator of a problem. The underlying code and data should fit neatly into the puzzle. The more the structure is graph-like, the messier the architecture is. If all of the calls of a specific function are on the same tree level, then the use of that function is well-balanced.


MIXED PRIMITIVES AND OTHER FAUX PAS

One very common structural problem is to create a set of primitives from one interface paradigm and mix them with another set from other paradigm providing multiple redundant interfaces to the same underlying code. This common problem, that you'll often see in popular Java libraries for example, is caused by some assumption that more is better or that the library would be more beneficial if it was more flexible. Bad idea. Two overlapping primitives sets just expands out the complexity for no real benefit.

A complete primitive set forms a close loop, with just one non-overlapping operation per primitive. Simple examples are add/delete/modify or insert/update/delete, or even add/subtract/multiple/divide. What is crucial here is that all other operations can be expressed as a set of primitives, and that that total set spans all of the possible functionality. There is one and only way way to do everything with a balanced set of primitives, if there are two ways to accomplish the same goal, then one or more of the operators are overlapping.

It is far better to create two separate, clean implementations, one for each primitive set, then to mix the two together. It's just opening the door to potentially dangerous corner-case problems caused by badly mixing the calls. Why waste time working out all of the weird interactions, especially if they aren't necessary or shouldn't be used in that way. Why give programmers the means to write increasingly convoluted steps, just because they mis-understood how to work with each individual primitive set. It's the type of wasted effort that we should have learned to avoid by now.

Null handling is another common problem, although not necessarily that structural. Programmers overuse the nulls, but their purpose and point are very explicit. For instance there is no difference between an empty container and a null. Why distinguish with containers? Having to test if a container is null, and then again if it is empty is useless code. Just never allow null containers, and use the one and only condition as the indicator. Structurally empty containers overlap with nulls in virtually all usages. Nulls as an out-of-band signal for a condition are sometimes necessary, but not if their meaning is fake or artificial. To many programs are poor tangled webs of over-extended null handling.

Exception handling, another overused language feature, was intended to clean up specific low level handling code, and build a better highway for systems to pass up significant errors. Often, thought, programmers go beyond that low level, and high level usage, and start indiscriminately using it everywhere. Syntax paradigms like try/catch form secondary execution paths through the system. One nice execution path is visually verifiable, but overlap a lot of little, radically places and one quickly swamps the usefulness of the syntax.

Programming is often about restraint, self-discipline and reductionism. Exception handling is one case where it is wise to get rid of as many handler blocks as possible. For low-level external error handling, and high level handing, try/catch blocks are extremely helpful, but used anywhere else they should be eyed with suspicion.

All three of these issues are really just instances of programmers added an extra level of complexity in their instructions to over-compensate for the overall lack of structure. Wasted nulls and excessive try/catch blocks are very noticeable blights in elegant code, but just fit into the background noise in messy code.

Once the code has been balanced to some degree, it is far easier to see what can be easily deleted, because it is serving no real functional purpose.


DATA AND CODE

Looking at programs as sub-trees of functions allows us to give great consideration to the program's overall structure without getting too lost in the details. But the code by itself will not fully normalize a program into something elegant.

Programs are always composed of two distinct, and often conflicting things: code and data. The sub-trees lay a structural framework, but we also need to understand how the data access is distributed through-out the overall structure.

If we look at all of the data in the system, we can see that it is a relatively small discrete collection of data-structures, which are essentially containing all of the data-types in the system. That is, for any given system, the amount of data used in it is both limited and finite*. You could create a small fixed list of the major entities.

* even when most programmers support dynamic data representations they often do so in very static ways, defeating the full power of their dynamic code. It's a safe bet to ignore dynamic code, or at least to contain it all into a set of fixed 'dynamic' data-type (thus making it limited and finite).

This notion is extremely helpful because we can start looking at the scope of all of the data, in terms of the trees in the system. A well-balanced, normalized bit of code will encapsulate specific data structures within specific sub-trees. The data is hidden from any code outside of that tree. This is information hiding, and encapsulation (if the code is buried there too).

We want, at each functional level, for the concepts, information and ideas below that sub-tree to be a small consistent set. As we descend further down into the tree, we want a more enhanced scoping of the data. The data and code in a given fixed set of primitive string utilities for instance would, underneath it all, refer to just strings and specific manipulations. As the data sinks lower into the tree, the understanding of the data should be more and more general. Explicit parameters at a high level, are a hash table below that, and then just strings and keys below that. Thus the language we use in the code to describe variables, function names, parameters, etc all match the level and scope of the data. As the level gets deeper, the terminology gets more general.

For a normalized model, the collection of sub-trees that make up the interface all encapsulate the scope of the underlying data. Any data that gets beyond that scope "leaks" into other parts of the system, is global or is effectively global. And these problems with the data are more common than expected.


STATE AND ITS EFFECTS

A global variable is one that is accessible from any location in the entire tree. We've known for a long time that globals are considered dangerous, because they allow multiple access points in different parts of the system to quickly fall out of sync even with simple changes. A reckless change can be followed by a long and painful hunt for the culprit. Because of this, we actively try to avoid globals.

What we know is true and a big problem in the whole tree is also true and a big problem for any given sub-tree within the system. For any sub-tree, any common similar location of data is the "state" of that sub-tree. Sometimes we don't see it as such, but if multiple locations within the tree access the same variable, then it is essentially a global. Scoped, a bit, but still global.

We've known for a long time that state is bad. State hugely increases the likelihood of errors, and makes it very hard to test to see if the software is working or not. State problems may require weird compound testing methods to re-create, so they are very expensive both in terms of development and testing. Most non-obvious bugs* are a result of state problems of some type.

* OK, threading bugs in Java, and hanging pointers in C and C++ are probably way more popular, but these were "features" added to the languages to keep programmers employed.

Implicit state is still state-based, and is far more dangerous because the programmers are generally unaware of the problem. Any sort of data that is not explicitly passed in and/or out of a function is some type of state. That means that any and every side-effect is an implicit state of some kind. Any state that is changed in many different places in the program, even if the change method is a covering function call, is a defacto global variable, with all of its inherent problems and weaknesses. Any and all of the locations are dangerous.

Stateless code was a great idea and a best practice goal for a while, but that went horribly wrong with paradigms like object-oriented. Objects are inherently state-happy, and in that way they often hide other more hideous state problems from unsuspecting programmers. An instance of an object can be scattered across an execution graph like a splatter paint artwork. One object instance can easily be acting as a bad global variable for any other. it can get very ugly, very quickly.

Without careful consideration of these structural relationships, old problems that we banished for good reasons in the past can easily crop back into our code bases. Worse, they can be effectively hidden to most developers. Toss in a fine helping of threads, and it is easy to understand why so many popular applications occasionally, and often quietly cease to behave correctly. And why they do so only in a tiny fraction of their runs. Seemly random, non-deterministic problems, lurking in the background, wasting lots of and lots of time and effort.

These realizations, are one of the primary reasons why it is important to sometimes change perspective on a problem. Hidden, yet inherent flaws in one viewpoint become far more obvious and understandable from another.


WASTED RESOURCES

If you trace out the data in many systems you will find that it progresses through the code, jump-by-jump in a series of copies. Sometimes buffer copies, sometimes it is being parsed, somethings is is being reassembled. The path of any given piece of data through a system always involved lots of copies. Modern languages and paradigm have made this problem worse, causing this type of bloat to increase rapidly.

From a tree perspective, what is happing is that the data is scoped within a large series of different sub-trees. As it leaves one sub-tree, it is copied into the next one. In this sense, we can see each copy as a implicit violation of the sub-tree encapsulation. More to the point, if a large chunk of data is copied into a sub-tree only to facilitate some small set of manipulations, then that specific code could easily be moved to a more appropriate tree.

In that sense, the smaller the tree and the fewer of them that hold the data, the more encapsulated it is. Watching how the data flows through out the system gives a good indication as to a better working structure.


ARCHITECTURAL LINES

At an even higher level, we can see the architecture as how the big major sub-trees in the system are laid out with respect to each other. Balance and symmetry apply here as well as anywhere else.

To get a real architectural line between two pieces of code, they both need to have entirely separate sub-trees and data. Overlap in either crosses the line.

Encapsulation is burying all of the messy details, code and data, of something behind a small subset of sub-trees. They act as the interface, that hides all of the other detail. Decomposing a problem properly makes it easier to build a real workable solution, not just one that is close to workable. We need to encapsulate the details in order to manage the complexity of the project and actually get it done.

More importantly, libraries, modules, etc. should be organized about their underlying data not on their algorithmic code. That principle makes it really easy to just see the library as data containment functionality for a specific data type in the program. The algorithm handling and the data handling should be separated.

As a related note, often many user libraries and packages combine mashes of algorithms and data-handling that are inconsistent and unbalanced. Clearly defining the structure around well-balanced decompositions would make using most libraries considerably easier to use. We need a movement that wraps simple components based on very specific and complete access to specific data structures or algorithms, in a fully complete, access type of way. That would make the choice of using a specific library, really a decision about supporting a new type of data, and it would also cut down on new versions and upgrades.

The spasmodic and arbitrary blend of data and functionality whipped up into most modern libraries forces a constant cycle of updating, if for no other reasons than to try to get some of the contained functionality into a more complete state. For many libraries, this dynamic upgrade path is not necessary, but simply a by-product of disorganization and bad partitioning. This a clear example of why better normalized libraries would significantly cut-down on development effort.


REFACTORING

Knowing what a good structure is, doesn't help unless there is some easy and simple way to get any program there. Refactoring acts as the micro-normalization rules that can allow a programmer to start with anything and make it more orderly. Of course, simple consistency is also critical in making it all hold together.

You can see all of the refactoring algorithms as just ways of pushing and pulling the code up and down between the different levels of the tree. In this sense we can balance the functions, and then balance their usage, then balance the data, etc. Think of it like the "roll" operations in a weighted-balanced binary tree.

It is possible to take any working program, and after apply a very long series of non-volatile refactorings, return to another working program. Refactoring doesn't have to interfere with the functioning of the code, in fact it is far better to pass through with a large series of non-altering changes first, before moving onwards to expanding the code base to add in new functionality.

Not all refactorings in this way will be non-destruction, because by definition some of them will actually be removing bugs from the system. The changes in behavior are often ultimately good, but then there can be unexpected dependencies tied to buggy code. Under these types of circumstances, it's based to temporarily duplicate the code, with a new clean version and an existing broken one. That makes it possible to reassembles all of the pieces first and do some comparison testing to insure that the none of the behavior has changed, before moving on to deleting the dependencies on the broken code. Getting the code quickly back to working order quickly finds obvious problems and keep the development work moving forward in a series of small independent discrete steps.

Normalized code that has been refactored, then retested (lightly) sets a strong base for extending the system to encompass the next level of functionality. Without this type of behavior, the code simply degenerates into some hideous onion-nightmare, a sad and embarrassing state of affairs that is entirely unnecessary. Each time the code degenerates, it becomes more of a work magnet, drawing in masses of wasted time debugging stupid problems and working around fixable issues. Anybody working on that type of code base knows that pretty quickly more effort goes into badly patching sloppy problems, then goes into new development. A sad, and absolutely avoidable state.


FINAL SUMMARY

I wasn't really specific in producing a finite set of "forms" for normalizing code. But if you see it as a structural problem, then the rules themselves are less important, they are simply the easier way to transform one structure into a better one. The final structure is what's key.

Someday, I'm sure someone will come along with a clearer set of rules. Something that can easily fit onto the back of refactoring, that makes it easily understood at the higher level.

We know the code is normalized by the fact that the final structure we create, is an easy to read one. We've simplified the execution graph. The code maps to the structure, which maps back to the code again. A messy graph is usually messy code.

Be careful in applying this knowledge, for as I said in "The Nature of Simple", human based simplifications are not the same as machines ones. We are somewhat flawed, and as such our normalizations will be too. We don't want things that are truly universally simplified, just ones that are 'simpler' to us.

If that's true, then why bother? Like the database, a good developer knows what is normal form for their code, even if they don't strictly follow it. There are exceptions, but you cannot understand when they are OK, if you do not grasp the complete picture. Breaking the rules without understanding them just pushes back the success onto luck. Relying on luck fails often enough.

Don't forget, that this work, extra as it may be, isn't to be done for fun, or because it's right. It is to be done to make it easier to move the code base forward to the next version. It is to be done to clean up the old messes, and to make way for a better version. It is to be done to save time, and allow us to leverage our coding abilities better, instead of our ability to continuously hit "next" in a debugger. It's neither arbitrary nor extra, simply work that needs to be completed to insure that the overall health of the project gets better from month to month, not worse.