Thursday, October 30, 2008

Revisiting the Structure of Elegance


This post is a continuation on an my earlier one entitled The Structure of Elegance. If you haven't read the first post, then most of this one won't make any sense.

I left off by presenting a different perspective on code, that of being subtrees of functions within the system. This type of perspective is convenient for allowing a graphical view on the code, one that can be easily manipulated to help normalize the structure to something considerably more elegant. Large collections of code can be a overwhelming jumble of complexity, but by inspecting limited subtrees, most people can get a better grasp on their structure.

This perspective provoked a few questions about how and why this is useful. One of the biggest pieces of feedback I received was about what type of rules would exist for normalizing the code. Before I step into that issue, we need to get a higher level view.

Normalization, much like optimization and simplification are all reductive processes. Given some constraints, the system is reduced to a minimal form that is more or less equivalent to the original system but improved in some regards. These processes all require trade-offs, so its never a case of the perfect solution. A number of ideal solutions may exist with respect to a set of constraints, and the constraints themselves can vary widely.

For instance, what is simplified with respect to a computer -- a massive list of assembler operations -- is nearly impossible for a human to work with. We pack things into encapsulated functions, adding in lots of extra complexity, but that allows us to bite off limited sections of the code at a time. We must simplify with respect to keeping things in bite-sized pieces. A computer doesn't need that extra complexity.

To pick rules for normalizing code we need to first understand what attributes within the code are really important. For instance, we know that a single node in tree with all of the instructions place together in the massive function is not a particularly easy bit of code to deal with. It's just a big mess of instructions.

Alternatively, we could place each instruction in its own function, giving us a super-long and skinny tree; but that wouldn't be very nice either. Balancing the depths and breadth of the trees is important.

We need to understand what types of dimensions work best, but first we need to look back at various layers in the tree. Code as different depths is quite different, and requires different types of normalizations. The overall structure of the program is not uniform, and shouldn't be handled as such.

At the top of the tree is a fair amount of code that deals with technical control issues. It controls the looping, communication, functionality triggering capabilities. Anything that binds the driver factors, human or computer, onto specific configured blocks of functionality.

At the very lower level of the tree is code that is also handling a number of technical issues. Low-level things like logging, number and string handling. Variable manipulation, common libraries, etc. Even persistence through a database. The lowest levels of the tree either end in specific data manipulations or interfaces towards library or external functionality.

It is in the middle where we find the guts of the program. The part were all of the work is completed.

All code is driven by some usage, which will call the problem, or business domain (although business isn't necessary defined to only be commerce). Problems are either very specific, such as a system to handle a particular type of data like client management, or very general such as a spreadsheet. However, even in the general case like a spreadsheet, it gets out there and is configured to be used for very specific problems. In that sense, all software domain problems start their origin as a business one, and then move inwards to become more abstract, either in the configuration, or in the actual code.

As the problem domain gets more abstract, the language describing it does as well. You might start out with stock prices, getting mapped to configuration information, getting mapped to formulas, then getting mapped to spreadsheet cells. Generally the deeper you go with the data, the more general the usage.

No matter what level of abstraction, the middle of the tree contains some business domain code that gets more and more abstract as it gets lower. Thus we have something like:

Real Problems -> Configuration -> Business Related Problems

Technical Control Issues
Business Related Problems
Abstraction on Business Problems
Smaller and/or Common Technical Issues


So, on the outside we have the real user problems that are mapped onto the primary "business" related ones in the code. Then in the code itself, we have a heirarchy of four layers that can overlay the whole execution tree for the system.

At the very top of the tree, the 1st level, we find we are dealing with the technical control issues. These we want to get too quickly and then set them aside. They are technical issues that are not directly effected by changes in the business issues. Expansion of the business demands sometimes requires enhancements, but ideally we like to minimize this code and encapsulate it away from the rest of the system. It's purpose is to really just bind code to some external triggering mechanism.

The same is true for all of the low-level common stuff, the 4th level. It gets created initially, then appended overtime. It usually is fairly controlled in its growth, so we want to encapsulate it as much as possible, and then hide it away at the lower levels of the code.

The ugly bit that is messy and subject to frequent changes is the stuff in the middle. It's also the hardest code to write and frequently the most boring. But it is the guts of the system, and it defines all of the functionality.


THE MIDDLE ROAD

We know that the code is frequently changing at the 2nd level and 3rd level, and that it is driven by forces outside of the developer's control. Generally, the best approach is to try and make a 3rd level that is abstract enough as to be resistant to all but the worst possible changes. In a sense we want to minimize the 2nd level, in order to reduce the impact of change.

That means that we really want to build up a set of reusable primitives in the 3rd level that can be used to express the logic in the 2nd. As trees, we want to minimize the depth of the 2nd level, and maximize the depth of the 3rd. The 2nd level is shallow and wide, while the 3rd is deep and wide.

As the project progresses if we have more and more base primitives in which to construct increasingly complex domain specific answers, where each primitives sits on a larger and larger subtree, then it becomes easier over time to construct larger more complicated solutions to the external problems. It's easier and it gets faster as each new abstracted set of subtrees covers more territory.

A broader 3rd level doesn't help if it's huge -- new programmer's would reinvent the wheel each time because they can't find existing answers -- so keeping the number of functions/subtrees down to a bare minimum is important too.

This helps in even more ways, since if the same subtree is used only once in a problem, then it is only tested 0 or 1 times. But if it is used in hundreds of places, the likelihood of testing is severally increased, thus the likelihood of unnoticed bugs is diminished. Heavily run code is more likely to be better code. Well-structured code promotes quality.

That principle is true for the whole system: a tightly-wound dense program where all of the code is heavily reused will have considerably better quality than a brute-force one where every piece of code is only ever executed in one distintctly small scenario. Ever wonder why after twenty years Microsoft still has so many bugs? Millions of lines of code, especially if they are mostly single use is an extremely bad thing, and nearly impossible to test. Also, it is subject to increasing inconsistencies from changes. The modern process of just wrapping old code in a new layer with an ever-increasing onion architecture is foolishly bad. It's worse than spaghetti, and completely unnecessary.

So overall we can see that we want very simple common and library code at the very bottom layer that deals with the technical problems in technical language. At the top we are force to have layers of technical code that also deals with the problems in a higher technical language.

At the second level we will have lots of code, as it represents all of the functionality of the system, but hopefully it is very shallow. Below that we want to continually build a minimal set of increasingly larger subtrees that act as domain-driven primitives to accomplish non-overlapping goals. The strength of the 3rd layer defines the flexibility of the system, while the consistency in behavior is often defined mostly from the 2nd level.


HOW DATA FITS IN

A big application will have a fairly large number of different major data entities. The scope of this data in terms of the subtrees, easily quantifies the structure of the code. Data that is duplicated, and copied into many different locations in the code is a mess. Refining the data down to specific subtrees limits the scope and contains the errors.

Another important consideration is that the impact of changes to the data is way easier to calculate if the data is only visible within a specific subtree. Change impact analysis can save on costly retesting.

Data drives the primitives in that each of them, if they are similar should deal with the exact same set of system data.

All of the string functions for instance, just deal with strings. All of the socket library code deals only with sockets and buffers. The persistence layer only deals with SQL and relational structuring.

A set of common primitives share the same underlying data structures. If one of them requires access to something different, that's a sign of a problem. Similarities are good, inconsistencies point to coding errors.

Within the system there will sometimes be vast gulfs between a few of the subtrees caused by drawing architectural lines in the code. One set of subtrees will be mutually distinct from another set with virtually no overlap.

This is a good thing, even if it means that some data is copied out of one subtree and replicated in another; it's a necessary resource expenditure to ensure that the two sections of the code remain mutually independent. Of course, this means that there is clearly a cost to adding in any architectural line, one that should be understood upfront. Somewhere in another post I talk about putting in lines for structure, but also for organizational and operational reasons.


IN OBJECT-ORIENTATION

By now some readers have probably surmised that I've been completely ignoring object-oriented issues. Although I've swept them aside, since objects are really just a collection of related functions tied to a specific set of data, much of the structuring I talked about works very well with object-oriented languages if we work in a few very simple observations.

The first is a little weird: there are really only two types of relationships between objects in this perspective. One is 'composite', the the other is 'peer'.

From a subtree perspective, one object in the system contains another if and only if all of the dependent objects methods appear on the stack below the other object. I.e. one is fully encapsulated by the other, it is a composite object. As this is testable on the stack it is easy to see if it is true, for two objects A, and B, if A.method1 is always ahead of B.method1, B.method2, then more or less A contains B. All methods of one are bounded by the other.

In the peer case, any two objects are interleaving on the stack at various levels. The order isn't important, simply that neither is contained so they work at a peer level with each other.

There of course are other possibilities, but these we don't want. For instance if A contains B and B contains A, then either these two are really peers, or there is something messed up about their relationship. Consequently if two objects are peers and also one contains the other, then there are really at least three objects all tied to each other, one object at least is horribly overloaded. That is it is at least two individual objects crammed together in the same body.

One could get into proving how or why some relationships imply that at least one of the objects is overloaded, but it isn't really that necessary. We really just want to focus on the two basic relationships.

Convoluted objects can can be refactored. With some snapshots of the stack, you can break the objects into their component objects, thus normalizing them.

With only these two relationships between all of the objects, it becomes considerably simpler to construct the system. For instance in a GUI, we glue upwards into the interface, mostly peer objects. Then we create a model of the data we are going to use in the 3rd level, mostly containing composite objects. We try to bury the SQL database persistence quietly in the 4th level again with composite relationships, and that just leaves the second, which is probably some type of mix, depending on what the code is really doing.

Since we're trying to minimize the 2nd level anyways, the basic overall structure of the code becomes quickly obvious, especially if we're concerned about trying to really encapsulate any of the ugly technical problems.

And this can be extended easily. If you get the same problem, but with a client/server breakup in the middle, it's easy to see that it can be solved by adding in another layer between 2 and 3 that hides the communication, although occasionally optimizing a few pieces of functionality into the client (from the server) just to save common resources. The four layers stay the same, even if you're adding a fifth one into the middle.

With the above perspective, peer objects can have complex runtime relationships, but composite ones should not. They should be very simple containing relationships in order to keep the code as simple as possible.

Some object-oriented schools of though seem to really like complex behavior between many different objects. The problem is that this quickly leads to yet another layer of possible spaghetti. Wide differences between structural relationships and runtime ones can be very difficult to maintain over time. We're best to focus on simpler code, unless the incentives are really really large.

Ideally, we want to take our business problems and express them in code in a format that is as close as possible to the original description. In that sense, very few real problems have a requirement for a complex runtime relationship, most of our data manipulation is about creating or editing new data in a structural fashion. Systems with extensive peer objects should heavily consider whether the dramatic increases in complexity are worth the cost of the design. It is entirely too easy to build something overcomplicated, when a much simpler design produces the exact same solution.


SPARTAN CODE

I've seen a few references to spartan programming rules:

http://ssdl-wiki.cs.technion.ac.il/wiki/index.php/Spartan_programming

these are great, good steps along the lines of normalization, but like all things, taking simplification too far can actually produce more complexity not less.

The spartan rules about naming conventions for example, are too much. Naming is a key way to embed "obvious comments" right into the code. The name used should be the shortest, longest possible name correct for that level of abstraction, and consistent with similar usage throughout the system.

All names are always key pieces of information, and even with temporary variables using a rule of simplifying to something like one character names is a really bad idea. Code should flow, and odd naming conventions become distractions. All variables have a proper name, it's just that the programmer may not realize that.

Still, we want to boil down the code to the absolute bare minimum. We want to reuse as much as possible, to achieve better quality. We don't want duplicate information, extra variables, extra code or any other thing that we have to maintain over the years, if in the end it isn't absolutely necessary.

Minimizing most elements brings down the amount of information to maintained over a long time to the smallest possible set.

There should only be one way to do everything in the system, and it should be consistently done that way. The consistency is important for achieving quality.

In languages where there are more than one way to do things, only one way to do things should be used. More is worse. Pick a good way of handing the technical issues, like loops, and stick with it for as far as it will go (although don't use this rule as an excuse to ignore features of the language, or go too far).

Ultimately it's like anything else, if you have too many bits, they become harder and harder to maintain, so eventually it gets messier and messier. Discipline is part of the answer, but so is cutting done on the overall amount of work. Why spend two months working on something, if you could have completed it in two weeks?


NORMALIZATION RULES

The above discussion provides a good overall understanding of the types of things we are looking for in the code. We want to normalize with respect to a limited subset of variables, more or less in some specific order, so choosing the constraints carefully is important.

There are some easy attributes that we can see we want from our code:

  1. All unique pieces of data have unique names.
  2. Variables only get more abstract as they go deeper.
  3. All equivalent subtrees are called at the same level for any tree
  4. All symmetric instructions are in the same function, or functions at the same level.
  5. All data is scoped to the minimal tree
  6. All instructions in a function belong there, and are related to each other
  7. All functions speak of only a few pieces of data.
  8. All data is parsed only once, and then reassembled only once.
  9. Data is only copied between trees for architectural purposes. Otherwise all data exists in the program in one and only one memory location.

With respect to all of the above, and with an understanding of the common problems we find in code, we can lay out a bunch of layers to the forms:


1st Normal Form

- no duplicate sub-trees or data

Go through the code, line-by-line and delete any duplicate functions or data. Delete copies of any identical/partial data.


2nd Normal Form

- balanced common 4th level libraries: height, subtrees and arguments.

This is an architectural level. Similar functions should have similar arguments and be called on a similar level. Push and pull instructions until this happens. Some new functions might have to be created, some combined, and many deleted. Refactor the code until all of the pieces are encapsulated non-overlapping primitives, that encapsulate specific technical issues.


3rd Normal Form

- balanced domain code: 2nd, 3rd level

This is also an architectural level. 2nd level is minimized, 3rd level is maximized, but with few subtrees. Like 2nd normal form, but at the higher level in the code. The 3rd level can actually be modeled on the verbs/nouns (functions/data) used by the real users to discuss their problems, and their needs. I.e. you can map their language to the 2nd layer, with the building blocks created in the 3rd.


4th Normal Form

- All similar operations use similar code.

All blocks of code are consistent, and use the exact same consistent style and handling. Any code doing similar things is formated in the same way, using the same style and conventions. This can be done by smaller refactorings in the code to enforce consistency.


5th Normal Form

- All data is minimized in scope.

All data is named properly given it usage, and appears a absolute minimum number of times, and is only every parsed once, and only ever reassembled once. It is only stored once per section. Names for identical data should be identical at the same level. Names (and usage) should get more abstract at deeper levels. Some structural refactoring may be necessary, but a data-dictionary with all variables names (and associated levels) would help find misnamed or duplicate variables.


6th Normal Form

- All functionally similar code exists only once, and is instanced with a minimal set of config variables, both at the high and low levels in the system.

For example, there are only a small number of screens types, even though there are hundreds of screens in the system, and there are only a small number of data types, even though there is tonnes of data in the system. When these two common repetitive levels are melded together into a smaller denser system, the code quality is massively increased, the amount of work is decreased and the consistency of the entire system is enforced by the code itself. 6th Normal Form implies that there are no duplicate 'functionally similar' blocks of code, since they have all been combined into general ones (every piece of code looks different).


BUILDING SYSTEMS

With plenty of work, any existing system can be refactored into a fully normalized code base. There is nothing but effort, and the occasional regression testing to stand in the way of cleaning up an existing system. Often the work required seems significantly larger than it actually is, and as part of the process, generally a large number of bugs are quickly removed from the system.

For new systems, development can aspire to achieving higher normal forms initially, but on occasion allow for a little less because of time pressure.

All development is iterative, even if the cycles are long and the methodology tries to hide it, so it makes perfect sense to start each new round of development with a period of refactoring and cleaning.

Done a little at a time, the work is not nearly as painful, and the consistency helps keep the development at a brisk pace. Big systems grind to halt only because of the messiness of the code. New changes become increasingly more onerous, making them increasingly less desirable.

For an initial project, it is best to quickly show that the upper technical problems fit well with the lower building blocks. New technologies should always be prototyped, but sometimes that process is driven by the development itself. Generally good practice is to start by building a small foundational 4th level, then attach upwards a minimal bit of functionality. This inverted T structure quickly shows that the flow through the code and the technologies are working correctly. Enhancements come from adding more functionality, a bit at a time, but if needed expanding the foundation level first. This type of iterative construction minimizes redoing parts, while making sure that the main body of code is usually close to working most of the time.


GRAND FINALE

The single greatest problem that I've seen over and over again with many large or huge development projects is the code base itself. Sure, there are lots of excuses about management, users or technology, but in the end the problems stem from the coders not applying enough discipline to their work.

Messy code is hard to work, while elegant code is not. But, as we do not teach coding particularly well, it seems as if very few programmers actually have a sense or understanding of the difference between good and bad.

That's why normalization is so important. It will allow programmers a clear cut set of rules for cleaning up their code. For getting it into a state where other programmers can easily help work on it. For not getting caught up in endless debates about right and wrong, and various conventions.

The rules as laid out in this post are simple and ascend in order of importance, that is just getting to 1st Normal Form for many systems would be a huge accomplishment. Getting past 2nd and 3rd would help prepare the system for a new way of expansion. Getting even higher would be elevate the work from simple system to something more extraordinary. Achieving 6th Normal Form would produce a work of art.

But -- as is also true with this normalization's database counter-part -- no set of normalization rules is absolutely perfect. There will always be occasions to denormalize; to not follow the rules, or break them temporarily. That's fine, so long as the the costs are understood, and a rational trade-off is accomplished. We may be rigid in our coding habits, but we should always be flexible with our beliefs. The best code is the stuff that is easy to understand, fix and then expand later as needed. All of this normalization is just help to point the way towards achieving those attributes in your code.

4 comments:

  1. I didn't get the "peer objects" stuff - can you provide an example?

    I'm not sure that the database metaphore or model works well here. I'm no DB expert, but I perceive databases as warehouses where data are put, kind of in a static fashion. In programs, data is the blood, or better the body fluids, and functions or procedures are organs. A symptom of the mismatch is that you don't talk about the status of references: in your 9 attributes, you talk about data duplication, scope, and naming. Of course, we can extrapolate from data to data references. But I think that inappropriate scope or duplication are really beginner's errors, whereas anarchic creation of references between data entities is a common mistake and a common source of bugs. Some are even extremely well known: double deletion of an object, use of pointer to a freed object for instance. Even though these are typical C/C++ bugs, this kind of problem arises in garbage-collected languages as soon as they provide the possibility to make a reference to an object.

    Let me re-interpret your four-level system from a different perspective. Software is a three-characters movie: there's the "user", the "programmer", and the "computer". The user is the guy who wants to hang out with the computer, but the chick speaks german. So he asks his more-or-less friend who knows german to talk to her and try to get a rendez-vous -- preferably a german rendez-vous, not a french rendez-vous, because the german rendez-vous is more likely to end with a french kiss (*). The plot is, that the programmer is french. He understands quite well what his english friend wants him to do, but misses a few details. Next, althought he is fluent in german, there are little gaps both when talking to the girl, and when getting the answer. There are also "little gaps" in telling his friend the answer of the girl. But happy end, they got married - after the french friend came to the rescue several times in order to correct his little gaps in his translations.

    I think your model has four layers because the middle part - we, the programmers - are split into two: we translate the users' requests into computer language, and we translate back the capabilities from computer language into user language.
    It is a difficult job because there are always idiomatic expressions that are hard to translate or that you simply don't know, cultural references, words unique to one language, etc.

    It suggests that, in the system, the two middle layers should be equal and symmetric (didn't you say that symmetry is good?). In practice however they might be assymetric and uneven depending on the language of the user and the language of the program.


    (*) a "rendez-vous" in french can be a professional appointment or a lover's date (the former meaning is more common); german retains only the latter meaning.

    ReplyDelete
  2. Hi Astrobe,

    Perhaps a better set of labels is 'structural' vs 'interactive'. The relationship between some objects is more of less defined by their real-world structural positions (patterns: composites, flyweight, factory). For other objects, specifically those that are algorithmic (patterns: visitors, strategy), their interaction with each other is what is significant. In either case, we are interested in being able to consistently normalize all objects down to simpler forms.

    I'm not trying to fit a database metaphor over programming, but rather I am just borrowing one key process (normalization) from schema creation and applying it to code. Once we have a definitive 'state' for any work, then we have a solid goal to accomplish (and a way of knowing that we have fallen short). Now we can ask "is it normalized? In which form?" and have an idea about the quality of the underlying code. This clearly helped databases, so it should help code as well.

    In the sense of references, don't confuse the real-world relationships between the data with the technical relationships that you had to create to implement the system. The data has a real structure, one that we try to mirror with the technology, but often it is incorrect. Technological problems like doubly deleting objects, and hanging pointers are algorithmic mistakes in the code, they should be encapsulated in the 4th level (not doing so it the real source of the error). If the code is well-structured, poor algorithmic choices are easier to detect (an algorithm is to code, as the content is too data, beyond the normalization rules).

    I split the middle layers of code into two because the necessity for simplification for different parts of the code is very very different. Symmetry is the best, but when it is artificially forced it becomes more of a problem then not. At that higher level (2nd) things are changing very quickly, so big bulky code is a problem. At the lower level, things are changing very slowly, so we want bigger and bigger pieces to allow us to leverage our earlier work as much as possible. No single set of rules would work with both, very opposite requirements; splitting them works well.

    The movie idea is amusing, but software developer's build things that the user's use. It's more of a 'thing' getting created and less of a 'relationship'. Most miscommunication comes from differences in perspective. Users can't articulate their needs, and developers wear blinders (often because the truth is actually too hard to code). As an industry we take wild guesses about the nature of the data, and wild guesses about the nature of the software, and then we attempt to build everything too quickly. With that type of 'process' as a foundation, massive changes are completely expected.

    Someday we'll stop guessing, but the science and engineering needed to support that aren't even on the horizon yet, so it will be a while. The best we can do is too accept the reality, and build that into our working habits.


    Paul.

    ReplyDelete
  3. Hello Paul,

    How viable do you think building a Lint-like tool for 'code normalization' would be? The rules themselves seem simple enough, but as always the difficulty is in the application of important principles.

    ReplyDelete
  4. Hi John,

    My next post has a longer explanation, but I don't see a problem with making the normal forms computable. I.e. we should be able to write something that states whether a program in in 1st normal form or not. The same should be true for all of the normal forms.

    The real problem comes from trying to understand a) similarity in code, and b) similarity in data. Code isn't too hard, they played around with this during the Linux lawsuits. Data is harder, but with a small amount of external meta information about which incoming variables are identical, it is a solvable problem.

    I'm almost finished editing my next post (maybe tonight), which is far less esoteric than this one. I go through each normal form and explain it, and why it is important. Hopefully it's a little nicer than this one :-)


    Paul.

    ReplyDelete

Thanks for the Feedback!