Sunday, July 5, 2009

In Our Dependence

Consider a line of code. Any line really, it doesn't make a difference. From a far off, hazy perspective they are all similar. Just another instruction for the computer to process, in a long sequence of instructions.

At some point, probably many times during it's operational life, the line of code will get executed, performing some simple operation like changing a variable, checking a condition or calling a function.

Most complex languages do allow for significantly more complicated sequences of events to be packed within a single line of code, but even then, each event is really something simple underneath. A simple instruction of some type.

We can consider these complex aggregations to be syntactic conveniences, and not something more advanced. In that sense, a line of code might actually be a clump of lines all compacted together, but for this post we'll ignore that.

We're only interested in looking at a single, simple instruction, and how it relates to the rest of the things around it. For now, we really don't care what the line does, or even how it really interacts with the other lines in the system. We're just concerned with its dependencies.


SETS OF INSTRUCTIONS

For our line of code to work properly a great number of things have to happen first.

Most obviously, the preceding line of code must have executed properly. And before that, all of the other preceding lines must have executed properly too. Each instruction, in a long sequence ofinstructions, has a dependency on the earlier instructions to have worked correctly. An algorithm that is only half successful at the end, is clearly invalid. Completely useless.

In this context, one can view all functionality in a software system as being just fixed sequences of instructions, with some minimal looping. We could lay out all of the lines of code that preceded ours, into a giant flattened list. A complete and final list needed to implement a specific instance of the code.

At some point, deep in the program, this current driving piece of functionality was triggered by some entry point. Something started it running.

All programs, whether they are command line utilities, batch jobs or GUIs, are really just launching pads for similar, related functionality. Our line of code belongs to one or more of these execution sets. The primary difference is whether the entry point was triggered by a human, a set of arguments, or some other mechanism.

Most computer processing relates back to the actions of individuals somewhere, although wiring it up to run at certain times of the day is popular too.


IN CONTEXT

As lines of code execute, they share significant dependencies on the various different "contexts" around them that contain data.

The functionality starts, for instance, with some constrained set of arguments that specifically control its behavior. This might be a set of states in a user application, or it might just be some parameters in a command line. Possibly some user input coming from a dialog or a form. It's a localized context to this specific execution of this specificfunctionality. The functionality context usually maintains the specifics of the execution.

The overall user context may have been a long running session of some type, gradually building up more and more specific information as it grows. Sessions can share data across many different functionality contexts. This can buildup very complex, long running dependencies. In this way, the session context essentially stores the short-term history.

The application itself started up with some preset series of configurations, generally in a file, but they could also be from some persistent data store (database) or possibly command line arguments. Someapplications can even be initialized on the fly from a third party location. Usually this information is consistent between different runs of the application, but not always. An application context is often moreoperationally oriented data, information about how to run in or interact with the environment.

The most significant data is usually the persistent stuff for the actual domain. That long-term data has been building up in the system, often over years. It is the core data that the user is trying to work with. The real stuff.

All functionality must implement some type of domain specific behavior. Software doesn't execute, unless it is trying to provide a specific answer to a specific problem. Generalized code may get reused underneath, but at the highest level, the code that gets executed must be bound to some real, and useful problem. Code runs because people use it for something.

Thus, the underlying persistent domain data is clearly the most influential data in the system. It really defines the success or failure of the computation.

The other contexts generally change the nature and the shape of the instruction sets, where the domain data is directly effected by the instructions themselves. In a real sense, the various contexts combine to control the way the algorithms create or manipulate the core domain data.

Of course, in practice, it is rarely that clean or refined.


A LITTLE BIT OF KNOWLEDGE

Sometimes, depending on the data affected by the specific sequence of code, it can actually be an amalgamation of several bits collected together from multiple contexts.

For instance, the application may have a root path in a configuration file that is then appended with some relative path information from other sources like the database, before finally being completed with some user input for a file name.

If the line of code is to call a function to open a file, for example, the final absolute path for the file may have been assembled from a very long and complex process starting in a huge number of different places in the code base.

In a very real sense, all of those locations, code and data, that are necessary, combine together to be fragments of the same underlying data requirement. They are all directly tied to the success of the code. I.e. the knowledge required to find the final file is scattered throughout the code, theconfigurations files and the database.

It is "partially repeated" in each of the different locations. If any one part fails, the entire handling fails.

Of course, we know that it is a very significant design goal to not have the same pieces repeated multiple times in multiple locations. We've been trying for decades to create programming paradigms to help with this issue.

In some cases, we do this for code, but it is also equally applicable for data, particularly operational context data.

Inevitability something changes, thus increasing the probability that any repeated knowledge (code or data) will fall out of synchronization. Bugs -- the most significant ones -- generally come from different parts of any program failing to coordinate their behaviors. These types of bugs are often the most serious types, and can be difficult and expensive to track down.

Static data, configuration files, caches and data stores, as well as the active contexts, are all locations of "state" related information that needs to be coordinated. Modern "Object Oriented" programming languages compound the problem by also allowing the code itself to have it's own run-time object state as well. The object state has little to do with the actual sequence of instructions running, it is just a by-product of the language decomposition, yet it is another easy place for errors to accumulate.

State, of any type, is always a bad thing. It builds up and changes in the system over time. In that way it becomes very difficult to test the code, because of the wide variety of possible internal states. Complicated logic tied to states are hard to maintain. Entirely state-free code, that is only ever dependent on a single direct input stream is far easier to test.

If the code "remembers" nothing, then a suitable range of inputs can validate its limited behavior in an operational environment. For each thing it remembers, a multiplier happens on the testing effort.


DOWNSTREAM

Clearly, there are a lot of things happening that can affect the execution of our single line of code.

A lot has to go right in order for it to work as expected. Still that is nothing compared to what is actually dependent on it working. The results of our line running, good or bad, affect a huge number of things. Changes in what the line does, can have an even wider effect.

One of the most obvious dependencies is any direct downstream code. Worse off, are the lines farthest away; possibly in another module, possibly in some other system. Failure or changes, could have significant affect on whether or not they continue to work. The code running later, depends on the current results. And that's not just code local to the this line, it's also a huge amount of code in this system and any other system that will need access to the final results at some point.

But that also includes all of the future versions of all of the dependent code as well.

In a philosophical sense, because the line ultimately affects the long-term persistent data, there is an infinite set of dependent lines of code for as long as that data stays in play. Data lasts nearly forever, and is dependent on all those bits of code in the past and upcoming in the future to do the right thing. So long as the data is not being decommissioned, it binds them all together. Something wrong early on propagates it way through all of them, until it is fixed.

There are all sorts of other lines of code, both adjacent, and far away physically and in time, that share significant dependencies on this line of code. This includes building, testing, packaging, installation and operations.

There could be problems with the build scripts that create the system, most are generic, but some are occasionally driven by code itself.

Testing can be adversely effected as well. A change in a specific line of code may force a huge number of tests to be altered. Depending on the depth and binding, the dependencies between the code and the tests may be just as strong as any other line in the system.

Supporting code, sometimes called scaffolding, may not be active in an operational environment, but it is still code by any other name.

Some unit-testing philosophies bind the code to tests at almost a one-to-one ratio, which means that any impact on one side is easily doubled (if not worse) across the whole system.

Packaging and installation could be effected as well. These are often forgotten about, but reasonable installation also includes both the ability to install a new version and to upgrade. Small changes in code can cause massive upgrade and installation issues, particularly if there are new dependencies on previously uncollected data. Upgrade or "migration paths" are also usually the weakest parts of most systems precisely because they aren't used often, are badly written and are hard to test.

Most multi-user software also gets some level of custom integration once it actually goes into an operational environment. There could, for example, be countless lines of code in various scripting languages that are dependent on this line. Scripts to cleanup, monitor or facilitate better integration between different systems. Since the software vendors rarely cooperate with each other, and are usually only selling thin slices of the solution, operational binding is a huge part of any multi-user software's life span.


SUPPORTING ROLES AND DOCUMENTS

Of course, a change to the line of code may also significantly change all of the related documentation as well.

A strong dependency comes from the comments in the source code itself. They describe the code, but there is nothing to enforce that description. Thus comments often quickly fall out of sync with the surrounding behavior. The comments may come in front, next to, behind or batched in a block, but the accuracy depends on how people have kept them up-to-date with the code. A misleading comment, may ultimately waste time. A lot of time.

Changes may cause failure in multiple language translations. There can be a mass of duplicated text and documentation with multiple-language systems. Every piece of text has to exists multiple times, in every language. Keeping that in sync is often a huge anchor on the code. It isorganizationally hard, and often involves expensive re-translations.

If our line of code fails to run as expected, then huge amounts of other documentation and supplementary information could in fact turn out to be wrong as well. These surrounding dependencies rely on things working the way they should. They rely on things not changing often.

With most modern commercial software, the actual code is tiny compared to all of these secondary documentation dependencies. They exist to facilitate the running of the code in different environments. This includes all of the design documents, marketing brochures and help files, but also any manuals and tutorials that get created. Any non-trivial system has (or should have) a significant library of related documents. A simple change to the code could spawn countless hours of related updating work.

Issues may also be propagated out to bug fix systems and support domains. There may be procedures that must change. FAQs that need updating. There are always the first line support tools, but often there are also a huge number of deeper ones. Some explicit, some just ad-hoc.

Various operations personal may be effected as well. There is usually an army of people responsible in most organizations for running and upgrading systems. A single line of code could possibly change their jobs, create huge upgrade projects or just render them obsolete. For some commercial systems, the upgrade procedures themselves are complex enough to have spawned related consulting industries.

In some cases, new people may need to be hired to monitor or augment the code. Software has always created a large number of specific employment positions beyond just the basic development and operational ones. Theorganization may have to change as well, to support using the code. The implementations may need project managers, and other people to keep the systems running smoothly. Domain specific experts may be required to keep it all working properly.

Code, often thought of as only a virtual thing -- ignoring the obvious point that it does "physically" alter magnetic bits in memory and on disks -- reaches out from its machine existence and entangles itself all over the real world, in all sorts of non-obvious ways.


NOT EFFECTED

While we've been looking at what is affected by our line of code, it is also worth looking at what is not.

There will always be lines of code in a system, particularly low-level ones that have some widespread effect across the entire scope of the system. However, a well-written system will actually minimize the amount of code that has these types of dependencies.

System architecture, encapsulation and good functional decompositions go a long way into helping to contain the overall dependencies of any single line of code.

If our line was properly encapsulated into a module, then most of the other lines in the system should be isolated from it.

If the context going into our system was well laid out by the architecture, then only that specific context will effect the running.

If the different contexts and data are logically separated, then it is easy and obvious to know which ones are affected by our line of code.

We should be able to see the transformations in the domain data, caused by the overall algorithm or functionality to which this line belongs. There should be no mysteries, any changes should be clear and obvious.

The code itself should sit at a point in the files and directories of the overall system as to indicate it's place in the design andarchitecture. Everything about the project should help lead to some significantly useful piece of information. Some piece of knowledge. It should all be as obvious as possible.

As a general rule of thumb, any given "experienced" programmer with the code should be able to reasonably and reliably predict the impact of any changes. They should be able to identify all dependent code and data, fairly quickly.

If the code fails that type of impact test, it is absolutely spaghetti code, by any definition. All lines of code should exist and be in a location, where their contribution to the whole system is obvious. Arbitrary or seemingly random lines of code show up serious self-discipline, or structuring problems. Nothing about code should ever be random or arbitrary. It always matters.

And certainly there is no need or purpose in Computer Science for having code that does not behave in a predictable, rational and explainable manner. Even with complex heuristics or models like neural nets, there is always some high level of explainable expected behavior, and the impact of the low-level code is constrained to very specific parts of the system. We cannot (and should not) use systems, if we do not understand what they will really do.


SUMMARY

A single line of code is a remarkable thing. It sits there in a big complex world surrounded by a sea of dependencies. It somehow has to manage to work correctly each and every time it runs, although most of the state of the world around it is gray and fuzzy.

We have so many problems with our systems simply because we do not properly value the significance of all of our code. It's all too common to see reasonable systems fall down because of really badly written supporting scripts for example. It's common to seeconfiguration files composed of arbitrary junk that has been building up for many versions. Many installation scripts barely work, and are often highly fragile.

Some gifted programmers might put a huge effort into getting the core of the system to be well-built, but that doesn't mean much if the other 80% is a mess.

We have all sorts of strategies and rules of thumb about not repeating code over and over again, but oddly few of these apply to the various bits of data that also are necessary. If we only encapsulate one small aspect of the code, the other aspects can still easily become indecipherable. Nicely commented spaghetti code is still spaghetti code, for instance. Quality is defined by the weakest points, not the best ones or even the average.

The idea then, is for all "related" elements (code or data) to be as spatially compacted as possible. The closer things are together, the easier it is to keep them in sync with each other.

The tighter we can truly encapsulate "knowledge" into the system, the more independent the pieces really become. When they can stand-alone, they make it possible for us to utilize them for as much as possible.

If changes in the code must be mirrored by ones in an XML configuration file as well, or vice versa, then the overall structure is repetitive and poorly encapsulated. The "knowledge" is split between two distinct locations. An architecture that declaratively redefines most elements in an external file, is clearly not a good design.

Well-structured code is a far better thing to work with. We should be able to look at a line of code and easily determine its affect on the overall system. Since we don't need chaotic systems (except for study and simulations), there is no reason we should build them.

Testing the overall quality of the code is not that complex of a process. For the most part, the lines in the system should be in obvious places with obvious behaviors. If the system needs some extremely complicated type of documentation in order for a reasonable programmer to make fixes, then it is likely that the structuring is extremely poor.

Well written programs make the solution look simple and obvious. Badly written ones require documentation. Knowing the impact of a change in a well written system is easy. Guessing the impact in spaghetti code is hard. At any given line of code in the program you should both be able to understand what it does, and correctly predict it's affect on everything else. It's that simple.