The Programmer's Paradox

Sunday, January 10, 2016

Thinking Problems

What’s most interesting about reading lots of code is that you get a fairly deep insight into how people think. Building software solutions involves decomposing a bunch of related problems into discrete pieces and then recomposing them back into a solution that allows users to get their work completed. Examining these pieces shows both how the individuals involved break apart the world, and how much they understand of it to be able to put it all back together again.

Supplemented with an understanding of the history of the development and enough experience to be able to correctly visualize a better solution, this can allow us to analyze various types of common thinking problems.

The first and most obvious observation is whether or not the programmers are organized. That just flows right from how much structure and consistency is visible. Since most large bodies of code are written over years, there can sometimes be local organization, but that might change radically in different parts.

Organization is both external and internal. It applies both to the overall process of development and to how the analysts, designers, and programmers arrived at the code.

Internal organization, as it applies to thinking, is in a sense how fast one can head in the right direction, rather than just wander aimlessly in circles. Often times checking the repo sheds light, in that disorganization usually generates a large number of small fixes that come in sporadically. A good testing stage should hide this from the users, but in broken environments that doesn’t happen. So really messy code, and a boatload of little fixes, generally sets the base thinking quality. If the fixes are spread across fairly large time spans -- implying that they are driven by users, not testers -- then a significant amount of the thinking was rather disorganized.

Disorganized thinking is generally fairly shallow, mostly driven by the constant lack of progress. That is, people who make little progress in their thinking tend to want to avoid thinking about things because it doesn’t ultimately help them. That pushes them to the tendency to react to the world in the short-term. But long-term planning and goal achievement, almost by definition, require deep thinking about both the future and the means to get there. Thus disorganized thinkers are more often a-types that prefer the bull-in-the-china-shop approach to handling problems. For the most part there seems to be a real correlation between the brute force approach to coding and disorganized thinking. It’s the “work harder, not smarter” approach to getting things done.

Another really common thinking problem that is visible is with local simplification. It’s not uncommon to see programmers go from A to B via all sorts of other crazy paths. So instead of going straight there, they go A -> M -> Q -> B, or something equally convoluted. Generally, if you get a chance to inquire, many of them will rather confidently state that since A -> M is shorter than A -> B, that they have actually simplified the work and that they are correctly following an approach like KISS. What they seem to have difficulty understanding is that A -> M -> Q -> B is inordinately more complex. That the comparison between the two approaches is flawed because they aren’t comparing two equal things, rather just a part of the solution vs. the whole solution. You see this quite frequently in discussions about architecture, tools, design, etc. and of course these types of complexity choices get burned directly into the code and are quite visible.

In a programmatic sense, any code that isn’t directly helping to get to the destination is a possible suspect for being an M. More specifically, if you ignore the code, but follow the data instead, you can see its path through the runtime as being a long series of data copies and translations. If the requirement of the system is to get from the persistence storage to a bunch of widgets and then back again, then unless intermediate destinations are helping to increase performance, they are really just unnecessary waypoints. Quite often they come from programmers not understanding how to exploit the underlying primitive mechanics effectively, or from them blindly following poor conventions. Either way, the code and the practices combine to consume excess work that would have been better spent elsewhere. In that way, local simplifications are often related to shallow thinking. Either the programmer came to the wrong conclusion about what really is the simpler approach, or they just choose not to think about it at all and decided to blindly follow something they read on the web. Either way, the code points back to a thinking problem.

The third, and most complex problem is with the decomposition of things into primitive parts. It’s easy enough to break a large problem down into its parts, but what isn’t often understood well is that the size and relative positioning of the parts actually matters. There is nearly an infinite number of ways to decompose a problem, but only a select few really make it easy to solve it for a given context. Ideally, the primitives fit together nicely; that is there are no gaps in between them. Also, they do not overlap each other, so that there is a real canonical decomposition. They need to be balanced as well so that if the decomposition is multi-level, each level only contains the appropriate primitives. This whole approach is mathematical in nature since it adheres to a rigid set of relative rules, created specifically for the problem at hand.

A problem broken up in this manner is easily reassembled for a solution, but that’s generally not what we see in practice. Instead, people tear off little bits at a time, rather randomly, then they belt them out as special cases. The erraticness of the decomposition provides great obstacles towards cleanly building a solution and as it progresses over time, it introduces significant spaghetti back into the design. It’s the nature of decomposition that we have to start from the top, but to get it reasonable we also have to keep iterating up and down until the pieces fit properly; it's that second half that people have trouble with because that type of refinement process is often mistaken as not providing enough value. However, it really is one of those a-stitch-in-time-saves-nine problems, where initial sloppiness generates significantly more downstream work.

A more formal background makes it easier to decompose into solid primitives, but given the size and complexity of most problems we are tackling, knowledge or even intrinsic abilities does not alleviate the grunt work of iterating through and refactoring them to fit appropriately. This then shows up as two different thinking problems, the first being able to notice that the pieces don’t fit properly, while the second is being able to fix that specific mismatch. Both problems come from the way people think about the underlying system, and how they visualize what it should ultimately look like.

Now so far, I have been using the term ‘thinking’ rather loosely, to really refer to what is in between observations and conclusions. That is, people see enough of the problem, and after some thinking, they produce a solution. That skips over issues like cognitive bias, where they deliberately ignore what they see that does not already support their conclusions. Given that it is internal, I do consider that part of the thinking process; the necessity to filter out some observations that don’t fit the current context and focus. Rose-colored glasses, however, are a very common problem with software development. Sometimes it related to the time pressures, sometimes it's just unreasonable optimism, but sometimes it is an inability to process the relevant aspects of the world around us. It is a thinking failure that prevents us from being able to build up internal models that are sophisticated enough to match observed evidence. To some degree, everyone is guilty of wearing blinders, of filtering out things that we should not have, but in some cases, it is considerably more extreme than that. It’s really a deep-seated need to live within an overly simplified delusion, rather than reality. Strangely that can be as much of a plus as a minus in many cases, particularly if success is initially difficult. Ignoring the obvious roadblocks allows people to make a full-hearted attempt. But as much as it can launch unexpected successes (occasionally), it becomes increasingly negative as time progresses and the need for more refined thinking grows more significant.

Analysis, design, and programming are all heavily dependent on observation and thought. In the end, a software system is just an instantiation of various people’s understanding of a solution to a problem. In that, it is really only as strong as the sum of its commonly used weakest links (some parts of the system are just irrelevant). So, if the quantity of the observations is compromised by time, and this is combined with poor thinking, the resulting system most often falls far short of actually solving the problem. It is not uncommon to see systems that were meant to cut down on resources need significantly more resources than they save. That is, a grunt job for 4 people is so badly automated that 6 people have to be involved to keep it running. In that sense, the system is an outright failure and underlying that is not a technical problem, but rather one about how various people came to think about the world, and how they combined their efforts to solve things relative to their conclusions.

Thinking well is a prerequisite for handling complex problems, with or without technology. If people can’t clearly understand the issues, then any attempt to solve them is as equally likely to just make them worse. If they break down the problem into a convoluted mess of ill-fitting parts, then any solutions comprised of these are unlikely to work effectively and is more likely to make things worse. Software, as it is applied to sophisticated problems, is inordinately complex and as such cannot be constructed without first arriving at a strong understanding of the underlying context. This comes from observations, but it doesn’t come together until enough thinking has been applied to it to get it into a usable state. Software in this sense is just well-organized intelligence that gets processed by a computer. There is no easy way around this problem.

Saturday, December 19, 2015

Routine Software

As our knowledge and understanding of software grows, it is important to keep track of what are ‘trivial’ and ‘routine’ software projects. Trivial means that something can be constructed with only the bare minimum of effort, by someone who has done it before. Routine means that it has been done so many times in the past, that the knowledge -- explicitly or implicitly -- is available within the industry; it doesn’t, however, mean that it is trivial, or that isn’t a large amount of work.

Both of these categories are, of course, relative to the current state of the industry. Neither of them means that a novice will find it to be ‘easy’, in that doing anything without pre-requisite knowledge is basically ‘hard’ by definition. If you don’t know what you are doing then only massive amounts of sweat and luck will ever make it possible.

Definition

At this time, and it has remained rather consistent for the last decade at least, a routine system is medium-sized or smaller, with a statically defined data model. Given our technologies, medium means approximately 400 users or less, and probably about 40,000 to 60,000 lines of code (LOC). A statically defined data model means that the structure of what is collected now is not going to change unless the functionality of the system is explicitly extended. That is, the structure doesn’t change dynamically on its own, what is valid about a data entity today is also valid tomorrow.

Modern routine systems almost always have a graphical user interface. Some are thick clients (a single stand-alone process), while others are client-server based (which also covers the architectural constraints on a basic web or mobile app, even if it involves several backend components). None of this affects whether the development is routine, since it has all been around for decades, but it does involve separate knowledge bases that need to be learned.

All of these routine systems rely primarily on edit loops:

http://theprogrammersparadox.blogspot.ca/2010/03/edit-loop.html

They move the data back and forth between a series of widgets and a database of some form. Most import external data through some form of stable ETL mapping, and export data out via well-defined data formats. Some have deeper interactive presentations.

Any system that does not get more complicated than this is routine, in that we have been building these now for nearly three decades, and the problems involved are well-defined and understood.

There are at least three sub-problems that will cause the development to no longer be routine, although in most cases these only affect a portion of the system, not the whole. They are:

Dynamic data-models
Scale larger than medium
Complex algorithmic requirements

Dynamic Data

A dynamic data model means that the underlying structure of the data can and will change all of the time. Users may enter one structure, one day, then something substantially different the next, yet these will still be the same data entity. The reason this occurs is because the domain is purposely shifting, often to its own advantage. Obviously, you can’t statically encode the entire space of possible changes, because that would involve knowing the future.

Dealing with dynamic data models means pushing the problem back to the users. That is, you give them some really convenient means of keeping up with the changes, like a DSL or complex GUI, so that they can adapt quickly. That may seem easy, but the problem leaks over both the interface and the persistence. That is, you need some form of dynamic interface that adapts to both the changing collection and reporting necessary, and you need this whole mess to be dynamically held in a persistent technology. The trick is to be able to write code that has almost no knowledge of the data that it is handling. The breadth and abstract nature of the problem are what makes it tricky to implement correctly; it is very rare to see it done well.

Scaling

Once the required scale exceeds the hardware capabilities, the system needs to be decomposed into pieces that can execute independently instead of all together. This sizing problem continually shifts because the hardware is evolving quickly, but there is always some threshold where it becomes the primary technical problem. In a simple system, if the decomposition leads to a set of independent pieces, the problem is only minorly painful. Each piece is pushed out on its own hardware. Sometimes this is structural, such as splitting the backend server into a web server and a database server. Sometimes it can be partitioned, such as sticking a load balancer in front of replicated servers.

If the data and code aren’t independent then very complex synchronization algorithms are needed, many of which are cutting edge computer science right now.

Software solutions for scale also exist in one of the many forms of memoization, such as caching or paging, however in the former case adding the ability to ‘reuse’ data or sub-calculations also means being able to precisely understand and scope the lifespan of the data, failing to do this makes it easy to accidentally rely on stale data.

Most scaling solutions in of themselves are not complex, but when multiple ones exist together their interactions can be extraordinarily complicated. As the necessity of scale grows, the need to bypass more bottlenecks means significant jumps in this complexity and increased risk of sending the performance backward. This complex interaction makes scaling one of these most difficult problems, and because we don’t have a commonly used toolkit for forecasting the behavior it means that much of the work is based on intuition or trial and error.

Algorithms

Most derived data is rather straightforward, but occasionally people are looking for subtle relationships within the structure of the data. In this way, there is a need for very complex algorithms, the worst of which is AI (since if we had that, it could find the others). Very difficult algorithms are always a challenge, but at least they are independent of the rest of the system. That is, in most applications, they usually only account for a small percentage of the functionality, say 10% to 20%. The rest of the system is really just a routine wrapper that is necessary to collect or aggregate the necessary data to feed them. In that way, these algorithms can be encapsulated into an ‘engine’ that is usable by a routine system, and so the independence is preserved.

For some really deep ‘systems’ programming problems like operating systems, compilers or databases the state-of-the-art algorithms have advanced significantly, and require significant research to understand. Most have some routine core that intermingles with the other two problems. What often separates systems programming from applications programming is that ignoring what is known, and choosing to crudely reinvent it, is way more likely to be defective. It’s best to either do the homework first or use code by someone who has already done the pre-requisite research.

Sometimes semi-systems programming approaches blend into routine application development, such as locking, threading, real-time(ish) etc. These are really a mix between scaling and algorithmic issues, that should really only be used in a routine system if the default performance is unacceptable. If they are used, significant research is required to use them properly, and some thought should be given as to how to properly encapsulate them so that future changes don’t turn ugly. Quite often it is common to see poor implementations that actually degrade performance, instead of helping it, or that cause strange, infrequent bugs that go for years without being found.

Finale

Now, of course, all three of these problems can be required for the same system at the same time, and it can be necessary to run this within a difficult environment such as fault tolerant. There are plenty of examples of our constructing such beasts and trying to tame them. At the same time, there are way more examples of mostly routine systems, that share few of these problems. In that latter category there are also examples of people approaching a routine system as if it were an extraordinarily complex one, and in those cases, their attempted solutions are unnecessarily expensive or unstable or even doomed. Understanding what work is actually routine then means knowing what and how to go about doing the work, so that it is most likely to succeed and be useful in the future.

What we should do as an industry is to produce better recipes and training for building routine systems. In an organized development environment, this type of work should proceed in a smooth and highly estimable fashion. For more advanced systems, the three tricky areas can often be abstracted away from the base, since they can be harder to estimate and considerably riskier. Given all these constraints, and strong analysis and design, there is no reason why most modern software development projects are so chaotic. They can, and should be, under much better control.

Friday, December 4, 2015

Requirements and Specifications

Programmers often complain about scope creep.

The underlying cause is likely that their project has been caught up in an endless cycle of requirements and specifications that bounce all over the place, which is extraordinarily expensive and frustrating for everyone.

The inputs to programming are design specifications, which are created from requirements gathered during analysis. There are many other ways to describe these two stages and their outputs, but ultimately they all boil down to the same underlying information. Failures in one or both of these earlier stages is really obvious during the coding stage. If the details aren’t ever locked down, and everything is interconnected, then frequent erratic changes mean that a lot of work gets wasted. In that sense, scope creep isn’t a real programming problem, but rather a process one.

Quite obviously, scope creep wouldn’t happen if the specification for the system was 100%. The programmers would just code exactly what is needed -- once -- and then proceed to polish the work with testing. The irony is that the work of specifying a system to 100% is actually the work of writing the system itself. That is, if you make the effort to ensure that no detail was vague or left unspecified, then you could write another program to turn that specification directly into the required program.

A slight variation on this idea was actually floated a long time ago by Jack W Reeves in “What is Software Design?” but never went mainstream:

http://www.developerdotstar.com/mag/articles/reeves_design_main.html

Of course, time and a growing understanding of the solution generally mean that any original ideas for a piece of software will always require some fiddling. But it is obviously cheaper and far more efficient to work out these changes on a smaller scale first -- on paper -- before moving on to committing to the slow, detailed work of writing and testing the code. Thus, it is a very good practice to create short, high-level specifications to re-arrange the details, long before slogging through the real effort.

As mentioned, a good specification is the by-product of the two earlier stages, analysis and design. The first stage is the collection of all details that are necessary to solve the problem. The second is to mix that analysis with technology and operational requirements, in order to flesh out an architecture that organizes the details. Scope creep is most often caused by a failure of analysis. The details were never collected, or they weren’t vetted, or an aspect of the domain that is dynamic was treated statically. Anyone of these three problems will result in significant changes, and any one of them in a disorganized environment will set off a change cyclone.

There is also a rather interesting fourth problem: the details were collected, but not in a way that was stable.

The traditional approach to requirements is to craft a sentence that explains some need for the user. By definition, this expresses a winding ‘path’ through the underlying data and computations while often being targeted at only a special case. People find it easier to isolate their needs this way, but it is actually problematic. If the specification for a large system is composed of a really large collection of these path-based requirements, then it resembles a sort of cross-hatched attempt to fill in the details of the system, not unlike the scratchings of a toddler in a coloring book. But the details are really a ‘territory’, in that it is a finite set of data, broken down by modeled entities, with higher level functionality coming from computations and derived data. It is also packed with some navigational aids and a visual aesthetic.

A good system is complete, in the sense that it manages all of the data correctly and provides a complete set of tools to do any sort of display or manipulation necessary. It is a bounded territory that needs to be filled in. Nicely. Describing this with an erratic set of cross-hatched paths is obviously confusing, and prone to error. If the programmers fixate on the wrong subset of paths, necessary parts of the system fall through the cracks. Then when they are noticed, things have to change to fill those gaps. Overlaps likewise cause problems in driving the creation of redundancies which eventually lose synchronization with each other.

A very simple example of this confusion happened a while back when a user told an analyst that he needed to ‘add’ some new data. The analyst took that path ‘literally’ and set it down as a requirement, which he turned into a screen specification. The programmer took that screen literally and set it down as code. A little time passed and the user made a typo, that he only noticed after he had already saved the data. He went to edit the data, but… The system could ‘add’ new data, however, it lacked any ability to ‘edit’ it, or ‘delete’ it, because these were not explicitly specified by the user. That’s pretty silly because what the user meant by ‘add’ was really ‘manage’ and that implies that the three bits of functionality: add, edit and delete are all available. They are a ‘unit’, they only make sense together.

If instead of focusing on the literalness of the user, the analyst understood that the system itself was going to be the master repository for this newly collected entity then it would have been more than obvious what functionality was necessary. The work to create the requirements and the screen where superfluous and already well-defined by the existing territorial boundaries (the screen didn’t even match the existing interface conventions). A single new requirement to properly manage a new data entity was all that should have been necessary. Nothing more. The specification would then be completely derived from this and the existing conventions, either explicitly by an interface designer or implicitly by the programmer who would need to look up the current screen conventions in the code (and hopefully reuse most of it).

It is important to understand that territorial requirements are a lot less work, as well as being less vague. You need only list out the data, the computations, and for the interface: the navigation. In some cases, you might also have to list out hard outputs like specific reports (because there is limited flexibility in how they appear or their digital formats). With this information and performance and operational requirements, the designers can go about finding efficient ways to organize and layout the details for the programmers.

While the boundary described by the requirements needs to be reasonably close to 100% (although it can be abstract), the actual depth of the specifications are entirely dependent on the abilities of the programming teams. Younger, less experienced programmers, need more depth in the specifications to prevent them from going rogue. Battle-scarred seniors might only need the requirements themselves. Keep in mind that unwanted ‘creativity’ makes for a hideous interface, convoluted navigation, and brutal operational problems, as well as being a huge resource drain. A programmer that creates a whole new unique sub-system within an existing one is really just pouring fuel on the fire. It will only annoy the users and waste resources, even if it initially looks like it will be faster to code. The resulting disorganization is deadly, so it's best to not let it take hold. A programmer that goes rogue when there is an existing specification is far easier to manage then if there is nothing. Thus specifications are often vital to keep a group of programmers working nicely together. To keep them all building one integrated system, instead of just a pile of disconnect code.

The two initial stages can be described in many different ways, but they are best understood as decomposition and recomposition. That is, analysis is decomposing the problem into its underlying details. The most efficient way of doing this ensures that the parts of the territory are not overlapping, or just expressing the same things in different ways. Recomposition is the opposite. All of the pieces are put back together again, as a design, that ensures that the minimal amount of effort is needed to organize and complete the work. Stated that way, it is obvious that effective designs will heavily leverage reuse because it will take the least amount of overall work. Massive redundancies introduced via brute force will prevent entanglement but they do it by trading them for significant future problems. For any non-trivial system, that rapidly becomes the dominant roadblock.

An unfortunate cultural problem in programming is to continually push all decisions back to the users. Many programmers feel that it is not their responsibility to interpret or correct the incoming requirements and specifications. Somehow the idea that the user champions can correctly visualize the elements of a large system has become rooted. They certainly do know what they want the program to do, but they know this as a massive collection of different, independent path requirements, and often that set in their head isn’t fully resolved or complete and might even be contradictory. Solutions are indeed built for the users, but the work needs to progress reasonably. Building off a territory means the development teams can appropriately arrange the pieces to get constructed in the most efficient manner. Building off a stream of paths, most often means that each is handled independent, at a huge work multiplier. And no organization can get applied.

In that sense, giving control of the development process directly to a user champion will not result in anything close to efficient use of resources, rather the incoming chaos percolates throughout the code base. There might be some rare champion that does have the abilities to visualize the territorial aspects of the requirements, but even then the specifications still need to be created.

Analysis and design are different, although related, skill sets that need to exist and can likely be independently measured. For example, if there is significant scope creep, the analysis is failing. If there are plenty of integration problems, it is the specification. The first is that the necessary details were never known, while the second is that they were never organized well enough that the independent tasks were synchronized. In fact, categorizing bugs and using them to identify and fix overall process problems is the best way to capitalize on testing and operational feedback. The code needs to be fixed, but the process is often weak as well.

In a very real sense, it is entirely possible to walk backward from a bug, to the code, to the specifications and then to the requirements, to see if the flow of work has serious problems. There will always be minor hiccups, but in really badly run projects you see rather consistent patterns, such as massive redundancies. These can always be unwound by ensuring that the different stages of development fit together properly. Specifications, or the lack of them, sit in the middle, so they provide an early indicator of success.

It's also worth noting that some projects are small enough or straightforward enough that they don’t really need to actuate the specifications. The requirements should be recorded, but more as a means for knowing the direction that is driven by the users. If organization exists and the new requirements are just filling in familiar territory, then the code itself is enough to specify the next round of work. That’s why it is not uncommon on medium sized programs to see senior developers jump straight from conversations with the users to actual working code. Every detail that is necessary is already well-known, so given the lack of resources, the documentation portion is never done. That does work well when the developer is an organized, methodical person, and if they are ever replaced it is by someone that can actually read code (the only existing specification), but it fails really badly if those two necessary conditions don’t exist. Some code handovers go smoothly, some are major disasters.

Sometimes people use shifting territories as a reason to avoid analysis and specification. That is because the territory itself isn’t even locked down, then everything should be ad hoc, or experimental. This is most common with startups that are likely to pivot, at some point. The fallacy with this is that the pivots most often do not shift entirely away from the starting territory. That is, a significant aspect of the business changed, but not the technologies nor the required base infrastructure. And in most cases the shift itself doesn’t remove data entities, it just adds new ones that are higher in priority. So, a great deal of the technical base is still intact. If it wasn’t, then the only rational thing to do would be to drop 100% of the previous effort, but that necessity is actually quite rare. In that sense, territories expand and contract throughout the life of any development. Seeing and adjusting to that is important in effectively using the available resources, but it is an issue that is independent of analysis and design. No matter how the territories change, they still need to be decomposed, organized and then recomposed in order to move forward. The work is always constant, even if it is sporadically spread across the effort.

Building anything practical is always a byproduct of some form of analysis and design. As the scale of the attempt increases, the need for rigor and organization become increasingly bound to quality. If we set our sights on creating sophisticated software solutions that really make life easy for everyone, we need to understand how to properly set up these prerequisites to ensure that this happens as desired.

Sunday, November 22, 2015

Containers, Collections and Null

A variable in a computer program holds exactly one member of a finite set of possibilities. This datum symbolically represents something in the physical world, although it is also possible that it represents something from an imaginary reality.

The usefulness of a variable is that it can be displayed at the appropriate moment to allow a person, or something mechanical, to make a decision. Beyond that, it is just being saved under the belief that it will be used for this purpose some day or that it will be an input into some other variable’s moment.

A variable can also have with it an associated null flag. It is not actually ‘part’ of that variable, but rather an independent boolean variable that sheds light on the original contents. A friend of mine used to refer to null values as ‘out-of-band’ signals; meaning that they sit outside of the set of possible values. This is important in that they cannot be confused with another member of the set, what he used to call ‘in-band’ signals or often ‘magic numbers’.

There is generally a lot of confusion surrounding the use of nulls. People ascribe all sorts of program-specific meanings to them, which act as implicit data in their program. In this way, they overload the out-of-band signal to represent a custom state relative to their own data. This is a bad idea. Overloading any meaning in a variable guarantees some sort of future problem since it is subjective and easily forgotten. It is far better to ascribe a tighter definition.

A variable is collected or it is not; computers are perfectly binary in that regard. If it is possible for a variable to not be collected, the only reason for this is that the variable is ‘optional’ otherwise the program will not continue with its processing. Obviously bad data should not be allowed to propagate throughout a system or get stored in the database. If there are different, and useful, excuses for not having specified a variable right away, then these are independent variables themselves. With that in mind, a null only ever means that the variable is optional and was not collected, nothing else should be inferred from it.

Then for the sake of modeling, for all variables, one only needs to know if the data is mandatory -- it must be collected -- or it is nullable. Getting a null in a mandatory field is obviously an error, which needs to invoke some error handling. Getting a null in an optional field is fine. If a variable is nullable ‘anywhere’ in the system then it should be nullable ‘everywhere’ in the system (it is part of the data model). If there is functionality dependent on an optional variable, then it should only ever execute when that variable is present, but should not be an error if it is missing. If the underlying data is optional, then any derived data or functionality built on top of it is optional as well. Optionality propagates upwards.

With that perspective, handling nulls for a variable is easy and consistent. If it doesn’t make sense that the data could be missing then don’t make it nullable. It also helps to understand where to reinforce the data model if the data is a bit trickier than normal. For instance, partially collected data is ‘bad’ data until all of the mandatory values have been specified. So it should be partitioned or marked appropriately until it is ready.

The strength of a computer is not that it can remember a single variable, but rather that it can remember a huge number of them. And they can be interrelated. What really adds to the knowledge is the structure of these cross-variable relationships. There can be a huge number of instances of the same variable and/or there can be a huge number of relationships to other types of variables.

The larger and more complex the structure, the more information that can be gleaned out of this collected data. The history of software has always been about trying to cope with increasingly larger and more complex data structures even if that hasn’t been explicitly stated as the goal. In a sense, we don’t just want data, we want a whole model of something (often including a time dimension) that we can use to make deeper, more precise decisions.

The complex structural relationships between variables are represented within a program as ‘containers’. A simple container, say a structure in C or a simple object in Java, is just a static group of related variables. These can be moved throughout the program as if they were a single variable, ensuring that they are all appropriately kept together. In this arrangement, the name of the variable in many programming languages is a compile-time attribute. That is, the programmer refers to the name when writing the code, but no such name exists in the system at runtime. Some languages provide introspection, allowing the programmer to retrieve their name and use it at runtime.

Null handling for simple containers is similar to null handling for individual variables. Null means that ‘all’ of the data in the structure is optional. In some languages, however, there can be confusion on how to handle mandatory data. With a simple variable, the language itself can be used to insist that it always exists, but with a pointer or reference to a container that check needs to be explicitly in the code. It can come in the form of asserts or some explicit branches that jump to error handling. The code should not continue with functionality that relies on mandatory data if it is missing.

The next level of sophistication allows for a runtime based ‘named’ variable. That is, both the name and the variable contents are passed around as variables themselves, together in a container. Frequently this container is implemented as a hashtable (sometimes called a dictionary), although in some cases ‘order’ is required so there can also be an underlying linked-list. This is quite useful for reusing code for functionally on similar data with only some minor processing attached to a few specific variables. Then it mostly leaves the bulk of the code to manipulate the data without having to really understand it, making it strongly reusable. This works well for communications, printing, displaying stuff in widgets, etc. Any part of the code whose data processing isn’t explicitly dependent on the explicit meaning of the underlying data, although sometimes there needs to be categories (often data types) of behavior. Learning to utilize this paradigm can cut out a huge number of redundant lines of code in most common programs.

Null handling for containers of named variables is slightly different, in that the absence of a particular named pair is identical to the name existing with a null value. Given this overlap, it is usually best to not add empty, optional data into the container. This is also reflected symmetrically by not passing along values without a name either. This type of structuring means that processing such a container is simplified in that each pair is either mandatory, or it is optionally included. If a pair is missing, then it was optional. To enforce mandatory variables, again there needs to be some trivial code that interrupts processing.

Containers get more interesting when they allow multiple instances of the same data, such as an array, list or tree. Large groups of collected data shed a brighter light on the behavior of their individual datum, thus providing deeper details. For these ‘collections’, they can be ordered or unordered although the latter is really a figment of the programmer’s imagination in that everything on a computer has an intrinsic order, it is just sometimes ignored.

Ordering can be based on any variables within the data although often it is misunderstood; the classic case of that being tables in a relational database returning their default order based on their primary index construction, thus leading to the extremely common bug of the SELECT statement order changing unexpectedly when the tables grow large or rows are deleted. Programmers don’t see this potentially chaotic reordering when testing with small datasets, so they make bad assumptions about what really caused the visible ordering.

One frequent confusion that occurs is with null handling for collections, in that an empty collection or a null reference can be interpreted as the same thing. In most cases, it is really optional data has not been collected, so it doesn’t make sense to support this redundancy. It is more appropriate to handle the ‘zero’ items condition as an optional collection, and the null reference itself as a programming error. This is supported quite elegantly by having any code that returns a collection to always allocate an empty one. This can then be mirrored by any function that needs a collection as well, in that it can assume that null will not be passed in, just empty collections. This reduces bugs caused by inconsistent collection handling, but it does mean that every branch of any new code should be executed at least once in testing to catch any unwanted nulls. It doesn’t, however, mean that every permutation of the logic needs to be tested, just the empty case, so the minimum test cases are fairly small. This isn’t extra work in that the minimum reasonable practice for any testing is always that no ‘untested’ lines of code should ever be deployed. That’s just asking for trouble, and if it happens it is a process problem, not a coding one.

This null handling philosophy prevents the messy and time-wasting inefficiencies of never knowing which direction an underlying programmer is going to choose. We’ll call it ‘normalized collection handling’. It does, however, require wrapping any questionable, underlying calls from other code just in case it doesn’t behave this way, but wrapping all third-party library calls was always considered a strong best practice, right up until it was forgotten.

Some programmers may not like it because they believe that it is more resource-intensive. Passing around a lot of empty collections will definitely use extra memory. Getting the count out of a collection is probably more CPU than just checking for a null (but less if you have to do both which is usually the case). This is true, but because an empty collection eats up the base management costs, it also means that the resource usage during processing, at least from the lowest level, is considerably more consistent. Programs that fluctuate with large resource usage extremes are far more volatile, which makes them far more difficult for operations. That is, if whenever you run a program, its total resource usage is predictable and relative to load, then it becomes really easy to allocate the right hardware. If it swings to extremes, it becomes far more likely to exceed its allocation. Stable-resource systems are way easier to manage. A little extra memory then is a fair price to pay for simpler and more stable code.

We can generalize all of the above by realizing that null handling differs between static and dynamic variables. Adding nulls is extra work in the static case while enforcing mandatory requirements is extra work in the dynamic case. Static data is easier to code, but dynamic data is both flexible and more reusable. In that sense, if we are going to just hardcode a great deal of static data, it is best if the default is mandatory and optional data is the least frequent special case. The exact opposite is true with dynamic data. Most things should be optional, to avoid filling the code with mandatory checks. This behavioral flip flop causes a great deal of confusion because people want a one-size-fits-all approach.

A nice attribute about this whole perspective of data is that it simplifies lots of stuff. We only have containers of unnamed or named variables. Some containers are collections. This is actually an old philosophy, in that it shows up in languages like AWK that have ‘associated arrays’ where the array index can be an integer or a string. The former is a traditional array, while the latter is a hashtable, but they are treated identically. In fact, in AWK, I think it just cheats and makes everything a hashtable, converting any other data type to a string key. This makes it a rather nice example of an abstraction smoothing away the special cases, for the convenience of both the users of the language and the original programmers.

We have to, of course, extend this to allow for containers of a mix between variables and sub-containers, and do that in a fully recursive manner so that a variable can be a reference to a container or collection itself. This does open the door to having cycles, but in most cases, they are trivial to detect and easily managed. Going a little farther, the ‘name’ of the variable or key of the hashtable itself can be a recursive container of structures as well, although eventually, it needs to resolve down to something that is both comparable and will generate a constant hashing code (it can't change over time). Mixed all together, these techniques give us a fully expressible means of symbolically representing any model in spacetime and thus can be increasingly used to make even more complex decisions.

We should try not to get lost in the recursive nature of what we need to express, and it is paradigms like ADTs and Object-Oriented programming that are really helpful in this regard. A sophisticated program will have an extraordinarily complex data model with many, many different depths to it, but we can assemble and work on these one container at a time. That is, if the model is a list of queues of trees with stacks and matrices scattered about, we don’t have to understand the entire structure as a ‘whole’ in order to build and maintain the code. We can decompose it into its underlying components, data structures and/or objects, and ensure that each one works as expected. We can then work on, and test, each sub-relationship again independently. If we know all the structural relationships are correct, and we know the underlying handling is correct, then we can infer that the overall model is correct.

While that satisfies the ability to implement the code, it says nothing about how we might decide on a complex model to a real-world solution. If there is art left in programming, much of it comes directly from this issue. What’s most obvious is that the construction of sophisticated models needs to be prototyped long before the time is spent to implement them, so that its suitability can be quickly rearranged as needed. The other point is that because of the intrinsic complexity, this type of modeling needs to be built from the bottom-up, while still being able to understand the top-down imposed context. Whiteboards, creativity and a lot of discussions seem to be the best tools for arriving at decent results. Realistically this is actually more of an analysis problem than a coding one. The structure of the data drives the behavior, the code just needs to be aware of what it is supposed to be, and handle it in a consistent manner.

Some programmers would rather avoid this whole brand of complexity and stick to masses of independent static variables, but it is inevitable that larger, more complex, structural relationships will become increasingly common as the user’s requirements for their software gradually get more sophisticated. A good example of this being the non-trivial data-structures that underpin spreadsheets. They have only gained in popularity since they were invented. Their complexity unleashed power-users to solve programmatic issues for themselves that were unimaginable before this technology existed. That higher level of sophistication is sorely needed for many other domain-based problems, but currently, they are too often encoded statically. We’ve been making very painfully slow progress in this area, but it is progress and it will continue as users increasingly learn what is really possible with software.

Thinking of the possible variable arrangements as structural relationships between containers makes it fairly straightforward to understand how to represent increasingly complex real-world data. With tighter definitions, we can avoid the disorganization brought on by vagueness, which will also cut down on bugs. Software is always founded on really simple concepts, the definition of a Turing machine for example, but we have built on top of this a lot of artificial complexity due to needless variations. Being a little stricter with our understanding allows us to return to minimal complexity, thus letting us focus our efforts on achieving both sophistication and predictability. If we want better software, we have to detach from the sloppiness of the past.

Sunday, November 8, 2015

Software Engineering

The standard definition for engineering from Wikipedia is:

Engineering is the application of mathematics, empirical evidence and scientific, economic, social, and practical knowledge in order to invent, design, build, maintain, research, and improve structures, machines, tools, systems, components, materials, and processes.

But I like it to simplify it into just two rules:

Understanding something really deeply.
Utilizing that knowledge to build stuff with predictable behavior.

With the added caveat that by ‘predictable behavior’ I mean that it is predictable for ‘all’ possible circumstances in the real world. That’s not to say that it must withstand everything that could be thrown at it, but rather that given something unexpected, the behavior is not. There is no need to guess, it can be predicted and will do exactly as expected. Obviously, there is a strong tie between the actual depth of the knowledge and the accuracy of these predictions.

The tricky part about engineering good software is acquiring enough deep knowledge. Although the existing underlying software is deterministic and explicitly built by people, it has been expanding so rapidly over the last five decades that it has become exceptionally convoluted. Each individual technology has blurry conventions and lots of quirky behavior. It becomes difficult to both properly utilize the technology and mitigate it under adverse conditions. Getting something to ‘sort of’ work is not too difficult, but getting it to behave reliably is extraordinarily complex and time-consuming.

For example, if you wanted to really utilize a relational database, that would require a good understanding of the set-theoretical nature of SQL, normalization, query plans, implicit queries, triggers, stored procedures, foreign keys, constraints, transactions, vendor-specific event handling and how to combine all of these together effectively for models that often exceed the obvious expressiveness. When used appropriately, a relational database is a strong persistence foundation for most systems. Inappropriate usage, however, makes it awkward, time-consuming and prone to unsolvable bugs. The same technology that can nearly trivialize routine systems can also turn them into hopeless tangles of unmanageable complexity. The difference is all about the depth of understanding, not the virtues of the actual software.

A big obstacle in acquiring deep knowledge is the lack of authoritative references. Someone could write a book that would explain in precise detail how to effectively utilize a relational database for standard programming issues, but culturally we don’t get that specific because it would be discounted due to subjective objections. That is, anyone with even a small variation of opinion on any subset of the proposed approach would discount the entire approach as invalid, thus preventing it from becoming common. In addition, creativity is valued so highly that most programmers would strongly prefer to rediscover how to use a relational database, over decades, then just adopt the pre-existing knowledge. That is unfortunate because there are more interesting problems to solve if you get past these initial ones.

To get to actual engineering we would have to be able to recognize the routine parts of our problems, and then solve them with standardized components whose ‘full’ behavior is well documented. This would obviously be a lot easier if we had a reliable means of categorizing that behavior. Thus we would not need to consume massive resources experimentally determining what happens if we knew that a technology was certified as ‘type X’ for instance. In that sense, the details need to be encapsulated, but all behavioral changes, such as possible errors, need to be explicitly documented and to strictly follow some standard convention. If we can achieve this, then we have components which can be used and if a programmer sticks with a collection of them from a limited set of categories, they can actually have a full and deep understanding of how they will affect the system. That depth will give us the ability to combine our work on top in a manner that is also predictable. Without -- of course -- having to deeply understand all of the possible conventions currently out there or even the full depth of the underlying technology.

What prevents us from actual software engineering is our own cultural evolution. We pride ourselves on not achieving any significant depth of knowledge, but rather just jumping in and flailing at crude solutions. Not standardizing what we build works in favor of both the programmers and the vendors. The former is in love with the delusion of creativity, while the latter deem it as a means to lock in clients. There is also a persistent fear that any lack of perceived freedom will render the job of programming boring. This is rather odd, and clearly self-destructive, since continuously re-writing ‘similar’ code gradually loses its glamour, resulting in a significant shortening of one’s career. It’s fun and ego fulfilling the first couple of times, but it eventually gets frustrating. Solving the same simple problems over and over again is not the same as really solving challenging problems. We do the first while claiming we are really doing the second.

There are many little-isolated pockets of software engineering in existence right now, I’ve worked in a few of them in the past. What really stands out is that they are considerably less stressful, more productive and you feel really proud of the work. Slapping together crud in a hurry is the exact opposite; some crazy deadline gets met, but that’s it. Still, the bulk of the nearly 18M programmers on the planet right now are explicitly oriented towards just pounding stuff out. And of the probably trillions of lines of code that are implicitly relied on by any non-trivial system, more and more of it is utilizing less and less knowledge. It is entirely possible to create well-engineered software, and it is possible to achieve decent quality in a reasonable amount of time, but we are slipping ever farther away from that goal.

At some point, software engineering will become mandated. As software ‘eats the world’ the unpredictability of what we are currently creating will become increasingly hazardous. Eventually, this will no longer be tolerated. Given its inevitability, it would be far better if we voluntarily refactored our profession instead of having it forced on us by outsiders. A gentle realignment of our culture would be less of a setback than a Spanish-style inquisition. It’s pretty clear from recent events that we are running out of time, and it’s rather obvious that this needs to be a grassroots movement. We can actually engineer software, but it just isn’t happening often right now and it certainly isn’t a popular trend.