Saturday, December 19, 2015

Routine Software

As our knowledge and understanding of software grows, it is important to keep track of what are ‘trivial’ and ‘routine’ software projects. Trivial means that something can be constructed with only the bare minimum of effort, by someone who has done it before. Routine means that it has been done so many times in the past, that the knowledge -- explicitly or implicitly -- is available within the industry; it doesn’t, however, mean that it is trivial, or that isn’t a large amount of work.

Both of these categories are, of course, relative to the current state of the industry. Neither of them means that a novice will find it to be ‘easy’, in that doing anything without pre-requisite knowledge is basically ‘hard’ by definition. If you don’t know what you are doing then only massive amounts of sweat and luck will ever make it possible.

Definition

At this time, and it has remained rather consistent for the last decade at least, a routine system is medium-sized or smaller, with a statically defined data model. Given our technologies, medium means approximately 400 users or less, and probably about 40,000 to 60,000 lines of code (LOC). A statically defined data model means that the structure of what is collected now is not going to change unless the functionality of the system is explicitly extended. That is, the structure doesn’t change dynamically on its own, what is valid about a data entity today is also valid tomorrow.

Modern routine systems almost always have a graphical user interface. Some are thick clients (a single stand-alone process), while others are client-server based (which also covers the architectural constraints on a basic web or mobile app, even if it involves several backend components). None of this affects whether the development is routine, since it has all been around for decades, but it does involve separate knowledge bases that need to be learned.

All of these routine systems rely primarily on edit loops:

http://theprogrammersparadox.blogspot.ca/2010/03/edit-loop.html

They move the data back and forth between a series of widgets and a database of some form. Most import external data through some form of stable ETL mapping, and export data out via well-defined data formats. Some have deeper interactive presentations.

Any system that does not get more complicated than this is routine, in that we have been building these now for nearly three decades, and the problems involved are well-defined and understood.

There are at least three sub-problems that will cause the development to no longer be routine, although in most cases these only affect a portion of the system, not the whole. They are:

  • Dynamic data-models
  • Scale larger than medium
  • Complex algorithmic requirements

Dynamic Data

A dynamic data model means that the underlying structure of the data can and will change all of the time. Users may enter one structure, one day, then something substantially different the next, yet these will still be the same data entity. The reason this occurs is because the domain is purposely shifting, often to its own advantage. Obviously, you can’t statically encode the entire space of possible changes, because that would involve knowing the future.

Dealing with dynamic data models means pushing the problem back to the users. That is, you give them some really convenient means of keeping up with the changes, like a DSL or complex GUI, so that they can adapt quickly. That may seem easy, but the problem leaks over both the interface and the persistence. That is, you need some form of dynamic interface that adapts to both the changing collection and reporting necessary, and you need this whole mess to be dynamically held in a persistent technology. The trick is to be able to write code that has almost no knowledge of the data that it is handling. The breadth and abstract nature of the problem are what makes it tricky to implement correctly; it is very rare to see it done well.

Scaling

Once the required scale exceeds the hardware capabilities, the system needs to be decomposed into pieces that can execute independently instead of all together. This sizing problem continually shifts because the hardware is evolving quickly, but there is always some threshold where it becomes the primary technical problem. In a simple system, if the decomposition leads to a set of independent pieces, the problem is only minorly painful. Each piece is pushed out on its own hardware. Sometimes this is structural, such as splitting the backend server into a web server and a database server. Sometimes it can be partitioned, such as sticking a load balancer in front of replicated servers.

If the data and code aren’t independent then very complex synchronization algorithms are needed, many of which are cutting edge computer science right now.

Software solutions for scale also exist in one of the many forms of memoization, such as caching or paging, however in the former case adding the ability to ‘reuse’ data or sub-calculations also means being able to precisely understand and scope the lifespan of the data, failing to do this makes it easy to accidentally rely on stale data.

Most scaling solutions in of themselves are not complex, but when multiple ones exist together their interactions can be extraordinarily complicated. As the necessity of scale grows, the need to bypass more bottlenecks means significant jumps in this complexity and increased risk of sending the performance backward. This complex interaction makes scaling one of these most difficult problems, and because we don’t have a commonly used toolkit for forecasting the behavior it means that much of the work is based on intuition or trial and error.

Algorithms

Most derived data is rather straightforward, but occasionally people are looking for subtle relationships within the structure of the data. In this way, there is a need for very complex algorithms, the worst of which is AI (since if we had that, it could find the others). Very difficult algorithms are always a challenge, but at least they are independent of the rest of the system. That is, in most applications, they usually only account for a small percentage of the functionality, say 10% to 20%. The rest of the system is really just a routine wrapper that is necessary to collect or aggregate the necessary data to feed them. In that way, these algorithms can be encapsulated into an ‘engine’ that is usable by a routine system, and so the independence is preserved.

For some really deep ‘systems’ programming problems like operating systems, compilers or databases the state-of-the-art algorithms have advanced significantly, and require significant research to understand. Most have some routine core that intermingles with the other two problems. What often separates systems programming from applications programming is that ignoring what is known, and choosing to crudely reinvent it, is way more likely to be defective. It’s best to either do the homework first or use code by someone who has already done the pre-requisite research.

Sometimes semi-systems programming approaches blend into routine application development, such as locking, threading, real-time(ish) etc. These are really a mix between scaling and algorithmic issues, that should really only be used in a routine system if the default performance is unacceptable. If they are used, significant research is required to use them properly, and some thought should be given as to how to properly encapsulate them so that future changes don’t turn ugly. Quite often it is common to see poor implementations that actually degrade performance, instead of helping it, or that cause strange, infrequent bugs that go for years without being found.

Finale

Now, of course, all three of these problems can be required for the same system at the same time, and it can be necessary to run this within a difficult environment such as fault tolerant. There are plenty of examples of our constructing such beasts and trying to tame them. At the same time, there are way more examples of mostly routine systems, that share few of these problems. In that latter category there are also examples of people approaching a routine system as if it were an extraordinarily complex one, and in those cases, their attempted solutions are unnecessarily expensive or unstable or even doomed. Understanding what work is actually routine then means knowing what and how to go about doing the work, so that it is most likely to succeed and be useful in the future.

What we should do as an industry is to produce better recipes and training for building routine systems. In an organized development environment, this type of work should proceed in a smooth and highly estimable fashion. For more advanced systems, the three tricky areas can often be abstracted away from the base, since they can be harder to estimate and considerably riskier. Given all these constraints, and strong analysis and design, there is no reason why most modern software development projects are so chaotic. They can, and should be, under much better control.

Friday, December 4, 2015

Requirements and Specifications

Programmers often complain about scope creep. 

The underlying cause is likely that their project has been caught up in an endless cycle of requirements and specifications that bounce all over the place, which is extraordinarily expensive and frustrating for everyone.

The inputs to programming are design specifications, which are created from requirements gathered during analysis. There are many other ways to describe these two stages and their outputs, but ultimately they all boil down to the same underlying information. Failures in one or both of these earlier stages is really obvious during the coding stage. If the details aren’t ever locked down, and everything is interconnected, then frequent erratic changes mean that a lot of work gets wasted. In that sense, scope creep isn’t a real programming problem, but rather a process one.

Quite obviously, scope creep wouldn’t happen if the specification for the system was 100%. The programmers would just code exactly what is needed -- once -- and then proceed to polish the work with testing. The irony is that the work of specifying a system to 100% is actually the work of writing the system itself. That is, if you make the effort to ensure that no detail was vague or left unspecified, then you could write another program to turn that specification directly into the required program.

A slight variation on this idea was actually floated a long time ago by Jack W Reeves in “What is Software Design?” but never went mainstream:

http://www.developerdotstar.com/mag/articles/reeves_design_main.html

Of course, time and a growing understanding of the solution generally mean that any original ideas for a piece of software will always require some fiddling. But it is obviously cheaper and far more efficient to work out these changes on a smaller scale first -- on paper -- before moving on to committing to the slow, detailed work of writing and testing the code. Thus, it is a very good practice to create short, high-level specifications to re-arrange the details, long before slogging through the real effort.

As mentioned, a good specification is the by-product of the two earlier stages, analysis and design. The first stage is the collection of all details that are necessary to solve the problem. The second is to mix that analysis with technology and operational requirements, in order to flesh out an architecture that organizes the details. Scope creep is most often caused by a failure of analysis. The details were never collected, or they weren’t vetted, or an aspect of the domain that is dynamic was treated statically. Anyone of these three problems will result in significant changes, and any one of them in a disorganized environment will set off a change cyclone.

There is also a rather interesting fourth problem: the details were collected, but not in a way that was stable.

The traditional approach to requirements is to craft a sentence that explains some need for the user. By definition, this expresses a winding ‘path’ through the underlying data and computations while often being targeted at only a special case. People find it easier to isolate their needs this way, but it is actually problematic. If the specification for a large system is composed of a really large collection of these path-based requirements, then it resembles a sort of cross-hatched attempt to fill in the details of the system, not unlike the scratchings of a toddler in a coloring book. But the details are really a ‘territory’, in that it is a finite set of data, broken down by modeled entities, with higher level functionality coming from computations and derived data. It is also packed with some navigational aids and a visual aesthetic.

A good system is complete, in the sense that it manages all of the data correctly and provides a complete set of tools to do any sort of display or manipulation necessary. It is a bounded territory that needs to be filled in. Nicely. Describing this with an erratic set of cross-hatched paths is obviously confusing, and prone to error. If the programmers fixate on the wrong subset of paths, necessary parts of the system fall through the cracks. Then when they are noticed, things have to change to fill those gaps. Overlaps likewise cause problems in driving the creation of redundancies which eventually lose synchronization with each other.

A very simple example of this confusion happened a while back when a user told an analyst that he needed to ‘add’ some new data. The analyst took that path ‘literally’ and set it down as a requirement, which he turned into a screen specification. The programmer took that screen literally and set it down as code. A little time passed and the user made a typo, that he only noticed after he had already saved the data. He went to edit the data, but… The system could ‘add’ new data, however, it lacked any ability to ‘edit’ it, or ‘delete’ it, because these were not explicitly specified by the user. That’s pretty silly because what the user meant by ‘add’ was really ‘manage’ and that implies that the three bits of functionality: add, edit and delete are all available. They are a ‘unit’, they only make sense together.

If instead of focusing on the literalness of the user, the analyst understood that the system itself was going to be the master repository for this newly collected entity then it would have been more than obvious what functionality was necessary. The work to create the requirements and the screen where superfluous and already well-defined by the existing territorial boundaries (the screen didn’t even match the existing interface conventions). A single new requirement to properly manage a new data entity was all that should have been necessary. Nothing more. The specification would then be completely derived from this and the existing conventions, either explicitly by an interface designer or implicitly by the programmer who would need to look up the current screen conventions in the code (and hopefully reuse most of it).

It is important to understand that territorial requirements are a lot less work, as well as being less vague. You need only list out the data, the computations, and for the interface: the navigation. In some cases, you might also have to list out hard outputs like specific reports (because there is limited flexibility in how they appear or their digital formats). With this information and performance and operational requirements, the designers can go about finding efficient ways to organize and layout the details for the programmers.

While the boundary described by the requirements needs to be reasonably close to 100% (although it can be abstract), the actual depth of the specifications are entirely dependent on the abilities of the programming teams. Younger, less experienced programmers, need more depth in the specifications to prevent them from going rogue. Battle-scarred seniors might only need the requirements themselves. Keep in mind that unwanted ‘creativity’ makes for a hideous interface, convoluted navigation, and brutal operational problems, as well as being a huge resource drain. A programmer that creates a whole new unique sub-system within an existing one is really just pouring fuel on the fire. It will only annoy the users and waste resources, even if it initially looks like it will be faster to code. The resulting disorganization is deadly, so it's best to not let it take hold. A programmer that goes rogue when there is an existing specification is far easier to manage then if there is nothing. Thus specifications are often vital to keep a group of programmers working nicely together. To keep them all building one integrated system, instead of just a pile of disconnect code.

The two initial stages can be described in many different ways, but they are best understood as decomposition and recomposition. That is, analysis is decomposing the problem into its underlying details. The most efficient way of doing this ensures that the parts of the territory are not overlapping, or just expressing the same things in different ways. Recomposition is the opposite. All of the pieces are put back together again, as a design, that ensures that the minimal amount of effort is needed to organize and complete the work. Stated that way, it is obvious that effective designs will heavily leverage reuse because it will take the least amount of overall work. Massive redundancies introduced via brute force will prevent entanglement but they do it by trading them for significant future problems. For any non-trivial system, that rapidly becomes the dominant roadblock.

An unfortunate cultural problem in programming is to continually push all decisions back to the users. Many programmers feel that it is not their responsibility to interpret or correct the incoming requirements and specifications. Somehow the idea that the user champions can correctly visualize the elements of a large system has become rooted. They certainly do know what they want the program to do, but they know this as a massive collection of different, independent path requirements, and often that set in their head isn’t fully resolved or complete and might even be contradictory. Solutions are indeed built for the users, but the work needs to progress reasonably. Building off a territory means the development teams can appropriately arrange the pieces to get constructed in the most efficient manner. Building off a stream of paths, most often means that each is handled independent, at a huge work multiplier. And no organization can get applied.

In that sense, giving control of the development process directly to a user champion will not result in anything close to efficient use of resources, rather the incoming chaos percolates throughout the code base. There might be some rare champion that does have the abilities to visualize the territorial aspects of the requirements, but even then the specifications still need to be created.

Analysis and design are different, although related, skill sets that need to exist and can likely be independently measured. For example, if there is significant scope creep, the analysis is failing. If there are plenty of integration problems, it is the specification. The first is that the necessary details were never known, while the second is that they were never organized well enough that the independent tasks were synchronized. In fact, categorizing bugs and using them to identify and fix overall process problems is the best way to capitalize on testing and operational feedback. The code needs to be fixed, but the process is often weak as well.

In a very real sense, it is entirely possible to walk backward from a bug, to the code, to the specifications and then to the requirements, to see if the flow of work has serious problems. There will always be minor hiccups, but in really badly run projects you see rather consistent patterns, such as massive redundancies. These can always be unwound by ensuring that the different stages of development fit together properly. Specifications, or the lack of them, sit in the middle, so they provide an early indicator of success.

It's also worth noting that some projects are small enough or straightforward enough that they don’t really need to actuate the specifications. The requirements should be recorded, but more as a means for knowing the direction that is driven by the users. If organization exists and the new requirements are just filling in familiar territory, then the code itself is enough to specify the next round of work. That’s why it is not uncommon on medium sized programs to see senior developers jump straight from conversations with the users to actual working code. Every detail that is necessary is already well-known, so given the lack of resources, the documentation portion is never done. That does work well when the developer is an organized, methodical person, and if they are ever replaced it is by someone that can actually read code (the only existing specification), but it fails really badly if those two necessary conditions don’t exist. Some code handovers go smoothly, some are major disasters.

Sometimes people use shifting territories as a reason to avoid analysis and specification. That is because the territory itself isn’t even locked down, then everything should be ad hoc, or experimental. This is most common with startups that are likely to pivot, at some point. The fallacy with this is that the pivots most often do not shift entirely away from the starting territory. That is, a significant aspect of the business changed, but not the technologies nor the required base infrastructure. And in most cases the shift itself doesn’t remove data entities, it just adds new ones that are higher in priority. So, a great deal of the technical base is still intact. If it wasn’t, then the only rational thing to do would be to drop 100% of the previous effort, but that necessity is actually quite rare. In that sense, territories expand and contract throughout the life of any development. Seeing and adjusting to that is important in effectively using the available resources, but it is an issue that is independent of analysis and design. No matter how the territories change, they still need to be decomposed, organized and then recomposed in order to move forward. The work is always constant, even if it is sporadically spread across the effort.

Building anything practical is always a byproduct of some form of analysis and design. As the scale of the attempt increases, the need for rigor and organization become increasingly bound to quality. If we set our sights on creating sophisticated software solutions that really make life easy for everyone, we need to understand how to properly set up these prerequisites to ensure that this happens as desired.