The Programmer's Paradox: 2015

Saturday, December 19, 2015

Routine Software

As our knowledge and understanding of software grows, it is important to keep track of what are ‘trivial’ and ‘routine’ software projects. Trivial means that something can be constructed with only the bare minimum of effort, by someone who has done it before. Routine means that it has been done so many times in the past, that the knowledge -- explicitly or implicitly -- is available within the industry; it doesn’t, however, mean that it is trivial, or that isn’t a large amount of work.

Both of these categories are, of course, relative to the current state of the industry. Neither of them means that a novice will find it to be ‘easy’, in that doing anything without pre-requisite knowledge is basically ‘hard’ by definition. If you don’t know what you are doing then only massive amounts of sweat and luck will ever make it possible.

Definition

At this time, and it has remained rather consistent for the last decade at least, a routine system is medium-sized or smaller, with a statically defined data model. Given our technologies, medium means approximately 400 users or less, and probably about 40,000 to 60,000 lines of code (LOC). A statically defined data model means that the structure of what is collected now is not going to change unless the functionality of the system is explicitly extended. That is, the structure doesn’t change dynamically on its own, what is valid about a data entity today is also valid tomorrow.

Modern routine systems almost always have a graphical user interface. Some are thick clients (a single stand-alone process), while others are client-server based (which also covers the architectural constraints on a basic web or mobile app, even if it involves several backend components). None of this affects whether the development is routine, since it has all been around for decades, but it does involve separate knowledge bases that need to be learned.

All of these routine systems rely primarily on edit loops:

http://theprogrammersparadox.blogspot.ca/2010/03/edit-loop.html

They move the data back and forth between a series of widgets and a database of some form. Most import external data through some form of stable ETL mapping, and export data out via well-defined data formats. Some have deeper interactive presentations.

Any system that does not get more complicated than this is routine, in that we have been building these now for nearly three decades, and the problems involved are well-defined and understood.

There are at least three sub-problems that will cause the development to no longer be routine, although in most cases these only affect a portion of the system, not the whole. They are:

Dynamic data-models
Scale larger than medium
Complex algorithmic requirements

Dynamic Data

A dynamic data model means that the underlying structure of the data can and will change all of the time. Users may enter one structure, one day, then something substantially different the next, yet these will still be the same data entity. The reason this occurs is because the domain is purposely shifting, often to its own advantage. Obviously, you can’t statically encode the entire space of possible changes, because that would involve knowing the future.

Dealing with dynamic data models means pushing the problem back to the users. That is, you give them some really convenient means of keeping up with the changes, like a DSL or complex GUI, so that they can adapt quickly. That may seem easy, but the problem leaks over both the interface and the persistence. That is, you need some form of dynamic interface that adapts to both the changing collection and reporting necessary, and you need this whole mess to be dynamically held in a persistent technology. The trick is to be able to write code that has almost no knowledge of the data that it is handling. The breadth and abstract nature of the problem are what makes it tricky to implement correctly; it is very rare to see it done well.

Scaling

Once the required scale exceeds the hardware capabilities, the system needs to be decomposed into pieces that can execute independently instead of all together. This sizing problem continually shifts because the hardware is evolving quickly, but there is always some threshold where it becomes the primary technical problem. In a simple system, if the decomposition leads to a set of independent pieces, the problem is only minorly painful. Each piece is pushed out on its own hardware. Sometimes this is structural, such as splitting the backend server into a web server and a database server. Sometimes it can be partitioned, such as sticking a load balancer in front of replicated servers.

If the data and code aren’t independent then very complex synchronization algorithms are needed, many of which are cutting edge computer science right now.

Software solutions for scale also exist in one of the many forms of memoization, such as caching or paging, however in the former case adding the ability to ‘reuse’ data or sub-calculations also means being able to precisely understand and scope the lifespan of the data, failing to do this makes it easy to accidentally rely on stale data.

Most scaling solutions in of themselves are not complex, but when multiple ones exist together their interactions can be extraordinarily complicated. As the necessity of scale grows, the need to bypass more bottlenecks means significant jumps in this complexity and increased risk of sending the performance backward. This complex interaction makes scaling one of these most difficult problems, and because we don’t have a commonly used toolkit for forecasting the behavior it means that much of the work is based on intuition or trial and error.

Algorithms

Most derived data is rather straightforward, but occasionally people are looking for subtle relationships within the structure of the data. In this way, there is a need for very complex algorithms, the worst of which is AI (since if we had that, it could find the others). Very difficult algorithms are always a challenge, but at least they are independent of the rest of the system. That is, in most applications, they usually only account for a small percentage of the functionality, say 10% to 20%. The rest of the system is really just a routine wrapper that is necessary to collect or aggregate the necessary data to feed them. In that way, these algorithms can be encapsulated into an ‘engine’ that is usable by a routine system, and so the independence is preserved.

For some really deep ‘systems’ programming problems like operating systems, compilers or databases the state-of-the-art algorithms have advanced significantly, and require significant research to understand. Most have some routine core that intermingles with the other two problems. What often separates systems programming from applications programming is that ignoring what is known, and choosing to crudely reinvent it, is way more likely to be defective. It’s best to either do the homework first or use code by someone who has already done the pre-requisite research.

Sometimes semi-systems programming approaches blend into routine application development, such as locking, threading, real-time(ish) etc. These are really a mix between scaling and algorithmic issues, that should really only be used in a routine system if the default performance is unacceptable. If they are used, significant research is required to use them properly, and some thought should be given as to how to properly encapsulate them so that future changes don’t turn ugly. Quite often it is common to see poor implementations that actually degrade performance, instead of helping it, or that cause strange, infrequent bugs that go for years without being found.

Finale

Now, of course, all three of these problems can be required for the same system at the same time, and it can be necessary to run this within a difficult environment such as fault tolerant. There are plenty of examples of our constructing such beasts and trying to tame them. At the same time, there are way more examples of mostly routine systems, that share few of these problems. In that latter category there are also examples of people approaching a routine system as if it were an extraordinarily complex one, and in those cases, their attempted solutions are unnecessarily expensive or unstable or even doomed. Understanding what work is actually routine then means knowing what and how to go about doing the work, so that it is most likely to succeed and be useful in the future.

What we should do as an industry is to produce better recipes and training for building routine systems. In an organized development environment, this type of work should proceed in a smooth and highly estimable fashion. For more advanced systems, the three tricky areas can often be abstracted away from the base, since they can be harder to estimate and considerably riskier. Given all these constraints, and strong analysis and design, there is no reason why most modern software development projects are so chaotic. They can, and should be, under much better control.

Friday, December 4, 2015

Requirements and Specifications

Programmers often complain about scope creep.

The underlying cause is likely that their project has been caught up in an endless cycle of requirements and specifications that bounce all over the place, which is extraordinarily expensive and frustrating for everyone.

The inputs to programming are design specifications, which are created from requirements gathered during analysis. There are many other ways to describe these two stages and their outputs, but ultimately they all boil down to the same underlying information. Failures in one or both of these earlier stages is really obvious during the coding stage. If the details aren’t ever locked down, and everything is interconnected, then frequent erratic changes mean that a lot of work gets wasted. In that sense, scope creep isn’t a real programming problem, but rather a process one.

Quite obviously, scope creep wouldn’t happen if the specification for the system was 100%. The programmers would just code exactly what is needed -- once -- and then proceed to polish the work with testing. The irony is that the work of specifying a system to 100% is actually the work of writing the system itself. That is, if you make the effort to ensure that no detail was vague or left unspecified, then you could write another program to turn that specification directly into the required program.

A slight variation on this idea was actually floated a long time ago by Jack W Reeves in “What is Software Design?” but never went mainstream:

http://www.developerdotstar.com/mag/articles/reeves_design_main.html

Of course, time and a growing understanding of the solution generally mean that any original ideas for a piece of software will always require some fiddling. But it is obviously cheaper and far more efficient to work out these changes on a smaller scale first -- on paper -- before moving on to committing to the slow, detailed work of writing and testing the code. Thus, it is a very good practice to create short, high-level specifications to re-arrange the details, long before slogging through the real effort.

As mentioned, a good specification is the by-product of the two earlier stages, analysis and design. The first stage is the collection of all details that are necessary to solve the problem. The second is to mix that analysis with technology and operational requirements, in order to flesh out an architecture that organizes the details. Scope creep is most often caused by a failure of analysis. The details were never collected, or they weren’t vetted, or an aspect of the domain that is dynamic was treated statically. Anyone of these three problems will result in significant changes, and any one of them in a disorganized environment will set off a change cyclone.

There is also a rather interesting fourth problem: the details were collected, but not in a way that was stable.

The traditional approach to requirements is to craft a sentence that explains some need for the user. By definition, this expresses a winding ‘path’ through the underlying data and computations while often being targeted at only a special case. People find it easier to isolate their needs this way, but it is actually problematic. If the specification for a large system is composed of a really large collection of these path-based requirements, then it resembles a sort of cross-hatched attempt to fill in the details of the system, not unlike the scratchings of a toddler in a coloring book. But the details are really a ‘territory’, in that it is a finite set of data, broken down by modeled entities, with higher level functionality coming from computations and derived data. It is also packed with some navigational aids and a visual aesthetic.

A good system is complete, in the sense that it manages all of the data correctly and provides a complete set of tools to do any sort of display or manipulation necessary. It is a bounded territory that needs to be filled in. Nicely. Describing this with an erratic set of cross-hatched paths is obviously confusing, and prone to error. If the programmers fixate on the wrong subset of paths, necessary parts of the system fall through the cracks. Then when they are noticed, things have to change to fill those gaps. Overlaps likewise cause problems in driving the creation of redundancies which eventually lose synchronization with each other.

A very simple example of this confusion happened a while back when a user told an analyst that he needed to ‘add’ some new data. The analyst took that path ‘literally’ and set it down as a requirement, which he turned into a screen specification. The programmer took that screen literally and set it down as code. A little time passed and the user made a typo, that he only noticed after he had already saved the data. He went to edit the data, but… The system could ‘add’ new data, however, it lacked any ability to ‘edit’ it, or ‘delete’ it, because these were not explicitly specified by the user. That’s pretty silly because what the user meant by ‘add’ was really ‘manage’ and that implies that the three bits of functionality: add, edit and delete are all available. They are a ‘unit’, they only make sense together.

If instead of focusing on the literalness of the user, the analyst understood that the system itself was going to be the master repository for this newly collected entity then it would have been more than obvious what functionality was necessary. The work to create the requirements and the screen where superfluous and already well-defined by the existing territorial boundaries (the screen didn’t even match the existing interface conventions). A single new requirement to properly manage a new data entity was all that should have been necessary. Nothing more. The specification would then be completely derived from this and the existing conventions, either explicitly by an interface designer or implicitly by the programmer who would need to look up the current screen conventions in the code (and hopefully reuse most of it).

It is important to understand that territorial requirements are a lot less work, as well as being less vague. You need only list out the data, the computations, and for the interface: the navigation. In some cases, you might also have to list out hard outputs like specific reports (because there is limited flexibility in how they appear or their digital formats). With this information and performance and operational requirements, the designers can go about finding efficient ways to organize and layout the details for the programmers.

While the boundary described by the requirements needs to be reasonably close to 100% (although it can be abstract), the actual depth of the specifications are entirely dependent on the abilities of the programming teams. Younger, less experienced programmers, need more depth in the specifications to prevent them from going rogue. Battle-scarred seniors might only need the requirements themselves. Keep in mind that unwanted ‘creativity’ makes for a hideous interface, convoluted navigation, and brutal operational problems, as well as being a huge resource drain. A programmer that creates a whole new unique sub-system within an existing one is really just pouring fuel on the fire. It will only annoy the users and waste resources, even if it initially looks like it will be faster to code. The resulting disorganization is deadly, so it's best to not let it take hold. A programmer that goes rogue when there is an existing specification is far easier to manage then if there is nothing. Thus specifications are often vital to keep a group of programmers working nicely together. To keep them all building one integrated system, instead of just a pile of disconnect code.

The two initial stages can be described in many different ways, but they are best understood as decomposition and recomposition. That is, analysis is decomposing the problem into its underlying details. The most efficient way of doing this ensures that the parts of the territory are not overlapping, or just expressing the same things in different ways. Recomposition is the opposite. All of the pieces are put back together again, as a design, that ensures that the minimal amount of effort is needed to organize and complete the work. Stated that way, it is obvious that effective designs will heavily leverage reuse because it will take the least amount of overall work. Massive redundancies introduced via brute force will prevent entanglement but they do it by trading them for significant future problems. For any non-trivial system, that rapidly becomes the dominant roadblock.

An unfortunate cultural problem in programming is to continually push all decisions back to the users. Many programmers feel that it is not their responsibility to interpret or correct the incoming requirements and specifications. Somehow the idea that the user champions can correctly visualize the elements of a large system has become rooted. They certainly do know what they want the program to do, but they know this as a massive collection of different, independent path requirements, and often that set in their head isn’t fully resolved or complete and might even be contradictory. Solutions are indeed built for the users, but the work needs to progress reasonably. Building off a territory means the development teams can appropriately arrange the pieces to get constructed in the most efficient manner. Building off a stream of paths, most often means that each is handled independent, at a huge work multiplier. And no organization can get applied.

In that sense, giving control of the development process directly to a user champion will not result in anything close to efficient use of resources, rather the incoming chaos percolates throughout the code base. There might be some rare champion that does have the abilities to visualize the territorial aspects of the requirements, but even then the specifications still need to be created.

Analysis and design are different, although related, skill sets that need to exist and can likely be independently measured. For example, if there is significant scope creep, the analysis is failing. If there are plenty of integration problems, it is the specification. The first is that the necessary details were never known, while the second is that they were never organized well enough that the independent tasks were synchronized. In fact, categorizing bugs and using them to identify and fix overall process problems is the best way to capitalize on testing and operational feedback. The code needs to be fixed, but the process is often weak as well.

In a very real sense, it is entirely possible to walk backward from a bug, to the code, to the specifications and then to the requirements, to see if the flow of work has serious problems. There will always be minor hiccups, but in really badly run projects you see rather consistent patterns, such as massive redundancies. These can always be unwound by ensuring that the different stages of development fit together properly. Specifications, or the lack of them, sit in the middle, so they provide an early indicator of success.

It's also worth noting that some projects are small enough or straightforward enough that they don’t really need to actuate the specifications. The requirements should be recorded, but more as a means for knowing the direction that is driven by the users. If organization exists and the new requirements are just filling in familiar territory, then the code itself is enough to specify the next round of work. That’s why it is not uncommon on medium sized programs to see senior developers jump straight from conversations with the users to actual working code. Every detail that is necessary is already well-known, so given the lack of resources, the documentation portion is never done. That does work well when the developer is an organized, methodical person, and if they are ever replaced it is by someone that can actually read code (the only existing specification), but it fails really badly if those two necessary conditions don’t exist. Some code handovers go smoothly, some are major disasters.

Sometimes people use shifting territories as a reason to avoid analysis and specification. That is because the territory itself isn’t even locked down, then everything should be ad hoc, or experimental. This is most common with startups that are likely to pivot, at some point. The fallacy with this is that the pivots most often do not shift entirely away from the starting territory. That is, a significant aspect of the business changed, but not the technologies nor the required base infrastructure. And in most cases the shift itself doesn’t remove data entities, it just adds new ones that are higher in priority. So, a great deal of the technical base is still intact. If it wasn’t, then the only rational thing to do would be to drop 100% of the previous effort, but that necessity is actually quite rare. In that sense, territories expand and contract throughout the life of any development. Seeing and adjusting to that is important in effectively using the available resources, but it is an issue that is independent of analysis and design. No matter how the territories change, they still need to be decomposed, organized and then recomposed in order to move forward. The work is always constant, even if it is sporadically spread across the effort.

Building anything practical is always a byproduct of some form of analysis and design. As the scale of the attempt increases, the need for rigor and organization become increasingly bound to quality. If we set our sights on creating sophisticated software solutions that really make life easy for everyone, we need to understand how to properly set up these prerequisites to ensure that this happens as desired.

Sunday, November 22, 2015

Containers, Collections and Null

A variable in a computer program holds exactly one member of a finite set of possibilities. This datum symbolically represents something in the physical world, although it is also possible that it represents something from an imaginary reality.

The usefulness of a variable is that it can be displayed at the appropriate moment to allow a person, or something mechanical, to make a decision. Beyond that, it is just being saved under the belief that it will be used for this purpose some day or that it will be an input into some other variable’s moment.

A variable can also have with it an associated null flag. It is not actually ‘part’ of that variable, but rather an independent boolean variable that sheds light on the original contents. A friend of mine used to refer to null values as ‘out-of-band’ signals; meaning that they sit outside of the set of possible values. This is important in that they cannot be confused with another member of the set, what he used to call ‘in-band’ signals or often ‘magic numbers’.

There is generally a lot of confusion surrounding the use of nulls. People ascribe all sorts of program-specific meanings to them, which act as implicit data in their program. In this way, they overload the out-of-band signal to represent a custom state relative to their own data. This is a bad idea. Overloading any meaning in a variable guarantees some sort of future problem since it is subjective and easily forgotten. It is far better to ascribe a tighter definition.

A variable is collected or it is not; computers are perfectly binary in that regard. If it is possible for a variable to not be collected, the only reason for this is that the variable is ‘optional’ otherwise the program will not continue with its processing. Obviously bad data should not be allowed to propagate throughout a system or get stored in the database. If there are different, and useful, excuses for not having specified a variable right away, then these are independent variables themselves. With that in mind, a null only ever means that the variable is optional and was not collected, nothing else should be inferred from it.

Then for the sake of modeling, for all variables, one only needs to know if the data is mandatory -- it must be collected -- or it is nullable. Getting a null in a mandatory field is obviously an error, which needs to invoke some error handling. Getting a null in an optional field is fine. If a variable is nullable ‘anywhere’ in the system then it should be nullable ‘everywhere’ in the system (it is part of the data model). If there is functionality dependent on an optional variable, then it should only ever execute when that variable is present, but should not be an error if it is missing. If the underlying data is optional, then any derived data or functionality built on top of it is optional as well. Optionality propagates upwards.

With that perspective, handling nulls for a variable is easy and consistent. If it doesn’t make sense that the data could be missing then don’t make it nullable. It also helps to understand where to reinforce the data model if the data is a bit trickier than normal. For instance, partially collected data is ‘bad’ data until all of the mandatory values have been specified. So it should be partitioned or marked appropriately until it is ready.

The strength of a computer is not that it can remember a single variable, but rather that it can remember a huge number of them. And they can be interrelated. What really adds to the knowledge is the structure of these cross-variable relationships. There can be a huge number of instances of the same variable and/or there can be a huge number of relationships to other types of variables.

The larger and more complex the structure, the more information that can be gleaned out of this collected data. The history of software has always been about trying to cope with increasingly larger and more complex data structures even if that hasn’t been explicitly stated as the goal. In a sense, we don’t just want data, we want a whole model of something (often including a time dimension) that we can use to make deeper, more precise decisions.

The complex structural relationships between variables are represented within a program as ‘containers’. A simple container, say a structure in C or a simple object in Java, is just a static group of related variables. These can be moved throughout the program as if they were a single variable, ensuring that they are all appropriately kept together. In this arrangement, the name of the variable in many programming languages is a compile-time attribute. That is, the programmer refers to the name when writing the code, but no such name exists in the system at runtime. Some languages provide introspection, allowing the programmer to retrieve their name and use it at runtime.

Null handling for simple containers is similar to null handling for individual variables. Null means that ‘all’ of the data in the structure is optional. In some languages, however, there can be confusion on how to handle mandatory data. With a simple variable, the language itself can be used to insist that it always exists, but with a pointer or reference to a container that check needs to be explicitly in the code. It can come in the form of asserts or some explicit branches that jump to error handling. The code should not continue with functionality that relies on mandatory data if it is missing.

The next level of sophistication allows for a runtime based ‘named’ variable. That is, both the name and the variable contents are passed around as variables themselves, together in a container. Frequently this container is implemented as a hashtable (sometimes called a dictionary), although in some cases ‘order’ is required so there can also be an underlying linked-list. This is quite useful for reusing code for functionally on similar data with only some minor processing attached to a few specific variables. Then it mostly leaves the bulk of the code to manipulate the data without having to really understand it, making it strongly reusable. This works well for communications, printing, displaying stuff in widgets, etc. Any part of the code whose data processing isn’t explicitly dependent on the explicit meaning of the underlying data, although sometimes there needs to be categories (often data types) of behavior. Learning to utilize this paradigm can cut out a huge number of redundant lines of code in most common programs.

Null handling for containers of named variables is slightly different, in that the absence of a particular named pair is identical to the name existing with a null value. Given this overlap, it is usually best to not add empty, optional data into the container. This is also reflected symmetrically by not passing along values without a name either. This type of structuring means that processing such a container is simplified in that each pair is either mandatory, or it is optionally included. If a pair is missing, then it was optional. To enforce mandatory variables, again there needs to be some trivial code that interrupts processing.

Containers get more interesting when they allow multiple instances of the same data, such as an array, list or tree. Large groups of collected data shed a brighter light on the behavior of their individual datum, thus providing deeper details. For these ‘collections’, they can be ordered or unordered although the latter is really a figment of the programmer’s imagination in that everything on a computer has an intrinsic order, it is just sometimes ignored.

Ordering can be based on any variables within the data although often it is misunderstood; the classic case of that being tables in a relational database returning their default order based on their primary index construction, thus leading to the extremely common bug of the SELECT statement order changing unexpectedly when the tables grow large or rows are deleted. Programmers don’t see this potentially chaotic reordering when testing with small datasets, so they make bad assumptions about what really caused the visible ordering.

One frequent confusion that occurs is with null handling for collections, in that an empty collection or a null reference can be interpreted as the same thing. In most cases, it is really optional data has not been collected, so it doesn’t make sense to support this redundancy. It is more appropriate to handle the ‘zero’ items condition as an optional collection, and the null reference itself as a programming error. This is supported quite elegantly by having any code that returns a collection to always allocate an empty one. This can then be mirrored by any function that needs a collection as well, in that it can assume that null will not be passed in, just empty collections. This reduces bugs caused by inconsistent collection handling, but it does mean that every branch of any new code should be executed at least once in testing to catch any unwanted nulls. It doesn’t, however, mean that every permutation of the logic needs to be tested, just the empty case, so the minimum test cases are fairly small. This isn’t extra work in that the minimum reasonable practice for any testing is always that no ‘untested’ lines of code should ever be deployed. That’s just asking for trouble, and if it happens it is a process problem, not a coding one.

This null handling philosophy prevents the messy and time-wasting inefficiencies of never knowing which direction an underlying programmer is going to choose. We’ll call it ‘normalized collection handling’. It does, however, require wrapping any questionable, underlying calls from other code just in case it doesn’t behave this way, but wrapping all third-party library calls was always considered a strong best practice, right up until it was forgotten.

Some programmers may not like it because they believe that it is more resource-intensive. Passing around a lot of empty collections will definitely use extra memory. Getting the count out of a collection is probably more CPU than just checking for a null (but less if you have to do both which is usually the case). This is true, but because an empty collection eats up the base management costs, it also means that the resource usage during processing, at least from the lowest level, is considerably more consistent. Programs that fluctuate with large resource usage extremes are far more volatile, which makes them far more difficult for operations. That is, if whenever you run a program, its total resource usage is predictable and relative to load, then it becomes really easy to allocate the right hardware. If it swings to extremes, it becomes far more likely to exceed its allocation. Stable-resource systems are way easier to manage. A little extra memory then is a fair price to pay for simpler and more stable code.

We can generalize all of the above by realizing that null handling differs between static and dynamic variables. Adding nulls is extra work in the static case while enforcing mandatory requirements is extra work in the dynamic case. Static data is easier to code, but dynamic data is both flexible and more reusable. In that sense, if we are going to just hardcode a great deal of static data, it is best if the default is mandatory and optional data is the least frequent special case. The exact opposite is true with dynamic data. Most things should be optional, to avoid filling the code with mandatory checks. This behavioral flip flop causes a great deal of confusion because people want a one-size-fits-all approach.

A nice attribute about this whole perspective of data is that it simplifies lots of stuff. We only have containers of unnamed or named variables. Some containers are collections. This is actually an old philosophy, in that it shows up in languages like AWK that have ‘associated arrays’ where the array index can be an integer or a string. The former is a traditional array, while the latter is a hashtable, but they are treated identically. In fact, in AWK, I think it just cheats and makes everything a hashtable, converting any other data type to a string key. This makes it a rather nice example of an abstraction smoothing away the special cases, for the convenience of both the users of the language and the original programmers.

We have to, of course, extend this to allow for containers of a mix between variables and sub-containers, and do that in a fully recursive manner so that a variable can be a reference to a container or collection itself. This does open the door to having cycles, but in most cases, they are trivial to detect and easily managed. Going a little farther, the ‘name’ of the variable or key of the hashtable itself can be a recursive container of structures as well, although eventually, it needs to resolve down to something that is both comparable and will generate a constant hashing code (it can't change over time). Mixed all together, these techniques give us a fully expressible means of symbolically representing any model in spacetime and thus can be increasingly used to make even more complex decisions.

We should try not to get lost in the recursive nature of what we need to express, and it is paradigms like ADTs and Object-Oriented programming that are really helpful in this regard. A sophisticated program will have an extraordinarily complex data model with many, many different depths to it, but we can assemble and work on these one container at a time. That is, if the model is a list of queues of trees with stacks and matrices scattered about, we don’t have to understand the entire structure as a ‘whole’ in order to build and maintain the code. We can decompose it into its underlying components, data structures and/or objects, and ensure that each one works as expected. We can then work on, and test, each sub-relationship again independently. If we know all the structural relationships are correct, and we know the underlying handling is correct, then we can infer that the overall model is correct.

While that satisfies the ability to implement the code, it says nothing about how we might decide on a complex model to a real-world solution. If there is art left in programming, much of it comes directly from this issue. What’s most obvious is that the construction of sophisticated models needs to be prototyped long before the time is spent to implement them, so that its suitability can be quickly rearranged as needed. The other point is that because of the intrinsic complexity, this type of modeling needs to be built from the bottom-up, while still being able to understand the top-down imposed context. Whiteboards, creativity and a lot of discussions seem to be the best tools for arriving at decent results. Realistically this is actually more of an analysis problem than a coding one. The structure of the data drives the behavior, the code just needs to be aware of what it is supposed to be, and handle it in a consistent manner.

Some programmers would rather avoid this whole brand of complexity and stick to masses of independent static variables, but it is inevitable that larger, more complex, structural relationships will become increasingly common as the user’s requirements for their software gradually get more sophisticated. A good example of this being the non-trivial data-structures that underpin spreadsheets. They have only gained in popularity since they were invented. Their complexity unleashed power-users to solve programmatic issues for themselves that were unimaginable before this technology existed. That higher level of sophistication is sorely needed for many other domain-based problems, but currently, they are too often encoded statically. We’ve been making very painfully slow progress in this area, but it is progress and it will continue as users increasingly learn what is really possible with software.

Thinking of the possible variable arrangements as structural relationships between containers makes it fairly straightforward to understand how to represent increasingly complex real-world data. With tighter definitions, we can avoid the disorganization brought on by vagueness, which will also cut down on bugs. Software is always founded on really simple concepts, the definition of a Turing machine for example, but we have built on top of this a lot of artificial complexity due to needless variations. Being a little stricter with our understanding allows us to return to minimal complexity, thus letting us focus our efforts on achieving both sophistication and predictability. If we want better software, we have to detach from the sloppiness of the past.

Sunday, November 8, 2015

Software Engineering

The standard definition for engineering from Wikipedia is:

Engineering is the application of mathematics, empirical evidence and scientific, economic, social, and practical knowledge in order to invent, design, build, maintain, research, and improve structures, machines, tools, systems, components, materials, and processes.

But I like it to simplify it into just two rules:

Understanding something really deeply.
Utilizing that knowledge to build stuff with predictable behavior.

With the added caveat that by ‘predictable behavior’ I mean that it is predictable for ‘all’ possible circumstances in the real world. That’s not to say that it must withstand everything that could be thrown at it, but rather that given something unexpected, the behavior is not. There is no need to guess, it can be predicted and will do exactly as expected. Obviously, there is a strong tie between the actual depth of the knowledge and the accuracy of these predictions.

The tricky part about engineering good software is acquiring enough deep knowledge. Although the existing underlying software is deterministic and explicitly built by people, it has been expanding so rapidly over the last five decades that it has become exceptionally convoluted. Each individual technology has blurry conventions and lots of quirky behavior. It becomes difficult to both properly utilize the technology and mitigate it under adverse conditions. Getting something to ‘sort of’ work is not too difficult, but getting it to behave reliably is extraordinarily complex and time-consuming.

For example, if you wanted to really utilize a relational database, that would require a good understanding of the set-theoretical nature of SQL, normalization, query plans, implicit queries, triggers, stored procedures, foreign keys, constraints, transactions, vendor-specific event handling and how to combine all of these together effectively for models that often exceed the obvious expressiveness. When used appropriately, a relational database is a strong persistence foundation for most systems. Inappropriate usage, however, makes it awkward, time-consuming and prone to unsolvable bugs. The same technology that can nearly trivialize routine systems can also turn them into hopeless tangles of unmanageable complexity. The difference is all about the depth of understanding, not the virtues of the actual software.

A big obstacle in acquiring deep knowledge is the lack of authoritative references. Someone could write a book that would explain in precise detail how to effectively utilize a relational database for standard programming issues, but culturally we don’t get that specific because it would be discounted due to subjective objections. That is, anyone with even a small variation of opinion on any subset of the proposed approach would discount the entire approach as invalid, thus preventing it from becoming common. In addition, creativity is valued so highly that most programmers would strongly prefer to rediscover how to use a relational database, over decades, then just adopt the pre-existing knowledge. That is unfortunate because there are more interesting problems to solve if you get past these initial ones.

To get to actual engineering we would have to be able to recognize the routine parts of our problems, and then solve them with standardized components whose ‘full’ behavior is well documented. This would obviously be a lot easier if we had a reliable means of categorizing that behavior. Thus we would not need to consume massive resources experimentally determining what happens if we knew that a technology was certified as ‘type X’ for instance. In that sense, the details need to be encapsulated, but all behavioral changes, such as possible errors, need to be explicitly documented and to strictly follow some standard convention. If we can achieve this, then we have components which can be used and if a programmer sticks with a collection of them from a limited set of categories, they can actually have a full and deep understanding of how they will affect the system. That depth will give us the ability to combine our work on top in a manner that is also predictable. Without -- of course -- having to deeply understand all of the possible conventions currently out there or even the full depth of the underlying technology.

What prevents us from actual software engineering is our own cultural evolution. We pride ourselves on not achieving any significant depth of knowledge, but rather just jumping in and flailing at crude solutions. Not standardizing what we build works in favor of both the programmers and the vendors. The former is in love with the delusion of creativity, while the latter deem it as a means to lock in clients. There is also a persistent fear that any lack of perceived freedom will render the job of programming boring. This is rather odd, and clearly self-destructive, since continuously re-writing ‘similar’ code gradually loses its glamour, resulting in a significant shortening of one’s career. It’s fun and ego fulfilling the first couple of times, but it eventually gets frustrating. Solving the same simple problems over and over again is not the same as really solving challenging problems. We do the first while claiming we are really doing the second.

There are many little-isolated pockets of software engineering in existence right now, I’ve worked in a few of them in the past. What really stands out is that they are considerably less stressful, more productive and you feel really proud of the work. Slapping together crud in a hurry is the exact opposite; some crazy deadline gets met, but that’s it. Still, the bulk of the nearly 18M programmers on the planet right now are explicitly oriented towards just pounding stuff out. And of the probably trillions of lines of code that are implicitly relied on by any non-trivial system, more and more of it is utilizing less and less knowledge. It is entirely possible to create well-engineered software, and it is possible to achieve decent quality in a reasonable amount of time, but we are slipping ever farther away from that goal.

At some point, software engineering will become mandated. As software ‘eats the world’ the unpredictability of what we are currently creating will become increasingly hazardous. Eventually, this will no longer be tolerated. Given its inevitability, it would be far better if we voluntarily refactored our profession instead of having it forced on us by outsiders. A gentle realignment of our culture would be less of a setback than a Spanish-style inquisition. It’s pretty clear from recent events that we are running out of time, and it’s rather obvious that this needs to be a grassroots movement. We can actually engineer software, but it just isn’t happening often right now and it certainly isn’t a popular trend.

Sunday, October 25, 2015

Organization

Disorganization is extremely dangerous for a software development project. No matter what code quality is delivered by the programmers, if it is just dumped into a ball of mud it will grow increasingly unstable. Software rusts quite quickly and cannot be maintained if the costs to read and modify it are unreasonable. Disorganization significantly amplifies those costs.

Organization applies both to the output, such as the data and code, and to the processes used for all five stages of software development. If the right work is not being done at the right time, the resulting inefficiencies sap the time and morale of the project, causing the work to falter. Once things have become messy, progress gets painfully slow. So it not only applies to the end, but to the means as well.

What is organization? It's a fairly simple term to claim, but a rather tricky one to define precisely. My favorite definition is the 17th (or 18th) century proverb "a place for everything and everything in its place" but that actually turns out to be a rather incomplete definition. It works as a start, but needs a little bit of refining:

A place for everything.
Everything in its place.
A similarity to everything in the same place.

While this definition may seem to only revolve around the physical world, it does actually work for the digital one as well.

'Everything' for a computer is code and data. There is actually nothing else, just those two although there are plenty of variations on both.

A 'place' is where you need to go to find some code or data. It's an absolute location e.g. the source code is located on machine X in the folder Y. The binary is located on machine P in the folder Q. These are both rather 'physical' places, even if they are in the digital realm.

With the first two rules it is clear that to be organized you need to define a rather large number of places and categorize all data and code into 'everything'. Then you need to make sure that all things are stored exactly where they belong. In principle it isn't that hard; in practice any medium to large computer system consists of millions of little moving parts. To get all of those into the proper place just seems like a huge amount of work. Really it's not. Particularly if you have some overwhelming 'philosophy' for the organization. Higher-level rules, i.e. best practices, fit generalizations over the details and when done appropriately can provide strong organizational guidance towards keeping a project clean. And then it is actually far less effort than being rampantly disorganized.

The third rule is necessary to insure that there aren't too many things all dumped into one big place. Some unscrupulous coders could claim that a ball of mud is organized because all of the code is in one giant directory in the repo. That, of course, is nonsense. Hundreds or thousands of source files in one giant mess is the antithesis of organization. The third rule fixes that. Everything in one place must be 'similar' or it just isn't organized.

So what does 'similar' mean? In my last post on Abstraction I talked about the rather massive permutations available for collections of sets. This directly implies that there are huge variations on degrees of possible similarity. As well, different people have very different tolerances for the proximity of any two things. Some people can find similarities at a great distance, while others only see them if they are close together. Taken together this implies that 'similarity' is subjective.

Two completely different people may disagree on what is close enough to be similar, and that of course propagates up towards organization. Different people will have very opposing views as to what is really organized, and what is disorganized. That being said, there is always some outer boundary on which the majority of reasonable people will agree that two things are definitely not similar, and it is this boundary that is the minimal necessity for the third rule. That is, there is basic organization and then there are many more advanced, more eclectic versions. If a project or process doesn't even meet the basics, then it is clearly doomed. If it twists a little to be tightly organized relative to a specific person's organizational preferences, then it should at least be okay, and it is always possible to re-organize it at some point in the future if the staff changes.

The corollary to this definition does imply however that if there are four programmers working on the same project with four distinctly different organization schemes, then if they overlap in any way, the project is in fact disorganized. If the programmers all choose different naming convention for the same underlying data, a thing, it is essentially stored in four different places, not one, violating the first rule. If duplicate code appears in many places, then the second rule is violated. If some component is the dumping ground for all sorts of unrelated functionality, then the third rule is broken. New programmers ignoring the conventions and tossing a "bag on the side" of an existing project are actually making it disorganized.

A large and growing project will always have some disorganization. But what is important is that there is continual work to explicitly address this. The data and code continually grow, drifting in similarity, so once the disorganization starts to impacts the work it needs to be addresses before it consumes a significant amount of time. And it needs to be handled consistently with the exact same organization scheme that already applies to everything else. A project that sees this as the minimum mandatory work involved in building up a system is one with a fighting chance for success. A project where this isn't happening is in trouble.

Testing to see if a system is organized is fairly simple. You just need to go through the data and code and see if for every place, the things are all similar. If there is lots of stuff out of place, then it is disorganized. If everything fits exactly where it should, then not only is it organized but it is also often described as 'beautiful'. The term 'elegant' is also applied to code, and is that ability to make a very complex problem appear to be rather simple. Underlying that achievement is a excellent organizational scheme, not just a good one.

Organization relates back to simplification and complexity. I talked about this in The Nature of Simplification nearly ten years ago, but it was in regards to simplifying without respect to the whole context. A bad simplification like the files/folder example is also disorganization because it gradually grows to violate the similarity rule. This feeds back into complexity in that disorganization is a direct form of added 'artificial' complexity. A mess is inordinately more complex than a well-designed system, but that mess is not intrinsic to the solution. It could have been done differently.

Organization ties back to other concepts such as Encapsulation and Architecture. A fully defined place for all of the data and code that are 'interrelated' is an alternative definition of Encapsulation. Architecture is often described as the main 'lines' or structures over which everything else fits, which is in a real sense an upper level of organization of the places themselves. Given enough places, they need to be put into meta-places and enough of those need to be organized too, etc. A massive system requires many layers to prevent any potential disorganization from just being pushed elsewhere and ignored. Organization is upwardly recursive as the scale increases.

Applying this definition of organization to processes is a bit tricky, but very similar. As always, l think it's easier to work backwards for understanding process issues. Starting with the users actually accessing the necessary functionality to solve their problems, we can see that the minimal organization includes getting each dependent 'thing' into its place with the minimal amount of effort. So the process of upgrading a system, for example, is well-organized when the only steps necessary are all dissimilar from each other. If there are five config files, then there ought to be just one step that gets them all installed into the right place. More importantly, if upgrading occurs more than once, then actually automating the process is in itself a form of organizing it.

Switching out to the earlier side of the development stages, the output of analysing an underlying problem to be solved requires that everything known and discovered is appropriately stored in the correct place, in the correct form. It isn't mashed together, but rather partitioned properly so that it is most useful for the next design stage. If this principle of organization is followed, it affects everything from the product concept to the actual feedback from operations. Everything is appropriately categorized, collected and then stored in its right place. The process is organized and anyone, at any stage, can just know where to find the correct information they need or know immediately that it is missing (and is now work to be done).

This may seem like a massive effort, but really it's not. There is no point collecting and organizing data from one stage if it's not going to be used in another. Big software projects sometimes amplify make-work because of misguided perspectives on what 'could' be useful, but after decades of development it becomes rather clear as to what 'will' actually be useful and what is just wasted effort. Processes should be crafted by people with significant hands-on experience or they miss those key elements. In a disorganized process, no real distinction can be made on the value of work, since you cannot ever be sure if there will be usage someday. In a well-organized project, spotting make-work is really easy.

One can extend this definition of organization fully to processes by substituting the nouns in the above rules with the appropriate verbs for the process. Then there is a process for every action, and every action takes place in the appropriate process. As well, there is a similarity to all of the actions within the same process. Of course with computers it is entirely possible to automate significant parts of the process, such that an overwhelming number of verbs really fall back to their output; back to nouns again. Seeing the work in this way, one can structure a methodology that ensures that the right outputs are always constructed at the right time, thus ensuring efficiencies throughout the flow of work.

A well-organized process is not at all as onerous as it sounds, rather it makes concentrating on the the quality of work much easier in that there are fewer disruptions and unexpected changes. As well, there are less arguments about how to handle problems, in that the delegation of work is no longer as subjective. For instance, if at the coding stage it becomes clear that some proposed functionality just won't work as advertised, then the work reverts back to the design stage or even further to the analysis. There is little panic, and an opportunity to insure that in the future the missing effort isn't continually repeating. In a disorganized process, someone generally just whips out duct tape and any feedback is lost, the same issues repeat over and over again. In that sense, an organized process is helpful to everyone involved, at a minimal cost. If chaos and pain are frequent process problems, then disorganization and make-work are strongly present.

A development project that is extremely well-organized is one that is fairly simple to manage and will produce software at a predictable pace, allowing for estimations to properly define priorities. A smooth development project, of course, provides more options for the work to match the required solutions. Organization feeds directly into the ability of the work to fulfil its goals in a reasonable time, at a reasonable cost. The very thing that users want most from software development. Organization is initially more expensive, but in any effort that is over a few months it quickly pays for itself. In a project expected to take years, with multiple developers, it is fairly insane not to take it seriously.

Organization springs from some very simple ideas. Being somewhat subjective, it is not the sort of thing that can be appropriately defined by a standard or a committee. Really the core leaders at each and every stage of development need to setup their own form of organization and apply it consistently to everything (old or new). If that is completed, then the development work proceeds without significant friction. If it is just left to happen organically, it won't. Once disorganization takes hold, it will always get worse. It's a vicious cycle, with more and more resources sucked away. It can be painful to reverse this, but given that it is absolutely fatal to a project, ignoring it won't work. Organization is a necessity to build all but the most trivial of software.

Sunday, October 4, 2015

Abstraction

About the hardest thing I've ever tried to write about is 'abstraction'.

It defies a simple explanation precisely because it is so abstract. Some people get it at an intuitive level, but for many others, it slips past their grasp, which is unfortunate since it is the gateway to efficiently building sophisticated software systems. It's the high-performance engine of the software industry, but it is utilized way too infrequently.

For this post, I'll try approaching the description from a completely different angle. It may make it more concrete for some people, perhaps. The main prerequisite is an understanding of entity-relationship modeling -- covered somewhat by my earlier post 'Data Modelling'.

A software entity represents some "concrete" thing in reality. We pick these things because they are useful for solving a user's problem. Each of these things has a set of attributes that are mostly discrete variables, although they could also be sub-entities. There are no continuous variables such as Real numbers, stored in a computer, only discrete finite approximations of them like floating point numbers. That everything, except time, is finite and bounded is sometimes important in ensuring correct behavior, but it also imparts some interesting constraints on the characteristics and usefulness of any software program.

If we turn our attention back to 'reality' we find that all of the 'things' in it are composed of particles. There are decompositions that go far deeper but for this post, we can ignore those. If we were to look at a person, we could think of them as being a 'container' of some incredibly large number of particles. Within that container, are a very large number of subsets, such as arms and legs. In a sense, a person is decomposable into any and all of the possible permutations for any subset of their particles. That's an incredibly huge number of subsets. In practice, however, only a very small number of these subsets actually make sense to us, like 'limbs'; the rest are ignored. Still one could conceivably organize every one of these permutations into hierarchies based on their 'closeness'; perhaps having the full set that represents a person at the root.

Given any two people, there are obviously common subsets like 'limbs', that represent the same relative parts, yet are filled with different particles for each person. In a sense, we simply designate the term 'limb' as being a common 'category' containing the arms or legs for both of these people. We attach it to a particular collection of subsets, even if we don't explicitly and precisely define what they are during most occurrences of our usage.

Now it so happens that our software term 'entity' also means a container, but for variables instead of particles. Thus, we can easily arrange to have entities in the software that have a one-to-one mapping to any containers of particles in reality. That makes a great deal of sense, but we do find that people often have attributes such as 'first name' that aren't directly related to subsets of their particles. That's not really a problem, in that if we use a term like 'limb' for body parts, it really is just a 'label' to refer to a collection of subsets. Names are exactly that, labels, but in this case, they refer to a specific instance, instead of a collection of them.

That idea that we can label both instances and generic groups of subsets is very important. It shows that this labeling mechanism is fairly powerful. We can use it uniquely, or we can apply it to be unbounded across all of time. When we refer to a limb, for instance, it holds its meaning for all of the past, present, and future. Throughout time, there will be a massive number of limbs, too many to ever count. No matter where, or when it is applied, the term references a very specific subset or 'category' of particles; or parts of a person if you would prefer.

And thus, we have this means of relating together vast subsets as being explicit 'categories'. Now we can create these categories and attach them to virtually any group of subsets imaginable, but very much like the permutations only some of these types of labels are useful.

Now comes the fun part. Earlier we noted that entities can be mapped to these subsets, but also that the attributes themselves are explicitly related to the subsets as well. Also, we noted that a person may contain up to four limbs, two of which are arms and two of which are legs. While the term 'legs' reference that same subset -- entity -- as 'limbs' it is just slightly more specific about the instances. If we were to collect specific information about someone's legs, given the differences in particles from 'arms', our corresponding entity would contain leg-specific variables. Placeholders, if you will, for the specifics of an individual.

When jumping from an entity of type 'legs' to a slightly more abstract 'limbs' there would be less variables that we could store. There might be specific information we could add about the 4 limbs like the number still remaining, but anything leg or arm specific would be 'abstracted away' as we moved up to the higher category.

For many object-oriented programmers, they may have recognized the above description as being 'polymorphism' which in itself is a particular type of abstraction. In moving 'upwards' we can see that we lose the specifics related to categories of instances, but we also gain metrics relating to the larger collection of subsets. In that, any abstraction is obviously a generalization that implicitly refers to more things than the original. A key reason for doing this is to bring special cases, with some useful similarity together, to save on the complexities of handling them individually. That is, it is simpler with respect to both organization and resources.

One thing to fully understand is that there is essentially an unlimited number of ways to relate similar subsets together, particularly if they contain a significant number of particles. That gives us an unfathomably large number of ways to categorize entities. That also means that there is no 'right' way to categorize, just convenient ones. This really hits home in software in that modeling any type of data with a static schema is inevitably going to limit the usefulness of that data itself. The static relationships cannot be 'complete' or all-inclusive. So, in a sense, any software system that is growing is going to need to re-arrange or re-categorize its modeling, whenever that functionality requested exceeds the abilities of the earlier model. That is, software is constantly in flux, because it is always, at most, focussed on some absurdly small number of the subset possibilities.

What's interesting about this view is that there are a huge number of subsets of particles in the real world that are useful to track with a computer. We're at the very early stages right now, most of the entities we capture are extremely crude. As we progress in our understanding of both modeling and complexity, we should be able to build systems that really do encompass enough intelligence to simplify people's lives. We're a long way away from that right now.

Earlier, I said that entities represent "concrete" things in reality. That isn't exactly true. For instance, in a video game, an entity might represent a humanoid like an elf or a dwarf. While these 'fictional' ideas share some similarity to reality, they are definitely not real. That is fine, in that our imaginations allow us to combine real things with abstract ones and contemplate these. One could not even say that these imaginary entities are bounded by the real world, give some of the abilities that a superhero, like Superman, has. But we can actually go even further in that the abstractions of mathematics quite easily allow us to ponder any sort of thing -- particularly as a formal system -- that utilizes alternatives that would directly contradict reality. That is, we can model anything, as a set of complex relationships, even though it violates the physical constraints of our existence. We are not bounded by our observable reality. We can still think of any of this as "concrete" because we are able to communicate it to each other. It is describable and can be modeled symbolically by our software.

All of this gives us the ability to see if something is really an abstraction and to possibly optimize the abstractions to be both larger and more useful. It also leaves us with a sense of coverage; that is a supposed abstraction that only intersects minimally with the original subset but fails to cover it. For testing a basic abstraction, all we need to do is see that there are plenty more subsets included as we go higher. If the abstraction bends upward, getting stuck on some alternative categorization, we can see that too. That latter case is actually sometimes useful, in that an abstraction that binds together a collection of subsets with some 'imaginary' subsets as well is an often used technique to increase performance, readability or reduce code. The imaginary parts simply reduce friction and make everything fit properly. It's exactly for that reason that getting pedantic about the correspondence between reality and the subsets can often make the problem harder, not simpler, which leads some programmers to avoid it. That's unfortunate, given that the directed creativity necessary to smooth over the bumps in a complex system is actually one of the types of problems that programmers say they want to solve. It takes a bit of thinking, but it isn't impossible, or particularly strange.

Abstraction then isn't all that abstract. We're really just loosening the labeling to collectively refer to a larger number of things. Given the variety of relations between collections of subsets, we have to be careful to ensure that any loosening shifts upwards in the desired manner. Once we have abstracted, we now have less individual cases to consider so we can save time and resources as we implement the models. Used properly, good abstractions can free the designers, allowing them to tackle the really sophisticated problems, instead of getting bogged down in the brute force tar pit.