Sunday, November 22, 2015

Containers, Collections and Null

A variable in a computer program holds exactly one member of a finite set of possibilities. This datum symbolically represents something in the physical world, although it is also possible that it represents something from an imaginary reality.

The usefulness of a variable is that it can be displayed at the appropriate moment to allow a person, or something mechanical, to make a decision. Beyond that, it is just being saved under the belief that it will be used for this purpose some day or that it will be an input into some other variable’s moment.

A variable can also have with it an associated null flag. It is not actually ‘part’ of that variable, but rather an independent boolean variable that sheds light on the original contents. A friend of mine used to refer to null values as ‘out-of-band’ signals; meaning that they sit outside of the set of possible values. This is important in that they cannot be confused with another member of the set, what he used to call ‘in-band’ signals or often ‘magic numbers’.

There is generally a lot of confusion surrounding the use of nulls. People ascribe all sorts of program-specific meanings to them, which act as implicit data in their program. In this way, they overload the out-of-band signal to represent a custom state relative to their own data. This is a bad idea. Overloading any meaning in a variable guarantees some sort of future problem since it is subjective and easily forgotten. It is far better to ascribe a tighter definition.

A variable is collected or it is not; computers are perfectly binary in that regard. If it is possible for a variable to not be collected, the only reason for this is that the variable is ‘optional’ otherwise the program will not continue with its processing. Obviously bad data should not be allowed to propagate throughout a system or get stored in the database. If there are different, and useful, excuses for not having specified a variable right away, then these are independent variables themselves. With that in mind, a null only ever means that the variable is optional and was not collected, nothing else should be inferred from it.

Then for the sake of modeling, for all variables, one only needs to know if the data is mandatory -- it must be collected -- or it is nullable. Getting a null in a mandatory field is obviously an error, which needs to invoke some error handling. Getting a null in an optional field is fine. If a variable is nullable ‘anywhere’ in the system then it should be nullable ‘everywhere’ in the system (it is part of the data model). If there is functionality dependent on an optional variable, then it should only ever execute when that variable is present, but should not be an error if it is missing. If the underlying data is optional, then any derived data or functionality built on top of it is optional as well. Optionality propagates upwards.

With that perspective, handling nulls for a variable is easy and consistent. If it doesn’t make sense that the data could be missing then don’t make it nullable. It also helps to understand where to reinforce the data model if the data is a bit trickier than normal. For instance, partially collected data is ‘bad’ data until all of the mandatory values have been specified. So it should be partitioned or marked appropriately until it is ready.

The strength of a computer is not that it can remember a single variable, but rather that it can remember a huge number of them. And they can be interrelated. What really adds to the knowledge is the structure of these cross-variable relationships. There can be a huge number of instances of the same variable and/or there can be a huge number of relationships to other types of variables.

The larger and more complex the structure, the more information that can be gleaned out of this collected data. The history of software has always been about trying to cope with increasingly larger and more complex data structures even if that hasn’t been explicitly stated as the goal. In a sense, we don’t just want data, we want a whole model of something (often including a time dimension) that we can use to make deeper, more precise decisions.

The complex structural relationships between variables are represented within a program as ‘containers’. A simple container, say a structure in C or a simple object in Java, is just a static group of related variables. These can be moved throughout the program as if they were a single variable, ensuring that they are all appropriately kept together. In this arrangement, the name of the variable in many programming languages is a compile-time attribute. That is, the programmer refers to the name when writing the code, but no such name exists in the system at runtime. Some languages provide introspection, allowing the programmer to retrieve their name and use it at runtime.

Null handling for simple containers is similar to null handling for individual variables. Null means that ‘all’ of the data in the structure is optional. In some languages, however, there can be confusion on how to handle mandatory data. With a simple variable, the language itself can be used to insist that it always exists, but with a pointer or reference to a container that check needs to be explicitly in the code. It can come in the form of asserts or some explicit branches that jump to error handling. The code should not continue with functionality that relies on mandatory data if it is missing.

The next level of sophistication allows for a runtime based ‘named’ variable. That is, both the name and the variable contents are passed around as variables themselves, together in a container. Frequently this container is implemented as a hashtable (sometimes called a dictionary), although in some cases ‘order’ is required so there can also be an underlying linked-list. This is quite useful for reusing code for functionally on similar data with only some minor processing attached to a few specific variables. Then it mostly leaves the bulk of the code to manipulate the data without having to really understand it, making it strongly reusable. This works well for communications, printing, displaying stuff in widgets, etc. Any part of the code whose data processing isn’t explicitly dependent on the explicit meaning of the underlying data, although sometimes there needs to be categories (often data types) of behavior. Learning to utilize this paradigm can cut out a huge number of redundant lines of code in most common programs.

Null handling for containers of named variables is slightly different, in that the absence of a particular named pair is identical to the name existing with a null value. Given this overlap, it is usually best to not add empty, optional data into the container. This is also reflected symmetrically by not passing along values without a name either. This type of structuring means that processing such a container is simplified in that each pair is either mandatory, or it is optionally included. If a pair is missing, then it was optional. To enforce mandatory variables, again there needs to be some trivial code that interrupts processing.

Containers get more interesting when they allow multiple instances of the same data, such as an array, list or tree. Large groups of collected data shed a brighter light on the behavior of their individual datum, thus providing deeper details. For these ‘collections’, they can be ordered or unordered although the latter is really a figment of the programmer’s imagination in that everything on a computer has an intrinsic order, it is just sometimes ignored.

Ordering can be based on any variables within the data although often it is misunderstood; the classic case of that being tables in a relational database returning their default order based on their primary index construction, thus leading to the extremely common bug of the SELECT statement order changing unexpectedly when the tables grow large or rows are deleted. Programmers don’t see this potentially chaotic reordering when testing with small datasets, so they make bad assumptions about what really caused the visible ordering.

One frequent confusion that occurs is with null handling for collections, in that an empty collection or a null reference can be interpreted as the same thing. In most cases, it is really optional data has not been collected, so it doesn’t make sense to support this redundancy. It is more appropriate to handle the ‘zero’ items condition as an optional collection, and the null reference itself as a programming error. This is supported quite elegantly by having any code that returns a collection to always allocate an empty one. This can then be mirrored by any function that needs a collection as well, in that it can assume that null will not be passed in, just empty collections. This reduces bugs caused by inconsistent collection handling, but it does mean that every branch of any new code should be executed at least once in testing to catch any unwanted nulls. It doesn’t, however, mean that every permutation of the logic needs to be tested, just the empty case, so the minimum test cases are fairly small. This isn’t extra work in that the minimum reasonable practice for any testing is always that no ‘untested’ lines of code should ever be deployed. That’s just asking for trouble, and if it happens it is a process problem, not a coding one.

This null handling philosophy prevents the messy and time-wasting inefficiencies of never knowing which direction an underlying programmer is going to choose. We’ll call it ‘normalized collection handling’. It does, however, require wrapping any questionable, underlying calls from other code just in case it doesn’t behave this way, but wrapping all third-party library calls was always considered a strong best practice, right up until it was forgotten.

Some programmers may not like it because they believe that it is more resource-intensive. Passing around a lot of empty collections will definitely use extra memory. Getting the count out of a collection is probably more CPU than just checking for a null (but less if you have to do both which is usually the case). This is true, but because an empty collection eats up the base management costs, it also means that the resource usage during processing, at least from the lowest level, is considerably more consistent. Programs that fluctuate with large resource usage extremes are far more volatile, which makes them far more difficult for operations. That is, if whenever you run a program, its total resource usage is predictable and relative to load, then it becomes really easy to allocate the right hardware. If it swings to extremes, it becomes far more likely to exceed its allocation. Stable-resource systems are way easier to manage. A little extra memory then is a fair price to pay for simpler and more stable code.

We can generalize all of the above by realizing that null handling differs between static and dynamic variables. Adding nulls is extra work in the static case while enforcing mandatory requirements is extra work in the dynamic case. Static data is easier to code, but dynamic data is both flexible and more reusable. In that sense, if we are going to just hardcode a great deal of static data, it is best if the default is mandatory and optional data is the least frequent special case. The exact opposite is true with dynamic data. Most things should be optional, to avoid filling the code with mandatory checks. This behavioral flip flop causes a great deal of confusion because people want a one-size-fits-all approach.

A nice attribute about this whole perspective of data is that it simplifies lots of stuff. We only have containers of unnamed or named variables. Some containers are collections. This is actually an old philosophy, in that it shows up in languages like AWK that have ‘associated arrays’ where the array index can be an integer or a string. The former is a traditional array, while the latter is a hashtable, but they are treated identically. In fact, in AWK, I think it just cheats and makes everything a hashtable, converting any other data type to a string key. This makes it a rather nice example of an abstraction smoothing away the special cases, for the convenience of both the users of the language and the original programmers.

We have to, of course, extend this to allow for containers of a mix between variables and sub-containers, and do that in a fully recursive manner so that a variable can be a reference to a container or collection itself. This does open the door to having cycles, but in most cases, they are trivial to detect and easily managed. Going a little farther, the ‘name’ of the variable or key of the hashtable itself can be a recursive container of structures as well, although eventually, it needs to resolve down to something that is both comparable and will generate a constant hashing code (it can't change over time). Mixed all together, these techniques give us a fully expressible means of symbolically representing any model in spacetime and thus can be increasingly used to make even more complex decisions.

We should try not to get lost in the recursive nature of what we need to express, and it is paradigms like ADTs and Object-Oriented programming that are really helpful in this regard. A sophisticated program will have an extraordinarily complex data model with many, many different depths to it, but we can assemble and work on these one container at a time. That is, if the model is a list of queues of trees with stacks and matrices scattered about, we don’t have to understand the entire structure as a ‘whole’ in order to build and maintain the code. We can decompose it into its underlying components, data structures and/or objects, and ensure that each one works as expected. We can then work on, and test, each sub-relationship again independently. If we know all the structural relationships are correct, and we know the underlying handling is correct, then we can infer that the overall model is correct.

While that satisfies the ability to implement the code, it says nothing about how we might decide on a complex model to a real-world solution. If there is art left in programming, much of it comes directly from this issue. What’s most obvious is that the construction of sophisticated models needs to be prototyped long before the time is spent to implement them, so that its suitability can be quickly rearranged as needed. The other point is that because of the intrinsic complexity, this type of modeling needs to be built from the bottom-up, while still being able to understand the top-down imposed context. Whiteboards, creativity and a lot of discussions seem to be the best tools for arriving at decent results. Realistically this is actually more of an analysis problem than a coding one. The structure of the data drives the behavior, the code just needs to be aware of what it is supposed to be, and handle it in a consistent manner.

Some programmers would rather avoid this whole brand of complexity and stick to masses of independent static variables, but it is inevitable that larger, more complex, structural relationships will become increasingly common as the user’s requirements for their software gradually get more sophisticated. A good example of this being the non-trivial data-structures that underpin spreadsheets. They have only gained in popularity since they were invented. Their complexity unleashed power-users to solve programmatic issues for themselves that were unimaginable before this technology existed. That higher level of sophistication is sorely needed for many other domain-based problems, but currently, they are too often encoded statically. We’ve been making very painfully slow progress in this area, but it is progress and it will continue as users increasingly learn what is really possible with software.

Thinking of the possible variable arrangements as structural relationships between containers makes it fairly straightforward to understand how to represent increasingly complex real-world data. With tighter definitions, we can avoid the disorganization brought on by vagueness, which will also cut down on bugs. Software is always founded on really simple concepts, the definition of a Turing machine for example, but we have built on top of this a lot of artificial complexity due to needless variations. Being a little stricter with our understanding allows us to return to minimal complexity, thus letting us focus our efforts on achieving both sophistication and predictability. If we want better software, we have to detach from the sloppiness of the past.

Sunday, November 8, 2015

Software Engineering

The standard definition for engineering from Wikipedia is:

Engineering is the application of mathematics, empirical evidence and scientific, economic, social, and practical knowledge in order to invent, design, build, maintain, research, and improve structures, machines, tools, systems, components, materials, and processes.

But I like it to simplify it into just two rules:

  1. Understanding something really deeply.
  2. Utilizing that knowledge to build stuff with predictable behavior.

With the added caveat that by ‘predictable behavior’ I mean that it is predictable for ‘all’ possible circumstances in the real world. That’s not to say that it must withstand everything that could be thrown at it, but rather that given something unexpected, the behavior is not. There is no need to guess, it can be predicted and will do exactly as expected. Obviously, there is a strong tie between the actual depth of the knowledge and the accuracy of these predictions.

The tricky part about engineering good software is acquiring enough deep knowledge. Although the existing underlying software is deterministic and explicitly built by people, it has been expanding so rapidly over the last five decades that it has become exceptionally convoluted. Each individual technology has blurry conventions and lots of quirky behavior. It becomes difficult to both properly utilize the technology and mitigate it under adverse conditions. Getting something to ‘sort of’ work is not too difficult, but getting it to behave reliably is extraordinarily complex and time-consuming.

For example, if you wanted to really utilize a relational database, that would require a good understanding of the set-theoretical nature of SQL, normalization, query plans, implicit queries, triggers, stored procedures, foreign keys, constraints, transactions, vendor-specific event handling and how to combine all of these together effectively for models that often exceed the obvious expressiveness. When used appropriately, a relational database is a strong persistence foundation for most systems. Inappropriate usage, however, makes it awkward, time-consuming and prone to unsolvable bugs. The same technology that can nearly trivialize routine systems can also turn them into hopeless tangles of unmanageable complexity. The difference is all about the depth of understanding, not the virtues of the actual software.

A big obstacle in acquiring deep knowledge is the lack of authoritative references. Someone could write a book that would explain in precise detail how to effectively utilize a relational database for standard programming issues, but culturally we don’t get that specific because it would be discounted due to subjective objections. That is, anyone with even a small variation of opinion on any subset of the proposed approach would discount the entire approach as invalid, thus preventing it from becoming common. In addition, creativity is valued so highly that most programmers would strongly prefer to rediscover how to use a relational database, over decades, then just adopt the pre-existing knowledge. That is unfortunate because there are more interesting problems to solve if you get past these initial ones.

To get to actual engineering we would have to be able to recognize the routine parts of our problems, and then solve them with standardized components whose ‘full’ behavior is well documented. This would obviously be a lot easier if we had a reliable means of categorizing that behavior. Thus we would not need to consume massive resources experimentally determining what happens if we knew that a technology was certified as ‘type X’ for instance. In that sense, the details need to be encapsulated, but all behavioral changes, such as possible errors, need to be explicitly documented and to strictly follow some standard convention. If we can achieve this, then we have components which can be used and if a programmer sticks with a collection of them from a limited set of categories, they can actually have a full and deep understanding of how they will affect the system. That depth will give us the ability to combine our work on top in a manner that is also predictable. Without -- of course -- having to deeply understand all of the possible conventions currently out there or even the full depth of the underlying technology.

What prevents us from actual software engineering is our own cultural evolution. We pride ourselves on not achieving any significant depth of knowledge, but rather just jumping in and flailing at crude solutions. Not standardizing what we build works in favor of both the programmers and the vendors. The former is in love with the delusion of creativity, while the latter deem it as a means to lock in clients. There is also a persistent fear that any lack of perceived freedom will render the job of programming boring. This is rather odd, and clearly self-destructive, since continuously re-writing ‘similar’ code gradually loses its glamour, resulting in a significant shortening of one’s career. It’s fun and ego fulfilling the first couple of times, but it eventually gets frustrating. Solving the same simple problems over and over again is not the same as really solving challenging problems. We do the first while claiming we are really doing the second.

There are many little-isolated pockets of software engineering in existence right now, I’ve worked in a few of them in the past. What really stands out is that they are considerably less stressful, more productive and you feel really proud of the work. Slapping together crud in a hurry is the exact opposite; some crazy deadline gets met, but that’s it. Still, the bulk of the nearly 18M programmers on the planet right now are explicitly oriented towards just pounding stuff out. And of the probably trillions of lines of code that are implicitly relied on by any non-trivial system, more and more of it is utilizing less and less knowledge. It is entirely possible to create well-engineered software, and it is possible to achieve decent quality in a reasonable amount of time, but we are slipping ever farther away from that goal.

At some point, software engineering will become mandated. As software ‘eats the world’ the unpredictability of what we are currently creating will become increasingly hazardous. Eventually, this will no longer be tolerated. Given its inevitability, it would be far better if we voluntarily refactored our profession instead of having it forced on us by outsiders. A gentle realignment of our culture would be less of a setback than a Spanish-style inquisition. It’s pretty clear from recent events that we are running out of time, and it’s rather obvious that this needs to be a grassroots movement. We can actually engineer software, but it just isn’t happening often right now and it certainly isn’t a popular trend.