The Programmer's Paradox: 2017

Sunday, December 31, 2017

Visualizing Code

In its simplest sense, programming is about assembling a list of instructions for a computer to follow.

It has become somewhat more complicated lately because some of these instructions are quite high-level and the underlying details are somewhat convoluted. We like to use badly designed libraries and erratic frameworks to accomplish most of the tasks that we no longer feel obligated to work on ourselves.

That might seem to define programming as just utilizing the knowledge embedded into both the programming language and underlying dependencies. Learn enough of the details and off you go.

But there seems to be another dimension to programming.

In order to fully comprehend the behavior of the code, it helps a great deal if we can visualize the running code. To be able to ‘see’ it in our heads, somehow.

There are plenty of debuggers and techniques for embedding print statements, but these alone, for debugging, are often not sufficient for understanding really complex bugs. And they don’t help at all with design issues. They only show us what is there and step by step, how it runs.

The ability to ‘see’ the running code in our heads, however, allows us to be able to work through the possible corner-cases, instead of having to acquire all of our knowledge from the runtime. Programmers that can visualize their code, can quickly diff their understandings against the actual behavior. It’s this skill that amplifies some programmers to new levels.

The basic underlying act of programming is not particularly creative since generally the problems should be analyzed before the coding began. Most design aspects are fairly routine if the underlying technologies are understood.

But even on an organized and well-thought-out development project, being able to visualize the code before it is being built is a huge advantage. That is, to be able to see the code running in our heads, we need to imagine it in some quasi-tangible way, which needs to be simpler than the actual operation, but not so simple that it doesn’t reflect the reality of the code running. That, because we are human, gives us some internal representations that allow us to understand how we see the code operating versus how it is actually running.

That ability brings creativity back into the mix since it requires imagination, and as it is a slowly growing skill over time it explains why programmers can sometimes accidentally exceed their abilities. That is, someone who is quite comfortable debugging a problem in 5K code might suddenly flounder with a problem that spans 20K. It also explains why a software developer that can design a 50K system might completely fail to design one at 150K or larger. What happens is that they become unable to see these increases in complexity, and if the issues span across the full width of the problem then they are unlikely to make ‘reasonable’ decisions.

What seems true is that visualization size is a very slowly accumulating skill set. It takes plenty of time to keep increasing this ability, little by little. It doesn’t come right away, most people aren’t even aware of its existence and there is no magical way to just hyper-jump it into a broader ability. In that sense, although occasionally an interview question almost taps into it, we generally focus on the exact opposite during interviews. We ask specific ‘definition’ questions, not general visualization questions (a rare exception is Fermi estimations).

There have been plenty of anti 10x programmer movements, where we insist that no such beings exist, but that doesn’t seem to match experience.

If 10x programmers exist, since their ability to type in code is effectively bound then it would be exactly this ability to visualize 10x larger contexts that would separate them from the pack. That makes sense, in that they would be able to see where more of the code should be before they start coding and thus save a huge amount of time by not going down wrong paths. They would also be able to diagnose bugs a whole lot faster and make fewer nasty ones. Both of these account for more of our time than actually typing in the code, which while it comes in bursts is not what usually consumes most programming effort. Someone with stronger visualization skills would obviously be able to work through complicated programs a lot faster than someone with lesser abilities. And it wouldn’t be surprising that this ability rests on an exponential scale, given that increases in code size do as well. Small increases in visualization would give large advantages.

So, it does seem that there is a significant creativity aspect to programming, but it isn’t revolving around finding “creative” solutions when unneeded, rather it is about being able to visualize larger and larger solutions and bind them better to reality.

If we accept this, then it is clear that we need to respond by doing two things. The first is that we need a way of identifying this ability. This helps with interviewing, but it also helps in assigning development work to people. Problems that exceed programmers abilities will not likely get solved well. If we are realistic about the size of our problems, then we can do a better job of matching up the coders to them. We can also assign problems that help push a programmers ability a little bit farther each time. Just a bit beyond their current level, but not far enough that it is demoralizing.

The second thing we should do is find some training methods that help increase this ability. We’ve always focused on simple decomposition methods like data structures or objects, but we’ve never reoriented this approach with respect to how it might help a programmer actually grow their abilities. It’s just considered base useful knowledge. We just drown them with definitions. It also makes a huge amount of sense to learn more alternative means of decomposition and different programming paradigms. These help to strengthen our visualizations by giving us different ways of seeing the same things. Thus we need to revisit the way we are teaching programming with an intent to find new ways to expand this ability.

This issue with visualization may actually help explain both the software crisis and modern bloat. Older programmers always lament that the kids don’t pay attention to the details anymore, and this oddly has been a long-running sentiment for over half a century. That makes sense in that each new generation of tools took some of that cognitive effort away from the coders while exchanging it for the ability to work on larger systems. In this exchange, the underlying visualization requirements are similar but are transferred to different areas. Thus the younger programmers work at a higher level, but what they do spans way more underlying code. So the basic effort stays relatively constant, but for each older generation, it looks like the kids are focussing on the wrong issues.

Indirectly this also aligns with many of Bret Victor’s ideas about reducing the lag between working and results. It is a more explicit external approach to achieving the same goal. It is just that even with his advances in technology, our need to visualize will still supersede them because there will always be a need for some larger visualizations that are beyond the current abilities of hardware. It’s a never-ending problem of growing complexity and context.

The underlying mechanics of programming seem to be fairly straightforward, but we are too focused on the instructions themselves. As we try to build bigger systems, this complexity quickly builds up. In order to cope with it, we develop the skills to visualize the behavior. As we grow this secondary skill, we are able to cope with larger and larger programming issues. This is the ability that gives some programmers an edge over the others. People who can see more of the full solution are far more likely to actually solve a lot more of the problem with fewer resources and better quality.

Saturday, December 30, 2017

Immutability

All too often software development practices are driven by fads.

What starts out as a good idea gets watered down to achieve popularity and then is touted as a cure-all or silver bullet. It may initially sound appealing, but eventually, enough evidence accrues that it is, as expected, far from perfect. At that point the negatives can overwhelm the positives, resulting in a predictable backlash, so everyone rushes off to the next big thing. We’ve been through this cycle many, many times over the last few decades.

That is unfortunate, in that technologies, practices, idioms, ideas, etc. are always a mixture of both good and bad. Rather than continually searching for the next, new, easy, thoughtless, perfect approach we really should just accept that both sides of the coin exist, while really focusing on trying to enhance our work. That is, no approach is completely good or completely bad, just applicable or not to a given situation. New isn’t any better than old. Old isn’t intrinsically worse. Technologies and ideas should be judged on their own merits, not by what is considered popular.

It’s not about right or wrong, it is about finding a delicate balance that improves the underlying quality at a reasonable cost. Because in the end, it's what we’ve built that really matters, not the technologies or process we used.

Sometimes, a technique like immutability is really strong and will greatly enhance the code, making it easier and more likely to be correct. But it needs to be used in moderation. It will fix some problems, but it can easily make others far worse.

Recently, I’ve run into a bunch of situations where immutability was abused, but still, the coders were insisting that it was helping. Once people are committed, it isn’t unusual for them to turn a blind eye to the flaws; to stick to a bad path, even in the face of contradictory evidence.

The idea of immutability became popular from the family of functional programming languages. There are good reasons for this. In general, spaghetti code and globals cause unknown and unintentional side-effects that are brutal to track down. Bug fixing becomes long, arduous and dangerous. The core improvement is that if there are no side-effects if the scope is tightly controlled, it is far easier to reason about the behavior of subparts of the code. Thus, we can predict the impact of any change.

That means that the underlying code then can no longer be allowed to modify any of its incoming data. It's not just about global variables, it's also any internal data and the inputs to functions. To enforce immutability, once some data has been created, anywhere, it becomes untouchable. It can’t be changed. If the data is wrong later, then there should be only one place in the system where it was constructed, thus only one place in the system where it should be fixed. If a whole lot of data has that attribute, then if it is buggy, you can reason quickly about how to repair it with a great deal of confidence that it won’t cascade into other bugs.

While that, for specific types of code, is really great and really strong, it is not, as some believe, a general attribute. The most obvious flaw is that there could be dozens of scattered locations were the immutable data is possibly created, thus diminishing its scope benefits. You might still have to spend a lot of time tracking down where a specific instance was created incorrectly. You’ve protected the downstream effects but still have an upstream problem.

So, immutability does not magically infer better programmatic qualities. But it can be far worse. There are rather obvious places in a system where it makes the code slower, more resource intensive, more complex and way less readable. There is a much darker side to immutability as well.

Performance is the obvious first issue.

It was quite notable in Java that its immutable strings were severely hurting its performance. Code often has to do considerable string fiddling and quite often this is distributed in many locations throughout the program as the final string gets built up by many different computations.

Minimizing this fiddling is a great way to get a micro-boost in speed, but no matter how clever, all of that fiddling never really goes away. Quite obviously, if strings are immutable, then for each fiddle you temporarily double the string, which is both memory usage and CPU time spent copying and cleaning up.

Thus pretty quickly the Java language descended into having special string buffers which were mutable and later an optimization to hide them when compiling. The takeaway from this is that setting things that need to change ‘frequently’ to be immutable is extraordinarily costly. For data that doesn’t change often, it is fine. For data that changes a lot, it is a bad idea if there are any time or resource constraints.

Performance isn’t nearly as critical as it used to be, but it still matters enough that resources shouldn’t just be pissed away. Most systems, most of the time, could benefit from better performance.

In a similar manner to the Java strings, there was a project that decided that all of the incoming read-only data it was collecting should be stored immutability in the database. It was not events, but just fairly simple static data that was changeable externally.

The belief was that this should provide both safety and a ‘history’ of any incoming changes, so basically they believed it was a means of making the ETL easier (only using inserts) and getting better auditing (keeping every change).

In general, strong decompositions of an underlying problem make them easier and safer to reason about the behavior and tend towards fewer side-effects, but intermixing two distinctly different problems causes unintentional complexity and makes it harder to work through the corner-cases. Decomposition is about separating the pieces, not combining them. We’ve known that for a long time, in that a tangle of badly nested if statements are usually described as spaghetti code. It is hard to reason about and expensive to fix. They need to be separated to be usable.

So it’s not surprising then that this persistent immutable collection of data required overly complex gymnastics in the query language just to be read. Each select from the database needs to pick through a much larger set of different but essentially redundant records, each time the data is accessed. The extra storage is huge (since there are extra management costs like secondary indices), the performance is really awful and the semantics of querying the data are convoluted and hopelessly unreadable. They could have added another stage to copy that data into a second schema, but besides being resource intensive and duplicating a lot of data, it would also add in a lag time to the processing. Hardly seems better.

If the data was nicely arranged in at least a 3rd NF schema and the data flow was idempotent (correctly using inserts or updates as needed) then simple logging or an audit log would have easily solved that secondary tracking problem with the extra benefit that it might be finite for a far shorter timeframe then the rest of the data. That easily meets all of the constraints and is a far more manageable approach to solving both problems. Immutability, in this case, is going against the grain of far older and more elegant means of dealing with the requirements.

Since immutability is often sold as a general safety mechanism, it is not uncommon for people to leverage that into misunderstandings as well. So, for instance, there was one example of a complex immutable data structure but it spanned two completely different contexts. Memory and persistence. The programmers assumed that the benefits of immutability in one context automatically extended to the other.

A small subset of the immutable structure was in memory with the full version persisting. This is intrinsically safe. You can build up new structures on top of this in memory and then persist later when completed.

The flaw is to assume however that any and all ‘operations’ on this arrangement would preserve that safety just because the data is immutable. If one copies the immutable structure in memory but does not mirror that copy within the other context then just because the new structure is immutable doesn’t mean it will stay in sync and not get corrupted. There aren’t two things to synchronize, now there are three and that is the problem.

Even if the memory copy is not touched, the copy operation introduces a disconnect between the three contexts that amounts to the data equivalent of a race condition. That is the copies are not independent of the original or the persisted version; mutable or immutable that ‘dependence’ dominates every other attribute. So, a delete on the persistent side would not be automatically reflected in the copy. A change on the original would not be reflected either. It is very easy to throw the three different instances out of whack, the individual immutability of each is irrelevant. To maintain its safety, the immutability must span all instances of the data at all structural levels, but that is far too resource intensive to be a viable option. Thus partial immutability causes far more problems and is harder to reason about than just making the whole construct mutable.

Immutability does, however, have some very strong and important usages. Quite obviously for a functional program in the appropriate language, being able to reason correctly about its behavior is huge. To get any sort of programming quality, at bare minimum, you have to be able to quickly correct the code after its initially written and we’ve yet to find an automated means to ensure that code is bug free with respect to its usage in the real world (it can still be proven to be mathematically correct though, but that is a bit different).

In object-oriented languages, using a design pattern like flyweights, for a fixed and finite set of data can be a huge performance boost with a fairly small increase in complexity. Immutable objects also help with concurrency and locking. As well, for very strict programming problems, compile-time exceptions for immutable static data are great and runtime exceptions when touching immutable data help a lot in debugging.

Immutability has its place in our toolboxes, but it just needs to be understood that it is not the first approach you should choose and that it never comes for free. It’s an advanced approach that should be fully understood before utilizing it.

We really need to stop searching for silver bullets. Programming requires knowledge and thought, so any promise of what are essentially thoughtless approaches to quickly encode an answer to a sub-problem is almost always dangerous.

In general, if a programmer doesn’t have long running experience or extensively deep knowledge then they should err on the side of caution (which will obviously be less efficient) with a focus on consistency.

Half-understanding a deep and complex approach will always result in a mess. You need to seek out deep knowledge and/or experience first; decades have passed so trying to reinvent that experience on your own in a short time frame is futile; trying to absorb a casual blog post on a popular trick is also futile. The best thing to do is to ensure that the non-routine parts of systems are staffed with experience and that programmers wishing to push the envelope of their skills seek out knowledge and mentors first, not just flail at it hoping to rediscover what is already known.

Routine programming itself is fairly easy and straightforward, but not if you approach it as rocket science. It is really easy to make it harder than it needs to be.

Saturday, December 9, 2017

Sophistication

Computers aren’t smart, but they are great at remembering things and can be both quick and precise. Those qualities can be quite helpful in a fast-paced modern world. What we’d like is for the computer to take care of our problems, while deferring to us on how or when this work should be executed. What we’d like then, is for our software running on those computers to be ‘sophisticated’ enough to make our lives easier.

A simple example of this is a todo list. Let’s say that this particular version can schedule conference calls. It sends out a time and medium to a list of participants, then collects back their responses. If enough or the right people agree, then the meeting is scheduled. That’s somewhat helpful, but the program could go farther. If it is tied to the phone used in the teleconference, it could detect that the current discussion is getting close to going over time and will impact the schedule of any following meetings. At that point, it could discretely signal the user and inquire if any of the following meetings should be rescheduled. A quite smart version of this might even negotiate with all of the future participants to reoptimize the new schedule to meet most of their demands all on its own.

That type of program would allow you to lean on it in the way that you might rely on a trusted human secretary. They often understand enough of the working context to be able to take some of the cognitive load off their boss, for specific issues. The program's scope is to understand all of the interactions and how they intertwine and to ensure that any rescheduling meets most of the constraints. In other words, it’s not just a brain-dead todo app; it doesn’t just blissfully kick off timers and display dialogues with buttons, instead, it has a sophisticated model of how you are communicating with other people and the ability to rearrange these interactions if one or more of those meetings exceed some threshold. So it’s not really intelligent, but it is far more sophisticated than just an egg timer.

It might seem that it would be a complicated program to write, but that upper-level logic isn’t hard. If the current meeting is running late, prompt the user and possibly reschedule. It just needs to know the current meeting status, have a concept of late, be able to interact with the user and then reschedule the other meetings.

A programmer might easily get lost in the details that would be necessary to craft this logic. It might, for example, be an app running on a phone that schedules a lot of different mediums, like conference calls, physical meetings, etc. It would need timers to check the meeting status, an interface to prompt the user and a lot of code to deal with the intrinsic complexities of quickly rescheduling, including plenty of data to deal with conflicts, unavailable contacts, etc.

It is certainly a non-trivial piece of code and to my knowledge, although parts of it probably appear spread across lots of different software products, none of them have it all together in one place. As one giant piece of code, it would be quite the tangle of timers, events, widgets, protocols, resources, contacts, business logic, conditional blocks of code and plenty of loops. It would be so complicated that it would be nearly impossible to get right and brutal to test. Still, conceptually it’s not rocket science, even if it is a lot of individual moving parts. It’s just sophisticated.

So how would we go about building something this sophisticated? Obviously, as one huge list of instructions, it would be impossible. But clearly, we can describe the higher level logic quite easily. So what we need to do is to make it easy to encode that logic as described:

If the current meeting is running late, prompt the user and possibly reschedule.

If that is all there is to the logic, then the problem is simple. We basically have four ‘parts’ that need to be tied together in a higher level block of code. The current meeting is just some data with respect to time. Running late is a periodic event check. Prompting the user is a simple interface and rescheduling while tricky, is a manageable algorithm. Each of these components wraps code and data. This is where encapsulation becomes critical. For each part, if it is a well-defined black box that really encapsulates all of the underlying issues, then we are able to build that sophistication on top. If not, then we’ll get lost in the details.

Some of the underlying data is shared, so the data captured and its structure need to span these different components. These components are not independent. The current meeting, for example, overlaps with the rescheduling, but in a way that both rely on the underlying list of contacts, meeting schedule and communications medium. That implies that under those two components there are at least three more. Having both higher level components anchor on using the same underlying code and data ensures that they will interoperate together correctly, thus we need to acknowledge these dependencies and even leverage them as reusable sub-components.

And of course, we want each of these upper-level components to be as polymorphic as possible. That is, we want a general concept of medium, without having to worry about whether that means a teleconference, video conference or a physical meeting. We don’t need to care, so we shouldn’t care.

So, in order to achieve sophistication, we clearly need encapsulation and polymorphism. These three are all tied together, and the last two fall directly under the banner of abstraction. This makes a great deal of sense in that for large programs to build up the necessary high-level capabilities you have to be both tightly organized and to generalize at that higher level.

Without organization and generalization, the code will grow so complex that it will eventually become unmanageable by humans. There is a real physical limit to our abilities to visual larger and larger contexts. If we fail to understand the behavior of enough code, then we can then no longer predict what it will do, which essentially is the definition of a fragile and buggy system.

It is worth noting here that we quite easily get to a reasonable decomposition by going at the problem from the top down, but we are actually just building up well-encapsulated components from the bottom up. This contradiction often scares software developers away from wanting to utilize concepts like abstraction, encapsulation, polymorphism, generalization, etc. in their efforts, but without them, they are really limited in what they can achieve.

Software that isn’t sophisticated is really just a fancy means of displaying collected data. Initially, that was useful, but as we collect so much more data from so many sources, having that distributed across dozens of different systems and interfaces becomes increasingly useless. It shifts any burden of stitching it all back together If over to the user, and that too will eventually exceed their capacity. So all of that extra effort to collect that data is heavily diminished. It doesn’t solve problems, rather it just shifts them around.

What the next generation of software needs to do is to acknowledge these failings, and start leveraging the basic qualities of computers to be able to really help the users with sophisticated programs. But to get there, we can’t just pound out brute force code on a keyboard. Our means of designing and building this new generation of software has to become sophisticated as well. We have to employ higher-level approaches to designing and building systems. We need to change the way we approach the work.

Sunday, November 19, 2017

Bombproof Data Entry

The title of this post is from a magazine article that I barely remember reading back in the mid-80s. Although I’ve forgotten much of what it said, its underlying points have stuck with me all of these decades.

We can think of any program as having an ‘inside’ and an ‘outside’. The inside of the program is any and all code that a development team has control of; that they can actively change. The outside is everything else. That includes users, other systems, libraries, the OS, shared resources, databases, etc. It even includes code from related internal teams, that is effectively unchangeable by the team in question. Most of the code utilized in large systems is really outside of the program, often there are effectively billions of lines that could get executed, by thousands of coders.

The idea is that any data coming from the outside needs to be checked first, and rejected if it is not valid. Bad data should never circulate inside a program.

Each datam within a program is finite. That is, for any given variable, there are only a finite number of different possible values that it can hold. For a data type like floats, there may be a mass of different possibilities, but it is still finite. For any sets of data, like a string of characters, they can essentially be infinite in size, but practically they are usually bounded by external constraints. In that sense, although we can collect a huge depth of permutations, the breadth of the data is usually limited. We can use that attribute to our advantage.

In all programming languages, for a variable like an integer, we allow any value between the minimum and the maximum, and there are usually some out-of-band signals possible as well, like NULL or Overflow. However, most times when we use an integer, only a tiny fraction of this set is acceptable. We might, for example, only want to collect an integer between one and ten. So what we really want to do as the data comes in from the outside is to reject all other possibilities. We only want a variable to hold our tiny little subset, e.g. ‘int1..10’. The tightest possible constraint.

Now, it would be nice if the programming language would allow us to explicitly declare this data type variation and to nicely reject any outside data appropriately, but I believe for all modern languages we have to do this ourselves at runtime. Then, for example, the user would enter a number into a widget in the GUI, but before we accept that data we would call a function on it and trigger possibly a set of Validation Errors (errors on data should never be singular, they should always be sets) if our exact constraints are not met.

Internally, we could pass around that variable ad nauseam, without having to ever check or copy it. Since it made it inside, it is now safe and exactly what is expected. If we needed to persist this data, we wouldn’t need to recheck it on the way to the database, but if the underlying schema wasn’t tightly in sync, we would need to check it on the way back in.

Overall, this gives us a pretty simple paradigm for both structuring all validation code, and for minimizing any data fiddling. However, dealing with only independent integers is easy. In practice, keeping to this philosophy is a bit more challenging, and sometimes requires deep contemplation and trade-offs.

The first hiccup comes from variables that need discontinuous ranges. That’s not too hard, we just need to think of them as concatenated, such as ‘int1..10,20-40’, and we can even allow for overlaps like ‘int1-10,5-7,35,40-45’. Overlaps are not efficient, but not really problematic.

Of course for floating-point, we get the standard mathematical range issues of open/closed, which might look like ‘float[0..1)’, noting of course that my fake notation now clashes with the standard array notation in most languages and that that ambiguity would need to be properly addressed.

Strings might seem difficult as well, but if we take them to be containers of characters and realize that regular expressions do mostly articulate their constraints, we get data types like ‘string*ab?’ which rather nicely restrict their usage. Extending that, we can quite easily apply wildcards and set theory to any sort of container of underlying types. In Data Modeling, I also discussed internal structural relationships such as trees, which can be described and enforced as well. That then mixes with what I wrote in Containers, Collections and Null, so that we can specify variable inter-dependencies and external structure. That’s almost the full breadth of a static model.

In that way, the user might play with a set of widgets or canvas controls to construct some complicated data model which can be passed through validation and flagged with all of the constraint violations. With that strength of data entry, the rest of the program is nearly trivial. You just have to persist the data, then get it back into the widgets later when requested.

A long time ago, programmers tended towards hardcoding the size of their structures. Often that meant unexpected problems from running into these constraints, which usually involved having to quickly re-deploy the code. Over the decades, we’ve shifted far away from those types of limitations and shouldn’t go backward. Still, the practical reality of most variable-sized data in a system is that there are now often neglected external bounds that should be in place.

For example, in a system that collects user's names, if a key feature is to always produce high-quality documentation, such as direct client marketing materials, the need for the presentation to never-be-broken dictates the effective number of characters that can be used. You can’t, for example, have a first name that is 3000 characters long, since it would wrap over multiple lines in the output and look horrible. That type of unspecified data usage constraint is mostly ignored these days but commonly exists within most data models.

Even if the presentation is effectively unconstrained, resources are generally not. Allowing a user to have a few different sets of preferences is a nice feature, but if enough users abuse that privilege then the system could potentially be crippled in performance.

Any and all resources need to be bounded, but modifying those bounds should be easily configurable. We want both sides of this coin.

In an Object-Oriented language, we might choose to implement these ideas by creating a new object for every unique underlying constrained type. We would also need an object for every unique Container and every unique Collection. All of them would essentially refuse to instantiate with bad data, returning a set of error messages. The upper-level code would simply leverage the low-level checks, building them up until all of the incoming data was finally processed. Thus validation and error handling become intertwined.

Now, this might seem like a huge number of objects, but once each one was completed it would be highly leverageable and extremely trustworthy. Each time the system is extended, the work would actually get faster, since the increasing majority of the system’s finite data would already have been both crafted and tested. Building up a system in this manner is initially slower, but the growing reductions in bugs, testing, coding, etc. ultimately win the day.

It is also possible to refactor an existing system into this approach, rather mindlessly, by gradually seeking out any primitives or language objects and slowly replacing them. This type of non-destructive refactoring isn’t particularly enjoyable, but it's safe work to do when you don’t feel like thinking, and it is well worth it as the system grows.

It is also quite possible to apply this idea to other programming paradigms, in a sense this is really just a tighter variation on the standard data structure ideas. As well as primitive functions to access and modify the data, there is also a means to validate and return zero or more errors. Processing only moves forward when the error count is zero.

Now one big hiccup I haven’t yet mentioned is cross-variable type constraints. The most common example is a user selecting Canada for country should then be able to pick from a list of Provinces, but if they select the USA, it should be States. We don’t want them to have to select from the rather huge, combined set of all sub-country breakdowns, that would be awful. The secondary dependent variable effectively changes its enumerated data type based on the primary variable. For practical reasons, I am going to assume that all such inter-type relationships are invertible (mostly to avoid confusing the users). So we can cope with these types of constraints by making the higher-level types polymorphic down to some common base type. This might happen as a union in some programming languages or an underlying abstract object in others. Then the primary validation would have to check the variable, then switch on its associated sub-type validation for every secondary variable. This can get a bit complex to generalize in practice, but it is handled by essential moving up to the idea of composite-variables that then know their sub-type inter-relationship mappings. In our case above, Location would be aware of Country and Province/State constraints, with each subtype knowing their own enumerations.

So far everything I’ve talked about is with respect to static data models. This type of system has extremely rigid constraints on what it will and will not accept. That actually covers the bulk of most modern systems, but it is not very satisfying since we really need to build smarter and more adaptable technologies. The real world isn’t static so our systems to model and interact with it shouldn’t be either.

Fortunately, if we see these validations as code, and in this case easily computable code, then we can ‘nounify’ that code into data itself. That is, all of the rules to constrain the variables can be moved around as variables themselves. All we need do to them is to allow the outside to have the power to request extensions to our data model. We can do that by explicitly asking for the specific data type, the structural changes, and some reference or naming information. Outside, users or code can explicitly or even implicitly supply this information, which is passed around internally. When it does need to go external again, say to the database, essentially the inside code follows the exact same behavior in requesting a model extension. For any system that supports this basic dynamic model protocol, it will just be seamlessly interconnected. The sophistication of all of the parts will grow as the underlying data models are enhanced by users (with the obvious caveat that there needs to be some configurable approach to backfilling missing data as well).

We sort of do this right now, when we drive domain data from persistence. Using Country as an example again, many systems will store this type of enumerated set in a domain table in the database and initialize it at startup. More complex systems might even have an event mechanism to notify the runtime code to re-initialize. These types of system features are useful for both constraining the initial implementation, but also being able to adjust it on-the-fly. Rather than do this on an ad hoc basis though, it would be great if we could just apply it to the whole data model, to be used as needed.

Right now we spend an inordinate amount of effort hacking in partial checks and balances in random locations to inconsistently deal with runtime data quality. We can massively reduce that amount of work, making it easier and faster to code, by sticking to a fairly simple model of validation. It also has the nifty side-effect that it will reduce both bloat and CPU usage. Obviously, we’ve known this since at least the mid-80s, it just keeps getting lost and partially reinvented.

Saturday, October 28, 2017

Freedom

A big part of programming culture is the illusion of freedom.

We pride ourselves on how we can tackle modern problems by creatively building new and innovative stuff. When we build for other programmers, we offer a mass of possible options because we don’t want to be the ones who restrict their freedoms.

We generally believe that absolutely every piece of code is malleable and that releasing code quickly is preferable over stewing on it for a long time. In the last decade, we have also added in that coding should be fun, as a fast-paced, dynamic response to the ever-changing and unknowable needs of the users. Although most of us are aware that the full range of programming knowledge far exceeds the capacity of our brains, we still express overconfidence in our abilities while contradictorily treating a lot of the underlying mechanics as magic.

Freedom is one of our deepest base principles. It appears to have its origin in the accelerated growth of the industry and the fact that every 5 to 10 years some brand new, fresh, underlying technology springs to life. So the past, most programmers believe, has very little knowledge to help with these new emerging technologies.

That, of course, is what we are told to believe and it has been passed down from generations of earlier programmers. Rather than fading, as the industry ages, it has been strengthening over the last few decades.

However, if we examine programming closely, it does not match that delusion. The freedoms that we proclaim and try so hard to protect are the very freedoms that keep us from moving forward. Although we’ve seen countless attempts to reverse this slide, as the industry grows most new generations of programmers cling to these beliefs and we actively ignore or push out those that don’t.

Quite obviously, if we want a program to even be mildly ‘readable’, we don’t actually have the freedom to name the variables anything we want. They are bound to the domain; being inconsistent or misusing terminology breeds confusion. It's really quite evident that if a piece of data flows through a single code base, that it should always have the same name everywhere, but each individual programmer doesn’t want to spend their time to find the name of existing things, so they call it whatever they want. Any additional programmers then have to cope with multiple shifting names for the exact same thing, thus making it hard to quickly understand the flow.

So there isn’t really any freedom to name variables ‘whatever’. If the code is going to be readable then names need to be correct, and in a large system, those names may have been picked years or even decades earlier. We’ve always talked a lot about following standards and conventions, but even cursory inspections of most large code bases show that it usually fails in practice.

We love the freedom to structure the code in any way we choose. To creatively come up with a new twist on structuring the solution. But that only really exists at the beginning of a large project. In order to organize and build most systems, as well as handle errors properly, there aren’t really many degrees of freedom. There needs to be a high-level structure that binds similar code together in an organized way. If that doesn’t exist, then its just a random pile of loosely associated code, with some ugly spaghetti interdependencies. Freely tossing code anywhere will just devalue it, often to the point of hopelessness. Putting it with other, similar pieces of code will make it easy to find and correct or enhance later.

We have a similar issue with error handling. It is treated as the last part of the effort, to be sloppily bolted on top. The focus is just on getting the main logic flow to run. But the constraints of adding proper error handling have a deep and meaningful impact on the structure of the underlying code. To handle errors properly, you can’t just keep patching the code at whatever level, whenever you discover a new bug. There needs to be a philosophy or structure, that is thought out in advance of writing the code. Without that, the error handling is just arbitrary, fragile and prone to not working when it is most needed. There is little freedom in properly handling errors, the underlying technology, the language and the constraints of the domain mostly dictate how the code must be structured if it is going to be reliable.

With all underlying technologies, they have their own inherent strengths and weaknesses. If you build on top of badly applied or weak components, they set the maximum quality of your system. If you, for instance, just toss some randomish denormalized tables into a jumble in an RDBMS, no matter how nice the code that sits on top, the system is basically an untrustworthy, unfixable, mess. If you fix the schema problems, all of the other code has to change. Everything. And it will cost at least twice as long to understand and correct it, as it would to just throw it away and start over. The value of code that ‘should’ be thrown away is near zero or can even be negative.

So we really don’t have the freedom to just pick any old technology and to use it in any old, creative, way we choose. That these technologies exist and can or should be used, bounds their usage and ultimately the results as well. Not every tech stack is usable for every system and picking one imposes severe, mostly unchangeable limitations. If you try to build a system for millions of users on top of a framework that can only really handle hundreds of them, it will fail. Reliably.

It might seem like we have the freedom to craft new algorithms, but even here it is exceptionally limited as well. If for example, you decide to roll your own algorithm from scratch, you are basically taking a huge step backward, maybe decades. Any existing implementations may not be perfect, but generally, the authors have based their work on previous work, so that knowledge percolates throughout the code. If they didn’t do that research then you might be able to do better, but you’ll still need to at least know and understand more than the original author. In an oversimplified example, one can read up on sorting algorithms and implement a new variant or just rely on the one in their language’s library. Trying to work it out from first principals, however, will take years.

To get something reasonable, there isn’t much latitude, thus no freedom. The programmers that have specialized in implementing algorithms have built up their skills by having deep knowledge of possibly hundreds of algorithms. To keep up, you have to find that knowledge, absorb it and then be able to practically apply it in the specific circumstance. Most programmers aren’t willing to do enough legwork, thus what they create is crude. At least one step backward. So, practically they aren’t free to craft their own versions (although they are chained by time and their own capabilities).

When building for other programmers it is fun to add in a whole boatload of cool options for them to play with, but that is only really useful if changing any subset of options is reliable. If the bulk of the permutations have not been tested and changing them is very likely to trigger bugs, then the options themselves are useless. Just wasted make-work. Software should always default to the most reasonable, expected options and the code that is the most useful fully encapsulates the underlying processing.

So, options are not really that useful, in most cases. In fact, in the name of sanity, most projects should mostly build on the closest deployments to the default. Reconfiguring the stuff to be extra fancy increases the odds of running into really bad problems later. You can’t trust that the underlying technology will hold any invariants over time, so you just can’t rely on them and it should be a very big and mostly unwanted decision to make any changes away from the default. Unfortunately, the freedom to fiddle with the configurations is tied to causing seriously bad long-term problems. Experience makes one extraordinary suspect of moving away from the default unless your back is up against the wall.

As far as interfaces go, the best ones are the ones that the user expects. A stupidly clever new alternative on a date selection widget, for example, is unlikely to be appreciated by the users, even if it does add some new little novel capability. That break from expectations generally does not offset a small increase in functionality. Thus, most of the time, the best interface is the one that is most consistent with the behaviors of the other software that the user uses. People want to get their work done, not to be confused by some new ‘odd’ way of doing things. In time, they do adjust, but for them to actually ‘like’ these new features, they have to be comfortably close to the norm. Thus, for most systems, most of the time, the user’s expectations trump almost all interface design freedoms. You can build something innovative, but don’t be surprised if the users complain and gradually over time they push hard to bring it back to the norm. Only at the beginning of some new technological shift is there any real freedom, and these are also really constrained by previous technologies and by human-based design issues.

If we dig deeply into software development, it becomes clear that it isn’t really as free as we like to believe. Most experienced programmers who have worked on the same code bases for years have seen that over time the development tends to be refined, and it is driven back towards ‘best practices’ or at least established practices. Believing there is freedom and choosing some eclectic means of handling the problems is usually perceived as ‘technical debt’. So ironically, we write quite often about these constraints, but then act and proceed like they do not exist. When we pursue false freedoms, we negatively affect the results of the work. That is, if you randomly name all of the variables in a system, you are just shooting yourself in the foot. It is a self-inflicted injury.

Instead, we need to accept that for the bulk of our code, we aren’t free to creatively whack it out however we please. If we want good, professional, quality code, we have to follow fairly strict guidelines, and do plenty of research, in order to ensure that result.

Yes, that does indeed making coding boring. And most of the code that programmers write should be boring. Coding is not and never will be the exciting creative part of building large systems. It’s usually just the hard, tedious, pedantic part of bringing a software system to life. It’s the ‘working’ part of the job.

But that is not to say that building software is boring. Rather, it is that the dynamic, creative parts of the process come from exploring and analyzing the domain, and from working around those boundaries in the design. Design and analysis, for most people, are the interesting parts. We intrinsically know this, because we’ve so often seen attempts to inject them directly into coding. Collectively, most programmers want all three parts to remain blurred together so that they don’t just end up as coding monkeys. My take is that we desperately need to separate them, but software developers should be trained and able to span all three. That is, you might start out coding from boring specs, learn enough to handle properly designing stuff and in time when you’ve got some experience in a given domain, go out and actually meet and talk to the users. One can start as a programmer, but they should grow into being a software developer as time progresses. Also, one should never confuse the two. If you promote a programmer to run a large project they don’t necessarily have all of the necessary design and analyze skills and it doesn’t matter how long they have been coding. Missing skills spells disaster.

Freedom is a great myth and selling point for new people entering the industry, but it really is an illusion. It is sometimes there, at very early stages, with new technologies, but even then it is often false. If we can finally get over its lack of existence, and stop trying to make a game out of the process, we can move forward on trying to collect and enhance our already massive amount of knowledge into something that is more reliably applied. That is, we can stop flailing at the keyboards and start building better, more trustworthy systems that actually do make our lives better, instead of just slowly driving everyone nuts with this continuing madness and sloppy results we currently deliver.

Sunday, October 15, 2017

Efficiency and Quality

The strongest process for building software would be to focus on getting the job done efficiently while trying to maximize the quality of the software.

The two objectives are not independent, although it is easy to miss their connection. In its simplest terms, efficiency would be the least amount of work to get the software into production. Quality then might seem to be able to range freely.

However, at one end of the spectrum, if the quality is really bad, we do know that it sucks in tonnes of effort to just patch it up enough to barely work. So, really awful quality obviously kills efficiency.

At the other end though, it most often seems like quality rests on an exponential scale. That is, getting the software to be 1% better requires something like 10x more work. At some point, perfection is likely asymptotic, so dumping mass amounts of effort into just getting trivial increases in quality also doesn’t seem to be particularly efficient either.

Are they separated in the middle?

That too is unlikely in that quality seems to be an indirect byproduct of organization. That is, if the team is really well-organized, then their work will probably produce a rather consistent quality. If they are disorganized, then at times the quality will be poor, which drags down the quality of the whole. Disorganization seems to breed inefficiency.

From practical experience, mostly I’ve seen that rampant inefficiency results in consistently poor quality. They go hand in hand. A development team not doing the right things at the right time, will not accidentally produce the right product. And they certainly won’t do it efficiently.

How do you increase both?

One of the key problems with our industry is the assumption that ‘code’ is the only vital ingredient in software. That, if there was only more ‘code’ that would fix everything. Code, however, is just a manifestation of the collected understanding of the problem to solve. It’s the final product, but certainly not the only place were quality and efficiency are noticeable.

Software development has 5 distinct stages: it goes from analysis to design, then into coding and testing and finally, it is deployed. All five stages must function correctly for the effort to be efficient. Thus, even if the coding was fabulous, the software might still be crap.

Little or hasty analysis usually results in unexpected surprises. These both eat into efficiency, but also force short-cuts which drag down quality. If you don’t know what you are supposed to build or what data it will hold then you’ll end up just flailing around in circles hoping to get it right.

Designing a system is mostly organization. A good design lays out strong lines between the pieces and ensures that each line of code is located properly. That helps in construction and in testing. It’s easier to test well-written code, and it's inordinately cheaper to test code if you properly know the precise impact of any of the changes.

On top of that, you need to rely on the right technologies to satisfy any engineering constraints. If you pick the wrong technologies then no amount of effort will ever meet the specifications. That’s a key aspect of design.

If the operational requirements never made it into the analysis and design, the costs of supporting the system will be extraordinary, which will eventually eat away at the resources available for the other stages. A badly performing system will rapidly drag down its own development.

Realistically, a focus on efficiency, across all of the stages, is the best way to ensure maximum quality. There are other factors, of course, but unless the software development process is smoothly humming along, they aren’t going to make a significant impact. You have to get the base software production flow correct first, then focus on refining it to produce high-quality output. It’s not just about coding. Generating more bad code is never going to help.

Sunday, October 1, 2017

Rainy Days

When first we practice to code, we do, of course, worry most about branching between different instructions and repeating similar blocks of work over and over again.

In time, we move on to longer and more complex manipulations.

Once those start to build up, we work to cobble them together. Continually building up larger and larger arrangements, targeting bigger features within the software.

That’s nice and all, but since we’ve usually only focused on getting the code to run, it really is only reliable on ‘sunny’ days. Days when there are no clouds, no rain, no wind, etc. They are warm and pleasant. When everything just works.

Code, however, needs to withstand all of the elements; it runs in a cruel, cruel world where plenty of unexpected things occur at regular frequencies. Storms come often, without warning.

Rainy day coding is a rather different problem to solve. It means expecting the worst and planning for it. It also means building the code in a way that its rainy day behavior is both predictable and easily explainable.

For instance, any and all resources might be temporarily unavailable. And they might go down in any number of permutations. What happens then should be easily explainable, and it should match that explanation. The code should also ensure that during the outages no data is lost, no user uninformed, no questions unanswered. It needs to be trustworthy.

Any data coming from the outside might be wrong. Any events might occur in a weird order. Any underlying technology might suddenly behave erratically. Any piece of hardware might fail. Anything at all could happen...

If that sounds like a lot of work, it is! It is at least an order of magnitude harder than sunny day coding. It involves a lot of digging to fully understand rainy days, or tornadoes, or hurricanes or even earthquakes. You can’t list out the correct instructions for a rainy day if you’ve never thought about or tried to understand them.

As software eats the world, it must live up to the cruelties that exist out there. When it doesn’t, it might be temporarily more convenient on the sunny days, but it can actually spawn off more work than it saves when it rains. The overall effect can be quite negative. We were gradually learning how to write robust systems, but as the underlying complexity spiraled out of control these skills have diminished. Too many sunny days and too little patience have driven us to rely on fragile systems while deliberately ignoring the consequences. If we want to get the most out of computers, we’ll have to change this...

Sunday, September 17, 2017

Decisions

We can model large endeavors as a series of decisions. Ultimately, their success relies on getting work completed, but the underlying effort cannot even be started until all of the preceding decisions are made. The work can be physical or it can be intellectual or it can even be creative.

If there are decisions that can be postponed, then we will adopt the convention that they refer to separate, but related pieces of work. They can occur serially with the later piece of work relying on some of the earlier decisions as well as the new ones. Some decisions are based on what is known up-front, while others can’t be made until an earlier dependent bit of work is completed.

For now, we’ll concentrate on a single piece of work that is dependent on a series of decisions. Later we can discussion parallel series and how they intertwine.

Decisions are rarely ever ‘right’ or ‘wrong’, so we need some other metric for them. In our case, we will use ‘quality’. We will take the decision relative to its current context and then talk about ‘better’ or ‘worse’ in terms of quality. A better decision will direct the work closer to the target of the endeavor, while a worse one will stray farther away. We’ll accept decisions as being a point in a continuous set and we can bound them between 0.0 and 100.0 for convenience. This allows for us to adequately map them back to the grayness of the real world.

We can take the quality of any given decision is the series as being relative to the decision before it. That is, even if there are 3 ‘worse’ decisions in a row, the 4th one can be ‘better’. It is tied to the others, but it is made in a sub-context and only has a limited range of outcomes.

We could model this in a demented way as a continuous tree whose leaves fall onto the final continuous range of quality for the endeavor itself. So, if the goal is to build a small piece of software, at the end it has a specific quality that is made up from the quality of its subparts, which are directly driven by the decisions made to get each of them constructed. Some subparts will more heavily weight the results, but all of it contributes to the quality. With software, there is also a timeliness dimension, which in many cases would out-weight the code itself. A very late project could be deemed a complete failure.

To keep things clean, we can take each decision itself to be about one and only one degree of variability. If for some underlying complex choice, there are many non-independent variables together, representing this as a set of decisions leaves room to understand that the one or more decision makers may not have realized the interdependence. Thus the collection of decisions together may not have been rational on the whole, even if all of the individual decisions were rational. In this sense, we need to define ‘rational’ as being full consideration of all things necessary or known relative to the current series of decisions. That is, one may make a rational decision in the middle of a rather irrational endeavor.

Any decision at any moment would be predicated both on an understanding of the past and the future. For the past, we accrue a great deal of knowledge about the world. All of it for a given subject would be its depth. An ‘outside’ oversimplification of it would be shallow. The overall understanding of the past would then be about the individual’s or group’s depth that they have in any knowledge necessary to make the decision. Less depth would rely on luck to get better quality. More depth would obviously decrease the amount of luck necessary.

Looking towards the future is a bit tricker. We can’t know the future, but we can know the current trajectory. That is, if uninterrupted, what happened in the past will continue. When interrupted, we assume that that is done so by a specific event. That event may have occurred in the past, so we have some indication that it is, say, a once-in-a-year event. Or once-in-a-decade. Obviously, if the decision makers have been around an area for a long time, then they will have experienced most of the lower frequency events, thus they can account for them. A person with only a year’s experience will get caught by surprise at a once-in-a-decade event. Most people will get caught by surprise at a once-in-a-century event. Getting caught by surprise is getting unlucky. In that sense, a decision that is low quality because of luck may indicate a lack of past or future understanding or both. Someone lucky may look like they possess more knowledge of both than they actually do.

If we have two parallel series of decisions, the best case is that they may be independent. Choices made on one side will have no effect on the work on the other side. But it is also possible that the decisions collide. One decision will interfere with another and thus cause some work to be done incorrectly or become unnecessary. These types of collisions can often be the by-product of disorganization. The decision makers are both sides are unaware of their overlap because the choice to pursue parallel effort was not deliberate.

This shows that some decisions are implicitly made by willingly or accidentally choosing to ignore specific aspects of the problem. So, if everyone focuses on a secondary issue instead of what is really important, that in itself is an implicit decision. If the choices are made by people without the prerequisite knowledge or experience, that too is an implicit decision.

Within this model then, even for a small endeavor, we can see that there are a huge number of implicit and explicit decisions and that in many cases it is likely that there are many more implicit ones made than explicit. If the people are inexperienced, then we expect the ratio to have a really high multiplier and that the quality of the outcome then relies more heavily on luck. If there is a great deal of experience, and the choices made are explicit, then the final quality is more reflective of the underlying abilities.

Now all decisions are made relative to their given context and we can categorize these across the range of being ‘strategic’ or ‘tactical’. Strategic decisions are either environmental or directional. That is, at the higher level someone has to set up the ‘game’ and point it in a direction. Then as we get nearer to the actual work, the choices become more tactical. They get grounded in the detail, become quite fine-grained and although they can have severe long-term implications, they are about getting things accomplished right now. Given any work, there are an endless number of tiny decisions that must be made by the worker, based on their current situation, the tactics and hopefully the strategic direction. So, with this, we get some notion of decisions having a scope and ultimately an impact. Some very poor choices by a low-level core programmer on a huge project, for instance, can have ramifications that literally last for decades, and that degrade any ability to make larger strategic decisions.

In that sense, for a large endeavor, the series of decisions made, both large and small, accumulate together to contribute to the final quality. For long running software projects, each recurring set of decisions for each release builds up not only a context, but also boundaries that intrinsically limit the best and worst possible outcomes. Thus a development project 10 years in the making is not going to radically shift direction since it is weighted down by all of its past decisions. Turning large endeavors slows as they build up more choices.

In terms of experience, it is important for everyone involved at the various levels to understand whether they have the knowledge and experience to make a particular decision and also whether they are the right person at the right level as well. An upper-level strategic choice to set some weird programming convention, for example, is probably not appropriate if the management does not understand the consequences. A lower-level choice to shove in some new technology that is not inline with the overall direction is also equally dubious. As the work progresses bad decisions at any level will reverberate throughout the endeavor, generally reducing quality but also the possible effectiveness of any upcoming decisions. In this way, one strong quality of a ‘highly effective’ team is that the right decisions get made at the right level by the right people.

The converse is also true. In a project that has derailed, a careful study of the accumulated decisions can lead one back through a series of bad choices. That can be traced way back, far enough, to find earlier decisions that should have taken better options.

It’s extraordinarily hard to choose to reverse a choice made years ago, but if it is understood that the other outcomes will never be positive enough, it can be easier to weigh all of the future options correctly. While this is understandable, in practice we rarely see people with the context, knowledge, tolerance and experience to safely make these types of radical decisions. More often, they just stick to the same gradually decaying trajectory or rely entirely on pure luck to reset the direction. Thus the status quo is preserved or ‘change’ is pursued without any real sense of whether it might be actually worse. We tend to ping-pong between these extremities.

This model then seems to help understand why complexity just grows and why it is so hard to tame. At some point, things can become so complex that the knowledge and experience needed to fix them is well beyond the capacity of any single human. If many of the prior decisions were increasingly poor than any sort of radical change will likely just make things worse. Unless one can objectively analyze both the complexity and the decisions leading to it, they cannot return to a limited set of decisions that need to be reversed properly, and so at some point, the amount of luck necessary will exceed that of winning a lottery ticket. In that sense, we have seen that arbitrary change and/or oversimplifications generally make things worse.

Once we accept that the decisions are the banks of the river that the work flows through, we can orient ourselves into trying to architect better outcomes. Given an endeavor proceeding really poorly, we might need to examine the faults and reset the processes, environment or decision makers appropriately to redirect the flow of work in a more productive direction. This doesn’t mean just focusing on the workers or just focusing on the management. All of these levels intertwine over time, so unwinding it enough to improve it is quite challenging. Still, it is likely that if we start with the little problems, and work them back upwards while tracking how the context grows, at some point we should be able to identify significant reversible poor decisions. If we revisit those, with enough knowledge and experience, we should be able to identify better choices. This gives us a means of taking feedback from the outcomes and reapplying it to the process so we can have some confidence that the effects of the change will be positive. That is, we really shouldn’t be relying on luck to make changes, but not doing so is far less than trivial and requires deeper knowledge.

Sunday, September 10, 2017

Some Rules

Big projects can be confusing and people rarely have little time or energy to think deeply about their full intertwined complexity. Not thinking enough is often the start of serious problems.

Programmers would prefer to concentrate on one area at a time, blindly following established rules for the other areas. Most of our existing rules for software are unfortunately detached from each other or are far too inflexible, so I created yet another set:

The work needs to get used
There is never enough time to do everything properly
Never lie to any users
The code needs to be as readable as possible
The system needs to be fully organized and encapsulated
Don’t be clever, be smart
Avoid redundancy, it is a waste of time

The work needs to get used

At the end of the day, stuff needs to get done. No software project will survive unless it actually produces usable software. Software is expensive, someone is paying for it. It needs backers and they need constant reassurance that the project is progressing. Fail at that, nothing else matters.

There is never enough time to do everything properly

Time is the #1 enemy of software projects. It’s lack of time that forces shortcuts which build up and spiral the technical debt out of control. Thus, using the extremely limited time wisely is crucial to surviving. All obvious make-work is a waste of time, so each bit of work that gets done needs to have a well-understood use; it can’t just be “because”. Each bit of code needs to do what it needs to do. Each piece of documentation needs to be read. All analysis and design needs to flow into helping to create actual code. The tests need to have the potential to find bugs.

Still, even without enough time, some parts of the system require more intensive focus, particularly if they are deeply embedded. Corners can be cut in some places, but not everywhere. Time lost for non-essential issues is time that is not available for this type of fundamental work. Building on a bad foundation will fail.

Never lie to any users

Lying causes misinformation, misinformation causes confusion, confusion causes a waste of time and resources, lack of resources causes panic and short-cuts which results in a disaster. It all starts with deviating from the truth. All of the code, data, interfaces, discussions, documentation, etc. should be honest with all users (including other programmers and operations people); get this wrong and a sea of chaos erupts. It is that simple. If someone is going to rely on the work, they need it to be truthful.

The code needs to be as readable as possible

The code in most systems is the only documentation that is actually trustworthy. In a huge system, most knowledge is long gone from people's memory, the documentation has gone stale, so if the code isn't readable that knowledge is basically forgotten and it has become dangerous. If you depend on the unknown and blind luck, expect severe problems. If, however, you can actually read the code quickly, life is a whole lot easier.

The system needs to be fully organized and encapsulated

If the organization of the code, configuration, documentation or data is a mess, you can never find the right parts to read or change when you need to. So it all needs organization, and as the project grows, the organizational schemes need to grow with it and they need organization too, which means that one must keep stacking more and higher levels of organization on top. It is never ending and it is part of the ongoing work that cannot be ignored. In huge systems, there is a lot of stuff that needs organization.

Some of the intrinsic complexity can be mitigated by creating black boxes; encapsulating sub-parts of the system. If these boxes are clean and well-thought out, they can be used easily at whatever level is necessary and the underlying details can be temporarily ignored. That makes development faster and builds up ‘reusable’ pieces which leverages the previous work, which saves time and of course, not having time is the big problem. It takes plenty of thought and hard work to encapsulate, but it isn’t optional once the system gets large enough. Ignoring or procrastinating on this issue only makes the fundamental problems worse.

Don’t be clever, be smart

Languages, technologies and tools often allow for cute or clever hacks. That is fine for a quick proof of concept, but it is never industrial strength. Clever is the death of readability, organization and it is indirectly a subtle form of lying, so it causes lots of grief and eats lots of time. If something clever is completely and totally unavoidable then it must be well-documented and totally encapsulated. But it should be initially viewed as stupendously dangerous and wrong. Every other option should be tried first.

Avoid redundancy, it is a huge waste of time

It is easy to stop thinking and just type the same stuff in over and over again. Redundancy is convenient. But it is ultimately a massive waste of time. Most projects have far more time than they realize, but they waste it foolishly. Redundancy makes the system fragile and it reduces its quality. It generates more testing. It may sometimes seem like the easiest approach, but it isn’t. If you apply brute force to just pound out the ugliest code possible, while it is initially faster it is ultimately destructive.

Overall

There really is no finite, well-ordered set of one-size-fits-all rules for software development but this set, in this order, will likely bypass most of the common problems we keep seeing over the decades. Still, software projects cannot be run without having to think deeply about the interrelated issues and to constantly correct for the ever-changing context. To keep a big project moving forward efficiently requires a lot of special skills and a large knowledge base of why projects fail. The only real existing way to gain these skills is to get mentored on a really successful project. At depth, many aspects of developing big systems are counter-intuitive to outsiders with oversimplifications. That is, with no strong, prior industry experience, most people will misunderstand where they need to focus their attention. If you are working hard on the wrong area of the project, the neglected areas are likely to fail.

Often times, given an unexpectedly crazy context, many of the rules have to be temporarily dropped, but they should never be forgotten and all attempts should be made to return the project to a well-controlled, proactive state. Just accepting the madness and assuming that it should be that way is the road to failure.

Monday, September 4, 2017

Data Modeling

The core of most software systems is the collection of data. This data is rarely independent; there are many inter-relationships and outward associations with the real world. Some of these interconnections form the underlying structure of the data.

If the system doesn’t collect the data or if the data collected is badly structured, it can be extraordinarily expensive to fix it later. Most often, if there are deficiencies in the data that are too expensive to fix that leads the software developers to overlay convoluted hacks in an attempt to hide the problems. Long-term application of this strategy eventually degrades the system to the point of uselessness.

As such, data forms the foundation for all software systems. Code without data is useless, and unlike data, code can be easily changed later. This means that when building big complex systems, more care should be spent on ensuring that the necessary data is collected and structured properly than any other aspect of the system. Getting the data wrong is a fatal mistake that often takes years to fully manifest itself.

A common way to model data is with an ER diagram. This decomposes the data into a set of ‘entities’ and provides a means of specifying ‘relationships’ between them. The relationships are oriented on frequency so they can be zero, one or many on either side. Thus one entity might be connected to many different related entities, forming a one-to-many relationship. All possible permutations are viable, although between multiple entities some collections of relationships are unlikely.

This is a reasonable way to model some of the structure, but it is hampered because its expressiveness is directly tied to relational databases. Their underlying model is to store the entities into tables which represent a two-dimensional means of encoding the data. This works for many common programming problems but is not able to easily capture complex internal relationships like hierarchies or graphs.

The older form of decomposing into data structures is far more expressive. You can break down extremely complex programs as a set of data structures like lists, hash tables, trees, graphs, etc. With those, and some associated atomic primitives as well as the larger algorithms and glue code, we have the means to specify any known form of computation. That decomposition works but does not seem to be generally well known or understood by the software industry. It is also a lot harder to explain and teach. It was softened way back to form the underlying philosophy behind object-oriented programming, but essentially got blended with other ideas about distributed computing, evolutionary design, and real-world objects. The modern remnants of this consolidation are quite confusing and frequently misunderstood.

So we have at least two means of modeling data, but neither of them really fit well or are easy to apply in practice. We can, however, fuse them together and normalize them so that the result is both consistent and complete. The rest of this post will try to outline how this might be achieved.

Terminology and definitions form the basis for understanding and communication, so it is well worth consideration. From ER diagrams, the word ‘entity’ is general enough for our purposes and isn’t horrifically overloaded in the software industry yet. For the subparts of an entity, the common term is ‘attributes’. Often in other contexts, we use the term ‘properties’. Either seems suitable for describing the primitive data that forms the core of an entity, but it might be better to stick with ‘attributes’. It is worth noting that if some of this underlying associated data has its own inherent non-trivial structure, it can be modeled as a separate entity bound by a one-to-one relationship. Because of that, we can constrain any of these attributes to just being single instances of primitive data. No need for internal recursion.

From an earlier post on null handling (Containers, Collections and Null) we can set the rules that no instance of an entity can exist unless it has all of its mandatory attributes. If an attribute is optional for any possible instance of the entity, it is optional for all possible instances. These types of constraints ensure that the handling and logic are simple and consistent. They need to be rigorous.

Every entity corresponds to something unique in the real world. This is another rigorous constraint that is necessary to ensure the data is useful. If under some circumstance that isn’t true, then the entity corresponds to a ‘collection’ of such instances, not an instance itself. In this way, we can always bind the data to something unique, even if that is just a bunch of arbitrarily categorized collections. The caveat though, is that we should never just arbitrarily categorize because that creates problems (since we are adding in fake information). But instead, we need a deeper analysis of the underlying domain to resolve any ambiguities.

If we get a unique principle in place, then it is absolutely true that every entity has a unique key. In many circumstances this will be a composite key, but then any such instance refers to one and only one entity in the real world.

In developing systems we often create a system-generated key as a level of indirection to replace some of the original data keys. This is a good practice in that it allows some of the composite attributes to be modifiable, without having to create a new entity. In historic systems, this keeps the related data attached across changes. This is a necessary property for systems that deal with human-centered input, mostly because there is a relatively constant rate of errors and long periods of time before they are noticed. That needs to be built into the data model.

So, with some tighter constraints, we can decompose all of the data into entities and we can show how they relate to each other, but we still need a means of showing how the entities relate internally to their own set. As an example, mostly the employees in a company form a tree or directed acyclic graph with respect to their managers. We could model this with some messy arrangement that has multiple entity types allowing different people to both report upwards and have employees as well, but this would be quite awkward and pedantic. What is preferable is to just model the entity itself as a tree (ignoring the more complicated arrangements for the moment).

This could be as easy as specifying the type for the entity as a ‘tree’ and adding in an attribute for ‘manager’. We simply overlay the data structure decomposition on top of the original entity. Thus we get an extension, where the internal entity relationship might be a data structure like a tree and to satisfy that structure we add in the appropriate attributes.

That works cleanly, but we would like to know which set of data structures is necessary to express the full width of any possible data. A guess at what might start initially with this collection: item, list, tree, dag, graph, hypergraph. That may seem fairly expressive, but we are aware of at least some form of dimensionality with respect to entities. That is, the entity might be a list of lists, or even a hypergraph of trees of lists. Thus we have the initial collection on any given axis, with any N axises. It can be any such combination.

As weird as that sounds, we could then specify a matrix as an entity of list^2, but noting that the lists are strongly bounded by the M and N sizes of the matrix. In most underlying data collections, the entities are just not that complicated, but we really don’t want to limit the models to the 80% mark.

As for representing these models, we can easily use the original ER diagrams with the extra information for including the entity type. We would also like a means of serializing these and given that the relationships are not necessarily planar we need a bit of extra mechanics here as well. Before we get there, it is worth noting that the external relationships are bound as graphs. A reasonable extension would be to allow them to be hypergraphs as well, which would make them consistent with the internal relationships. That is, we could have a relationship such as one-many-many between three entities. That could further compact the model, but would only be usable if it was there to highlight a relationship, not obscure it.

We can easily serialize an entity (and we know it isn’t intrinsically recursive). So all we need is a means to serialize a hypergraph. That is well understood for graphs as being an adjacency matrix, but with the cells containing the relationship itself instead of just 1 or 0. Or we can just list out the vertices (entities) and the edges (relationships). Something like ‘entityA-entityB-entityC: one-many-many’ would be quite suitable. We could iterate any list of unnamed variables as brackets [], and any named variables (like entities) as braces {}. Another suggestion is to iterate keys first, separated internally by commas, but switch to a semicolon to separate them from the rest of the attributes or other keys (there may be multiple means of getting uniqueness). If there is some reason to distinguish entities from other hash tables, we could use parenthesis () for this distinction. Thus we might then differentiate between objects and dictionaries, for clarity.

With a bit of white space, this would all provide a means to have a serialized text-based format for specifying a data model, which means that we could apply other tools such as source-code control and formatting to be able to manipulate these specifications. With the addition of positioning information, we could easily render this as a diagram when needed.

Given the importance of data for any software system, it seems that we should focus strongly on building up analytical tools for correctly laying the foundations for the system. It should be easy to model data and organizations should have large repositories of all of their current models. It is often the case that some systems use eclectic subsets of the actual real underlying data models, but what we need is both a way to catch this early and to gradually enhance our knowledge to properly model the data for all of the domains. A subset is fine, but when it needs to be extended, it should be done properly. It is reckless to leave this to chance, and we so often see the consequences of doing that. If the analysis stage for software development is properly focused on really modeling the data, the other stages will proceed smoothly. If not, the initial failures will percolate into the rest of the work.