Monday, December 22, 2014

Thinking About What You Don't Know

I've observed that many people, when they think about stuff, choose to limit themselves to only what they know. That is, they revisit the facts and opinions with which they are familiar, over and over again, trying to squeeze something new from them.

What I've always found, that tosses a big wrench into the works, is not what is known but rather what isn't. Some people might think it is a contradiction to think about what you don't know, since obviously you have nothing in your head that relates to it, but what you do have is essentially dangling threads. Information that only goes so deep and then suddenly stops. If you want, you can explore these.

You start by inventorying what you already understand, then gradually decompose that into more precise fragments. At some point, you either bang into an assumption or you just draw a blank. Blanks are easy since you know that you don't know, so you now have something to learn about. Assumptions however can be very hard to recognize. They often appear as concrete knowledge, except they are not based on any underlying facts that you know. What they really do is cleverly hide unknowns, removing uncertainty, but at the cost of quite possibly being wrong. Most problems start with at least one faulty assumption.

So you ask yourself, what do I know that really are facts? Are these absolute, or are they relative to some specific context. Once you've filtered out the facts, what remains needs clarification. With these you can start asking questions, and with those you can search for answers. It's a long, sometimes slow, loop but if you dig deep enough you'll find that you can uncover all sorts of unexpected things.

There are of course "right turns" that come up so unexpectedly that no amount of thinking would ever lead you to them. There is little you can do about these. You never want to go too far down any rabbit holes, since they might lead you to start reconstructing things incorrectly and end up creating worse assumptions.

One thing that helps is to analyse whether or not the people you are talking to are themselves referring to actual knowledge. I've often found that people will confidently rely on some pretty shaky assumptions and that it is inversely proportional. The shakier the assumption, the more some people convince themselves to believe it. It's considered rude to point that out, so often what you need to do is note down what they are saying and find an independent way to verify it. Two people saying the same thing is better, but it's best when you can break it down to real underlying facts that can be shown to be true. Digging in this way always brings up plenty of unexpected issues, some which turn out to be dead ends, but many which lead to valuable insight.

Things generally go wrong, not because of what we know, but because of what we missed. Learning to think deeply about what might be missing is a valuable skill. With practice, one starts to see the weaknesses right away, and you don't really have ponder them for days. Rather, in the midst of discussion, you become aware of a blind spot and start asking questions. If you confront you own assumptions, you often find that others were operating with similar ones and that they are quietly hiding something nasty. Getting that out into the open usually saves a lot of trouble in the future.

Sunday, December 14, 2014

Definitions

The easiest thing to do in response to a major problem is to create a bunch of new 'rules' to prevent it from ever happening again. It might be the obvious solution, but if the underlying problem was caused by a systemic breakdown due to unmanageable complexity, it is most likely also a bad solution. Once things get too complicated, rules loose their effectiveness.

Although I've seen this reoccur time and time again in large organizations, I wasn't really sure what was driving it or how to really fix the underlying problems. The 21st century is dominated by large organizational systems teetering on the brink of collapse. We have built up these mega-institutions, but they do not seem to have weathered the unbounded complexities exploding from the early part of the digital age. Most are barely functional, held together by that last bit of energy left in people devoted to not being unemployed.

Just saying "don't create any new rules" isn't helpful. These rules happen as a response to actual problems, so turning away from them isn't going to work. It's this no-win situation that has intrigued me for years. Surely, there must be some formalizable approach towards putting the genie back into the bottle?

A few years ago I was sitting in an interview. When asked a technical question, I started off my answer with "I can't remember the exact definition for that, but...".

At which point the interviewer reassuringly interrupted and said "Don't worry about it, we aren't concerned with definitions in this company".

At the time I let the exchange slide by, but there was something about it that deeply bothered me. Of course people should be concerned about definitions! They are, after all, the basic tokens of any conversation. If two people are talking, but using two completely different definitions, then the conversation itself is either useless or misleading.

I've encountered this cross-talk circumstance so many times, in so many meetings, that it has become automatic to stop the discussion and get both parties to dump out their definitions onto the table. Always, after such an exchange, things have started to click and real progress gets made. Until then, the meeting is just noise.

Recently I've begun to understand that bad or vague definitions have even deeper consequences such that they relate directly back to systemic failure, not just useless meetings. They can be used to empower and enhance dysfunction.

So I have a very simple conjecture: any general rules like "all X must have property Y" are going to cause 'trouble' if any of the underlying 'definitions' are wrong.

For example, if X is really something messy like x+q, x'' and xish then it is easy to overlook the impact on all three loose forms of X. Worse, when the boundaries are fuzzy, then people can include or exclude things based on a whim. Some days z is in X, some days it is not.

It gets even worse when Y is messy as well, such as y1, y3, y4 and y7. Now there are 12 to 16 different cases that are broadly covered. If someone crafted the rule because one day an x+q should have had the property y3, that might have been really clear at the time, but may not generally apply in all cases.

If however X really means precisely and only X, and the property Y is really just Y, then the rule boundaries are tight. We know exactly what is covered and we can manage that.

With loose rules it is the corner-cases that people miss. If the definitions are vague, then there are lots of known and unknown cases. Any bad side-effect percolates causing other problems, so a bad rule can really do a lot of damage.

Sometimes the definitions are vague on purpose. Someone high up in the hierarchy wants control. With a vague definition they can yank their inferiors chains any time they choose. They just pull the carpet out from under them, and then punish them for doing a face plant. This isn't an uncommon executive trick, many even believe that this works well. In some corporate cultures the hierarchies are fear-based and sometimes even excessively micromanaged. Those sorts of cruel executive games trickle down to the masses causing widespread corporate dysfunction.

Sometimes the definitions are vague because management is clueless about what it is actually managing. If you don't understand stuff then you really don't want to have a precise definition, and without that most top-down rules are mostly likely go bad.

Sometimes it goes the other way. The employees don't want management oversight, so they redefine things as vaguely as possible to give themselves wiggle room to operate as they please. That, of course, is in tandem with clueless management, but in this case the definitions are coming from the bottom up.

Sometimes the origin of the rules is lost in history. Sometimes it's just outsiders looking in, but missing all but what floats on the surface.

In some cases organizations really like to make up their own terminology or redefine existing terms. That promotes a "club membership" sort of evironment. The danger is of course that the definitions slide all over the place, subject to various agendas. Gradually that converges on dysfunction.

This sort of issue can be easily detected. If rules have been created in response to a problem, but the problem don't go away, then it is very likely that the underlying definitions are broken. More rules are not needed. Fix the definitions first, then use them to simplify the existing rules, then finally adjust them to assure that the specific corner-cases are handled. But it's important to note that if fixing a definition involves "retraining" then there is a strong chance that the definition itself is still broken. If you have a tight, reasonable definition, that should make perfect sense to the people working, so it should not require effort to remember. If you're just flinging around arbitrary words, then the problems will creep in again.

Sometimes the dysfunction is hidden. A bunch of rules get created to throttle the output, so that people can claim improvement. There are less problems if less work is getting done, but that will eventually make the cost of any work inordinately expensive. Also, it hasn't fix the quality issues, but probably just made them much worse. It's a hollow victory.

Broken definitions are everywhere these days. To find them, all you really need to do is start asking questions such as "what does X really mean?" If different people give a range of vague answers, you know that the definition is broken. Unfortunately, asking very basic questions about actual definitions to lots of different people has a tendency to agitate them. Most people would rather blindly chase a rule they don't understand than actually talk or think about it. In questioning the definition they tend to think that you are either stupid or just being difficult.

Still, once a definition is broken it will not fix itself. If it has been broken for years, it will have seeped into all sorts of cracks. People may be running around, throwing out all sorts of hasty bandaides for these new problems, but until light is thrown on the root cause, the issues will either get worse or be ignored. Just saying "that's the way it is" not only preserves the status quo, but also lays a foundation for people to continue to make the problems worse.

Automation via computers was once envisioned as a means of escaping these sorts of problems, but it actually seems to be making them worse. At least when processes were manual, it was obvious that they were expensive and ineffective. Once they've been computerized, the drudgery was shifted to the machines, so any feedback that might have helped someone clean up the mess disappeared. The computer quietly enables people to be more dysfunctional. The programmers might have been aware of the issues they created, but most often they are detached from the usage, so they just craft a multitude of special cases and move onto to the next bit of code.

In a sense, the definition problem is really trivial. You can not manage what you don't understand and any understanding is built into how you define the underlying parts. If the definitions are vague and shifty then the management effort is disconnected from the underlying reality. As the distance between the two grows, things get more complicated and gradually degrade. To reverse the trend requires revisiting how things are defined.

Sunday, December 7, 2014

Bad Process

To work effectively, software development projects need some type of process. The more people involved, the bigger and more formalize the methodology needs to be.

A good process helps people understand their roles, keep track of the details and work together with others. It is the glue that holds everything together.

Too little process usually ends up being a bad process. It doesn't help complete the tasks, enabling disorganization. That comes as dangerous shortcuts, hurry up and wait scheduling and too much time spent plugging up avoidable holes. When busy, most people drop best practices in favor of the most expedient approach. The occasional lapse is fine, but shortcuts always come with a hidden cost. Get too far into debt and a nasty cycle ensues which drags everything down further. A good process will help prevent this.

Too much process generally results in useless make-work. Following the process itself becomes the focus, not getting the work done. Most heavy processes claim to increase quality, but if there is excessive make-work it actually drains away from the resources causing the quality to drop. Moral is also affected. People can usually tell valuable work from the useless stuff. They know if their contributions are going directly into the end product, or if they've just propping up some meta-administration. As such, when they spend their time doing make-work instead of actually dealing with the real issues, they lose interest in getting things done.

The ideal process is one that everybody actually wants to follow. They realize that it is there to make their lives easier and to help achieve quality. They don't have to work around the process in special circumstances, because the process genuinely covers all aspects of their work. It guides them through their labors, keeping them safe from making dumb mistakes and allowing them to focus on what they really want to do: get the work done.

A process crafted by people with little real experience in actual software development obviously won't focus on what matters. Rather it gloms onto the superficial, because it's faster to understand. It usually makes up it's own eclectic terminology as an attempt to hide its lack of depth and to make it appear more intellectual than it really is. It might be presented with much fanfare and a raft of unachievable claims, but at its heart it is nothing more than a tragic misunderstanding.

Software has had many, many bad processes injected into it, probably because on the surface it appears to be such a simple discipline. It looks like we are just tossing together endless lists of 'thingies' for the computer to do, so it can't be that hard to smother it with silly paperwork, brain-dead committees, convoluted metrics or even some type of hokey ticketing system. Complex tasks are always easily misunderstood, which often just makes them more complex.

What we need, of course, is for the people who have spent decades laboring in the mines to get together and craft their own extensive experiences into a deep methodology for reliably building good software. There is certainly plenty of aged knowledge out there, and lots of hard lessons learned, but that just doesn't seem to be making it back into the main stream.

It should be noted that there is no one-size-fits-all methodology for creating software systems. As the size of the project increases, the process has to change drastically to avoid dysfunction. Small works have a greater tolerance for mistakes and disorganization. As things grow, little annoyances propagate out into major disasters. Since many projects start with humble origins, as the system changes the process must grow along with it.

There are five distinct stages to software development: analysis, design, coding, testing and deployment. Each stage has its own unique issues and a good process will address them specifically. The biggest challenge is too keep each stage separated enough that it becomes obvious if one of them is being ignored. Too often programmers try to collapse the first three into a single stage in a misguided attempt to accelerate the work, but coding before you know what the problem is and before you get it organized is obviously not going to produce well thought out solutions. Programs are far more than just the sum of the code they contain, they need to really solve things for real people, not just spin the data around.

Analysis is the act of gathering information about a problem that needs to be solved. That information should be collected together in an organized form, but it shouldn't be presented as a design of any type. I've often seen analysts that create screen layouts for new features, but rarely have I seen those designs actually fit back into the existing system and as a result, all of the work is effectively lost. The end-product of analysis is 'who', 'why' and 'what'. Who needs the new feature, why do they need it and what all of the data, formulas, external links, etc. really are. Also for data, an accurate model, frequency, quantity and quality of the data are important facts to have around when designing and building.

The underlying goal of design can be stated by the well-known 17th century proverb: "a place for everything, and everything in its place". That is, the design organizes the work by fitting a structure over the top of it. The walls of that structure form boundaries through the code that keep unrelated parts separated. If the design is followed -- the second half of the quote -- then the resulting code should both work properly and be extendable in the future. If there is no design, the chances are that what follows will be such a tangled mess that it will be unmaintainable. There will often be separate designs for the different aspects of the same system, common ones include architecture, graphic design and UX design. They get separated because they are all fundamentally different and each requires very different skill sets to accomplish properly. A design is only useful if it provides enough information to the programmers to allow them to build their code correctly, thus the 'depth' of the design is vital but also relative to the programming team. For instance, a sky-high design isn't particularly useful for junior programmers since it doesn't provide clear guidance, but it might be all that a team of veterans needs to get going right away.

Within a good process, the programming stage is the least exciting. It is all about spending time to write code that matches the design and analysis. There are many small problems to be worked out, but all of the major ones should have been settled already. As such, coding should just be a matter of time, and it should be possible to understand how much time is necessary to get it done. In practice that is rarely the case, but those unexpected hiccups come from deficiencies in the first two stages: missing analysis, lack of design or invalid technical assumptions. Working those common problems back into a process is extremely tricky since they cause the tasks to change hands in the middle of the work, leaving the possibility that the programmers may just end up sitting around waiting for others. Or worse, the programmers just make some random assumptions and move on. Neither circumstance works out very well, the work is either late or wrong.

Testing is an endless affair since you can't ever test everything, eventually you just have to stop and hope for the best. For that reason the most critical parts of the system -- those that will lead to embarrassment if they are wrong -- should be tested first. Unit testing is fine for complex pieces, but only system tests really validate that everything is working correctly in the final version. Manual testing is slow and imprecise, but easy. A mix of testing is usually the most cost effective way forward. Retesting is expensive, so a list of problems is preferable to just handling one problems at a time. Often the users are involved too, but again that opens up the door to the work having to go all of the way back to the analysis stage. Care should be taken to document good test cases, both to avoid redundant work and to ensure future quality.

Deployment should be the simplest stage in development. It comes as putting the software out there and then collecting lots of feedback about the problems. For releases there needs to be both a standard release and a quick fix one. The feedback is similar to the initial analysis and needs to be distributed to every other stage within the development so that any problems can be corrected. Testing might have let loose an obvious bug, coding might have made a wrong assumption, the design might not have accounted for existing behavior or resource requirements and analysis might have missed the problem all together. That feedback is the most crucial information available to be able to repair process problems so they don't reoccur in the future. Quick fixes must only release the absolute minimum amount of code. A classic mistake is to allow untested code to escape into production as part of an emergency patch.

What makes software development complicated is that there are often millions of little moving parts most of which depend on each other. A good process helps track and mange these relationships while trying to schedule the work as efficiently as possible. Since any work in any stage has the chance that it might just suddenly fall back to an earlier stage, it's incredibly hard to keep everyone properly busy and adhere to a schedule. What I've always found is that essentially pipelining the work fits well. That is, there are many parallel projects all moving through the different stages at different times. So if one feature gets punted from coding back to analysis, there are plenty more waiting to move forward. Pipelines can work well but only if development management has control over the process and is willing to address and fix the problems.

For a big team, the flow issues are really issues of personalities. Software developers don't like to think of programming as a 'people' problem so they desperately try to ignore this. Tweaking the process is often about modifying people's work load, responsibilities and their role in the team, and it's a never ending battle. The process needs to change when the people involved change.

Over the decades I've worked in plenty of really bad and really good processes. The bad ones usually share the same characteristics, there is a blurring of the five different stages which causes the roles to be poorly defined. Those cracks either mean stuff doesn't get done or weird rules get created to patch them. Examples include the analysts doing low-quality screen designs, operations pushing admin tasks back to programmers, programmers doing their own testing and the ever popular programmers doing graphic design. Sometimes a single role is broken up into too many smaller pieces, so you might have sys admins that aren't allowed to reboot for instance, or operational support staff that have no actual knowledge of the running systems so they can't do any real work by themselves. Redefining common terminology, incorrect titles and making up new words or cute acronyms are other rather obvious signs of dysfunction, particularly in an industry that has been around for decades and has involved millions of people.

What good processes seem to have is clearly defined roles at each stage that have enough control to insure that the work is done correctly and are held responsible if it isn't. So the flow goes smoothly: analyst->designer->coder->tester->operations for each peice of work. In very small groups many roles will be filled by the same person, that works so long as they have the prerequisite skill set and they know which roles and responsibilities they are in, at which times. In fact one person could do it all -- which saves a huge amount of time in communication -- but that absolutely restricts the size of the project to small or medium at best.

With a large+ and messy development there is often a temptation to parallelize everything, so for example to have four separate testing environments to support four separate development paths. The problem is that deployment is serial. If there are two different projects in testing at the same time, they have independent changes. If they go into production as is, the second release reverts the first change. If instead they merge the two, the second set of testing becomes invalidated and needs to be redone. Supporting parallel development usually only makes sense for the first three stages, and in the third stage by constructing a separate branch in the source code control and then relying on the good quality tools to merge them later.

People sometimes confuse "demo'ing rough functionality to the user" with 'user acceptance testing'. In the later it should be unlikely that any major changes are expected, while in the former it is common. For parallel works that need visibility, there could be a demo branch from coding that lets the second set of changes be seen early, but still preserves the serial nature of the testing and deployment. The trick is to not accidentally call it a testing environment, so there is no confusion.

One of the best new process additions to software development has been the idea of 'iterations'. With exceptionally long programming stages, not only was there concern about things changing, but also there was no way to confirm that the first two stages worked correctly. That kindled with the monotony of just getting the code done, would often cause people to lose faith in the direction. Small iterations cost more (almost everything is cheaper in bulk), but they balance out the work nicely. Some people have advocated really short iterations of a fixed size. I've always preferred to vary the size based on the work, and usually end each with some testing and most often deployment so as to lock in the finished code. Varying the size also helps with seasonal issues and the effective handling of both large and small changes. Some technical debt repayments, when left too long, can require significant time periods thus need very long iterations to get done correctly.

One thing that I found works really well is see every new development as a 'set' of changes to the different layers in the system. Starting from the bottom, most changes effect the schema, parts of the middleware and then the user interface. Rather than trying to change all three things at the same time, I schedule them to get done as three separate, serial projects. So any schema changes go into the first release. They can be checked to see if they are correct and complete. The next release follows with the new middleware functionality. Again it is verified. Finally the interface changes are deployed to a very stable base, so you can focus directly on what they need. Working from the bottom of the system upwards avoids awkward impedance mismatches, simplifies the code, promotes reuse and tends to drive programmers into extending what is already there instead of just reinventing the wheel. Working top-down has the opposite effect.

A friend of mine once said "companies deserve the software they get", which if combined with another friend's sentiment of "the dirty little secret of the software industry is that none of this stuff actually works" is a pretty telling account of why we have so much effort going into software development, yet we are still not seeing substantial benefits from it. Lots of stuff is 'online' now which is great when it works, but when it doesn't it becomes Herculean effort to try and get it sorted out. It's not that people have simply forgotten how to fix things, but rather that the whole sad tangled mess is so impenetrable that they would try anything to avoid facing it head on. While we know these are design and coding problems, the wider issue is that after all these decades why are we we still allowing these types of well-understood problems to get into production systems in the first place? The rather obvious answer is that the process failed, that is was a bad process. Most people want to do a good job with their work but that's highly unlikely in a bad environment. Most development environments are not centred around software engineering. They have other priorities. The only recourse is to craft a protective process around the development itself to prevent destructive cultural elements from derailing everything. In that sense, process is the first and last obstacle between a development project and high quality output. Without it, we are at the mercy of the environment, and so many of those have allowed uncontrollable complexity to lead them into madness. When the process is wrong, we spend too much time trying to scale it, instead of doing the right things. Software development isn't 'fully' understood right now, still we have amassed a huge amount of knowledge over the decades although little of it is actually being utilized. As such, we know what attributes we really need for a good process, there are just some details left to be filled in.