Monday, December 22, 2014

Thinking About What You Don't Know

I've observed that many people, when they think about stuff, choose to limit themselves to only what they know. That is, they revisit the facts and opinions with which they are familiar, over and over again, trying to squeeze something new from them.

What I've always found, that tosses a big wrench into the works, is not what is known but rather what isn't. Some people might think it is a contradiction to think about what you don't know, since obviously you have nothing in your head that relates to it, but what you do have is essentially dangling threads. Information that only goes so deep and then suddenly stops. If you want, you can explore these.

You start by inventorying what you already understand, then gradually decompose that into more precise fragments. At some point, you either bang into an assumption or you just draw a blank. Blanks are easy since you know that you don't know, so you now have something to learn about. Assumptions however can be very hard to recognize. They often appear as concrete knowledge, except they are not based on any underlying facts that you know. What they really do is cleverly hide unknowns, removing uncertainty, but at the cost of quite possibly being wrong. Most problems start with at least one faulty assumption.

So you ask yourself, what do I know that really are facts? Are these absolute, or are they relative to some specific context. Once you've filtered out the facts, what remains needs clarification. With these you can start asking questions, and with those you can search for answers. It's a long, sometimes slow, loop but if you dig deep enough you'll find that you can uncover all sorts of unexpected things.

There are of course "right turns" that come up so unexpectedly that no amount of thinking would ever lead you to them. There is little you can do about these. You never want to go too far down any rabbit holes, since they might lead you to start reconstructing things incorrectly and end up creating worse assumptions.

One thing that helps is to analyse whether or not the people you are talking to are themselves referring to actual knowledge. I've often found that people will confidently rely on some pretty shaky assumptions and that it is inversely proportional. The shakier the assumption, the more some people convince themselves to believe it. It's considered rude to point that out, so often what you need to do is note down what they are saying and find an independent way to verify it. Two people saying the same thing is better, but it's best when you can break it down to real underlying facts that can be shown to be true. Digging in this way always brings up plenty of unexpected issues, some which turn out to be dead ends, but many which lead to valuable insight.

Things generally go wrong, not because of what we know, but because of what we missed. Learning to think deeply about what might be missing is a valuable skill. With practice, one starts to see the weaknesses right away, and you don't really have ponder them for days. Rather, in the midst of discussion, you become aware of a blind spot and start asking questions. If you confront you own assumptions, you often find that others were operating with similar ones and that they are quietly hiding something nasty. Getting that out into the open usually saves a lot of trouble in the future.

Sunday, December 14, 2014


The easiest thing to do in response to a major problem is to create a bunch of new 'rules' to prevent it from ever happening again. It might be the obvious solution, but if the underlying problem was caused by a systemic breakdown due to unmanageable complexity, it is most likely also a bad solution. Once things get too complicated, rules loose their effectiveness.

Although I've seen this reoccur time and time again in large organizations, I wasn't really sure what was driving it or how to really fix the underlying problems. The 21st century is dominated by large organizational systems teetering on the brink of collapse. We have built up these mega-institutions, but they do not seem to have weathered the unbounded complexities exploding from the early part of the digital age. Most are barely functional, held together by that last bit of energy left in people devoted to not being unemployed.

Just saying "don't create any new rules" isn't helpful. These rules happen as a response to actual problems, so turning away from them isn't going to work. It's this no-win situation that has intrigued me for years. Surely, there must be some formalizable approach towards putting the genie back into the bottle?

A few years ago I was sitting in an interview. When asked a technical question, I started off my answer with "I can't remember the exact definition for that, but...".

At which point the interviewer reassuringly interrupted and said "Don't worry about it, we aren't concerned with definitions in this company".

At the time I let the exchange slide by, but there was something about it that deeply bothered me. Of course people should be concerned about definitions! They are, after all, the basic tokens of any conversation. If two people are talking, but using two completely different definitions, then the conversation itself is either useless or misleading.

I've encountered this cross-talk circumstance so many times, in so many meetings, that it has become automatic to stop the discussion and get both parties to dump out their definitions onto the table. Always, after such an exchange, things have started to click and real progress gets made. Until then, the meeting is just noise.

Recently I've begun to understand that bad or vague definitions have even deeper consequences such that they relate directly back to systemic failure, not just useless meetings. They can be used to empower and enhance dysfunction.

So I have a very simple conjecture: any general rules like "all X must have property Y" are going to cause 'trouble' if any of the underlying 'definitions' are wrong.

For example, if X is really something messy like x+q, x'' and xish then it is easy to overlook the impact on all three loose forms of X. Worse, when the boundaries are fuzzy, then people can include or exclude things based on a whim. Some days z is in X, some days it is not.

It gets even worse when Y is messy as well, such as y1, y3, y4 and y7. Now there are 12 to 16 different cases that are broadly covered. If someone crafted the rule because one day an x+q should have had the property y3, that might have been really clear at the time, but may not generally apply in all cases.

If however X really means precisely and only X, and the property Y is really just Y, then the rule boundaries are tight. We know exactly what is covered and we can manage that.

With loose rules it is the corner-cases that people miss. If the definitions are vague, then there are lots of known and unknown cases. Any bad side-effect percolates causing other problems, so a bad rule can really do a lot of damage.

Sometimes the definitions are vague on purpose. Someone high up in the hierarchy wants control. With a vague definition they can yank their inferiors chains any time they choose. They just pull the carpet out from under them, and then punish them for doing a face plant. This isn't an uncommon executive trick, many even believe that this works well. In some corporate cultures the hierarchies are fear-based and sometimes even excessively micromanaged. Those sorts of cruel executive games trickle down to the masses causing widespread corporate dysfunction.

Sometimes the definitions are vague because management is clueless about what it is actually managing. If you don't understand stuff then you really don't want to have a precise definition, and without that most top-down rules are mostly likely go bad.

Sometimes it goes the other way. The employees don't want management oversight, so they redefine things as vaguely as possible to give themselves wiggle room to operate as they please. That, of course, is in tandem with clueless management, but in this case the definitions are coming from the bottom up.

Sometimes the origin of the rules is lost in history. Sometimes it's just outsiders looking in, but missing all but what floats on the surface.

In some cases organizations really like to make up their own terminology or redefine existing terms. That promotes a "club membership" sort of evironment. The danger is of course that the definitions slide all over the place, subject to various agendas. Gradually that converges on dysfunction.

This sort of issue can be easily detected. If rules have been created in response to a problem, but the problem don't go away, then it is very likely that the underlying definitions are broken. More rules are not needed. Fix the definitions first, then use them to simplify the existing rules, then finally adjust them to assure that the specific corner-cases are handled. But it's important to note that if fixing a definition involves "retraining" then there is a strong chance that the definition itself is still broken. If you have a tight, reasonable definition, that should make perfect sense to the people working, so it should not require effort to remember. If you're just flinging around arbitrary words, then the problems will creep in again.

Sometimes the dysfunction is hidden. A bunch of rules get created to throttle the output, so that people can claim improvement. There are less problems if less work is getting done, but that will eventually make the cost of any work inordinately expensive. Also, it hasn't fix the quality issues, but probably just made them much worse. It's a hollow victory.

Broken definitions are everywhere these days. To find them, all you really need to do is start asking questions such as "what does X really mean?" If different people give a range of vague answers, you know that the definition is broken. Unfortunately, asking very basic questions about actual definitions to lots of different people has a tendency to agitate them. Most people would rather blindly chase a rule they don't understand than actually talk or think about it. In questioning the definition they tend to think that you are either stupid or just being difficult.

Still, once a definition is broken it will not fix itself. If it has been broken for years, it will have seeped into all sorts of cracks. People may be running around, throwing out all sorts of hasty bandaides for these new problems, but until light is thrown on the root cause, the issues will either get worse or be ignored. Just saying "that's the way it is" not only preserves the status quo, but also lays a foundation for people to continue to make the problems worse.

Automation via computers was once envisioned as a means of escaping these sorts of problems, but it actually seems to be making them worse. At least when processes were manual, it was obvious that they were expensive and ineffective. Once they've been computerized, the drudgery was shifted to the machines, so any feedback that might have helped someone clean up the mess disappeared. The computer quietly enables people to be more dysfunctional. The programmers might have been aware of the issues they created, but most often they are detached from the usage, so they just craft a multitude of special cases and move onto to the next bit of code.

In a sense, the definition problem is really trivial. You can not manage what you don't understand and any understanding is built into how you define the underlying parts. If the definitions are vague and shifty then the management effort is disconnected from the underlying reality. As the distance between the two grows, things get more complicated and gradually degrade. To reverse the trend requires revisiting how things are defined.

Sunday, December 7, 2014

Bad Process

To work effectively, software development projects need some type of process. The more people involved, the bigger and more formalize the methodology needs to be.

A good process helps people understand their roles, keep track of the details and work together with others. It is the glue that holds everything together.

Too little process usually ends up being a bad process. It doesn't help complete the tasks, enabling disorganization. That comes as dangerous shortcuts, hurry up and wait scheduling and too much time spent plugging up avoidable holes. When busy, most people drop best practices in favor of the most expedient approach. The occasional lapse is fine, but shortcuts always come with a hidden cost. Get too far into debt and a nasty cycle ensues which drags everything down further. A good process will help prevent this.

Too much process generally results in useless make-work. Following the process itself becomes the focus, not getting the work done. Most heavy processes claim to increase quality, but if there is excessive make-work it actually drains away from the resources causing the quality to drop. Moral is also affected. People can usually tell valuable work from the useless stuff. They know if their contributions are going directly into the end product, or if they've just propping up some meta-administration. As such, when they spend their time doing make-work instead of actually dealing with the real issues, they lose interest in getting things done.

The ideal process is one that everybody actually wants to follow. They realize that it is there to make their lives easier and to help achieve quality. They don't have to work around the process in special circumstances, because the process genuinely covers all aspects of their work. It guides them through their labors, keeping them safe from making dumb mistakes and allowing them to focus on what they really want to do: get the work done.

A process crafted by people with little real experience in actual software development obviously won't focus on what matters. Rather it gloms onto the superficial, because it's faster to understand. It usually makes up it's own eclectic terminology as an attempt to hide its lack of depth and to make it appear more intellectual than it really is. It might be presented with much fanfare and a raft of unachievable claims, but at its heart it is nothing more than a tragic misunderstanding.

Software has had many, many bad processes injected into it, probably because on the surface it appears to be such a simple discipline. It looks like we are just tossing together endless lists of 'thingies' for the computer to do, so it can't be that hard to smother it with silly paperwork, brain-dead committees, convoluted metrics or even some type of hokey ticketing system. Complex tasks are always easily misunderstood, which often just makes them more complex.

What we need, of course, is for the people who have spent decades laboring in the mines to get together and craft their own extensive experiences into a deep methodology for reliably building good software. There is certainly plenty of aged knowledge out there, and lots of hard lessons learned, but that just doesn't seem to be making it back into the main stream.

It should be noted that there is no one-size-fits-all methodology for creating software systems. As the size of the project increases, the process has to change drastically to avoid dysfunction. Small works have a greater tolerance for mistakes and disorganization. As things grow, little annoyances propagate out into major disasters. Since many projects start with humble origins, as the system changes the process must grow along with it.

There are five distinct stages to software development: analysis, design, coding, testing and deployment. Each stage has its own unique issues and a good process will address them specifically. The biggest challenge is too keep each stage separated enough that it becomes obvious if one of them is being ignored. Too often programmers try to collapse the first three into a single stage in a misguided attempt to accelerate the work, but coding before you know what the problem is and before you get it organized is obviously not going to produce well thought out solutions. Programs are far more than just the sum of the code they contain, they need to really solve things for real people, not just spin the data around.

Analysis is the act of gathering information about a problem that needs to be solved. That information should be collected together in an organized form, but it shouldn't be presented as a design of any type. I've often seen analysts that create screen layouts for new features, but rarely have I seen those designs actually fit back into the existing system and as a result, all of the work is effectively lost. The end-product of analysis is 'who', 'why' and 'what'. Who needs the new feature, why do they need it and what all of the data, formulas, external links, etc. really are. Also for data, an accurate model, frequency, quantity and quality of the data are important facts to have around when designing and building.

The underlying goal of design can be stated by the well-known 17th century proverb: "a place for everything, and everything in its place". That is, the design organizes the work by fitting a structure over the top of it. The walls of that structure form boundaries through the code that keep unrelated parts separated. If the design is followed -- the second half of the quote -- then the resulting code should both work properly and be extendable in the future. If there is no design, the chances are that what follows will be such a tangled mess that it will be unmaintainable. There will often be separate designs for the different aspects of the same system, common ones include architecture, graphic design and UX design. They get separated because they are all fundamentally different and each requires very different skill sets to accomplish properly. A design is only useful if it provides enough information to the programmers to allow them to build their code correctly, thus the 'depth' of the design is vital but also relative to the programming team. For instance, a sky-high design isn't particularly useful for junior programmers since it doesn't provide clear guidance, but it might be all that a team of veterans needs to get going right away.

Within a good process, the programming stage is the least exciting. It is all about spending time to write code that matches the design and analysis. There are many small problems to be worked out, but all of the major ones should have been settled already. As such, coding should just be a matter of time, and it should be possible to understand how much time is necessary to get it done. In practice that is rarely the case, but those unexpected hiccups come from deficiencies in the first two stages: missing analysis, lack of design or invalid technical assumptions. Working those common problems back into a process is extremely tricky since they cause the tasks to change hands in the middle of the work, leaving the possibility that the programmers may just end up sitting around waiting for others. Or worse, the programmers just make some random assumptions and move on. Neither circumstance works out very well, the work is either late or wrong.

Testing is an endless affair since you can't ever test everything, eventually you just have to stop and hope for the best. For that reason the most critical parts of the system -- those that will lead to embarrassment if they are wrong -- should be tested first. Unit testing is fine for complex pieces, but only system tests really validate that everything is working correctly in the final version. Manual testing is slow and imprecise, but easy. A mix of testing is usually the most cost effective way forward. Retesting is expensive, so a list of problems is preferable to just handling one problems at a time. Often the users are involved too, but again that opens up the door to the work having to go all of the way back to the analysis stage. Care should be taken to document good test cases, both to avoid redundant work and to ensure future quality.

Deployment should be the simplest stage in development. It comes as putting the software out there and then collecting lots of feedback about the problems. For releases there needs to be both a standard release and a quick fix one. The feedback is similar to the initial analysis and needs to be distributed to every other stage within the development so that any problems can be corrected. Testing might have let loose an obvious bug, coding might have made a wrong assumption, the design might not have accounted for existing behavior or resource requirements and analysis might have missed the problem all together. That feedback is the most crucial information available to be able to repair process problems so they don't reoccur in the future. Quick fixes must only release the absolute minimum amount of code. A classic mistake is to allow untested code to escape into production as part of an emergency patch.

What makes software development complicated is that there are often millions of little moving parts most of which depend on each other. A good process helps track and mange these relationships while trying to schedule the work as efficiently as possible. Since any work in any stage has the chance that it might just suddenly fall back to an earlier stage, it's incredibly hard to keep everyone properly busy and adhere to a schedule. What I've always found is that essentially pipelining the work fits well. That is, there are many parallel projects all moving through the different stages at different times. So if one feature gets punted from coding back to analysis, there are plenty more waiting to move forward. Pipelines can work well but only if development management has control over the process and is willing to address and fix the problems.

For a big team, the flow issues are really issues of personalities. Software developers don't like to think of programming as a 'people' problem so they desperately try to ignore this. Tweaking the process is often about modifying people's work load, responsibilities and their role in the team, and it's a never ending battle. The process needs to change when the people involved change.

Over the decades I've worked in plenty of really bad and really good processes. The bad ones usually share the same characteristics, there is a blurring of the five different stages which causes the roles to be poorly defined. Those cracks either mean stuff doesn't get done or weird rules get created to patch them. Examples include the analysts doing low-quality screen designs, operations pushing admin tasks back to programmers, programmers doing their own testing and the ever popular programmers doing graphic design. Sometimes a single role is broken up into too many smaller pieces, so you might have sys admins that aren't allowed to reboot for instance, or operational support staff that have no actual knowledge of the running systems so they can't do any real work by themselves. Redefining common terminology, incorrect titles and making up new words or cute acronyms are other rather obvious signs of dysfunction, particularly in an industry that has been around for decades and has involved millions of people.

What good processes seem to have is clearly defined roles at each stage that have enough control to insure that the work is done correctly and are held responsible if it isn't. So the flow goes smoothly: analyst->designer->coder->tester->operations for each peice of work. In very small groups many roles will be filled by the same person, that works so long as they have the prerequisite skill set and they know which roles and responsibilities they are in, at which times. In fact one person could do it all -- which saves a huge amount of time in communication -- but that absolutely restricts the size of the project to small or medium at best.

With a large+ and messy development there is often a temptation to parallelize everything, so for example to have four separate testing environments to support four separate development paths. The problem is that deployment is serial. If there are two different projects in testing at the same time, they have independent changes. If they go into production as is, the second release reverts the first change. If instead they merge the two, the second set of testing becomes invalidated and needs to be redone. Supporting parallel development usually only makes sense for the first three stages, and in the third stage by constructing a separate branch in the source code control and then relying on the good quality tools to merge them later.

People sometimes confuse "demo'ing rough functionality to the user" with 'user acceptance testing'. In the later it should be unlikely that any major changes are expected, while in the former it is common. For parallel works that need visibility, there could be a demo branch from coding that lets the second set of changes be seen early, but still preserves the serial nature of the testing and deployment. The trick is to not accidentally call it a testing environment, so there is no confusion.

One of the best new process additions to software development has been the idea of 'iterations'. With exceptionally long programming stages, not only was there concern about things changing, but also there was no way to confirm that the first two stages worked correctly. That kindled with the monotony of just getting the code done, would often cause people to lose faith in the direction. Small iterations cost more (almost everything is cheaper in bulk), but they balance out the work nicely. Some people have advocated really short iterations of a fixed size. I've always preferred to vary the size based on the work, and usually end each with some testing and most often deployment so as to lock in the finished code. Varying the size also helps with seasonal issues and the effective handling of both large and small changes. Some technical debt repayments, when left too long, can require significant time periods thus need very long iterations to get done correctly.

One thing that I found works really well is see every new development as a 'set' of changes to the different layers in the system. Starting from the bottom, most changes effect the schema, parts of the middleware and then the user interface. Rather than trying to change all three things at the same time, I schedule them to get done as three separate, serial projects. So any schema changes go into the first release. They can be checked to see if they are correct and complete. The next release follows with the new middleware functionality. Again it is verified. Finally the interface changes are deployed to a very stable base, so you can focus directly on what they need. Working from the bottom of the system upwards avoids awkward impedance mismatches, simplifies the code, promotes reuse and tends to drive programmers into extending what is already there instead of just reinventing the wheel. Working top-down has the opposite effect.

A friend of mine once said "companies deserve the software they get", which if combined with another friend's sentiment of "the dirty little secret of the software industry is that none of this stuff actually works" is a pretty telling account of why we have so much effort going into software development, yet we are still not seeing substantial benefits from it. Lots of stuff is 'online' now which is great when it works, but when it doesn't it becomes Herculean effort to try and get it sorted out. It's not that people have simply forgotten how to fix things, but rather that the whole sad tangled mess is so impenetrable that they would try anything to avoid facing it head on. While we know these are design and coding problems, the wider issue is that after all these decades why are we we still allowing these types of well-understood problems to get into production systems in the first place? The rather obvious answer is that the process failed, that is was a bad process. Most people want to do a good job with their work but that's highly unlikely in a bad environment. Most development environments are not centred around software engineering. They have other priorities. The only recourse is to craft a protective process around the development itself to prevent destructive cultural elements from derailing everything. In that sense, process is the first and last obstacle between a development project and high quality output. Without it, we are at the mercy of the environment, and so many of those have allowed uncontrollable complexity to lead them into madness. When the process is wrong, we spend too much time trying to scale it, instead of doing the right things. Software development isn't 'fully' understood right now, still we have amassed a huge amount of knowledge over the decades although little of it is actually being utilized. As such, we know what attributes we really need for a good process, there are just some details left to be filled in.

Sunday, November 30, 2014

Error Handling

When I first started programming my focus would be on what I'll call the 'main execution path'. That is, if my task was to add a screen with some new features to the system, I would concentrate on getting the data from the persistence layer to the screen with as little distractions as possible. Basically, to simplify the task, I'd ignore all of the external things that might go wrong while the code was running, effectively assuming that the system was living in a "perfect world".

It didn't take long to realize that this was a problem. Internally computers might be deterministic, but more and more they have to interact with the messy chaos of the surrounding world. If you don't, as I quickly realized, then the operational and support issues start to overwhelm your ability to continue to extend the system. That is, if the code is breaking for every little unexpected problem and frequently requires manual intervention, as the system grows the interventions grow faster. At some point the whole thing is completely operationally unstable. The main execution paths might be working perfectly, but a robust system also needs decent error handling.

The reason I think most programmers choose to code for what I call the perfect world scenario is that it is a whole lot easier than trying to write code that intelligently copes with reality. Really good code that can withstand anything that happens is extremely complex and generally has a large number of interwoven execution paths. It takes time to think through all of the corner-cases, and then even more time to organize and structure it properly to be readable.

The pinnacle of engineering for code is to achieve 'zero maintenance'. That is, you install the system and never have to worry about it again. It just runs. If for any reason there are questions about its state, then you can safely reboot the hardware without concern. There is nothing else that needs to be done, except possibly to upgrade to a better version someday.

Zero maintenance software is exceptionally hard to build, but it can really help a big project achieve it goals. Instead of hacking away at an ever-growing sea of ugly problems caused by accidental inter-dependencies, you get the surety of knowing that the foundations are fully encapsulated and reliable. This allows developers to concentrate on moving forward, extending or adding features. Once a problem is solved, we want it to stay solved until we choose to revisit it later. If you have that, then you have the freedom to plan out a reasonable work strategy that won't keep getting push aside because of the next crisis.

Mostly, under modern time constraints it's nearly impossible to get real zero maintenance code, the expectations of the stakeholders no longer allow us to spend enough time perfecting all of the little details. It's unfortunate, but unlikely to change anytime soon. These days I tend to shoot for very low maintenance code, always hoping for a chance to do better.

For any piece of code, it is important to distinguish between what's internal and what's external. The demarcation point is that what's truly internal in any system is only the code that your team has written and is currently maintaining. That's it. If you can't change it, then it is external. This is really important because the first rule of zero maintenance is to never 'trust' anything external. If it's outside of your ability to control, then you have to accept that it will go wrong someday. I first encountered this philosophy when I was young and thought it was overly paranoid. However after decades of watching things fail, and change, and get disorganized, I've got at least one good story for every possible external dependency, including code, configuration, libraries, data, users, administration, etc. all going bad. Nothing, it turns out is really safe from tampering or breaking.

Even internally, you still have to watch out for your own team. Programmers come and go, so it's easy for a new one to misunderstand what was written and weaken it to the point of continual and hard to diagnose failures. Good documentation and code reviews help, but a strong architecture, good working habits and well-written software are the only real defenses.

The first big problem programmers encounter is shared resources, like a persistent databases. If you have to open up a connection to something, then there are days when it will fail. The naive approach is to just let errors get thrown right to the top, stopping the code. That might be okay if the database was down for days because of a major failure, but most often it's really just unavailable for a short time, so the system should wait nicely until it is up again then continue with whatever it was doing. That means that for any and all database requests, there is some code to catch the problem and keep retrying until the database is available again. The first time this occurs, some warning should be logged, and if the problem persists, every so often another message is logged to say that the database still isn't available, but care should be taken not to swamp the log file with too many messages. Depending on what's calling the database access code, this handling should happen forever or it should timeout. If the code is a long running backend process, it should just keep trying. If it's a GUI request, it should eventually return an 'unavailable' message to the user, or gracefully degrade into lessor service (like stale data).

Some errors will require reinitialization, while others may just need to wait before retrying. Any error coming from the connection should be explicitly handled, so that other programmers following can verify the existence of the code. Given the amount of database code that is in most systems, writing this type of proper error handling every time would be way too much work, so this is one place were reuse is absolutely paramount. The retry/reconnect semantics should be encapsulated away from the calling code, with the exception of perhaps one simple flag indicating whether or not this code should 'quit' after some predefined system timeout (and all interactive code should use the same timing).

It's hard enough to structure this type of code for simple queries or updates, but it's a whole other level to make this work across distributed transactions on separate databases. Still, what we want if one of the databases is currently unavailable is to wait for it to come up again and then continue on with the processing. We want the software to remember where it was, and what still needs to be done. We do have to watch for trying up resources though. Say you have a transaction across two independent databases and the second one is down. The whole transaction often needs to be rolled back to avoid unnecessarily holding a 'read' lock on the first database. What works for most circumstances is to put parts of the system into a passive 'waiting' state until the missing resources return.

This works for handling long running backend processing that interact with databases, but there are also shared resources for simpler programs like web applications. These can be files, pipes or even cooperating processes that can fluctuate depending on the chaos of their implementations. Our desire is to not let their problems propagate into our system.

Operationally, because of tools like 'quota' it is even possible for a server to run out of disk space. Again the best approach is for the system to wait patiently until the space has been resized, then continue on with whatever it was doing. You rarely see this in code, usually running out of disk space causes a major crash, but given that the disk is just another shared resource it should be treated the same. RAM however is often treated differently because under most circumstances the OS will 'thrash' long before memory is exceeded. That is, it will be swapping in and out an excessive number of virtual pages to the point that that's all it is really doing anymore. For long running processes that need potentially large amounts of memory, care needs to be taken at all levels in the code to only allocate what is absolutely necessary and to reuse it appropriately. Creating leaks is really easy and often expensive to correct later. It's best to deal with these types of constraints up front in the design, rather than later in panic recoding.

In a simpler sense, the zero maintenance ideas even apply to command line utilities. Two properties that are commonly desired are 'interruptible' and 'idempotent'. The first means that any command can be interrupted at any time while it's running, but even if the timing was awkward that won't prevent it from working when run again. That's extraordinarily hard to accomplish if the command line does work like updating an existing file, but it is a very useful property. If you started something accidentally or it's running slow, you should just be able to kill it without making any of the problems worse. Idempotent is similar, in that if the same command is run multiple times then there should be no ill effects. The first time it might insert data in the database from a file. Each subsequent time, with the same file, it will either do nothing or perform updates instead of inserts. I like the second approach better, because it allows for the file to be modified and then the system to be synchronized to those changes.

For whatever reason the originators of relational databases decided to split writes into two very separate functions: insert and update. Often programmers will use one or the other for their task, but generally zero maintenance code always needs both. That is, it checks first to see if the data set is already in the system, and then adds or modifies it accordingly. Getting that right can actually be very messy. In some cases the 'master' version of the data is in the database, so any edits are applied directly. Sometimes the actual master version of the data exists in an external system. If the edits are coming from that system they obviously take precedence, but if they aren't then it's a problem. The best approach is for any specific data entity to assign one and only one system as the actual master of the data, every other system just owns a read-only copy. For most systems that is fine, but there are several cases where it becomes murky. That should be sorted out long before coding. For really bad problems, to preserve integrity and help catch issues, the data needs to be consistently labeled with a 'source' and a 'time'. In some systems there is also an audit requirement, so the actual user making the insert, update or delete needs to be captured as well. Adding in these types of features may seem like overkill, but the moment something goes wrong, and it always will, being able to correctly diagnose the problem because you haven't just blown over the original data, will be a lifesaver and of course the first and very necessary step towards actually fixing the problem permanently. Sometimes you can make the code smart, other times you just have to provide the tools for the users to work out what has happened.

Persistent databases actually cause all sorts of data clashing problems that are frequently ignored. One of the simplest is 'dirty' reads. Someone starts editing data, but they walk away from the computer for a while. In the meantime, someone else comes in and makes some edits. Nice systems will alert the original user that the underlying data has changed, and really nice ones will present all three version to the user to allow them to manually merge the changes. Mean systems will just silently clobber the second person's data. Some systems will just irratate the user by timing out their editing session. Occasionally you see a system where the first user locks out the second, but then those systems also need tools and special handling to break the locks when it turns out the first user has gone on vacation for a couple of weeks. Mostly though, you see mean systems, where the programmers haven't even noticed that there might be a problem.

The key aspect for zero maintenance is not trusting anything external. That means that any data coming from any source, user interface, database, or external feed from any other system is not to be trusted. Thus it needs very specific code to validate it, and often to validate its cross-dependencies as well. Like the database code, this type of checking would require too much code if it were rewritten for every screen or every data feed. Thus, this is another place were reuse is mandatory. If you have one library of checks for any possible external data entity, then all you need to do is call a validator if you know that the data comes from outside of the code base. Again, dealing with bad data from a feed is different than dealing with bad data from an editing screen, but it should be handled similarly. Given a data entity, which in OOP would be an object, the validator should return a list of errors. If there are none, then it is fine. If there are many, then on the edit screen each bad field can be flagged or if it is batch the errors are combined into a log message. Depending on the data, it might be correct to ignore bad feed data, or the system might fall back to default data, or even sometimes the data might be redirected to an edit queue, so that someone can fix it when they get a chance and then retry. Each approach has its own merits, based on the source of the data and the system requirements.

In some systems, if the incoming data is unavailable, the best choice is to show an error screen. But that is fairly ugly behavior and sometimes the users would rather see stale data. In that case the system should fall back to older data, but it needs to clearly indicate this to the users. For a realtime monitoring system, if the users are watching a table of numbers, the nicest code would colour the table based on the age of the data. If nothing were available for perhaps minutes or hours, the colour would gradually go to red, and the users would always be aware of the quality of what they are seeing. They wouldn't need to call up the operations dept. and ask if if things are up-to-date.

Zero maintenance also implies that any sort of editing requirement by the users is fully handled by the interface. For all entities if the data is in the system it is either read-only or there is a way to add, edit and delete it. There should never be any need to go outside of the interface for any data admin tasks. If the system contains historic data, there are tools available to check and correct it if there are errors. Data can always be removed, but even if it is no longer visible, it still maintains a presence in the log or audit handling. The general idea is to able to empower the users with the full ability to track and correct everything that they've collected. If that is set up well, the programmers can stay far away from the issues of data maintenance, and if the underlying persistence storage is a relational database with a normalized schema, the programmers can avoid 'reporting hell' as well by pushing that work onto some generic reporting tool. The up-front time saved from having the development team avoid data maintenance and reporting can be better spent on perfecting the code base or fully adding in new features.

Getting in really old data to a new system makes it more useful, but really it if often a massive amount of work. Mostly the problems come from modeling, poor quality or fuzzy data merges. What often happens is that programmers skip the temporal aspects of the data, or its original source information. Sometimes they just model only the subset that concerns them at the present moment, and then try to duct tape their way into extending the model. By now you'd think that they were readily establish tools to model and move data around, but except for some very pricey products (which no one ever wants to pay for) most of the ETL code is just hacked individually, for each specific type of data. If the requirements are for a wide ranging collection of historic data, then it makes a lot of sense to design and implement one big general way of back-filling any type of data within the system. It's not trivial to write, but it really is a lot of fun, and it teaches programmers to handle coding issues like abstraction and polymorphism. The idea is to eliminate absolutely every special case possible, while providing an easily configurable interface like a DSL to apply transformations and defaults. Once that type of facility is built, the original programmers can move onto something else, while some other group does the actual migration. It's best to get the right people focused on working on the right problems at the right time.

Another common mistake is confusion about keys. They are really an internal concept. That is a system creates and manages its own keys. Sometimes you get data from other systems with their keys included. Those are external keys, they can't be relied upon and the data needs to be carefully rekeyed internally. Most developers really like creating new numeric keys for all major entities, this not only helps speed up access, but it abstracts away the underlying fields in the key and allows for reasonable editing. Sometimes programmers shy away from creating numeric keys in an effort to save time or thinking, but that's almost always the type of mistake that is heavily regretted later. Either because it is inconsistent with the rest of the schema, or a situation arises where the underlying fields really do need to be edited but also remain linked with other entities. Keeping the inter-relationships between data entities is crucial for most systems, since real world data is rarely flat. Once the data is in the system, it's modeling should be strong enough that it doesn't need to be messed with later which is always a horrifically expensive exercise. Just tossing data in, in any form, and then expecting to fix it later is a recipe for disaster. Dumping unrelated data into arbitrary fields is absolute madness.

Given that zero maintenance systems can be conveniently restarted if there are questions about their internal state, it is also necessary that they are configured to start automatically and that their startup times are minimized. In the middle of a bad problem you can't afford to wait while the code warms up gradually. Most often the delays are caused by faulty attempts at performance optimizations like caching. When implemented correctly these types of optimizations shouldn't require prolonged start up times, they should kick in immediately and gradually get more efficient as time progresses. It's easier to just wait until everything has been slowly loaded into memory, but again that's just another dangerous shortcut to avoid spending the effort to build the code properly.

Eventually all systems need to be installed or upgraded. A zero maintenance system has an installer to make the process trivial. But it also often has the attribute that any two versions of the system can co-exist on the same machine without interfering with each other. Again, really getting that to work is complex particularly when it comes to handling shared resources. For instance, if you have one version installed on a box with an older schema and lots of data, installation of the newer version should be able to create a new schema, then copy and upgrade the old data into it. In fact, at any time, any new version should be able to grab data from an older one. In some cases, the databases themselves are just too large to allow this, but a well-written system can find a way to copy over all of the small stuff, only causing minimal disruption. The key reason this is necessary is that new releases sometimes contain subtle, lurking bugs that go unnoticed for days. If one presents itself, the process to rollback should be easy, safe and quick, but also carry with it any new data for the old schema. That way most of the work isn't lost and the choice to roll back isn't devastating.

One important thing to be able to detect is faulty installations. Once code goes into a production environment, it is out if the hands of the programmers. Strange things happen. It is best for each piece to have clear and consistent version number and for all of the pieces to cross check this and warn if there is a problem. Overly strict code can be annoying, particularly if it is wrong about the problem, so I've always felt that in 'debug' mode the code should stop at the first error it encounters, but in production it should try to muddle through but issue warnings to the log. This avoids the problem were a minor error with a change gets escalated to a major problem that stops everything.

Bugs happen, so zero maintenance accepts that while trying very hard to not let any of them deceive the users. That is, if the data is questionable there needs to be a way to still present it, but flag it. Then it becomes the user's problem to decide whether or not it is serious. It doesn't matter if it really is a data problem or if it is because of a coding bug, the facility to move forward and alert the users should be designed into the system.

Overall, the general idea with zero maintenance is to solve the problem at a deep enough level that the programmers don't have to get involved in operations or any of the user issues with collecting and managing a large set of data. If it's all done well, the system becomes obsequious. That is, it fades into the background and the problems encountered by the users are those that belong to their job, not the technology that they are using to accomplish it. Operations is rarely bothered, and there is just one simple approach to handling bad days. If you think about most software out there, it works fine for a bit but becomes extraordinarily ugly if you try to deviate outside of its main execution path. This provides all sorts of pain and frustration and generally is one of my core reasons for insisting that the current state of software is 'awful'. Computers with well-written software can be powerful tools that help us manage the chaos and complexity of the world around us. What we don't need is an extra level of annoyance brought on by programmers ignoring that reality. A system that works on good days is barely usable, but one that we trust and is zero maintenance is what we really need. And as programmers, just splatting some data to the screen in a perfect world is only an interesting problem to solve for the first few years, but after that it's time to move on to the real challenges.

Saturday, October 18, 2014

The Necessity of Consistency

A common problem with large projects starts with adding new programmers. What happens is that the new coder bypasses the existing work and jumps right into adding a clump of code. Frequently, both to make their mark but also because of past experiences, they choose to construct the code in a brand new, unrelated way. Often, they'll justify this by saying that the existing code or conventions are bad. That's were all of the trouble starts.

In terms of good or bad, most coding styles are neither. They're really a collect of arbitrary conventions and trade-offs. There are, of course, bad things one can do with code, but that's rarely the issue. In fact, if one had to choose between following a bad coding practice, or going in a new direction, following the existing practice is the superior choice. The reason for this is that it is far easier to find and fix 20 instances of bad code, then it is to find 20 different ways of doing the same thing. The latter is highly subject to missing code and bug-causing side effects, while the former is simple refactoring. 

When a programmer breaks from the team, the work they contribute may or may not be better, but it definitely kicks up the complexity by a notch or two. It requires more testing and it hurts the ability to make enhancements, essentially locking in any existing problems. And as I said earlier, it is often no better or worse than what's already there. 

It's ironic to hear programmers justify their deviations as if they were improving things, when experience shows that they are just degenerating the existing work. Once enough of this happens to a big code base it becomes so disorganized that it actually becomes cheaper to completely start over again, tossing out all of that previous effort. Overall the destruction out-weights any contributions.

That's why it is critical to get all of the programmers on the same 'page'. It is a sometimes-painful necessity to allow any large development project to achieve its full potential and it is one of the core ingredients necessary to forming a 'highly effective team'. The latter being one of the keys to building great systems; in that dysfunctional teams always craft dysfunctional software. 

With all of that in mind, there are a couple of 'rules' that are essential for any software project if it wants to be successful. The first is that all of the original programmers for a new project need to come together and spend some time to lay out a thorough and comprehensive set of coding conventions. These should cover absolutely 'everything', and everyone should buy into them before any coding begins. This isn't as difficult as it sounds, since most brand new projects start with very small tiger teams. 

The next rule is that any new programmer has to follow the existing conventions. If they don't they should be removed from the team before they cause any significant problems. It should be non-negotiable.

Now of course, this is predicated on the idea that the original conventions are actually good, but often they are not. To deal with this, a third rule is needed. It is simple idea that anyone at anytime can suggest a change to the existing conventions. For this to come into effect two things are necessary. First, the remaining programmers all need to buy into the change, and second the programmer requesting the change must go back to 'all' existing instances in the code and update them. The first constraint insures that the new change really is better than what is there, while the second one sets a cost for making the change and also insures that everything in the code base remains consistent afterwards (it also teaches people to read code and to refactor).

If teams follow the above rigorously, what happens is that the code base converges onto something organized and well-thought out. It does add some overhead, but a trivial amount relative to a messy disorganized ball of mud. And of course it means that the development project never ends for avoidable internal reasons. 

Monday, October 13, 2014

The Depth of Knowledge

A few decades back, I decided that I wanted to know more about quantum physics. I've always been driven to indulged my curiosity of various disciplines, it's been a rather long-standing obsession since I was a kid. 

I went out to the bookstores (remember those :-) and bought a huge stack of books on what might be best described as the 'philosophy' underpinning quantum physics; stuff like the wave/particle duality and Schrödinger's rather unfortunate cat. It was fun and I learned some interesting stuff. 

I'm fairly comfortable with a wide range of mathematics, so I bought a book called Quantum Theory by David Bohm. The first chapter 'The Origin of the Quantum Theory' nearly killed me. I was so hopelessly lost by the first sixteen pages. I'm sure if I dedicated years to studying, I could at least glean 'some' knowledge from that tome, but it was likely that it just wasn't possible to get there as a hobby.

Knowledge has both breadth and depth. That is, there is a lot of stuff to know, but getting right down into the nitty gritty details can be quite time-consuming as well. Remembering a long series of definitions and facts is useful, but knowing them doesn't necessarily help one to apply what they've memorized. Knowledge is both memory and understanding. My leap into quantum physics is a great example, I learned enough to chat about it lightly at a party but not enough to actually apply it to anything useful, not even chatting to an actual quantum physicist. I got a very shallow taste of the higher level perspective, but even after all those books I still didn't really understand it.

That knowledge is both wide and deep causes lots of problems. People can easily convince themselves that they understand something when they've only scratched the surface. It's like trying to make significant contributions to a field were you have only taken the introductory 101 course. If it's a deep field, there are years, decades and possibly centuries of thinking and exploration built up. What's unknown, is buried so deeply below that accumulated knowledge. Finding something new or novel isn't going to be quick. If you want to contribute new stuff, you really have to build on what is already known, otherwise you'll just be right back at the starting point looking over well-tread ground.

And some knowledge is inherently counter-intuitive. That is, it contradicts things you already thought you knew, so that you literally have to fight to understand it. Not everything we understand is correct, some of it is just convenient over-simplifications to make it possible for us to operate on a day-to-day basis. That 'base code' helps, but it also hinders our deeper understanding. We see what we need to, not what is really there.

It's pretty amazing how deep things can really go, and how often what's under the surface is so different from what lies above. My curiosity has led me to all sorts of unexpected places, one of my favorite being how much can be learned from really simple systems like boolean equations. Going down, it can feel like a bottomless pit that endlessly twists and turns each time you think you it can't go any further. With so much to grok from just simple things, it's truly staggering to contemplate the full scope of what we know collectively, and what we've forgotten. That 'field' of all things knowable has exceeded our own individual capacities a long time ago, and has steadily maintained its combinatorial growth since then. At best we could learn but a fraction of what's out there.

Software is an interesting domain. It's breadth is huge and continuously growing as more people craft their own unique sub-domains. A quick glance might lead one to believe that the whole domain is quite shallow, but there are frequently deep pools of understanding that are routinely ignored by most practitioners. Computational complexity, parsing, synchronization, intelligence and data mining are all areas that sit over deep crevices. They are still bottomless pits, but depth in these areas allows programmers to build up the sophistication of their creations. Most software is exceptionally crude, just endless variations on the same small set of tricks, but some day we'll get beyond that and start building code that really does have the embedded intelligence to make people's lives easier. It's just that there is so much depth in those sort of creations and our industry is only focused on expanding the breadth at a hair-raising pace. 

It's easy to believe that as individuals we posses higher intellectual abilities, but when you place what our smartest people know along side our collective knowledge, it becomes obvious that our inherent capabilities are extremely limited. Given that, the rather obvious path forward is for us to learn how to harness knowledge together as groups, not individuals. Computers act as the basis for these new abilities, but we're still mired in the limits of the personalities involved in their advancements. Software systems, for example, are only as strong as their individual  programmers and they are known to degenerate as more people contribute. We only seem to be able to utilize the 'overlap' between multiple intellects, not the combination. That barrier caps our abilities to utilize our knowledge.

Despite all we know, and all of the time that has passed, we are still in the early stages of building up our knowledge of the world around us. If you dig carefully, what you find is a landscape that resembles swiss cheese. There are gapping holes in both the breadth and depth of what we know. Sure, people try to ignore the gaps, but they percolate out into the complexity of modern life at a mind numbing rate. We rooted our cultures on a sense of superiority over the natural world, but clearly one of the largest things we still have to learn is that we are not nearly as smart as we think we are. If we were, the world would progress a lot smoother than it has been. Knowledge is after all, a very significant way to avoid having to rely on being lucky.

Wednesday, September 3, 2014


One of the most persistent myths in software development is that it is impossible to estimate the amount of programming work required for a project. No doubt this is driven by people who have never been put in a position where estimates mattered. It is indeed a very difficult skill to master but a worthwhile one to have, since it routinely helps to make better decisions about what to write and when to write it.

In my younger days I was firmly in the camp that estimates were impossible. That changed when I got a job working with a very experienced, veteran team of coders. I whined that it was "hopeless", but they said to "just try it", so eventually I did. We settled on lines-of-code as the metric, and over the years we tracked how much we wrote and how large the system was. Surprisingly, after years of tracking our progress, we found that we could make quite strong estimates about our future work output.

After I left that job, I found myself in a startup that was particularly dependent on selling custom code to the clients. It accounted for a significant part of our revenue, and it all relied on being able to produce quick, accurate estimates during client meetings. Estimate too high and they wouldn't buy in. Estimate too low and we'd lose money on the work. Since it was just after the dotcom bomb, things were really tricky so we needed every penny we could get to avoid bankruptcy. My earlier experience really paid off. I would sit in meetings, noting down all of the pieces in discussion, then be able to quickly turn these features into dollars so that we could hammer out an appropriate set of functionality that matched back to the client's budget. That on-the-fly type of price negotiation could have been really dangerous except that I was quite accurate with calculating the numbers. When you have no alternatives, you get forced to sharpen your skills.

For the rest of this post I'll do my best to describe how I estimate work. There isn't a formal process and of course there are some tricky tradeoffs to be made, but with plenty of practice anyone can keep track of their past experiences and use that to project future events. It's a skill, a very useful one and one that really helps with ensuring that any software project does not get derailed by foreseeable issues.

The first thing that is necessary is to choose a reasonable metric for work. Although it is imperfect, I really like lines-of-code (LOC) since it's trivial to compute and when used carefully it does reasonably match the output of a development project. It's also easy for programmers to visualize.

Most of the systems I've worked on have been at least tens of thousands of lines, often hundreds of thousands, so to simplify this discussion I'll use the notation 10k to mean 10,000 LOC, it will just make it easier to read.

For me, I count most of the lines including spaces and comments, since they do require effort. I generally avoid counting configuration files, and for web apps I usually separate out the GUI. Coding in languages like HTML and CSS is often slower than the more generalized languages like Java or C#. This causes a multiplier effect between different programming languages. That is, one line in Java is worth two lines in straight C, since the underlying primitives are larger. Tracking the differences in multipliers due to language or even the type of code being written is important to being able to assess the size of the work properly.

I generally like to watch two numbers, the first is the current running total of lines for the system. So we might talk about a 150k system or perhaps a rather smaller 40k one. If along the way someone replaces 5k with a better 10k piece, the total has only increased 5k. That is, modified lines count for nothing, deleted lines are subtracted and new lines are added. It's only the final total in the repo that really matters. It's easy to calculate this using a command line tool like word count, i.e. "wc -l *.java". One of the first scripts I write for any new job is to traverse all of the source and get various counts for the existing code. The counts are always tied to the repo. If it isn't checked in, it hasn't been completed.

The second number I like to watch is any single developer's production. That is the total number of lines added to the repo per year (although sometimes I talk about weekly numbers). So I think of programmers as being able to contribute say 20k or 50k. Sometimes I think in terms of new lines, sometimes in terms of net lines, depending on the level of reuse in the code. Again, language and code type are also significant, you can't compare a 15k JavaScript programmer to a 25k C# coder, but knowing the capacity of the coders on a given team really helps in determining if any particular project is actually viable.

One big issue to watch out for are the copy-and-pasters. They might have awesome numbers like 75k, but the messiness of their output and the dangers of their redundant code really reduces their work to 1/3 or worse of their actual counts. That, and they endanger any ability to safely extend their code. If you were to try and capture the counts, the copied code would count for 0 lines, modifications and additions would add, but deletes would be ignored.

Now generally a 50k programmer might have lots of 1k weeks, but very few programmers are actually consistent in their output and modern software development is highly disruptive. It's a lot of 'hurry up and wait". Back in the waterfall days, when the development stage lasted months or years it was easier to be consistent for long periods, and it also meant higher yearly counts. But even then some weeks will have very low counts, while others might be over 2k. Because of that variability it is more appropriate to look at the yearly numbers. As the project matures, the counts should get lower as well. That happens, particularly if the software is well-written, because the size of the underlying primitives grow. Instead of having to start from scratch each time, most new work can leverage the existing code (if the counts aren't going down, then possibly the cut & pasting is increasing).

Code that is algorithmically complex or is particularly abstract is obviously a lot slower to write and debug. In that sense, the backend guy from a project will progress much slower than the front end guy. So you might have a domain expert coder at 15k, with a GUI front guy at 35k. A decent generalized piece might have really low counts, but when it's reused each time it saves the overall system size from bloating, although it doesn’t reflect directly in the numbers.

Thinking in terms of totals and capacity has the nice secondary attribute that it makes it easy to translate into man-years. For example, a small but functional enterprise web-app product needs at least 100k to round out it's functionality (including administration screens). For a 50k coder that's at least 2 man-years. If you see a 350k system, you can get a sense of how long it might take to rewrite it if you had a team of 25k coders, it's sitting at about 14 man-years to replace unless you can get away with a lot less code.

My suggestion to most programmers is to learn to track both numbers for themselves. They don't have to share them with management, but understanding their own abilities to produce code and the size of the projects that they've experienced really helps in knowing their current skill level. That keeps programmers from over-promising on their assignments. If you are a 30k programmer and someone wants you to hammer out a 40k web app in half a year, you can be pretty sure that you're in trouble coming right out of the gate. If you know that in advance -- as opposed to the last month or so -- you can start taking action to address the issue, such as asking for more resources or reducing expectations (or reducing the quality).

It's also worth noting that speed isn't what counts, it's organization and quality that really matter. I would be far happier with a detail-oriented 15k programmer for which we never have to debug any of their work or rewrite it, than a reckless 75k programmer that is essentially just contributing more problems. Projects last forever, so it's important to get the work done well, encapsulate it and then move on to other more interesting code. A well-rounded team usually has a mixture of fast and slow programmers, with hopefully a wide range of different skills.

My first code tracking experiences came when building a massive 150k distributed cache. In the early years I was cranking out about 50k per year as my primary project (I've always had at least one secondary project that was not intended for production, but just to sharpen my skillset). Years later, I was working on a 120k enterprise product in Perl at about 30k per year (with lots of sales and management duties). These days I'm somewhat less than 15k in C# on a 350k behemoth (which if it wasn't for cut & paste should have been closer to 150k), but again my role isn't just pure coding anymore.

Size matters. I've worked on a few >1M systems, but at that level there are usually teams that are responsible for 50k - 200k chunks of the system. From experience I've found that a code base over 50k - 100k starts to get unwieldy for a single individual to handle by themselves. Most single developer in-house systems tend to range from 25k - 50k. Commercial systems have to be larger (to justify the sales) and tend to be higher quality so there are more lines; they usually start around 100k and go into the millions (except possibly for apps). Quality requires more code, but reuse requires quite a bit less, and there can be big multipliers such as 2x or even 10x for brute force and cut & paste. It's those latter multipliers that always makes me push reuse so hard. Time might be saved from pasting in code, but it gets lost again in testing, bugs and obfuscating any attempt to extend the work. In a big project, bad multipliers can result in the work falling "years" behind schedule particularly if the work is compared to a competitor with an elegant code base. If you have to do three times the work, you'll have 1/3 of the features.

Once you've mastered the skill of tracking, the next thing to learn is estimating. I always use ranges, so a new bit of work might be 10k - 20k in size. The best and worst cases are drawn directly from experience with having written similar code. The ranges get wider if there is a lot of uncertainty. They are always relative to the system in question, so that a new screen for one rather brutish system might require 20k in code, but for an elegant one with lots of reuse that might be 5k or less (getting below 1k is awesome).

Of course estimating in an ongoing project is easier than a green field one, but in some sense if a programmer isn't experienced enough to make a reasonable estimate he or she might take that as a sign that they don't currently have enough experience to get the work completed properly. Most system extensions are essentially horizontal, that is they are just adding in features that have the same design patterns as existing features. In that case, if the new work leverages existing work, then a huge amount of time can be saved per feature and the code will automatically maintain good consistency (and have less testing).

I pretty much expect that something similar to the Pareto rule works for overall complexity. That is between 10 and 20 percent of the system is difficult, slow code while the rest is fairly routine. The 80% still requires work, but it should be quite estimable with practice. The difficult stuff is nearly impossible to estimate since unexpected problems creep up, which is why each key aspect of it should be prototyped first during the analysis and/or design phase (for startups this is essentially version 1.0). If you separate out the code into independent problems then it makes it easier to plan accordingly. Often under-estimates in some parts balance against over-estimates in others.

Another common issue is uncontrolled scope creep, where the underlying requirements are constantly shifting. Sometimes that is just a lack of reasonable analysis. Since analysis is way cheaper than code the development should be put on hold until the details are explored more thoroughly. It's always faster than just flailing at the code. Sometimes scope creep is caused by natural shifts in the underlying domain, in which case the design needs to be more dynamic to properly solve the real problem.

It's worth keeping in mind that depending on the testing requirements and the release procedures, a coding estimate is only a fraction of the work. Testing, in reasonable systems tends to stay in lock step as a fraction of the coding work. Releases are more often a fixed amount of effort.

Once you can size incoming work, the next trick is to be able to shift the team around to get it done effectively. Programmers are like chess pieces, they all differ in strengths. It doesn't work to give a backend guy some GUI piece if you have a faster front end guy available for the job. This is spoiled of course since some programmers don't want to focus on their strengths, they want assignments that stretch their abilities. With good estimations, it actually makes it easier to let them try out different parts, since tracking lets you know how much work is remaining and whether there is actually some slack time available. High efficiency teams are generally ego-less such that the programmers aren't siloed, they can work on any section of the code at any time. Of course having at least two coders familiar with every part of the system reduces disruptions with emergencies, holidays and people leaving. It also helps with communication and overall quality. But it has to be managed in a way that doesn't just blindly assume that all programmers are interchangeable cogs. Each one is unique, with their own skills and suitability, thus it takes plenty of observations and some deep management skill to deploy them well.

For me, with deadlines, I'll either accept a variable number of features with a hard date, or the full set of features with a variable date. That is, we can deliver 4 - 10 features on date X, or we can deliver all 10 features at some floating time in the future. Sometimes the stakeholders argue, but generally there are one or two really key things they are after anyways. With reasonable estimations, fixed dates are manageable. You can choose to do a couple of priority items early, then go after some low hanging fruit (easy items), then just aggressively cut what isn't going to make it. A key trick to this approach is to get the developers working on just one or possibly two items at a time, and don't let them go on until the items are done and dusted. Estimations really help here because you can quickly see if any of the items exceeds the length of the next iteration, and thus should be branched off from the main development (and then not tied up in cross dependencies). Obviously the more accurate the estimates, the more options you have in planning the development and managing the expectations of the stakeholders. Accurate estimates reduce a lot of the anxieties.

Once a project gets established, the incoming workload tends to fluctuate. That is, it's quiet for a while, then a whole load of new work comes in all at once followed by a rush to get it done. Software development is best when it is a smooth and steady process, so being able to scale the work loads and re-arrange the teams really helps in smoothing out the development to a nice consistent pace. This avoids destructive practices like 'crunch mode' and allows for intermediate periods to reduce technical debt. It also makes it easier to decide when to just hack up some throw-away code and when to take the time to generalize the code so that you can leverage it later. Foresight is an incredibly useful skill for big development projects.

What gets a lot of projects into trouble is that become reactive. They focus on just trying to keep up with the incoming work, which ultimately results in very high technical debt and further complicates the future workload. Learning to estimate is tricky, but valuable as it helps to make better decisions about what to write and when to write it. This helps in getting ahead of the ball in the development, which leads to getting enough 'space' to be able to build clean, efficient and high quality systems.

Once a programmer has mastered the basics of coding, they start getting the itch of wanting to build larger and larger systems. For medium sized systems and above the technical issues quickly give way to the architectural, process and management issues. Within that level of software development, life is much easier if you have the skills to size both the work and the final system. It allows significant planning and it provides early indicators of potential trouble.

Estimating is a tricky skill and one that can not be trivially formalized, but it's not impossible and once you have enough mastery you really wonder how you ever lived without it. Nothing is perfect, estimates less so, but with practice and experience software developers can produce very usable numbers for most of their development efforts, and the technical leads can understand and utilize the underlying capacity of their team members. Estimations are not feasible from non-programmers and they should not be used to rank programmers against each other. There are very real limits to how and why they can be practical, but when used correctly they can at least eliminate many of the foreseeable problems that plague modern development efforts. A little less chaos is a good thing.