Thursday, April 16, 2020

Development and Operations

As a recurring theme, there seems to be endless confusion within the software industry, as to the different roles for employees. We can clear that up with two precise definitions:
  • Software development is the act of writing, refactoring and testing software
  • Software operations is the act of configuring, deploying, monitoring, administering, supporting and patching software.
They are 2 very different stages that occur for the life of software.

The first part of any software project involves building something. Deciding what to build (analysis), planning to build it (design), building it (coding) and ensuring that it works as expected (testing) are all components of building software. They are the constructive part of the process.

Once something is built, it needs to go out into the world and get used. That is the operational side, and it doesn’t matter if the software is shipped on disk, installed over the Internet, or run remotely on a server. It’s the runtime operation of the system, which is very different from the construction of the system. The best way to envision it is the difference between building a car or driving it around.

The line starts to get blurred with bugs. Often while running, software runs into a problem or two. To triage the severity of the problem often needs someone technical, and occasionally needs the author of the actual code to be involved. If it's handled reasonably, once the bug is reported, the operation side does some minimal work to document it, reproduce it, assess its severity, and deal with its consequences.

The operational aspect of handling a bug can involve needing to get a developer to alter the code, but the responsibility and control of the process is still on the operational side. Sometimes, if the bug is difficult, operations have to arrange for the developer to get access to production, or move the data around into a testing environment. These are all operational tasks.

If it is set up well, all the developer needs to do is to be able to walk through the issue with an operations person, get a sense of the problem, and then they can start to work on determining how to fix it. The setup is done before they arrive. Since bug fixing is wildly unestimatable, any sort of scheduling for putting the fix back into production can not be started until the issue is believed to be resolved.

What often happens in practice, particularly with unstable systems is that the operational side absolves itself of any of their responsibilities and shoves it all onto the developers. They install software, but little else. There are two critical problems with this: a) the developers are no longer doing any significant software development and b) the developers don’t have the desire, context or interest to handle the bugs in an expedient fashion.

To this point, that is obvious, if you hired someone whose career and job title are listed as programming, then their primary interest is probably just in programming. Getting them deeply involved with runtime issues is both frustrating and demotivating for them, so they are unlikely to do a good job of it. Without support, they have a tendency to just try and slap on the fastest bandaid possible, which in many cases merely delays the problems, it doesn’t actually solve them. And it frequently makes them worse.

In a dysfunctional environment, when this is happening a lot, we see that these recurring issues build up into a growing cycle that eventually stops all forward progress. The developers are not coding, but rather just sticking their fingers into the holes of a dam that is about to come crumbling down. This is fairly common.

The cloud made this issue worse. Once it became way less ‘physical’, and ‘machine rooms’ disappeared, the confusion between operating software and building it got lost. Suddenly, you see shops where no one is effectively responsible for ‘operations’, and everyone is doing their best to try and dodge the ever growing pile of operating problems. It's easy to detect, in that both the developers and the business are unhappy, and the development progress is overly slow.

Sometimes there are QA roles or DevOps, but oddly even though these roles seem to span both worlds, the practitioners see themselves on the development side, and try hard as well to avoid getting involved in or taking responsibility for operational problems. That’s not surprising, in that both roles are defined to avoid operational responsibilities, but at least for DevOps it would likely make more sense if the positioning of the role was reversed. That is, it's an operational person who frequently crosses over into development issues, rather than a developer who gets dragged into operation problems. That would lay any of the responsibility for tracking big issues and setting up specific bug related testbeds directly in their camp, and would then minimize the interactions of the core developers.

Fixing this issue starts with assigning someone to handle the operations of the software. It has to be someone’s job to monitor the system and take responsibility for ensuring that any issues with it are handled correctly. It’s also important to understand that that ‘someone’ should not be one of the active developers. They should have a technical background, but they should not be anywhere close to the critical path for getting development work completed. It’s a full time job, and they need to spend their days (and probably more) on making sure that the right work to stabilize the system is getting done in the right ways, at the right time.

Once there is an operations manager, since runtime issues usually have strong business consequences, they generally have the right to co-opt developers to fix stuff as needed. But they have to be careful, and respectful, of that authority, in that they should be minimizing the developer involvement, to ensure that the developers spend most of their time on development tasks. If they can find a way forward that doesn’t involve developers, then that path is preferable. They should have an inventory of all of the common software issues and their resolution. They should be able to assess the working stability of the system and any user dissatisfaction with the way it was built. They should be able to say whether things are getting better, or they are getting worse.

Once the structure within the organization properly separates developers and operations, it can be further enhanced by moving lots of feedback back to the development groups, to aid in analysis, design, and testing. All three of these areas can be better if the knowledge of how the software is doing in operations is deep enough, but that is a step above just keeping the operational side from washing over the development.

Monday, April 13, 2020

Bullet Specs

Implementing the wrong code wastes everyone's time. A good specification that prevents this makes a real difference in keeping a software development project from derailing.

For software, time is a scarce resource. So, it is inordinately more effective to work out any design issues long before coding. If you try to fix a mess later, in the code, it won’t happen and everything built on top will be compromised. Stack a mess high enough and it becomes usable.

So, a sane project that is expected to build non-trivial software needs specifications to avoid wasting its resources.

Over the decades, I’ve used all sorts of methodologies and formats for producing specifications, but they have either been: a) too big and bloated or b) too vague. One wastes precious time, while the other isn’t precise enough to dredge up any critical problems before they become destructive.

Lately, I’ve been using what I call ‘bullet specs’. It’s a compressed variant on what some people might call ‘manage-by-fact’. It’s far from perfect, but it is fast enough to meet time expectations, while still providing enough detail to keep the coding work from going bad.

The idea is that you specify the smallest set of bullets that need to be satisfied. The bullets need to be short, concise and contain only factual statements. If it’s important, then it is a bullet. If it isn’t listed, it can be whatever, it’s a free variable.

You belt out only, exactly, what you know has to be, for the code to be accepted in production. Nothing more, nothing extraneous, no opinions, whys, speculations, or anything else that is not a factual bullet. Just the necessary facts.

That works quite well, but it still needs more in a lot of cases. Bullets are not quite expressive enough to communicate all of the issues.

I tend to think of architecture as a means of organization, that draws hard ‘lines’ separating different sets of code for the system. The best format for lines is a simple diagram, without a lot of fiddly bits. E.g. draw two boxes, then specific code goes into one, or the other, but not in both. If the code is in the wrong box, it needs to be moved.

So, if we were interested in laying a fast spec for a big system, it would consist of a few top-down diagrams that chop up the mechanics and a bunch of bullet specs that tighten down each part. It’s not onerous to read and it is precise enough to not be arguable. If it is followed, the results are predictable.

That’s fairly inexpensive, but it can still go wrong, usually due to convoluted business logic. So, the third piece that is sometimes needed is a data model. For those, I find that a classic ER diagrams fit well to most domains, but I sometimes augment that with the entities being data-structures themselves as I discuss in this post: http://theprogrammersparadox.blogspot.com/2017/09/data-modeling.html

Since we’ve iterated out the underlying data, its types and its structure, and we’ve organized the code into explicit places, the contents of the bullets fill in all of the remaining facts.

A full specification is complete enough that the ‘coding’ can be fairly thoughtless, but again to save time we might not want to invest that amount of effort upfront, or the work is better done by people closer to the metal. Either way, we can look at specifications in three levels: high, medium and low.

If the spec is high level, it will list out required properties or metrics that need to happen as a system and the overall runtime structure. It might cut the system up into a set of libraries. It will list out the major technologies used. The focus is on solving problems for people.

At a medium level it might list out an API and the incoming and outgoing data types, or it could specify the major features to be used from an underlying library or framework. It may layout parts of some of the screen or the parameters for a CLI. The focus is on sets of usable features.

At a lower level, it might list out all of the attributes like locking, formulas, computations, user options or particular algorithms or even the variable names. The focus is on ensuring strong implementations.

Collectively, if all three levels did exist they would cover all of the ‘major details’ that could be found in the code, there would be no wiggle room. Most often though, the specs can save time by leaving some of the details up to the individual programmer or to the agreed-upon architecture, styles or conventions for the project. If part of the system exists in a particular way, the spec can assume that the programmers will follow suit.

What’s specified is only those details that can’t be gotten wrong, where for high-level reasons there is no wiggle room.

Now some people might suggest that it is impossible with any new code to know what has no wiggle room before actually writing it. That’s only actually true for inexperienced coders or pure research projects. For everything else, a specification would reduce the risk of making a mess, which would reduce the amount of time spent on the code, which would help the project achieve its goals. More importantly, it would allow senior developers to lay down fundamentals for less experienced coders that would prevent screw-ups from being caught way late, like in code reviews, testing, or production. Why wait to the end to find out it’s wrong, why let it go that far?

If a team doesn’t have anyone that can produce any type of specs for the upcoming development work, then that problem should be addressed before continuing. Big projects need strong technical leads. It’s fun to try and stretch our abilities in building complex stuff, but there is a ‘bridge too far’ that usually ends in tears. We should do a better job of not starting projects that are impossible without the necessary prerequisites. That would save us from a lot of coding disasters.