Thursday, April 16, 2020

Development and Operations

As a recurring theme, there seems to be endless confusion within the software industry, as to the different roles for employees. We can clear that up with two precise definitions:
  • Software development is the act of writing, refactoring and testing software
  • Software operations is the act of configuring, deploying, monitoring, administering, supporting and patching software.
They are 2 very different stages that occur for the life of software.

The first part of any software project involves building something. Deciding what to build (analysis), planning to build it (design), building it (coding) and ensuring that it works as expected (testing) are all components of building software. They are the constructive part of the process.

Once something is built, it needs to go out into the world and get used. That is the operational side, and it doesn’t matter if the software is shipped on disk, installed over the Internet, or run remotely on a server. It’s the runtime operation of the system, which is very different from the construction of the system. The best way to envision it is the difference between building a car or driving it around.

The line starts to get blurred with bugs. Often while running, software runs into a problem or two. To triage the severity of the problem often needs someone technical, and occasionally needs the author of the actual code to be involved. If it's handled reasonably, once the bug is reported, the operation side does some minimal work to document it, reproduce it, assess its severity, and deal with its consequences.

The operational aspect of handling a bug can involve needing to get a developer to alter the code, but the responsibility and control of the process is still on the operational side. Sometimes, if the bug is difficult, operations have to arrange for the developer to get access to production, or move the data around into a testing environment. These are all operational tasks.

If it is set up well, all the developer needs to do is to be able to walk through the issue with an operations person, get a sense of the problem, and then they can start to work on determining how to fix it. The setup is done before they arrive. Since bug fixing is wildly unestimatable, any sort of scheduling for putting the fix back into production can not be started until the issue is believed to be resolved.

What often happens in practice, particularly with unstable systems is that the operational side absolves itself of any of their responsibilities and shoves it all onto the developers. They install software, but little else. There are two critical problems with this: a) the developers are no longer doing any significant software development and b) the developers don’t have the desire, context or interest to handle the bugs in an expedient fashion.

To this point, that is obvious, if you hired someone whose career and job title are listed as programming, then their primary interest is probably just in programming. Getting them deeply involved with runtime issues is both frustrating and demotivating for them, so they are unlikely to do a good job of it. Without support, they have a tendency to just try and slap on the fastest bandaid possible, which in many cases merely delays the problems, it doesn’t actually solve them. And it frequently makes them worse.

In a dysfunctional environment, when this is happening a lot, we see that these recurring issues build up into a growing cycle that eventually stops all forward progress. The developers are not coding, but rather just sticking their fingers into the holes of a dam that is about to come crumbling down. This is fairly common.

The cloud made this issue worse. Once it became way less ‘physical’, and ‘machine rooms’ disappeared, the confusion between operating software and building it got lost. Suddenly, you see shops where no one is effectively responsible for ‘operations’, and everyone is doing their best to try and dodge the ever growing pile of operating problems. It's easy to detect, in that both the developers and the business are unhappy, and the development progress is overly slow.

Sometimes there are QA roles or DevOps, but oddly even though these roles seem to span both worlds, the practitioners see themselves on the development side, and try hard as well to avoid getting involved in or taking responsibility for operational problems. That’s not surprising, in that both roles are defined to avoid operational responsibilities, but at least for DevOps it would likely make more sense if the positioning of the role was reversed. That is, it's an operational person who frequently crosses over into development issues, rather than a developer who gets dragged into operation problems. That would lay any of the responsibility for tracking big issues and setting up specific bug related testbeds directly in their camp, and would then minimize the interactions of the core developers.

Fixing this issue starts with assigning someone to handle the operations of the software. It has to be someone’s job to monitor the system and take responsibility for ensuring that any issues with it are handled correctly. It’s also important to understand that that ‘someone’ should not be one of the active developers. They should have a technical background, but they should not be anywhere close to the critical path for getting development work completed. It’s a full time job, and they need to spend their days (and probably more) on making sure that the right work to stabilize the system is getting done in the right ways, at the right time.

Once there is an operations manager, since runtime issues usually have strong business consequences, they generally have the right to co-opt developers to fix stuff as needed. But they have to be careful, and respectful, of that authority, in that they should be minimizing the developer involvement, to ensure that the developers spend most of their time on development tasks. If they can find a way forward that doesn’t involve developers, then that path is preferable. They should have an inventory of all of the common software issues and their resolution. They should be able to assess the working stability of the system and any user dissatisfaction with the way it was built. They should be able to say whether things are getting better, or they are getting worse.

Once the structure within the organization properly separates developers and operations, it can be further enhanced by moving lots of feedback back to the development groups, to aid in analysis, design, and testing. All three of these areas can be better if the knowledge of how the software is doing in operations is deep enough, but that is a step above just keeping the operational side from washing over the development.

No comments:

Post a Comment

Thanks for the Feedback!