Thursday, October 20, 2022

Operational Issues

If there is a problem with a system in “production”, rather obviously that should be recorded with all of the relevant facts and hopefully an ability to reproduce it.

Then the problem should be triaged:
  • Is it an operational problem? Such as configuration or resources?
  • Is it a data administration problem? Bad or stale data?
  • Is it a development problem? Our code or a vendor’s code is incorrect?

There should be a run book that lists all old and new problems this way.

If a problem is operational, say we’ve run out of some type of resource, then 2 things need to occur:
  1. Bump up the resource, and restart the system
  2. Analyze the problem and make recommendations to avoid it in the future.

If the problem is data administration, then there are a few options:
  • Was the data changed recently? Who changed it and why?
  • Is it stale? Why did it fall out of sync?
  • Is there a collision between usage? Both old and new values are needed at the same time?
  • Who is responsible for changing the data? Who is responsible for making sure it is correct?

If it is a development problem, then:
  • Is there a workaround?
  • How important is it that this gets fixed?
  • When can we schedule the analysis, design, coding, and testing efforts necessary to change it?

In the case of resources such as operating system configuration parameters, the first time they occur for any system it will be a surprise, but they should be logged both against the system itself and the underlying tech stack, so that later even if it happens in a completely different system, the solution to correct it quickly is known and already vetted.

If it is a well-known tech stack issue, then operations can address it immediately, and only later let the developers know that it had occurred.

If the problem is infrequent or mysterious, then operations may ask for developer involvement in order to do a full investigation. If the problem is repeating, trivial, or obvious, then they should be able to handle it on their own. If a developer needs to get involved, then they often need full and complete access to production, so it is not something you want to occur very often and it is expensive.

For common and reoccurring problems, operations should be empowered to be able to handle them immediately, on their own. For any system, the default behavior should be to reboot immediately. If reboots aren’t “safe” then that needs to be corrected right away (before any other new work commences).

As well as reactively responding to any problems with the system, operations need to be proactive as well. They should set up their own tooling that fully monitors all of the systems they are running, and alerts them to limited resource issues long before they occur. Non-trivial usage issues come from the users.

Upgrades and bug fixes should be tested and queued up by development, but deployed by operations at their convenience. Since operations are actively monitoring all systems, they are best able to decide when any changes are made. Operations are also responsible for tracking any underlying infrastructural changes, assessing the potential impacts, and scheduling them, although they may have to consult with vendors or in-house development to get more information.

2 comments:

  1. "...default behavior should be to reboot immediately".

    I disagree with this, as a reboot will generally make it difficult or impossible to determine the root cause of the issue. If root cause is not determined, the issues (and reboots) will continue to plague the system. Much better is to establish a data-gathering routine that captures essential debug data, then restart errant processes, and only reboot as a last resort.

    I ran large-scale systems for a long time, and the worst ones were where the only recourse was to reboot. For me and my staff, 'reboot' was not a valid troubleshooting methodology, and was not permitted until troubleshooting related data was gathered. I then did everything possible to ensure that the developer/vendor assisted in determining root cause and resolving the issue.

    ReplyDelete
  2. I've seen a few places where everyone is super terrified to reboot. In those places, fairly trivial issues that should could have been handled rapidly became really large outages. That tends to make every problem a big problem.

    I do usually suggest rebooting on regular intervals, maybe weekly or biweekly. It prevents a lot of surprises, and ensures that reboots are safe.

    I tend to rely more on good logs, persistent data, and user feedback for most trouble shooting. That's mostly for applications, so I guess if it's some weird state change in a complex backend engine, keeping it up and interacting with it might somehow could help shed light, but I'd probably only add into practice for the 2nd time an issue happened. Often developers have no access to production anyways, for security reasons. They need to build in their own feedback mechanisms. Logging is cheap, and easy to capture.

    ReplyDelete

Thanks for the Feedback!