Sunday, June 1, 2014

Requirements

The idea of specifying programs by defining a set of requirements goes way way back. I saw a reference from 1979, but it is probably a lot earlier. Requirements are the output from the analysis of a problem. They outline the boundaries of the solution. I've seen many different variations on them, from very formal to quite relaxed.

Most people focus on direct stakeholder requirements; those from the users, their management and the people paying the bill for the project. These may frame the usage of the system appropriately, but if taken as a full specification they can lead to rampant technological debt. The reason for this is that there are also implicit operational and development requirements that although unsaid, are necessary to maintain a stable and usable system. You can take shortcuts to avoid the work initially, but it always comes back to haunt the the project.

For this post I'll list out some of the general requirements that I know, and the importance of them. These don't really change from project to project, or for domain. I'll write them in rather absolute language, but some of these requirements are what I would call the gold or platinem versions, that is they are above the least-acceptable bar.

Operational Requirements

Software must be easily installable. If there is an existing version, then it must be upgradeable in a way that retains the existing data and configuration. The installation or upgrade should consist of a reasonably small number of very simple steps and any information that the install needs that can be obtained from the environment should automatically be filled in. In-house systems might choose to not to do a fresh install, but if they do, then they need another mechanism for setting up and synchronizing test versions. The installation should be repeatable and any upgrade should have a way to rollback to the earlier version. Most installs should support having multiple instances/version on the same machine, since that helps with deployments, demos, etc..

Having the ability to easily install or upgrade provides operations and the developers with ability to deal with problems and testing issues. If there is a huge amount of tech debt standing in the way, the limits force people into further destructive short-cuts. That is they don't set things up properly, so they get caught by surprise when the testing fails to catch what should have been an obvious problem. Since full installations occur so infrequently, people feel this is a great place to save time.

Software should handle all real world behaviours, it should not assume a perfect world. Internally it should expect and handle every type of error that can be generated from any sub-components or shared resources. If it is possible for the error to be generated, then there should be some consideration on how to handle that error properly. Across the system, error handling should be consistent, that is if one part of the system handles the error in a specific way, all parts should handle it in the same way. If there is a problem that affects the users, it should be possible for them to work around the issue, so if some specific functionality is unavailable, the whole system shouldn't be down as well. For any and all shared resources, the system should use the minimal amount and that usage should be understood and monitored/profiled before the software is considered releasable. That includes both memory and CPU usage, but it also exists for resources like databases, network communications and file systems. Growth should be understood and it should be in line with reasonable operational growth. 

Many systems out there work fine when everything is perfect, but are down right ornery when there are any bumps, even if they are minor. Getting the code to work on a good day is only a small part of writing a system. Getting it to fail gracefully is much harder, but often ignored as a invalid shortcut.

When a problem with software does occur, there should be sufficient information generated to properly diagnose the problem and zero in on a small part of the code base. If the problem is ongoing, there should not be too much information generated, it should not eat up the file system or hurt the network. It should just provide a reasonable amount of information initially and then advise on the ongoing state of the problem at reasonable intervals. Once the problem has been corrected it should be automatic or at least easy to get back to normal functionality. It should not require any significant effort.

Quirky systems often log excessively, which defeats the purpose of having a log since no one can monitor it properly. Logs are just another interface for the operations personal and programmers so they should be treated nicely. Some systems require extensive fiddling after an error to reset poorly written internal states. None of this is necessary and just adds an extra set of problems after the first one. It is important not to over-complicate the operational aspects by not addressing their existence or frequency.

If there is some major, unexpected problems then there should be a defined way of getting back to full functionality. A complete reboot of the machine should always work properly, there may also be faster, less sever options such as just restarting specific processes. Any of these corrective actions should be simple, well-tested and trustworthy, since they may be choosen in less than ideal circumstances. They should not make the problems worse, even if they do not fix them outright. As well, it should be possible to easily change any configuration and then do a full reboot to insure that that configuration is utilized. There may be less sever options, but again there should always be one big, single, easy route to getting everything back to a working state.

It is amazing how fiddly and difficult many of the systems out there are right now. Either correcting them wasn't considered or the approach was not focused. In the middle of a problem, there should always be a reliable hail mary pass that is tested and ready to be employed. If it is done early and tested occasionally, it is always there to provide operational confidence. Nothing is worse than a major problem being unintentionally followed by a long series of minor ones.

Development Requirements

Source code control systems have matured to the point where they are mandatory for any professional software development. They can be centralized or distributed, but they all provide strong tracking and organizational features such that they can be used to diagnose programming or procedural problems at very low cost. All independent components of a software system must also have a unique version number for every instance that has been released. The number should be easily identifiable at runtime and should be included with any and all diagnostic information.

When things are going well, source repos don't take much extra effort to use properly. When things are broken they are invaluable at pinpointing the source of the problems. They also work as implicit documentation and can help with understanding historic decisions. It would be crazy to build anything non-trivial without using one.

A software system needs to be organized in a manner that encapsulates its sub-parts into pieces that can be used to control the scope for testing and changes. All of the related code, configurations and static data must be placed together in a specific location that is consistent with the organization of the rest of the system. Changes to the system are scoped and the minimal number of tests is completed to verify their correctness. Consistency is required to allow programmers the ability to infer system wide    behavior from specific sub-sections of code.

The constant rate of change for hardware and software is sufficiently fast enough that any existing system that is no longer in active development starts to 'rust'. That is, after some number of years it has slipped so far behind its underlying technologies that it becomes nearly impossible to upgrade. As such, any system in active operation also needs to maintain some development effort as well. It doesn't have to be major extensions, but it always necessary to keep moving forward on the versions. Because of this it is important that a system be well-organized so that at very least, any changes to a sub-part of a system can be completed with the confidence that it won't affect the whole. This effectly encapsulates the related complexities away from the rest of the code. This allows any change to be correctly scoped so that the whole system does not need a full regression test. In this manner, it minimizes any ongoing work to a sub-part of the system. The core of a software architecture is this necessary system-wide organization. Beyond just rusting, most systems get built under sever enough time constraits that it takes a very long time and a large number of releases before the full breath of functionality has been implemented. This means that there are ongoing efforts to extend the functionality. Having a solid architecture reduces the amount of work required to extend the system and provides the means to limit the amount of testing necessary to validate the changes.

Finally

There are lots more implicit non-user requirements, but these are the ones that I commonly see violated on a regular basis. With continuously decreasing time expectations, it is understandable why so many people are looking for shortcuts, but these never come for free so there are always consequences that appear later. If these implicit requirements are correctly accounted for in the specifications of the system, the technical debt is contained so that the operations and ongoing development of the system is minimized. If these requirements are ignored, ever-increasing hurdles get introduced which compromise the ability to correctly mange the system and make it significantly harder to correct later.