The Programmer's Paradox: Testing for Battleships

In one of my earlier blog entries, I casually suggested that testing was like playing a game of 'Battleship'.

It is an odd analogy, but as I pondered it a bit more I realized that there were interesting parallels between playing Battleship and testing, many that are worth exploring.

Often, it is exactly our change of perspective that allows us to see things with fresh eyes; even if it is tired and well-tread ground.

AN INITIAL DEFINITION

For those of you not entirely familiar with it, Battleship was a popular kid's game, a long long time ago, in a galaxy far far away.

Wikipedia has a great description of it: http://en.wikipedia.org/wiki/Battleship_%28game%29

The version I grew up with was non-electronic and consisted of two plastic cases each containing a couple of grids, the 'ships', and a huge number of little red and white pegs.

In the first phase of the game, both players take their five 'ships' of varying sizes and place them on the horizontal grid. After that, one after the other, the players take turns calling out 'shots' by the grid coordinates.

The coordinates are alphabetical on one axis, and numerical on the other. A shot to C5 for instance is the third position across and the fifth one down.

If the shot hits a ship, the defending player has to say 'it is a hit'. The calling player puts a red peg in the vertical grid to mark the spot. Otherwise, just so they don't call the same spot again, a white peg is used.

The game continues until one player has sunk all of the other players 'ships'. It is a race between who sinks who's fleet first.

The big strategy of the game is to find ways to 'hide' the ships. For instance, lining them up against each other, so it is hard to see where one starts and one ends. Sometimes hiding them on the outskirts of the grid is great, sometimes not. Once the game gets going, it becomes a contest between the two players as to who figures out all of the other locations first.

Mostly, the game ends long before all of the pegs have been entered into the board. From my rather random and fuzzy memory, I'd say usually around two thirds of the board was full at the end. Sometimes more, sometimes less.

OBSERVATIONS

What does this have to do with testing?

Well, interestingly enough, testing is kinda like playing half a game of Battleship against the code. You can see all of the functionality in the system as a giant grid, a vast sea that contains a number of hidden bugs.

The tester's job is to place tests (pegs) at various points in the grid to see if there is a bug there or not. Each test -- or part of the test -- acts as a little peg that reveals a possible bug. For a test suite, one after another, the testers keep adding pegs until all of the the testing is complete.

Battleship ends when one player has sunk all of the other player's ships. Testing ends, when the time is up or you've run out of little pegs. It is also a race condition, but usually against time, resources, or patience.

In the real game, the 'ships' come in multiple sizes that are well known. Bugs on the other hand come individually or in clumps and there is very little rational behind either. In fact, there is a totally unknown number hidden in the grid that can be anywhere from zero to almost NxN bugs.

Now, most people might think testing occurs with a huge number of pegs, but if you really look at the coverage on a line-by-line, input-by-input basis, the grid is absolutely huge, while the pegs are few.

Even if you are going to push the whole development team into an intense round of final system-wide super integration testing, it is still a pitifully small number of pegs verses the actual size of the grid. One big different between the game and real testing is that the grid grows with each release. It keeps getting larger all of the time. The game is ever changing. Even when you think you have mastered it, it will still occasionally catch you by surprise.

Although for the analogy I picture the functionality as an N by N grid, the real topology of any system, depending on representation, is some multi-dimensional arbitrary structure, where the sea-scape itself holds many clues into the increased presence or likelihood of bugs.

What I really like about this analogy is that it gets across the point that testing is hit-or-miss, and by definition not everything is tested. People tend to view testing as a deterministic process; if you do enough testing, then you will catch all of the bugs. In fact, I've seen 'intensely' tested code, that has run in production for years and years, still yield up a bug or two.

There are always a huge number of bugs and they are always buried deep. Unless a peg hits them directly with the right input, the presence of the bug is unknown.

TESTING STRATEGIES

Over the years I've seen many 'interesting' strategies for testing.

Sometimes it is all manual, sometimes it is heavily automated. Some testing focuses on catching new problems, so of it is more bias towards regression.

Generally, though the most common thing about testing is that most of it is done on the easier bits. The hard stuff rarely gets touched, even though it is the most likely source of bugs.

In that sense, developers quickly use up all their pegs clumped together in a close location on the grid, while ignoring the rest of the board. They tend to put the most pegs into the spots where the least bugs will exist.

Unlike the game, in testing the board itself offers tonnes of clues on how to test more effectively, but most developers prefer to disassociate themselves from that knowledge. As if, somehow, if we don't get smart about what we test, then the testing will be more thorough, which would be a possibly workable strategy only if you have a huge mountain of pegs. If you have way too few pegs, it's effect is probably the opposite.

Software developers hate testing, so it is usually handled very badly. Most big integration tests for example are just replications of unit-tests, done that way to save time. An oddly ineffectual use of time, if the tests both refer back to the same position on the grid.

Underneath, some bugs may have moved around but we still need to test for them in their final position. Sticking the first peg in, then removing it, then adding it again is unnecessary effort.

Interestingly enough is that for whole sections of the board, if there haven't been any changes to the code, then the same bugs remain hidden in the same positions. So executing the same peg pattern, or regression testing, will produce the exact same results as last time: nothing.

No changes to the code + no changes to the tests = no new bugs, but also a huge waste of time.

If you get the sense of this game, then spending a great deal of time sticking the pegs into the boards only to not find any bugs seems as though it may have been a waste of time. Programmer's like to test to 'confirm' the absence of bugs, but really for most code, particularly if it is new -- less than twenty years old -- not finding bugs more likely means the testing is ineffective, not that the code is good. It ain't. So that one last pass through the tests, with the expectation of not finding anything wrong is really a big waste of time.

You regression test to make sure none of the older bugs have reappeared, but that only makes sense if the code was actually effected. Retesting the same code multiple times, even when it hasn't changed is a standard practice, but still crazy.

Clearly this also implies that any second round of testing should be based on the result of any changes, and a first round of testing. Parts of the board are not worth retesting, and it is important to know which parts of the board are which to avoid wasting resources.

PATTERNS

I can remember when I was a kid, I'd come up with 'patterns' to try and optimize my peg placement for Battleship. You know, like hitting every other spot, or doing a giant X or cross or something. I wanted a secret formula for always winning. These never worked, and I would generally be pushed back to messing up the pattern because of the nature of the game.

Once you've identified a red peg, you want to finish off the ship quickly. Then you get back to searching. Overall the best strategy was to randomly subdivide as much as possible. Big gaps allow big things to hide. Random shots, distributed more or less evenly across the board tended to be more fruitful.

One thing for certain about Battleship was that no matter who you were playing against, there was some underlying structure to the way they placed the ships.

As the game progressed, the strategy for finding the next ships was based on the success of the current strategy. None of the fixed patterns worked because they were too static. Although, no one would actually be able to explain why they placed the pieces the way they did -- to them it was random -- it just wasn't.

Different opponents tended towards different strategies, which after a game or two became fairly obvious. In a weird, but rather indirect way, the same is true for code. The bugs in the system are not really random, they come from an extremely complex pattern that will always seem random to human beings, but there is some ordered structure.

The most effective ways to find them are by using instinct to work through the parts of the system that are most likely to have problems, or to insure that their aren't problems where they might be really devastating. On fixing a bug, retesting the entire system is a huge waste of resources. It isn't hard in most systems to narrow the board down to a smaller one that might have been effected by the change in code.

Finding the true 'impact' of a change may seem expensive, but it is considerably cheaper than retesting everything. In that regard, a well-crafted architecture, one that isn't just a big ball of mud certainly helps in drawing lines around the possible effects of a software change. It makes understanding the impact of a change a lot easier. Obviously, if you can reduce the size of the board, you increase the chances of finding any other hidden ships on the field, particularly if you didn't significantly decrease the number of pegs.

SUMMARY

There are lots of things we can learn from studying an analogy. They allow us to morph the problem into some other perspective, and then explore the relationships and consequences in some other way.

Pushing testing along side a board-game shows the importance of being responsive in the testing process, and not just trying to issue fixed patterns across a huge space. A static approach to testing fails repetitively and is more expensive. A dynamic approach is a more likely way to utilize the available resources.

Clearly shrinking the board but not heavily reducing the pegs will increase the likelihood of finding problems. Also, not just retesting all of the old spots will help to find new hidden dangers.

Understanding the topography of the functionality will help in high-light areas that need to be specifically addressed. There are lots of strategies for making better use of the resources. If you see the life of a product as a long series of games against the code, you get an awful large number of chances to test for underlying problems.

If you squander that on a massive stack of simple-but-useless regression tests that cover the same unchanging territory again and again, it is no wonder that the product keeps getting released with a significantly large number of problems. Time and resources are not making headway into the real underlying territory. If you craft tests as absolute sets of instructions, then recrafting them for new games is hard and a huge amount of work.

Since we hate testing, we shy away from any work in that area. As such, even projects that do everything 'correctly' and have the 'proper' suites of regression tests and always test 'before' each and every release, let large qualities of bugs through. In fact, the ones that followed the established best practices are probably of the average or worse quality because the programmers tend to shift their minds into neutral during the testing phase. It is the guys with no resources that are anxious about not wasting what little effort they have, that are getting the most bang for their buck.

2 comments:

AnonymousMarch 30, 2008 at 3:11 PM
You mention things about the topology of the code which could reveal high risk areas. Could you elaborate a bit on ways to identify these spots.

As a Bsc (wannabe Cand.sc) im thinking it could be fun to use some kind of static analysis to visualize these areas :)
Paul W. HomerMarch 30, 2008 at 4:52 PM
Hi Søren,

Great question. Density and structure come to mind right away. A metric like cyclomatic complexity would show denser areas in the code, which would probably require more testing. Structural inconsistencies (depth, # of calls, types of calls) would probably reveal higher likelihoods for bugs as well, given that they are less prevalent in code that has been heavily refactored, and thus probably a little more 'battle-tested'.

Most systems I've written generally have a small amount of core 'complex' functionality, and then the rest of the stuff (admin, data, preferences, etc). The complex bits are more often encapsulated in a library or 'engine'. Focusing on the corner-cases for that type of code always reveals a huge number of issues. It's often easy to find those bugs, it's just hard to know if there is some value in doing so.

Risk may also come from the common 'areas' or pathways through the system that you just can't afford to get wrong. Making sure the most common usage is bug free helps prevent embarrassing problems that are often costly in support. It's one of those rare places were perhaps wasting time re-running regression testings is cheaper than allowing a stupid bug to get released. If it is that critical, automating it is the only guaranteed consistent approach that will work (but it is expensive).

Paul.

Thanks for the Feedback!

Friday, March 21, 2008

Testing for Battleships

2 comments: