Friday, March 13, 2020

Testing for Oversimplification

For any ‘lump’ of complexity there is an ‘inside’ and an ‘outside’.

From the outside, things always look a lot simpler. The devil is in the details, and the details are always on the inside. If you really want to know something, you have to get depth, which is basically digging around on the inside trying to, at very least, inventory most of the details.

So, now with that as a foundation, if someone presents a set of primitives for which they claim solves some aspect of the complexity, it would be really useful to have some test that can be applied to see if their work is correct, an oversimplification or even possibly over-the-top.

The means to do that would be to get some inside knowledge. That would give us a sense of the frequency of things that occur. With that frequency, we could take some of the more frequent ‘corner-cases’ and see if they fit into the primitives.

If we find that some fairly frequent thing cannot be represented within the primitives, but many things can, then the most likely explanation is that the solution is oversimplified.

That’s a pretty abstract way of looking at it. Let's try an example.

In most companies, there is some reporting hierarchy. There are employees that are direct reports for a boss. From the outside it might seem like an N-ary tree is a great way to represent this as a data-structure. Then we might go forth and start implementing any collection for this type of data as a tree.

However, it is not infrequent that some employees, for some period of time, report to multiple people. This happens for a lot of reasons, and it is more likely where the employee roles are less defined. How does our tree idea work with this?

A single linked tree where the parent points at a set of children obviously won’t work. A doubly linked one where the children point back up to the parent as well, doesn’t help. If a child node points to multiple parent nodes, that correctly holds this relationship, but technically the data structure isn’t a tree anymore, it is now either a directed acyclic graph (DAG) or a full graph. We'd rather have the former, in that the latter opens up cycles which can be difficult to deal with properly in the system.

So, using a tree (of any type) is an oversimplification, a DAG seems pretty good and a full graph is likely over the top.

Or is it? What if we just changed the definition of ‘reporting’ to be ‘direct reporting’ to one and only one person. Then the employee might have zero or more indirect bosses, but always just one that captures the HR relationship.

So, we think we’ve fixed the problem by taking the concept of ‘reporting’ and categorizing it into 2 distinct groups, the first of which maintains the tree relationship. But what about the second? We could have some ‘dotted line’ internal structure working through our tree, but since the indirect reports can be 0, or 1, or 2, or N we suddenly fall back to having the dotted line relationships to being a DAG again. If a tree is overlaid on a DAG even if there are two different types of structural referencing, stepping back a bit says that the bounding structure is actually still a DAG.

More to the point, if the system needs to capture the way people report to each other, then just throwing away half of that so that we can jam the whole thing into a tree is not satisfying the initial requirements. It will just cause other problems.

As people often do, the next obvious approach is to just disconnect the whole thing and not think of it as a data structure.

Any employee can point to a list of bosses and any bosses can point to a list of employees. Super flexible, right? Except that without restricting it, in any way, a boss can point to an employee who can then point back to being their boss. We can create a cycle, really easily.

If we check for that special case, is that better? Not really, in that cycles of length 2 could be caught and stopped, but then we have to do the same for cycles of length 3, then 4, then ... So now, we have some huge cycle checker in the code to avoid crashes or endless loops, but if they do find something wrong we may need to get a human to interfere and arbitrate in some way. Either the new link causes the cycle, or one of the old links is way wrong. The code can’t tell accurately. So, we tumble farther down the rabbit hole, while trying to avoid what is basically the intrinsic nature of the original data.

The big trick here is to start with needing to capture the working relationships as a requirement for a system. There may be lots of proposals for how to handle this, but if we know enough about the underlying data, we can quickly use that to discount any proposals really rapidly. If we leap on the first, plausible sounding idea, implement it and then try to correct it from there, it will go badly.

Oversimplification is really, really common. For most of the stuff we encounter, we are outsiders. It takes a bit of work to prevent it, but it is the type of work that ends up saving a lot of wasted work, later.

No comments:

Post a Comment

Thanks for the Feedback!