Sunday, September 13, 2015

Data Modelling

Data is the foundation for all software. The strength of its internal relationships makes it possible for computers to assist us with real world problems. Without carefully modelled data, code just spins uselessly.

A collected datum by itself has little value. It needs both recurring frequency and the interconnections between other related points. However, collected together, a group of points identifies a deeper meaning; it pinpoints something specific. If we collect enough data, and we model it appropriately, we can leverage this information to help us.

There hasn't been any real growth in the area of modelling for at least 20 years; the strongest tools we currently have for modelling are ER diagrams. They are backed by a deeper set of entity modelling methodologies, but few brave souls ever descend to those depths.

In an ER diagram each datum is an attribute, they are combined together to form an entity. We can then specify relationships between these various entities. For most data with relatively static relationships this works quite well. Analysis and modelling is well-understood and easy to apply, although it often seems to have been ignored.

Relationships between entities happen on both ends as 0, 1 or many times. This can capture any entity with a set of children or even a many-to-many entity that multiplexes between two others. The allowance of 0 is really an offhanded means of supporting optional relationships. It's expressive, but somewhat overloaded.

ER diagrams map directly back relational databases (entity -> table) and have simple (although somewhat obtuse) rules for how to appropriately normalize the structure in order to maximize the functionality. Normalization, unfortunately, is also often ignored, resulting in severe operational problems and damaging the ability to extend the usefulness of the code later. It's sad in that it isn't difficult knowledge to acquire and is essentially mandatory if a relational database is needed for persistence. If the foundations are broken, everything built on top of them inherits that disfunction.

ER diagrams are not particularly good for complex data-structures like timeseries, trees and graphs. This is unfortunate, since as we tackle more sophisticated problems these structures are far more frequent. Still, it doesn't take much fiddling to adapt the basic ideas to handle them. A data-structure is just a recursive container for variables and other data-structures. An entity is also a container, but not recursive, so the extension is obvious although it is difficult to know whether to express the recursion internally in the entity or as an external relationship. The latter is clearer in breadthwise data, while the former is clearer when coping with depth.

Decomposing complex data models into data-structures and their relationships makes it easy to map the data back to a model in code, but care must be taken. Turing-complete programming languages most often appear as considerably more expressive than the core relational model. I'm not sure if this is theoretically true, but it has shown itself to be practically the case; most often made obvious by programmers writing their internal code first, then getting into severe problems when trying to connect it back to the schema. For this reason, for any persistent data, it is far superior to sort out the representation in the persistent technology first, then map it in a 1:1 fashion back to the code. That remains true no matter what programming paradigm is being used. It also provides justification for viewing the persistence as a 'foundation' on which all other stuff is built.

As such, while user interfaces should be constructed top-down (to ensure empathy with the users), data should be built from the bottom up. A slight contradiction, but resolved by always dealing with the data first, then once that has been completed then it can be connected to the user interface. It's an inverted T pattern. Lay the foundations, extend it to the interface, then rinse and repeat to span it out to the full functionality. Built in that order, it opens up the ability to maximize reuse, which saves massive resources.

Doing the actual modelling work is fairly easy. Find out all of the currently existing data, figure out what other data will also need to be captured, then start bringing all of the attributes together into entities. A whiteboard really helps, in that in order to work through all of the special and corner cases it sometimes takes several rounds of re-arranging the structure until it fits appropriately. Still, it is way faster and a lot cheaper to work through the issues in design first, before they become trapped in a large code base. Modelling usually requires a bunch of people to work together to avoid tunnel vision.

For a very complex schema with a huge number of entities, the same basic structural patterns will repeat over and over again. Much like code reuse, we would prefer to generalize these common structural relationships and only implement them once, saving both time and resources. Generalization in this fashion is a form of abstraction that can be seen as collapsing as many special cases as possible into a minimal number of unique structures. Obviously to keep it readable the terminology must lift upwards and appropriately apply to the breadth of all included special case terminology. A generalization that lies, is inappropriate. One that abstracts the structure allowing enough flexibility to hold all special cases is a huge asset. Generalizing however isn't independent of performance and resource issues. For instance you can store all of the data as an entity-attribute-value (EAV) schema, in long skinny tables, but care needs to be taken because the number of rows now has a strong multiplier, which can kill performance. A table with millions of rows isn't too bad on modern computers, but handling billions of rows isn't so easy. This type of generalization should always be prototyped first, to avoid burning through lots of resources on predictable problems.

Our current understanding of data modelling is predicated on the relationships being static. We essentially model the rather informal real world as a very rigid formal system. In some constrained contexts, like an accounting or inventory system, this works fine, but as we start to try to build smarter more helpful software this rigidity is a huge problem. It would seem that the best option then is to include no structure, just let the data land where it may. That's not actually practical; a fully free floating clump of data, like a massive text field is essentially worthless because we cannot deterministically extract knowledge from it (humans can, but not computers). If we decompose everything, type it in and then build up dynamic graphs to capture the relationships we now have something useful, but we will run afoul of computational complexity. That is, the resource usage will grow exponentially as we add just a little bit more data, which will rapidly make the program so slow it has become useless again. The time will quickly jump from a few seconds for a calculation to days, months and eventually years.

The trade-off between static and dynamic relationships is extraordinarily difficult and has not nearly been explored enough. Given this, the strongest approach is to model rather statically at first, then gradually generalize and make relationships dynamic. A careful step-by-step progression, checking for real performance issues at each iteration, can lead to finding the most appropriate level of dynamic behaviour for any given model. It certainly isn't fast work, and although experience can speed it up, in a big data model the inherent context is well beyond our ability to visualize the behaviour. That means that each step must be fully tested before the next one is taken. Although it is costly to build, dynamic data better matches the real world, thus making the software super-flexible. Done well, this incorporates the type of intelligence and sophistication that the users so desperately need to get their problems really solved, without tossing in negative side-effects.

Over the years, whenever I've encountered really bad software, the all too common thread has been that the data modelling was ignored. There was a plethora of excuses, plenty of hollow justifications, but none of that really mitigated the severe damage that erratic data caused in the code base, the development project and in some cases the business itself. I've generally taken bad modelling to be a symptom of other more severe problems, but once it gets entrenched it anchors the development far away from the possibility of success. That is, there is never a turnaround in the fortunes of a failing development project, not a real one, until the foundations of the project have been addressed and fixed. Fix the modelling, fix the schema, then get in all of the refactoring required to fix the code, then at least from a development perspective the project now has a fighting chance of survival. Continue to avoid acquiring any knowledge about modelling and ignoring the consequences of that, then the project will just get worse. It doesn't really matter what the larger environmental or political problems are, if the foundations for a software development project are fubar'ed. If there is no upper-level will to correct this, go find something else to do with your life, it will be infinitely more satisfying.

Monday, September 7, 2015

Just Managing

Somewhere, headed into the 21st century, we lost our understanding of how to manage people. To compound this, as the last of the great managers vanish, there are fewer and fewer role models to show us how to lead.

Good management is not just talking. Walking around and chatting about any and every thing does not help to get stuff done. Throwing out a stream of highly creative ideas doesn't help either. Managing people is all about getting them to get their work completed, it isn't about spewing out blue sky ideas in the off-chance that they might be useful.

Management isn't about screaming at people. People really don't like that, and as a consequence they aren't going to respond well to screaming. A good manager is 'firm', but they are also 'fair'. Their directions are consistent and applied equally. They listen well and can provide enough guidance to help the people below get their jobs done.

Management isn't about placing blame. Rather it is about protecting the employees, particularly when the problems are a fairly routine consequence of people being all too human. On occasion individuals do need to be corrected, but that shouldn't be a public display of shaming; that's only going to make the issues worse.

Management isn't about hiding away when the shit hits the fan. In fact, it is quite opposite. Managers need to lead by example. They need to step into a position of authority and act quickly to get everything resolved. If they can't lead, how can people follow?

Management isn't about endlessly mulling over a lack of information while procrastinating on making decisions. Things must get done. Sometimes the available information leads to the wrong conclusions, but a good manager accepts this, admits it, and adapts to the changing circumstances. Not doing anything is far less productive than having to make a few alterations on-the-fly.

Management isn't about shutting off your brain. The idea that getting promoted to a higher level means not having to work as hard anymore is both backwards and inherently self-destructive. What makes a great manager is that they understand the work that their people are doing, and that they can help them jump through the obstacles to get it done efficiently. You can't manage people if you have no idea what it is that they are supposed to be doing. Pretending to understand is not a rational way to make decisions.

Management is not an independent skill. You can't take a manager from one domain and just toss them into another. You can't just teach 'generic' management and expect that it applies to any type of work. Management at all levels means understanding, in great detail, how things need to be done, so that you can make sure that they actually get done.

The skill of managing is different from any specific work skill. Good workers aren't intrinsically good managers. To make that leap, an employee needs to learn a whole new set of people and communication skills as well as get some serious mentoring from an existing, good manager. It's an upper level perspective, but still tied to people.

Management is sometimes baby-sitting. Not to be demeaning to some employees, but people under stress can be quite volatile, so they often need a bit of hand-holding to keep them stable. That requires patience and it requires empathy. It also requires a sense that sharp turns in steering can upturn a lot of baggage. People are the key resource and they need to be handled individually, and with dignity. Robots are mindless replaceable cogs, employees are not.

Management means disappointment. You might specify very clearly a few of the degrees of freedom for an important piece of work, but you have to accept that for the other degrees, what was left unsaid could be interpreted very differently. The final results may not match the expectations. If it was a bad choice, it's the manager's fault not the employee's. A good manager knows what needs to be specified, precisely sometimes, and what can float. They also learn to whom they have to provide more details, greater clarifications. They can't just blindly hope that things will get done correctly.

A good manager is not a dictator. Nor are they a control freak or a bully. They have to ensure the work gets done, but they also have to not burn through resources such as employees. They lead -- intelligently -- but they can also step in and assist when necessary. People want to work for good managers, they will follow them to new organizations. That loyalty is sometimes a strong indicator of ability.

Management is a lost art. We should try to find it again.