Friday, February 24, 2012

Complexity and Decomposition

The standard approach to getting a big project accomplished to identify the problem, come to an overall solution, then keep breaking it down into smaller, more manageable pieces. Once these are small enough -- specific tasks -- all that’s left is to march through them one-by-one and complete them. This is quite a reasonable way to approach problems where the problem isn’t heavily hierarchical in nature and the sub-parts are mutually independent or at least mostly independent.  

If however, the problem is multi-dimensional in its nature, and there are a significant number of different ways to decompose it, this doesn’t work. If there are at least two logically consistent ways to decompose a problem then the underlying pieces are interdependent. These cross-dependencies means that “linearizing” the problem and marching through each piece in sequence will result in a significant amount of redundant effort. Beside the extra work, redundancies reek havoc because people are inherently inconsistent. These resulting differences in output (and any consequential problems) amplify the complexity, and again result in extra work.

An example of this is a problem that can broken down into A, B and C. It can be further decomposed into 1, 2, 3 and 4. So if we decompose lexically first, we get the following pieces:

A1, A2, A3, A4, B1, B2, B3, B4, C1, C2, C3 and C4.

If we decompose numerically we get:

1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, 4B and 4C

Thus we have a two level hierarchy which falls into 12 different sub-pieces that we need to complete in order to get the project done, with two reasonable ways to slice and dice it.

While this type of problem exists generally with any type of labor, it is very noticeable in software development. As such the concept of ‘polymorphism’ arose. The essence behind this concept in software is that for a number of similar but different types of data, they all share a common interface so that the code working on them doesn’t need a specific implementation for each different type. Thus, this approach addresses a decomposition on a finite number of static data-types. But the concept itself is general. It is also not restricted to things that are particularly static, nor data. Thus it can be applied to dynamic data as well. Data that is mutating in unexpected ways can still share a common interface (and support reflection). But it can also be applied to static code, where all of the blocks of code share an identical interface. And of course, given those other two, it can apply to dynamic code as well.

While it is a commonly used and a very effective programming paradigm, it can also apply generally back to any basic problem decomposition. In our above example, there are 12 different pieces of work that need to be completed. But there are really only 7 different unique items in that work list. Their permutation may be a combinatorial explosion, but that doesn’t mean that the only way to approach getting the work done is to explicitly work through all 12 pieces. We might conclude that at maximum efficiency we need only complete 7 pieces of work, but I would suspect that in practice there is some other overhead required in order to bind together the sub-pieces. Still the expectation would easily be that a polymorphic approach to handling the work would lay somewhere between 7 and 12. If it were 8 for instance, that would mean that the most efficient way to solve the problem was 2/3rds of the amount of effort required to just pound through all of the pieces. That would be a considerable savings.

Now, at this point some readers are probably on the edge of their keyboards, but they likely fall into two different categories: a) those that disagree with there being a more efficient way of solving the problem, and b) those that see this whole discussion as obvious. I’ve certainly met both kinds of people in practice, although I think more people fall into the ‘a’ category. The reason I think this is more common is that it is easy for most people, when faced with a problem to adopt tunnel vision. That is, they don’t focus on the ABCs, nor on the 123s, but rather they go pick what they believe to be the one correct decomposition (lexical or numerical) then they narrow down their field of vision to deal only with a sub-problem like A1. While working on A1, they don’t want to hear anything about A2, and certainly don’t want to know, or even consider C4. They just accept that there are twelve things to do and start doing them.

Again, this shows up very clearly in programming; quite frequently. It is fairly common if you read a lot of code to see a programmer belt out very specific handling for A1 in one location, and then find another, completely separate, yet nearly similar A2 in another. Not only does this occur for code, but it also pops up in the data and the static data used for operational configuration. Software, these day, is highly redundant. It actually gets worse when you look at teams of programmers. The large to massive code bases seem to start with 2x - 3x redundancy and I wouldn’t even want to guess what their worst factors are. The just become ever increasing seas of redundant data and code.

Even though polymorphism is a well-known, popular approach to architecting code, It becomes very unlikely in most cases that a significant proportion of most code-bases is utilizing it across the board. There may be some sub-instances, but most often they are trivial uses. This continues to occur, while oddly one of the most famous modern adages of programming is the DRY principle -- don’t repeat yourself -- which came out of The Pragmatic Programmer book (and possibly series). It’s rather obvious that this wisdom isn’t getting followed on a regular basis quite simply because its really really hard to accomplish in practice. If it were being attempted more often, we’d see it dominating far more of the technical conversations. We see it show up more often in the vast amount of open code that exists. And a huge number of modern technologies not only violate this principle, but they actively encourage it. So while it is talked about, it is quite often swept under the rug in the rush to belt out the next version.

I do find this ironic because certainly in software, we’ve seen a huge growth in the expectations for what software can do, and in general given the rapidly increasing -- out of control -- complexity of our societies, you’d think more and more people would start to look for efficiencies. But it does seem that the opposite is true about human nature. That is, the more complex things become, the more tunneled the vision of the people who have to deal with them. Our species seems to retreat towards the shallows the moment things get too much for them. This, it seems, sets up a feedback loop, forcing people who took the long road to be late, thus forcing them to not analyze the next fork sufficiently and again choose a longer route. And then it never ends.

And it’s that type of negative feedback loop that I’ve seen in many software development projects, both at the coding level and the organizational one. The projects get so ground down by slapping on band-aids to unnecessary problems caused by redundancies, that they lack the resources to avoid creating them in the first place. The work degrades into a loosely coupled collection of disorganized functionality that mitigates most of its own benefits. Not only have they chosen the lest affective path, they’ve also opened up an ever widening sink hole for their own resources. And it seems that this type of problems comes directly from our own intuitive approach to solving problems. We are most likely to not make the right choices (and then far too stubborn to backtrack). 

Thursday, February 16, 2012

Software Clearing Houses

I love the idea behind open source. If I’m building something that uses someone else’s work, being able to drop into their code, investigate, curse, and then work around their problem is a huge time-saver. Nothing is worse then wasting hours guessing at what weirdness lies beneath.

The problem is that this idea of ‘open’ mutated into the idea of ‘free’, and ‘free’ is not a good idea in a society that revolves around money. If you write some fabulous piece of code and give it away for free, not only do you not make money yourself, but you’ve also prevented other programmers from making money by writing something similar. Not all of us are lucky enough to get funded by other means, some of us need to pay mortgages and bills and such. We do this by getting paid to work. By writing code for a paycheck. If everything is free, we’re going to have to find some other (less agreeable) way to pay the bills.

A slightly worse problem is that as more and more stuff becomes free, more and more of the low hanging fruit disappears. What that means in reality is that it becomes harder and harder for programmers to go out on their own; to start their own companies. Instead the control of the industry shifts to the big players, who have little incentive to innovate. If you can write something small and profitable, then you get the freedom to experiment. If you can’t, then you’re stuck for life in a big sweatshop writing broken code for people who don’t get computers. I’ve definitely seen this trend in the industry over the last few decades. The really innovative works have nearly vanished and been replaced by more and more sloppy re-works of existing wheels. Not only that, but the profits come from the upper levels of software, so the lower ones get stagnant as the bugs get permanently frozen into the code-base. Thus our software looks prettier, but because of the complexity increase, is dropping in sophistication and quality. And it’s not in a big companies’ interest to change this trend. They seem to make more money with lessor quality code.

So what can we do? My first suggestion is that we should push to get more and more software into openSource. That is an easy win, transparency promotes quality and ease-of-use. But at the same time, we need to attach a price to every single piece of code out there. I don’t think home users and developers should pay, I like that they ride for free, but for big companies -- making profits from our labors --  money should definitely return to our community. Money that we can use to innovate with.

The problem is that programmers aren’t business people and few of them really want to deal with business. What they want is to build really cool stuff and leave the hassles of collecting money to other people who enjoy it. To allow this, I think we need to set up ‘software clearing houses’. Programmers would deposit their code into these organizations, and the staff there would deal with the issues of wheeling and dealing in the business arena. The clearing house could deal with licenses, accounts receivable and accounts payable. They would be the repository of the running code and of the source code. They could collect bug reports, then deal with farming them back out to the communities that built the code. Basically they’d act as a middleman between a large number of developers on one side, and a large number of companies on the other.

Many of our current licenses ask for funds when the code goes into a commercial product, but not if it is being used in-house. One reason for letting the in-house users ride for free is that not doing so would result in a huge number of little payments that would all have to be coordinated. That would be messy for an individual developer, but if a clearing house represented a significant number of projects, libraries, utilities, etc. most big companies wouldn’t have a problem paying a single reasonable yearly lump sum amount. They’re a huge number of ways of structuring this type of arrangement -- fine verses course payments, etc. -- but what is really important is that it isn’t a burden to the companies, and the money is flowing to the developers.

A significant problem with relying on many of the newer openSource libraries is support, both for bugs and for on-going rust prevention. A clearing house could provide some assurances that they will contact the developers and try to get the issues sorted out. If that is unfeasible they could also contact other unrelated developers and get the code fixed or updated that way. Once deposited in the clearing house, the code could live on well past the author’s interest. It would also be less subject to dramatic shifts in design or licensing. If enough people were interested in the preservation of a fork, the fork would find an easy means to continue.

One problem for commercial developers is the proliferation of various licenses for libraries. There might be a great library to use, but the license may be vague or destructive. Often approaching the developers directly, results in outrageous financial demands thus making it impossible to utilize the work. Commercial developers are keen to make profits, and aren’t against sharing them fairly, but the commodification of software has dramatically lowered the margins. It’s getting harder and harder to make a profit directly on software, the monies come more often from the services and support side, particularly for software categories like niche enterprise software (5-20 clients). Thus payments to the authors of dependencies would be fairly small, and constitute a considerable overhead if there were a lot of them. This again would be fixed by clearing houses. Lump sum payments, or per-sale payments that were sent to a single clearing house that then disperses them to a large group of developers would allow the money to flow. If all of the works were under the same license and the terms were reasonable, then that would easily drop out another big problem for the commercial developers.

Another important point is that there should be many clearing houses. Competition is a good thing, but also some of the houses may specialized in providing access to particular types of code. Some industries are highly regulated and a house that could provide certified libraries would be hugely appreciated. Also license and support features could differ significantly, as well the underlying quality of the code. A house that only provided vetted high-quality libraries for instance, would be a very useful entity and save lots of development time currently used to evaluate the existing options.

I should point out that to some degree this idea already exists. Both Apple and Google have markets for apps that act as middlemen between the developers and the consumers. This seems to be working quite well (although it does also seem to have reduced the price of software). What I think we should do is expand that basic mechanism out to all code, all of the time. Pretty much everything would go through clearing houses, and for everything that is usable there should be some cost to it.

There are lots of other benefits as well, but my readers seem to really hate it when I go on and on :-) My key point for most people is: why kill yourself in the evenings and weekends to do great work that may end up making other people money for decades, if you are not going to get some share of the pie? Write something, deposit it into a couple of different houses, and if in a few years that provides the means to retire early then you are free to focus on the code you’ve always wanted to write, wouldn’t that be a great thing? You’re happy, the other coders are happy, the business’s are happy and the industry is happy. Everybody wins.