Saturday, September 6, 2008

A Dependency Too Far

History is littered with failed ideas. Bright sounding, sensible suggestions that utterly fail to deliver, or are just outright fallacies. Heck, we spent how long thinking the planet was flat? Math was only geometry? Alchemy and science were magic? Women were envious?

Every great intellectual pursuit has wandered down a blind alley and gotten stuck for periods in its history. It's expected. Once we get a bad idea entrenched, it takes some time to free it up again. It's a shame, but it is all part of growth. Progress, in terms of learning, spreads out in all directions, some of which are duds.

Computer Science, being young and all, has more than its fair share of bad ideas. Where some disciplines have received a bit of rain, we've always gotten a torrent. We're drowning under a steady flow of black and white rigid half-truths that only have limited applicability. Something about software causes people to try to jam down half-baked one-size-fits-all theories that ultimately cause more harm than good. Ideas that stick when they shouldn't.


REINVENT ME

My personal favorite, used so many times incorrectly is "don't reinvent the wheel". You can't get a bunch of programmers together for too long, before one or more of them falls into this old saying.

When most people shout this one out they are using it as the de facto reason why all programmers, everywhere in the world should use some specific existing library to perform some actions. It is usually used as a reason why you shouldn't be writing code. "It already exists, you clearly don't want to reinvent the wheel, do you?" goes the logic. Sometimes, its specific to a known library, often its just a reference to writing anything that is already known to have been completed.

There is so much wrong with this statement, and its usage is often badly applied.

The first point is that it isn't really a reference to making use of a specific library, its a reference to making use of a specific idea. By "not reinventing the wheel", you're actually trying to avoid recreating the ideas behind it. The wheel is a concept, not a specific instance.

That point is more obvious when you realize that you wouldn't use a bike wheel on a car, or a wooden carriage wheel on a bike. Depending on what you are building, you often have to create your own adapted version of the wheel.

The advice is really to avoid rethinking through solved problems; not about sticking to only the instances that are available. Reinventing it is quite different than reimplementing it.

It is far closer to saying don't reinvent the design pattern, then it is to saying not to reinvent the specific code in some library. If there is a known algorithm for handling that problem, why agonize on creating a new one? Build on the ideas that are there already.

But you can't confuse that with just building on specific known instances. Working through your own algorithm is totally different than working through your own code.

Had we not reimplemented the wheel occasionally we never would have evolved. Given that we invented wooden wheels first, had everyone stuck with them, most of the vehicle's we have today would not have been possible. Our wooden wheels would blow apart on the highway. Progress would be nil.


EXISTING CODE

Even if you counter for the mis-use of the term, a lot of programmers think that it is always a good idea to exploit any existing code. If the code exists, why rewrite it? This notion is predicated on the idea that if it was already written, for some reason it must the a reasonable and workable way to solve the problem. It's sort of the anti-thesis to progress.

It is one thing to reuse the code you wrote, over and over again in a project. That is not only perfectly reasonable, it is also considered good practice. Internally in any development, the programmers should leverage all of their work to the maximum effect, which definitely means not constantly rewriting the same subset of instructions again and again. Coding is slow and time consuming, finding ways to leverage the effort is good.

We're this goes wrong is that people assume that if it works internally within the project, it is just as reasonable externally. If the code exists in the industry, it should also always be used, no matter what. No exceptions.

If this were true, we'd get some horrible problems. Once its written, it is written for all time. We can never fix the problems or find a better solution, because someone already wrote that. Also implicitly, the people to get to the problem first, inherently must know how to solve the problem better. If someone wrote a library for handling logging, for instance, some people seem to think that implies that they are experts on logging. Sometimes that is the case, while often it is not. With more experience and hindsight the implementations are often much better.

Just because something exists, that in itself is not a good enough reason to use it. Yes, you should be aware of it, but sometimes it can be more trouble then its worth.

The real problem, is that programmers are always looking for hard and fast rules that keep them from thinking. Everyone just wants something fixed, so the issue doesn't have to go up for discussion for hours and hours, or days and days. That wish, of course is the key part of problem.


DEPENDENCIES

Ultimately, we must always look at our software from a higher perspective. When we write a solution, these days, because of the intense complexity of our environments, infrastructure, etc. we are only writing a very small percentage of the actual final code. What we control is tiny in comparison to what we do not.

A huge problem then, one that often rears its head in development is what happens with the code we do not control.

If we find a bug or strange behavior in something we wrote, obviously we can fix it. If it's not ours, but is fixable (avoidable) that problem propagates upwards into our work. If we can't fix it, we have even more problems.

The old (and reasonable) term for someone else's code was calling it a dependency. That term implies that the underlying library, framework, technology, protocol, etc. is something on which we are depending. And the effect of any dependency is to impact the success or failure of a project.

We can't release our code if its based on features that don't work in a database, for example. All the work we've done, all of the effort is wasted, if we can't find a way around the problem.

Dependencies are risky things. They are inherently dangerous, and should be well tracked and well understood.

If you knew there was an inherent risk in performing some action, even if it was small, then you understand that the more you perform that act, the higher the likelihood of having problems. That's the basis of probability. Do something five times with a 1 in 5 chance for some outcome, and the odds are pretty good you'll trigger that outcome.

If you're out there dumping every possible dependency you see into your code base, you're on your way to a major meltdown. Your risk is sky rocketing at an alarming rate.

The way its used, the phrase "don't reinvent the wheel" is tantamount to saying we should try to gather as many dependencies as possible, without thinking at all about whether or not they are reasonable: implying that dependencies are always better than code. That's just madness.


REASONS

Sometimes a dependency is unavoidable, sometimes it isn't. The total inherent complexity of one's own code can be less than the inherent complexity of a library. That is, sometimes, in some cases, for some reasons, writing it yourself is a much much better idea.

But, as you may have guessed, I was being really really ambiguous about when and why that is true. Like all good and messy things, the choices aren't simple.

A couple of easy ones exist. If you're the expert, or more expert than an existing version, then you should likely to write your own version. If the code is so simple, that everyone is an equal expert, then again, unless there are time constraints, you should probably write it. Another expression "a bird in hand is worth two in the bush" applies with code, specifically when the version you write can be orders of magnitude simpler than an establish library.

That happens when, for example, the library code is heavily generalized and attempting to meet the requirements of a large diverse group. If your usage is a tiny fraction, and you understand it well, then it can be time well spent to make your own.

On the other hand, if you're building an accounting application, it would make almost no sense to try to rewrite something huge and complex like a jpeg library. When you are coming from a problem domain unrelated to some specific technological domain, you have to focus on the types of code and solutions that you want to support. You might have the ability to write an image library, or a specific subset, but do you want lose focus on the real coding issues? What happens when your expert image programmer leaves, who will maintaining or enhance the code?

Another thing to consider is who is building the code, and how long are they going to continue to support it. Some libraries need frequent updates, yet if the initiative drops out, you've invested time in integrating something that is rusted and useless. Particularly for Open Source, if the project doesn't reach a significant level of success it quickly becomes orphaned. That can render the code legacy; why maintain someone's else mistake, when you could have created your own better solution?

On occasion freely available code has even shifted, and been ground down in ugly legal issues. The license for a version is not necessarily the for the next. Projects have gone from Open Source to commercial and back again. If people smell money, that can change their intent. If there never was any money, that can also influence them. Things change, and sometimes that create serious problems. You can't always trust the developers.

In some cases, the functionality is basically simple, but the code is large anyways. If the initial size of the development work is small for example, it doesn't make sense to add in a huge amount of already simple code. It just increases the code base to make the project medium or large, with no real advantage. Assuming the code works well enough, the size of the original project vs. the size of the library is important in not pushing a smaller project into a larger implementation, without sufficient payback for the complexity increases.

A classic is example is a small project taking advantage of a weak, but usable logging library in exchange for keeping the project small. A large project would easily benefit from a total rewrite, since it would fix problems and provide enhanced capabilities, but also a lot more code to maintain. Crossing the size barrier for a project has all sorts of implications.

These issues get even more complex if you account for the different sectors of the programming industry. Programming is done in a wide variety of places with different constraints and different expectations. Where you are programming, and what you are programming effect your choices.


IN-HOUSE

In-house refers to a group of programmers working on domain specific applications in a company or organization whose focus in not software. That includes industries like financial, manufacturing or health care. If the central goal of the organization is not to code, then its priorities lie elsewhere. Programming is a means to an end. Although this is a diminishing area, for the longest time this is where most of the code was getting written.

All major companies needs accounting, inventory and sales systems, and may have found specific automations can be competitive advantages. Internally the programmers are driven to build custom systems, but often with a focus on trying to jam in as many off-the-shelf parts as possible. Historically in-house code has been quirky, and developer-centric, so organizations have learned hard lessons about having mission critical systems tightly bound to specific employees. Costs and faults shoot through the roof if key personal are lost, so after decades, most big companies are actively trying to avoid more of this type of work.

The range and quality of in-house code varies widely. Some of it is high quality, but the internal standards are always significantly less than any commercial sector. Essentially, in-house works can be very crude, and half-completed, but because there is only one real client, and a lot of hand-on operational exposure this is accepted.

Mostly, for many shops, the "don't reinvent the wheel" philosophy is reasonable. This is the heartland of that expression. In-house developers stay for ever or turn over really quickly, but either way the focus is on stitching together a final solution at a high level, not developing a full and complete real solution to the problem. The depth of the code base is always very shallow. With the exception of mainframes, the average lifespan of in-house code is short, and often relative to the turnover in employees.


CONSULTING

Because of the obvious problems with companies expending too much effort on custom implementations, only to have their lifespans cut short, companies are eager to avoid in-house development. That would be fine, accept that some customizations in code can represent significant competitive advantages in the market. An easy example is in the financial sector offering better information access and highly customize reporting to help draw in clients with higher incomes. Tracking is a significant issue for people with a lot of financial assets. Providing this draws in clients.

Over the last few decades consulting companies have risen to take up the challenge of building custom applications for large companies. Their mandate is usually to gather a great mass of requirements and then build a big system. Financially, writing code is only a small part for the revenue, so consultants most often prefer to use libraries wherever possible. Interestingly enough, bugs with underlying libraries are not the annoyances they are to most other programmers.

Most consulting companies work on a "get the foot in the door" philosophy, where their initial goal is just to get the base contract. From there, they want to "widen the crack", and introduce more billing. Scope increase is a great way to do thing, but bugs in underlying dependencies also help. You can't blame your consultants if they have to do twice as much work to get around a database bug, can you?

Consulting code is generally neat, organized and well documented. Mostly because that increases revenue to make it that way. It's biggest problem is that there is little incentive to do a great long-term job, so the code is usually cobbled together from a very short-term perspective. Classically poor architecture for instance. It's far better to just wing it today, and then rewrite it in five years, then it is to get it right for the next twenty or even fifty. Alas, a constraint in consulting is to make it possible to continue consulting.

For consulting, new technolgoies and a huge number of dependencies fit well into their philosophy. Long design times, and short development further reduce failure risks. Most of the systems built are one-offs so they can be quirky, incomplete systems, but they need to map back to a large amount of paperwork somewhere, the paper after all, pays better than the code. It's far more visible.

Even if they are expensive and heavily focused on the short-term, companies still like consulting built systems because the employee turnover risk is completely removed. Any of the companies will happily return to fix the system for you, for as long you want. For a price, of course. It becomes a price vs. risk trade-off.


ASP COMMERCIAL

A huge up and coming market has been the commercial quality systems that are being hosted in limited places. The market is called Application Service Provider (ASP), an unfortunate acronym clash with a Microsoft technology of the same letters. It's usually companies that write and host their own solutions. They cover a full range of services.

Unlike in-house development, the code is often used by a huge number of paying or engaged customers. At the interface level, this demands a much higher degree of sophistication and appearance. In-house interfaces can be quirky, few ASP ones can get away with that. Another huge issue is often performance. Some of these systems are clearly the largest systems in the world, hitting huge levels of complexity that are hard to imagine for many ofter development sectors. Millions of users have a huge impact, a high load and a lot of feedback.

Without digging into it, you might guess that dependencies are good here as well, but clearly the single largest player in the field, Google has shown that rewriting everything from scratch is a better idea. The more they write, the more they own, the more they control. If you control it, you can change and fix it, that's not possible if its a dependency.

Fewer dependencies are better. They are less expensive to fix and less rigid. When you have more control, you can tackle bigger problems.

Reimplementing the wheel, particularity as a wheel 2.0 is a strong way to gain market share. If you are good with technology, the things you build can surpass most of what is available.


COMMERCIAL PRODUCTS

At an operations level, ASP code get implemented in a smaller more personalized environment. The quality of the interface may have to be excellent, but the overall packing of the code, can be sloppy.

The next level up happens when even small problems at an operation level can become huge financial burdens. The most complicated, hardest sector of programming easily belongs to the shipped commercial product arena.

Products have to maintain a higher degree of professionalism and a higher degree of packaging. They also have to have built in means of distributing support issues like patches and upgrades. While it is the hard level of programming, it is so easily to find many companies, products and even industries that fall entirely short of living up to this expectation.

The market, over the years has traded reduced expectations for faster implementations. A poor trade-off to be sure, but people get swayed by dancing baloney and forget about frequent crashes.

In a product, a dependency of any type is an unwelcome issue. If it were possible, writing the whole operating system would be best, because it would eliminate any external issues. Microsoft understands this, but Google seems to be cluing in as well.

While its far too much work for most people, commercial products still have to be extremely vigilant in letting in dependencies. Not only do they cause problems in development, documentation and support, they can also be financial or ownership issues as well. Raising investment is far easier, for example, if you own a significant amount of the underlying intellectual property (IP) of the product. If it's just a nice set of finishing touches on somebody else's code, it is far easier for competition to enter the market, thus it is far risker to fund the effort.

Another big issue with libraies on the commercial venue are the licenses. Some licenses, like the GPL are totally inaccessible. You can't use them, the code is untouchable. You might be able to weasel around the issue with an ASP implementation, but not in a real product. The vast number of vague, confusing and often changing licenses is a constant headache for any commercial developer. Reason enough to avoid specific groups of coders.

Strong products control their dependencies, eliminating them wherever possible. If your running your code on an infinitely large permutation of different configurations, things are ugly enough without having to trust that some other group of developers did it right as well. In cases of extremely high complexity, eliminating any of it, even if it is small is a priority.


A BETTER IDEA

Hopefully I've made my case that there are times and places where "reimplementing the wheel" is actually crucial to success. The real trick in software development is too not get too attached to the any way of doing things. There is never really a "right" way, there are only the blinders that we put on to convince ourselves that we don't have to think about something. In an industry where thinking and creativity are revered, it's odd that people keep tending in the opposite direction on these types of issues.

If you need to have fixed rules to follow, you should start with the idea that a dependency is never good. It's not something you want, but occasionally, it's something that you have to accept. Instead of a negative platitude that forces always accepting, we should be looking for strong reasons why we aren't rewriting this code.

Even in consideration of time, and schedules, it often makes sense to exploit a library today with the intent of re-writing it tomorrow. We just may have to live with a few more dependencies than are helpful for a while.

Like dead code, we should always be going back, trying to figure out how to remove these things from the project. Of course some of them are there forever, usually because they encapsulate something super-complex like images, or because the sheer amount of added code to implement them would significantly bump up the size and complexity to a new level. Whatever, we need to understand why they shouldn't be removed, not vice versa.

In the grandest possible manner, it's not unusual for the complexity of the dependences to exceed the complexities of the code. It's almost always true in a small project, particularly if they tie in a couple of different technologies. It is also sometimes true for large or huge projects as well.

It is a hugely significant statement to say that the un-encapsulated complexity of all of the dependencies is more significant than the coding work in a project. It shows where you should be putting your design and testing resources, it also shows how minimal control you have over the full effect of the system. Control, particularly in an industry prone to legacy or degenerating new versions can make a big different on the long-term success of the project. You can't change it if you don't control it.