I just finished listening to a discussion between Jim Coplien and Bob Martin:
http://www.infoq.com/interviews/coplien-martin-tdd
and there were some really interesting things in that conversation.
The most stunning was that Jim Coplien works with systems that are hundreds of millions of lines of code. That is a massive amount of code. Shouldn't a higher paradigm like Object-Oriented save one from such a difficult fate? How long does it take to assemble that much code? Why hasn't millions of it been refactored?
The next is that even though Jim was dead-on about his observations of having to start close to the underlying understanding of the actual domain in his example of writing software for a bank -- otherwise you'll spend 40 years reinventing a well-known wheel -- that center of the conversation just died. If you know how you need to structure your system, why wouldn't you go straight there with the minimum amount of work, especially if you have huge time constraints. Approaches such as CDD or TDD seem to be taking the long way around the block.
Jim's point about developers without an architecture painting themselves into a corner is true irregardless of any discussion on Object-Oriented or Agile issues. It is a long standing tradition in Computer Science for aimless meanderings to finish off a project badly, that hasn't changed in 40 years. It didn't work before OO, YAGNI, TDD or any other acronym and it still won't work after. If you don't start close to where you need to be, you probably won't ever get there unless you are really lucky.
Bob's description of TDD seems more like a 'game' where you are pitting two different representations of the same thing against itself. All the more so when you realize that Jim is correct about unit testing just being a game of 'sink the battleship' where you randomly fire specific tests over a large surface hoping to hit something, before running out of time and shipping the code. Testing is always about getting lucky. And doesn't context switching between tests and code tired one out more quickly? For speed, I've always liked to batch up the work as much as possible.
Bob's thesis that it was unprofessional to 'not' unit-test every line of code before a release was absolutely crazy. And not possible. Mass buckets of code get released by 'professionals' all of the time where the 'coverage' of the testing is probably way, way less than 30%. That's just a guess, but an easy one. TDD doesn't even stop that; given all of the areas were you can't use TDD (GUIs and databases?), huge amounts of untested code will get loose. Untested code will always get loose.
Even if you unit-tested each and every line to 110%, the combination all of the pieces will open up unexpected and undesirable results; which if you didn't have some overall architecture to tie it all together would be a particularly huge and likely fatal mess. You can't skip integration testing, it isn't safe. If you have to do a lot of it at the end, then wouldn't it be better to minimize the stuff in the middle or at the start?
Further, TDD is not a main-stream process right now, even if it is getting lots of coverage from the 'Agile Press'. Most code is still written using waterfall methods, for most industries. Agile itself isn't even main-stream, and TDD is only a tiny piece of it. Saying it is so, doesn't make it any truer, and it leaves a bad taste in the mouth.
The weirdest part for me is that both of these guys are arguing about how to make code more stable, or at least that's what I hope they are arguing about, otherwise what is the point? All this low-level testing is nice, but it doesn't invalidate the higher level full-system testing that 'must' be done to end-to-end test the various pathways through the code. When is low-level testing effective and when is it just doubling up the workload for no apparent reason?
So, I add a process to my house-building technique that makes sure each and every brick is super-accurate. Now they are all exactly the same size given some huge level of precision. That's nice, but does that really help me build a better house?
Software is a static list of instructions, which we are constantly changing.
▼
Tuesday, February 26, 2008
Tuesday, February 19, 2008
Fundamental Coding Issues
Everybody loves a short summary, one that can easily compress a complicated idea into a simple concept. They are hard to find, but always very useful.
In software development there are many complex issues surrounding our computer language 'code' and the various conventions we use for programming. There are lots of different attributes that a program can have beyond just working, but ultimately there are only a few things that easily define a great program. With that in mind, I find the following summary quite reasonable:
The secret to great code is to get the smallest program possible without resorting to being clever, while generalizing to as large a problem space as possible given the time constraints. Make all of the broad strokes explicit, while making all of the details implicit. Get this right, keep it clean and consistent, and you've got elegance.
It is simple, as it should be, but still the underlying parts require deeper explanation.
THE SMALLEST CODE BASE
Software developers don't like to admit it, but the amount of the code in a system is a hugely significant issue for all sorts of reasons.
Programming is a massive amount of work. If you look at the various parts of development: analysis, implementation, testing and deployment, the second one -- implementation, which is the act of writing the code to implement the solution -- doesn't appear all that large in the overall process. It is only one of four basic steps. However, despite appearances, the size of the code drives issues with all of the other work, including the analysis. The code is the anchor for everything else; bigger code means more work and bigger problems.
The problem starts with having to convert your understanding of the user's need into some set of interfaces. Complicated problems produce complicated interfaces which feed on themselves. It is far more work to add a new function to Microsoft Word, for example, then it is to add it to some really simple interface. The analysis doesn't just cover the problem space, it also includes how to adapt the proposed solution to a specific tool. Adding a function to a small tool is way less work than adding it into some huge existing monolith. Because of this,the analysis changes depending on the target infrastructure. More code means more work integrating any new functionality.
The size of the code itself causes its own problems. If your intended solution requires 150,000 lines of code you'd be looking at a small team of programmers for over a year. If your intended solution requires 1,000,000 lines of code it will take a huge team many years to build it. Once it is built, if there is a problem with the 150,000 lines of code, refactoring chunks of it is significant, but not dangerous to the project. With 1,000,000 lines you are committed to your code base for good or bad. It, like the Titanic is slow to turn, so if there are any obstacles along the way such as icebergs you are in serious trouble. Why use 1,000,000 lines if 150,000 will do?
With each line of code, you commit to something you've got to maintain throughout history. It requires work to keep it updated, it even requires work to delete it. The more code there is, the more work that is required to just figure out the available options. Big code bases are unwieldy beasts that are expensive and difficult to handle. Developers often try to get away with just bolting a bit more code onto the side, but that rather obvious tactic is limited and ugly.
In all code there are always latent unwanted behaviors (bugs), that includes mature code that has been running in production for years. These 'problems' sit there like little land mines waiting for unsuspecting people to pass by and trigger some functionality. Testing is the act of mine-sweeping, where the testers spend as much time as they have, trying to detect as many of these problems as possible. Most current test processes are horribly inefficient, they end up spending way too much time on the easy issues and way too little time on the hard ones. Once the time is up, the code is released in whatever state. So often you'll find that cord works well for its simple common usages, but becomes increasingly unstable the more you try to push its envelop. Not surprisingly, that matches the testing.
Testing follows some type of inverse square law, e.g. it likely takes four times as much effort to test twice as much code. By committing to a bigger code base you are committing to a huge increase in the amount of testing, or as more often the case, you are actually diminishing your existing testing by a significant degree. So often, the code base doubles but the testing resources remain the same, only now they are 25% as effective.
With the increase in testing requirements being ignored, most big software packages get more operational problems and issues. For all software, there is a considerable support cost even if it is just an installer, an operator, a simple help desk and some beepers for the programmers. For large commercial projects support can include an entire division of the company.
Running code is hugely expensive, most programmers don't seem to understand this. They just hand off their work and don't think about the consequences. There are the fixed costs, but also those driven because of behavioral problems. A particularly bad bug in a commercial setting could cost millions of dollars and really hurt the reputation of a company. Even in an in-house setting, the code going down could delay other significant events costing money or bad publicity. The more useful the code, the more expense the failures.
Bigger code 'will' fail more often. From experience it is clear that with twice as much code, comes twice as many bugs. Programmers have a nearly consistent rate of adding bugs to their code, that is mostly independent of the actual programmer or testing. Better programmers often have less bugs per line of code, but they still have them, and because they tend to work on the more complicated sections of the system or write more code, it is not uncommon for their bug count to be higher. It stands to reason, if there is twice as much code, there are at least twice as many bugs, so the odds of getting a bug is twice as likely.
Finally, bigger code bases also mean bigger algorithms, and a lot more of them. The more complex code is harder to use and has more bugs, but it also means more documentation work. Usually the 'rules' for any of the functional behavior of the system start to become really 'sophisticated'. Well, sophisticated isn't the right word, overcomplicated is probably more appropriate. Once the functionality bloats, it takes 'essays' to even explain the simplest usage of the system, the support costs go through the roof. A very simple program that does one thing that is strictly following the conventions in an obvious way, probably doesn't need any help or tutorials. Once you grow to do fancier tasks with lots of customization, then online help becomes a necessity. Open up the system to allow the users to adapt it to themselves then you need lots pf tutorials, courses and books. A whole industry springs forth from really complex software. Photoshop is the classic example, with a phenomenal secondary industry devoted to helping people utilize the internals by translating between language of the programmers, and the language of the users (such as "how do I remove red-eye from this photo?").
NOT CLEVER
Given all of the above problems with big code bases, programmers shouldn't rush out to compact their code into the tiniest possible pieces. Size is an issue, but readable code is more important. If you get too clever and really tightly pack your code it can be impossible to understand. Clever is bad. Very bad.
The obfuscated C code contest is a great example of really clever ways to dramatically alter or reduce the size of the code. While it is entertaining, in practice it is extremely dangerous. Clever code packs way too much complexity into too small a package. If we wrote things once and only once then never touched it, that would be fine, but the lifespan of code is huge. In its life, there are always periods were people make fast judgment calls on the functioning of the code. Code needs a constant amount of never-ending fixing and updating. Stress is a part of software because the amount of work is always larger than the resources. Clever code just sets itself up to cause problems in the future. It is yet another land mine waiting to happen, as if it were just a bug of some sort.
You get away with some clever trick of the language or some other weird way of getting your results, it will be easily missed by someone else. That code is dangerous, and definitely not elegant.
Since code is just complicated by its very nature. All problem spaces have huge amounts of complexity, but we really want to lay out each and every line of code in the simplest, most straight-forwardly reusable manner possible. That also includes, not commenting too much, as well as too little. Taking away the readability of code is always asking for trouble. If you can give it to someone with a very light coding background and they get 'it' immediately then it is probably close enough to elegant. Programming students, for example should be able to easily understand the nature and purpose of the code. You shouldn't need an extensive background to read good code, although you definitely need one to write it.
The formatting, names of the variables, comments, and all of the syntactic attributes are very important in producing something that is easy to read. In big languages such Perl, showing a tremendous amount of discipline in 'not' using some of the more esoteric features of the language is the industrial strength way of coding. In Java, not following weak design pattern and bean conventions will produce things that are more readable and easily understood. Obscuring the underlying functioning of the code by focusing on its structure, just makes it harder to understand. The magic comes from generalizing it, not from being clever.
Some of the 'movements' in programming over the years have been counter-productive because they get too obsessed about the 'right way' to realize that they are doing extreme damage to the 'readability' of their code base; a factor that is far more important than being right. If it is readable and consistent it can't be too far away from elegant.
LARGEST PROBLEM SPACE
When you are solving a problem, the range of possible solutions extends from pounding out each and every instruction using a 'brute-force' approach, to writing some extremely generalized configurable program to solve a huge number of similar problems. At the one end of this spectrum, using brute-force, there is a tremendous amount of work in typing and maintaining a very fragile and stiff code base. The solution is static and brittle, needing lots of fixes. While it is probably huge, each sub-section of it is very straight-forward as it is just a long sequence of instructions to follow, e.g. get this file, open it, read the contents, put them in this structure, add these numbers, do these manipulations, save them in this format, etc. Most computer languages provide some higher level of abstraction, so at least each instruction is millions of lines of assembler, but it is still rigid and explicit.
Adding more degrees of freedom to the instructions, generalizing or making it dynamic, means that the resulting code can be used to solve more and more similar problems. The size of the problem space opens up and with additional configuration information; the usage of the code becomes huge.
As we shift the dynamic behavior away from the static lines of code, we have to provide some 'meta-data' in order for the generalized version of the code to work on a specific problem. In a very philosophical sense, the underlying details must always exist, no matter how general the solution. When we generalize however, we shift the details from being statically embedded into the code in a fragile manner, to being, either implicit, or explicitly held in the meta-data effecting the behavior.
Like energy, the primary problem domain details* can neither be created nor destroyed. You just shift them around. They just get shifted from being explicitly embedded into the code to existing somewhere else, either implicitly or in the configuration data.
*We can and do create massive amounts of artificial complexity, which creates artificial details, which can be refactored out of the code. If you can delete details without altering the functionality, then it was clearly artificial.
Way way off to the very end of the spectrum, one might imagine some very complicated all-purpose general piece of code the can do everything, but funny enough, that 'code' exists and is the computer itself. It is the ultimate general solution.
Building and maintaining a system is a huge amount of work that is often underestimated. Usually wherever a specific software based tool can be used to solve a problem, there is an abundance of similar problems that also need to be solved. Programmers love to just bite off a piece and chew on that, ignoring the whole space, but it is far more effective to solve a collection of problems all at once. All it takes is the ability to step back and look at the specific problems in their larger context.
Which of the business rules seem to bend, and how many other problems in the company have the same feel to them? The larger the problem space, the cheaper the solution. If you have ten departments in a company that all need customized phone books, solving each one by itself is considerably more work than solving them all together. If you have one group that needs approvals for their documents, that problem spans a huge number of different groups. They will all benefit by a common solution.
Generalizing makes the solution cheaper, and the reduces the overall work. Also, strangely enough, generalized code is always smaller than brute force. That means that the long term costs of maintaining a code base are cheaper as well. The size issue on its own is significant enough that in a long run perspective it is always worth generalizing to bring down the size of the code, even if there are no available similar problems. Generalizing can reduce the code base enough to get it to fit in the available development window. The technique can be applied as a measure to control the costs of the project and build more efficiently.
It also helps in maintaining consistency. If there is one routine that is responsible for rendering multiple screens in a system, but virtue of its usage it enforces consistency.
TIME CONSTRAINTS
Everybody underestimates the amount of time it takes to build software. They also underestimate the amount of effort it takes to keep it running. Like an iceberg, only a small portion of the code is actually visible to the users so they see it as a rather small malleable thing that can be easily changed. Any significant set of functionality gets locked in by its own size and complexity.
If you make frequent quick changes to your code base you'll quickly find out that you are destabilizing it. The more fast hacks you add, the worse the code becomes. There is always an expensive trade-off to be made between getting it done quickly and getting it done correctly. Many projects make the wrong choices, although the damage often takes years before it fully shows. Getting out one good release is easy, getting one out time and time again is difficult.
Without a doubt, time is the biggest problem encountered in software. There is never enough. That means that any 'technique' that helps reduce the time taken to do some work is probably good if it helps both in the short run and the long run. Any technique that adds extra work is probably bad. We should always keep in mind that sometimes you need to add a little extra effort in the short run to save time in the long run.
For instance, keeping the code clean and neat makes it easy to add stuff later. A well-maintained project takes more time, but is worth it. Sloppy is the friend of complexity.
Optimization is always about doing something extra now, that results in a gain later because the results get re-used over and over. Clean code saves time in understanding and modifying it. A little bit of extra work and discipline that pay off. Design saves time from being lost later. With a solid design you can fit right into the code you need, you don't have to waste a lot of time wandering around trying to guess at what might work.
Time is an all important quantity in software development. Wasting it is a bad idea. Some of the newer programming development techniques seek to make 'games' out of programming. It is an immature way of trying to remove some of the natural tediousness away from coding. We want to build things and get to the end as fast as possible. Playing around is avoiding the issue.
If you want to build great tools, at times it will be painful, it comes with the territory. No job is 100% entertaining, they all have their down-sides, that is why it is called work and we have to be paid to do it. Nothing wrong with hobbyist programmer's playing games and competing, but its is not appropriate for the work place.
Testing is one area that people waste massive amounts of time without getting any additional benefit. Again, some of the newer testing techniques add extra work, but in exchange they claim to reduce the amount of bugs. If that works it is great, but you have to be sceptical of most claims. Component testing thoroughly is good, but if you cannot assure the interaction of the components with each other, then there is always some minimal level of final testing that still needs to be done.
It is immutable, and as such is not possible to be optimized away. If you must test the final product as a whole coherent piece, than you cannot skip those tests no matter how much work you have done on the sub-components. If you are not skipping the tests, then you are not saving any time. If you down-grade the final tests to be less, then you upgrade the risks of allowing problems through. Of course this is true as a basic principle: in-stream testing in any process is there to reduce the amount of bouncing around between states, but it does significantly bump up the amount of testing work without increasing the quality. If the bouncing between states is not significant, then reducing it doesn't add much extra value. It doesn't negate any of the final testing, which still needs to be done.
In same way that one cannot solve the halting problem, the amount of testing required to achieve 100% certainty that there are no bugs is 'infinite'. It is an easy proof, once you accept the possibility that there is some sequence of input that causes the software to get into a state that can break it. Given that all non-trivial software maintains some form of internal-state, then to achieve absolute certainty that there are no bugs you have have to test every combination of possible input. Since there is no limit to the size of the input, the number of possible test scenarios is infinite, and it would take forever to create and apply them. Given our restrictions on time, unless we do our development on the edge of a black hole, there will always be some possibly of bugs with any release.
You may put in significant effort for a series of releases to really produce stellar quality, but you always need to remember that software development projects never really end. Sooner or later a bug is going to escape. In practice it is always sooner, and it always happens way more than most programmers anticipate. Although one person called it 'defeatist', it is a really good idea to plan for sending out patches to fix the code. Just assume there will be bugs and deal with it. If you build this into distribution and deployment, then when it is necessary it is just part of the system, otherwise the 'crisis' will eat up huge amounts of time.
Unexpected, 'expected' events cause delays, morale issues and scheduling conflicts. When we continually see the same problems happen over and over again, we need to accept them as part of process, rather than try to ignore them or rail against how unfair they are. If we anticipate problems with the deployment of the software we can build in the mechanisms to make this dealing with the problem easier. Taking a broad approach to development to include the development, testing and operations into the problem domain is the best way to build practical solutions that can withstand significant shifts in their environments.
The time issues never go away and ignoring them only makes it worse and more disruptive.
BROAD STROKES AND DETAILS
There are lots of arguments between strongly typed and loosely typed languages. Different camps of programmers feel that one or the other is the perfect approach to the problem. After lots of experience on both sides, I've come to realize that some problems are best handled with strongly typed approaches -- while others are best handled with very loose ones. The difference comes down to whether or not you need to be pedantic about the details.
For some types of solutions, the underlying correctness of the data is the key to making it work. In those cases you want to build up internal structures while programming that are very specific, and during the process you want to insure that the correct structure is actually built.
If your writing some code to write out a specific file format, for example, you'd like the internal structure to mirror the file format. In that way, depending on the format the structure and syntax are highly restricted. As the various calls go about in the system building up the structure, there is also code to make sure that the structure stays consistent and correct. With many types of errors, the closer the program stops after the first error, the easier it is to diagnose. When running in a development environment, a program that is strongly typed and checking its data can stop immediately the moment it deviates from the prescribed plan. That makes it really easy to find and fix the problems which also helps to minimize testing.
For some solutions, if you are performing a large number of steps and you want the program to be as tolerant as possible, then loosely typed is way better. It doesn't matter for example, how the data got into the program, instead it matters how it will be transformed. High-level programs, scripting, data filters, and any other program that needs to deal with a wide range of unexpected inputs fit well into this type of circumstance. The range of the data is massive, but the transformations are pretty trivial. If the program stopped each and every time the input was an unknown combination, the code would be so fragile that it would be useless. Loosely typing data under this circumstance means that the importance is on the set of instructions, they need to be completed, the data is only secondary. Scripting in particular requires this.
This dichotomy is true for all software. For some goals the data is the most important thing and it needs to be structured correctly. For some goals, it is the sequence of instructions and the final output that is significant. It doesn't matter how it got there.
So for example, we can use a typed language like Java to perform the algorithmic combinations, but a basically untyped tool like ant to insure that the code was built and deployed correctly. That the shell scripting languages in Unix are mostly loosely typed is no accident. The two approaches are needed for two very different problems.
It is also true that within all systems, the sequence of instructions at the higher level is more important, and the structure of data at the low level is important. The nice part is realizing that the depth of the code makes a big difference. If you are building a complex system, the broad strokes of the code should be loosely typed, they are more flexible and less rigid that way, while the detailed calculations should be strongly typed because the accuracy for the data is often the key. Loose typing at the higher level also helps in decoupling the architecture and splitting off the presentation from the underlying data. All these things work together at the different levels within any system.
Clearly any new language that covers the whole domain of building and deploying complex systems will cover the whole spectrum and be both strongly and loosely typed. Working both into the same syntax will be one of those key things that will bump us forward in complexity handling.
ELEGANCE
Clean and consistent code is a great secret in getting things launched, but it is exceptionally difficult to get a group of programmers to follow some underlying conventions. There are cultural issues that make it nearly impossible for big teams to synchronize their working practices. That could be why the initial version of most of commercial software is built initially by little teams and later handed off to big teams for maintenance. We still don't know how to coordinate our efforts correctly on a large scale. Our most common official methodologies are horrible.
Elegance, just for the sake of elegance is never a great idea. Elegance because it makes you job easier and it means that the software works better is a great idea. It is even better when it takes away a lot of the stresses associated with messy programming habits. It becomes a means to an end.
We build and maintain tools to help our users solve their problems, which are most generally in playing with their ever-increasing piles of data. The most important thing is that the tools we build actually solve the problems for the users. Everybody getting their input into the design, programmers having fun while coding, and processes being left wide open and 'casual' may amuse some people while they are working, but they do not get the basic tasks completed any faster. Spending time to understand the user's real needs, keeping the code clean and consistent, using real graphic design for the interface, and providing simple tools that are easy to understand and effective are some of the things that make the users come to appreciate (or not) the underlying software. Computers can make people's lives easier or they can make them harder, the difference is up to the abilities of the programmers involved with the code.
In the end, a short simple solution that works simply and consistently is as good as it gets. Bad, overly complicated bloated software with every function imaginable isn't good, and it isn't much of an accomplishment. All the fancy graphics, dancing baloney and crammed in information can't hide a badly written program. It's not the technology, it's what you do with it that matters.
In software development there are many complex issues surrounding our computer language 'code' and the various conventions we use for programming. There are lots of different attributes that a program can have beyond just working, but ultimately there are only a few things that easily define a great program. With that in mind, I find the following summary quite reasonable:
The secret to great code is to get the smallest program possible without resorting to being clever, while generalizing to as large a problem space as possible given the time constraints. Make all of the broad strokes explicit, while making all of the details implicit. Get this right, keep it clean and consistent, and you've got elegance.
It is simple, as it should be, but still the underlying parts require deeper explanation.
THE SMALLEST CODE BASE
Software developers don't like to admit it, but the amount of the code in a system is a hugely significant issue for all sorts of reasons.
Programming is a massive amount of work. If you look at the various parts of development: analysis, implementation, testing and deployment, the second one -- implementation, which is the act of writing the code to implement the solution -- doesn't appear all that large in the overall process. It is only one of four basic steps. However, despite appearances, the size of the code drives issues with all of the other work, including the analysis. The code is the anchor for everything else; bigger code means more work and bigger problems.
The problem starts with having to convert your understanding of the user's need into some set of interfaces. Complicated problems produce complicated interfaces which feed on themselves. It is far more work to add a new function to Microsoft Word, for example, then it is to add it to some really simple interface. The analysis doesn't just cover the problem space, it also includes how to adapt the proposed solution to a specific tool. Adding a function to a small tool is way less work than adding it into some huge existing monolith. Because of this,the analysis changes depending on the target infrastructure. More code means more work integrating any new functionality.
The size of the code itself causes its own problems. If your intended solution requires 150,000 lines of code you'd be looking at a small team of programmers for over a year. If your intended solution requires 1,000,000 lines of code it will take a huge team many years to build it. Once it is built, if there is a problem with the 150,000 lines of code, refactoring chunks of it is significant, but not dangerous to the project. With 1,000,000 lines you are committed to your code base for good or bad. It, like the Titanic is slow to turn, so if there are any obstacles along the way such as icebergs you are in serious trouble. Why use 1,000,000 lines if 150,000 will do?
With each line of code, you commit to something you've got to maintain throughout history. It requires work to keep it updated, it even requires work to delete it. The more code there is, the more work that is required to just figure out the available options. Big code bases are unwieldy beasts that are expensive and difficult to handle. Developers often try to get away with just bolting a bit more code onto the side, but that rather obvious tactic is limited and ugly.
In all code there are always latent unwanted behaviors (bugs), that includes mature code that has been running in production for years. These 'problems' sit there like little land mines waiting for unsuspecting people to pass by and trigger some functionality. Testing is the act of mine-sweeping, where the testers spend as much time as they have, trying to detect as many of these problems as possible. Most current test processes are horribly inefficient, they end up spending way too much time on the easy issues and way too little time on the hard ones. Once the time is up, the code is released in whatever state. So often you'll find that cord works well for its simple common usages, but becomes increasingly unstable the more you try to push its envelop. Not surprisingly, that matches the testing.
Testing follows some type of inverse square law, e.g. it likely takes four times as much effort to test twice as much code. By committing to a bigger code base you are committing to a huge increase in the amount of testing, or as more often the case, you are actually diminishing your existing testing by a significant degree. So often, the code base doubles but the testing resources remain the same, only now they are 25% as effective.
With the increase in testing requirements being ignored, most big software packages get more operational problems and issues. For all software, there is a considerable support cost even if it is just an installer, an operator, a simple help desk and some beepers for the programmers. For large commercial projects support can include an entire division of the company.
Running code is hugely expensive, most programmers don't seem to understand this. They just hand off their work and don't think about the consequences. There are the fixed costs, but also those driven because of behavioral problems. A particularly bad bug in a commercial setting could cost millions of dollars and really hurt the reputation of a company. Even in an in-house setting, the code going down could delay other significant events costing money or bad publicity. The more useful the code, the more expense the failures.
Bigger code 'will' fail more often. From experience it is clear that with twice as much code, comes twice as many bugs. Programmers have a nearly consistent rate of adding bugs to their code, that is mostly independent of the actual programmer or testing. Better programmers often have less bugs per line of code, but they still have them, and because they tend to work on the more complicated sections of the system or write more code, it is not uncommon for their bug count to be higher. It stands to reason, if there is twice as much code, there are at least twice as many bugs, so the odds of getting a bug is twice as likely.
Finally, bigger code bases also mean bigger algorithms, and a lot more of them. The more complex code is harder to use and has more bugs, but it also means more documentation work. Usually the 'rules' for any of the functional behavior of the system start to become really 'sophisticated'. Well, sophisticated isn't the right word, overcomplicated is probably more appropriate. Once the functionality bloats, it takes 'essays' to even explain the simplest usage of the system, the support costs go through the roof. A very simple program that does one thing that is strictly following the conventions in an obvious way, probably doesn't need any help or tutorials. Once you grow to do fancier tasks with lots of customization, then online help becomes a necessity. Open up the system to allow the users to adapt it to themselves then you need lots pf tutorials, courses and books. A whole industry springs forth from really complex software. Photoshop is the classic example, with a phenomenal secondary industry devoted to helping people utilize the internals by translating between language of the programmers, and the language of the users (such as "how do I remove red-eye from this photo?").
NOT CLEVER
Given all of the above problems with big code bases, programmers shouldn't rush out to compact their code into the tiniest possible pieces. Size is an issue, but readable code is more important. If you get too clever and really tightly pack your code it can be impossible to understand. Clever is bad. Very bad.
The obfuscated C code contest is a great example of really clever ways to dramatically alter or reduce the size of the code. While it is entertaining, in practice it is extremely dangerous. Clever code packs way too much complexity into too small a package. If we wrote things once and only once then never touched it, that would be fine, but the lifespan of code is huge. In its life, there are always periods were people make fast judgment calls on the functioning of the code. Code needs a constant amount of never-ending fixing and updating. Stress is a part of software because the amount of work is always larger than the resources. Clever code just sets itself up to cause problems in the future. It is yet another land mine waiting to happen, as if it were just a bug of some sort.
You get away with some clever trick of the language or some other weird way of getting your results, it will be easily missed by someone else. That code is dangerous, and definitely not elegant.
Since code is just complicated by its very nature. All problem spaces have huge amounts of complexity, but we really want to lay out each and every line of code in the simplest, most straight-forwardly reusable manner possible. That also includes, not commenting too much, as well as too little. Taking away the readability of code is always asking for trouble. If you can give it to someone with a very light coding background and they get 'it' immediately then it is probably close enough to elegant. Programming students, for example should be able to easily understand the nature and purpose of the code. You shouldn't need an extensive background to read good code, although you definitely need one to write it.
The formatting, names of the variables, comments, and all of the syntactic attributes are very important in producing something that is easy to read. In big languages such Perl, showing a tremendous amount of discipline in 'not' using some of the more esoteric features of the language is the industrial strength way of coding. In Java, not following weak design pattern and bean conventions will produce things that are more readable and easily understood. Obscuring the underlying functioning of the code by focusing on its structure, just makes it harder to understand. The magic comes from generalizing it, not from being clever.
Some of the 'movements' in programming over the years have been counter-productive because they get too obsessed about the 'right way' to realize that they are doing extreme damage to the 'readability' of their code base; a factor that is far more important than being right. If it is readable and consistent it can't be too far away from elegant.
LARGEST PROBLEM SPACE
When you are solving a problem, the range of possible solutions extends from pounding out each and every instruction using a 'brute-force' approach, to writing some extremely generalized configurable program to solve a huge number of similar problems. At the one end of this spectrum, using brute-force, there is a tremendous amount of work in typing and maintaining a very fragile and stiff code base. The solution is static and brittle, needing lots of fixes. While it is probably huge, each sub-section of it is very straight-forward as it is just a long sequence of instructions to follow, e.g. get this file, open it, read the contents, put them in this structure, add these numbers, do these manipulations, save them in this format, etc. Most computer languages provide some higher level of abstraction, so at least each instruction is millions of lines of assembler, but it is still rigid and explicit.
Adding more degrees of freedom to the instructions, generalizing or making it dynamic, means that the resulting code can be used to solve more and more similar problems. The size of the problem space opens up and with additional configuration information; the usage of the code becomes huge.
As we shift the dynamic behavior away from the static lines of code, we have to provide some 'meta-data' in order for the generalized version of the code to work on a specific problem. In a very philosophical sense, the underlying details must always exist, no matter how general the solution. When we generalize however, we shift the details from being statically embedded into the code in a fragile manner, to being, either implicit, or explicitly held in the meta-data effecting the behavior.
Like energy, the primary problem domain details* can neither be created nor destroyed. You just shift them around. They just get shifted from being explicitly embedded into the code to existing somewhere else, either implicitly or in the configuration data.
*We can and do create massive amounts of artificial complexity, which creates artificial details, which can be refactored out of the code. If you can delete details without altering the functionality, then it was clearly artificial.
Way way off to the very end of the spectrum, one might imagine some very complicated all-purpose general piece of code the can do everything, but funny enough, that 'code' exists and is the computer itself. It is the ultimate general solution.
Building and maintaining a system is a huge amount of work that is often underestimated. Usually wherever a specific software based tool can be used to solve a problem, there is an abundance of similar problems that also need to be solved. Programmers love to just bite off a piece and chew on that, ignoring the whole space, but it is far more effective to solve a collection of problems all at once. All it takes is the ability to step back and look at the specific problems in their larger context.
Which of the business rules seem to bend, and how many other problems in the company have the same feel to them? The larger the problem space, the cheaper the solution. If you have ten departments in a company that all need customized phone books, solving each one by itself is considerably more work than solving them all together. If you have one group that needs approvals for their documents, that problem spans a huge number of different groups. They will all benefit by a common solution.
Generalizing makes the solution cheaper, and the reduces the overall work. Also, strangely enough, generalized code is always smaller than brute force. That means that the long term costs of maintaining a code base are cheaper as well. The size issue on its own is significant enough that in a long run perspective it is always worth generalizing to bring down the size of the code, even if there are no available similar problems. Generalizing can reduce the code base enough to get it to fit in the available development window. The technique can be applied as a measure to control the costs of the project and build more efficiently.
It also helps in maintaining consistency. If there is one routine that is responsible for rendering multiple screens in a system, but virtue of its usage it enforces consistency.
TIME CONSTRAINTS
Everybody underestimates the amount of time it takes to build software. They also underestimate the amount of effort it takes to keep it running. Like an iceberg, only a small portion of the code is actually visible to the users so they see it as a rather small malleable thing that can be easily changed. Any significant set of functionality gets locked in by its own size and complexity.
If you make frequent quick changes to your code base you'll quickly find out that you are destabilizing it. The more fast hacks you add, the worse the code becomes. There is always an expensive trade-off to be made between getting it done quickly and getting it done correctly. Many projects make the wrong choices, although the damage often takes years before it fully shows. Getting out one good release is easy, getting one out time and time again is difficult.
Without a doubt, time is the biggest problem encountered in software. There is never enough. That means that any 'technique' that helps reduce the time taken to do some work is probably good if it helps both in the short run and the long run. Any technique that adds extra work is probably bad. We should always keep in mind that sometimes you need to add a little extra effort in the short run to save time in the long run.
For instance, keeping the code clean and neat makes it easy to add stuff later. A well-maintained project takes more time, but is worth it. Sloppy is the friend of complexity.
Optimization is always about doing something extra now, that results in a gain later because the results get re-used over and over. Clean code saves time in understanding and modifying it. A little bit of extra work and discipline that pay off. Design saves time from being lost later. With a solid design you can fit right into the code you need, you don't have to waste a lot of time wandering around trying to guess at what might work.
Time is an all important quantity in software development. Wasting it is a bad idea. Some of the newer programming development techniques seek to make 'games' out of programming. It is an immature way of trying to remove some of the natural tediousness away from coding. We want to build things and get to the end as fast as possible. Playing around is avoiding the issue.
If you want to build great tools, at times it will be painful, it comes with the territory. No job is 100% entertaining, they all have their down-sides, that is why it is called work and we have to be paid to do it. Nothing wrong with hobbyist programmer's playing games and competing, but its is not appropriate for the work place.
Testing is one area that people waste massive amounts of time without getting any additional benefit. Again, some of the newer testing techniques add extra work, but in exchange they claim to reduce the amount of bugs. If that works it is great, but you have to be sceptical of most claims. Component testing thoroughly is good, but if you cannot assure the interaction of the components with each other, then there is always some minimal level of final testing that still needs to be done.
It is immutable, and as such is not possible to be optimized away. If you must test the final product as a whole coherent piece, than you cannot skip those tests no matter how much work you have done on the sub-components. If you are not skipping the tests, then you are not saving any time. If you down-grade the final tests to be less, then you upgrade the risks of allowing problems through. Of course this is true as a basic principle: in-stream testing in any process is there to reduce the amount of bouncing around between states, but it does significantly bump up the amount of testing work without increasing the quality. If the bouncing between states is not significant, then reducing it doesn't add much extra value. It doesn't negate any of the final testing, which still needs to be done.
In same way that one cannot solve the halting problem, the amount of testing required to achieve 100% certainty that there are no bugs is 'infinite'. It is an easy proof, once you accept the possibility that there is some sequence of input that causes the software to get into a state that can break it. Given that all non-trivial software maintains some form of internal-state, then to achieve absolute certainty that there are no bugs you have have to test every combination of possible input. Since there is no limit to the size of the input, the number of possible test scenarios is infinite, and it would take forever to create and apply them. Given our restrictions on time, unless we do our development on the edge of a black hole, there will always be some possibly of bugs with any release.
You may put in significant effort for a series of releases to really produce stellar quality, but you always need to remember that software development projects never really end. Sooner or later a bug is going to escape. In practice it is always sooner, and it always happens way more than most programmers anticipate. Although one person called it 'defeatist', it is a really good idea to plan for sending out patches to fix the code. Just assume there will be bugs and deal with it. If you build this into distribution and deployment, then when it is necessary it is just part of the system, otherwise the 'crisis' will eat up huge amounts of time.
Unexpected, 'expected' events cause delays, morale issues and scheduling conflicts. When we continually see the same problems happen over and over again, we need to accept them as part of process, rather than try to ignore them or rail against how unfair they are. If we anticipate problems with the deployment of the software we can build in the mechanisms to make this dealing with the problem easier. Taking a broad approach to development to include the development, testing and operations into the problem domain is the best way to build practical solutions that can withstand significant shifts in their environments.
The time issues never go away and ignoring them only makes it worse and more disruptive.
BROAD STROKES AND DETAILS
There are lots of arguments between strongly typed and loosely typed languages. Different camps of programmers feel that one or the other is the perfect approach to the problem. After lots of experience on both sides, I've come to realize that some problems are best handled with strongly typed approaches -- while others are best handled with very loose ones. The difference comes down to whether or not you need to be pedantic about the details.
For some types of solutions, the underlying correctness of the data is the key to making it work. In those cases you want to build up internal structures while programming that are very specific, and during the process you want to insure that the correct structure is actually built.
If your writing some code to write out a specific file format, for example, you'd like the internal structure to mirror the file format. In that way, depending on the format the structure and syntax are highly restricted. As the various calls go about in the system building up the structure, there is also code to make sure that the structure stays consistent and correct. With many types of errors, the closer the program stops after the first error, the easier it is to diagnose. When running in a development environment, a program that is strongly typed and checking its data can stop immediately the moment it deviates from the prescribed plan. That makes it really easy to find and fix the problems which also helps to minimize testing.
For some solutions, if you are performing a large number of steps and you want the program to be as tolerant as possible, then loosely typed is way better. It doesn't matter for example, how the data got into the program, instead it matters how it will be transformed. High-level programs, scripting, data filters, and any other program that needs to deal with a wide range of unexpected inputs fit well into this type of circumstance. The range of the data is massive, but the transformations are pretty trivial. If the program stopped each and every time the input was an unknown combination, the code would be so fragile that it would be useless. Loosely typing data under this circumstance means that the importance is on the set of instructions, they need to be completed, the data is only secondary. Scripting in particular requires this.
This dichotomy is true for all software. For some goals the data is the most important thing and it needs to be structured correctly. For some goals, it is the sequence of instructions and the final output that is significant. It doesn't matter how it got there.
So for example, we can use a typed language like Java to perform the algorithmic combinations, but a basically untyped tool like ant to insure that the code was built and deployed correctly. That the shell scripting languages in Unix are mostly loosely typed is no accident. The two approaches are needed for two very different problems.
It is also true that within all systems, the sequence of instructions at the higher level is more important, and the structure of data at the low level is important. The nice part is realizing that the depth of the code makes a big difference. If you are building a complex system, the broad strokes of the code should be loosely typed, they are more flexible and less rigid that way, while the detailed calculations should be strongly typed because the accuracy for the data is often the key. Loose typing at the higher level also helps in decoupling the architecture and splitting off the presentation from the underlying data. All these things work together at the different levels within any system.
Clearly any new language that covers the whole domain of building and deploying complex systems will cover the whole spectrum and be both strongly and loosely typed. Working both into the same syntax will be one of those key things that will bump us forward in complexity handling.
ELEGANCE
Clean and consistent code is a great secret in getting things launched, but it is exceptionally difficult to get a group of programmers to follow some underlying conventions. There are cultural issues that make it nearly impossible for big teams to synchronize their working practices. That could be why the initial version of most of commercial software is built initially by little teams and later handed off to big teams for maintenance. We still don't know how to coordinate our efforts correctly on a large scale. Our most common official methodologies are horrible.
Elegance, just for the sake of elegance is never a great idea. Elegance because it makes you job easier and it means that the software works better is a great idea. It is even better when it takes away a lot of the stresses associated with messy programming habits. It becomes a means to an end.
We build and maintain tools to help our users solve their problems, which are most generally in playing with their ever-increasing piles of data. The most important thing is that the tools we build actually solve the problems for the users. Everybody getting their input into the design, programmers having fun while coding, and processes being left wide open and 'casual' may amuse some people while they are working, but they do not get the basic tasks completed any faster. Spending time to understand the user's real needs, keeping the code clean and consistent, using real graphic design for the interface, and providing simple tools that are easy to understand and effective are some of the things that make the users come to appreciate (or not) the underlying software. Computers can make people's lives easier or they can make them harder, the difference is up to the abilities of the programmers involved with the code.
In the end, a short simple solution that works simply and consistently is as good as it gets. Bad, overly complicated bloated software with every function imaginable isn't good, and it isn't much of an accomplishment. All the fancy graphics, dancing baloney and crammed in information can't hide a badly written program. It's not the technology, it's what you do with it that matters.
Sunday, February 10, 2008
The Power of Expression
My writing archives are littered with half-completed, mostly dead posts on the nature of expression. More so than any other topic, this one has defeated any attempt to roll up my ideas into a coherent, finished piece of work.
Writing needs to come together in a way that leads the reader on a journey while leaving them satisfied at the end. Half-thoughts, while interesting, leave the reader longing for more. Sort of like an appetizer with no main course. You won't starve to death, but you're still very hungry afterward.
To get around that, this post is -- I guess -- a series of appetizers; hopefully enough to fulfill. If you don't like one, perhaps some of the following might be more satisfying. If you keep reading long enough, hopefully, you'll be satiated. If you're still hungry at the end, stay tuned, there will always be more.
THE NEVER CHANGING ELEMENTS OF DEVELOPMENT
Software development is all about building 'tools' for users to play with their piles of data. The most important attribute for any tool is that it is usable. The second most important attribute is that it is extensible, or at very least it can be kept updated.
A tool that only worked for a short time is a pain. Any investment in learning how to utilize it, is squandered when it is no longer available. Even the best tools take some effort to master, while most of the software ones take huge effort because of their poor designs.
There is an implicit covenant between the users and the programmers. The users commit to learning how to utilize the tools only if we commit to building and maintaining them over the long run. A 'release' is only a brief instance in the life of any software, and you're only as good as your last release.
Software itself is just a large set of instructions. These days, it is often a huge set of instructions, in fact, trillions and trillions of them if you are looking at assembler. Most modern computers are happily working their way through millions of lines of instructions every second. The sheer size of our existing code bases and amount of code that is being executed is daunting.
Collecting together these instructions is hard, messy and prone to failure. There is much we can do to improve our accuracy, but I'll leave those thoughts for another day.
Once we know what to build, if we didn't know better, we might try to manually type each and every one of the instructions into explicitly the computer. We did it that way initially, but we've learned a lot since then. Or have we?
Even though we are no longer manually flipping switches or pounding out endless lines of assembler, we still commonly employ higher level brute force approaches as our primary means in building software. Generally, most programmers build the system by belting out or copying each every line of code that needs to be executed. They'll use some 'theory' to reduce the redundant code, but it usually is not applied to great effect. Few code-bases are not endlessly-repeating chucks of nearly identical code, despite how we as an industry proclaim we are following principles like DRY (Don't Repeat Yourself -- The Pragmatic Programmers).
Even more disconcerting, on an industry level, is that we are madly adding in as much code as possible to handle all of our perceived problems. There is some type of implicit assumption that having more code will actually help. That if we just had 'enough' code, we could solve all of our user's problems.
The funny thing is that it is a waste of time. You can never win, the amount of brute force code you need is infinite and infinitely growing. E.g. the more crappy code you have, the more crappy code you need to monitor it. That's an exponentially escalating mess. It defines the building culture for 'many' of our current operating systems and major tools. Although we've found ways to add in instructions at a faster rate we have not fundamentally changed our approach to what we are doing. We're just pounding each and every instruction explicitly into the computer.
It might be OK if it wasn't for the fact that code rusts. If we have code we have to maintain it and if it is rapidly growing out of control, that means that sooner or later lots of that code is going to rust. We cannot support it all. Our approach is flawed.
At least there will always be work in programming. Unless there is a shift, companies will endlessly pound-out partial tools. And we will endlessly refactor those tools into arbitrarily inconsistent pieces. And people like virus writers will help, in producing counter-productive code that needs to be monitored and controlled. It really is quite endless with this approach. The more code you have, the more you need, the more work you have to do to keep it going. We are not evolving, we are just barely keeping up with the demand.
LANGUAGE EXPRESSION
There are huge debates over which programming language is best. This rather silly subjective argument has been going on with the entire life of software development. It is not that I don't think the choice of language is important. It can be a critical component in getting the tool successfully built. But inherently, all of the languages that we currently have -- suck. They suck to varying degrees, but they still suck.
We haven't found the right level of abstraction yet, that allows us to build our systems reliably and consistently. We're not even in the right corner of the solution space, so we are quibbling over an endless array of broken and incomplete languages. Do you care what the mnemonic is for incrementing a register in assembler? No, of course not, that discussion is long gone. Is a Pascal pointer better than a C one? That too is ancient history. So, too, will many of today's issues just disappear.
The language we want is the one that makes its representation the closest possible to the way we think about the problem. The 'farther' we have to translate the answers, the more likely there will be errors. The 4am test is critical. Will I be able to sort this out at 4am or do I have to do some huge amount of mental gymnastics? Clear, straight-forward syntax and semantics that match the problem space are inherently necessary in minimizing the undesirable translations from the real world into the computer.
None of our current language paradigms really match the problem space for which we are building. Not objected oriented, nor functional programming, nor any of the older models. Users don't come to us asking for 'objects', nor do they come to us asking for functional closures. There is little in the technical language that maps back to most problem domains.
They come to us to build very specific tools to solve their data pile problems. They talk to us about data and they talk to us about their problems. These other 'things' are abstract technology concepts. We spend a massive amount of time and effort mapping the user based problems onto abstract 'technology' issues. That is a critical amount of our effort.
Not that 'abstraction' itself is bad. The foundation on which we leverage our work across many problem domains is abstraction and it is here that we need to put more effort, not less. Abstractions are the answer to the brute force problem. Abstraction as a concept is great and truly important. It is just that most of our 'current' abstractions are not nearly as strong as we need; they could be more effective. So this leads to problems with 'expressing' our solutions in these underlying languages, technologies, and abstractions.
Instead of foolishly defending our own favorite languages, we should really try to come together and see what works and what doesn't. But honestly, not with a bias in trying to show one language is way better than any other. That really doesn't matter. These days the deciding factor isn't even the language itself, its the libraries and communities that matter most.
The problem we are trying to solve is relatively simple: we want to be able to build tools quickly and correctly. Never lose sight of that underlying problem. Being able to cut and paste a million lines of 'for' and 'if' statements into a barely-stable GUI that tortures its users is not the makings of a great programmer.
At some point, we will find better, stronger abstractions that are closer to the way our users express their problems. When we have reduced that impedance mismatch, building a system will become trivial. Deciding what to build, however, will never change. Understanding the structure of the 'data' will never change. It is only the way we instantiate our solutions that can be affected. That doesn't mean we won't necessarily find larger super-tools that will be leverage-able to provide the underlying mechanics for huge swatches of problems. Given the current redundancy of most of our systems it isn't hard to guess that yes, there are still more than a few elegant solutions just waiting to be found. We've only covered a tiny segment of the capabilities of our machines.
That growth that we still need to accomplish is the underlying essence behind my prediction for the future:
http://theprogrammersparadox.blogspot.com/2008/02/age-of-clarity.html
One day we understand the data we are collecting, and what it really means. Then we will be able to use this understanding to 'deterministically' improve ourselves and our societies. The problem that keeps us from getting to this point is not with our technologies, we just don't know how to use them properly yet. The key to solving this problem will absolutely be the computer; it was the single most significant invention of the 20th century, and it will drive huge social changes in the 21st. We are still vastly underrating the significant of these machines.
COMPLEXITY REVISITED
You can't get very far in software development without having to learn to deal with complexity. Software development management is complexity management. While there are lots of definitions for complexity, people seem to understand the essence of the concept, but they still have trouble with the mechanics.
If we pick a convenient way to break it down into underlying effects, it becomes easier to see where the problems arise. A simple clean definition is always the strongest starting point.
We can start with a few definitions: all 'business' domains have an inherent complexity. The business domain is the specific industry, problem, etc. for which you are writing the tool. Some generalized tools cover huge domains, but all tools must always cover some domain. Unless of course, the code is entirely random and pointless. Even a simple demo is aimed at a specific set of users.
Most developers understand this and go about performing or acquiring a significant amount of analysis of the business domain on which to build their solutions for their common user problems.
What often seems to get missed however, is that the development, testing, and deployment of the software itself is significant. The problem domain for any piece of software isn't just the business domain, it is all of the development domains as well. For example, if you write the perfect tool, but it is flaky, then it is not usable. Everything about the tool, including itself, is part of the tool's 'problem' domain.
In addition, for every technology used in providing the solution, there is an inherent underlying amount of complexity that comes from the technologies themselves. To write 'to' a specific operating system platform, for example, you have to understand the strengths and weakness of it, or your solution will be volatile. Depending on any specific aspect of any technology is a mandatory risk for a project, but a manageable one.
So, for our development project, we have a very huge problem domain with its inherent complexity and a significant amount of technical complexity for each and every piece in the system. If you were brilliant and your underlying technologies were clean, and this and only this was the sum total of the complexities in your solution, it would essentially be perfect. However, for all of these complexities, the culture of software development has an extreme tendency to 'add' in way more complexity on top of all of this.
Beyond the inherent complexity in a system, everything else is 'artificial'. It need not be there, but there is -- in practice -- generally a huge amount of it. It is entirely possible with enough effort to refactor any solution to completely remove, forever, all artificial complexity. That is true by definition. It serves no purpose other than to 'bulk' up the solution.
Frederick P. Brookes uses the term 'accidental' complexity, but I believe that includes both what I call artificial complexity and some of the technical complexity. This makes it a less than desirable term because there is nothing you can do about technical complexity, it is as much a part of the solution as the problem domain. Artificial complexity, on the other hand, is removable.
Also, accidental is a horrible word, although his intended definition is the centuries-old version favored by Aristotle. Oddly, I think that using archaic terms for modern things is in itself a form of artificial complexity, we like to make things sound special so we can be exclusive. Simple is better.
Artificial complexity for most development projects equals or exceeds the other inherent complexities. If not directly, the underlying technologies contribute huge amounts of fancy dancing that need not be necessary in an ideal world to properly complete the solution. The actual amount of artificial complexity in the software industry is astoundingly vast these days, and growing at an exponential rate. It is so large and so pervasive that most developers don't even realize how much of the underlying infrastructure could actually be simplified to make their lives easier. Getting it is a mind-blowing experience.
The funniest part about Computer Science is one's instinctive guess about software that might lead to the assumption that if a large group of people arwere working on the same code for year after year, it would be gradually approaching a system with a decreased amount of artificial complexity. As work progresses, the problem domain would grow larger and the solution, overall, should get simpler.
In reality, the longer these big teams work on their systems, the worse the artificial complex gets. In some cases, the entire terminology and development practices of some of these massive groups is so choked with artificial complexity that it probably represents up to 95% of their effort to discuss, push, rehash, extend or mess with their code on a regular basis. Artificial complexity breeds artificial complexity. Stay at it long enough, and most of what you have is just artificial complexity. Very little real stuff gets done underneath.
My favorite example of artificial complexity is a very visible one. There is but one operating system on the whole planet that distinguishes between binary and text files, and it is rumored that the cause of that distinction was a quick fix for a demo, many decades ago. The reason, so I was told, was that a specific hard drive needed to have the newline characters translated between the operating system and the disk. This then became the reason for differentiating between text and binary files. Translate in one case, ignore in the other.
So this was some simple little problem that briefly reared its head on some early DOS system. Of course, the ripples from this are visible all over. NTFS in Windows still differentiates for no apparent reason. Protocols like FTP require specification of this parameter. No doubt it has worked its way into a countless number of interfaces, particularly any that want portability is DOS or Windows. Million and millions of lines of code have had to deal with keeping track of this for one file or another. Millions and millions of hours of time have been spent debugging problems related to this.
This little 'artificial' distinction -- completely unnecessary -- has had a significant impact on the world. In fact, I'm willing to bet that if we took all of the effort involved in this silly little problem in one way or another and converted into some other form of constructive effort, we'd have quite the funky-cool skyscraper by now. Possibly the largest one on the planet. If you think of how many soon-to-be programmers will eventually trip up on this issue one day, it is very depressing.
Even more disconcerting, is when you consider that along the various decades there were multiple periods where this issue could have, and should have been put to bed. Removed, refactored or cleaned up. But still, it remains. And more importantly, its brothers and sisters and cousins -- wantonly little bits of artificial complexity -- all amount to more effort, than what we needed to solve the actual underlying problems for our users. I'll go out on a limb here, but it is a very thick one: the amount of artificial complexity in the software industry is larger than the amount of inherent complexity, but I'll leave that proposition for someone more knowledgeable and wiser to prove.
WHY EVEN SLICE AND DICE?
Modern day software developers are easily lost. While we may know what we want to accomplish, there is a dizzying array of technological and technique choices to be made before even sitting down to consider a design. With so many subjective arguments and contradictory opinions, it is easy to get lost amongst the voices screaming at each other about the right way to build software.
That is why it is so critical, time and again, to go back to the basics and reexamine them. When the noise gets too loud, you have to ground yourself in what you know to be universally true.
We chop programs up into little pieces to make them easier to build. That is the only reasons why we should be doing it. If it isn't easier, then it is just artificial complexity. That being said, there are so many 'theories' for programming, many of which are great, but all of which are dangerous if you take them too seriously. Again, "we chop programs up into bits to make them easier to build."
That means, they are easier to read, easier to fix, easier to understand, etc. The attributes 'gained' by chopping up the code are all good things that help in the long run. It is not about typing, nor about the 'right' way to do it, nor anything else. Elegance comes solely from the ease in which we can manipulate the system. Clever, convoluted 'tricky' code is never elegant.
In Java, for example, the whole idea of dumping things back into mindless sub-objects called 'beans' only to stitch them back to other fuller objects later, seems like an exercise in futility. For things to be readable, and elegant we want to bring all of the relevant code together, and we don't want to repeat it over and over. What are beans then, other than some semantic mess for non-object structures that aren't even convenient to use in the system.
Now given the above definition of elegance, you might be thinking that it is far easier to pound out a set of instructions, over and over again, then it is to apply some fancy abstraction to it, in an attempt to generalize the solution. That assertion might be true if the time spent pounding out the instructions wasn't significant. However, given that the more 'brutal', the brute force, the more work that is required to get the job done. And not just a little more work, we are talking about massive amounts of work. Pounding out the code is hugely time-consuming, and an infinitely losing proposition.
A good abstraction, that opens up the problem domain and really solves a series of related problems is absolutely more work. But, in comparison, it is only marginally more effort, relative to the alternative. If, for example, you find a way to create twenty GUI screens with one block of code and changing the incoming parameters, it may have taken you twice as much time as writing one of the screens, but 1/10 as much time as writing all twenty. The more powerful the abstraction, the more leverage. The more leverage, the more time 'saved'.
When we generalize programs, we still need to chop them up to be easier to build. All of the same reasons for slicing and dicing some explicit set of instructions also occurs for slicing and dicing some generalized set of instructions. We build with an abstraction in mind to make it easier to understand the code. We build with a pattern in mind for the same reason. The abstraction and the pattern are irrelevant, except in regards to making the underlying code easier to understand. A pattern helps to slice and dice, but unless it is also an abstraction, it should not leave remnants of artificial complexity in the code. Naming objects after design patterns, for example, is misleading. The data in a system is 'that' data, its structure is the pattern. It should be named for what it is, not how it is structured. It is the same as calling your intermediate counter variable in a 'for' loop 'integer'.
Again and again, movements grow from programmers that seek easier ways of building software. That is to be expected, but it is also to be expected that many of these movements are not improvements. With that in mind, we should always fall back to first principles when examining a new movement. If it does not jive with the base problem we are trying to solve, then it is a poor solution. Also, you should never buy the counter-argument that you have to try something to know if it is good or bad. Our ability to think through a problem is tremendous, and our ability to ignore the truth is also tremendous. Just because you find a technique fun, doesn't make it a good idea.
I'VE HAD TOO MUCH JAVA TODAY
Like all programmers, I often have a bias towards whatever technology I am currently working with. The more you dig in and understand something, the easier it becomes. Oddly even though I've been working heavily in Java recently, I don't find the language very appealing.
Languages range from being very flexible and accommodating, to being stiff and fragile. COBOL was the original stiff board. While it did an excellent job of making screens to capture data, it was just painfully boring to work with. The various incantations of Visual Basic were another example of stiff. The language forces its practitioners to pound out brute force code because of the weak semantics of the language. I've never seen elegant VB, and I would be surprised if I ever did.
Java has some of that stiffness. The original plan was to not provide too much rope to allow the programmers to hang themselves. Languages like C are incredibly flexible, but for most programmers that flexibility is dangerous. They don't use it wisely, so their code gets unstable. Language designers don't want to stifle expression, but they can stop their users from creating some types of programs. Stiffness is good and bad.
The biggest problem with Java isn't the language itself, it is the culture that grew out of the language after it matured. The underlying libraries in Java are just damned awkward, and that translated into the coding practices being dammed awkward. Things like beans and struts and just about every library I've ever seem is so over-engineered, inconsistent, and messy. Stuffing in fifty million unrelated Design Patterns became vogue, which fills the system with a tremendous amount of artificial complexity. So much, that working the whole environment on a messy operating system like Windows with one of the more modern icky IDEs, the collection of stuff is as arbitrary and convoluted as working on a mainframe or AS400s (iSeries). It is one giant arbitrary inconsistent mess. When I was younger, I remember not wanting to work on mainframes because the technology was arbitrary and ugly. Now it has found me on the PC. Java has become the new COBOL and Windows has become the new mainframe. The cycle of life perhaps?
There is much interest in adding new features like closures to the language. I'll dispense a little advice: Hey guys, the problem isn't that the language is missing stuff. The problem is that the libraries are damned awkward. Fix the problem. Re-release a Java2 with a decent set of clean and normalized libraries that don't suck. Cut back on the stupid functionality and focus on getting really really simple abstractions. You know you are close when the examples for simple things aren't hideously large. Make simple things simple. Look to the primitive libraries for other languages like Perl and C. Get a philosophy other than 'over-complicate-the-hell-out-of-it', and then clean up the implementation. And, ok make a couple of language changes, like making strings easy to use (remove StringBuffer), get rid of arrays and primitive types, and find a nicer syntax for callbacks. But please, whatever you do, don't adopt the C# strategy of just dumping more and more crap into the pot until you can't see the bottom, that's a guaranteed recipe for 'unstable'.
Despite my misgivings for Java, I'll probably be using it for a while still. At least until we can convince someone to bankroll a serious effort into finding new and better ways of really building systems (if you've got big money, I've got big ideas :-) But it doesn't seem as if the focus right now is on moving forward. We're just too busy drowning in our own artificial complexity to even consider that we shouldn't be. Besides, bugs are big business.
COMING BACK TO THE IMPORTANT OF LANGUAGES
The various sections in this post fit together like a meal of appetizers, because of how they relate to the way we express our solutions for our users. Our primary Computer language and its libraries may be the heart of our implementations but what we really do is translate the perceived needs of our users into long and complex sequences of instructions. By the time you include development, testing, packaging, and distribution a commercial software project will often involve the coordination of many full and partial computer languages. A web-based application may involve over a dozen. These different forms of expressing the solution fit some problems easily, but most require more effort. It is in this expression, that we so easily go astray.
I see each and every language as a series of good and bad attributes, for which I would really like to collect the good together and discard the bad. If you've read my earlier posting on primitives, you probably understand why that is not a great idea, but as an approach towards enhancing development, it is a good direction to start. We need a new language that more closely matches the way we express our problems with each other. We don't necessarily need a fancy 'natural' language system, but we should focus on reducing the amount of translation that is happening between analysis and implementation.
Our latest technologies are extremely complex. Much of it is accumulated artificial complexity, that could be removed if we have the nerve to refactor our solutions. When we find the right underlying abstractions, they will create a consistent layer on which we can easily express super-complicated problems. This combined with a more natural representation should give us a huge leap in technical sophistication. It is always worth noting that a computer is an incredibly powerful mind-machine and that our current level of software development is an extremely disappointingly crude attempt at utilizing it.
It should be easier to express our "understanding from the users" into specific tools. There is no real reason why I need to write, over and over again with the same basic solutions to the same basic technical problems. My real problem is the nature and structure of the data, not what type of list structure is returned from some bizarre internal call. We get so caught up in the fantasy of our brilliance in pounding out little solutions to the same common problems, that we forget about the big picture, the real problem we are trying to solve. Users want to manipulate a pile of data. We need to build specific tools to accomplish this. That underlying consistency threads all types of programming for all types of industries together into one giant related effort. We are building ever-increasing piles of data in the same way that ancient Egyptians were building ever increasingly large pyramids. It is just that it is hard to physically see our efforts, although the web does allow tourists to visit our piles.
A key source of our problems is our underlying technologies, particularly our languages. We need to fix or refactor these if we want to make building things easier. However, if things are too easy, programmers may actually refuse to use the technologies, because they take away too much of the fun. Oddly, one can easily suspect that Frederick P. Brookes is correct about there not being a silver bullet, not because it is impossible, but because people wouldn't use it, even if it existed. Humanity -- in that regard -- is a strange crowd.
Writing needs to come together in a way that leads the reader on a journey while leaving them satisfied at the end. Half-thoughts, while interesting, leave the reader longing for more. Sort of like an appetizer with no main course. You won't starve to death, but you're still very hungry afterward.
To get around that, this post is -- I guess -- a series of appetizers; hopefully enough to fulfill. If you don't like one, perhaps some of the following might be more satisfying. If you keep reading long enough, hopefully, you'll be satiated. If you're still hungry at the end, stay tuned, there will always be more.
THE NEVER CHANGING ELEMENTS OF DEVELOPMENT
Software development is all about building 'tools' for users to play with their piles of data. The most important attribute for any tool is that it is usable. The second most important attribute is that it is extensible, or at very least it can be kept updated.
A tool that only worked for a short time is a pain. Any investment in learning how to utilize it, is squandered when it is no longer available. Even the best tools take some effort to master, while most of the software ones take huge effort because of their poor designs.
There is an implicit covenant between the users and the programmers. The users commit to learning how to utilize the tools only if we commit to building and maintaining them over the long run. A 'release' is only a brief instance in the life of any software, and you're only as good as your last release.
Software itself is just a large set of instructions. These days, it is often a huge set of instructions, in fact, trillions and trillions of them if you are looking at assembler. Most modern computers are happily working their way through millions of lines of instructions every second. The sheer size of our existing code bases and amount of code that is being executed is daunting.
Collecting together these instructions is hard, messy and prone to failure. There is much we can do to improve our accuracy, but I'll leave those thoughts for another day.
Once we know what to build, if we didn't know better, we might try to manually type each and every one of the instructions into explicitly the computer. We did it that way initially, but we've learned a lot since then. Or have we?
Even though we are no longer manually flipping switches or pounding out endless lines of assembler, we still commonly employ higher level brute force approaches as our primary means in building software. Generally, most programmers build the system by belting out or copying each every line of code that needs to be executed. They'll use some 'theory' to reduce the redundant code, but it usually is not applied to great effect. Few code-bases are not endlessly-repeating chucks of nearly identical code, despite how we as an industry proclaim we are following principles like DRY (Don't Repeat Yourself -- The Pragmatic Programmers).
Even more disconcerting, on an industry level, is that we are madly adding in as much code as possible to handle all of our perceived problems. There is some type of implicit assumption that having more code will actually help. That if we just had 'enough' code, we could solve all of our user's problems.
The funny thing is that it is a waste of time. You can never win, the amount of brute force code you need is infinite and infinitely growing. E.g. the more crappy code you have, the more crappy code you need to monitor it. That's an exponentially escalating mess. It defines the building culture for 'many' of our current operating systems and major tools. Although we've found ways to add in instructions at a faster rate we have not fundamentally changed our approach to what we are doing. We're just pounding each and every instruction explicitly into the computer.
It might be OK if it wasn't for the fact that code rusts. If we have code we have to maintain it and if it is rapidly growing out of control, that means that sooner or later lots of that code is going to rust. We cannot support it all. Our approach is flawed.
At least there will always be work in programming. Unless there is a shift, companies will endlessly pound-out partial tools. And we will endlessly refactor those tools into arbitrarily inconsistent pieces. And people like virus writers will help, in producing counter-productive code that needs to be monitored and controlled. It really is quite endless with this approach. The more code you have, the more you need, the more work you have to do to keep it going. We are not evolving, we are just barely keeping up with the demand.
LANGUAGE EXPRESSION
There are huge debates over which programming language is best. This rather silly subjective argument has been going on with the entire life of software development. It is not that I don't think the choice of language is important. It can be a critical component in getting the tool successfully built. But inherently, all of the languages that we currently have -- suck. They suck to varying degrees, but they still suck.
We haven't found the right level of abstraction yet, that allows us to build our systems reliably and consistently. We're not even in the right corner of the solution space, so we are quibbling over an endless array of broken and incomplete languages. Do you care what the mnemonic is for incrementing a register in assembler? No, of course not, that discussion is long gone. Is a Pascal pointer better than a C one? That too is ancient history. So, too, will many of today's issues just disappear.
The language we want is the one that makes its representation the closest possible to the way we think about the problem. The 'farther' we have to translate the answers, the more likely there will be errors. The 4am test is critical. Will I be able to sort this out at 4am or do I have to do some huge amount of mental gymnastics? Clear, straight-forward syntax and semantics that match the problem space are inherently necessary in minimizing the undesirable translations from the real world into the computer.
None of our current language paradigms really match the problem space for which we are building. Not objected oriented, nor functional programming, nor any of the older models. Users don't come to us asking for 'objects', nor do they come to us asking for functional closures. There is little in the technical language that maps back to most problem domains.
They come to us to build very specific tools to solve their data pile problems. They talk to us about data and they talk to us about their problems. These other 'things' are abstract technology concepts. We spend a massive amount of time and effort mapping the user based problems onto abstract 'technology' issues. That is a critical amount of our effort.
Not that 'abstraction' itself is bad. The foundation on which we leverage our work across many problem domains is abstraction and it is here that we need to put more effort, not less. Abstractions are the answer to the brute force problem. Abstraction as a concept is great and truly important. It is just that most of our 'current' abstractions are not nearly as strong as we need; they could be more effective. So this leads to problems with 'expressing' our solutions in these underlying languages, technologies, and abstractions.
Instead of foolishly defending our own favorite languages, we should really try to come together and see what works and what doesn't. But honestly, not with a bias in trying to show one language is way better than any other. That really doesn't matter. These days the deciding factor isn't even the language itself, its the libraries and communities that matter most.
The problem we are trying to solve is relatively simple: we want to be able to build tools quickly and correctly. Never lose sight of that underlying problem. Being able to cut and paste a million lines of 'for' and 'if' statements into a barely-stable GUI that tortures its users is not the makings of a great programmer.
At some point, we will find better, stronger abstractions that are closer to the way our users express their problems. When we have reduced that impedance mismatch, building a system will become trivial. Deciding what to build, however, will never change. Understanding the structure of the 'data' will never change. It is only the way we instantiate our solutions that can be affected. That doesn't mean we won't necessarily find larger super-tools that will be leverage-able to provide the underlying mechanics for huge swatches of problems. Given the current redundancy of most of our systems it isn't hard to guess that yes, there are still more than a few elegant solutions just waiting to be found. We've only covered a tiny segment of the capabilities of our machines.
That growth that we still need to accomplish is the underlying essence behind my prediction for the future:
http://theprogrammersparadox.blogspot.com/2008/02/age-of-clarity.html
One day we understand the data we are collecting, and what it really means. Then we will be able to use this understanding to 'deterministically' improve ourselves and our societies. The problem that keeps us from getting to this point is not with our technologies, we just don't know how to use them properly yet. The key to solving this problem will absolutely be the computer; it was the single most significant invention of the 20th century, and it will drive huge social changes in the 21st. We are still vastly underrating the significant of these machines.
COMPLEXITY REVISITED
You can't get very far in software development without having to learn to deal with complexity. Software development management is complexity management. While there are lots of definitions for complexity, people seem to understand the essence of the concept, but they still have trouble with the mechanics.
If we pick a convenient way to break it down into underlying effects, it becomes easier to see where the problems arise. A simple clean definition is always the strongest starting point.
We can start with a few definitions: all 'business' domains have an inherent complexity. The business domain is the specific industry, problem, etc. for which you are writing the tool. Some generalized tools cover huge domains, but all tools must always cover some domain. Unless of course, the code is entirely random and pointless. Even a simple demo is aimed at a specific set of users.
Most developers understand this and go about performing or acquiring a significant amount of analysis of the business domain on which to build their solutions for their common user problems.
What often seems to get missed however, is that the development, testing, and deployment of the software itself is significant. The problem domain for any piece of software isn't just the business domain, it is all of the development domains as well. For example, if you write the perfect tool, but it is flaky, then it is not usable. Everything about the tool, including itself, is part of the tool's 'problem' domain.
In addition, for every technology used in providing the solution, there is an inherent underlying amount of complexity that comes from the technologies themselves. To write 'to' a specific operating system platform, for example, you have to understand the strengths and weakness of it, or your solution will be volatile. Depending on any specific aspect of any technology is a mandatory risk for a project, but a manageable one.
So, for our development project, we have a very huge problem domain with its inherent complexity and a significant amount of technical complexity for each and every piece in the system. If you were brilliant and your underlying technologies were clean, and this and only this was the sum total of the complexities in your solution, it would essentially be perfect. However, for all of these complexities, the culture of software development has an extreme tendency to 'add' in way more complexity on top of all of this.
Beyond the inherent complexity in a system, everything else is 'artificial'. It need not be there, but there is -- in practice -- generally a huge amount of it. It is entirely possible with enough effort to refactor any solution to completely remove, forever, all artificial complexity. That is true by definition. It serves no purpose other than to 'bulk' up the solution.
Frederick P. Brookes uses the term 'accidental' complexity, but I believe that includes both what I call artificial complexity and some of the technical complexity. This makes it a less than desirable term because there is nothing you can do about technical complexity, it is as much a part of the solution as the problem domain. Artificial complexity, on the other hand, is removable.
Also, accidental is a horrible word, although his intended definition is the centuries-old version favored by Aristotle. Oddly, I think that using archaic terms for modern things is in itself a form of artificial complexity, we like to make things sound special so we can be exclusive. Simple is better.
Artificial complexity for most development projects equals or exceeds the other inherent complexities. If not directly, the underlying technologies contribute huge amounts of fancy dancing that need not be necessary in an ideal world to properly complete the solution. The actual amount of artificial complexity in the software industry is astoundingly vast these days, and growing at an exponential rate. It is so large and so pervasive that most developers don't even realize how much of the underlying infrastructure could actually be simplified to make their lives easier. Getting it is a mind-blowing experience.
The funniest part about Computer Science is one's instinctive guess about software that might lead to the assumption that if a large group of people arwere working on the same code for year after year, it would be gradually approaching a system with a decreased amount of artificial complexity. As work progresses, the problem domain would grow larger and the solution, overall, should get simpler.
In reality, the longer these big teams work on their systems, the worse the artificial complex gets. In some cases, the entire terminology and development practices of some of these massive groups is so choked with artificial complexity that it probably represents up to 95% of their effort to discuss, push, rehash, extend or mess with their code on a regular basis. Artificial complexity breeds artificial complexity. Stay at it long enough, and most of what you have is just artificial complexity. Very little real stuff gets done underneath.
My favorite example of artificial complexity is a very visible one. There is but one operating system on the whole planet that distinguishes between binary and text files, and it is rumored that the cause of that distinction was a quick fix for a demo, many decades ago. The reason, so I was told, was that a specific hard drive needed to have the newline characters translated between the operating system and the disk. This then became the reason for differentiating between text and binary files. Translate in one case, ignore in the other.
So this was some simple little problem that briefly reared its head on some early DOS system. Of course, the ripples from this are visible all over. NTFS in Windows still differentiates for no apparent reason. Protocols like FTP require specification of this parameter. No doubt it has worked its way into a countless number of interfaces, particularly any that want portability is DOS or Windows. Million and millions of lines of code have had to deal with keeping track of this for one file or another. Millions and millions of hours of time have been spent debugging problems related to this.
This little 'artificial' distinction -- completely unnecessary -- has had a significant impact on the world. In fact, I'm willing to bet that if we took all of the effort involved in this silly little problem in one way or another and converted into some other form of constructive effort, we'd have quite the funky-cool skyscraper by now. Possibly the largest one on the planet. If you think of how many soon-to-be programmers will eventually trip up on this issue one day, it is very depressing.
Even more disconcerting, is when you consider that along the various decades there were multiple periods where this issue could have, and should have been put to bed. Removed, refactored or cleaned up. But still, it remains. And more importantly, its brothers and sisters and cousins -- wantonly little bits of artificial complexity -- all amount to more effort, than what we needed to solve the actual underlying problems for our users. I'll go out on a limb here, but it is a very thick one: the amount of artificial complexity in the software industry is larger than the amount of inherent complexity, but I'll leave that proposition for someone more knowledgeable and wiser to prove.
WHY EVEN SLICE AND DICE?
Modern day software developers are easily lost. While we may know what we want to accomplish, there is a dizzying array of technological and technique choices to be made before even sitting down to consider a design. With so many subjective arguments and contradictory opinions, it is easy to get lost amongst the voices screaming at each other about the right way to build software.
That is why it is so critical, time and again, to go back to the basics and reexamine them. When the noise gets too loud, you have to ground yourself in what you know to be universally true.
We chop programs up into little pieces to make them easier to build. That is the only reasons why we should be doing it. If it isn't easier, then it is just artificial complexity. That being said, there are so many 'theories' for programming, many of which are great, but all of which are dangerous if you take them too seriously. Again, "we chop programs up into bits to make them easier to build."
That means, they are easier to read, easier to fix, easier to understand, etc. The attributes 'gained' by chopping up the code are all good things that help in the long run. It is not about typing, nor about the 'right' way to do it, nor anything else. Elegance comes solely from the ease in which we can manipulate the system. Clever, convoluted 'tricky' code is never elegant.
In Java, for example, the whole idea of dumping things back into mindless sub-objects called 'beans' only to stitch them back to other fuller objects later, seems like an exercise in futility. For things to be readable, and elegant we want to bring all of the relevant code together, and we don't want to repeat it over and over. What are beans then, other than some semantic mess for non-object structures that aren't even convenient to use in the system.
Now given the above definition of elegance, you might be thinking that it is far easier to pound out a set of instructions, over and over again, then it is to apply some fancy abstraction to it, in an attempt to generalize the solution. That assertion might be true if the time spent pounding out the instructions wasn't significant. However, given that the more 'brutal', the brute force, the more work that is required to get the job done. And not just a little more work, we are talking about massive amounts of work. Pounding out the code is hugely time-consuming, and an infinitely losing proposition.
A good abstraction, that opens up the problem domain and really solves a series of related problems is absolutely more work. But, in comparison, it is only marginally more effort, relative to the alternative. If, for example, you find a way to create twenty GUI screens with one block of code and changing the incoming parameters, it may have taken you twice as much time as writing one of the screens, but 1/10 as much time as writing all twenty. The more powerful the abstraction, the more leverage. The more leverage, the more time 'saved'.
When we generalize programs, we still need to chop them up to be easier to build. All of the same reasons for slicing and dicing some explicit set of instructions also occurs for slicing and dicing some generalized set of instructions. We build with an abstraction in mind to make it easier to understand the code. We build with a pattern in mind for the same reason. The abstraction and the pattern are irrelevant, except in regards to making the underlying code easier to understand. A pattern helps to slice and dice, but unless it is also an abstraction, it should not leave remnants of artificial complexity in the code. Naming objects after design patterns, for example, is misleading. The data in a system is 'that' data, its structure is the pattern. It should be named for what it is, not how it is structured. It is the same as calling your intermediate counter variable in a 'for' loop 'integer'.
Again and again, movements grow from programmers that seek easier ways of building software. That is to be expected, but it is also to be expected that many of these movements are not improvements. With that in mind, we should always fall back to first principles when examining a new movement. If it does not jive with the base problem we are trying to solve, then it is a poor solution. Also, you should never buy the counter-argument that you have to try something to know if it is good or bad. Our ability to think through a problem is tremendous, and our ability to ignore the truth is also tremendous. Just because you find a technique fun, doesn't make it a good idea.
I'VE HAD TOO MUCH JAVA TODAY
Like all programmers, I often have a bias towards whatever technology I am currently working with. The more you dig in and understand something, the easier it becomes. Oddly even though I've been working heavily in Java recently, I don't find the language very appealing.
Languages range from being very flexible and accommodating, to being stiff and fragile. COBOL was the original stiff board. While it did an excellent job of making screens to capture data, it was just painfully boring to work with. The various incantations of Visual Basic were another example of stiff. The language forces its practitioners to pound out brute force code because of the weak semantics of the language. I've never seen elegant VB, and I would be surprised if I ever did.
Java has some of that stiffness. The original plan was to not provide too much rope to allow the programmers to hang themselves. Languages like C are incredibly flexible, but for most programmers that flexibility is dangerous. They don't use it wisely, so their code gets unstable. Language designers don't want to stifle expression, but they can stop their users from creating some types of programs. Stiffness is good and bad.
The biggest problem with Java isn't the language itself, it is the culture that grew out of the language after it matured. The underlying libraries in Java are just damned awkward, and that translated into the coding practices being dammed awkward. Things like beans and struts and just about every library I've ever seem is so over-engineered, inconsistent, and messy. Stuffing in fifty million unrelated Design Patterns became vogue, which fills the system with a tremendous amount of artificial complexity. So much, that working the whole environment on a messy operating system like Windows with one of the more modern icky IDEs, the collection of stuff is as arbitrary and convoluted as working on a mainframe or AS400s (iSeries). It is one giant arbitrary inconsistent mess. When I was younger, I remember not wanting to work on mainframes because the technology was arbitrary and ugly. Now it has found me on the PC. Java has become the new COBOL and Windows has become the new mainframe. The cycle of life perhaps?
There is much interest in adding new features like closures to the language. I'll dispense a little advice: Hey guys, the problem isn't that the language is missing stuff. The problem is that the libraries are damned awkward. Fix the problem. Re-release a Java2 with a decent set of clean and normalized libraries that don't suck. Cut back on the stupid functionality and focus on getting really really simple abstractions. You know you are close when the examples for simple things aren't hideously large. Make simple things simple. Look to the primitive libraries for other languages like Perl and C. Get a philosophy other than 'over-complicate-the-hell-out-of-it', and then clean up the implementation. And, ok make a couple of language changes, like making strings easy to use (remove StringBuffer), get rid of arrays and primitive types, and find a nicer syntax for callbacks. But please, whatever you do, don't adopt the C# strategy of just dumping more and more crap into the pot until you can't see the bottom, that's a guaranteed recipe for 'unstable'.
Despite my misgivings for Java, I'll probably be using it for a while still. At least until we can convince someone to bankroll a serious effort into finding new and better ways of really building systems (if you've got big money, I've got big ideas :-) But it doesn't seem as if the focus right now is on moving forward. We're just too busy drowning in our own artificial complexity to even consider that we shouldn't be. Besides, bugs are big business.
COMING BACK TO THE IMPORTANT OF LANGUAGES
The various sections in this post fit together like a meal of appetizers, because of how they relate to the way we express our solutions for our users. Our primary Computer language and its libraries may be the heart of our implementations but what we really do is translate the perceived needs of our users into long and complex sequences of instructions. By the time you include development, testing, packaging, and distribution a commercial software project will often involve the coordination of many full and partial computer languages. A web-based application may involve over a dozen. These different forms of expressing the solution fit some problems easily, but most require more effort. It is in this expression, that we so easily go astray.
I see each and every language as a series of good and bad attributes, for which I would really like to collect the good together and discard the bad. If you've read my earlier posting on primitives, you probably understand why that is not a great idea, but as an approach towards enhancing development, it is a good direction to start. We need a new language that more closely matches the way we express our problems with each other. We don't necessarily need a fancy 'natural' language system, but we should focus on reducing the amount of translation that is happening between analysis and implementation.
Our latest technologies are extremely complex. Much of it is accumulated artificial complexity, that could be removed if we have the nerve to refactor our solutions. When we find the right underlying abstractions, they will create a consistent layer on which we can easily express super-complicated problems. This combined with a more natural representation should give us a huge leap in technical sophistication. It is always worth noting that a computer is an incredibly powerful mind-machine and that our current level of software development is an extremely disappointingly crude attempt at utilizing it.
It should be easier to express our "understanding from the users" into specific tools. There is no real reason why I need to write, over and over again with the same basic solutions to the same basic technical problems. My real problem is the nature and structure of the data, not what type of list structure is returned from some bizarre internal call. We get so caught up in the fantasy of our brilliance in pounding out little solutions to the same common problems, that we forget about the big picture, the real problem we are trying to solve. Users want to manipulate a pile of data. We need to build specific tools to accomplish this. That underlying consistency threads all types of programming for all types of industries together into one giant related effort. We are building ever-increasing piles of data in the same way that ancient Egyptians were building ever increasingly large pyramids. It is just that it is hard to physically see our efforts, although the web does allow tourists to visit our piles.
A key source of our problems is our underlying technologies, particularly our languages. We need to fix or refactor these if we want to make building things easier. However, if things are too easy, programmers may actually refuse to use the technologies, because they take away too much of the fun. Oddly, one can easily suspect that Frederick P. Brookes is correct about there not being a silver bullet, not because it is impossible, but because people wouldn't use it, even if it existed. Humanity -- in that regard -- is a strange crowd.
Saturday, February 2, 2008
The Age of Clarity
"What do we really know? Hmmm." I pondered as we walked.
I was out with the dog the other night. The quiet tranquil nature of empty suburban streets is a great place for deep thinking. The cold chill of winter keeps one from wandering too far off topic while wandering aimlessly in the streets. Dogs make wonderful intellectual companions for these types of journeys because they don't interrupt with too many questions. They are very good listeners.
I was pondering information quality, and I foolishly started to wonder about how much inaccurate information was choking up my memory. Certainly, there are lots of spin, lies, half-truths, deceptions and other stuff built up over the years from less than quality sources like politics, news and TV. Somethings in my memory are just easy simplifications. Somethings are out right fabrications. There is also the changing nature of science, and our non-stop quest for learning. Some of my knowledge is just 'relative', it wouldn't stand up to a universal judge. It is considered true here and now, but won't be in the future. In an overall sense, how much of this is really accurate?
If you factor in all of the different reasons for low quality, and take a big sweeping guess, the amount of truth in our brains could be lower than 30%. Just a wild guess, but I could easily believe that 1 in every 3 three facts in my brain are true, while the other 2 are questionable for all sorts of reasons. I am just speculating of course, but in this misinformation age, we are full of a tremendous amount of low quality knowledge. And it feels like it is growing at an ever increasing rate, although that might just be our ability to confirm that it is suspect.
THOSE WHO FORGET HISTORY ARE DOOMED TO REPEAT IT
In the past, mankind was mostly ignorant of the accuracy of their information. They could take pride in their depth of knowledge without ever knowing how dubious it really was. Now all we have to do is check wikipedia and we can instantly find out that truth, well, err at least a pointer towards the truth.
How often have I pulled forth some ancient fact from the depths of my brain, only to discover that it was fundamentally untrue? Worse still is how those facts actually make it into my head in the first place. Some were obviously from disreputable sources, but others had come from well-known authorities, and were still incorrect. My problem is not loss or corruption of memory, it is the opposite, these 'facts' stay for far too long. If I just dumped them faster, I might find they were more accurate overall.
It is oddly telling. It allows us to guess that this huge degree of inaccuracy in our current knowledge is actually some type of pointer towards the future. The Renaissance was an awaking about the world that we live in. A moment when we first opened our eyes and saw it for what it actually was. This in turn drove the foundations for the industrial age, where we learned to create and use an unlimited number of machines. One of those machines, the computer, has driven us into an information age, where we collect huge piles of information, about virtually everything in this world. There is a trend here. The next age will follow along in this sequence.
Even though we have built up a tremendous collection of fantastic machines, they do not serve us well. We can build things, but we have trouble maintaining them. Our massive and complex cities crumble around us. We are forever fighting a losing battle against entropy; like a runner that has leaned too far forward we are continually off balance. We must continue to build to move forward, we don't know how to preserve what we have and we don't know how to live within our environmental means. We grow at a severe cost to the world around us.
With all of our equipment and learning, collecting data is still a hit or miss proposition. We just guess at what we want to collect and how it is structured. It is not orderly and we don't have any underlying theories that drive our understanding. Computer Science is still so young that is frequently wrong. Often it is just random guessing. We are currently only utilizing a small fraction of the capabilities of our computers because we keep bumping into complexity thresholds each time we try to build truly sophisticated systems. We are trapped with crude software.
Even thought we can collect the data, we continuously fail to be able to mine or interpret it. We gather the stuff, format it and then save it to backup tapes. But we get little actual use from all of our work in collecting it. Some decisions are made from the data, but given the real amount of underlying information contained in our efforts we could actually use what we have to make really sophisticated decisions. To actually know, for fact that we are changing things for the better. If we understood what we have.
THE YELLOW BRICK ROAD
Given those trends, it is not hard at all to predicate the future. The next step in the sequence. The path we must take is the only one available: machines to labour for us physically and mentally lead to vast infrastructures and vast piles of information. We built up these things, but we don't understanding them, and we have trouble keeping them going.
It is not like we will wake up one day and the light will get turned on, but I imagine that over time like a dull and steady wind blowing away the haze, much of what we know will become clear and finally fit into place. It will be a modern day Renaissance reoccurring not with our perspective of the world around us, but with our perspective of the information and knowledge that we have collected. It will take time. Many years, decades or even a century or two, but one day there will be an 'age of clarity', where mankind can finally see the information around them for what it actually is. That is, if we survive the turmoil of our current societies; we have so many dangers that await us, because of what we know, but don't yet understand.
And what could we expect in such an age? I imagine that we will have a real understanding of information, probably based on a currently unknown science. Maybe several. We will know how to quickly, conveniently monitor and collect information for any questions. Inherently, we will understand the truthfulness of what we collect, and we'll be able to immediately use this information to ascertain whether things are improving or getting worse. The term 'immediately' being one of the very key points.
Unlike now, this won't be a big effort, but rather something simple that people do as a matter of due diligence. Government effectiveness for example, will be based on simple true numbers that show that things are improving or getting worse. Unlike the statistics of our day, these numbers and their interpretation, based on science will be irrefutable. We will be able to show cause and effect relationships between policies and real life. We will be able to measure the effectiveness, not guess at it. If we say things are getting better, it won't just be 'spin'.
Underneath, if we capture enough data, we will get a vibrant picture of all of the relationships, how they fit with each other and what they really mean. When we choose to make changes, they will not be partially-informed guesses, they will be tangible deterministic improvements to our societies that will work as expected. In the same way that the industrial age leapt from wildly building things to the reproducible industrialization of products with a tremendous amount of consistent quality, we will shift our understanding of the data around us. Like the difference between B&W photography and color we will learn how to start really capturing the information that is of real value, and we will learn how to really interpret it.
AT THE EDGE OF REASON
Does it sound too overly deterministic or crazy? Whatever comes in the future, it must be something that isn't here now. So, if it isn't pushing the envelop of convention, then its not really much of a prediction, is it? Jules Verne wrote about ships that travelled underwater, a famously crazy concept if ever there was one, except that it now has become common knowledge. He wrote about air ships, defying gravity and hanging in the sky with birds, clearly another bit of wackiness. Yet, this too is common, and rather boring now.
Does it sound very similar to what we have now? For all we know, we know so very little. We have many approaches and methods to really prove things, to get the the real underlying truth, but because we can't do that easily on a grand scale we are awash in misinformation. All of this low quality knowledge chokes our pathways and keeps us from progressing. It becomes food for subjective arguments, and endless discussions. And while some of us may suspect falsehoods, proving it is costly and often distracting. We can't fight all of the battles all of the time, so the low quality stuff washes over us like a tsunami.
Sometimes when I am out walking the dog, my mind drifts around to us being so sophisticated that there is not much left in this world that we don't know. That 'proposition' is comforting in many ways, but patently false. Like the pre-Renaissance societies, we think we have reached some level of sophistication, but we barely even realize how to keep our own existence from spiraling out of control. And what we don't know, is the question: "what do we really know?" We feel pride in having built up knowledge bases like the World Wide Web, but realistically the things are a mess. What good is a massive unorganized pile of data, if we can't use it to answer the serious questions in our lives? We live in an age where subjective arguments are possible for most of what we commonly deal with in our lives. Everything is up for grabs; everything is based on opinion. We can barely distinguish the quality of our facts, let alone position them into some coherent and universally correct structure of the world around us. For all that we know, we are still incredibly ignorant.
The next big thing then is obvious. If we are too survive, then we have to pass through the Clarity Age. We have no choice. If this understanding hasn't already popped into someone else's brain, it was bound to sooner or later. You can't get very far down the path, if you don't know where the path lies. It is murky now, and for us to progress it must be clear.
Woof, woof, woof! My thinking and wanderings were interrupted because the dog spotted a raccoon. I was riped from the depths by the pulling, jumping and barking. In the here and now, I am reminded that it is best if I move on quickly to keep the dog from making too much racket. I don't want to wake up my whole neighborhood with the commotion. However much I long to spend time in the future, I must live with the world around me as it is now. These dark ages are apt to last a while, possibly my entire life. I ought not to waste it, pining for enlightenment.
I was out with the dog the other night. The quiet tranquil nature of empty suburban streets is a great place for deep thinking. The cold chill of winter keeps one from wandering too far off topic while wandering aimlessly in the streets. Dogs make wonderful intellectual companions for these types of journeys because they don't interrupt with too many questions. They are very good listeners.
I was pondering information quality, and I foolishly started to wonder about how much inaccurate information was choking up my memory. Certainly, there are lots of spin, lies, half-truths, deceptions and other stuff built up over the years from less than quality sources like politics, news and TV. Somethings in my memory are just easy simplifications. Somethings are out right fabrications. There is also the changing nature of science, and our non-stop quest for learning. Some of my knowledge is just 'relative', it wouldn't stand up to a universal judge. It is considered true here and now, but won't be in the future. In an overall sense, how much of this is really accurate?
If you factor in all of the different reasons for low quality, and take a big sweeping guess, the amount of truth in our brains could be lower than 30%. Just a wild guess, but I could easily believe that 1 in every 3 three facts in my brain are true, while the other 2 are questionable for all sorts of reasons. I am just speculating of course, but in this misinformation age, we are full of a tremendous amount of low quality knowledge. And it feels like it is growing at an ever increasing rate, although that might just be our ability to confirm that it is suspect.
THOSE WHO FORGET HISTORY ARE DOOMED TO REPEAT IT
In the past, mankind was mostly ignorant of the accuracy of their information. They could take pride in their depth of knowledge without ever knowing how dubious it really was. Now all we have to do is check wikipedia and we can instantly find out that truth, well, err at least a pointer towards the truth.
How often have I pulled forth some ancient fact from the depths of my brain, only to discover that it was fundamentally untrue? Worse still is how those facts actually make it into my head in the first place. Some were obviously from disreputable sources, but others had come from well-known authorities, and were still incorrect. My problem is not loss or corruption of memory, it is the opposite, these 'facts' stay for far too long. If I just dumped them faster, I might find they were more accurate overall.
It is oddly telling. It allows us to guess that this huge degree of inaccuracy in our current knowledge is actually some type of pointer towards the future. The Renaissance was an awaking about the world that we live in. A moment when we first opened our eyes and saw it for what it actually was. This in turn drove the foundations for the industrial age, where we learned to create and use an unlimited number of machines. One of those machines, the computer, has driven us into an information age, where we collect huge piles of information, about virtually everything in this world. There is a trend here. The next age will follow along in this sequence.
Even though we have built up a tremendous collection of fantastic machines, they do not serve us well. We can build things, but we have trouble maintaining them. Our massive and complex cities crumble around us. We are forever fighting a losing battle against entropy; like a runner that has leaned too far forward we are continually off balance. We must continue to build to move forward, we don't know how to preserve what we have and we don't know how to live within our environmental means. We grow at a severe cost to the world around us.
With all of our equipment and learning, collecting data is still a hit or miss proposition. We just guess at what we want to collect and how it is structured. It is not orderly and we don't have any underlying theories that drive our understanding. Computer Science is still so young that is frequently wrong. Often it is just random guessing. We are currently only utilizing a small fraction of the capabilities of our computers because we keep bumping into complexity thresholds each time we try to build truly sophisticated systems. We are trapped with crude software.
Even thought we can collect the data, we continuously fail to be able to mine or interpret it. We gather the stuff, format it and then save it to backup tapes. But we get little actual use from all of our work in collecting it. Some decisions are made from the data, but given the real amount of underlying information contained in our efforts we could actually use what we have to make really sophisticated decisions. To actually know, for fact that we are changing things for the better. If we understood what we have.
THE YELLOW BRICK ROAD
Given those trends, it is not hard at all to predicate the future. The next step in the sequence. The path we must take is the only one available: machines to labour for us physically and mentally lead to vast infrastructures and vast piles of information. We built up these things, but we don't understanding them, and we have trouble keeping them going.
It is not like we will wake up one day and the light will get turned on, but I imagine that over time like a dull and steady wind blowing away the haze, much of what we know will become clear and finally fit into place. It will be a modern day Renaissance reoccurring not with our perspective of the world around us, but with our perspective of the information and knowledge that we have collected. It will take time. Many years, decades or even a century or two, but one day there will be an 'age of clarity', where mankind can finally see the information around them for what it actually is. That is, if we survive the turmoil of our current societies; we have so many dangers that await us, because of what we know, but don't yet understand.
And what could we expect in such an age? I imagine that we will have a real understanding of information, probably based on a currently unknown science. Maybe several. We will know how to quickly, conveniently monitor and collect information for any questions. Inherently, we will understand the truthfulness of what we collect, and we'll be able to immediately use this information to ascertain whether things are improving or getting worse. The term 'immediately' being one of the very key points.
Unlike now, this won't be a big effort, but rather something simple that people do as a matter of due diligence. Government effectiveness for example, will be based on simple true numbers that show that things are improving or getting worse. Unlike the statistics of our day, these numbers and their interpretation, based on science will be irrefutable. We will be able to show cause and effect relationships between policies and real life. We will be able to measure the effectiveness, not guess at it. If we say things are getting better, it won't just be 'spin'.
Underneath, if we capture enough data, we will get a vibrant picture of all of the relationships, how they fit with each other and what they really mean. When we choose to make changes, they will not be partially-informed guesses, they will be tangible deterministic improvements to our societies that will work as expected. In the same way that the industrial age leapt from wildly building things to the reproducible industrialization of products with a tremendous amount of consistent quality, we will shift our understanding of the data around us. Like the difference between B&W photography and color we will learn how to start really capturing the information that is of real value, and we will learn how to really interpret it.
AT THE EDGE OF REASON
Does it sound too overly deterministic or crazy? Whatever comes in the future, it must be something that isn't here now. So, if it isn't pushing the envelop of convention, then its not really much of a prediction, is it? Jules Verne wrote about ships that travelled underwater, a famously crazy concept if ever there was one, except that it now has become common knowledge. He wrote about air ships, defying gravity and hanging in the sky with birds, clearly another bit of wackiness. Yet, this too is common, and rather boring now.
Does it sound very similar to what we have now? For all we know, we know so very little. We have many approaches and methods to really prove things, to get the the real underlying truth, but because we can't do that easily on a grand scale we are awash in misinformation. All of this low quality knowledge chokes our pathways and keeps us from progressing. It becomes food for subjective arguments, and endless discussions. And while some of us may suspect falsehoods, proving it is costly and often distracting. We can't fight all of the battles all of the time, so the low quality stuff washes over us like a tsunami.
Sometimes when I am out walking the dog, my mind drifts around to us being so sophisticated that there is not much left in this world that we don't know. That 'proposition' is comforting in many ways, but patently false. Like the pre-Renaissance societies, we think we have reached some level of sophistication, but we barely even realize how to keep our own existence from spiraling out of control. And what we don't know, is the question: "what do we really know?" We feel pride in having built up knowledge bases like the World Wide Web, but realistically the things are a mess. What good is a massive unorganized pile of data, if we can't use it to answer the serious questions in our lives? We live in an age where subjective arguments are possible for most of what we commonly deal with in our lives. Everything is up for grabs; everything is based on opinion. We can barely distinguish the quality of our facts, let alone position them into some coherent and universally correct structure of the world around us. For all that we know, we are still incredibly ignorant.
The next big thing then is obvious. If we are too survive, then we have to pass through the Clarity Age. We have no choice. If this understanding hasn't already popped into someone else's brain, it was bound to sooner or later. You can't get very far down the path, if you don't know where the path lies. It is murky now, and for us to progress it must be clear.
Woof, woof, woof! My thinking and wanderings were interrupted because the dog spotted a raccoon. I was riped from the depths by the pulling, jumping and barking. In the here and now, I am reminded that it is best if I move on quickly to keep the dog from making too much racket. I don't want to wake up my whole neighborhood with the commotion. However much I long to spend time in the future, I must live with the world around me as it is now. These dark ages are apt to last a while, possibly my entire life. I ought not to waste it, pining for enlightenment.