Thursday, June 19, 2014


I was chatting with a friend the other day. We're both babysitting large systems (>350,000 lines) that have been developed by many many programmers over years. Large, disorganized mobs of programmers tend towards creating rather sporadic messes, with each new contributor going further off in their own unique direction as the choas ensues. As such, debugging even simple problems in that sort of wreckage is not unlike trying to make sense of a novel where everyone wrote their own paragraphs, in their own unique voice, with different tenses and with different character names, and now all the paragraphs are hopelessly intertwined into what is supposedly a story. Basically flipping between the different many approaches is intensely headache inducing even if they are somehow related.

Way back in late 2008, I did some writing about the idea of normalizing code in the same way that we normalize relational database schemas:

There are a few other discussions and papers out there, but the idea was never popular. That's strange given that being able to normalize a decrepit code base would be a huge boon to people like my friend and myself that have found ourselves stuck with somebody else's short sightedness.

What we could really use to make our lives better is a way to feed the whole source clump into an engine that will non-destructively clean it up in a consistent manner. It doesn't really matter how long it takes, it could be a week or even a month, just so long as in the end, the behaviour hasn't gotten worse and the code is now well-organized. It doesn't even matter anymore how much disk space it uses, just that it gets refactored nicely. Not just the style and formatting, but also the structure and perhaps the naming as well. If one could create a high-level intermediate representation around the statements, branches and loops, in the same way that symbolic algebra calculators like Maple and Mathematica manipulate mathematics, then it would just be straight forward processing to push and pull the lines matching any normalizing or simplification rule. 

Picking between the many names for variables holding the same type or instance of data would require stopping for human intervention, but that interactive phase would be far less time consuming than refactoring by hand or even with the current tool set that is available in most IDEs. And a reasonable structural representation would allow identifying not only duplicate code, but also code that was structurally similar yet contained a few different hard-coded parameters. That second case opens the door to automated generalization, which given most code out there, would be a huge boost in drastically reducing the code size. 

One could even apply meta-compiler type ideas to use the whole infrastructure to convert easily between languages. The code to intermediate representation could be split away from the representation to code part. That second half could be supplied with any number of processing rules and modern style guides so that most programmers who follow well-known styles could easily work on the revised anchient code base. 

Of course another benefit is that once the code was cleaned up, many bugs would become obvious. Non-symetric resource handling for instance, is a good example. If the code grabbed a resource but never released it, that might have been previously buried in speghetti, but once normalized it would be a glaring flaw. Threading problems would also be brought quickly to the surface.

This of course leads to the idea of code recycling. Why belt out new bug-riddled code, when this type of technology would allow us to reclaim past efforts without the drudery of having to unravel their mysteries? 

A smart normalizer might even leverage the structural understanding to effectively apply higher level concepts like design patterns. That's possible in that functions, methods, objects, etc. are in their essence just ways to slice and dice the endless series of instructions that we need to supply. With structure, we can shift the likely DAG-based representations around, changing where and how we insert those meta-level markers. We could even extract large computations buried with global variables into self-standing stateless engines. Just that capability alone would turbo charge many large projects.

With enough computing time -- and we have that -- we could even compute all of the 'data paths' through the code that would show how basically the same underlying data is broken apart, copied and recombined many times over, which is always an easy and early target when trying to optimize code. Once the intermediate representation is known and understood, the possibilities are endless.

There are at least trillions of lines of code out there, much of which has been decently vetted for usability. That stuff rusts constantly and our industry has shied away from learning to really utilize it. Instead, year after year, decade after decade, each new wave of programmers happily rewrites what has already been done hundreds of times before. Sure it's easier, and it's fun to solve the simple problems, but we're really not making any real progress by working this way. Computers are stupid, but they are patient and quite powerful, so it seems rather shortsighted for us to not be trying to leverage this for our own development work in the same way that we try to do it for our users. Code bases don't have to be stupid and ugly anymore, we have the resources now to change that, all we need to do is just put together what we know into a working set of tools. It's probably about time that we stop solving CS 101 problems and moved on to the more interesting stuff.