Sunday, November 25, 2012

Theory and Practice

Nearly three decades ago, when I started university all I really wanted to learn was the magic of programming. But my course load included plenty of mathematics and computer theory courses, as well as crazy electives. “What does all this have to do with programming?” I often complained. At first I just wished they’d drop the courses from the curriculum and give me more intensive programming assignments. That’s what I thought I needed to know. In time I realized that most of it was quite useful.

Theory is the backbone of software development work. For a lot of programming tasks you can ignore the theory and just scratch out your own eclectic way of handling the problem, but a strong theoretical background not only makes the work easier it also is more likely to withstand the rigors of the real world. Too often I’ve seen programmers roll their own dysfunctional code to a theoretical problem without first getting a true appreciation of the underlying knowledge. What most often happens is that they flail away at the code, unable to get it to be stable enough to work. If they understood the theory however, not only is the code shorter, but they’d spend way less time banging at it. It makes it easier. Thus for some types of programming, understanding the underlying theory is mandatory. Yes, it’s a small minority of the time, but it’s often the core of the system, where even littlest of problems can be hugely time intensive.

The best known theoretical problem is the ‘halting problem’. Loosely stated, it is impossible to write some code that can determine if some other code will converge on an answer or run forever (however one can write an estimation that works with a finite subset within a Turing Machine and that seems doable).  

In its native form the halting problem isn’t crossed often in practice, but we do see it in other ways. First is that an unbounded loop could run forever. An unbounded recursion can run forever as well. Thus in practice we really don’t want code that is ever unbounded -- infinite loops annoy users and waste resources -- at some point the code has to process a finite set of discrete objects and then terminate. If that isn’t possible, then some protective form of constraint is necessary (although the size should be easily configurable at operational time).

The second way we see it is that we can’t always write code to understand what code is trying to do. In an offbeat way, that limits the types of tools we can use in automation. It would be nice for instance if we could write something that would list out the stack for all possible exceptions in the code with respect to input, but that would require the lister to ‘intelligently’ understand the code enough to know the behavior. We could approx that, but the lack of accuracy might negate the value of the tool.

Another interesting theoretical problem is the Two Generals Problem. This is really just a coordination issue between any two independent entities (computers, threads, processes, etc.). There is no known way to reliability get 100% communication if the entities are independent. You can reduce the window of problems down to a tiny number of instructions, but you can never remove it entirely. With modern computers we can do billions of things within fractions of a second, so even a tiny 2 ms window could result in bugs occurring monthly in a system with a massive number of transactions. Thus what seems like an unlikely occurrence can often turn into a recurring nuisance that irritates everyone.

Locking is closely related to the Two Generals Problem. I’ve seen more bugs in locking than in any other area of modern programming (dangling pointers in C were extremely common in the mid 90s but modern languages mitigated that). It’s not hard to write code to lock resources, but it is very easy to get it wrong. At its heart, it really falls back to a simple principle: to get reliable locking you need a ‘test-and-set’ primitive. That is, in one single uninterrupted single-threaded protected operation, you need to test a variable and set it to ‘taken’ or return that is it unavailable. Once you have that primitive, you can build all other locking mechanisms on top of it. If it’s not atomic however, there will always be a window of failure. That links back to the Two Generals Problem quite nicely, since where it becomes an issue is when you can’t have access to an atomic ‘test-and-set’ primitive (and thus there will always be problems).

Parsing is one of those areas where people often tread carelessly without a theoretical background, and it always ends badly. If you understand the theory and have read works like The Red Dragon Book then belting out a parser is basically a time problem. You just decide what the ‘language’ requires such as LR(1), and how big the language is and then you do the appropriate work, which more often than not is either a recursive descent parser or a table driven one (using tools like lex/yacc or antlr). There are messy bits of course, particularly if you are trying to draft your own new language, but the space is well explored and well documented. In practice however what you see is a lot of crude split/join based top-down disasters, with the occasional regular expression disaster thrown in for fun. Both of those techniques can work with really simple grammars, but then fail miserably when applied to more complex ones. Thus being able to parse a CSV file, does mean you know how to parse something more complex. Bad parsing usually is a huge time sink, and if it’s way off then the only reasonable option is to rewrite it properly. Sometimes it’s just not fixable.

One of my favorite theoretical problems is the rather well-known P vs NP problem. While the verdict is still outstanding on the relationship, it has a huge implication for code optimizations. For people unfamiliar with ‘complexity’, it is really a question of growth. If you have an algorithm that takes 3 seconds to run with 200 inputs, what happens when you give it 400 inputs? With a simple linear algorithm it takes 6 seconds to run. Some algorithms perform worse, so they may take 9 secs (3^2 -- three squared) to run, or even 64 seconds (4^3 -- four to the power of three). We can take any algorithm and calculate its ‘computational complexity’ which will tell us exactly how the time grows with respect to the size of the input. We usually categorize this by the dominant operators so O(1) is a constant growth, O(n) is growing linearly by the size of the input, O(n^c) is growing by a constant exponent (polynomial time) and O(c^n) has the size of the input as the exponent (exponential time). The P in the equation is a reference to polynomial time, while NP is rather loosely any growth such as exponential that is larger (I know, that is a gross oversimplification of NP, but it serves well enough to explain that it references problems that are larger, without getting into what constrains NP itself).

Growth is a really important factor when it comes to designing systems that run efficiently. Ultimately what we’d like is to build is a well-behaved system that runs in testing on a subset of the data, and then to know when it goes into production that the performance characteristics have not changed. The system shouldn’t suddenly grind to a halt when it is being accessed by a real number of users, with a real amount of data. What we’ve learned over the years is that it is really easy to write code where this will happen, so often to get the big industrial stuff working, we have to spend a significant amount of time optimizing the code to perform properly. The work a system has to do is fixed, so the best we can do is find approaches to preserve and reuse the work (memoization) as much as possible. Optimizing code, after its been shown to work, is often crucial to achieving the requirements.

What P != NP is really saying in practice is that there is a very strong bound on just exactly how optimized the code can really be. If it’s not true then there would be no possible way you could take an exponential problem and find clever tricks to get it to run in polynomial time. You can always optimize code, but there might be a physical bound on exactly how fast you can get it. A lot of this work was best explored with respect to sorting and searching, but for large systems it is essential to really understand it if you are going to get good results.

if it were true however, amongst many other implications, that would mean that we are able to calculate some pretty incredible stuff. Moore’s law has always been giving us more hardware to play with, but users have kept pace and are continually asking for processing beyond our current limits. Without that fixed boundary as a limitation, we could write systems that make our modern behemoth's look crude and flaky, and it would require a tiny fraction of the huge effort we put in right now to build them (also it would take a lot of fun out of mathematics according to Gödel).

Memoization as a technique is best known from ‘caching’. Somewhere along the way, caching became the over-popular silver bullet for all performance problems. Caching in essence is simple, but there is significant more depth there than most people realize, and as such it is not uncommon to see systems that are deploying erratic caching to harmful effect. Instead of magically fixing the performance problems, they manage to make them worse and provide a slew of inconsistencies in the results. So you get really stale data, or a collection of data with parts out of sync, slower performance, rampant memory leaks, or just sudden scary freezes in the code that seem unexplainable. Caching, like memory management, threads and pointers is one of those places where ignoring the underlying known concepts is most likely to result in pain, rather than a successful piece of code.

I’m sure there are plenty of other examples. Often when I split programming between ‘systems programming’ and ‘applications programming’ what I am really referring too is that the systems variety requires a decent understanding of the underlying theories. Applications programming needs an understanding of the domain problems, but they can often be documented and passed on to the programmer. For the systems work, the programmer has to really understand what they are writing, for if they don’t, the chances of just randomly striking it lucky and getting the code to work are are nearly infinitesimal. Thus, as I found out over the years, all of those early theory courses that they made me take are actually crucial to being able to build big industrial strength systems. You can always build on someone else’s knowledge, which is fine, but if you dare tread into any deep work, then you need to take it very seriously and do the appropriate homework. I’ve seen a lot of programmers fail to grok that and suffer horribly for their hubris.