Tuesday, February 3, 2009

In Expression

The other day I was reading a recent issue of National Geographic. It was a story on Charles Darwin and the author, David Quammen was speculating about when and where Darwin finally came upon his famous theories. I found it interesting, since I could easily imagine that just prior to Darwin's 'aha' moment most of what was circulating around his head were vague notions of some hazy concept. Pieces sure, but not the whole thing, and certainly not a refined version. Ideas don't just pop into heads that way; as complete pieces. He, in a sense, was beginning to formulate the knowledge, but he had no way of expressing it.

The big difference between a partial understanding and really 'getting' something is being able to express it in a simple understandable manner. You might have a reasonable amount of knowledge around a specific topic, but when you sit down and try to explain it, the gaps suddenly become extremely noticeable. You know something when you can express it. And you know it really well, when you can express it in multiple ways.

Those ideas of thought interest me because we can apply them to Computer Science too. I.e. you cannot write what you don't know. A programmer flailing away with only a vague notion in mind will not be successful by definition. If they don't know what they are writing, it will not work, they can't express it.

Even more interesting, is that in some sense, certain languages are going to help in assembling ideas more quickly. They often talk about how Inuits and other far northern cultures have a huge vocabulary for snow, I'm not sure if that's really true, but certainly many natural languages have been influenced directly by their environment. In that sense, a speaker in a specific language with more appropriate words is more likely to be able to cross that knowledge gulf to the other side and express their ideas more clearly.

The elements of spoken knowledge -- our vocabulary -- assist us in understanding them.

But even with a limited natural language, if your vocabulary is wide and domain specific it certainly makes it far easier to extend your underlying knowledge to new points, particularly if they are not too far from the initial ones. If you know how to express several big ideas, you can build on them to create even bigger ones.

Expression, then is more that just formulating a correct syntax. It's finding a suitable arrangement to communicate something complex, whether to another human, or to a machine for execution. In a human sense, it's about taking those vague threads in one's understanding and actualizing them into a coherent stream of information.

In a computer sense, it's about taking those vague notions of possible functionality, data and user needs, and actualizing them into a complex structural form in multiple computer languages as a system.

How we ourselves assemble the bits to create knowledge is similar in many ways to how we as system analysts assemble the parts to create specifications. Both bring order to the chaos. Both actualize vague notions.


EXPRESSIVENESS

If you've set out to create a theory of evolution, even if you haven't named it that, the first thing you do is to collect as much underlying related information as possible. A trip around the world would do, for example. It's on those base facts that you'll build your ideas.

In creating a big software system, the designers and analysts set out with the same goal. They, on deciding which problems to solve, collect a huge amount of base information in a vary domain specific format. If you talk to enough people, preferably experts in the domain, from all of the different perspectives you can assemble a deep and complex picture on which you can build.

Given that effort, it makes the most intuitive sense to want to do the absolute least amount of translation necessary in order to express that domain-specific understanding into a form directly usable by a computer. Translation is inherently dangerous.

We've collected the data in a domain-specific format, shouldn't we try to utilize it there as much as possible?

That translation work, often ignored by language developers underscores the success that languages like COBOL and APL have had over the decades in forming the basis of many systems.

COBOL is a particularly verbose and clunky language, but for your standard business application it fits well with the domain, minimizing translations. COBOL was certainly one of software's most popular languages, and it's highly likely -- given the failure of most modern technologies in displacing the older, cruder, yet way more stable mainframes -- that it still accounts for most of the data, and certainly most of the world's mission critical data (your bank accounts for instance are likely held by a mainframe with COBOL, if they're not you may want to consider changing banks).

Similarly APL, which is a matrix-oriented language was hugely popular with those disciplines solving mostly matrix problems like actuarial science, better known as insurance. Translating from the natural mathematical domain of the problems into some procedural or object-oriented paradigm opens up considerable dangers. Translating to something closer to the domain is considerable safer. Bugs come from mis-understandings, but way more bugs come from mis-translations.

It seems rather obvious that we'd like to avoid translating our domain problems into other more complex formats, but we keep pursuing technologies that show this feature very poorly.

Software development is about solving technical problems as well as domain ones, and so much of the industry prefers to tackle the technical problems. They are simpler, more straight-forward, and generally black and white. Domain problems are bigger, uglier and often very grey. With that difference, it's no wonder the technical problems hold more of a fascination for programmers.

Unfortunately a purely technical solution solves no real world issues directly on its own, they all need to be embedded into domain specific solutions to find their real value in this world. The trouble comes, not from a technology like a database, but from how we use it in our customer relationship system.

Not suprisingly, most of the modern popular languages focus heavily on solving technical problems, while absolutely ignoring anything in the domain spectrum. The best languages however, make up for this a bit by allowing themselves to be extendable. The programmers then, if they put in the effort, can tailor the language to become more domain-specific, hopefully encapsulating the technical and lower-level domain details deep into the foundations, away from most of the other coders.


QUALITY

If you were going to write an academic paper, you'd be very careful in choosing your language. Most disciplines have evolved over time, so there are well-known concepts that everyone uses in order to work through the mechanics of their problems. The denotations and connotations of the underlying terminology grow ever larger as each new paper builds on a continuing theme. In that way, the pieces get bigger and bigger.

But, assuming that the underlying papers survived time and peer review, the quality of the upper levels of work also gets intrinsically better. How? As the terms grow, and become larger generalizations, the pieces become far more static. That is, they are harder to put together in an incorrect manner. If you're using the right terms, in the right way, as previously defined, their depth means that the allow permutations for interconnecting them is reduced, otherwise you'd be violating the definitions. You're peers, presumable would notice this right away.

That also applies to computers, although for programming languages it is a lot more obvious. In assembler for example, a programmer might easily forget to push or pop something on the stack, causing a bug. Skip up to the higher abstraction in C, and the compiler itself does all of the pushing and popping, eliminating most, if not all stack problems. But, at that particular abstraction level, pointers can be easily manipulated. Thus, C programs suffer horribly from a lot of loose pointer errors. Memory management is also up to the programmer, causing another common set of bugs.

As we work ever higher, the lower-level problems disappear, but new ones surface, although smaller and less debilitating. Java for instance cannot have a loose pointer, and although possible, it is far harder to leak memory. On the other hand, threading problems are rampant, and the systems are so big, bloated and bulkly that they've exceeded the growth of the hardware. The problems are still there at the higher level, but they've become way smaller.

It's far more likely that a group of programmers will get a reasonable system done in Java, then it is that they will get the same one done in assembler. While it's possible, a system in assembler even half the size of a modest Java one would be an uncontrollable bear to keep running. Way, way too much work.

What does this have to do with quality? Well, those problems seen as indirect inputs into the process of building a system are a natural byproduct of the constructive process. Bugs, I am saying, are impossible to avoid. Bigger bugs, are presumable harder to find, and more work to fix.

If an underlying step up in abstraction, almost by definition, causes smaller problems, then it is also indirectly taking the system closer to a higher quality. Although not entirely linear, 4 huge pointer bugs are a far harder and more time-consuming problem than 30 little typos. If your language doesn't allow pointer bugs, and hasn't nicely replaced them with some other equivalent bug like threading problems, then that step upwards comes with a noticable step up in quality.


ABSTRACTIONS AND THINGS

Although its obvious that a higher-level abstraction will increase the quality, it is not always obvious that a 'different' abstraction is a higher one. Java programs and C programs share an instability, although their underlying causes are very different.

But the paradigm itself, as an aspect of the programming language may also play a big effect.

We deal with most things in our real world in a 4 dimension sense, and as I've often said it effects the way we structure things and our language itself. The object-orient paradigm is a way to model the world around us as a series of atomic 'objects' each made up of some code and some data. This particular model mixes the limited expressibility of static data, with the more broad expressibility of running code in order to provide an atomic element that is flexible enough to span our common 4 dimensional functional space.

In a sense, it breaks down every element in our world, in a model of a 'thing' (data) in 'time' (code).

That model, it turns out, can be applied to all things in our world, but most people applying it think that the code attributes should be utilized properly. That's nice, but much of what we seek to represent in a computer is happily static data, of the very boring 3D kind.

An inventory system for a restaurant for example, need only keep track of what's in and what's out for the current time period. I.e. the 4th dimension is not used or particularly desired for the system to run. Modeling an inventory system with no time constraints in an object-oriented framework forces the designers to have to translate pretty simple data, into dangerously, state-driven objects. A complex translation that we know how to do, yet one that is done incorrectly, often.

The generalness of the object-oriented model imposes an order of complexity on any problems where that specific model is not necessary. That, it turns out, is a lot more places than most people realize. History -- time in particular -- is not often applied to modern computer systems, and rarely applied well or consistently. Which means, that for the bread and butter of modern systems we are going to a huge amount of extra work, trying to force our real-world view into a paradigm that makes it way harder to express.

It fails often, as one would expect it too.


LANGUAGES

Ultimately, we'd like a collection of technologies that makes it easy to express the various different problems we encounter frequently with software development. We want the underlying solutions to be generalized, but we want to tailor specifically what sits on the top to be very domain intensive.

Many large organizations find that their systems provide an edge of competitiveness, so there is always value for any corporation in distinguishing itself through process. Since the computer systems are coming more and more often to define the process, they still represent important competitive areas.

The closer we are to expressing things in their domain-specific formats, the closer we are to massively reducing the presence of bugs. Certainly, it takes some of the fun out of programming, people love to struggle with overly complex-logic and fragile systems, but as an industry we have to move past having our low success rate rooted in our passion for hacking. Sooner or later, someone is going to figure out how to make it more reliable, so sooner, hopefully, the techniques in building things are going to change.

Going back to languages, C programmers used their own code or libraries to manipulate complex data structures like hash tables. The expression of access to a hash table was through pointers, so consequently, even if the underlying library was solid there were many problems with getting hash tables to work in a typical C program.

Perl on the other hand has hash tables, also known as associated arrays, built right into the language. This higher-level paradigm means that the Perl interpret can perform semantic as well as syntactic checks on the code, allowing the language to prevent the user from utilizing it incorrectly. C, on the other hand, knows nothing of it's programmer's intent, the hash table code is just like any other code in the system. The syntax can be checked, but only in the most minimal sense.

Does that matter? Associated arrays as an expression paradigm provide a large set of ways of handling complex problems, that are more difficult in a straight-forward functional language. Fluent Perl programmers can write smaller, better quality solutions for a specific class of functions, that would be far harder and more volatile in languages like C or Java. Text processing for example, can be easily scripted, to help summarize or transform data. For some technical problems, and some domains that are plagued by smaller data collection variations, utilizing Perl can be orders of magnitude more efficient and produce significantly higher quality. It just comes intrinsically from using a more appropriate language.

Although Perl is a more complex language than either C or Java, for specific classes of problems it is a more appropriate one. That fact is true for just about every language as well too. Each and every one, with it's own syntax and paradigm has areas of greater competency, which always means time and quality.


FINAL THOUGHTS

I've written no less than three different versions of this post. Each one grinding to a crashing halt, as the text becomes hopelessly incomplete and lost.

Like Darwin, before his ideas became clear, the notion and sense that there is something extremely important about how and what we express is filtering about in my head. I feel as if I am just circling around the outside of some deeper understanding. Expression, as an issue to Computer Science is far more important than whether Ruby is better than Java, or Python is prettier than Perl. It's more tan static verses dynamic typing.

It pops up, over and over again, and I know that it is the key to getting to that next plateau for us. The technologies we currently have impede our expression in the same way that a crude language like Klingon or Elvish would make it hard to express some extremely complex scientific papers. It might, as if programming in assembler, be possible to pound out the full and entire text, yet the underlying pieces are too small and too frail to allow that sophistication.

Our next leap in technology, which we need soon, comes from examining our current abstractions, and finding an extension or even something completely different. More so than the C to Java leap, we don't just want to trade one problem for another. We want another Assembler to Java leap were we will well and truly bury a lot of complexity in the underlying levels that will never be seen again. In that way we can move forward and build the size and complexity of systems that will finally utilize a computer for what it can really do.