Wednesday, December 9, 2009

Quickie: Let it be Free

I was thinking about how the "scientists" in the Climategate scandal are hiding their raw data. If the science is really objective, then by its own merits other researchers will reach the same conclusions.

If you're afraid that they won't, then you're hiding something.

As such, no scientific researcher should ever hide their data. Ever! Not only that, but their work should be out there too -- formulas, theories, methodologies, everything -- to be judged not only by their peers, but also by the public at large. If science is a quest for the truth, then the truth has nothing to fear from inspection.

But I realize that it is not just our disgraced climate researchers who have adopted policies of secrecy. You can't, for example, get most of the academic work in Computer Science unless you join the ACM. In fact you can't get most works in academia unless you join some tight-nit clique. It's all closely guarded.

Long ago, I could understand how these organizations needed to charge money in order to recoup the costs of publishing on paper. It makes sense that they shouldn't take a loss. But amazingly that changed in the Information Age, and instead of offering free downloads it appears that most organizations choose to monetize their IP (business speak for "go for the profits").

You could make some well thought-out arguments about how the funds collected go back into the organizations to help spread their good works, but oddly one can easily suspect that most of the income is really used to protect the revenue generation (business speak for "stuff that causes profits") and pay salaries.

Academia -- data, papers, methodology, everything -- should be free and done out in the public. It should be open, and visible. Approachable. Companies hide their research because they are locked in competitive battles. Universities (and other research institutes) are not companies.

Research is not competition, it should be co-operative. And it shouldn't matter who contributes (particularly when we live in a day and age were so many people are both eager to help, but also have the available time).

I should never be told that I can't download a paper unless I pay a big sum of money. It should be free.

Saturday, December 5, 2009


First, some Administrivia. Given that I haven't sold any copies for quite a while now, I've decided to make the electronic copy of "The Programmer's Paradox" free for a while (a few months probably). It can be found at Lulu:

Lack of sales is really my fault, as I was never really strong about pushing the book (as it was never edited as well as I would have liked). It was my first real act of writing, so please go easy on me :-)


While I was fiddling with prime numbers, I also had another collection of un-related ideas invade my brain. At the high level, it involves how we have constructed Computer Science as a formal system.

Computers deal with two distinct things: code and data. This is an inherent duality in the basic concept.

However, all of our theoretical models are extremely "code-centric". That is, they focus on the boundaries of the code within a formal system, and leave the data as just a secondary player.

This is true of Turing machines, deterministic finite automata and all of the other formal system's modelling of computing that I am aware of. The data exists, but only as an unimportant set of "symbols" that come and go on some "tape". It's the code we are really studying, not the data.

As I've said before, that is not entirely unexpected as most people's introduction to computers come from a code perspective, and they see the machines in terms of the user functionality and how they can apply it.

Still, the counter-approach of seeing computers in terms of machines that assemble mass piles of data is an equally correct, yet significantly less complex way of visualizing them.

Code is intrisically complex and hard to unwind, data is structurally complex and mostly is normalized by virtue of its usage. That is, there are an infinitely large number of ways to write a program, but there is only a small (and finite) number of reasonable mappings from the internal data back to reality. The data is either correct, or it isn't (not accounting for bugs, of course).


So, I started thinking about some way to model the data "within" the system, without having it overshadowed by the code.

I've worked out a number of the basic elements, but these ideas are far from complete. Because of this, I'll try to include my rational in this discussion, along with the rather sparse ideas in the hope that collectively these ideas will get pushed farther along by anyone interested in them.

Starting at the top, if we aren't interested in the code, then it shouldn't appear in the model explicitly. With that in mind we can define a transform T that maps the data from it's value A to another value B.

T(A) -> B

A transformer is some block of (unknown) code that takes an explicit set of inputs, and returns an explicit set of outputs. To make life easier, there won't be a concept of global variables. Everything necessary for the transformation must be passed in explicitly. And there is also no concept of "side effects". Everything changed in any way is passed back outwards.  Transforms can return multiple values.

So we might have something more complex like:

T(a,b,c,d,e,f,g,h) -> A,B,C

A transform that takes 8 parameters, applies some type of algorithm to the incoming data and returns three other new parameters. It's a nice simple start.

Now, none of this would be interesting if we couldn't use this to model something relevant. What might be interesting to know, is "given that there is some infinite number of possible algorithms for a class of transforms (or there are none), what are the bounds of these algorithms as they are running?"


I woke up one night, a while back, with an idea about how to use transforms to calculate the bounds on algorithms. While asleep it seemed to be a  very strong idea, but awake I've been unable to fully pin in down. It's mostly about continuously decomposing the algorithms, to get a better idea of how they work, but without ever having to examine the inner workings themselves.

The general idea is pretty simple (if it works). While we can't just splat in a bunch of algorithms into the transforms and calculate the bounds, we can sub-divide the transforms into smaller ones (endlessly) until we can use those atom pieces to calculate best and worst performing boundaries.

So for instance, we might be interest in studying the class of "sorting" transformers. That is, any transformer that takes N elements in one arrangement and returns them in another re-arranged "sorted" (lexically, numerically, etc.) order.

If we were going to look at the trivial case, then we find that:

Ts(A) -> A [ O(1) .. O(1) ]

In English, this says that any transform that "sorts" a single entity, requires at minimum "constant time" to sort, and at maximum, also constant time to sort  (I used a range notation, but I'd guess that using little o and big O notation may be more expected (acceptable)).

What's really interesting here is to figure out the range of growth of the performance of the family of algorithms. What I'd really like to know is what is the range for:

Ts(A1,A2,...,An) -> Aw,Ac,...,Am

The various different families of algorithms that differently sort through a list of N elements.

We can deal with this by taking the two bounding cases at the same time; the best and worst case.

In the best case, the elements are sorted because the algorithm "knows" the behavior. That is the above algorithm decomposes into

Ts(A1,A2,...,An,L) -> Al,P1
Ts(A1,A2,...,An,K) -> Ak,P2
Ts(A1,A2,...,An,M) -> Am,Pn

That is, for each different element (L,K,M) at a time, a transformation takes the whole list, the current element, and returns that element and it's position in the new list.

Since we're still working with the best case, there are N decompositions of the original algorithm that generate N positions, which are used to re-arrange the list. Each decomposition, if we are still in the best case, could use something "intrinsic" from its input to calculate P. As such, the best possible sort algorithm which can use "intrinsic information" from it's incoming data, can only ever re-order the list in n*O(1) time which I think works out to O(n) time.

In the worst case, however, assuming that the code is all practical, and really does things, we can decompose the sort algorithm down into N transforms, each of which has to examine every N elements, in order to calculate P. As such, it works out to n*n*O(1) or O(n^2). So we get something like:

Ts(A1,A2,...,An) -> Aw,Ac,...,Am [ O(1), O(n^2) ]


In an analogous manner to sorting we could start examining the class of transformers that take a large number of inputs, and return a subset of "found" outputs. I won't go into too much detail, but at each level of decomposition we can reduce the number of arguments, split off a best and worst case. In that way we might start with:

Tf(A1,A2,...,An,C1,C2,...,Cm) -> (null | Ak,...,Al)

We could simplify this by adding null in as an input:

Tf(null, A1,A2,...,An,C1,C2,...,Cm) -> Ak,..,Al

Then this describes the "find" family of transforms, that takes a set of input, and a set of criteria, and then returns nothing, or some smaller subset of "found" items.

At the very bottom, as a best case, a transform would either "know" about that the data was or was not it what it was looking for. Otherwise the transform would have to look through everything, and determine which were the correct elements.

The best and worse cases are similar to the space for sorting.


So why do I think this is useful? Primarily because it can be used to draw hard boundaries on a family of algorithms. From the standard Computer Science perspective, you can work with algorithmic complexity to find better algorithms, but you have no way of knowing if you've found the best one.

There is some computational bound which cannot be crossed, and it is nice to know where that bound is.

Of course in my examples, I took the best case to be a trivially not-possible case, but that was primarily because I've really haven't gone deep enough to understand that lower bound in a better fashion. It's likely, from the output, and from decompositions, that we should be able to figure a way to calculate a much more accurate best case boundary.

Another place where I think this could be useful is in studies the computational complexities themselves. We've had categories of complexity for a long time like NP, but we've been unable to make a dent in determining elementary issues like P=NP.

These problems too represent important boundaries on the things that can and would like to do with our computers. Unfortunately now, the studies in these areas have been tied to working with existing algorithmic problems, and with specific algorithms.

If we can work with the complexity classes, without having to descend into finding the actual algorithms themselves, we may be able to construct higher order proofs about our boundaries, without getting too wrapped up in the specifics of existing algorithms.

Another interesting issue about this approach is that it takes us one more step down a path I've talked about before:

With our current code-centric perspectives on software systems, it seems as if the systems can quickly exceed everyone's threshold of understanding as they swell in complexity.

What this means in practical terms is that while the system may start under control, with enough people contributing, and enough work getting done, the system rapidly moves into a "stupid zone", where it has become so large and so complex, that nobody can fully understand it anymore.

If nobody can understand it, then nobody can use it (fully), and as such it is  just a monolithic nightmare of related functionality.

Once large enough, the only way programmers can deal with adding in new functionality is by carving off little sections and ignoring the rest. This divide and conquer strategy works, but it also means the functionality itself starts becoming disjoint, and the system gets even more complex as a result.

So we get these behemoths that are so large and so disjoint that people can't make use of the available functionality. We pass over some maximum point in combined functionality, where on the other side our usage degenerates horribly.

The perfect example of this is Microsoft Word. Maybe a few decade agos, most users understood most of the functionality and could use it well. Now, the software is so convoluted that most users are forced to use less features just because the interactions between the features rarely does what is expected.

We used to heavily "type" our documents to allow the formater to make them auto-magically consistent. Now, anybody would have to be crazy to turn on the formatting, it hasn't worked (behaved consistently) in years. So we're back to working with our documents in the most crude manner possible.

The general case is that if you keep building on some underlying complexity, at some point the complexity itself will overwhelm what you are doing, and become its own issue.

If instead of building up ever-increasing mountains of complexity, we just worked at one low level abstraction and used the machines themselves to dynamically assemble the complexity, we could avoid a lot of problems.

The abilities of our computers are strong enough that they can be used to map pathways through a sea of transformations. Intellect is necessary for "how" to do the transformations, but it is not necessary to calculate a deterministic path between a large series of them.

We've based our current architectures and technologies implicitly on our existing theories. The Turing machine model drove the hardware construction and that drove the way we've built software. This in turn controls how we see our software problems.

We see every problem as a nail, because we started out with using hammers. And after a very long time of pounding nails into anything that moved, we're now seeing the intrinsic weaknesses in our approach to building things.

If we can't control and contain complexity, we can't exceed a fairly low threshold for the sophistication of our systems. We are stuck, just crudely re-write the same things over and over again, with simple variations on the same underlying technologies. After twenty years, this lack of progress is getting boring.