Wednesday, December 9, 2009

Quickie: Let it be Free

I was thinking about how the "scientists" in the Climategate scandal are hiding their raw data. If the science is really objective, then by its own merits other researchers will reach the same conclusions.

If you're afraid that they won't, then you're hiding something.

As such, no scientific researcher should ever hide their data. Ever! Not only that, but their work should be out there too -- formulas, theories, methodologies, everything -- to be judged not only by their peers, but also by the public at large. If science is a quest for the truth, then the truth has nothing to fear from inspection.

But I realize that it is not just our disgraced climate researchers who have adopted policies of secrecy. You can't, for example, get most of the academic work in Computer Science unless you join the ACM. In fact you can't get most works in academia unless you join some tight-nit clique. It's all closely guarded.

Long ago, I could understand how these organizations needed to charge money in order to recoup the costs of publishing on paper. It makes sense that they shouldn't take a loss. But amazingly that changed in the Information Age, and instead of offering free downloads it appears that most organizations choose to monetize their IP (business speak for "go for the profits").

You could make some well thought-out arguments about how the funds collected go back into the organizations to help spread their good works, but oddly one can easily suspect that most of the income is really used to protect the revenue generation (business speak for "stuff that causes profits") and pay salaries.

Academia -- data, papers, methodology, everything -- should be free and done out in the public. It should be open, and visible. Approachable. Companies hide their research because they are locked in competitive battles. Universities (and other research institutes) are not companies.

Research is not competition, it should be co-operative. And it shouldn't matter who contributes (particularly when we live in a day and age were so many people are both eager to help, but also have the available time).

I should never be told that I can't download a paper unless I pay a big sum of money. It should be free.

Saturday, December 5, 2009


First, some Administrivia. Given that I haven't sold any copies for quite a while now, I've decided to make the electronic copy of "The Programmer's Paradox" free for a while (a few months probably). It can be found at Lulu:

Lack of sales is really my fault, as I was never really strong about pushing the book (as it was never edited as well as I would have liked). It was my first real act of writing, so please go easy on me :-)


While I was fiddling with prime numbers, I also had another collection of un-related ideas invade my brain. At the high level, it involves how we have constructed Computer Science as a formal system.

Computers deal with two distinct things: code and data. This is an inherent duality in the basic concept.

However, all of our theoretical models are extremely "code-centric". That is, they focus on the boundaries of the code within a formal system, and leave the data as just a secondary player.

This is true of Turing machines, deterministic finite automata and all of the other formal system's modelling of computing that I am aware of. The data exists, but only as an unimportant set of "symbols" that come and go on some "tape". It's the code we are really studying, not the data.

As I've said before, that is not entirely unexpected as most people's introduction to computers come from a code perspective, and they see the machines in terms of the user functionality and how they can apply it.

Still, the counter-approach of seeing computers in terms of machines that assemble mass piles of data is an equally correct, yet significantly less complex way of visualizing them.

Code is intrisically complex and hard to unwind, data is structurally complex and mostly is normalized by virtue of its usage. That is, there are an infinitely large number of ways to write a program, but there is only a small (and finite) number of reasonable mappings from the internal data back to reality. The data is either correct, or it isn't (not accounting for bugs, of course).


So, I started thinking about some way to model the data "within" the system, without having it overshadowed by the code.

I've worked out a number of the basic elements, but these ideas are far from complete. Because of this, I'll try to include my rational in this discussion, along with the rather sparse ideas in the hope that collectively these ideas will get pushed farther along by anyone interested in them.

Starting at the top, if we aren't interested in the code, then it shouldn't appear in the model explicitly. With that in mind we can define a transform T that maps the data from it's value A to another value B.

T(A) -> B

A transformer is some block of (unknown) code that takes an explicit set of inputs, and returns an explicit set of outputs. To make life easier, there won't be a concept of global variables. Everything necessary for the transformation must be passed in explicitly. And there is also no concept of "side effects". Everything changed in any way is passed back outwards.  Transforms can return multiple values.

So we might have something more complex like:

T(a,b,c,d,e,f,g,h) -> A,B,C

A transform that takes 8 parameters, applies some type of algorithm to the incoming data and returns three other new parameters. It's a nice simple start.

Now, none of this would be interesting if we couldn't use this to model something relevant. What might be interesting to know, is "given that there is some infinite number of possible algorithms for a class of transforms (or there are none), what are the bounds of these algorithms as they are running?"


I woke up one night, a while back, with an idea about how to use transforms to calculate the bounds on algorithms. While asleep it seemed to be a  very strong idea, but awake I've been unable to fully pin in down. It's mostly about continuously decomposing the algorithms, to get a better idea of how they work, but without ever having to examine the inner workings themselves.

The general idea is pretty simple (if it works). While we can't just splat in a bunch of algorithms into the transforms and calculate the bounds, we can sub-divide the transforms into smaller ones (endlessly) until we can use those atom pieces to calculate best and worst performing boundaries.

So for instance, we might be interest in studying the class of "sorting" transformers. That is, any transformer that takes N elements in one arrangement and returns them in another re-arranged "sorted" (lexically, numerically, etc.) order.

If we were going to look at the trivial case, then we find that:

Ts(A) -> A [ O(1) .. O(1) ]

In English, this says that any transform that "sorts" a single entity, requires at minimum "constant time" to sort, and at maximum, also constant time to sort  (I used a range notation, but I'd guess that using little o and big O notation may be more expected (acceptable)).

What's really interesting here is to figure out the range of growth of the performance of the family of algorithms. What I'd really like to know is what is the range for:

Ts(A1,A2,...,An) -> Aw,Ac,...,Am

The various different families of algorithms that differently sort through a list of N elements.

We can deal with this by taking the two bounding cases at the same time; the best and worst case.

In the best case, the elements are sorted because the algorithm "knows" the behavior. That is the above algorithm decomposes into

Ts(A1,A2,...,An,L) -> Al,P1
Ts(A1,A2,...,An,K) -> Ak,P2
Ts(A1,A2,...,An,M) -> Am,Pn

That is, for each different element (L,K,M) at a time, a transformation takes the whole list, the current element, and returns that element and it's position in the new list.

Since we're still working with the best case, there are N decompositions of the original algorithm that generate N positions, which are used to re-arrange the list. Each decomposition, if we are still in the best case, could use something "intrinsic" from its input to calculate P. As such, the best possible sort algorithm which can use "intrinsic information" from it's incoming data, can only ever re-order the list in n*O(1) time which I think works out to O(n) time.

In the worst case, however, assuming that the code is all practical, and really does things, we can decompose the sort algorithm down into N transforms, each of which has to examine every N elements, in order to calculate P. As such, it works out to n*n*O(1) or O(n^2). So we get something like:

Ts(A1,A2,...,An) -> Aw,Ac,...,Am [ O(1), O(n^2) ]


In an analogous manner to sorting we could start examining the class of transformers that take a large number of inputs, and return a subset of "found" outputs. I won't go into too much detail, but at each level of decomposition we can reduce the number of arguments, split off a best and worst case. In that way we might start with:

Tf(A1,A2,...,An,C1,C2,...,Cm) -> (null | Ak,...,Al)

We could simplify this by adding null in as an input:

Tf(null, A1,A2,...,An,C1,C2,...,Cm) -> Ak,..,Al

Then this describes the "find" family of transforms, that takes a set of input, and a set of criteria, and then returns nothing, or some smaller subset of "found" items.

At the very bottom, as a best case, a transform would either "know" about that the data was or was not it what it was looking for. Otherwise the transform would have to look through everything, and determine which were the correct elements.

The best and worse cases are similar to the space for sorting.


So why do I think this is useful? Primarily because it can be used to draw hard boundaries on a family of algorithms. From the standard Computer Science perspective, you can work with algorithmic complexity to find better algorithms, but you have no way of knowing if you've found the best one.

There is some computational bound which cannot be crossed, and it is nice to know where that bound is.

Of course in my examples, I took the best case to be a trivially not-possible case, but that was primarily because I've really haven't gone deep enough to understand that lower bound in a better fashion. It's likely, from the output, and from decompositions, that we should be able to figure a way to calculate a much more accurate best case boundary.

Another place where I think this could be useful is in studies the computational complexities themselves. We've had categories of complexity for a long time like NP, but we've been unable to make a dent in determining elementary issues like P=NP.

These problems too represent important boundaries on the things that can and would like to do with our computers. Unfortunately now, the studies in these areas have been tied to working with existing algorithmic problems, and with specific algorithms.

If we can work with the complexity classes, without having to descend into finding the actual algorithms themselves, we may be able to construct higher order proofs about our boundaries, without getting too wrapped up in the specifics of existing algorithms.

Another interesting issue about this approach is that it takes us one more step down a path I've talked about before:

With our current code-centric perspectives on software systems, it seems as if the systems can quickly exceed everyone's threshold of understanding as they swell in complexity.

What this means in practical terms is that while the system may start under control, with enough people contributing, and enough work getting done, the system rapidly moves into a "stupid zone", where it has become so large and so complex, that nobody can fully understand it anymore.

If nobody can understand it, then nobody can use it (fully), and as such it is  just a monolithic nightmare of related functionality.

Once large enough, the only way programmers can deal with adding in new functionality is by carving off little sections and ignoring the rest. This divide and conquer strategy works, but it also means the functionality itself starts becoming disjoint, and the system gets even more complex as a result.

So we get these behemoths that are so large and so disjoint that people can't make use of the available functionality. We pass over some maximum point in combined functionality, where on the other side our usage degenerates horribly.

The perfect example of this is Microsoft Word. Maybe a few decade agos, most users understood most of the functionality and could use it well. Now, the software is so convoluted that most users are forced to use less features just because the interactions between the features rarely does what is expected.

We used to heavily "type" our documents to allow the formater to make them auto-magically consistent. Now, anybody would have to be crazy to turn on the formatting, it hasn't worked (behaved consistently) in years. So we're back to working with our documents in the most crude manner possible.

The general case is that if you keep building on some underlying complexity, at some point the complexity itself will overwhelm what you are doing, and become its own issue.

If instead of building up ever-increasing mountains of complexity, we just worked at one low level abstraction and used the machines themselves to dynamically assemble the complexity, we could avoid a lot of problems.

The abilities of our computers are strong enough that they can be used to map pathways through a sea of transformations. Intellect is necessary for "how" to do the transformations, but it is not necessary to calculate a deterministic path between a large series of them.

We've based our current architectures and technologies implicitly on our existing theories. The Turing machine model drove the hardware construction and that drove the way we've built software. This in turn controls how we see our software problems.

We see every problem as a nail, because we started out with using hammers. And after a very long time of pounding nails into anything that moved, we're now seeing the intrinsic weaknesses in our approach to building things.

If we can't control and contain complexity, we can't exceed a fairly low threshold for the sophistication of our systems. We are stuck, just crudely re-write the same things over and over again, with simple variations on the same underlying technologies. After twenty years, this lack of progress is getting boring.

Sunday, November 29, 2009

Spaghetti Code

And now back to our regularly scheduled programming :-)

I think I've manged to get past my prime fetish for a while (although I'm still intrigued by Goldbach's conjecture). Fun stuff, but clearly not good for one's popularity. At least somebody said my diagrams were pretty ...

Continuing on from my earlier post:

My second set of laws was:

Fundamental Laws of Spaghetti Code:

1. You can't understand it just by looking at it.
2. Repetitive code will always fall out of synchronization.
3. The work will always gets harder and slower as it progresses.
4. Fixing one bug creates several more.
5. You can't enhance the code if you are spending all of your time fixing the bugs.

But first, some comments from the earlier post. Astrobe commented:
Spaghetti code:
I think that characterizing "bad code" is interesting only if one says by which process the original code "degrades" or "rusts". For instance:
"backwards compatibility is a poison for software"

Hans-Eric Grönlund also commented (on a later law, but still relevant here):
"2. Well-structured software is easier to build."

While I can't argue with the above "law" it makes me think. If well-structured software really is easier to build then why isn't everybody building it so? Why is so much of our code, despite good intentions, ending up as "bad design"?
Could it be that well-structured software is easy to build but requires good design, which is where the experienced difficulty lies?

I figured for this post I could dig deeply into my notions of "spaghetti code". All and all it is actually a controversial issue, since one programmer's spaghetti approach can be another's set of well-applied principles. And vice versa.


Back in school, in one of our computer theory courses, we were taught this useful concept of assigning a "number" to various different programs. In the days of statically compiled code that was pretty easy to accomplish because the number was just the bits of the code itself.

We can interpret all of the bits in an executable as one giant string representing a fairly huge number. A massive number, but still just a number.

Now, if we take the uniqueness of a program to mean that it handles a specific set of inputs by producing a specific set of outputs, and that two programs are identical if the inputs and output are the same, this leaves us with what are essentially "groups" of programs that perform nearly identically to each other.

From the outside, the programs all do the same thing, but they do not necessary execute the same statements in the same order to get there. Nor do they use the same amount of resources (CPU, memory, etc.)

Simplifying this a bit, for any set of user functionality there are an infinitely large number of programs that can satisfy it. Infinitely large? Yes, because there is always code you can add to a program that doesn't change the results, but does force the program itself to be larger (unless your compiler is super-intelligent) and more complex.

For every possible program, we can construct an infinite number of larger programs that do exactly the same thing, but use up more space, memory, disk, etc. without providing any extra functionality.

Within this group of programs, some of the larger numbers are clearly going to be obfuscated for absolutely no reason. They may contain more instructions, but the instructions contribute nothing towards the overall functionality. There are an infinitely large number of these variations.

Interestedly, something similar is probably true about some of the smallest versions (numbers) of the code as well. Way back I wrote about simplification:

Our human perspective of simple is not the same as the computers. We don't actually prefer the "simplest" version at any specific level, but rather something that has been well-balanced as it has been reduced.

For example, simple could come from the code being contained in one great giant function or method, instead of being broken down into smaller well-managed calls. It's simpler with respect to the overall size of the code (and the run-time), but certainly not decomposed into something more manageable for a human to understand.

Because of that, for many of these groups of programs, the smallest numbers are likely simplified in ways that are not easy for us to understand. They've not desirable versions.

So, along this range of numbers, representing programs, we find that if we were looking for the best, most reasonable, version of our program, it would be located somewhere towards the middle. The smaller code lacks enough structure, while the bigger code is probably arbitrarily over-complicated; just a tangle of spaghetti. We need something balanced between both sides.


If we are interested in a a reasonable definition for spaghetti code, it would have to be any code were the internal structure is way more complex than necessary. Where the complexity of the code, exceeds the complexity required. Most of the problems lay at the structural level.

Almost by definition you can't mess with the individual lines themselves. Well, that's not entirely true. You can use something like a pre-processor to put a complex second layer over the primary language one.

You can also use the language features themselves to make any conditions grossly complex, or use some crazy number of pointer de-references, or any other syntax that ultimately amounts to the programmer being able to "hide" something important in the line of code. It's not the syntax itself, but what it obscures from the processing that matters.

Now, in general, hiding things isn't necessarily a bad thing.

We do want to "encapsulate away" most of the complexity into nice little sealed areas where we can manage it specifically and not allow it to leak over into other parts of the program.

However, the success of what is being done, is related absolutely to what we are trying to hide in the code. That is, if you hide the "right stuff" (coming up from a lower level), then the code appears far more readable. If you hide the "wrong stuff", then it becomes impossible to tell exactly what is going on with the code.

The different "levels" in the code need to be hidden from above, but they should be perfectly clear when you are looking directly at them. Everything in the code should be obvious and look like it belongs, exactly where it ended up.

We don't want it to be a giant tangled mess of instructions. To avoid being spaghetti code, there must be some reasonable, consistent structure overlaid, that does more than just add extra complexity.

The last thing programs need is unnecessary complexity.

Design patterns are one technique that are often abused in this manner. I saw a post a while ago about how one should replace most switch statements with the strategy pattern.

It's a classic mistake, since by adding in this new level of complexity, you shift the program into one of the higher numbers in its range. But there have been no significant gains in any other attribute (performance, readability, etc) in order to justify the change. If no useful attributes come out of the increase, then the code has passed over into spaghetti territory. It is arbitrarily more complex for no real reason.

The motivation for using the strategy patten in this manner is simply to have a program that is deemed (by pattern enthusiasts) to be more correct. The reality is just more complexity with no real gain.

It's a negative effort, but oddly a common approach to programming. It turns out that one of the seemingly common characteristics of many programmers is their love of complexity.

We're a bunch of people that are always fascinated by small complex, inter-working machinery, so it is no wonder that the second greatest problem in most people's code is over-complexity (the first is always inconsistency, the third encapsulation).

Most programs contain far more machinery than is actually necessary to solve the problem. Many contain far more machinery than will actually ever be used at any point in the future to solve the problems.

From all of the this we can easily understand that over-complexity essentially equals spaghetti code. They are the same thing. And that spaghetti code makes it hard to verify that the code works, find bugs and extend future versions.

All in all, spaghetti code is a bad thing.


Now getting back to the (possible) fundamental laws, the first one was that:

1. You can't understand it just by looking at it.

I'm not sure that's an entirely correct statement. It's more like: if you're skilled, and you should be able to understand it, but you don't then it is probably over-complicated.

Some code is just inherently ugly and hard to understand by any non-practitioners. But once understood by an experienced professional it can be managed.

The classic example is always the APL programming language. The language is a brutal collection of algebraic symbols, where the language experts use any vector processing trick in the book to avoid doing things like simple loops. If they can find some strange set of operators to accomplish their string processing for example, they're consider to be gurus in the language. If they start to use if/for like C processing logic, then they're just using the language wrong.

But it's not just complex syntax like APL. Any non-programmer would have a tough time understanding anything other than the most trivial C or Java. And they certainly wouldn't catch all the little ugly inherent subtleties in the language (like using StringBuffers and appends in Java).

If you give some C code to an experienced C programmer, while he (or she) might not understand the point or the domain aspects of the code, they certainly should be able to understand what the code is doing. And with some investigation (at the various levels in the code) they should be able to take that code and alter it in 'predictable' ways.

Even more to the point, in most languages, we have enough flexibly to encapsulate the different aspects of the logic so that virtually all of the code written in the system should look like it belongs as an example in a text-book. It should always be presented in a way that it looks like it is just a simple example of "how to do something".

Of course, anyone with experience on big systems knows that it is never the case. Sadly most of the code out there is horrifically un-readable. It's not any where close to being able to be used as an example. It shouldn't even see the light of day.

Wiith good conventions and some normalization there is no practical reason why it should be such a mess; still it ends up that way.

It's probably that our industry doesn't insist on reasonable code, and thus the code is ugly and messy. That allows horrible problems to hide gracefully, which ultimately makes it way harder to work with the code.

It's the incredibly stupid, but most common way, that groups of programmers shoot themselves in the foot.


2. Repetitive code will always fall out of synchronization.

So, by now we have all of these great acronyms like KISS and DRY, but that doesn't stop us from encoding the same thing, in multiple ways, in multiple files, over and over again. When we do this, it increases the probability that some future set of changes will cause a bug because two different "sources" of the same information that are no longer the same.

You gotten love it when someone sticks the "constants" in the code AND in the database AND in the config files. It's just an accident waiting to happen, and eventually it does.

If you consider the sum total of all of the work (code, scripts, installations, documentation, packaging, etc.), then there are a massive number of places where a relatively small amount of key information is replicated over and over again.

If it is fundamentally the same thing (even in a slightly different form) and it is contained in multiple places, and the programmers don't have the underlying discipline in process to make sure it is all updated, then at some point it will go "kablooey".

You can always trace a large number of bugs back to duplication information. It's a popular problem.


3. The work will always get harder and slower as it progresses.

If you've got a mess, then everything you're trying to do with it either makes it worse, or it takes way longer than it should. Messes are the tar pits in which most programmers work daily.

What's really funny is how many excuses people will make for not cleaning up the code.

The favorite is invoking the term "history", as if that in itself is a reasonable justification for not cleaning up.

Another horrible one, mentioned earlier by Astrobe is "backwards-compatibility". A freakin huge amount of our unnecessary complexity has come from bad choices percolating upwards, year after year, under this banner.

Software vendors claim that their lazy-onion approaches to hacking code come from the market's need to support the older versions, but realistically there have always been "points" were it would have been acceptable to fix the earlier problems and remove the backwards compatibility.

For example, nothing has chained OSes to their own stinking tar pits more cataclysmically then their own marketing groups claiming to be backward compatible.

Apple is one of the few companies that ever really turfed their old weak and crippled OS (versions 1 to 9), to jump to a hugely different one with ten (OS X) and above.

A gamble, for sure, but proof that it can be done. Since for most OS versions you inevitable have to upgrade most software packages anyways, the whole notion of backwards compatibility being important is more myth, than good sense.

If the programmers don't fix the rapidly multiplying problems with inconsistent and bad structure, then how does one expect them to go away? Why wouldn't they make it more expensive to dance around? You can't bury your head in the sand and expect that the coding problems will just disappear. It won't happen.


4. Fixing one bug creates several more.

The correct description for this is of stuffing straw into a sack full of holes. As you push it in one side, it pops out the others. It's impossible to finish. A hopeless task.

Anytime a group of programmers generates more bugs then they fix, it is either because they don't have the necessary experience to understand and fix the code, or that the code itself is hopelessly scrambled.

In the first case, there is a significant number of programmers out there that will leap off quickly and start working on code that they do not have the pre-requisite knowledge to understand. It is endemic in our industry to think we know more than we do, to think that we are smarter than we actually are (and user comments don't help).

Still, I've seen enough programmers try to tackle issues like parsers or resource management or even multi-threaded programming without having anything close to enough understanding.

It probably comes from the fact that within the domain itself, we are certainly not experts, but also not even novices. We get a perspective of what our users are trying to accomplish, but particularly for complex domains, we could never use the software ourselves to meet the same goals.

I guess if we learn to become comfortable with our weak domain knowledge, that transfers over to our approach towards our own technical knowledge.

The crux, then, is that a group of programmers may be expending great deals of time and effort trying to fix some multi-threaded code, but to no real great effect. They just move the bugs around the system.

In truth, the programmer's lack of knowledge is the source of the problem, but even as they acquire more understanding, the code itself is probably not helping.

And that comes back to the second case.

If the code is structurally a mess, then it is very hard to determine, with only a small amount of effort, whether or not the code will do as advertised.

If the code is refactored, and cleaned up, then its functioning will be considerably more obvious. And if it is obvious, even if the programmer is not fully versed on all of the theoretical knowledge, they have a much better chance of noticing how to convince the code to do what the programmer desires.

Bugs, and how they related to each other become strong indicators of development problems.


5. You can't enhance the code if you are spending all of your time fixing the bugs.

The single biggest issue with code is time.

The time to write it, the time to debug it, the time to document it, the time to package it, the time to distribute it, and the time to correctly support it. Time, time, and less time drive all programming issues.

And as such, it is always so hard to hear programmers making excuses about how they don't have enough time to fix something, when it's actually that something that is eating up their time in the first place.

Strangely, for a group of people who like detail, we tend not to noticed the cause of our troubles too often.

Time is such a big problem, that I think that's why I always find it so disconcerting when I hear industry experts pushing processes that require large amounts of time, but don't return much in exchange. It is easy to make up a process to waste time, it is hard to get one to use it efficiently in a series of reasonable trade-offs.

But we seem to be suckers in wanting to find easy, and fun solutions, that don't respect the fact that we can't afford the luxury of taking our time.

Programming is at least an order of magnitude more expensive than most people are willing to pay for. It survives because the total final costs get hidden over time, and by a sleazy industry that knows that truth is certainly not the best way to sell things. Software sales and consulting teams often rival our worst notions of used-car salesmen.

So the real underlying fundamental question that every programmer has to always ask about every new process, and every new technology is simply: "does using this process or technology pay for itself in terms of time, or not?"

If it's not helping, then it is hurting.

One popular example these days is unit testing. Somehow, in some strange way, a bunch of people have taken a rather simple idea and sent it off on a crash course.

In the end, we should test most of the system in its most complete, most integrated state. Testing a partially built system, or just some of the pieces is not going to catch enough of the problems.

Testing is very limited, and based on guess-work, so it certainly is an area where we need to be cautious with our resources.

In the past I have built a couple of really great, fully automated regression test environments to insure that the attached products had exceptionally high quality. Automated testing assures that, but it is expensive. It's basically a two for one deal. You write twice as much code, and that will give you a small bump in quality. Expensive, but if that is the priority on the project, it is a necessary cost.

Still, if the automated tests are being run against the full and final integrated system then they conclusively prove, for the coverage of the tests, that the system is working as expected. You can run them, and then release the system.

A unit test, on the other hand is a very low-level test.

In the past, I've used them for two reasons: one is when the permutations of the code make it unreasonable to do the test manually at the integration level, but the code is encapsulated enough to mean that the unit test is nearly identical with an integrated manual one.

That is, I wanted to save time by not having a person test it, and I wanted the test to really permute through all of the possible combinations (which a person is unlikely to do correctly). 

The other times I've used unit tests are as scaffolding to work on some particularly algorithmically nasty code, that's been having a lot of trouble.

It's really just a  a temporary bit of scaffolding to make debugging easier.

Generally, once the code has stabilized (over a few releases) I nuke the tests because I don't want them laying around as possible work sink holes. They did their job, now it's time they get cleaned up. You put up scaffolding, with the intention of taking it down someday.

Unit tests are just more code in the system that are going to be wrong, need fixing, must be updated, and take time. Time that I'd rather apply to areas that need more help. To areas where we can be more successful (like cleaning up the code).

If the choice is: a) write a unit test, or b) refactor a mess, then between the two, I've learned that the second one has a way higher probability of paying off.

But in our industry, the first one has become a newly popular process. As such a lot of programmers have been attracted by it because they would rather write something new, than actually fix what they have. The avoidance of work appeals directly into most people's weaknesses.

Since no one can see that the code is a mess, and it is much more fun to write some new (but not too complex) code nstead of fixing the real problems, programmers choose to add new stuff. And the new stuff eats up more time than fixing the old stuff would have.

In the end, any poor use of time, gradually over the course of a development starts to show up badly. Our industry says that two thirds of all of our projects fail, which means that the majority of programmers don't actually have any real clue on how to build and release systems reliably.

My experience in working in many different organizations has shown that to be mostly true (although some large companies often support completely impractical projects, and that leaves the programmers thinking that they're success rates are far higher than they actually are).

With this law, I might have been more direct and said something like: recovering time lost to maintaining spaghetti code should be the first priority of any development project. It's a sink for wasted time, and it shouldn't be.

Once we figured out how to refactor (mostly non-destructively), we had the tools available to get our code bases in order. It's a critically bad mistake to not fix code early on. The longer it lingers, the more odorous the onion gets, the more resources that get tied into testing, support, patching, covering up and apologizing for the code.


We all know spaghetti code is bad, but it still shows up frequently in our systems. It is everywhere. In fact, one might guess that most of our code is grossly over-complicated in some easily fixable way. Mis-used of techniques like Design Patterns only make it worse.

Hans-Eric Grönlund was really close in his earlier comments. I do think it is  actually easier for programmers to build over-complicated code, than it is for them to build simple and elegant code. That accounts for the difference in volume between the two.

I think in general, it is way easier to be a bad programmer than a good one, and that most people choose the bad route (whether or not they know it).

Few humans are capable of hammering out minimally complex versions of something. It's not how we think. Even incredibly experienced gurus have to go back, again and again, to simplify their calls and make their work consistent. The first pass is always the brute force one. And the first pass is often an extremely over-complicated one.

We get bad code because, left on our own, we build bad code. And left again on our own, we'd rather do anything else, then cleanup our own mess. So the code stays around, and people just build worse things on top of it.

We are also an industry of extremists.

The Web always has someone, somewhere, taking some idea and pushing it way past its point of usefulness.

A good idea in moderation can become awesomely stupid when applied too pedantically. The machinery we build doesn't need to be complex, we don't want it to be complex, and complex isn't going to help.

So why do people keep dreaming up these crazy ideas that only obfuscate the underlying solutions? Why do programmers keep flocking to them?

Some of it, as I wrote earlier is about simplifying with respect to a different set of variables, but a lot of it is probably about ego.

Programmers have big egos, and they like to work on the hard projects. So, even if their task isn't particularly hard, they'll do everything in their power to make it as difficult and hard as possible. It's a common weakness.

In a similar manner, many people believe that writing big long complex sentences with lots of obscure terms make them sound smarter for example. But the truth is (and always has been), that the smartest people are not the ones that take a simple problem and make it appear complicated. They are instead the ones that make all of their solutions look as simple and trivial as possible. Ultimately that's what makes them famous, makes them stand out from the crowd. Being able to obfuscate something isn't intelligence, it's just patience, and an inability to see the larger picture.

The same is true in the programming world. The systems that survive are the ones that do their job as simply and painlessly as possible (that's why main-frames are still popular, they're ugly but straight-forward). The really complex stuff generally gets replaced fairly often (or never used). But there are just too many people in the industry trying too hard to prove that they are as smart as they think they are. And as such, we are awash in an endless sea of bad code.

Sunday, November 15, 2009

Prime Perspectives

It all started with a dream about infinity. Well, not just one infinity, but an infinite number of them, spanning out in all directions. A repeating pattern of infinities splitting off from other infinities. It's a simple pattern, but also a very complex one.

It was intriguing enough to compel me to spend some time exploring my thoughts a little deeper. I was captivated by the overall pattern.

Somehow, after playing around a bit, I associated the dream with the underlying structure for natural numbers (originally I though it might be related to the P=NP somehow).

I suspected that the structure for prime numbers was driven by some type of blending of an infinite number of far simpler infinite patterns; where a new complex pattern emerges as each simpler one is merged into the rest.

Prime numbers play an important role in Number Theory -- one of the most elementary branches of mathematics -- yet they are cast in all sorts of different devilish manners. They are surrounded by myths, superstitions and a long history of obsessions.

They draw a lot of attention because they are simple, yet hugely complex.
"Mathematicians have tried in vain to this day to discover some order in the sequence of prime numbers, and we have reason to believe that it is a mystery into which the mind will never penetrate." -- Leonhard Euler
Because they are a basic building block of our numbering system, they affect Computer Science heavily. Computers rely directly on primes for things like security, but also indirectly they affect all of our numeric calculations. They help to build up the intrinsic errors we live with when using a discrete space of digital numbers to represent a continuous space of real ones.

While primes rarely intercede directly on the day-to-day aspects of programming, they are always there, quietly in the background influencing our number crunching, and contributing to our errors.

What follows in this post are various different 'perspectives' I played with while exploring the world of primes. I'm not sure of their value, nor of their novelty. I suspect they've all been played with before in the past, but I've found little direct reference to any of them.

By necessity, this is a long, long blog entry. I could have broken it up into smaller pieces, but I really wanted to get all of the major elements out at the same time. Posting a bunch of smaller entries, all on the same day is essentially the same as posting one big long one, so I opted for keeping the entire work together.


Our standard number system is based around 10 symbols, but programmers are long familiar with the convenience of changing the number of symbols we used to represent our numbers. Binary (two symbols) and hexadecimal (16 symbols) provide well-know examples that are frequently used because they make it more transparent to read some properties of the numbers.

Roman numerals too, have their uses. Although there are normally only 7 symbols (14 if you include differentiating symbols with lines over them) in the scheme, it could easily have been extended to use an infinite number of symbols. However, they likely stopped at a finite number because allowing an infinite number of symbols would ensure inconsistencies in usage (everyone would pick different symbols for any that are undefined, and an infinite number means that there will always be some undefined).

Sometimes an alternative perspective on the same old underlying information is an extremely useful place to start exploration.

With that in mind, I started experimenting around with creating another base for the natural numbers. One which is based around having an infinite number of symbols.

In this case I wanted one symbol for each unique prime number. It is best described as the prime base, or 'base P' for short.

For convenience I will picked the symbols a, b, c, d, ... mostly because they help clearly differentiate the various properties of this encoding (but also because they are easier to type into my blog :-)

Using a rather common perspective of ignoring 1 as not being a prime number, I started by assigning the symbol 'a' to the prime 2. Then 'b' to 3, 'c' to 5, etc. Composite numbers are composed of strings of their prime factors, so 'aa' is 4, 'ab' is 6, 'ac' is 10, etc. To round it out, I used an empty string ("") to represent 1, and and the NULL string (or NULL character) for zero.

Thus, in numeric order (and ignoring 0 and 1 for the moment) we get the following sequence for natural numbers:

a  b  aa  c  ab  d  aaa  bb  ac  e  aab  f  ad  ...

which represent the normal base 10 sequence:

2  3  4  5  6  7  8  9  10  11  12  13  ...

All of the strings are 'normalized'. For any two different types of symbols, say a and b, if they are in a string together then the lessor one should always be written first.

Thus 'ba' is invalid and is actually identical to 'ab'. Both strings represent the number 6. Then 'dea' is actually 'ade'. They are the same, it is just that one is not normalized. 


There are a few interesting things that show up within base P.

The first most obvious one is that multiplication and division are very straight-forward. They are just string operations: concatenation for multiplication and deletion for division. For instance 42 divided by 6 looks like:

abd / ab = d

A number is divisible by another, if and only if the same symbols appear in both strings. Integer division is just a simple matter of removing some symbols. Testing for common divisors is just a matter of seeing what subset is the same in both strings.

If multiplication and division get really easy and simple in this new base, addition and subtraction become horrific. A few examples:

"" + aa = c

aabb - d = j

Both of these show that the operations of addition and subtraction are extremely non-intuitive. There is no easy way of doing these two operations. Interestingly this seems to be a swap from the finite bases, where summation is more intuitive and multiplication has always required learning some type of lookup tables.

Of all of the interesting things to do with these numbers, one caught my attention right away. It is best described as 'merging' series.

I started with the series:

a  aa  aaa  aaaa  aaaaa  aaaaaa ....

and seeing that as an 'axis' of some type, I crossed it with a similar axis:

b  bb  bbb  bbbb  bbbbb  bbbbbb  ....

Once I sorted these numbers into numerical order they gave me something like:

a  b  aa  ab  aaa  bb  aab  aaaa  abb  aaab  bbb  aaaaa  aabb  ...

When I placed both axises on a Cartesian-like graph, and then drew arrows from each number in the sequence to the next, I got this really interesting 'search' pattern:

This implies that the sequence is working it's way through each and every possible combination of the two axis in base P.

It's an interesting result given that I started out by looking for patterns composed of an infinite number of patterns branching off to infinitely.

This pattern comes from just a pair of infinite patterns, combined. If I start adding in the other axises, one by one, this could easily become a complex and diverse pattern that fully represents our basic number system.


Playing around some more, I decided to create a 'chart' with all of the base P strings, ordered numerically.

I placed the natural numbers in base P on the vertical axis, and each prime symbol itself on the horizontal axis. Because I did the original in a spreadsheet, I put the numbers on the vertical axis headed downwards, and the prime symbols on the horizontal axis headed right. But it could have been oriented horizontally as well.

My original table was vertically labeled AxisA, PlaneAB, DimABC, etc. as I was treating each repeating symbol sequence as an axis, and the whole sequence together as a search pattern throughout these various axises, as they get added into the sequence one by one.

While it was interesting chart, it was really noisy and hard to understand. So I simplified it, eventually to D1, D2, D3, etc. And then later to just a, b, c, .... Then I started highlight specific cells to see if I could find a pattern.

It was during the processing of simplifying the chart that I suddenly realized that there was a very simple deterministic pattern that was clearly flowing though all of the entries (I also realized that I had made a few mistakes in my original input).

After looking at it a bit, I simplified it into what I like to call the "Prime Product Chart". It is an easy and simple representation of the frequency and structure of prime numbers in our natural numbering system:

When I looked at the above chart, I knew I was onto something interesting. Instead of the complex patterns within patterns I was looking for, I had stumbled onto something far more simple.

This chart basically marks out each of the prime factors for each number. That is, where the number equals 0 mod P (is divisible by P) for all of the different primes in the columns. When there are no possible factors, a new prime gets created and added into the pattern.

Looking at the columns, we can see a clear and easily repeating pattern where the distance between each of the marks in the columns is the value of the prime itself. This is a pretty strong pattern, and at least indirectly it has been known for thousands of years.

The Greeks first discovered this as the algorithm named "the Sieve of Eratosthenes". It has been a pretty well-tested over the centuries (although historic computers -- even those just over a hundred years ago -- tended to have personal problems, make mistakes, get hungry and take occasional lunch breaks).

The algorithm generates the set of all primes for a fixed range.

It iterates through the series of numbers, marking off those which are known to not be prime. It does so in each iteration by marking off the factors of the last known prime number.

The first pass marks off every other number, the second every third number, the fourth every fifth one, and so on. One pass for each prime number.

In that way, eventually "all" of the factors for "all" of the numbers have been marked off to insure that any and all composite numbers are properly removed. What's left over are prime numbers.

The Prime Product chart makes the algorithm obvious. The chart can actually be interpreted as just listing out each different iteration of the sieve. Each column is one pass through the algorithm, and where each row is unmarked that is where a new prime is created.

The next prime in the algorithm, for the next pass, can always be found by moving forward from the last prime until a row of all unmarked cells is reached. It's a simple deterministic step.

From the chart we can see that the location (or not) of primes is a direct result of how each of these different columns interact with each other. As they gradually align together, new symbols are needed to fill in gaps in the pattern.

In that way, it is clear that primes are not laid out indiscriminately. They fall where they fall because of what is essentially a negation of the Chinese remainder theorem. When the next number has nothing in common with any of the previous symbols, a new symbol is necessary.

Where I marked the cells, I also added in the number of times the same symbol repeats as one of the factors 1, 2, 3, etc. There is clearly another pattern buried here with the number of repeated factors, but so far I haven't done anything fun with it. There were so many other interesting questions to play with first.

Getting back to it, a new prime occurs whenever there are no other possible symbols to fill in. Gaps always keep occurring (even when there are a huge number of existing primes), so that new primes are always added into the mix. The pattern doesn't stop (or even slow down that much). There are clearly an infinite number of primes that will get created.

From this chart we can make some interesting observations.

The well-known case where two primes fall closely together is a fairly obvious consequence of the fact that a lot of the earlier primes start coming closer into phase (being at 0 mod P at the same time), but at some point in the pattern the heavily repetitive factor 2 becomes the only reason that the primes are not one right after another. Thus, it enforces a minimal distance.

We'll also see later that it is entirely likely that any two closely related primes can appear at any location within the number system, e.g., that there are an infinite number of prime pairs.

In a way we can visualize any of the unmarked cells as holes through which new primes might be created, a sort of "filter effect",  but we'll get deeper into that later in the post.

Note that there is an 'a' symbol in every other string in base P (and a 'b' in every third, etc.). So the sieve itself is directly represented in base P.

Another point that will get way more interesting later is that like the two base P sequences I talked about merging earlier in this post, it is clear that the natural numbers go through each and every permutation of all of the currently existing symbols.

That is, for any set of normalized symbols Pa*Pb*...*Pm, where Pa can be the same prime as Pb, there is one and only one corresponding natural number, and there is always only one number.

The numbers in this sense, lay out all possibly permutations for the normalized strings in base P.


As I explored more deeply, I found a number of interesting things about this chart at the row level.

One easy thing is that it is clear that a prime only exists when there are absolutely no other earlier prime factors are available. That is, a prime occurs when none of the lower primes are 0 mod Pn, where Pn is the nth prime less than the new one.

The gcd function finds lines where each parameter P is 0 mod P, so we need an analogous function, possibly called the lowest common non-multiple (lcnm?) such that:

lcnm(v1, v2, v3, ..., vn) -> 0 != mod v1 && 0 != mod v2 && ... && 0 != mod vn

In this way we can say that:

Pn = lcnm(P2, P3, ... Pn-1)

The nth prime is the lowest common non-multiple of all previous prime numbers.

It's an extremely simple functional representation of the nth prime number (although it nicely glosses over the actual complexities of the numbers themselves).

Another thing I noted was that the minimum distance between prime numbers is always 2, since it is the most common factor. Interestedly enough, the maximum distance between primes changes radically. I watched it go up to around 150 for a small set of numbers, but it clearly goes higher. It is an easy assumption that it will grow to an infinitely large size.

That the distance between primes continues to grow is an important property of the structure of primes. Intuitively it tends to lead to incorrect assumptions.

I had always expected that primes are spaced out a lot farther from each other, as the numbers got bigger. From imagination I would have assumed that prime numbers are few and far between at the really high ranges.

But oddly it just isn't like that. They seem far more dense that one would guess. There are a huge number of massive primes. A surprisingly large number. There are problem a few massive ranges of "low density", but those are followed by ranges of more-or-less normal density (high-density) prime numbers.

The prime-counting function (Pi symbol) attempts to calculate a boundary for prime numbers, but the chart itself is also quite useful for showing how constrained and how dense the overall set of primes really is. There are some structural aspects to the chart that can really help illustrate the patterns.

There are always more symbols entering, but the distances are expanding at more or less the same rate. The density of the primes grows rapidly, but over a rapidly growing field of numbers.


What is very noticeable in the chart is that for any N prime numbers (or N columns), there is a block (of rows) in the chart that is exactly P1*P2*...*Pn cells in length.

We can call these N by P1*P2*...*Pn structures 'prime blocks'. As each one is essentially a basic building block coming from the Nth prime number.

That is, for the prime 2 the blocks are every other number. For the prime 3, the blocks repeat every 6 cells. For the prime 5 it repeats every 30 (2*3*5) cells.

Within the block, there are essentially two types of primes. Those that are inherently in the definition of the structure (primes <= Pn) and those that are essentially newly created, above or outside the structure.

I will refer to these as internal and external primes.

As we look at larger blocks in the chart, we see that the thickness of the block grows really slowly by one prime each time, while the depth grows really fast, increases at a massive rate by the product of the next prime.

When I first looked at prime blocks, I missed the fact that as the blocks get larger, the external primes repeat themselves more frequently. While the internal primes form a more-or-less deterministic pattern, the interaction of the different frequencies in the external primes makes for an increasingly complex pattern.

The blocks are essentially stretching rapidly, each time we include another prime, a point which is interesting. Important, because although while we've only explicitly calculated N prime numbers, we've actually tested (and  factored) P1*P2*...*Pn numbers.

That is, in the seive algorithm, as we generate larger and larger prime blocks, we get a sort of magnifying effect out into the number system, showing as all of the other related primes within that block.

Another thing to note is that prime blocks are finite structures for a fixed number of N primes. That is, they exist and get replicated over and over again. Some of the higher external primes will extinguish some of the 'possible' lower ones, but only when they have no common factors. In a way, we can see the blocks themselves a type of base 'filter', and then the external primes add another layer on top of that.

Judging by the overall density of primes, the extinguishing effect from external primes is relatively in-frequent, mostly primes that start in one instance of a block have a good chance of staying prime in any of the following instances (a point that I have been waffling on, so it deserves much more investigation).

At the end of every instance of an N block of primes, the last entry, the one at P1*P2* .. *Pn is always a composite number whose that is divisible by every one of the internal block primes. Lets call it the 'last index' number.

In base P that number is:

a  ab  abc  abcd  abcde  abcdef  abcdefg ...

The last index number is interesting because it delineates the end of the block, but also because of the numbers both on both sides of it. Since it is the last number in the first instance of the block, will call the next number the 'block-start' number.

Often in many cases, the block-start number is a prime.

Looking back at the chart we can see that 7 and 31 are very obvious block-start primes. The 6 pattern repeats on a few different primes until it gets cut down by a couple of powers of 5 (an external prime for the 2*3 block).

Now, it is extremely interesting to note that all this time that we have been assuming that 1 is not, by definition, a prime number, but it is a block-start number (shared for all of the blocks).

Initially it might seem like block-start numbers are always prime, since at least they are trivially co-prime with each of the internal prime numbers. However, the external primes easily make the overall structure significantly more complex.

So, for larger blocks, it turns out that block-starts aren't always prime. The smallest example where they are not is 30031, which is a block-start number for 2*3*5*7*11*13 (abcdef), and is divisible by the external primes 59 and 509.

Block-start numbers are also known as Euler Number and have been around for a long time.

My interest in them comes from mistakenly not realizing that some of the externals could actually extinguish them from being prime. This mistake lead to some very excited effort in trying to generate a massive block-start prime (I generated one with 116 million digits using only a home computer and a week's worth of computing). Oddly, while the number may be prime (or it may not), the sheer size of the number makes it hard to prove, one way or another.


A huge question in Number Theory has been to be able to accurately count the total number of prime numbers. Many great mathematicians have thrown themselves at this problem, always with a limited degree of success.

Perhaps the key issue is that the increasing size of the prime blocks makes the actual number of primes vary significantly for each of the block ranges.

There might not be a single function for the whole range, but possibly only ones for each of the the prime-block ranges (and then a limit as the blocks go to infinity).

I figured it was an interesting issue, but really only from the perspective of trying to show that the arrangement of external primes in very high-up ranges wasn't essentially random.

So I tried taking a new look at this problem but only as trying to find the count within a discrete closed range, may actually produce less complex results.

For instance, we can count the number of primes within specific prime blocks:

For block 2: There is one prime (assuming we don't count 1), which is 50%
For block 6: There are 3, which is 50%
For block 30: There are 10, which is 30%
For block 210: There are 46 primes, which is 21.9%
For block 2310:There are 343 primes, which is 14.5%

The percentages start cleanly, but quickly built up to be 'less obvious' numeric patterns. As the blocks grow larger, the numbers look weirder.

Interestingly enough, we know that the Nth block which is P2*...*Pn in size is actually the N-1 block repeated N times.

If we consider the prime-block itself to be a filter of some type, then for each instance of the filter we can work out the total number of "possible" prime numbers allowed in each block (certainly in terms of the previous smaller prime-blocks, if not in terms of some complete formula).

The big problem comes from having to subtract the correct number of external primes that extinguished any of the possible primes.

Although we could calculate the density of the external primes, we need a simple density function that takes into account overlaps (such as 59, and 509, both extinguishing 30031).

I highly suspect that this calculation is possible, but lately I haven't had the time to dig into it.


Another interesting issue that pops up all of the time is whether or not we could find one single continuous function for calculating prime numbers. For those not familiar with Binet's formula, this may seem like a near impossible task, but we actually been able to take certain classes of complex recursive functions like the Fibonacci sequence, and create functions from them.

It all starts with being able to get some type of recursive definition for the sequence.

Because we can see the structure of the primes, we can write a program to start at one position and find the next one. We can easily write a program to implement the sieve for instance.

Programs (to some degree) share the same expressive level with recursive formulas, so we can write a recursive formula for creating primes.

We start by stating that a prime number comes from the previous prime, plus some "gap" between the two numbers:

Pn = Pn-1 + Gn(Pn-1)

To make things easier, we want to be able to test to see if any number is not equal to zero modulus some prime. Because we want to use this test for recursive purposes, if the number is not 0 mod P, then we return 1, otherwise 0. In this way we can use it as a multiple, to drop out different terms if necessary. I think the following works:

T(x,m) = Ceil ( ( x mod m ) / x )

If x is 0 mod m, then the fraction is 0 / x, which is 0. Otherwise it is 1. With this function in mind, we can create a recursive function that tests a particular column in the chart to see if all of the cells are all != 0 mod Pk, for all Pk:

FTk(x,Pk) = T(x,Pk) + FTk-1(x, Pk-1)

This is a classic tail-end recursion that evaluates to 0 if all of the recursive levels are 0 mods. From here we can then create a function to calculate the gap simply by trying each row, and if it fails, adding one to the result and moving (via recursion) to the next row.

Gk(x) = 1 + T(FTk(x,Pk), 2) * Gk(x+1)

The outer test function T simply forces the results of FTk to be either 1 or 0, pushing us to the next row in the gap.

With a few initial starting values:

G1(x) = 1
G2(x) = Ceil(x mod 2 / x)

FT1(x, 1) = 1
FT2(x,2) = T(x, 2)
FT3(x,3) = T(x,3) + FT2(x,2)

We have a complete recursive representation of prime numbers.

Now, just because we have a recursive representation doesn't mean that we can find a generator function for this, or that we could find a formula for the Nth base prime.

While Binet's formula does work for the recursive Fibonacci series, the difference is in the fact that the recursion in the Fibonacci sequence is simple (and not entirely necessary, although it is the easiest way to see it). The recursion in the prime number formula is extremely complex and self-referential.

Oddly, it all comes down some issues about information. Although it's not explicit, the Fibonacci sequence contains within its composition a structure that can be represented in multiple ways. Once by a simple recursive function, and again by a more complex one for the Nth number that is based on golden ratios. Although the secondary representation is far from obvious, the "information" to make it work is implicit with the recursive formula and the number system itself (in the same sense that many computer programs depend heavily on the operating system to be complete).

Since it was non-intuitive, it certain opens up the possibility that there is something similar available (but unknown) that would work with the recursive prime function as well. Even more interesting is that this indirectly states that there is more to our axiomatic abstractions, then just the axioms themselves. In a way, they sit on a "consistent" and "logical" foundation, which itself has an intrinsic about of information, which indirectly controls the representations of the axioms and allows (or does not allow) for alternative isomorphisms.

We can see this as the number system itself (as a formal system) allowing for the mapping between Binet's formula and the Fibonacci sequence.

Getting jiggy with it, if we extrapolate this out a bit farther, we can draw a conclusion that there are likely an infinite number of alternative isomorphic representations, any of which can be built on some intrinsic information contained by the rather large space of an infinite number of possible formal systems.

That opens the door to easily saying that just because we have not found an obvious alternative representation of some mathematical object, doesn't in any way imply that there are not some other "useful" ones (there are an infinite number of non-useful ones :-). Or in short, for those that are interested in it, I wouldn't write off the possibility that P=NP just yet, no matter how counter-intuitive the problem appears.


The Prime Product Chart proved so useful, that I started searching for a similar type of representation for addition and subtraction.

Realistically, the Prime Summation Chart should have a similar structure to the Product one, but be focused on addition. The Product chart broke down the factors, essentially dealing with the sub-structure of the numbers. In a related fashion the summation chart will break down the base numbers based on partitioning them into prime terms.

It is believed that all numbers can be decomposed into either the sum of two or three prime numbers. With that in mind, if we are looking for arithmetic partitions, we should concentrate those that only involved prime numbers and those that involve a minimum number of them.

The Prime Summation Chart should be similar to it's multiplicative cousin, so once again it's the numbers going down and the lowest prime (in the partition) on the right.

Now with addition since there were so many different partitions I had to narrow it down somewhat to the more interesting ones. This version of the chart sits on the middle numbers for the next round.

I went that way initially because I suspected that partitions involving 1 were going to be trivial, and thus not interesting.

This is similar in many ways to the product version, except that it is using summation.

Wherever possible I tried to represent the numbers with the absolute minimum number of primes, but always primes, since they are the main building blocks.

Although, for each new row there were many different possibly partitions,  in the list of possible partitions I high-lighted the outside elements in orange, and the inside one in grey

Interestingly enough, in the chart is how the patterns of primes keep repeating downwards and to the left. This attribute actually makes the chart fairly easy to generate and verify.

Still, while this chart shows some interesting patterns, it is still fairly complex.

If choosing the center of the possible partitions produced something complex, it might be interesting to see what the trivial choice for partition produces:

I dropped most of the triple partitions, and always picked the most trivial partition for the next row (which I high-lighted in grey).

It is really interesting to see how the patterns of descending numbers are reproducing themselves again and again in each column. The nth prime column just starts the pattern all over again for each row. Generating entries in this table is actually a simple, manual process.

Initially most numbers are either a prime or a product of two primes. But 27 and 35 both don't have any two prime number representations.

The question, is whether or not there is always at least a three prime representation or if at higher numbers it will degrade to four, five, etc.

The answer to this lies in the fact that each new column repeats the same pattern over and over again. It is this property that controls how the higher numbers get generated.

The most obvious point is that 27 and 35 only become partitions of three because the gap between primes is greater than 4. That is, the next numbers are always going to be X plus the last prime. So it's P, then 1 + P, 2 + P, 3+ P, etc. Except that there is not single one-prime representation for 4, so it must become (1 + 3) + P.

Any time the gap between primes is equal to 4, that number can only be decomposed into three prime numbers, not 2.

So, extrapolating a bit, what happens when the gap between numbers reaches 27? If 4 was the first non-prime number that needs to be represented as 2 prime numbers, then 27 is the first prime number that needs to be represented as 3. Thus the smallest partition is going to be:

1 + 3 + 23 + Pk

Where k is some value such that Pk - Pk-1 >= 27.

From this we can guess that this particular circumstance will continue infinitely. Each new entry that can't be represented with some limited number of primes will force some new entry into the number system. So for any N, there is always some number that cannot be represented with less than N prime numbers (thus showing that some of the original guesses over the years were probably wrong).

What might be interesting is whether or not these numbers fall on even or odd numbers. It wouldn't surprise me to see them flip flopping between evens and odds, or just staying with the odd numbers themselves (there is an existing proof in this area which may or may not conflict with the above).

UPDATE:  The range between the primes 2971 and  2999 may be the first one greater than equal to 27. The number 2998 can be expressed as: 1 + 3 + 23 + 2971, but is there also some prime decomposition that is smaller (with only 2 prime terms?)? If so, then the Goldbach conjecture stands, if not then it is wrong. It's easy enough to brute force my way through this search, so hopefully in a few days I'll try, and then update the blog again.My guess is that there won't be another smaller decomposition into primes AND the number 1 + 3 + 23 + 2971 + Pk, where Pk - Pk-1 >=2998 will be an odd number. Thus implying that the minimal decomposition (into 2 or 3 primes) is neither true for odd nor even numbers.

UPDATE2:  Interestingly enough: 2998 = 2 * 1499 = 1499 + 1499. And the smallest decomposition is 2998 = 29 + 2969. Also the smallest composite that is 27 numbers away from a prime is 1354. Oddly, like 2998 the following relation also occurs: 1354 = 2 * 677 = 677 + 677.

The first gap of 35 occurs between 9551 and 9587. Also, 9586 = 47 + 9539 = 2 * 4793 = 4793 + 4793, which shows that it has the same underlying decompositions as the case with 27 (which is exactly as we would expect).

So all of these numbers have the property that Ck = 2* Pi = Pi + Pi. In that way, even if 27 and 35 can only be the sum of three primes, their pattern multiples (1354 and 9551) are back to being expressible with 2 primes. Which (loosely speaking) means that the conjecture is true (although I'd like to state it as every number can be minimally stated as the sum of 1, 2 or 3 primes).

UPDATE3: While I'm here: there are no two-prime decompositions of 27 and 35, because both numbers are odd, and except for 2, all primes are odd numbers. To be decomposable into two primes Pa and Pb, the following must hold true:

Pa = 27 - Pb   (or 35)

And since Pa can't be even (divisible by 2) Pb must be even (and thus not a prime).

(the concept of 'odd' numbers is clearly useful beyond things just being a factor of 2, so is it possible to extrapolate that to higher numbers, say 3, for instance? Could be call that "trodd" ? Or is it just Friday?)

UPDATE4: This is my last update, I promise. I just started thinking that every positive natural number (k) can be deterministically generated (in sequence) from the following rules:
  • The number is a prime number Pi
  • The number is an Nth entry away from a prime number, either as the last prime plus another, or the last prime plus two others. So Pi + Pa or Pi + Pb + Pc
  • The number divided by two is a prime number, so Pi + Pi
Then a stack-based algorithm to generate all of minimal prime sums for all of the numbers is:
  1. Check if the number is prime, if so use that number and reset the stack
  2. If not, take the last prime and the last numbers on the stack
  3. If the last numbers are one or two, use them plus the last prime
  4. Else divide the number by two, and use it twice
  5. Finally: add the representation (numbers) to the stack
The only question left over, is we know this appears to work for expansions of 1 + 3, but does it work similarly for the next two-sum expansion 1 + 5 ? The first gap where that shows up is between 89 and 97, where 95 = 89 + 1 + 5, and there are no smaller representations. If everything holds true, then

95 + Pk = 2 * Pi

where both Pk and Pi are primes, and there are no prime numbers in the range (Pk, Pk+95]. One quick check shows that this doesn't occur in less than numbers less than 25,000. Maybe I'll need another update after all :-)


Getting way back to generating large prime numbers, I suspect that it is entirely possible to work backwards from some point in the number space to finding any nearby associated prime numbers.

Although primes clearly have a pattern, to get to the Nth element in the pattern, one has to take in account all of the "information" from the previous N-1 elements.

For massive numbers, this is a huge amount of information. However I keep thinking it might be similar to solving a Sudoku puzzle.

In a Sudoku puzzle, there is a relationship that each number only appears uniquely per row, per column and per square (the major ones).

The puzzle starts with a few numbers filled in (which we can see as information). In order to solve the puzzle, readers can utilize all of the intrinsic relationships in order to absolutely assert the location of a specific number. It's not a puzzle where one needs to guess, you can "absolutely" be certain that a number appears in a specific location if and only if you have made the correct inferences from all of the available information.

As such, it is deterministic, and once someone starts the puzzle (assuming that it really is a valid Sudoku puzzle), it can always be correctly finished. No guessing, no wrong answers, no ambiguity.

Within the number system there are a larger number of 'partial information' holders that can be used to say something relative about a range of numbers. We know all sorts of relations that occur between numbers that do not need to be tied to all of the intrinsic structural knowledge.

That is, we could drop into a range of numbers and starting at one specific point, make associations between the various different sets of numbers. We can make some high level associations, but also some low-level ones, such as that some number is a power of 2.

If we can build up enough information, over a large enough range, we should be able to correctly infer that some of those numbers in this (increasing) range are prime numbers.

This is very similar to how I produced the Prime Product Chart for the block-start number 30031. I didn't have to compute everything up to that number in order to be able to diagram the number itself. I just started by placing 30031 in the center and worked my way outwards on both sides. Each time I factored the nearby numbers, I updated the chart to show there revised structure.

Of course for that example, I cheated somewhat. I had access to a UNIX factor program, which quickly and conveniently produced the correct factors, that I added to the graph.

Still, if factor can be seen as producing an "absolute" reference to some information (factors in this case), I suspect that there is a "relative" variant that could still provide reasonable information that we can use.

Even small operations are expensive on multi-million digit numbers, so to be practical we'd have to produce some pretty strong information based on very little related information. With enough of it, we could re-construct the surrounding numerical space, and then draw inferences about the elements in that space.


There is more. Lots more. I manged to get the major points out for each of my key focuses, but it'll probably take me a long time to sift through the details and find all of the rest of things of interest.

My biggest problem is being able to find the time to explore this world of primes to the degree that my curiosity is calling for. It's nice because it is simple and accessible, but often I find I have to think about things for a long time, or spend a lot of time reading books on topics like Number Theory.

Hopefully in the new year I'll get a chance to do some follow up work. I'd really like to try experimenting around with finding inexpensive "relative" relationships. Some intuitive inner-voice suggests that that might be very workable. Or perhaps, I might just wait again, until the next weird dream wakes me up in the middle of the night. I managed to get a fair amount of traction from that last one.

Friday, November 6, 2009

Programming Structure

I've been a little sluggish with my writing, lately. Probably because I've been very busy rewriting a big piece of code at work and I'm not much of multi-tasker (if it involves thinking).

Still, I need to get back to my exploring proposed laws, starting with the first set:

Fundamental Laws of Programming Structure:

1. All programming code is hard to write and has bugs.
2. The lower the code is in an architecture, the more you can re-use it.
3. The fastest way to develop code is by re-using it.
4. The best way to develop code is by deleting it.

Before I dive in, I thought I would start with xlr8's comment:

"the first thing I can think is: there is some difference between math laws and these."

Computers straddle mathematical abstractions and the real world. As such, software runs in an abstract system that references back to the real one. What we are really doing when we design and build systems is finding useful ways to bind the two together.

On the purely abstract side, a term like "theory" works well for describing something that has been rigorously proven to be true. Something that is abstract in nature.

That's nice, but most of the interesting or complex problems from computers come not from their roots in the pure mathematical space, but rather from how we tie that back to reality.

Mathematics can have a intellectual purity to it that is impossible in the real world. The real world, on the other hand effectively resists any type of intellectual containment. In that way, even if we are rigorous with our real world observations, there will always be exceptions.

Economics sets a fine precedent for dealing with the mixture and uses the term "law" for handling things are mostly proven to be true:

so I think that it also works well when we are discussing any Computer Science problems that fall on the "real world" side of the fence.

Software development in that sense will always be a "soft" science similar to Economics (which calls its 'hard' math side "econometrics"). We already have names for our "hard" math side, it's the references to the softer side of our jobs that get most confusing.

Given that, for any proposed "law", there will always be exceptions:

xlr8 also said:

"2 This leads to second point, definitions. In fact in math everything is well defined, here, it can not be. See the words i quoted in preceding point ("useful", "you can't", etc)."

As I go through the different sets, I will try and be more precise about what I mean for the various terms I am using. I usually have a very precise definition in mind, even if I haven't managed to document it well.

The issues that play out in the real world, are usually based around scheduling or development costs, both of which serve as very sharp and painful obstacles for software development projects. Software development is almost always primarily bounded by time and resources, it is after all, real world stuff that needs to function in order to be useful.


I really agonized over my first law for this set, mostly because I don't actually think programming is "hard", at least it is not hard all of the time. Parts of it are, but it shouldn't be large parts.

In fact, I know that the road to success for any project is to get the "hard" part out of the way as early as possible. If you're limping into the deadline with nothing but mindless effort between you and getting it done, you're pretty sure that you going to make it on time.

However, if a few weeks before you're supposed to deliver, you find that there is still some major chunk of the program that is unknown, and you haven't even worked out a battle plan for getting written yet, then you're pretty much sure that you ain't making that deadline.

If there are "hard" things to do between you and the finish line, then you're not managing the project well (even if it is not your fault).

Hard, as it comes about in programming, is either that it is hugely time-consuming, hugely resource consuming, or just requires a mass amount of thought. If it's beyond hard, then likely it's impossible (or near enough so).

No sane construction company would start building a structure that they knew was impossible, but oddly that's not all that uncommon in software development. We're often too eager to rush off and start coding.

Still, I've found that with experience comes an increasing ability to recognize "hard", and to learn to worry about it as early as possible.

In that way, if all is going well in a development project, then the second half should not be hard, it should just be work. And to some degree, it should be predictable (even to the point of always expecting the get one or two of those really nasty week-consuming bugs that grind everything to a standstill).

"Hard" is not something we want, need or should put up with for long periods of time. "Hard" is the first big obstacle in many cases, but there are ways to manage it successfully.

What I really should have said was:

1. All programming code is time-consuming to write.

which I think is far closer to the underlying reality of what I am trying to say.

Writing code, any code takes a very long time. And, to some degree, if it doesn't, it is because the project is saving time by accruing technical debt, but I'll get deeper into that point when I talk about enhancing software and some of the rules-of-thumb metrics I added under the commercial software laws.

Also, my second point about bugs deserves its own specific law:

2. All programming code has some unexpected behavior (bugs).

Bugs have always been a bit of a funny concept particularly in programming. We often blend several very different things together. Astrobe writes:

"- Programs have bugs because undecidability prevents automated checks
- Programs have bugs because they are written by humans, and errarum humanum est."

In the purest sense, a bug in a programming language really only exists when a compiler of some type refuses to continue building the executable code. In any other circumstance, including run-time, the computer is just doing exactly, and only exactly, what it was told to do (including a segmentation violation or core dump).

Ignoring undetected hardware failures (for now), a computer that does exactly what it is told, but still does something unexpected is more of a problem with our expectations and our point-of-view of what we think is right, rather than with its ability to perform a given task. The computer's not at fault, we are.

Al responded to Astrobe with:

"That said, it is very true that we cannot be sure that all bugs are detected automatically, because of undecidability (thanks to Mr Gödel and Mr Turing for the theories behind it). But, even without any theory about programming, the number of combinations for the inputs of any non-trivial program is far too large to do exhaustive testing (could the number of combinations be a beginning for a characterization of "non-trivial" ?).

And when the bug lies in the spec, well, we cannot write a jUnit that checks that the spec is what the customer meant ;)"

In fact I'd say that the far majority of the bugs I've had to deal with over the years have been rooted in domain issues; coming directly from communication problems with the customers or the users.

Like the coding itself, technical problems are easier to detect, and fix. Often they are fairly obvious in their solution. Domain problems on the other hard, are very hard to detect, and frequently very expensive to fix.

Nothing automated could ever detect that some information wasn't right if it doesn't have some context in which to compare it. Since the code is the first attempt to work with that domain information (correctly), it is unlikely to ever have a secondary code body that can validate the primary one (or we'd be using the secondary body as the primary one).

There is zero intelligence in computer behavior, above and beyond exactly what we explicitly encode there. Computers are not wrong, nor are they right, they just are ....

About the only thing we can do is maximize our use of the specific programming language features that force the compiler to be stricter in detecting that there is a problem with the code.

Of course, that opens up a severe trade-off where we are getting better value from our compilers automated checking, at the expense of having significantly more code that is explicitly more rigid. For some domain-critical algorithms, that is an appropriate, if not excellent trade-off, but for something more general code like a re-usable framework that is an exceptionally poor choice.


Projects die for all sorts of reasons, but often they die because the programmers squandered their resources and where never able to get past the mountain of technical debt they created.

The idea that every messy piece of code, every inconsistent file, and every quick hack eventually builds up to drive the forward momentum of a project to nearly zero is not a new one. Still, it is ignored so frequently that it is staggering. Making too many short term vs. long term decisions nicely encapsulates itself under the term "debt".

The term "debt" fits well because software development is really bounded by the real world problems from the domain, not by the mathematical ones or the technical ones.

No-one (that I know) has ever failed because they violated or crashed into the halting problem (although some clearly should). The theory sets some pretty broad limits, but the fatalities in software come from more mundane things.

Technical debt, like financial debt can build up to the point where it's effect is tremendous. Ignored it is frequently fatal.

In that way, my first goal on all of my development projects has always been to "make sure all aspects of the development are tangible". Due to my first point about programming being hugely time consuming, most projects are seriously short on resources before they've even reached the preliminarily design stage. Knowing this early on allows for one to make better trade-offs.

Limits are intrinsic in the creation of projects. If you decided to renovate your kitchen, chances are you're not wealthy enough to allow the contractors to spend "whatever it takes" to get the job done. There is some amount of money in your saving account that is bounding the work, before it's even been planned.

In all but a few cases, the same point is true in software. The value of the work, in either dollars, time or man-power is known long in advance of the design. When I was young, I can remember how much that irritated me, how I sensed it was an injustice. However, when I had to write the cheques my perspective changed, and fast. Controlling the resources is tad amount to controlling the risk. Controlling the risk is necessary for success.

One does want to lose their house because a renovation project went 200X over budget, that would be madness, and the same occurs in software.

My point of this little ramble is to justify, as much as possible, the importance for all software projects to leverage their code base to it's full extent. Just pounding out masses of code at a high rate is the same as just generating masses of technical debt.

Brute force coding may seem effective in the short run, but it eventually involves callously corning one-self far away from the door while trying to paint the floor, i.e., painting oneself into a very tight corner.

It's one of those stupid things that seems on reflection to be obvious, but is still done far too frequently by people who should know better. With that in mind, points 2-4 in the original laws might be summed up with:

3. Every line of code, comments and document counts as work.

And added to by the nearly obvious:

4. It is less work to update a smaller amount of code, comments and documentation.

And here I should point out that by "code" I not only mean all of the code that is written to deal with the technical and domain issues, but also any "code" that is written for packaging, automation and scaffolding (like testing). Code is code is code, it doesn't really matter what it is doing. Comments and documentation are just sloppy code (or code is just strict documentation). they are all the same thing, differentiated only by precision.

In the end, these are all things by which people have to expend effort to update, examine or remove. When there are more of them, there is more work.

Ignoring things (like the last version of the user manual) doesn't make them go away, it just builds up more technical debt.

By 'work', I mean any effort expended by any persons for the purpose of managing, containing, checking, meeting, talking about, updating, changing, cleaning up or re-writing or any other effort that somehow involves some amount of dedicated time.

It's easy to forget that even if some chunk of complex code is not receving constant updates, if it is coming up in discussions, lectures, meetings, etc. it is still "costing" some amount of effort on the people involved. If the coders all whine once a week about how bad a particular library is to use, that's still expended effort, even if it's not constructive effort (and bad morale heavily eats at development resources, whether or not the pin-headed management consultants realize it).


My original goal for this set of laws was to lay down something that stressed the importance of leveraging code to it's highest capabilities. Why waste it, if you have it.

As a profession, we constantly refuse to do this.

Too many programmers have fallen into the bad practice of just accepting that the bulk of their system won't get re-used. Some programmers have even mastered the black art of explicitly not reusing stuff, so that more or less it is their intrinsic style of programming (each and every line of code has exactly one and only one usage within the system).

Many programmers seem to fear that any extra effort in thinking won't get paid off by the savings in work. Which, in many regards is true initially.

Doing it right, generalizing and making sure that the code is re-usable takes more time, right away. Creating an architecture, doing design, fixing inconsistencies, and not just taking the quick way around all of the problems is always longer. But not long enough to negate any payoffs later in the project.

Unrealistic schedules force us to always have to accept some technical debt, and all of it is absolutely more resource expensive then doing it right the first time. Minimizing it, and carefully picking and choosing, are key choices that can effect the success or failure.

Some programmers fear that highly leveraged code is complex, and changes will propagate to unexpected places. In some ways that is true, but primarily since it's the re-use itself that is helping to enforce consistency across the system. If the code is re-used, but the usage is an inconsistent mess then, of course, changes can have negative consequences, but realistically it's not because of the re-use, it's because the upper body of code is a mess.

From this we get:

5. Lower level code is easier to re-use.

A small body of well-formatted, consistent code is worth so much more than some untold millions of lines of crap (well, at least to technical people, the market still occasionally disagrees, but only in the short run).

Thus if we have one and only one main programming principle it should be to always solve the problem with the "absolute minimum" of resources, including code, documentation and testing.

If it isn't necessary, it isn't necessary and shouldn't be done. If there is little or no practical value, then it is just wasting time.

As a side-note, I'm a huge fan of design and architecture. I'll never commit to a project without at least a preliminary architecture, but I've often sensed that people get the wrong ideas about what is necessary.

Creating a big fancy architecture, with hundreds of pages is a waste of time. Creating one with any level of detail above and beyond exactly what is going to be useful is a waste of time. Detail is not the issue. Systems always need an architecture, but only in the context of not wasting too much time in experimental programming, and in not wasting too much time in overall system testing, and in not wasting too much time in operational support triage. For these goals, the lines need to be clearly drawn, but beyond that the specifics are less crucial. So long as you are clear on where the line should be, enforcing that is not hard (and is mostly boring refactoring work).

The architecture -- to have any value at all -- needs to pay for itself and the time it took in producing it. Otherwise it's just another -1 to a project in desperate need of more resources.

Getting back to re-use, it is important that we maximize any and all effort in all levels of the project. While that might appear to be an obvious statement, most people too easily forget about side-effects.

If you built some testing mechanism that required a couple of hours to create, but saves 4 hours of debugging, you've managed to pull ahead by two hours. If you've built some automation for weeks that might have been done manually by a low level employee in a couple of days, it might require years, decades or even centuries to pay off. Patting yourself on the back for "saving time" is a little premature.

It is too easy for projects to invest too much time in the wrong areas.

A skilled and experienced software developer learns not to take any effort for granted. The rewards of the work, may or may not pay off the effort. Sometimes it makes more sense to not automate something. To not add fancy diagrams, or to just not commit to some huge effort right away. Sometimes stupid and ugly code is the best approach (rarely, and only at higher levels).

For people who don't know and are trying to figure it out there is a very simple way to think about it. Trace everything backwards through time. If you can't associate the work with some useful tangible (and necessary) effect (either in the project or outside of it), chances are, it has no value and should not be done. In environments with constrained resources, there always needs to be a practical reason for everything. Nothing should be random, nothing should be "just because".

If you're relating the work to "assumed" facts, then it might make some real sense to validate you assumptions before investing too much time. Simple things have a nifty way of getting out of control if they are not tied to specific outcomes (in fact most things out of control get that way precisely because they are unbounded).

All of this comes back to getting the most value out of your work. The biggest investment in effort in all software development projects comes from working on the code (not including sales and operations). You can't find the time to do it right, if you can't figure out how to leverage any existing work. You can't get ahead if you are always behind.

You can't win if you're not making long-term trade-offs, and you can't makes these unless you have enough resources. Thus:

6. Minimizing the work allows for making more longer-term trade-offs.


Computers are amazing machines, but our current (disappointing) state of software construction diminished their usefulness.

Well-structured programming code is part of the balance required for large and small software projects to be able to find the time to correct any significant issues with their efforts. Poor development practices leave the programming teams scrambling to make tight deadlines, often forcing them to accrue significant technical debt. Debt eats back into the resources, forcing the teams to further compound their weak practices. This downward cycle is fairly standard for most new and upcoming projects.

This cycle can only be broken by minimizing all of the overall work, and then using what is saved to increase the number of long-term choices made. As more of the debt is paid off, the "crunch" from each development cycle will ease and more time can be dedicated towards improving the system, not just patching it.

So I think programming is bound by the following laws:

1. All programming code is time-consuming to write.
2. All programming code has some unexpected behavior (bugs).
3. Every line of code, comments and document counts as work.
4. It is less work to update a smaller amount of code, comments and documentation.
5. Lower level code is easier to re-use.
6. Minimizing the work allows for making more longer-term trade-offs.