Nearly
three decades ago, when I started university all I really wanted to
learn was the magic of programming. But my course load included plenty
of mathematics and computer theory courses, as well as crazy electives.
“What does all this have to do with programming?” I often complained. At
first I just wished they’d drop the courses from the curriculum and
give me more intensive programming assignments. That’s what I thought I
needed to know. In time I realized that most of it was quite useful.
Theory
is the backbone of software development work. For a lot of programming
tasks you can ignore the theory and just scratch out your own eclectic
way of handling the problem, but a strong theoretical background not
only makes the work easier it also is more likely to withstand the
rigors of the real world. Too often I’ve seen programmers roll their own
dysfunctional code to a theoretical problem without first getting a
true appreciation of the underlying knowledge. What most often happens
is that they flail away at the code, unable to get it to be stable
enough to work. If they understood the theory however, not only is the
code shorter, but they’d spend way less time banging at it. It makes it
easier. Thus for some types of programming, understanding the underlying
theory is mandatory. Yes, it’s a small minority of the time, but it’s
often the core of the system, where even littlest of problems can be
hugely time intensive.
The
best known theoretical problem is the ‘halting problem’. Loosely
stated, it is impossible to write some code that can determine if some
other code will converge on an answer or run forever (however one can
write an estimation that works with a finite subset within a Turing
Machine and that seems doable).
In
its native form the halting problem isn’t crossed often in practice,
but we do see it in other ways. First is that an unbounded loop could
run forever. An unbounded recursion can run forever as well. Thus in
practice we really don’t want code that is ever unbounded -- infinite
loops annoy users and waste resources -- at some point the code has to
process a finite set of discrete objects and then terminate. If that
isn’t possible, then some protective form of constraint is necessary
(although the size should be easily configurable at operational time).
The
second way we see it is that we can’t always write code to understand
what code is trying to do. In an offbeat way, that limits the types of
tools we can use in automation. It would be nice for instance if we
could write something that would list out the stack for all possible
exceptions in the code with respect to input, but that would require the
lister to ‘intelligently’ understand the code enough to know the
behavior. We could approx that, but the lack of accuracy might negate
the value of the tool.
Another
interesting theoretical problem is the Two Generals Problem. This is
really just a coordination issue between any two independent entities
(computers, threads, processes, etc.). There is no known way to
reliability get 100% communication if the entities are independent. You
can reduce the window of problems down to a tiny number of instructions,
but you can never remove it entirely. With modern computers we can do
billions of things within fractions of a second, so even a tiny 2 ms
window could result in bugs occurring monthly in a system with a massive
number of transactions. Thus what seems like an unlikely occurrence can
often turn into a recurring nuisance that irritates everyone.
Locking
is closely related to the Two Generals Problem. I’ve seen more bugs in
locking than in any other area of modern programming (dangling pointers
in C were extremely common in the mid 90s but modern languages mitigated
that). It’s not hard to write code to lock resources, but it is very
easy to get it wrong. At its heart, it really falls back to a simple
principle: to get reliable locking you need a ‘test-and-set’ primitive.
That is, in one single uninterrupted single-threaded protected
operation, you need to test a variable and set it to ‘taken’ or return
that is it unavailable. Once you have that primitive, you can build all
other locking mechanisms on top of it. If it’s not atomic however, there
will always be a window of failure. That links back to the Two Generals
Problem quite nicely, since where it becomes an issue is when you can’t
have access to an atomic ‘test-and-set’ primitive (and thus there will
always be problems).
Parsing
is one of those areas where people often tread carelessly without a
theoretical background, and it always ends badly. If you understand the
theory and have read works like The Red Dragon Book then belting out a
parser is basically a time problem. You just decide what the ‘language’
requires such as LR(1), and how big the language is and then you do the
appropriate work, which more often than not is either a recursive
descent parser or a table driven one (using tools like lex/yacc or
antlr). There are messy bits of course, particularly if you are trying
to draft your own new language, but the space is well explored and well
documented. In practice however what you see is a lot of crude
split/join based top-down disasters, with the occasional regular
expression disaster thrown in for fun. Both of those techniques can work
with really simple grammars, but then fail miserably when applied to
more complex ones. Thus being able to parse a CSV file, does mean you
know how to parse something more complex. Bad parsing usually is a huge
time sink, and if it’s way off then the only reasonable option is to
rewrite it properly. Sometimes it’s just not fixable.
One
of my favorite theoretical problems is the rather well-known P vs NP
problem. While the verdict is still outstanding on the relationship, it
has a huge implication for code optimizations. For people unfamiliar
with ‘complexity’, it is really a question of growth. If you have an
algorithm that takes 3 seconds to run with 200 inputs, what happens when
you give it 400 inputs? With a simple linear algorithm it takes 6
seconds to run. Some algorithms perform worse, so they may take 9 secs
(3^2 -- three squared) to run, or even 64 seconds (4^3 -- four to the
power of three). We can take any algorithm and calculate its
‘computational complexity’ which will tell us exactly how the time grows
with respect to the size of the input. We usually categorize this by
the dominant operators so O(1) is a constant growth, O(n) is growing
linearly by the size of the input, O(n^c) is growing by a constant
exponent (polynomial time) and O(c^n) has the size of the input as the
exponent (exponential time). The P in the equation is a reference to
polynomial time, while NP is rather loosely any growth such as
exponential that is larger (I know, that is a gross oversimplification
of NP, but it serves well enough to explain that it references problems
that are larger, without getting into what constrains NP itself).
Growth
is a really important factor when it comes to designing systems that
run efficiently. Ultimately what we’d like is to build is a well-behaved
system that runs in testing on a subset of the data, and then to know
when it goes into production that the performance characteristics have
not changed. The system shouldn’t suddenly grind to a halt when it is
being accessed by a real number of users, with a real amount of data.
What we’ve learned over the years is that it is really easy to write
code where this will happen, so often to get the big industrial stuff
working, we have to spend a significant amount of time optimizing the
code to perform properly. The work a system has to do is fixed, so the
best we can do is find approaches to preserve and reuse the work
(memoization) as much as possible. Optimizing code, after its been shown
to work, is often crucial to achieving the requirements.
What
P != NP is really saying in practice is that there is a very strong
bound on just exactly how optimized the code can really be. If it’s not
true then there would be no possible way you could take an exponential
problem and find clever tricks to get it to run in polynomial time. You
can always optimize code, but there might be a physical bound on exactly
how fast you can get it. A lot of this work was best explored with
respect to sorting and searching, but for large systems it is essential
to really understand it if you are going to get good results.
if
it were true however, amongst many other implications, that would mean
that we are able to calculate some pretty incredible stuff. Moore’s law
has always been giving us more hardware to play with, but users have
kept pace and are continually asking for processing beyond our current
limits. Without that fixed boundary as a limitation, we could write
systems that make our modern behemoth's look crude and flaky, and it
would require a tiny fraction of the huge effort we put in right now to
build them (also it would take a lot of fun out of mathematics according
to Gödel).
Memoization
as a technique is best known from ‘caching’. Somewhere along the way,
caching became the over-popular silver bullet for all performance
problems. Caching in essence is simple, but there is significant more
depth there than most people realize, and as such it is not uncommon to
see systems that are deploying erratic caching to harmful effect.
Instead of magically fixing the performance problems, they manage to
make them worse and provide a slew of inconsistencies in the results. So
you get really stale data, or a collection of data with parts out of
sync, slower performance, rampant memory leaks, or just sudden scary
freezes in the code that seem unexplainable. Caching, like memory
management, threads and pointers is one of those places where ignoring
the underlying known concepts is most likely to result in pain, rather
than a successful piece of code.
I’m
sure there are plenty of other examples. Often when I split programming
between ‘systems programming’ and ‘applications programming’ what I am
really referring too is that the systems variety requires a decent
understanding of the underlying theories. Applications programming needs
an understanding of the domain problems, but they can often be
documented and passed on to the programmer. For the systems work, the
programmer has to really understand what they are writing, for if they
don’t, the chances of just randomly striking it lucky and getting the
code to work are are nearly infinitesimal. Thus, as I found out over the
years, all of those early theory courses that they made me take are
actually crucial to being able to build big industrial strength systems.
You can always build on someone else’s knowledge, which is fine, but if
you dare tread into any deep work, then you need to take it very
seriously and do the appropriate homework. I’ve seen a lot of
programmers fail to grok that and suffer horribly for their hubris.
Software is a static list of instructions, which we are constantly changing.
Sunday, November 25, 2012
Sunday, November 18, 2012
Best Practices
One
significant problem in software development is not being able to end an
argument by pointing to an official reference. Veteran developers
acquire considerable knowledge about ‘best practices’ in their careers,
but there is no authoritative source for all of this learning. There is
no way to know whether a style, technique, approach, algorithm, etc. is
well-known, or just a quirk of a very small number of programmers.
I have heard a wide range of different things referred to as best practices, so it’s not unusual to have someone claim that their eclectic practice is more widely adapted than it is. In a sense there is no ‘normal’ in programming, there is such a wide diversification of knowledge and approaches, but there are clearly ways of working that consistently produce better results. Over time we should be converging on a stronger understanding, rather than just continually retrying every possible permutation.
Our not having a standard base of knowledge makes it easier for people from outside the industry to make “claims” of understanding how to develop software. If for instance you can’t point to a reference that says there should be separate development, test and production environments, then it is really hard to talk people out of just using one environment and hacking at it directly. A newbie manager can easily dismiss 3 environments as being too costly and there is no way to convince them otherwise. No doubt it is possible get do everything all on the same machine, it’s just that the chaos is going to extract a serious toll in time and quality, but to people unfamiliar with software development issues like ‘quality’ find that they are not easily digestible.
Another example is that I’ve seen essentials like source code control set up in all manner of weird arrangements, yet most of these variations provide ‘less’ support than the technology can really offer. A well-organized repository not only helps synchronise multiple people, but it also provides insurance for existing releases. Replicating a bug in development is a huge step in being able to fix it, and basing that work on the certainty that the source code is identical between the different environments is crucial.
Schemas in relational databases are another classic area where people easily and often deviate from reasonable usage, and either claim their missteps as known or dismiss the idea that there is only a small window of reasonable ways to set up databases. If you use an RDBMS correctly it is a strong, stable technology. If you don’t, then it becomes a black hole of problems. A normalized schema is easily sharable between different systems, while a quirky one is implicitly tied to a very specific code base. It makes little sense to utilize a sharable resource in a way that isn’t sharable.
Documentation and design are two other areas where people often have very eclectic practices. Given the increasing time-pressures of the industry, there is a wide range of approaches happening out there that swing from ‘none’ to ‘way over the top’, with a lot of developers believing that one extreme or the other is best. Neither too much or too little documentation serves the development, and often documentation isn’t really the end-product, but just necessary steps in a long chain of work that eventually culminates in a version of the system. A complete lack of design is a reliable way to create a ball of mud, but overdoing it can burn resources and lead to serious over-engineering.
Extreme positions are common elsewhere in software as well. I’ve always figured that in their zeal to over-simplify, many people have settled on their own unique minimal subset of black and white rules, but often the underlying problems are really trade-offs that require subtle balancing instead. I’ll often see people crediting K.I.S.S (keep it simple stupid) as the basis for some over-the-top complexity that is clearly counter-productive. They become so focused on simplifying some small aspect of the problem that they lose sight that they’ve made everything else worse.
Since I’ve moved around a lot I’ve encountered a great variety of good and insane opinions about software development. I think it would be helpful if we could consolidate the best of the good ones into some single point of reference. A book would be best, but a wiki might serve better. One single point of reference that can be quoted as needed. No doubt there will be some contradictions, but we should be able to categorize the different practices by family and history.
We do have to be concerned that software development is often hostage to what amounts to pop culture these days. New “trendy” ideas get injected, and it often takes time before people realize that they are essentially defective. My favorite example was Hungarian notation, which has hopefully vanished from most work by now. We need to distinguish between established best practices and upcoming ‘popular’ practices. The former have been around for a long time and have earned their respect. The latter may make it to ‘best’ someday, but they’re still so young that it is really hard to tell yet (and I think more of these new practices are deemed ineffective then promoted to ‘best’ status).
What would definitely help in software development is to be able to sit down with management or rogue programmers and be able to stop a wayward discussion early with a statement like “storing all of the fields in the database as text blobs is not considered by X to be a best practice..., so we’re not going to continue doing it that way”. With that ability, we’d at least be able to look at a code base or an existing project and get some idea of conformity. I would not expect everyone to build things the same way, but rather this would show up projects that deviated way too far to the extremes (and because of that are very likely to fail). After decades, I think it’s time to bring more of what we know together into a usable reference.
I have heard a wide range of different things referred to as best practices, so it’s not unusual to have someone claim that their eclectic practice is more widely adapted than it is. In a sense there is no ‘normal’ in programming, there is such a wide diversification of knowledge and approaches, but there are clearly ways of working that consistently produce better results. Over time we should be converging on a stronger understanding, rather than just continually retrying every possible permutation.
Our not having a standard base of knowledge makes it easier for people from outside the industry to make “claims” of understanding how to develop software. If for instance you can’t point to a reference that says there should be separate development, test and production environments, then it is really hard to talk people out of just using one environment and hacking at it directly. A newbie manager can easily dismiss 3 environments as being too costly and there is no way to convince them otherwise. No doubt it is possible get do everything all on the same machine, it’s just that the chaos is going to extract a serious toll in time and quality, but to people unfamiliar with software development issues like ‘quality’ find that they are not easily digestible.
Another example is that I’ve seen essentials like source code control set up in all manner of weird arrangements, yet most of these variations provide ‘less’ support than the technology can really offer. A well-organized repository not only helps synchronise multiple people, but it also provides insurance for existing releases. Replicating a bug in development is a huge step in being able to fix it, and basing that work on the certainty that the source code is identical between the different environments is crucial.
Schemas in relational databases are another classic area where people easily and often deviate from reasonable usage, and either claim their missteps as known or dismiss the idea that there is only a small window of reasonable ways to set up databases. If you use an RDBMS correctly it is a strong, stable technology. If you don’t, then it becomes a black hole of problems. A normalized schema is easily sharable between different systems, while a quirky one is implicitly tied to a very specific code base. It makes little sense to utilize a sharable resource in a way that isn’t sharable.
Documentation and design are two other areas where people often have very eclectic practices. Given the increasing time-pressures of the industry, there is a wide range of approaches happening out there that swing from ‘none’ to ‘way over the top’, with a lot of developers believing that one extreme or the other is best. Neither too much or too little documentation serves the development, and often documentation isn’t really the end-product, but just necessary steps in a long chain of work that eventually culminates in a version of the system. A complete lack of design is a reliable way to create a ball of mud, but overdoing it can burn resources and lead to serious over-engineering.
Extreme positions are common elsewhere in software as well. I’ve always figured that in their zeal to over-simplify, many people have settled on their own unique minimal subset of black and white rules, but often the underlying problems are really trade-offs that require subtle balancing instead. I’ll often see people crediting K.I.S.S (keep it simple stupid) as the basis for some over-the-top complexity that is clearly counter-productive. They become so focused on simplifying some small aspect of the problem that they lose sight that they’ve made everything else worse.
Since I’ve moved around a lot I’ve encountered a great variety of good and insane opinions about software development. I think it would be helpful if we could consolidate the best of the good ones into some single point of reference. A book would be best, but a wiki might serve better. One single point of reference that can be quoted as needed. No doubt there will be some contradictions, but we should be able to categorize the different practices by family and history.
We do have to be concerned that software development is often hostage to what amounts to pop culture these days. New “trendy” ideas get injected, and it often takes time before people realize that they are essentially defective. My favorite example was Hungarian notation, which has hopefully vanished from most work by now. We need to distinguish between established best practices and upcoming ‘popular’ practices. The former have been around for a long time and have earned their respect. The latter may make it to ‘best’ someday, but they’re still so young that it is really hard to tell yet (and I think more of these new practices are deemed ineffective then promoted to ‘best’ status).
What would definitely help in software development is to be able to sit down with management or rogue programmers and be able to stop a wayward discussion early with a statement like “storing all of the fields in the database as text blobs is not considered by X to be a best practice..., so we’re not going to continue doing it that way”. With that ability, we’d at least be able to look at a code base or an existing project and get some idea of conformity. I would not expect everyone to build things the same way, but rather this would show up projects that deviated way too far to the extremes (and because of that are very likely to fail). After decades, I think it’s time to bring more of what we know together into a usable reference.
Sunday, November 4, 2012
Work Smarter...Not Harder
I’ve always loved this quote:
"Work Smarter...Not Harder"
Allan F. Mogensen
But what does it really mean, particularly when it comes to software development?
Working smarter is most often about spending more time to think your way through the problems first, before diving in. Under intense time pressure, people often rush to action. There are times when this is necessary, but there is always a cost. Rack up enough of this technical debt. and then just servicing it becomes the main priority, which only amplifies the time pressure. Thus any gains from swift action are lost and working harder won’t undo the downward spiral.
Being smart, however, can prevent this from occurring. Yes, the pace is slower and getting the details right always requires more effort, but a minimal technical debt. means more resources are available to move forward. Eventually it pays off. Being smart isn’t just about thinking hard, it also requires having enough data to insure that the thinking is accurate. Thus, acquiring more information -- dealing with the details -- is the key to being able to amplify one’s thinking. In software development, what you don’t understand can really harm you and it’s very rare that you can ignore it.
"Work Smarter...Not Harder"
Allan F. Mogensen
But what does it really mean, particularly when it comes to software development?
- get organized immediately and cleanup often
- being consistent is more important than being right
- be very sensitive to scale (bigger systems mean less shortcuts)
- utilize the full abilities of any technology or tools, but don’t abuse them
- automate as much as possible, spend the time to get it reliable
- minimize dependencies, accept them only when there are really no other options
- read the manual first, even if it is boring
- research before you write code, avoid flailing at a problem, seek expertise
- choose to refactor first, before coding
- delete as much code as possible
- encapsulate the subproblems away from the system, spend the time to get it right
- break the problem cleanly, fear special cases
- apply abstraction and generalization to further reduce the work
- think very hard about the consequences before diving in
- fail, but admit it and learn from it
- don’t be afraid of the domain, learn as much as possible about it
- focus on the details, quality is all about getting them right
- accept that complexity breeds below the surface, if you think something is simple then you probably don’t understand it yet
- know the difference between knowing, assuming and guessing
- everything matters, nothing should be random in software
- don’t ignore Murphy’s law
- a small number of disorganized things doesn’t appear disorganized until it grows larger
- reassess the organization as things grow, updating it frequently as needed
Working smarter is most often about spending more time to think your way through the problems first, before diving in. Under intense time pressure, people often rush to action. There are times when this is necessary, but there is always a cost. Rack up enough of this technical debt. and then just servicing it becomes the main priority, which only amplifies the time pressure. Thus any gains from swift action are lost and working harder won’t undo the downward spiral.
Being smart, however, can prevent this from occurring. Yes, the pace is slower and getting the details right always requires more effort, but a minimal technical debt. means more resources are available to move forward. Eventually it pays off. Being smart isn’t just about thinking hard, it also requires having enough data to insure that the thinking is accurate. Thus, acquiring more information -- dealing with the details -- is the key to being able to amplify one’s thinking. In software development, what you don’t understand can really harm you and it’s very rare that you can ignore it.
Subscribe to:
Posts (Atom)