There are two main schools of thought in software development about how to build really big, complicated stuff.
The most prevalent one, these days, is that you gradually evolve the complexity over time. You start small and keep adding to it.
The other school is that you lay out a huge specification that would fully work through all of the complexity in advance, then build it.
In a sense, it is the difference between the way an entrepreneur might approach doing a startup versus how we build modern skyscrapers. Evolution versus Engineering.
I was working in a large company a while ago, and I stumbled on the fact that they had well over 3000 active systems that were covering dozens of lines of business and all of the internal departments. It had evolved this way over fifty years, and included lots of different tech stacks, as well as countless vendors. Viewed as ‘one’ thing it was a pretty shaky house of cards.
It’s not hard to see that if they had a few really big systems, then a great number of their problems would disappear. The inconsistencies between data, security, operations, quality, and access were huge across all of those disconnected projects. Some systems were up-to-date, some were ancient. Some worked well, some were barely functional. With way fewer systems, a lot of these self-inflicted problems would just go away.
It’s not that you could cut the combined complexity in half, but more likely that you could bring it down to at least one-tenth of what it is today, if not even better. It would function better, be more reliable, and would be far more resilient to change. It would likely cost far less and require fewer employees as well. All sorts of ugly problems that they have now would just not exist.
The core difference between the different schools really centers around how to deal with dependencies.
If you had thousands of little blobs of complexity that were all entirely independent, then getting finished is just a matter of banging out each one by itself until they are all completed. That’s the dream.
But in practice, very few things in a big ecosystem are actually independent. That’s the problem.
If you are going to evolve a system, then you ignore these dependencies. Sort them out afterwards, as the complexity grows. It’s faster, and you can get started right away.
If you were going to design a big system, then these dependencies dictate that design. You have to go through each one and understand them all right away. They change everything from the architecture all the way down to the idioms and style in the code.
But that means that all of the people working to build up this big system have to interact with each other. Coordinate and communicate. That is a lot of friction that management and the programmers don’t want. They tend to feel like it would all get done faster if they could just go off on their own. And it will, in the short-term.
If you ignore a dependency and try to fix it later, it will be more expensive. More time, more effort, more thinking. And it will require the same level of coordination that you tried to avoid initially. Slightly worse, in that the time pressures of doing it correctly generally give way to just getting it done quickly, which pumps up the overall artificial complexity. The more hacks you throw at it, the more hacks you will need to hold it together. It spirals out of control. You lose big in the long-term.
One of the big speed bumps preventing big up-front designs is a general lack of knowledge. Since the foundations like tech stacks, frameworks, and libraries are always changing rapidly these days, there are few accepted best practices, and most issues are incorrectly believed to be subjective. They’re not, of course, but it takes a lot of repeated experience to see that.
The career path of most application programmers is fairly short. In most enterprises, the majority have five years or less of real in-depth experience, and battle-scared twenty-year+ vets are rare. Mostly, these novices are struggling through early career experiences, not ready yet to deal with the unbounded, massive complexity present in a big design.
Also, the other side of it is that evolutionary projects are just more fun. I’ve preferred them. You’re not loaded down with all those messy dependencies. Way fewer meetings, so you can just get into the work and see how it goes. Endlessly arguing about fiddly details in a giant spec is draining, made worse if the experience around you is weak.
Evolutionary projects go very badly sometimes. The larger they grow, the more likely they will derail. And the fun gives way to really bad stress. That severe last-minute panic that comes from knowing that the code doesn't really work as it should, and probably never will. And the longer-term dissatisfaction of having done all that work to ultimately just contribute to the problem, not actually fix it.
Big up-front designs are often better from a stress perspective. A little slow to start and sometimes slow in the middle, they mostly smooth out the overall development process. You’ve got a lot of work to do, but you’ve also got enough time to do it correctly. So you grind through it, piece by piece, being as attentive to the details as possible. Along the way, you actively look for smarter approaches to compress the work. Reuse, for instance, can shave a ton of code off the table, cut down on testing, and provide stronger certainty that the code will do the right thing in production.
The fear that big projects will end up producing the wrong thing is often overstated. It’s true for a startup, but entirely untrue for some large business application for a market that’s been around forever. You don’t need to burn a lot of extra time, breaking the work up into tiny fragments, unless you really don’t have a clue what you are building. If you're replacing some other existing system, not only do you have a clue, you usually have a really solid long-term roadmap. Replace the original work and fix its deficiencies.
There should be some balanced path in the middle somewhere, but I haven’t stumbled across a formal version of it after all these decades.
We could go first to the dependencies, then come up with reasons why they can be temporarily ignored. You can evolve the next release, but still have a vague big design as a long-term plan. You can refactor the design as you come across new, unexpected dependencies. Change your mind, over and over again, to try to get the evolved works to converge on a solid grand design. Start fast, slow right down, speed up, slow down again, and so forth. The goal is one big giant system to rule them all, but it may just take a while to get there.
The other point is that the size of the iterations matters, a whole lot. If they are tiny, it is because you are blindly stumbling forward. If you are not blindly stumbling forward, they should be longer, as it is more effective. They don’t have to all be the same size. And you really should stop and take stock after each iteration. The faster people code, the more cleanup that is required. The longer you avoid cleaning it up, the worse it gets, on basically an exponential scale. If you run forward like crazy and never stop, the working environment will be such a swamp that it will all grind to an abrupt stop. This is true in building anything, or even cooking in a restaurant. Speed is a tradeoff.
Evolution is the way to avoid getting bogged down in engineering, but engineering is the way to ensure that the thing you build really does what it is supposed to do. Engineering is slow, but spinning way out of control is a heck of a lot slower. Evolution is obviously more dynamic, but it is also more chaotic, and you have to continually accept that you’ve gone down a bad path and need to backtrack. That is hard to admit sometimes. For most systems, there are parts that really need to be engineered, and parts that can just be allowed to evolve. The more random the evolutionary path, the more stuff you need to throw away and redo. Wobbling is always expensive. Nature gets away with this by having millions of species, but we really only have one development project, so it isn’t particularly convenient.
It is good to see the both extreme sides clearly through this article - start small and add complexity over time vs specify everything and then build it. And the core of both sides comes down to dependencies. One anecdote I'd want to contribute is our choice of dependency at RudderStack. We chose Postgres over Kafka as our event queue solution. A controversial choice back then, why would you make your life hard when there is a solution just made for that. But it has paid off in the long term.
ReplyDeleteyou are missing a point in engineering:
ReplyDeletebefore the skyscraper is built, there are many R&D sessions, evolving the tech stack. Some skyscrapers crumbled, others stood, best practices formed etc. Point being an engineered system ran through its own evolution before series deployment.
The thing with software is, that usually it is a new problem to be solved. Many best SWE practices exist, but this equates more to OSHA guidelines, and not to the actual civil engineering tech.
So i think the point should be to not be afraid of serious refactoring efforts.
To make it clear to management, that they are resting on a house of cards, and while it works now it can have catastrophic effects.
The reason for systems in the wild still to just be the result of their evolutions is usually cost pressure
Nice observation. Captures what software engineering is supposed to mean I think. In the pre-small system days, systems like telephony switches or airlinre reservation systems were engineered. Constraints of time and money made the upfront design worthwhile. Brooks' famous Mythical Man-Month observations were an extreme case, where R&D and engineering were intertwined, they tried to design a new approach to OS's while shipping one. Same with Multics. The lessons learned there led to the small Bell Labs team success with Unix that we all love :-) The engineering behind it stood on solid research about semaphores, regular expressions, parsing and grammars, etc. As computing got cheaper and more accessible, we could afford toe experiment and iterate, and evolve systems as you put it. It makes total sense. However it does present maintenance issues, since the structure can grow to be byzantine. It has become a social activity to write readeable code that works and can be maintained. With LLM's we can strive for the same thing if we make sure to understand and communicate the underlying engineering behind our design decicions. To manage dependencies well, they must be contained, and this goes against the business model of software upgrades for profit (sometimes). I think the example I like best for a posterchild of good engineering for complex systems is the compiler one. A solid theoretical foundation, lots of rigourous multi-environment testing, and an evolution path as research comes in about things like optimization techniques.
ReplyDelete