Saturday, June 27, 2020

Innovation

For a long time now, I’ve been noticing that the pace of innovation in software has been slowing. I started paying attention to the industry in the mid-80s, so that breadth gives me a pretty good perspective on the (cyclic) changes.


There are a couple of really great links that I thought shed insight into these problems. This first one gives a great general idea about knowledge transfer: 


https://www.youtube.com/watch?v=pW-SOdj4Kkk

And this second one takes the perspective of how it is acting from an economical standpoint:



I’ve had a strong sense that ‘software projects’ act as fishbowl allowing us to observe complexity problems in a reduced, constrained scale. Our projects usually have all of the core fuzzy issues including personalities, agendas, irrational behavior, politics as well as the underlying technical ones like constraints, strict logic, discrete components, etc. That is, they exist in that intersection between mathematics and the physical world. This volatile mix plays out all across our societies, in different industries, in different ways, but it is somewhat smaller and easier to observe in software because the projects are often tiny. 


In the mid-90s I saw the lure of commercialization pull away a lot of the pure and applied researchers. There are pockets left out there, but they are all underfunded. Decades of little research eventually resulted in stagnation on moving forward. We basically locked ourselves into construction techniques that haven’t moved for at least 20 years, all while the number of programmers has exploded and the amount of available code that we can leverage grew even larger. Lots of people, lots of code, but we still don’t see innovative products coming out very often, and we’ve seen a huge decrease in the quality of the code produced. 


“And some things that should not have been forgotten were lost. History became legend. Legend became myth. And for two and a half thousand years, the ring passed out of all knowledge.”

― Galadriel in The Lord of the Rings: The Fellowship of the Ring, The Lord of the Rings

If there are any billionaires out there that would like to help reverse this problem, there are a lot of people like myself that are fountains of innovative ideas but have no serious way to explore them. Once life grants you a few dependencies, and takes away some of your energy, working a full-time, high-stress job to pay the bills is cognitively demanding enough that we need our nights and weekends to recover. Personal research projects tend to take hundreds of times longer as side-projects and are prone to disruptions and so they aren’t pursued often and have a super-high rate of not getting finished. Instead of toiling away on another half-finished idea, I’ve tended to just write them up in the blog and move on. It’s less frustrating.


Of the brighter sources of help is MacArthur fellowship grants, but they are pretty much just focused on academics in the USA. The idea is great though, that someone can grant you a five year, no strings attached, opportunity to pursue your ideas. I think if this type of financing was more easily available, we’d see a lot more people trying out innovative software ideas, and eventually, some of those ideas would result in huge improvements in our industry. 


The reason we don’t see this type of innovation in private companies is that most of the focus, as expected, is on monetization. The biggest and most pressing problems in software development don’t have direct ways of making money. They layout a base platform on which the upper layers may be monetizable, but much like the Internet, if they get bent too early in just making money, that hurts both their construction and usage. We need cooperative platforms first before we can rest competitive games on top of them. Skipping step one tends to craft speculative games, not solid ecosystems. 


If we wanted to lift the software development with innovation, we really have to go backward and explore a lot of vague ideas that are attached to very deep root causes. We suspect that there are better ways of framing our construction processes that would result in faster, more stable systems. We know that as we become more dependent on software, we can’t continue to live with our low-quality construction methods. If we rely on some code, that code has to work within a very tight tolerance, and we currently can’t afford the massive amount of time we know is needed to achieve that type of precision. There are so many fundamental issues that we’ve ignored over decades in our mad rush to just put half-baked products into the market. At some point, this has to change if we want to move forward.

Sunday, June 21, 2020

Defensive Coding

Over my career, I have joined a number of ongoing software development projects that were having extreme difficulties.


Most of these troubled projects had very common symptoms. 


On joining, the first most obvious problem was that the developers were kept really busy with operational problems. There were lots of known bugs, but generally, the unknown or unrecorded bugs were way, way higher. So, a lot or most of the effort was going into trying to put bandaids on existing issues, without really trying to understand or solve them. 


That sense of stress, lack of time, and anxiety not only was effecting how they were dealing with operational problems but also with any new code additions. Since the environment was chaotic with frequent interruptions, the new stuff was also not receiving much cognitive focus. If they weren’t just hacking out at bugs, they were just blindly writing new code and hoping that it might work correctly.


Another interesting trait is that they generally blame management for their problems. They feel like they are being forced to move too fast, and that the focus is too short-term.


I’ve seen this both in small and large companies. With huge projects and medium-sized ones. It basically spans all tech stacks and has pretty much been common over the 30 years I have been working.


What most of the developers want, when I arrived, is finding a magic bullet that would make all of their problems go away and suddenly put the project back under control. They usually figured there was something that management could change. What turned out to be true most times, was that the core issues at the heart of the problem were the habits of the developers themselves. That is, they were hoping it was a management problem, and at times it was irritated by high-level personality issues, but most often it turned out that it is the way the developers were working. Their code was a mess and getting way worse as it was growing larger, and that was throwing everything else out of whack. Often they were using the language, the framework, or other libraries incorrectly. They generally didn’t leverage any of their tools either. Sometimes they were critical things they needed to know but didn’t. They’re pounding away furiously at the keyboard, often doing a lot of work, sometimes even consistently, but it’s the work itself that is causing most of their problems.


It’s a really hard problem to fix when you have one or more programmers who are often over-confident but still struggling with getting something built. It makes it harder when management believes that too, but for the wrong reasons, like that the work itself is trivial. That compounds the rift between developers and management, particularly if there are actually some complicated parts to solving the problem. Neither side wants to admit any faults, neither side wants to change.


The fix for this type of problem is always that the quality of the code needs to be better. Way better. But not just by adding something like a few piddly performance optimizations sporadically around the code or switching to some other alternative technology. It all needs to be better. It needs to be organized, disciplined, and every little part of it thought through. The developers need to change their habits, get tighter about their work, and keep the code as tidy as they can, given any time constraints. Then they need to spend more time thinking deeply about how they will add new features by extending the code, and less time at their keyboard experimenting or trying to route around code that they don’t want to spend time understanding. 


It’s not unlike a driver’s training course, where it’s not that hard to learn to drive a car, but it is difficult to be able to drive one safely in most major cities around the world. There are two skillsets there, not one. The first is to know how to do things with the car, the second is to know how to move around the environment properly without crashing into things. Programming is the same. The coding part of programming is just one piece of the puzzle, but knowing how not to end up with ‘bad’ code in a production environment that triggers these types of vicious cycles is a completely different skill.


So, the very first, high-level understanding of Defensive Coding is that not all code is created equal. 


There is ‘good’ code, that you want to be running in production because it is stable and trustworthy and really solves the problems. Then all of the other code is essentially ‘bad’ code. It has one or more serious problems that either effect it operationally (short-term), or its ability to be extended and grown with the rest of the system (long-term). 


In modern software development, people often have the expectation that the expected work is way less than the actual work involved. So, there is always a lot of pressure to find short-cuts, to reduce estimates, etc. In order to keep financing for any large software project, programmers need to occasionally bend to these types of pressure, to get something out quickly. So, it would be unreasonable to say that a production system will never have ‘bad’ code running, but it is actually reasonable to expect that the programmers who built the system were aware of what code is ‘good’ and what code is ‘bad’. And as a matter of their work habits, when they are forced to quickly put ‘bad’ code into production, they also allocate the time and effort necessary to replace it later with ‘good’ code. And then replace it. That is, if it’s written quickly, it will probably need to be replaced soon. It’s not until you’ve gone back over the existing code base a few hundred times, rechecking it and cleaning up little issues, that you are certain that that code is industrial-grade and really ready for a production system.


The converse is also true. There is a huge coding problem if the programmers can’t distinguish between ‘good’ and ‘bad’ code or they think that everything in production should be there and can be ignored. If they aren’t willing to revisit earlier work, then any of its bad issues will percolate outwards and destabilize everything else build on top. 


Which gets us to the second major issue. Code is always stacked on top of other code, whether it was written by the current programmers or other people. If the lower levels of the code were written badly, then the upper levels are tainted. Nothing in a large system is truly independent, they are always interdependencies whether they are explicit, or implicit. Worse, they sometimes form when nobody is watching, so components may start out as independent, but as time progresses that will change. 


The simplest way to think about this is that the system is only as good as it’s worst code. Technically that’s not entirely true, in that there may be onion layered sub-systems that are awful and that support unused features, but even those infrequent ‘entry-points’ into the system represent significant risks that a user may accidentally trigger them someday, and that may cascade back into the rest of the functionality. So, for safety reasons, they need to be at least disabled, and for consistency reasons, marked as unusable. But its probably best that they are just removed, and the only traces of them left in deleted files in the repo. 


That gets us back to talking about the habit of cleaning up stuff. Really nothing in the system should be ‘off-limits’ and it is a reoccurring and constant task to go back and clean up the code. Oddly, even if it is not causation, there is usually a correlation between a large bug count and a lot of old dead code, invalid comments, lack of consistency, stray configurations, etc. probably because of bad habits. The two aspects always seem to come together, mostly because when the primary concern is not ‘good’ code, anything will get into production. Eventually, that will build up enough to trigger a downwards cycle.


Indirectly that leads us to a somewhat uncalculatable metric. If you were to get a full, and honest account of all of the technical, domain, interface and operational issues that ‘bug’ people in the current system, all of them, then while for many systems that would be thousands of bugs, we could use that to divide the overall number of lines of code and coding comments currently written. So, for easy example numbers, we might have 1500 bugs, coming out of 40k code. Although it’s not evenly distributed, that essentially gives us one issue per every 26.7 lines of code. That sets a rough quality marker for the work done and gives someone an idea of how frequently issues are getting build up into the system. Way back, one of the developers I used to work with felt that about 1 in 200 was expectable, but that at least 4 out of 10 of those should be caught by the developer, and probably 4 out of 10 of those should be caught by basic QA. Or that we might see 1 bug leaking into an ‘average’ production system for every 1000 lines written. So, if we go back to 26.7, that is a huge number of bugs really, or basically poor quality. And the closer we get to 0 (which I know is asymptotic) the more we can say that the code is lower quality. This metric is uncalculatable though because very few projects actually are diligent enough to really get a full list of bugs. More often, they don’t have explicit specifications or get full feedback from the users or the operations department, so their sense of problems is grossly underrepresented. They count only known, obvious, technical bugs, ignoring the other 3 categories.


There is a lot more I can say about Defensive Coding, but this is getting too long for a blog post. Indirectly I’ve covered this a lot in the past posts, but I think it’s best to try and focus it on a specific mindset going forward. Programmers don’t want their jobs to be hard or painful, but in order to avoid that fate, they have to choose good working habits that keep their ‘workspaces’ tidy and organized. That’s true of any profession, but the digital realm allows us to hide a huge mess and ignore it more easily, so we have to work harder to ensure that we not shooting ourselves in the foot by misfocusing on the wrong issues. And much like driving a car, we have to realize that while some accidents aren’t unavoidable, there are lots of things we can do as well to keep us out of obvious trouble, most of the time. Good programming requires discipline, even if that sometimes makes parts of the job boring. We shouldn’t focus on making ‘coding’ fun or fast, instead, we need to focus on what we build. Real satisfaction from building stuff comes afterward, when it is ‘good’, when it is used and when it stays around a long, long time. Those basic principles can help us reframe our perspectives so that with our limited time we can get the right things done, at the right time, to keep everyone (mostly) happy and the project moving forward smoothly. 

Tuesday, June 16, 2020

Comments...

When I first started blogging, over a decade ago, there were lots of times where I found the comments frustrating. It’s nice when people compliment you, but beyond that, I wasn’t so sure. 


Occasionally I’d get comments about my spelling or grammar. Although I understand where that comes from, I felt that people were missing the point. I’m an awful writer, horrible speller, and never really learned English grammar despite it being my first and only language. In the beginning, I was struggling so hard with getting anything written, that I’d be super happy that I managed to anything published, but then people would be pointing out that it wasn’t really well edited. They were right, and I usually fixed the errors, but I rather wished they would add sometimes more valuable in their comments.


I really liked genuine insights, where they pointed out that there was something I misunderstood and provided a reference or explanation. That’s a great way to learn, and in a field like software development that is so large we can’t even get exposed to a fraction of it, someone helping like that, giving pointers to deeper knowledge is great.


Sometimes people asked interesting questions. That was cool too, but I remember a few times struggling to answer something after a long, and horrible day at work. I guess I felt obligated to answer as soon as possible, so while I enjoyed it, I did find it a bit stressful sometimes. As I’ve aged, and consumed way too much ‘information’ about programming, sometimes I just overflow and go blank. That’s another odd feeling in that I know that I’ve encountered the answer at some point, but my brain just can’t search for it anymore.


Once in a while, I’d get a “you’re all wrong’” comment, without references, knowledge, etc. In general, I try to stay away from the deeply subjective parts of the industry, but over the years there have often been many popular trends that were rather obviously not good trends. I supposed the classic is Hungarian notation, which was a way of obfuscating variable names, supposedly to make them easier to think up, but really it had a strong tendency to decrease the readability of the code quite heavily. At the time, since it came from Microsoft, lots of younger programmers felt quite strongly that it was a superior way of handling naming, and couldn’t understand why other programmers were against it. We still see those types of conflicts in our industry, often enough that in my cynical old age I sometimes equate too much popularity with wrongness, in the sense that for any ideas have been watered down so much that they now appeal to everyone, that process itself may have caused the ideas to go astray. 


These days, I am rather keen again to get comments. It’s partly because I’ve pretty much written every post at least twice by now, but also because I have a real sense that our industry is stagnating. I’ve always been searching around the edges, looking for new and interesting work, but over the last decade, although there are some neat things happening in a few hidden corners, it seems as if a lot less of it is occurring now. We’re not talking about software, we not exploring new ideas much anymore, we’re not even trying to analyze what we are doing to find a better way to get it done. It seems like pretty much everyone has just given up and decided that coding is about copying and pasting stuff from StackOverflow into their editor as fast as possible. It’s just a stressful grind that cycles between using different broken code bases. 


So, it would be nice to get back to discussing stuff again. There are so many interesting paths for us to choose to move this industry forward. We’ve pretty much been doing the same work over and over again for the last twenty years, it would be nice to explore some new territory.


Sunday, June 14, 2020

Naturally Encapsulated Coding

In a number of different systems, more so recently, I’ve seen a lot of code that I refer to as ‘brute force’.

It manifests itself as huge convoluted functions that do all sorts of disconnected things, as well as a lot of hardcoded data and often a crazy number of ‘if’ statements. Extreme examples are highly fragmented as well, it’s hard to follow the logic since it is distributed all over the code and the flow is bouncing around erratically.

Generally, code gets built that way because that’s how the programmers want to think about the problems they are given, and many of them feel that it is easier if it’s explicitly coded into the program in the exact same way.

The concern is that type of code is highly redundant, fragile, and generally, once the program gets large enough, nearly impossible to keep extending. So, what starts out as an easier and cheaper way to get quick functionality into the code, eventually becomes a significant blocker that degrades the project’s ability to move forward.

Underneath, the primary issue seems to be the way people think about implementing a solution.

Most programmers are pretty much left to their own devices to figure out larger code structuring issues. It’s not really talked about or taught, and even though it is somewhat implied by a paradigm like object-oriented, it was rarely ever explicitly stated that way. As such, unless you ended up working with code that was well-structured at some point in your career, you’re probably not going to figure it out on your own (unless you are a mathematician). You’ll just default to what you are most comfortable with, unaware that in doing so, you are gradually making your job way harder.

Instead of seeing the work to be done by a computer as a very long list of instructions that need to be executed, we need to flip around our perspective. Data is the dual of code. They are both necessary for a code to execute, but in many ways seeing the computation from the data perspective is a lot easier than seeing it from the code perspective.

So, let’s say that there are 250 little execution steps that need to be triggered for a new feature.

The first set of instructions is about fetching some data, X, from persistence somewhere, and ensuring that it is good data.

So, we can consider the translation null -> X. Basically we start with nothing, then we get an X. But, as is usually the case, we almost never really start with nothing, really we are given some search data S, and that is what we need to find X.

So, rather obviously, and kinda cleanly, we just need something in the code that looks like getX(S) -> X. Now, that thing may go to the database, and as it travels there, it may find out that the connection to the database doesn’t exist yet, and the config info needed to initialize it may not be in memory, it might still be somewhere on disk. But we need this DB thing to be up.

So, we follow the same mechanics with getDB(C) -> DB, where C is some config information. If we don’t have that yet, again the pattern repeats as getC() -> C.

This whole DB/C thing is messy, and we want it decomposed away from the other work we are doing. It’s probably a one-time initialization that could or could not happen lazily. There is also fault handing that should be wrapped around it to handle periods of unavailability or bad queries.

So, if we are being nice, and we want getX() to not be stateful, we pass the connection in from above, in some type of environmental context that understands the running issues in the system, let’s call this E for environment.

So, getX(E,S) -> X is what we need to build. That gives us our base work.

Now X is probably raw data, and it needs to be prettied up, so again, by some context that was given to us from above. We’ll call that the user context or U. So to get to pretty data X’ we need something like:

     decorate(U,X) -> X’

At that point, we can distribute X’ to a set of widgets for a screen, or send it down some pipe and then to a bunch of widgets. Putting this all together we get:

once-in-a-while:
     getC() -> C
     getDB(C) -> DB
     DB -> E

per X-type request:
     -> U,S
     getX(E,S) ->X
     decorate(U,X) -> X’
     X’ -> widgets|pipe

Now, what’s important here is that we’ve decomposed this problem, not as a single large set of instructions but really by what amounts to ‘topic’ or ‘paragraph’ as it is oriented to the data. Basically, when the data changes the topic changes along with it, so we can break that into a new paragraph (function, method, etc.). If we put in ‘layers’ on data transformations, then the code is really easy to write, and use, and it is highly reusable.

If we had a second decorative type like X’’, then it is easy to create a new request type, or even put a switch in the code (depending on what minimizes the overall complexity). The same holds true if we need a Z that is composed from X and Y’.

If there is a bug, that is allowing bad X’s into the code, it’s rather obviously in getX. If on the screen X’ is funny, but the data in the database is good, then ‘decorate’ is the culprit. What that means is as well as consistency, and code reuse causing fewer bugs, we are also getting a really solid means of triaging the bugs we see that is letting us narrow down the code, by the impact of the bug.

The core thing here is to build upwards. The incoming requirement came from the top-down, but taking that literally gets us back to big, ugly lists of code. Instead, we start by looking at the data that needs to be persisted, we get that into the code, move it to where it is needed, then we start applying algorithms to it, to get it into the correct format. From there we just take that ‘data’ and get it into the final widget structure of the screens, or the data structure of a file, or the data format for an exported protocol, or any of the other ways that the data may leave the system.

So, start with the data, and move it to where it is needed, making the minimum number of transformations along the way. The whole task might equate to a list of things to do, but the construction is handled by building up larger and larger components until the goal is accomplished.

While that’s a fairly simple example, a more common problem is that we might need some new Z, as a computation based on X’ and Y’. But for whatever reason, X and Y are somewhat of a mess as stored currently in persistence. Instead of trying to redo them somehow, the first 2 steps are extending the persisted data properly, so that we really do have X’ and Y’. In the code, that gives us some code that uses X and some new code that needs X’. That’s fine if X’ is a proper superset of X, but what if they contradict each other? If it’s mergeable, then we can combine both for an X’’ and modify the existing getX(E,S) -> X’’ to handle the new data and backfill any missing parameters. If for some reason the changes are not mergeable, then we might need polymorphism to treat both types as the same, even if they have different structures, so X’’ is either X or X’ depending on its origins. Once X’ is available, we do the same for Y’ and then follow the initial approach to get Z into the code. The tricky part is seeing the feature as Z, Z’, X’ and Y’, but once that is understood the coding is very straight-forward and somewhat mechanical.

What’s key here is that we are not just trying to add in new features by setting them up independently and wrapping them around the existing mess, but rather looking at the underlying persistent data and extending its model to be larger to hold any new understandings that come with the feature.

What scares programmers away from doing this is both having to understand that X is already in the system, and also to change an older ‘getX’ or ‘decorate’ written by someone else. Shying away from doing that may be convenient, but it causes significant accidental complexity and disorganization that is accelerating the technical debt. So, it’s a twofold degeneration, the way that they are thinking about the solution is incorrect and their habits of updating the code will lead to a mess.

It’s unfortunate that we haven’t spent much time analyzing why code gets written in problematic ways. Likely it is because some aspects of coding are subjective, and we use it as an excuse to avoid trying to talk about any of it. Another candidate is that we still want the act of programming itself to be ‘creative’, even when it’s obvious that the ‘coding’ part of building big software shouldn’t be since it just leads to a chaotic mess. If we forget about the code itself and concentrate on making sure the data is of good quality and it is visible to the users when they need it, most programming tasks are easier, a few are more difficult and the users are happier.

If we get lost in trying to creatively guess at what we can do in the code based on the number of coding tricks that we are currently aware of, then that view of how we are coding influences the stuff we produce, which inevitably is oriented towards being easier for a specific programmer to write, than it is for the user to get their work done. I’ve written about that in the past: https://theprogrammersparadox.blogspot.com/2012/05/bag-otricks.html, but it is not that easy to describe it in a way that people can pick it up and use directly.

Monday, June 8, 2020

Common Questions

There are lots of strategy questions that circulate around software development. I thought I would address a few of the major ones.


Buy vs build?


A company that depends on, but does not own, their main technology is going to have a couple of problems: a) anyone else can use the same technology to compete with them and b) it limits any possibilities of competitive advantage.


If a company is basically a service company, that is underpinned by technology, then since it is core to their line-of-business, they should set up their own development effort and build the key pieces that they need. In that sense, the ‘coding’ is actually part of the business, and they are essentially an applied software company. if they are going to do the work themselves, then it is also super important that they take it seriously and acquire enough experienced resources that it has at least reasonable quality. 


The counter is also true. If there is a complex technical problem that is necessary, but outside the scope of their business, then it doesn’t make sense for them to build up the resources to solve that unless they are considering getting into the business of selling that specific technology. 


So, it’s not a question of buy vs build, but rather a question of what specific parts do we need to build to out complete the other companies in our industry.


Outsource, offshore, or assemble local talent?


Code is a manifestation of a programmer’s understanding of a solution to a problem. That’s a pretty loaded sentence, but what it boils down to is if the programmer’s don’t ‘get’ what they are building, then it is unlikely that the outcome will be cost-effective. So, it’s a bad idea to try to cheap out and staff a development effort with the lowest price resources, unless there is a reliable way to drive them to the goal correctly. But that means having to produce very precise, very low-level specifications for everything, or at least most of the major things. That work itself is almost the same as writing the program. 


Thus, if you have a group of programmers that don’t already know how to build something to solve the problem, then you essentially need another group that does, to write the specifications for the first group. Which obviously a) doesn’t save any money and b) opens up the risks of communications errors. If you try to get a group of non-technical people to write high-level specifications, then the intrinsic vagueness of their output is highly likely to result in the actual code being unusable for its purpose. 


Or in short, with programming, you get what you pay for. There is no finding a cheaper way around it. If you pay crappy, you get a crappy system. If you are willing to spend the extra money, and you’ve gone to the extra work to hire the right resources that have the right knowledge to get the thing built, then you should get the system you need. But it’s also worth noting that is is extremely hard for non-technical people to validate the skillsets of technical ones, so you still run the risk of hiring someone to lead who says they can do it, but ultimately they can’t.


Proprietary vs OpenSource?


It’s nice to get things for free, but when your business depends on something, it is important to make sure that it has decent quality control in its construction. In that sense, it doesn’t matter what you’ve paid, the quality of the components is what drives their usefulness. So, if it’s a well-known OpenSource that is amazingly well-supported and has a great track record for heading in a positive direction, then it is quite usable. If it is some little proprietary piece that is badly built and releases are scary, then you really don’t want to be dependent on it. Quality is more important than origination.


The only other factor is time. If the OpenSource project is hot today but dies off quickly, then the code might end up being unsupportable before the lifetime of the system is finished. When it comes down to money vs popularity, companies that are making a bit of money off their software have a strong tendency to keep it going for as long as possible. So, propriety software has the edge, with respect to time. The popularity of OpenSource tends to be shorter (but not always as big projects like Linux prove). So, the expected lifespan of the system should be a big factor in choosing the dependencies, but oddly it’s rarely considered.


Refactor or Rewrite?


I didn’t save the link, but I saw a good post a while back that basically said that it depends on where the knowledge of the system lies. If the knowledge is outside of the system, and it is relatively complete, then a rewrite might be faster and less resource-intensive. If however, the system is decades worth of understanding piled together in a heap, then a rewrite is most likely to set the whole game backward. 


Refactoring a big ugly mess is very slow, but it can be done fairly safely, one self-contained part at a time. As well, if it is non-destructive refactoring (the behavior of the system doesn’t change afterward), then it preserves the knowledge that went into it. Technical people will often prefer rewrites, as they don’t like to read or deal with other people’s code, so getting their opinion on it is difficult because of their bais. 


If the hole is big enough, then what seems to mostly work is to identify the lowest and most costly mistakes and then refactor them one-by-one is a long series of releases. Obviously, from a business perspective, that is horrible, as it means the technical resources remain at the same cost, but the outward status of the project looks like it is frozen. So what happens in practice is that the project might flip between long-running refactors, follow by new features, then followed by long-running refactoring again. The temptation is to do both in parallel, but the cross-dependencies between the work tends toward a merge nightmare, which ultimately makes it all slower (and can often throw the whole thing back into the same dysfunctional state that you were trying to get out of). 



And Finally ...


Choosing to use software to automate part of a company is a non-trivial and difficult decision. When it is handled well and executed properly, it can contain the costs to be low enough that it is hard for others to compete with you. When it goes wrong, it is a huge cost sink. It’s not the type of decision where it is easy to get lucky, or even has a 50/50 chance, but rather the type where the number of bad outcomes easily exceeds the number of good ones. The wisest thing to do is to find someone who understands it at a very deep level and has been living with the consequences of it for decades. To understand the trade-offs, you have to have spent a lot of time in the trenches, otherwise, it is far too easy to over-simplify the nature of the task.