Thursday, December 28, 2023

Identity

The biggest problem with software security is that we have wired up a great deal of our stuff to rely on ‘anonymous’ actions.

A user logs into some device, but with distributed computing that will machine talk to a very large number of other machines which often talk to even more machines behind the scenes. Many of those conversations default to anonymous. When we implement security, we only take ‘some’ of those conversations and wrap them in some type of authentication.

The most common failures are that either we forget to wrap some important conversations, or that there are various bugs when we do.

A much better way is to insist that ‘all’ conversations have the originating user’s identity “attached” to them. Everything. All of the way down. No verifiable identity, no computation. Simple rule.

"Why?"

Security as an afterthought will always get forgotten in a rush. And if it’s complicated and redundant, the implementations will vary, most landing on the broken side. It is a losing battle.

“But we don't need that much security...”

Actually, we do, and we’ve always needed that much security. It’s just that long, long ago when these things were first written, there were so few people involved and they tended to be far more trustworthy. Now everyone is involved and the law of averages prevails. We put security on everything else in our lives, why wouldn’t we do it properly on software?

“It’s too hard to fix it...”

Nope. It certainly isn’t easy, but if you look at the technologies we’ve been using for a long, long time now, they have the capacity to get extended to do this. It won’t be easy or trivial, but it isn’t impossible either. If we get it wired up correctly, we can gradually evolve it to improve.

“It doesn’t work for middleware...”

Any incoming request must be associated with an identity. The use of generic identities would be curtailed. So, the web server runs as SERVER1, but all of its request threads run as the identity of the caller. No identity, no request.

“That’s crazy, we’d have to know all of the user's identities in advance...”

Ah, but we do. We always do. Any non-trivial system has to authorize, which means that some functionality is tagged directly to a set of users. If you have user preferences for example, then only you are authorized to modify them (in most cases). There could be anonymous access, but that is mostly for advertising or onboarding. It is special, so it should not be the default.

Some systems could have anonymous identities, and it can be turned on or off, in the same way that we learned to live with them in FTP. But they wouldn’t be the default, you’d have to do a lot of extra work to add them, and you’d only do that for very special cases.

Every thread in middleware could have an identity attached to it that is not the ‘system identity’, aka the base code that is doing the initialization and processing the requests. It’s pretty simple and it should be baked in so low that people can’t change it. They could only just ‘add’ some other anonymous identity if they wanted to bypass the security issues. It’s analogous to the split between processes and the kernel in a reasonable operating system.

“But the database doesn’t support it...”

Oddly, the problem with most databases does not seem to be technical. It is all about licenses. Historically, the way companies figured out how to make extra money was through licensing users. It’s a great proxy for usage and usage is a way of sizing the bill to fit larger companies. You set a price for small companies. then add multipliers to get more out of the bigger ones.

We should probably stop doing that now. Or at least stop using ‘users’ as proxies for it, especially if that is one of the root causes of all of our security issues.

Then any statement to the database is also attached to an identity. Always. The database has all of the individual users, and every update is automatically stamped with user and time. No need to rewrite an application version of this anymore. It is there for all rows and all tables, always.

“That’s too much processing, some rows need far less...”

Programmers cheat the game in their applications and don’t properly audit some of the changes. Usually, that seems like a great idea right up until someone realizes that it isn’t. Whenever you collect data, you always need a way of gauging its reliability, and that is always the source of the data. If it comes from somewhere else, you need to keep that attached to the data. If a user changes it, you need to know that too. If a user changes it and it jumps through 18 systems, then if you lose its origins, you also lose any sense that it is reliable. So, it would make far more sense if, during an ETL, you keep that information too, and honor it. It would increase your data quality and certainly make it a whole lot easier to figure out how bugs and malicious crimes happened.

“That’s too much disk space...”

Most large organizations store their data redundantly. I’ve actually seen some types of data stored dozens of times in different places. We really should stop doing that. It would be a macro optimization on saving a huge amount of badly used disk space, as opposed to a micro one caused by lowering the data quality.

“But what about caching...”

I’ve said it before, and I’ll say it again, you should not be rolling your own caching. Particularly not adding in a read cache, when you have writable data. You’re just causing problems. So, realistically, you initialize with a system identity, and then it primes the cache under that identity. If someone builds a real working cache for you, it needs user identities, and it figures out how to weigh those against the system identity work to appropriately account for each. It does that both for security, but also to ensure that as a cache it is effective. If the system identity reads a wack load of data for one user but never uses it again, then the cache is broken. So, weights of 100% for example would mean that the caching was totally and utterly useless. A weight less than 0.01% would probably be quite effective. Security and instrumentation, combined.

“But what about ex-users...”

People come and go. Keeping track of that is an organizational issue. They really shouldn’t forget that someone worked for them a few decades back, but if they wanted to do that, they could just swap to a single ‘ex-employee’ identity. I wouldn’t recommend this myself, I think it makes far more sense that if you have returned to a company they reconnect you to your previous identity, but it should be a company-wide decision, not left to the whims of each application. When you start building something new, the ‘group’ of people that can use it should already be established, otherwise, how would you know that you need to build the thing?

“What about tracking?”

If you know all of the computations that an identity triggers and all of the data that they have changed, then you have a pretty powerful way of assessing them. That’s not necessarily a good thing, and it would have to be dealt with outside of the scope of technology. It would not be accurate though, because it is really easy to game, so if a company used it as a performance metric, it would only end up hurting them.


“But I want to roll my own Security...”

Yeah, that is the problem with our security. It takes a crazy amount of knowledge to do it correctly, everyone wants to do it differently, most attempts get it wrong, and while it would be fun to code up some super security, in reality, it is always the first functionality that gets slashed when everyone realized they aren’t going to make the release deadlines. If your job is effectively to rush through coding, then most of the coding you should stick to is straightforward. It sucks, but it is reality. It also plays back to the notion that you should always do the hard stuff first, not last. That is, the first release of any application should be a trivial shell that sets the foundations, but effectively has no features. Then the first release of the application is actually an upgrade. Doing it will eliminate a lot of pain and is easier to schedule.

"There are too many vendors, they won't agree to this..."

The industry is notoriously addicted to locking customers in. This type of change would not affect that, so if we crafted it as an ISO standard, and then there was pressure to be compliant, most of them would comply simply because it was good for sales. The downside is that in some cases it would affect their invoicing, but I'm sure they could find another proxy for organization size that is probably easier and cheaper to implement.

Identity, like a lot of other software development problems, is difficult simply because we like to shoot ourselves in the foot. If we could stop doing that, then we could put in place some technologies that would help ensure that the things we build work far better than they do now. Oddly, these problems are not hard to implement, and we basically know how to do them correctly, the issue isn’t technological, it has nothing to do with computers themselves, it is all about people.

Thursday, December 21, 2023

Software Knowledge

There are at least two different categories of knowledge in software.

One is the specifics in using a tech stack to make a computer jump through specific hoops to do some specific computations. It is very specific. For example, the ways to store certain arrangements of bits for persistence when using Microsoft stacks.

When you are building software, if you know how to do something, it will make the work go faster. But people are right when they say this type of knowledge has a half-life. It keeps changing all of the time, if you haven't done it for a while, you’ve forgotten it or it has moved out from under you.

This is the stuff you look up in StackOverflow.

The other category of knowledge is far more important.

There are all sorts of ways of working and building things with software that have not changed significantly for decades. They are as true now, as when I started forty years ago. They are the same no matter what tech stack you use, be it COBOL or JavaScript. Sometimes they are forgotten and then reinvented later under different names. A good example is that we used to edit our code, now we refactor it.

Fundamentally, building software is a construction effort. The medium appears as more malleable than most, but it is not immune from any other constructive issue. And because programmers intentionally freeze far too much, we change things too fast, and often need backward compatibility, it is rarely as malleable as it could be.

There are a couple of obvious anchors.

The first is that size matters. It is a whole lot easier to build tiny things than massive ones. Size is everything.

The second is that as the size of the effort grows, disorganization causes worse problems. If you write some tiny spaghetti, it is okay, you can still change it. But if you have a million lines of spaghetti you are screwed.

Organizing stuff isn’t fun, and oddly it isn’t a one-time task either. It is an ongoing effort. The data and code are only as organized as the explicit effort you put into them to keep them organized. If you aren’t doing anything, it is likely a mess.

But even more specifically, there is a lot of general knowledge about how to code things like data structures, algorithms, data normalization, or GUI conventions that hold true regardless of the stack. You may not need to create a hash table yourself anymore but you still need to understand how to leverage it and its limits. People will always need to ensure their data is at least 3NF or they will pay the price for storing it badly. A poorly wired GUI will diminish trust, it may be marginally workable but generating ill will.

The tools too. Learning to properly configure and use an editor or IDE tends to stay relevant for a very long time. There are all sorts of build tools and scripting, most of which haven't changed for decades, although sometimes they get obscured by trendy stuff that doesn’t last. But the need for the tools and usage of them never changes. If you spend time to figure one out, the others come easily. It also helps in understanding why some newer trends are poor and should be avoided.

Of course, all of the issues with people and politics never, ever change. We build software as a solution to some users' problems. If you don’t fully understand what you are trying to solve, then the things you’ve built are far less likely to work as needed. There is also a lot of gymnastics involved with funding software development, often resulting in too much stress and rushing through the work. Stress is bad for thinking; rushing is bad for quality.

Most of what you do specifically in software changes. The trends come and go in roughly five-year waves, developers need to keep up but not every wave. You can skip some waves, but if you skip too many your opportunities narrow. Once you are old and out, it is brutal getting back in.

Most of the general knowledge is far more important than the specifics. It is what keeps the projects from chaos, ensures that the work is at least good enough, and helps control the expectations of the people on the margins. If you know generally how to build things well, you can always look up the specifics. But if you don’t know how to properly persist the data, for instance, the work is doomed before it even starts. If you don’t understand fragmentation, you won’t understand why your work keeps failing when you bring it all together. If you don’t understand the components, you cannot craft a reasonable architecture.

You actually need more general knowledge to ensure that a large project is successful than specific knowledge. It is what keeps it all out of trouble while people are coding like mad. This is probably why the high failure rate of modern software is independent of the methodologies used. It's more likely experience-related. A bunch of kids who really know their specifics well will still usually fail in very predictable, general ways. That was true when I was a kid and it still holds true today.

Thursday, December 14, 2023

Fragmentation

When you have never experienced large-scale software development, your preference is often to prefer fragmentation. You want lots of little parts.

It’s understandable.

Most people are taught to decompose large problems into smaller ones. Once they have completed that, they assume that all of those smaller sub-solutions are mostly independent from each other.

They believe that dealing with each piece independently would be easier. That way you can just focus on one and ignore the rest.

It’s just that when people learn to decompose stuff, they tend to choose rather arbitrary lines of decomposition. There are lots of options, they choose the easiest. But that means that the pieces below are more likely to have dependencies between them. That they are not independent.

If the problem was decomposed based on ‘natural’ lines that tend away from dependencies, then the idea of treating stuff as fragments would work. But they don’t know what that means, so it doesn’t happen.

The other part of this issue then comes into play.

If you decompose a problem into parts, you need to ensure that the parts themselves still hold together to solve the original problem. That at least they cover everything necessary. That is, after you break it down, you build it back up again to make sure it is right. Breaking it down is only the first half of the effort.

So overlaps, gaps, and dependencies tend to derail a lot of decomposition attempts.

Once any complexity gets fragmented it grows exponentially. If there are enough fragments it becomes nearly impossible to assert anything reasonable about its overall behavior. That is, each of the components may work as expected, but the combination of them does not. This is an all too common problem in software.

The cure is to be suspicious of fragmentation. It’s not the same as encapsulation.

In the latter case, all of the rough edges are hidden nicely inside of a box. In the first case, the edges are exposed and are effectively their own pieces, thus throwing the complexity out of whack. You quickly end up with far more pieces than you can handle.

You can see this as an issue of ‘scope’ and most programming languages provide strong tools to control it, but very few programmers take advantage of them. We first figured this out with global variables, but there are endless ways to create similar issues.

If your decomposition into an Object is correct, for example, then all of the internal variables in the Object can be set to private. They are not modifiable from the outside. They are not visible from the outside. The entire interface is methods, and all of the variables are properly encapsulated. Instead, we have crazy ideas like ‘getters’ and ‘setters’ that drop functions directly over the variables, so that we can pretend that we encapsulated them when clearly we didn’t.

Other fun examples of fragmentation include the early attempts to distribute code throughout lots of static html files, making it nearly impossible to correctly predict behavior of anything non-trivial.

Modern frameworks are often based around fragments as well. You know there is a problem if you need to access ‘globals’ in a lot of little ‘callbacks’; it will quickly become a mess.

Even a lot of modern data storage philosophies make the same mistake. Just dumping all of the data in little files into a disorganized pool or lake is only going to blow out the complexity. Sure, you save time while collecting the data, but if it is nearly impossible to find stuff when you need it, then the collection will grow into a swamp.

Breaking things down into smaller pieces without fully encapsulating them is fragmentation. It is bad, in that while encapsulation wraps and controls complexity, fragmentation just amplifies it. Complexity is the impassable barrier for size. If you can’t manage it, you cannot get any larger or more sophisticated. If you encapsulate parts of it properly, you can grow the solution until it covers the whole problem.

Thursday, December 7, 2023

Solving Hard Problems

Trivial solutions are great for solving trivial problems. But if you have a hard problem, then no combined set of trivial solutions will ever end up correctly solving it.

The mechanics of this are fairly easy to understand.

If you try to solve only a small subset of a big problem with a solution that targets just that subset, then it will spin off more problems, it is a form of fragmentation.

You’ve addressed the subset, but now that needs to interact with many of the other parts, and those are artificial complexity. They would not have been necessary if you had addressed the whole problem.

If you try to solve a hard problem, either one that is huge or one that is complex, with a lot of trivial solutions, there will be an endless stream of these unsolved fragments, and as you try to solve these, they will make it all rather perpetual. You can’t get a perpetual motion machine in our physical universe, but you can effectively spend nearly forever creating and patching up little holes in a misfitting solution.

If you want to solve a hard problem, then the solution itself cannot be trivial. It will not be simple, it will not be easy. Any desire or quest to get around this is hopeless.

“Things should be made as simple as possible, but no simpler” -- possibly Albert Einstein

The really important part of that misattributed above quote is at the end. That there is a notion of too simple.

The belief that one can get away with over-simplifying solutions is the underlying cause of so many software problems. It’s nice when software is simple and easy to understand, but if it isn’t the right solution, then it will cause trouble.

Yes, you can whack out a large number of trivialized silos into software. You can get these into active usage really quickly. But if they do not fit properly, the net effect is now worse than before. You’ve created all sorts of other problems that will accumulate to be worse than the original one. The software isn’t really helping, it’s just distracting everyone from really fixing the problem.

This is quite common in a lot of large companies. They have an incredible number of systems that spend more of their resources pushing data back and forth, than they do working with their intended users. And the more they move data around, the worse it gets. The quality issues spin off secondary systems trying to fix those data problems, but by then it is all just artificial complexity. The real underlying problems have been lost.

If any of the silos aren’t fully encapsulated, then either the partitioning is wrong, or the problem is hard and can’t be split up.

Thursday, November 30, 2023

Observable on Steriods

A while back I wrote a small product that, unfortunately, had an early demise. It was great technically -- everyone who saw it loved it -- it’s just that it didn’t get any financial support.

The idea was simple. What if all of the data in an application was observable? Internally it was just thrown into a giant pool of data. So, it could be variables in a model, calculations, state, context, etc. Everything is just observable data.

Then for each piece of derived data, it would watch any of its dependencies. If it was a formula, it would see that any of the underlying values changed, recalculate, and then tell anyone watching it that it is now different.

The stock way of implementation observable in an object-orient paradigm is to keep a list of anyone watching, then issue an event, aka function call, to each. The fun part is that since any object can watch any other object, this is not just a tree or a dag, it can be a complete graph.

Once you get a graph involved, any event percolating through it can get caught in a cycle. To avoid this, I traded off space by having each event keep a dictionary of the other objects it had already visited. If an object gets notified of an event and it is already in that visited set, it just ignores the event. Cycles defeated.

So, now we have this massive pool of interconnected variables, and if we dumped some data into the pool, it would set off a lot of events. The leaf variables would just be updated and notify watchers, but the ones above would recalculate and then notify their watchers.

I did do some stuff to consolidate events. I don’t remember exactly, but I think I’d turn on pause, update a large number of variables, then unpause. While paused, events would be noted, but not issued. For any calculation with a lot of dependents, if one thing changed, I’d track the time and since it has already recalced, grabbing all of the child variables data, it would toss any events for other dependents that were earlier. So, for A+B+C, after the unpause, you’d be notified that A changed, and do the recalc, but then the time that B and C changes would be earlier than the recalc, so ignored. That cut down significantly on any event spikes.

Then finally, at the top, I added a whole bunch of widgets. Each one was wired to watch a single variable in the pool. As data poured into the pool, the widgets would update automatically. I was able to wire in some animations too, so that if a widget displayed a number, whenever it changed it would flash white, and then slowly return to its original color. But then I wired in tables and plotting widgets as well. The plot, for instance, is a composite widget, whose children, the points of the curve, are then they are all wired to different variables in the pool. So, the plot would change on the fly as things changed in the pool.

Now if this sounds like MVC, it basically is. I just didn’t care if the widgets were in one view or a whole bunch of them. They’d all update correctly either way. and the entire model was in the pool. And any of the interface context stuff was in the model so in the pool. And any of the user’s settings were in the model, so in the pool. In fact, every variable, anywhere in the system, was in the model, so in the pool. Thus the steroids designation. Anything that can vary in the code is an object and that is observable in the pool.

Since I effectively had matrices and formulas, it was a superset of a spreadsheet. A sort of disconnected one. A whole lot more powerful.

Because time was limited, my version had to wire up each object as code explicitly. But they didn’t do much other than inherit the mechanics, define the children to watch, and provide a function to calculate. It would have been not too hard to make all of that dynamic, and then provide an interface to create and edit new objects in the app itself. This would let someone create new objects on the fly and arrange them as needed.

It was easy to hook it up to other stuff as well. There was a stream of incoming data, so it was handled by a simple mapping between the data’s parameters and the pool objects. Get the next record in the stream, and update the pool as necessary. Also, keyboard and button events would dump stuff directly into the pool. I think some of the widgets even had two-way bindings, so the underlying pool variable changed when the user changed the widget and could percolate to everything else. I had some half cycles, where the widget displayed a value, it was two-way bound and as the user changed it, it triggered other changes in the pool, which updated stuff on the fly which would change the widget. I used that for cross-widget validations as well. The widgets change the meta information for each other.

I did my favorite form of widget binding, which is by name. I could have easily added scope to that but the stuff I was working with was simple enough that I didn’t have any naming collisions. I’ve seen structural-based binding sometimes, but they can be painful and rigid. The pool has no structure and the namespace is tiny because the objects were effectively hardwired to their dependencies. Extending it would need scope.

To pull it all together I had dynamic forms and set them to handle any type of widget, primitive or composite. I pulled a trick from my earlier work and expanded the concept of forms to include everything on the screen including menus and other recursive forms. As well, forms could be set to not be editable, which lets one do new, edit, and view all with the same code, which helps to save coding time and enforce consistency.

Then I put in some extremely complex calculations, hooked it up to a real-time feed, and added a whole bunch of screens. You could page around while the stream was live and change stuff on the fly, while it was continuously updating. Good fun.

It’s too bad it didn’t survive. That type of engine has the power to avoid a lot of tedious wiring. Had I been allowed to continue I would have wired in an interpreter, a pool browser, and a sophisticated form creation tool. That and some way to dynamically wire in new data streams would have been enough to compete with tools like Excel. If you could whip up a new table and fill it with formulas and live forms, it would let you craft complex calculation apps quickly.

Thursday, November 23, 2023

Bottom Up

The way most people view software is from the top down. They see the GUI and other interfaces, but they don’t have any real sense of what lies behind it.

The way that bugs are reported is from the top down. Usually, there is an interface problem somewhere the underlying cause may be deep in the mechanics.

The foundation of every system is the data it persists. If there isn’t a way to reliably keep the data around for a long time, it might be cute, but it certainly isn’t practical.

The best way to build software is from the bottom up. You lay down a consistent set of behaviors and then you build up more complicated behaviors on top of those. This leverages the common lower stuff, you don’t have to reduplicate the work again for everything on top.

The best way to extend a system is bottom-up, You start with getting the data into persistence, then into the core of the system, and then you work your way upwards until all of the interfaces reflect the changes.

The way to deal with bugs is top-down. But the trick is to keep going as low as you can. Make the lowest fixed as time will allow. Sometimes it is rushed, so you might fix a bunch of higher-level symptoms, but even if you do that you still have to schedule the proper lower-level ones soon.

The best way you screw up a large system is to let it get disorganized. The way you organize a system is from the bottom up, Looking at it top down will only confuse you.

Some people only want to see it one way or the other, as top-down or bottom-up, but clearly, that isn’t possible. When a system becomes large, the clash in perspectives becomes the root of a lot of the problems. Going in the wrong direction, against the grain, will result in hardship.

Thursday, November 16, 2023

The Power of Abstractions

Programmers often complain about abstractions, which is unfortunate.

Abstractions are one of the strongest ‘power tools’ for programming. Along with encapsulation and data structures, they give you the ability to recreate any existing piece of modern software, yourself, so long as you have lots and lots of time.

There is always a lot of confusion about them. On their own, they are nothing more than a generalization. So, instead of working through a whole bunch of separate individual special cases for the instructions that the computer needs to execute, you step back a little and figure out what all of those different cases have in common. Later, you bind those common steps back to the specifics. When you do that, you’ve not only encoded the special cases, you’ve also encoded all of the permutations.

Put another way, if you have a huge amount of code to write and you can find a small tight abstraction that covers it completely, you write the abstraction instead, saving yourself massive amounts of time. If there were 20 variations that you needed to cover but you spent a little extra time to just create one generalized version, it’s a huge win.

Coding always takes a long time, so the strongest thing we can do is get as much leverage from every line as possible. If some small sequence of instructions appears in your code dozens of times, it indicates that you wasted a lot of time typing and testing it over and over again. Type it once, name it, make sure it works, and then reuse it. Way faster.

A while back there were discussions that abstractions always leak. The example given was for third-generation programming languages. With those, you still sometimes need to go outside of the language to get some things done on the hardware, like talking directly with the video card. Unfortunately, it was an apples-to-oranges comparison. The abstractions in question generalized the notion of a ‘computer’. But just one instance of it. Modern machine architecture however is actually a bunch of separate such computing devices all talking to each other through mediums like the bus or direct memory access. So, it’s really a ‘collection’ of computers. Quite obviously if you put an abstraction over the one thing, it does not cover a collection of them. Collections are things themselves (which part of what data structures is trying to teach).

A misfitting abstraction would not cover everything, and an abstraction for one thing would obviously not apply to a set of them. The abstraction of third-generation programming languages fit tightly over only the assembler instructions that manipulated the computer but obviously didn’t cover the ones that were used to communicate with peripherals. That is not leaking, really it is just scope and coverage.

To be more specific, an abstract is just an abstraction. If it misfits and part of the underlying mechanics is sticking out, exposed for the whole world to see, the problem is encapsulation. The abstract does not fully encapsulate the stuff below it. Partial encapsulation is leaking encapsulation. There are ugly bits sticking out of the box.

In most cases, you can actually find a tight-fitting abstraction. Some generalization with full coverage. You just need to understand what you are abstracting. An abstraction is a step up, but you can also see it as binding together a whole bunch of special cases like twigs. If you can visualize it as the overlaid execution paths of all of the possible permutations forming each special case, then you can see why there would always be something that fits tightly. The broader you make it the more situations it will cover.

The real power of an abstraction comes from a hugely decreased cognitive load. Instead of having to understand all of the intricacies of each of the special cases, you just have to understand the primitives of the abstraction itself. It’s just that it is one level of indirection. But still way less complexity.

The other side of that coin is that you can validate the code visually, by reading it. If it holds within the abstraction and the abstraction holds to the problem, then you know it will behave as expected. It’s obviously not a proof of correctness, but being able to quickly verify that some code is exactly what you thought it was should cut down on a huge number of bugs.

People complain though, that they are forced to understand something new. Yes, absolutely. And since the newer understanding is somewhat less concrete, for some people that makes it a little more challenging. But programming is already abstract and you already have to understand modern programming language abstractions and their embedded sub-abstractions like ‘strings’.

That is, crafting your own abstraction, if it is consistent and complete, is no harder to understand than any of the other fundamental tech stack ones, and to get really good at programming, you have to know those anyway. So adding a few more for the system itself is not onerous. In some cases, your abstraction can even cover a bunch of other lower-level ones, so if it is encapsulated, you don’t need to know those anymore. A property of encapsulation itself is to partition complexity, making the sum more complex but each component a lot less complex. If you want to write something sophisticated with extreme complexity, partitioning it is the only way it will be manageable.

One big fear is that someone will pick a bad abstraction and that will get locked into the code causing a huge mess. Yes, that happens, but the problem isn’t the abstraction. The problem is that people are locking things into the codebase. Treating all of the code in the system as write-once and untouchable is a huge problem. In doing that, it does not matter if the code is abstract or not, the codebase will degenerate either way, but faster if it is brute force. Either the code on top a) propagates the bugs below, b) wraps another onion layer around the earlier mess, or c) just spins off in a new silo. All three of these are really bad. They bloat up the lines of code, enshrine the earlier flaws, increase disorganization, and waste time with redundant work. They get you out of the gate a little faster, but then you’ll be stuck in the swamp forever.

If you pick the wrong abstraction then refactoring to correct it is boring. But it is usually a constrained amount of work and you can often do it in parts. If you apply the changes non-destructively, during the cleanup phase, you can refactor away some of the issues and check their correctness, before you pile more stuff on top. If you do that a bunch of times, the codebase improves for each release. You just have to be consistent about your direction of refactoring, waffling will hurt worse.

But that is true for all coding styles. If you make a mistake, and you will, then so long as you are consistent in that mistake, fixing it is always a smaller amount of work or at the very least can be broken down into a set of small amounts. If there are a lot of them, you may have to apply the sum over a large number of different releases, but if you persist and hold your direction constant, the code will get better. A lot better. Contrast this with freezing, where the code will always get worse. The mark of a good codebase is that it improves with time.

Sometimes people are afraid of what they see as the creativity involved with finding a new abstraction. Most abstractions however are not particularly creative. Really they are often just a combination of other abstractions fitted together to apply tightly to the current problem. That is, abstractions slowly evolve, they don’t just leap into existence. That makes sense, as often you don’t fully appreciate their expressibility until you’ve applied them a few times. So, it’s not creativity, but rather a bit of research or experience.

Programming is complicated enough these days that you will not get really far with it if you just stick to rediscovering everything yourself from first principles. Often the state of the art has been built up over decades, so going all of the way back in time and trying to reinvent everything again is going to be crude in comparison.

This is why learning to research a little is a necessary skill. If you decide to write some type of specific computation, doing some reading beforehand about others' experiences will pay huge dividends. Working with experienced people will pay huge dividends. Absorbing any large amount of knowledge efficiently will allow you to start from a stronger position. Code is just a manifestation of what the programmer understands, so obviously the more they understand the better the code will be.

The other side of this is that an inexperienced programmer seeking a super-creative abstraction will often be a disaster. This happens because they don’t fully understand what properties are necessary for coverage, so instead they hyper-focus on some smaller aspect of the computation. They optimize for that, but the overall fit is poor.

The problem though is that they went looking for a big creative leap. That was the real mistake. The abstraction you need is a generalization of the problems in front of you. Nothing more. Step back once or twice, don’t try to go way, way out, until much later in your life and your experience. What you do know should anchor you, always.

Another funny issue comes from concepts like patterns. As an abstraction, data structures have nearly full coverage over most computations, so you can express most things, with a few caveats, as a collection of interacting data structures. The same isn’t true for design patterns. They are closer to idioms than they are to a full abstraction. That is why they are easier to understand and more tangible. That is also why they became super popular, but it is also their failure.

You can decompose a problem into a set of design patterns, but it is more likely that the entire set now has a lot of extra artificial complexity included. Like an idiom, a pattern was meant to deal with a specific implementation issue, it would itself just be part of some abstraction, not the actual abstraction. They are implementation patterns, not design blocks. Patterns should be combined and hold places within an abstraction, not be a full and complete means of expressing the abstraction or the solution.

Oddly programmers so often seek one-size-fits-all rules, insisting that they are the one true way to do things. They do this because of complexity, but it doesn’t help. A lot of choices in programming are trade-offs, where you have to balance your decision to fit the specifics of what you are building. You shouldn’t always go left, nor should you always go right. The moment you arrive at the fork, you have to think deeply about the context you are buried in. That thinking can be complex, and it will definitely slow you down, thus the desire to blindly always pick the same direction. The less you think about it, the faster you will code, but the more likely that code will be fragile.

You can build a lot of small and medium-sized systems with brute force. It works. You don’t need to learn or even like abstractions. But if you want to work on large systems, or you want to be able to build stuff way faster, abstractions will allow you to do this. If you want to build sophisticated things, abstractions are mandatory. Once the inherent complexity passes some threshold, even the best development teams cannot deal with it, so you need ways of managing it that will allow the codebase to keep growing. This can only be done by making sure the parts are encapsulated away from each other, and almost by definition that makes the parts themselves abstract. That is why we see so many fundamental abstractions forming the base of all of our software, we have no other way of wrangling the complexity.

Thursday, November 9, 2023

Time Ranges

Sometimes you can predict the future accurately. For instance, you know that a train will leave the station tomorrow at 3 pm destined for somewhere you want to go.

But the train may not leave on time. There are a very large number of ‘unexpected’ things that could derail that plan, however, there is only a tiny probability that any of them will actually happen tomorrow. You can live with that small uncertainty, ignore it.

Sometimes you can only loosely predict the future. It will take me somewhere between 10 minutes to 1.5 hours to make dinner tonight. It’s a curve and the probability is most likely that the time it takes to make dinner lands somewhere in the middle; so not 10 mins and not 1.5 hours. It may only take half an hour, but I will only be certain about that after I finish.

If you have something big to accomplish and it is made up of a huge number of little things, and you need to understand how long it will take, you pretty much should only work with time ranges. Some things may be certain like train schedules, but more often the effort is like making dinner. This is particularly true if it is software development.

So, you have 10 new features that result in 30 functional changes to different parts of the code. You associate a time range for each functional change. Then you have to add in the time for various configurations and lots of testing.

Worth noting that the time to configure something is not the actual time to add or modify the parameters, that is trivial. It is the time required to both find and ensure that the chosen configuration values are correct. So, lots of thinking and some testing. Might take 1 minute to update a file, but up to 3 days to work through every possible permutation until you find one that works as expected. Then the range is 5 mins if you get lucky, and 3 days if the universe is against you, which it seems to be sometimes.

For most things, most of the time, you’ll get a bit of luck. The work falls on the lower side of the range. But sometimes life, politics, or unexpected problems cause delays. With time ranges, you can usually absorb a few unexpected delays and keep going.

As the work progresses the two big levers of control are a) the completion date and b) the number of features. If luck is really not on your side, you either move the date farther out or drop a few features. You often need to move one or the other lever, which is why if they are both taken away it becomes more likely the release will explode.

Some things are inherently unestimatable. You can’t know what the work is until someone has managed to get to the other side and there is no way to know if anyone will ever get to the other side.

These types of occurrences are a small part of development work, but lots of other stuff is built on top of them. If you have something like that, you do that exploration first, then re-estimate when you are done.

For example, half the features depend on you making an unguessable performance improvement. If you fail to guess how to fix that lower issue in a reasonable time frame, then those features get cut. You can still proceed with the other features. The trick is to know that as early as possible, thus don’t leave inestimable work until the end. Do it right away.

It’s worth noting too that coding is often only one-third of the time. The analysis and design should be equal to the coding time, as should the testing time. That is, it can take 3 times longer to get done than most programmers expect, but they start 2/3rds of the way through.

This is often a shortcut by doing almost no analysis or design, but that tends to bog things down in scope creep and endless changes. Skimping on testing lets more bugs escape into production which makes any operation drama far more expensive. So, in both cases attempting to save time by not doing necessary work comes back to haunt the project later and always ends up wasting more time than what was saved. Shortcuts are always nasty time trade-offs. Save a bit of time today, only to pay more for it later.

If you have all of the work items as time ranges, then you can pick an expected luck percentage and convert the schedule into an actual date. Most times, it’s around the 66% mark. You can tell if you are ahead or behind, and you can lock in some of the harder and easier work early, so there is at least something at the end. If you end up being late, at least you know why you are late. For example, most of the tasks ended up near their maximum times.

Time ranges also help with individual estimates. For example, you can ask a young programmer for an estimate and also a senior one. The difference will give you a range. In fact, everyone could chime in with dates, accurate or crazy, and you’d have a much better idea of how things may or may not progress. You don’t have to add in random slack time, as it isn’t ever anchored in reality. It is built in with ranges.

Time ranges are great. You keep them as is when you are working, but can convert them to fixed dates when dealing with external people. If you blow the initial ranges, you’ll know fairly early that you are unlucky or stuck in some form of dysfunction. That would let you send out an early ‘heads up’ that will mitigate some of the anger from being late.

Thursday, November 2, 2023

Special Cases

One of the trickiest parts of coding is to not let the code become a huge mess, under the pressure of rapid changes.

It’s pretty much impossible to get a concrete static specification for any piece of complex software, and it is far worse if people try to describe dynamic attributes as static ones. As such, changes in software are inevitable and constant. You write something, be prepared for it to change, it will change.

One approach to dealing with this is to separately encode each and every special case as it own stand-alone siloed piece of code. People do this, but I highly recommend against it. It is just an exponential multier in the amount of work and testing necessary. Time that could have been saved by working smarter.

Instead, we always write for the general case, even if the only thing we know today is one specific special case.

That may sound a bit weird, but it is really a mindset. If someone tells you the code should do 12 things for a small set of different data, then you think of that as if it were general. But then you code out the case as specified. Say you take the 3 different types of data and put it directly through the 12 steps.

But you’ve given it some type of more general name. It is isn’t ProcessX, it something more akin to HandleThese3TypesOfData. Of course, you really don’t want the name to be that long and explicit. Pick something more general that covers the special case, but does not explicitly bind to it. We’re always searching for ‘atomic primitives’, so maybe it is GenerateReport, but it only actually works for this particular set of data and nly goes these 12 steps.

And now the fun begins.

Later, they have a similar case, but different. Say it is 4 types of data, but only 2 overlap with the first instance. And it is 15 steps, but only 10 overlap.

You wrap your generate report into some object or structure that can hold any of the 5 possible datatypes. You set an enumeration that switches between the original 12 steps and the newer 15 steps.

You put an indicator in the input to say which of the 2 cases match the incoming data. You write something to check the inputs first. Then you use the enum to switch between the different steps. Now someone can call it with either special case, and it works.

Then more fun.

Someone adds a couple more special cases, You do the same thing, trying very carefully to minimize the logic wherever possible.

Maybe you put polymorphism over the input to clean that up. You flatten whatever sort of nested logic hell is building up. You move things around, making sure that any refectoring that hits the original functionality is non-destructive. In that way, you leverage the earlier work, instead of redoing it.

And it continues.

Time goes by, you realize that some of the special cases can be collapsed down, so you do that. You put in new cases and collapse parts of the cases as you can. You evolve the complexity into the code, but make sure you don’t disrupt it. The trick is always to leverage your earlier work.

If you do that diligently, then instead of a whole pile of spaghetti code, you end up with a rather clean, yet sophisticated processing engine. It takes a wide range of inputs but handles them and all of the in-between permutations correctly. You know it is correct because, at the lower levels, you are always doing the right thing.

It’s not ‘code it once and forget it’, but rather carefully grow it into the complicated beast that it needs to become in order to be useful.

Thursday, October 26, 2023

Two Generals

There are two armies, encamped on different sides of a large city.

If both armies attack the city at exactly the same time they will be victorious. If only one army attacks, the city defenders can wipe it out.

One of the generals wants to notify his peer in the other army about when to begin the attack. He sends out a message “Tomorrow at 6am”. But he receives no reply.

He has a huge problem now. Did his messenger make it to the other general? Maybe he did, but the returning messenger was captured and killed. Or maybe his messenger never made it. As he was sneaking around the outskirts of the city he met an untimely end.

So, tomorrow morning, should he attack as he said he would, assuming the messenger was successful? Or should he not attack?

Whether the message was received or not is ‘ambiguous’. The general cannot know which of the two possibilities is true; that his message didn’t make it or that the reply didn’t come back. He doesn’t have enough information to make an informed decision.

Yet the fate of the battle rests on reliable communications…

I’m sure that many readers have various suggestions for ways to remove the ambiguity. For instance, you could send other messengers, lots of them. But if they just disappear too, then the ambiguity is still there. You could try to establish waypoints, so the distance of the communication is shorter. But if there is still even a tiny corridor where the defenders reign supreme, it makes no difference.

What if instead of sending “tomorrow at 6am” you send it as a question “What about tomorrow at 6am?” Then if it is intercepted the general isn’t compelled to act. But now the other general has the exact same original problem with their reply. It swapped the problem, but it didn’t go away.

Clever people might suggest a different, alternative medium. Like flags or something, but the city is large enough that there is no guarantee about visibility, and if you used smoke or balloons, the defenders could just clear it away. The medium isn’t the problem.

The thing is, there is no perfect, always-working scenario. Where you need information but there is only an ambiguity, no matter how small you shrink it, it still remains. The ambiguity is the information.

Worse is that this is a fundamental physical constraint imposed by the universe for any sort of distributed communication. All distributed software, where separate computations need to synchronize on any sort of non-perfect medium, is bound by it. If you have a client and server, or a bunch of peers, or even two processes using a less-than-perfect medium such as files, no technology, protocol, or magic bullet will make the conversation 100% reliable, and even if it is 99.9999% reliable, there is still some sort of ambiguity in your way.

At your very best you could shrink it down to say a 1 in 100 years likelihood of failure, so you could not see it go wrong in your entire career, but it will still go wrong, someday. And that is what makes it so different from a regular computation. It is not deterministic.

Short of some unexpected external event like a flood or gamma rays or a hardware defect or something, all of the other computations will work perfectly each and every time they run on the computer. If they work for their full context, you can always assume they will always work. Oddly, we treat them as 100%, even though the physical nature of the computation itself is subject to adverse advents, the software itself, as a formal system, is not.

There is and will always be a huge gulf between 100% and 99.9...%, between deterministic computations and non-deterministic (in the distributed sense, not the language theory one) ones. Ultimately it affects the level of trust that we place in our software.

Thursday, October 19, 2023

Impermanence

Physical things can hang around for a long, long time. You can keep them in your house, for example. Short of some epic disaster like a fire, it is up to you how long you value them and when you finally get rid of them. It is in your control. For some people, it is possible to keep their stuff safe for their entire life. It becomes an heirloom.

Digital things are impermanent. There are an endless number of ways for them to get lost, forgotten, or corrupted.

The next hardware you buy is probably only good for five years, maybe a little bit longer. Rolling over to newer hardware probably won’t go well, even if you spend a ridiculous amount of time trying to figure it out.

The cloud storage you pay for may be discontinued or out of business next week. They probably scrimped on backups, so odds are that in a disaster the stuff you wanted won’t make it. They’ll apologize, of course. And they are always working harder on finding ways to increase the price than they are on finding ways to make it more reliable, engineering is not as important as exploitation. You may wake up one day to a nasty price increase. Storing stuff on the cloud is extremely hazardous.

Compounding it all, software keeps changing. Most programmers suck at achieving any real sort of backward compatibility. They’ll just force you to wipe out everything because it was easier for them to code it that way. The chances that persistent data survives for longer than a decade are slim. And even if technically it did survive, it has probably become unreachable, unusable. The software you used before to leverage it has long since been broken by someone else who didn’t know or care about what you need.

There was a big stink about forgetting things on the web, but honestly, it was a total waste of time. Not only is the web tragically forgetful, but the infrastructure on the web is getting a little more useless every day. Technically stuff could still be out there, but how would you find it now? The web is about hype, not information. Stuff tends to vanish when the limelight gets shifted.

In the grand scheme of things, the digital realm is extraordinarily flakey. Far more of its history is lost than preserved. It’s a dangerous, somewhat unpredictable place where selfishness and irrationality have more weight than quality. The odds are better that someone else whom you don’t want to see your data will see it than you getting that same data back in twenty or thirty years. We make stuff digital to be trendy, not smart.

Thursday, October 12, 2023

Curiosity

If I had to pick one quality that I think is necessary to make a big software project run smoothly, it is curiosity.


If you get 100% code that perfectly fits to the domain problems, the project will be great. If you correctly guessed how long it would take in advance, and people actually believed you and gave you that time, the politics would be negligible. 


But anyone who has been on a bunch of big projects for a long time knows that it almost never works that way.


It’s usually some sort of dumpster fire. The time isn’t enough, the fit is bad, and the technology is flaky.


So, ultimately, for the first few versions of the code, you do what you need to do in order to get it out the door. It ain’t pretty but it seems to be an inescapable reality of the job.


But after that smoke clears, and if there is still an appetite to go forward, the circumstances have changed. Hopefully, there is confidence in the work now.


Curious people will look back and what happened earlier, and what they have, and start asking the hard questions. Why did someone do that? How is that supposed to work? etc. 


If they are curious and enabled, then at least some of the huge collection of smaller problems that plague the earlier versions are now in their focus. And they will get looked at, and hopefully corrected. All of that happens outside of the main push for whatever new features other people desire. 


It is fundamentally cleanup work. There is a lot of refactoring, or replacement of weak parts. There is more investigation, and deep diving into the stranger issues. All of which is necessary in order to build more on what is there now.


Non-curious people will just claim that something is already in production, that it is locked that way, and it should not be touched, ever. They will enshrine the damage, and try to move on to other things. That is a classic mistake, building anything on a shaky foundation is a waste of time. But you have to be curious about the foundation in order to be able to assess that it is not as good as necessary.





Thursday, October 5, 2023

Feedback

I was working with a group of people a while ago who built a web app and put it on the Internet. It was a bit crude, didn’t quite follow normal user conventions, and was quite rough around the edges. When they built it, they added a button to get users' feedback.

Once they put it out there live, they got swamped by negative feedback. People took the time to complain about all sorts of things, to point out the deficiencies. It was a lot. They were overwhelmed.

So, they removed the feedback button.

As far as solving problems goes, this was about the best example of the worst possible way to do it. They had this engaged group of people who were willing to tell them what was wrong with their work, and instead of listening to that feedback and improving, they just shut down the interaction.

Not surprisingly, people avoided the app and it never really took off.

For programmers, feedback is difficult. We are already on thin ice when we build stuff, as there is more we don’t know about what we are doing than what we do know. And, it is easy to throw together something quickly, but it takes a crazy long time to make it good. This all leaves us with a perpetual feeling of uncertainty, that the things we build could alway be way better. You never really master the craft. 

Those nagging doubts tend to make most programmers highly over-sensitive to criticism. They only want positive feedback.

On top of that, user feedback is almost never literal. The users are vague and wishy-washy when they talk about what is wrong or why something bothers them.

They often know what they don’t like, but they do not know what is better or correct. Just that it is wrong. They are irrational and they usually don’t like fully explaining themselves.

In order to make sense of what they are saying, you have to learn to read between the lines. Use what they say to get an idea that something might be wrong, but then work out the actual problems on your own. Once you think you understand, you change things and test to see if that is better. It’s a soft, loose process, that usually involves endless rounds of refinements.

Things gradually get better, but it isn’t boolean. It’s not done or undone, it is a convergence. Thinking of it as one discrete task is a common mistake for popular feedback-tracking tools. They confuse the rigor of issuing instructions to a computer with the elasticity of interfaces. They try to treat it all the same when it is completely different.

If you were being pragmatic about it, you would capture all of the feedback and triage it. Positive or negative, categorized by the visible features that are involved. Then you might collect together certain negative entries and hypnosis that the underlying cause is something tangible. From there, you would schedule some work, and then schedule some form of testing. The overall category of the problem though would likely never really go away, never get resolved. It would stay there for the life of the system. Just something that you are gradually working towards improving.

The classic example is when the users say that the system or some of its features are “awkward”. That most often means that parts of the behavior do not ‘fit’ well with the users as they deal with their problem domain. It could be because the workflow is wrong, or that the interface conventions clash with the other tools they are using, or that the features should have been somewhere else, or that it is all too slow to be usable. It is hard to tell, but it is still vital feedback.

You don’t “de-awkward” a system, it is not a ‘thing’. It’s not a requirement, a ticket, a feature request, anything really. It is more about the ‘feel’ the users experience while using the features. If you want to make it less awkward, you probably have to directly interact with them while they are doing things they find awkward, then take the scratchy points you observed and guess how to minimize them. You definitely won’t be 100% correct, you might not even be 10% correct. It will take a lot to finally get your finger on the types of things that ‘you’ can do to improve the situation.

A rather huge problem in the software industry is that most people don’t want to do the above work. They only want to solve contained discrete little problems, not get lost in some undefinable, unquantifiable, swamp. We lay out methodologies, build tools, and craft processes on the assumption that all things are atomic, discrete, and tangible, and it shows in our outputs. ‘Awkward’ comments are ignored. The bad behaviors get locked in, unchangeable. People just wrap more stuff around the outside, but it too suffers from the same fate. Eventually, we just give up, start all over again from first principles, and eventually arrive at the same conclusion again. It’s an endless cycle where things get more complicated but gradually less ‘user-friendly’.

Thursday, September 28, 2023

Trifecta

Right from the beginning of my career, I have been bothered by the way we handle software development. As an industry, we have a huge problem with figuring out ‘who’ is responsible for ‘what’.

For decades, we’ve had endless methodologies, large and small, but all of them just seem to make poor tradeoffs between make-work and chaos. Neither is appealing.

As well, there are all sorts of other crazy processes and plenty of misconceptions floating around. Because of this most projects are dumpster fires, which only adds to the stress, wastes energy, and ensures poor quality.

For me, whenever development has worked smoothly it was been because of strong personalities who are subverted the enforced methodology. Strong, knowledgeable leadership works well.

Whenever the projects have been excessively painful, it is often caused by confusion in the roles and responsibilities which resulted in poor outcomes. Politics blossoms when the roles or rules are convoluted or vague. Focus gets misplaced, time gets wasted, and the quality plummets. It gets ugly.

It’s not that I have an answer, but after 30 years of working and 17 years of writing about it, I feel like I should at least lay down some basic principles.

So, here goes...

There are three primary areas for software. They are: a) the problem domain, b) the operational environment, and c) the development environment. Software (c) is a set of solutions (b) for some problems (a).

A system is a collection of similar solutions for a common problem domain.

There are two primary motivators for creating software: a) vertical and b) horizontal.

A vertical motivator is effectively a business-driven need for some software. Either they use it, offer it as a service, or sell it.

A horizontal motivator is an infrastructural need for some software. Missing parts of the puzzle that are disrupting either the operational or development flow.

Desired quality is a growing exponential curve, where low-quality throw-away code is to the left, then static, hardcoded, in-house development, then decent commercial products, then likely healthcare, aerospace, and NASA. To get to the next category is maybe 2x - 10x more work for each hop.

The actual quality is the desired level plus the sum of all testing, which is also exponential. So to find the next diminishing set of less visible bugs is 2x - 10x more effort. There is an endless series of bug sets. Barely reasonable commercial quality is probably a 1:1 ratio of testing with coding.

The quality of the code itself is dependent on the design and the enforcement of good style and conventions. Messy code is buggy code. The quality of the design is dependent on the depth of the analysis. The overall results are a reflection of the understanding of the designers and coders. The problem domain is often vague and irrational but it has to be mapped to code which is precise and logical. That is a very tricky mapping.

Ultimately while software is just instructions for a computer to run, its genesis is from and all about people. It is a highly social occupation. Non-trivial software takes a team to build and a team to run.

So, for every system, we end up with three main players:
  • Domain Champion
  • Operations Manager
  • Lead Software Developer
The domain champion represents all of the users. They also represent some of the funding, they are effectively paying for work to get done. They have a short-term agenda of making sure the software out there runs as expected and they are the ones that commission new features for it. They drive any non-technical analysis. They have or can get all of the answers necessary for the problem domain, which they need to understand deeply.

The operational manager is effectively the day-to-day ‘driver’ of the software. They set it up and offer it for others to use. They need to get the software installed, upgraded, and carefully monitor it. They are the front line for dealing with any issues the users encounter. They offer access as a service.

The lead developer builds stuff. It is constructive. They should focus on figuring out the stuff that needs to be built and the best way to do it, given all of the domain and operational issues. The features are usually from the top down, but to get effective construction the code needs to be built from the bottom up. The persistence foundation should exist before the GUI, for example.

For most domain functionality, the champion is effectively responsible for making sure that the features meet the needs of the users. The lead makes sure the implementation of those features is reasonable. They do this by breaking those features down into lots of different functionality to get implemented. A champion may need the system to keep track of some critical data, the lead may implement this as a set of ETL feeds and some user screens.

If there are bugs in production the users should go directly to the operations manager. If the manager is unable to resolve the issues, then they would go to the lead developer, but it would only be for bugs that are brand new. If it's recurring, the manager already knows how to deal with it. The operations manager would know if the system is slow or overusing resources. Periodically they would provide feedback to the lead.

If a project is infrastructure, cleanup, or reuse, it would be commissioned directly by the lead developer. They should be able to fund maintenance, proof of concept, new technologies, and reuse work on their own since the other parties have no reason to do so. The project will decay if someone doesn't do it.

The lead needs to constantly make the development process better, smoother, and more effective. They need to make sure the technology used is keeping up with the industry. Their primary focus is engineering, but they also need to be concerned with solution fit, and user issues like look and feel. They set the baseline for quality. If the interface is ugly or weird, it is their fault.

As well as the champion, the operations manager would have their own system requirements. They set up and are responsible for the runtime, so they have a strong say in the technologies, configuration, security, performance, resource usage, monitoring, logging, etc. All of the behavior and functionality they need to do their job. If they have lots of different systems, obviously having it consistent or aligned would be highly important to them. They would pick the OS and persistence for example, but not the programming language. The dependencies used for integration would fall under their purview.

The process for completing the analysis needed to come up with a reasonable set of features is the responsibility of the champion. Any sort of business analyst would report to them. They would craft the high-level descriptions and required features. This would be used by the lead to get a design.

If the project is infrastructure, instead of the champion, it is the responsibility of the lead to do the analysis. Generally, the work is technical or about organization, although it could be reliant on generalities within the problem domain. The work might be combining a bunch of redundant software engines altogether, to get reuse, for example.

Any sort of technical design is the lead, and if the organization is large, they likely need to coordinate the scope and designs with the firm’s architects and security officers. As well, the operational requirements would need to be followed. A design for an integrated system is not an independent silo, it has to fit with all of the other existing systems.

Architects would also be responsible for keeping the higher level organized. So, they wouldn’t allow the lead or champion to poach work from other teams.

The process of building stuff is up to the lead. They need to do it any which way, and in any order, that best suits them and their teams. They should feel comfortable with the processes they are using to turn analysis into deployment.

They do need to give time estimates, and if they miss them, detailed explanations of why they missed them. Leads need to learn to control the expectations of the champion and the users. They can’t promise two years of work in six months, for example. If development goes poorly or the system is unusable they are on the hook for it.

There should be a separate quality assurance department that would take the requirements from the champions, leads, and operations managers, and ensure that the things being delivered meet those specifications. They would also do performance and automated testing. With the specs and the delivery items, they would return a report on all of the deficiencies to all three parties. The lead and champion would then decide which issues to fix. Time and expected quality would drive those decisions.

The items that were tested in QA are the items that are given directly to operations to install or upgrade. There are two release processes. The full one and the fast one. The operations manager schedules installations and patches at their own convience and notifies the users when they are completed. The lead just queues up the almost-finished work for QA.

The lead has minimal interaction with operations. They might get pulled into net new bug issues, they get requirements for how the software should operate, and they may occasionally, with really tricky bugs, have to get direct special access to production in order to resolve problems. But they don’t monitor the system, and they aren’t the frontline for any issues. They need to focus on designing and building stuff.

The proportion of funding for the champion and for the lead defines the control of technical debt. If the system is unstable or development is slow, more funding needs to go into cleanup. The champion controls the priority of feature development, and the lead controls the priority of the underlying functionality. That may mean that a highly desired feature gets delayed until missing low-level functionality is ready. Building code out of order is expensive and hurts quality.

So that’s it. It’s a combination of the best of all of the processes, methodologies, writing, books, arguments, and discussions that I’ve seen over the decades and in the companies that I have worked for directly or indirectly. It offsets some of the growing chaos that I’ve seen and puts back some of the forgotten knowledge.

All you need, really, is three people leading the work in sync with each other in well-defined roles. There are plenty of variations. For example in pure software companies, there is a separate operations manager at each client. In some cases, the domain champion and lead are the same person, particularly when the domain is technical. So, as long as the basic structure is clear, the exact arrangement can be tweaked. Sometimes there are conflicting, overlapping champions pulling in different directions.

Thursday, September 21, 2023

Historic Artifacts with Data

Software development communities have a lot of weird historical noise when it comes to data. I guess we’ve been trying so hard to ignore it, that we’ve made a total mess of it.

So, let’s try again:

A datam is a fixed set of bits. The size is set. It is always fixed, it does not change. We’ll call this a primitive datam.

We often need collections of datam. In the early days, the min and max numbers in the collections were fixed. Later we accepted that it could be a huge number, but keep in mind that it is never infinite. There is always a fixed limitation, it is just that it might not be easily computable in advance.

Collections can have dimensions and they can have structure. A matrix is 2 dimensions, a tree is a structure.

All of the data we can collect in our physical universe fits into this. It can be a single value, a list or set of them, a tree, a directed acyclic graph, a full graph, or even possibly a hypergraph. That covers it all.

I suppose some data out there could need a 14-dimension hypergraph to correctly represent it. I’m not sure what that would look like, and I’m probably not going to encounter that data while doing application programming.

Some of the confusion comes from things that we’ve faked. Strings are a great example of this. If you were paying attention, a character is a primitive datam. A string is an arbitrary list of characters. That list is strictly ordered. The size of a string is mostly variable, but there are plenty of locations and technologies where you need to set a max.

So, a string is just a container full of characters. Doing something like putting double quotes around it is a syntactic trick to use a special character to denote the start of the container, and the same special character to denote the end. Denoting the start and end is changing the state of the interpretation of those characters. That is, you need a way of knowing that a bunch of sequential datam should stay together. You could put in some sort of type identifier and a size, or you could use a separator and an implicit end which is usually something like EOF or EOS. Or you can just mark out the start and end, as we see commonly in strings.

Any which way, you add structure on top of a sequence of characters, but people incorrectly think is itself a primitive datam. It is not. It is actually a secret collection.

The structural markers embedded in the data are data themselves. Given a data format, there can be a lot of them. They effectively are meta-data that tells one how to collect together and identify the intervening data. They can be ambiguous, noisy, unbalanced, and a whole lot of other issues. They sometimes look redundant, but you could figure out an exact minimum for them to properly encode the structure. But properly encoding one structure is not the same as properly encoding all structures. The more general you make things, the more permutations you have to distinguish in the meta-data, thus the noisier it will seem to get.

Given all that pedantry, you could figure out the minimum necessary size of the meta-data with respect to all of the contexts it will be used for. Then you can look at any format and see if it is excessive.

Then the only thing left to do is balance out the subjectiveness of the representation of the markers.

If you publish that, and you explicitly define the contexts, then you have a format whose expressibility is understood and is as nice as possible with respect to it, and then what’s left is just socializing the subjective syntax choices.

In that sense, you are just left with the primitive datam and containers. If you view data that way, it gets a whole lot simpler. You are collecting together all of these different primitives and containers, and you need to ensure that any structure you use to hold them matches closely to the reality of their existence. We call that a model, and the full set of permutations the model can hold is its expressiveness. If you want the best possible representation then you need to tighten down that model as much as possible, without constricting it so much that legitimate data can’t be represented.

From time to time, after you have collected it, you may want to move the data around. But be careful, only one instance of that data should be modifiable. Merging structure reliably is impossible. The effort to get a good model is wasted if the data than just haphazardly spread everywhere.

Data is the foundation of every system. Modeling properly can be complex, but the data itself doesn’t have to be the problem. The code built on top is only as good as the data below it. Good code with bad data is useless. Bad code with good data is fixable.

Thursday, September 14, 2023

Modelling Complex Data

Usually, the point of building a software system is to collect data. The system is only as good as the data that it persists for a long time.

Machines go up and down, so data is really only persisted when it is stored in a long-term device like a hard disk.

What you have in memory is transitory. It may stay around for a while, it may not. A system that collects stuff, but accidentally forgets about it sometimes, is useless. You can not trust it, and trust is a fundamental requirement for every piece of software, large and small.

But even if you manage to store a massive amount of data, if it is a chaotic mess it is also useless. You store data so you can use it later; if that isn’t possible, then you haven’t really stored it.

So, it is extremely important that the data you store is organized. Organization is the means of retrieving it.

A long, long time ago everybody rolled their own persistence. It was a disaster. Then relational databases were discovered and they dominated. They work incredibly well, but they are somewhat awkward to use and you need to learn a lot of stuff in order to use them properly. Still, we had decades of being reliable.

NoSQL came along as an alternative, but to get the most out of the tech people still had to understand concepts like relational algebra and normalization. They didn’t want to, so things returned to the bad old days were people effectively rolled their own messes.

The problem isn’t the technology, it is the fact that data needs to be organized to be useful. Some new shiny tech that promises to make that issue go away is lying to you. You can’t just toss the data somewhere and figure it out later. All of those promises over the decades ended in tears.

Realistically, you cannot avoid having at least one person on every team that understands how to model persistent data. More is obviously better. Like most things in IT, from the outside, it may seem simple but it is steeped in difficulties.

The first fundamental point is that any sort of redundant data is bad. Computers are stupid and merging is mostly undeniable, so it’s not about saving disk space, but rather the integrity, aka quality, of the data. The best systems only store everything once, then the code is simpler and the overall quality of the system is always higher.

The second fundamental issue is that you want to utilize the capabilities of the computer to keep you from storing garbage. That is, the tighter your model matches the real world, the less likely it is choked with garbage.

The problem is that that means pedantically figuring out the breadth and depth of each and everything you need to store. It is a huge part of analysis, a specialized skill all on its own. Most people are too impatient to do this, so they end up paying the price.

To figure out a model that fits correctly to the problem domain means actually having to understand a lot of the problem domain. Many programmers are already so overwhelmed by the technological issues that they don’t want to poke into the domain ones too. Unfortunately, you have no choice. Coders code what they know, and if they are clueless as to what the users are doing, their code will reflect that. But also, coders with domain expertise are far more valuable than generic coders, so there's a huge career upside to learning what the users are doing with their software.

If you avoid redundant data and you utilize the underlying technology to its best abilities to ensure that the data you need fits tightly then it’s a strong foundation to build on top of.

If you don’t have this, then all of those little modeling flaws will percolate through the code, which causes it to converge rapidly on spaghetti. That is, the best code in the world will still be awful if the underlying persisted data is awful. It will be awful because either it lets the bad data through, or it goes to insane lengths to not let the bad data through. You lose either way. A crumbly foundation is an immediate failure out of the gate.

Time spent modeling the data ends up saving a lot of time that wasn’t wasted on hacking away at questionable fixes in the code. The code is better.

Saturday, September 9, 2023

The Groove

A good development project is smooth. Not fun, not exciting, but smooth.

You figure out what you need to build. Spending lots of time here helps.

Then you sit down to code it. There are always a whole lot of things, some obvious, and some not, that the code needs to do in order for it to work as expected.

If you are curious, you’ll find out lots of interesting things about the problem domain.

If you are rational, you start coding at the bottom. Get the data, add it to persistence, and make it available to the code above. You work your way slowly upwards.

If you are efficient, the code has a lot of reuse. So, instead of having to add lots of stuff again, you are mostly just extending it to do more.

Toward the end, you’ll deal with the interfaces. The API, CLI, GUI, and NUI ones. Wiring them up nicely is the last thing you do because if you do it too early, they will keep changing.

If there is a time crunch, then you’ll tighten the focus down to the things that absolutely must be there. It is not nice, but sometimes you have no choice.

Before release, you go through some form of extensive QA. When you find a problem, you fix it as low as possible. You iterate through as many cycles as you need to get the quality you desire.

The first thing you do after the release is go back and clean up the mess you made in order to get the release out. You do this before adding more features and functionality.

If you follow this pattern, you get into a rhythm. The code grows rapidly, it evolves to satisfy more and more of the user’s needs. If it's good, the users may even like using it. As time goes on, you build up larger and larger machinery, to make future coding tasks even easier. The work should get smoother over time. It’s a good sign.

Thursday, August 31, 2023

Patching is not Programming

One of the best skills to have in software development is the ability to jump into some code, find, and then fix an error it is having.

Every software needs bug fixing, being able to do it quickly and reliably is a great skill.

But ...

Programming is not bug fixing. They are complete different. You can, of course, just toss in some code from StackOverflow, then bug fix it until it seems to be doing what it needs to do. That is a common strategy. But it is not programming.

In programming, you go the other way. You take the knowledge you understand, and you carefully lay out the mechanics to make that reality a thing. What is important is how you have laid out the code in an organized manner with an expectation for all of it’s possible behaviours.

Then later you bug fix it.

Thursday, August 24, 2023

Cache Invalidation Problem

In an earlier post, when I suggested that it was unethical for computers to lie to their users, someone brought up the “cache invalidation” problem as a counterpoint.

Cache invalidation is believed to be a hard problem to solve. If you cache any type of data, how do you know when to invalidate and remove that data from the cache?

Their argument was that because it is believed to be hard (but not impossible), it is then acceptable for any implementation of caching to send out the wrong data to people. That ‘hard’ is a valid excuse for things not working correctly.

As far as difficult programming problems go, this one is mostly self-inflicted. That is, it is often seen as a hard problem simply because people made it a hard problem. And it certainly isn’t impossible like the Two Generals’ Problem (TGP).

Essentially, if you don’t know how to cache data correctly, then don’t cache data. Simple.

“We have to cache to get performance”

In some cases yes, but for most of my life, I have been ‘removing’ caching from applications to get performance. Mostly because the caching was implemented so badly that it would never work.

There are two very common problems with application caching:

First: Caching everything, but never emptying the cache.

A cache is a fixed-sized piece of memory. You can’t just add stuff without deleting it. A well-written cache will often prime (preload values) and then every miss will cause an add/delete. It requires a victim selection approach such as LRU.

So, yeah, putting dictionaries everywhere and checking them for previous things is not caching, it is just crude lookup tables and you have to be extra careful with those. You can stick those in utilizing memoization for long-running algorithms, but you cannot use them as cheap-and-easy fake caching.

If you add stuff and never get rid of it, it isn’t caching it is just intentional/accidental leaking.

Second: Caching doesn’t even work on random or infrequent access.

For caching to save time, you have to have a ‘hit’ on the cache. Misses eat more CPU, not less. Misses are bad. If 90% of the time you get a miss, the cache is wasting more CPU than it is saving. You can increase the size of the cache, but priming it just takes longer.

So caching really only works on common data that is used all of the time. You can’t cache everything, and it probably follows the Parento rule or something, so maybe less than 20% of the data you have in a system would actually benefit from caching.

The core point though is if you don’t know the frequency and distribution of the incoming requests, you are not ready to add in caching. You’re just taking wild guesses and those guesses will mostly be wrong.

“We have to cache to reduce throughput”

This is the other side of the coin, which shows up when scaling. You use caching to reduce the throughput for a lot of clients. In this case, it doesn’t matter if accessing the data is slower, so long as the requests for data are reduced. It is not a problem for small, medium, and even some large systems. It’s a special case.

General caching really only works correctly for static resources. If the resource changes at all, it should not be cached. If the resource could be different, but you don’t know, then you cannot cache it.

If you do something like web browsers where you do allow caching, but provide a refresh action to override it, you’re mostly just annoying people. People keep hitting shift-refresh just in case. All the time. So the throughput isn’t less, it is just delayed and can even be more.

“We can’t know if the data we cached has been changed ....”

There are 2 types of caches. A read-only cache and a write-through cache. They are very different, they have different usages.

A read-only cache holds data that never changes. That is easy. For however you define it, there is a guarantee out there, somewhere, that the data will not change (at least for a fixed time period which you know). Then you can use a read-only cache, no other effort is needed.

Although it is read-only it is also a good ‘best practice’ to add in a cache purge operational end-point anyways. Guarantees in the modern world are not honored as they should and someone may need to ask ops to purge the cache in an emergency.

If you do have data that may occasionally change after it has been cached, there are a couple of ways of handling it.

The worst one is to add timeouts. It sort of works for something like static HTML pages or CSS files, but only if your releases are very infrequent. You can get a bit of resource savings at the cost of having short periods of known buggy behavior. But it is a direct trade-off, for up to the entire length of the timeout remaining after a change, the system is possibly defective.

It may also be important to stagger the timeouts if what you are trying to do is ease throughput. Having everything timeout in the middle of the night (when you expect data refreshes) doesn’t help with an early morning access spike. If most people use the software when they first get into work, that approach is mostly ineffective.

It makes no sense at all though if most of the caching attempts miss because they have timed out earlier. if you have a giant webapp, with a short timeout and most traversals through the site are effectively random walks, then the hits on the overlap are drowned out by timing and space constraints. If you bypass the memory constraints by utilizing the disk, you will save a bit on throughput, but the overall performance suffers. And you should only do this with static pages.

There were days in the early web were people blindly believed that caching as much as possible was the key to scaling. That made quite a mess.

The second worst way of handling changing data is to route around the edge. That is, set up a read-only cache, have a secondary way of updating the data, then use events to trigger a purge on that entry, or worse, the entire cache. That is weak, in that you are leaving a gaping hole (aka race condition) between the update time and the purge time. People do this far too often and for the most part, the bugs are not noticed or reported, but it doesn’t make it reasonable.

The best way of dealing with changeable data is having a ‘write-through’ cache that is an intermediary between all of the things that need the cached data and the thing that can update it. That is, you push down a write-through cache low enough so everybody uses it, and the ones that need changes do their updates directly on the cache. The cache receives the update and then ensures that the database is updated too. It is all wired with proper and strict transactional integrity, so it is reliable.

“We can’t use a write-though cache, it is a different team that does the updates”

That is an organizational or architectural problem, not a caching one. There is a better way to solve this, but the way you are arranged, half of the problem falls on the wrong people.

At very worst set up an update event stream, purge individual items, and live with the race condition. Also, provide an operational end-point to purge everything, as you’ll need it sometimes.

If it can change and you won’t ever know when it changes, then you should not cache it.

“We have 18 copies of the data, and 3 of them are actively updated”

That is just insane. What you have is a spaghetti architecture with massive redundancy and an unsolvable merge problem. About the only thing you should do here is to not cache. Fix your overall architecture, clean it up, and do not try to bypass ‘being organized’ with invalid hacks.

It’s worth pointing out again, that if you can not cache correctly, you should not cache. Fixing performance should not come at the cost of destroying integrity. Those are not real performance fixes, just dangerous code that will cause grief.

To summarize, there isn’t really a cache invalidation problem. You are not forced to lie. Yes, it can be hard to select a victim in a cache to purge. But is hard because the implementations are bad, or the organization is chaotic, or the architecture is nonexistent. It is hard because people made it hard, it is not a hard problem in general. You don’t have to cache, and if you do, then do it properly. If you can’t do it properly minimize the invalid periods. Make sure that any improvements are real and they do not come at the cost of throwing around stale, incorrect data.