Thursday, January 26, 2023

Difficult Choices

When dealing with a large technical project, there are a huge number of decisions that constantly need to be made. Some of them are quite simple. You can pick the intuitive or obvious answer.

Some of them are incredibly difficult. Pick correctly and things will more or less work out. Pick badly and your problems will only get worse.

The most difficult choices are always between picking an okay short-term fix, or choosing what is often referred to as “the right thing”. In that second case, you will spend the time to really solve the problem correctly.

If you choose the short-term fix then rather immediately something will be done and the problem will appear to go away. You can claim success, then move on to other problems.

If you do "the right thing” then it is a long-term play. For the short term, things will take longer, but you are hoping to make it up later.

The problem is that the benefits in the future are often things that will now “not happen” to slow you down anymore. So, it’s the ‘lack’ of other later problems that you are getting, which is a bit misleading.

To understand what those other problems will be, you probably have had to not only live through them in the past but also have accepted them as consequences of some other choice and then tried to learn from them. If you haven’t experienced them directly for yourself, the most common behavior is to significantly undervalue them, so they don’t seem like a big deal. Or as they like to say “hindsight is 20/20”.

So, it’s not surprising that if someone is facing a relatively new choice between making a problem quickly go away or delaying things a bit in order to not face other problems down the road, they will usually choose the first path. Really, why not? Without any prior experience, choosing the second path just seems irrational. If you did that for everything, you’d never get anything done.

In the case where the short-term fix spawns other problems, above say a 2 to 1 margin so it is basically "do it now" or "do twice the amount of work later", then it’s a fairly high cost to cope with. In many cases though, it is often "spend a week now", or "lose 1 day every few weeks". So it takes nearly 3 months for the long-term fix to finally pay off. But it is worth it because it is endlessly reoccurring and the original cost is likely getting worse too.

So we can see that 1 or 2 of these poor choices can hurt a little. But they are also a slippery slope in that making a few of them often leads to making more and more of them. And since they can act as multipliers to each other then they might combine together to become exponential. Or put another way, just when you thought you were doing really well and solving all of your problems, the consequences of doing it all too quickly explode.

It is probably a general property of any complexity. The more things get complicated, the more you will face difficult choices, and the more expensive the tradeoff between short and long-term will become. You don’t see this in simple problems, mostly because the penalties are trivial. The only real mitigation against this is prior experience, in that it may force a person to do the right thing even when it seems to be less rational in comparison to any of the short-term options. Which, I think we commonly refer to as 'wisdom'.

Thursday, January 19, 2023

Once

The number of times that you need a specific piece of data in a large organization is exactly “once”, You only need one writable copy of it. Not once per screen, not once per system, but actually just once for each company.

If you have more than one writable copy of the data, you have a huge problem.

If 2 different people make 2 different changes to the 2 different sources of data at the same time, there is no ‘general’ way to merge those changes together with software. There are specific instances where the independence is low enough that an automatic merge is possible, but the general problem is unsolvable. You would have to punt it to a human who can figure out the context and consequences, then update it.

But we don’t want to engineer people into our software if we can avoid it. And we can.

Just keep the writable data once. You can spin off any number of read-only copies, basically as a form of caching or memoization, but have only one version of the data, anywhere in the organization, that can be changed.

For any of the read-only copies, you’ve handed them properly if it is always safe to delete them. That is, you won’t lose data, just trigger a space-time trade-off, nothing more.

So, why don’t we do this?

The first reason is control. Programmers want total control over their systems, which includes control over their data. If they don’t have control, it can take a lot longer to do even simple things. If they do have control, and there is stupid politics, they can just fix their own data and let the rest of the organization be damned.

Crafting a crappy ETL to snag a copy of the data, and putting some screens out there to let people fix its bad quality is easy, even if it takes a bit of time. Setting things up, so that the users know how to go to some other system’s data administrators and correct a serious data issue is actually hard, mostly because it is incredibly messy. It’s a lot of work, a lot of conversations, and a lot of internal agreements. So, instead of playing nicely with everyone else, a lot of people choose to go it alone.

The second reason is laziness. If someone else had the data, and it was in some other form, you’d have to spend a lot of time to figure it out. You have to also know how to access it, how to get support, and how to understand what they are even saying sometimes. That is often a massive amount of learning and research, not coding. You’d have to read their schema, figure out their naming convention, work through their examples and special cases, etc. You have to deal with budget allocations, security permissions, and then proper support.

A third reason is politics. There are all sorts of territorial lines in a large company that you can’t easily cross. Different levels of management are very protective of their empires, which also means being protective of their data. Letting it leak out everywhere makes it harder to contain, so they put up walls to stop or slow down people wanting access. So, people silo instead.

A fourth reason is budgets, which is a slight variation on territorial issues and involves services as well. If you put out a database for use by a large number of people in the organization, then there is a fair amount of service and support work that you now have to accomplish as well as maintaining the data. If the funding model doesn’t make that worthwhile, then letting other people access your data is effective taking a loss on it. You don’t get enough funds to cover the work that people need for access.

How do you fix this?

You make it easier for all of the ‘developers’ to get access to the primary copies of any and all of the corporate data types. You make it easy for them to just do a quick pull, and intermix that data with whatever they have. If it’s not easier than rolling their own version, they will mostly take the easiest route and roll it themselves.

You can’t make it easier by spinning up millions of micro-services. That eventually doesn’t scale because it’s hard to find that one-in-a-million service you are looking for with your data. As it grows and becomes even more of a disorganized mess, it gets even harder. If there is no explicit organization, then it is disorganized. Pretty much the primary benefit of microservers is that you don’t have to spend time organizing them. Opps.

The data lake philosophies share the same issue. Organizing a lot of data is hard, painful work, so we’ve spent decades coming up with clever new technologies that seemingly try to avoid spending any effort on organization. Just throw it all together and it will magically work out. It doesn’t; it never does.

So you need a well-organized well-maintained “library” where developers can go and find out how to easily source some data in a few minutes. Minutes. You probably need some librarians too. If you get that, then the programmers will take that as the easiest route, the data won’t be copied everywhere, and sanity will prevail.

The priority of any IT dept has to be about finding ways to reduce the number of silos and copies of data. That needs to take precedence over pretty much everything else. If it isn’t more important than budgets, politics, process, etc. then any of those other issues will continue to derail it, and the organization will have dozens, if not hundreds, of copies of the same stuff littered everywhere. Chaos is expensive.

Thursday, January 12, 2023

Software Evolution

As software gets more complicated, it would be nice to see newer tech stacks that absorb some of the older complexities away from developers. Over the decades, I’ve seen three places where we could do a much better job handling this.

These days, we have a nearly common repeating pattern for API end-points. So it would be nice if there was a really cheap way to spin them up that included security, configuration, and monitoring.

That is, all you provide is a table of endpoints, with behaviors turned on or off, and some code that matches and does all of the heavy lifting. There would be some persistent, secure, configuration parameters as well that you might need to change occasionally.

To help people use it properly there is a set of known documented patterns for every possible scenario, you just have to follow the instructions.

It includes both REST endpoints, but also background executions (aka batch jobs, repeated or one-off). So it is everything, all in one well-organized place, completely put together so that the programmers can concentrate on what their code needs to do, not how to get it set up. It should be so easy that the programmers would dread doing it any other way.

It would have a facility to browse what is available, which would support security to restrict some of that information. If you were on a dev team, you could see what was running, when, and how often. You could get a list of the configuration parameters available, but not their current values. Basically, after an upgrade, you’d be able to verify that all the pieces were in all of the places that you expected. It would satisfy a lot of documentation requirements as well.

Another piece that would be good is solid data plumbing. I think in the past there were lots of excellent examples, but their price was so high that they were unaffordable most of the time. So the costs here would have to be reasonable.

You want a nice clean inventory screen that lists out all of the persistent data for the organization. Developers can browse. The permissions are easy to open up to match the business needs. The security requirements for the data are baked in. That is if a business unit can see some data, the developers for that unit can see the definitions for that data, its model, and examples. They may not be able to see the data itself though.

So, it is both the full inventory of all persistent data and the home of all documentation necessary to be able to use that data in development. It is not tied to vendors or specific tech stacks. So, it includes RDBMSes, NoSQL, file systems, and external references, such as APIs. Anything, anywhere that is persisted. If the organization gets or uses any form of data, even from the outside, it is included.

But the real strength is the third part: plumbing. It also contains the list of all internal jobs that move data from one place to another. It distinguishes between real-time and ETL. It holds all configurations and transformations needed to move all of the data around. From those lists, it explicitly does all of the scheduling and monitoring. You would be able to see all jobs that were executed over months or years, their status, and the number of times they were rerun. You’d know when they were last updated and by whom.

If a team needs some new data in their system, they can arrange and schedule it from here. It should be simple, and it should allow that plumbing to come up rapidly. It should also behave like a code repository and keep a list of all deltas made to its instructions and configurations and by whom it was made. Everything, so that it can be tested in one environment, then safely moved to another.

In this way, we would know that a specific feed was changed months ago, know what those changes were, and know whether or not that made it work better or worse.

It can move data from outside to inside, from databases to files, and from primary dbs to read-only replicas. It is a complete inventory of all data and all the plumbing.

No longer would a system rely on its own custom ETL madness. This would organize all of it in one place.

The final piece is the ability to spin up simple interfaces, quickly.

A complex interface takes a long time to build and a lot of work, but most interfaces needed by most systems are fairly trivial. So, it should be possible to just whip them up quickly.

We can do this now, on two fronts.

The first is having frameworks that allow people to build “super” widgets. You can build them and use them alongside all of the regular widgets, you can design them to handle complex data. That is, you have a widget to display a complex composite type like user options. Then instead of just passing it a primitive variable, you give it your composite one. Everything else stays the same. Basically, it is analogous to adding composite types in a programming language but by adding super widgets to the gui.

The second part is dynamic forms. Well, not really “forms” but screens full of widgets that can be arranged and rearranged dynamically. Each widget in the form has a textual description of itself and some type of binding id. If you put the form on the screen, you can give it one big piece of composite data and it will traverse that and find all of the pieces that match the widget ids. In its simplest arrangement, the ids are just unique names that match up. It can be a bit more sophisticated with a hierarchy or scope, but simple names work well.

Then you get the definition of a form and some large clump of data. You draw the form on the screen and give that screen the data. You get another data clump back from that and give it to the backend. You don’t have to go through the data or understand it. If there is other stuff there that doesn’t bind, it doesn’t show up in widgets but does stay around after edits. Instead of being fixated on this or that variable, you only have to worry about this or that clump. Basically, you lifted up your efforts.

To really make it flexible, the forms can be recursive. And they can contain super widgets. But even more flexible is that they also can contain buttons, links, and images. All of the widgets can be either editable or read-only, so you can use the same layout for viewing and editing. All of the widgets have a type and a validation description. Forms implicitly know not to return invalid data.

Doing this, the main screen is just a form with a top-level menu, and as the user interacts with it, other forms are loaded. Any data is synced with the backend as needed, so the only real two parts a developer needs to do is to define a few forms, and define the structure or model of the data. With that work, the interface is mostly done.

For any complex interactions, basically cross-widget wiring like country/state mechanics, that is handled by super widgets. The super widget binds to both the countries and the sub-regional breakdown. It synchronizes correctly on the screen as the user would expect.

If you have some persistent data in a relational database, all you need is a corresponding form to be able to supply a basic gui for viewing or editing it. If the data is normalized and modeled reasonably, you can automatically generate that form. Mistakes in the schema would have to be worked around by hand.

Of course, that won’t work for every screen in every system. The really complicated ones or the screens that don’t match the persistent representation will still need a lot of custom work. Super widgets may cover some of it, but some screens may need full customization outside of this framework. Still, it gives developers more time to craft those properly, while the trivial stuff just gets thrown together easily. In urgent situations, you could put up some interim screens for a while, until the correct ones are ready. Then you can throw away the interim ones since it wasn’t a lot of work.

Together I think these three ideas would form an incredibly strong infrastructure for most large organizations. You’d end up knowing exactly what is in production and how it was doing. You’d know what data is persisted and where it was all located. And you'd be able to quickly throw together simple interfaces to deal with any urgent problems.

It would not however take away from any of the heavier work that still needs to be done. You’d still need some complex custom interfaces, there would still be data representation problems, some of the data would still need expensive migrations or strange transformations, etc. That work never goes away, but at least it wouldn’t be happening in the middle of a huge disorganized mess. Everything would have a place, and there would be a place for everything.

About the only concern with this might be performance. You wouldn’t want the demands of one side of the business consuming resources for another side. That is an easy fix. You partition them away from each other. They have independent setups in really large companies.

Costs are often another big problem, and in the past, we have seen that result in a lot of disorganization. People try to do things cheaply, but the resulting mess usually makes it way more expensive. Big systems require big efforts. You can’t keep cheating the game and expect the results to be any different.

Thursday, January 5, 2023

Dependencies

One of the trickiest issues in software is figuring out the exact dependencies between any two complex things. Sometimes you kind of know it’s dependent, but it is vague. To discuss it, we first have to go really abstract, then we can return down to the underlying issues.

In the craziest, broadest sense, everything is basically dependent on everything else.

We live in one universe, so everything that we are aware of is currently in here with us. It all shares the same global ‘context’.

That’s a kind of fun way of looking at it. It implies that no two things are ever ‘entirely’ independent in that broadest context. Thus ‘independence’ is another human abstraction like ‘perfection’ and ‘infinity’. It’s a term we use loosely, but it's probably only a byproduct of our creative imagination, not the world around us.

We can talk about dependence relative to smaller, tighter contexts. And we can keep shrinking the context until any two things do become independent. The context is null if they are the exact same thing. But even if they are absolutely perfect clones of each other, they still occupy different spatial coordinates, thus the context.

This gives us a rather grand gradient. Any two things are dependent until the context is tight enough that they are independent.

As odd as that sounds, we can use it and its converse for some interesting stuff.

Given two things we think are independent, we can expand the context to where they are dependent. That can actually define the ‘dependencies’. If the smallest context where they are dependent is finite then it is composed of a normalized set of variables. If we tightened the context enough, then it is exactly that set of variables for the dependence between the two things.

And the opposite is true as well. Given any two dependent things, we can shrink the context until they are independent, then use that to figure out what is variable; what binds them.

That makes it interesting in that for that context, described that way, it is pretty much always a finite set of independent variables itself, and we can expand, contract, or shift it around as needed to get to a different context.

From this, it seems like there is only one ‘normalized’ context that sits on the border between independence and dependence, although it may be true that there are an infinite number of different basis variables (equivalent but alternative variable sets).

While this is a very abstract discussion so far, it does actually apply to software. It applies to both data and code, so we’ll start with the data first.

Two pieces of data are dependent on each other in some context. But that context may be looser than the given system parameters, so, with respect to the system, they are independent. We can fiddle with both, at the exact same time, without having to worry about interference. Thus we can parallelize any operations on the two independent pieces of data.

If the context of the dependence however is within the context of the system, then the two pieces of data are dependent on each other, and we cannot safely fiddle with both at the same time. We need to coordinate any actions, which is either some form of locking or utilizing atomic operations (which is just embedding an implicit lock into the operation itself).

If two pieces of data are dependent, and the system treats them as independent, that is effectively a race condition, and it will go wrong at some frequency of occurrence. That frequency might be once every 100 years, so the people that build the system may be unaware that it has a problem, but it still is a problem.

Data issues are effectively ‘runtime’ issues. They happen when the system is being used. Code issues are more often ‘compile time’ issues, as we are building stuff. Although not being threadsafe, for example, is a code dependency runtime issue, so, they do exist too but will skip them for now.

With code, as you are generally putting together millions of instructions for the computer to follow, it would be best if you didn’t have redundant code everywhere. Why? It is a behavior dependency that degrades with time. That is, there is an implicit dependency on 2 pieces of code behaving the same way to make things work, but someone later edits one of the pieces of code which changes it, and triggers an unwanted side effect.

In that sense, given an overlap or commonality between any two sequences of instructions, if that dependency is relative to the context of the system, then keeping the code redundant is fragile and may be the root cause of triggering a bug in the not-too-distant future.

For some people, they believe that all that matters is the next release, but really a system is strong because it was built with anticipation of its full life span. That is, ignoring code dependencies is ignoring a fundamental requirement of the system itself, which is that it should continue to run correctly until it is finally retired.

That’s it at a high level, but we can even get a little lower, yet more general when discussing dependencies. One great issue is what we often call the ‘merge problem’.

In general, given two different sets of data that are related, a computer can’t automatically merge them. If there is a single variable with two values, you can set a precedence, and using that automatically merge. But if the data is composite, and you have two different versions of it, you can’t automatically merge. Why? Because some of the fields (attributes) may be dependent on other fields, and the changes themselves may involve context outside of the system. For example, if you have 5 fields, A .. E, and one user changes A and B, while another changes A and E, you can’t just put them together. Only the users would know whether B or E is more appropriate with the given A. They are a part of the context.

Now, we do seem to get around that with code repositories, like git, but there is a trick. Their merges are done line-by-line, so lifting it up that way, any merge conflicts are reduced. But they can still occur and often do, which is why we have so many tools and techniques for manually merging code. It kinda works but it also goes wrong fairly often. Enough that we can’t just automate it and expect it to be reliable.

So, it’s pretty fair to say that you can’t merge data unless you can tightly constrain it with a very tight context. So, in general, you can’t do it, but some days you might get lucky.

Dependency plays out in other ways as well. You build a system with some libraries or framework dependencies. But it is recursive and some of those dependencies clash. That would be okay if every dependency was effectively ‘runtime’ safe and you could load different versions into memory at the same time, but it is more often the case that there are dependent locations in memory that will get shared, which will cause unexpected behaviors. The authors might have left a static or global lying around, but are playing with it in different ways in the different versions.

We see all sorts of other dependencies playing out in software projects. The development work itself is often dependent on other work. Domain logic is usually dependent on its industry. Synchronizing data is dependent on it not changing.

Complexity comes from the intricacies of the thing itself, plus all of its dependencies, within a given context. It is why tunnel vision is often a problem. You shrink the context until the dependencies become manageable, but if some of the discarded dependencies were significant, when it goes back into the larger context it goes very wrong. Dependencies drive complexity growth.