The Programmer's Paradox: May 2024

Thursday, May 30, 2024

Identitynet

It’s a simple idea.

Anywhere you go in this world, you can prove that you have some attribute. But at the very same time, absolutely all other information about you remains private.

The inquirer gets nothing but an opaque token. Its existence validates that they did check for that attribute and it was successful at that time, nothing else. It would be anonymous, they might link it to some other information they have, but that is up to them.

A check would be in two stages: request and approve. You will see the request and then be able to approve or decline it.

To get around access issues, most people would have one or more delegates. Their delegates could approve or decline on their behalf.

The incoming request would always have the identity of the requester. These would be validated, so trustworthy.

If you were the delegate of someone who got arrested in a foreign country, you would see which officials were making requests about their attributes and be able to accept or deny them. The person arrested might not have access to technology.

Requests would come from an individual or a person on behalf of some organization. In the case of lower-level employees, the requestor would be anonymized you would only see the organization, but it is still trackable and accurate. If there were an issue with that employee, the organization would be able to see who they were, but only in some sort of legal case could they be forced to reveal them. Still, you would have proof that a duly appointed person for that organization did receive confirmation.

You could have lots of different delegates for different attributes. In some cases like citizenship, your attribute would just wrap a token that your country has accepted you as their citizen. You would essentially pass that along.

Access to this system and its information would be available everywhere around the world. It could not be blocked or interfered with by territorial governments.

You can set whatever attributes you want, but getting official ones is a two or more hops to process. You might want to be able to prove that you are the owner of funds in a bank account somewhere. But more likely you would just want to approve a transfer of funds to someone else, while the source account remains anonymous. A token would act as a receipt and a verification that the source funds exist and have been allocated for that transfer. The actual transfer would take place outside, between institutions, and settled at their convenience. That type of token would have a best-before date.

To get all of this to work, you need a globally accessible protocol that can verify the correctness of the tokens it is issuing. The tokens would be secure. You can not tamper with them or fake them. It needs to be fully distributed to avoid being compromised by any individuals, organizations, or governments. As long as both parties can tap into the Internet, the request and approval mechanics will function.

There would be legal remedies for the issuers or holders of these tokens to have to reveal them at times. A token might be evidence in a criminal or civil case for example. If someone requested one of your tokens from the holder, it would be issued as a token itself and you would receive a token about who they are. So, you would know immediately that that person checked your earlier token. You could turn this off as needed.

The tokens themselves would be quite large, but as they are also formal receipts for tracking one half of the request, they are storable and searchable. Declined requests would occupy no space on either side. To prevent DoS attacks, there would be an auto decline that could be turned on for a while.

Some non-formal requests would also be auto-declined. A stranger asking for your name or email, for example. Someone could not scoop these up in a public place.

For social media, the returned tokens could be anonymized. So you could post, without an account, and only legal actions could be used to find out who you are. But you could also issue tokens that publically identify that your post is associated with your account on their site. Both would work, and the social media site is not in control of what is happening.

There would need to be some small type of fee established for using these tokens, mostly to prevent misuse. There could be some way of making it discounted for some people while charging others a fair rate. Likely requests for discounted services would bounce the discount to a different party for approval. You might get your government or a charity to authorize your reduced token rates for example. This would level the playing field for usage.

In that way, tokens would be stacked. One token might link a whole chain of other tokens together to prove a complex attribute, but to reveal all of it in its full detail would require legal remedies against each party in the chain. This would allow eventual transparency, but in a way that could not be abused. You would know who is tracing the token.

Someone could get a social media site to reveal some tokens. Those would link back to an identity provider or government. Going there might have another level of indirection. It might take some time to get to the end of the chain, but the chain is finite, so it would be possible.

All the attributes about you that you might need to share in your life would be tokens in this way. You would have a one-stop location for identifying just the things you want others to know about yourself. You would know what you revealed and to whom.

As all tokens are time-related, you could change those attributes as you need to. So, the trail would be reasonable, you used to prefer to be called Bill, but now it is William. You used to live in the suburbs, but now you reside downtown. The tokens would reflect this, and the dates attached to them would let the holders know both when they are stale and when they should be legally deleted.

This would make a lot of things in life easier, yet provide some protection against the bad stuff.

Thursday, May 23, 2024

Data Driven Programming

I know the term data-driven has been used before, but I want to ignore that and redefine it as a means of developing code.

So, you have some code to write.

Step 1: Figure out all of the data that may be used by the code. Not some of the data, or parts of it, but all of it.

Step 2: Set up all of the structures to hold that data. They can be objects, typedefs, structs, etc. Anything in the language that makes them composite variables. If the same data already exists somewhere else in another data structure, use that instead of creating your own. Duplication is bugs.

You have finished this step when there are reasonable data structures for absolutely all of the data.

Step 3: Link up the structures with each other. Stuff comes in from an end-point, maybe an API or something. Put that data into the correct structure, flow that structure through the code. If there are any algorithms, heuristics, or transformations applied on those structures to get others, code each as a small transformation function. If there is messy conditional weirdness, code it as a decision function. Link all of those functions directly to the data structure if the language allows (methods, receivers, etc). If there are higher-level workflows add those in as separate higher functions that call the other functions underneath. Always have at least two layers. Maybe try not to exceed five or six.

Step 4: Go back to all of the code and make sure it handles errors and retries properly. Try to get this error code as close to the endpoints as possible, don’t let it litter the other code.

Step 5: Test, then fix and edit. Over and over again, until the code is pretty and does what you need it to do, and you’ve checked all of the little details to make sure they are correct. This is the most boring part of coding, but you always need to spend the most time here. The code is not done until it has been polished.

If you do all of this, in this order, the code will work as expected, be really clean, and be easily extendable.

If someone wants the behavior of the code to change (which they always do), the first thing to do is Step #1, see if you need more data. Then Step #2, add in that extra data. Then Step #3 add or extend the small functions. Don’t forget Steps #4 and #5, the code still needs to handle errors and get tested and polished again, but sufficient reuse will make that far less work this time.

It’s worth noting that sometimes for Step #3 I should start bottom up, but sometimes I do start top-down for a while, then go back to bottom up. For instance, if I know it is a long-running job with twenty different steps, I’ll write out a workflow function that makes the 20 function calls first, before coding the underlying parts. Flipping between directions tends to help refine the structure of the code a little faster and make it cleaner. It also helps to avoid over-engineering.

Step #3 tends to drive encapsulation, which helps with quality. But it doesn’t help with abstraction. That is Step #2, where instead of making a million little data structures, you make fewer that are more abstract, trying to compact the structures as much as possible (within reason). If you can model data really well, then most abstractions are obvious. It’s the same data doing the same things again, but a little differently sometimes. You just need to lift the name up to a more generalized level, dump all of the data in there, and then code it. Minimize the 'sometimes'.

The code should not allow garbage data to circulate, that is a major improvement in quality that you can tightly control with the data structures and persistence. As you add the structures you add in common validations that prevent crap from getting in. The validations feel like they are Step #3, but really they can be seen as Step #2b or something. The data structures and their validations are part of the Model, and they should tightly match the Model in the schema if you are using an rdbms. Don’t let garbage in, don’t persist it.

That’s it. I always use some variation on this recipe. It changes a little, depending on whether I am writing application or systems code, and on the length of the work. The biggest point is that I go straight to the data first and always. I see a lot of programmers start the other way around and I almost always see them get into trouble because of it. They try to stitch together the coding tricks that they know first, then see what happens when they feed data into it. Aside from being non-performant, it also degrades quickly into giant piles of spaghetti. Data is far more important than code. If you get the data set up correctly, the coding is usually easy, but the inverse is rarely true. Let the data tell you what code you need.

Thursday, May 16, 2024

Megafunctions

For a long time, I have been attempting to write something specific about how to decompose code into reasonable functions.

The industry takes this to be an entirely subjective issue.

Programming styles go through waves. Sometimes the newest generation of coders write nicely structured code, but sometimes they start writing all sorts of crazy “megafunctions”.

Over the decades there has been an all too obvious link between megafunctions and bugs. Where you find one, you tend to find a lot of the other.

Still, it oscillates, and every so often you start to see one-function-to-rule-them-all code examples, and people arguing about how this is so much better. Once I was even rejected for a job because after the interview they said I had too many functions in my code, which was madness.

Functions should be small. But how small?

Within the context of software code, a ‘concern’ is a sequence of related instructions applied to a group of variables within the same contextual level.

If the code is low-level, the operations applied to those variables are all low-level. If the code is structural then the operations applied are structural. If the code is domain-based, then the operations are all domain-based. etc.

This is an organizational category. It starts with “some of these instructions are not like the others.” and flows through “there is a place for everything, everything in its palace, and not too many similar things in the same place.”

It is that last point that is key. The code is not organized if some of the instructions in a function should not be in that function because they are dealing with a completely different concern.

It is an objective definition, but somewhat subjective on the margins. You can have a wider or narrower view of concerns, but at some point, a different concern is a different concern. When that is ignored it is usually obvious.

If an operation needs to be applied to one or more variables at a different level than the context, it is a different concern. If an operation mixes different types of data, that is a different concern. If its scope is global then it is a bunch of concerns. If it is for optimization then it is a different concern.

This categorization is important.

Limiting the number of lines of code in a function is not a reasonable way to control function scope. Rather any given function should just take care of one and only one concern, however large or small it is. If part of doing that involves a different concern, then the function should call out to another function to perform that effort. That is, each function is an atomic primitive operation that does exactly one concern.

So then it is a one-to-one relationship. The function says it does X then the code in it just does X.

If you don’t know if two things are the same concern or different ones, then it is far better to err on the side of caution and treat them as different concerns. Fixing that later by joining them together is a lot easier than separating them.

This seems to be a hard concept to get across to some programmers. There are a number of reasons why they want megafunctions:

Some programmers just don’t know any better or they don’t care
Some programmers feel like not creating functions is an optimization
Some programmers like to see everything all together at once.
Some programmers think the code is easier to step through in a debugger

None of these are correct.

If you don’t care about what you code, it will show, both in readability, but also in quality. Your job will be stressful and difficult.

Functions execute quickly, so it is almost never reasonable to avoid using them just to save a tiny insignificant overhead.

Big functions may help in writing the code slightly faster, but they really hurt when you have to debug them. They only get worse with time.

Using a debugger is a vital way of figuring out what the code is doing, but it is far better if you just code the right thing in a clean and readable way and don’t have to step through it. It saves a lot of time.

Functions are a main feature of all programming languages. Learning to use them properly is a critical skill.

Keeping the scope of any function down to just a single concern has a big impact:

The code is readable
The code is debuggable
You can deal with ensuring that the concern itself is correct without getting lost in the details
The code is extendable
The code can be reused as many times as you need

Megafunctions are often an indication that the programmer does not have a lot of experience. However, sometimes it is just that the programmer got stuck in really bad habits. The more you code, the more you come to realize that keeping your code organized isn’t optional, it is key to be able to get high-quality work out to deployment quickly.

It’s like cooking. If your kitchen is nicely organized and clean, cooking is the main task. If your kitchen is dirty, disorganized, and a total mess, then trying to cook in that environment is extremely hard. The friction from your kitchen being a dumpster fire is preventing you from focusing on the cooking itself. Whether you like or hate cooking, you still want a clean kitchen. Coding is the same.

Thursday, May 9, 2024

Iterative Development

Even if I know that I will end up building something extremely complex, I never start there.

When we get a complex problem to solve, we are taught to decompose that into pieces. But often that is not enough to get something sophisticated to work correctly.

I’ll start with some simplified goals. Usually the most common “special case”.

I lay out some groundwork. Bottom-up programming is always the strongest approach, but it still needs to be guided. So you can whack out all of the common foundational parts you know about, but that isn’t wide enough to get it all completed. So don’t. Just enough to get going.

Then take that special case and work from the top down.

If it’s a GUI, you start with a common screen like the landing one. If it’s an ETL, you just get some relatively close test data. If it's a calculation, you put in a simplish test case. Ultimately you need a fairly simple starting point.

Then you wire it up. Top to bottom. While wiring, you will find all sorts of deficiencies, but you constrain yourself to only fixing the core problems. But you also note all of the others, do not forget them, and dump them into a big and growing list of stuff to do. They are important too, just not now.

So, hopefully, now you have something crude that kinda solves the issue in very limited circumstances and a rather tiny foundation for it to sit on. It is what a friend of mine used to call an inverted T.

If it sort of works, this is not the end, it is just the beginning. You iterate now, slowly evolving the code into the full complex thing that you need.

One key trick is to pass over the unknown-unknowns as early as possible. If there is some technology you are unfamiliar with, or a difficult algorithm. You touch these areas with something crude and you do it as early as possible. Unknown-unknowns are notorious for turning out to be way larger than you have time or estimates to handle. They like to blow stuff up. The sooner you deal with them the more likely you can get a reasonable sense of when the work will be completed.

For each iteration, once you’ve decided what you need to do, you first need to set the stage for the work. That comes in the form of a non-destructive refactoring. That is, if your structure is crude or wonky, you rearrange that first in a way that does not change the overall behavior of the code. If you don’t have enough reuse in the code, you fix that first.

If you did a good job with the non-destructive refactoring, the extension code should be easy, It is just work now. Sure, it takes time, but you spend the time to make sure it is neat, tidy, and super organized. You need to do this each and every time, or the iterations will converge on a mess. You don’t want that, so you avoid it early too.

Once you get into this habit, each and every development is a long-running series of iterations, and each iteration is cleaning up stuff and making it better and more capable. The code will keep getting a little better each time.

This has two core benefits. First, is that you can level out your cognitive effort. The non-destructive refactorings are boring, but they are always followed with the expansions. As you are tracking everything, you have a huge list of todo items, so there is never a ‘hurry up and wait’ scenario. If one little thing is stuck on other people, there are plenty of other better things to do anyway.

The second benefit is that all of the people around you see that the work is getting better and better. They may have been nervous initially that the total effort will be slow, but continual improvements will dissolve that. What they don’t want is big highs and lows, and an endless number of unknown-unknown dramas. That wonky approach will keep them from trusting you.

Done well, the development is smooth. There are plenty of little issues, but they do not derail the path forward. eventually, you will get to something really sophisticated, but you will not get lost in the complexity of it. You will not crumble under technical debt. This is when you know you have mastered your profession.

Thursday, May 2, 2024

Symmetry

“One of these things is not like the others” -- Sesame Street

Symmetry is an often ignored property of code that is incredibly helpful when editing or debugging. It is part of readability, along with naming and structuring.

If you do a quick visual scan of some suspect code and you see asymmetric parts, it quickly tells you that you should spend a little more time there.

Symmetry issues can be simple, for instance:

   if condition {
       // do thing one

   } else {
       if sub-condition {
           // do thing two

       } else {
           // do thing three

       }
   }

Doesn’t seem to be a problem, but that second conditional block is broken into two sub-blocks, while the first one is not. It is unbalanced. There are 3 different things that can be triggered from this code. That is a little odd, given that most permutations are independent and applied consistently, so it could really be 4 things. We may be missing one. Depending on what behavior you are investigating, this would deserve closer inspection.

That makes symmetry an indirect property of readability. For example, you quickly scan large amounts of code to see that every function definition has exactly 3 arguments, and then you come across one with 7. Hmmmm. It stands out, maybe it is wrong.

With good code, given most bugs, you will have a strong idea about where the problem is located. Then a quick scan for asymmetry will highlight some parts around there that are off. You double-check those, and maybe find one that is wonky. That can get you from the report of the bug to the actual flaw in the code in a few minutes, with a fairly low cognitive effort. All you need to do is correct it and test that it is now behaving as expected. Symmetry problems are often second in frequency to typos.

Symmetry is entirely missing in spaghetti code. The logic wobbles all over the place doing unexpected work. Jumping around. If you are not the author, the flow makes no sense. It takes a lot of cognitive effort to hypothesize about any sort of behavior, and then you have to focus in on as little as possible, just to figure out how to alter it in a way that hopefully is better.

It’s why debugging someone else’s mess is so painful.

Symmetry is not an accidental property. It is there only because someone had the discipline to make sure it is there. They pounded out some code but went back to it later and cleaned it up. Fixed the names, made sure it only does one thing, and put in as much symmetry as they can. That took them longer initially, but it will pay off later. The worse the code, the more you will need to revisit it.

Symmetry can occur at the syntax level, the semantic level, the data level, the function level, and the structural level. Everywhere. It is a sign of organization. Where and when it could have been there, but is missing, it is always suspect.

The most obvious example is indenting lines of code. Most languages allow you to do whatever you want, but if you do it properly that will be a form of line-by-line symmetry which helps make the flow more understandable.

Spending the time to find and enhance symmetry saves a huge amount of grief later with quality issues. The code is way easier to refactor, debug, and extend. It is worth the time.