Thursday, May 23, 2024

Data Driven Programming

I know the term data-driven has been used before, but I want to ignore that and redefine it as a means of developing code.

So, you have some code to write.

Step 1: Figure out all of the data that may be used by the code. Not some of the data, or parts of it, but all of it.

Step 2: Set up all of the structures to hold that data. They can be objects, typedefs, structs, etc. Anything in the language that makes them composite variables. If the same data already exists somewhere else in another data structure, use that instead of creating your own. Duplication is bugs.

You have finished this step when there are reasonable data structures for absolutely all of the data.

Step 3: Link up the structures with each other. Stuff comes in from an end-point, maybe an API or something. Put that data into the correct structure, flow that structure through the code. If there are any algorithms, heuristics, or transformations applied on those structures to get others, code each as a small transformation function. If there is messy conditional weirdness, code it as a decision function. Link all of those functions directly to the data structure if the language allows (methods, receivers, etc). If there are higher-level workflows add those in as separate higher functions that call the other functions underneath. Always have at least two layers. Maybe try not to exceed five or six.

Step 4: Go back to all of the code and make sure it handles errors and retries properly. Try to get this error code as close to the endpoints as possible, don’t let it litter the other code.

Step 5: Test, then fix and edit. Over and over again, until the code is pretty and does what you need it to do, and you’ve checked all of the little details to make sure they are correct. This is the most boring part of coding, but you always need to spend the most time here. The code is not done until it has been polished.

If you do all of this, in this order, the code will work as expected, be really clean, and be easily extendable.

If someone wants the behavior of the code to change (which they always do), the first thing to do is Step #1, see if you need more data. Then Step #2, add in that extra data. Then Step #3 add or extend the small functions. Don’t forget Steps #4 and #5, the code still needs to handle errors and get tested and polished again, but sufficient reuse will make that far less work this time.

It’s worth noting that sometimes for Step #3 I should start bottom up, but sometimes I do start top-down for a while, then go back to bottom up. For instance, if I know it is a long-running job with twenty different steps, I’ll write out a workflow function that makes the 20 function calls first, before coding the underlying parts. Flipping between directions tends to help refine the structure of the code a little faster and make it cleaner. It also helps to avoid over-engineering.

Step #3 tends to drive encapsulation, which helps with quality. But it doesn’t help with abstraction. That is Step #2, where instead of making a million little data structures, you make fewer that are more abstract, trying to compact the structures as much as possible (within reason). If you can model data really well, then most abstractions are obvious. It’s the same data doing the same things again, but a little differently sometimes. You just need to lift the name up to a more generalized level, dump all of the data in there, and then code it. Minimize the 'sometimes'. 

The code should not allow garbage data to circulate, that is a major improvement in quality that you can tightly control with the data structures and persistence. As you add the structures you add in common validations that prevent crap from getting in. The validations feel like they are Step #3, but really they can be seen as Step #2b or something. The data structures and their validations are part of the Model, and they should tightly match the Model in the schema if you are using an rdbms. Don’t let garbage in, don’t persist it.

That’s it. I always use some variation on this recipe. It changes a little, depending on whether I am writing application or systems code, and on the length of the work. The biggest point is that I go straight to the data first and always. I see a lot of programmers start the other way around and I almost always see them get into trouble because of it. They try to stitch together the coding tricks that they know first, then see what happens when they feed data into it. Aside from being non-performant, it also degrades quickly into giant piles of spaghetti. Data is far more important than code. If you get the data set up correctly, the coding is usually easy, but the inverse is rarely true. Let the data tell you what code you need.

No comments:

Post a Comment

Thanks for the Feedback!