Thursday, February 21, 2019

Software Optimizations

Most software can execute faster.

There are many ways that software can be optimized to improve its performance. Most of these techniques are well-understood, but they still need to be used with caution, in that they can accidentally harm other attributes of the system.

The most obvious way to speed up code is to no longer do useless work.

One common form of wasted effort is to redundantly copy the same data to many different areas of memory. Another is to parse the data into smaller pieces and then reassemble it later or vice versa. Removing these from code should get it closer to the minimal effort, but not necessarily the minimum.

Sometimes, however, extra copies of data are necessary for security, locking or architectural reasons. These types of redundancies can stay, if they are justified. For sanity reasons, most incoming data for a large system should be fully parsed as soon as possible. Exporting this data may legitimately require reassembling it. This too is fine.

Switching to a better algorithm, quite often, can afford very large time savings. There is a huge amount of knowledge available about algorithms and their performance attributes. Significant research is always required.

Sometimes shifting between time and space works really well. We can rebalance the code to shift this resource usage. In some cases though, there are natural boundaries for reductions, so the embedded information doesn’t exist to optimize the code.

Algorithmic optimizations are often the most effective, but they require a great deal of knowledge and often a lot of time to implement properly.

Beyond that, memoization which is the reuse of earlier computations can produce decent optimizations. Caching is the most famous of these, but care must be taken to distinguish between read-only and write-through caching. They are very different from each other and frequently confused. A bad implementation can cause weird bugs.

The big trick with memoization is not in saving the value, but rather in knowing ‘precisely’ when that value is no longer of any use. These types of optimizations require a strong understanding of usage and data frequency. Generalized solutions can help (and hurt), but specific solutions will always produce better results.

An example of this is compression. Data can be taken down close to its information theoretic minimum, beyond that some data is lost (which can also be acceptable). The act of reducing the size of the data is accomplished by utilizing these redundancies. This is also a classic time vs space tradeoff.

Parallelizing computation is another strong form of optimization. To make it work on interconnected data usually requires synchronization primitives like locking. Locking can be coarse-grained or fine-grain, with the latter usually providing better performance at the cost of more management overhead. Locking gets extraordinarily challenging when it occurs outside of a single process.

Locking algorithms spread across different computers are bounded by TGP (two generals problem) which in itself influences impossibility results like CAP and FLP. Generally, this is caused by an inherent ambiguity (missing information) within the communication between the separated computations (getting worse as the underlying reliability of the communication weakens). This sometimes described as transactional integrity as well.

In general, we can optimize code by seeking out data independence. If for a given computation, there is some dependence for the result on some other piece of data, then that relationship bounds the minimum amount of work. All outputs must be produced from some finite set of inputs. This is true for all computations. As there is a precise minimum for information, there also exists one for computation.

Optimization attempts then can start by observing for a given context that there will never be ties between any specific set of variables and using that information to reorder the work involved to get closer to the minimum. That is, we can conjecture that for any specific output, there are a finite number of both computations and data that form a minimum directed acyclic graph (DAG) with all inputs as leaves. Then there should exist a minimal such DAG (relative to the computational primitives). This can be applied mechanically to any set of instructions, for a given set of data, as it is bounded by a specific context. Fill in these unknowns and the minimal set of work is explicit.

Some algorithmic optimizations are tricky in that they would require currently unknown relationships to exist in order to find the actual minimum effort. We can, however, come close to the minimum, even if we can’t get there yet.

Most other optimizations are easier, in that they really come from understanding the data, its usage and the underlying functioning of the computers themselves (sometimes optimizations at one level exist to counterbalance bad optimizations at a lower level).

Most code is written as the ‘obvious first try’, so most of the time there is plenty to optimize. However, most programmers do not fully understand the data or the context, which is why we warn younger coders to not prematurely attempt to optimize. They do not have a full enough understanding yet to do it correctly and bad optimizations, by definition, will use more resources not less. Poor optimizations can impair readability or extendability. Only good optimizations will help.

Friday, February 15, 2019

Model-based Systems Design

Start with some data that you want the system to capture.

Where does this data come from? Data is usually entered by people or generated by some type of machine.

Most data is composite. Break it down into its subparts. It is fully decomposed when each subpart is easily representable by a programming language primitive. Try to stick to a portable subset of primitives like strings, integers, and floating point numbers. Converting data is expensive and dangerous.

Are any of these subparts international or common standards? Do they have any known, necessary properties, constraints, restrictions or conventions? Do some research to see what other people in the same industry do for representing these types of values. Do some research to see what other programmers have done in representing these values.

Now, investigate usage. How many of these values will the system store? How often are they generated? Are there any weird rules for frequency or generation? Sometimes close estimates for frequency are unavailable, but it’s always possible to get a rough guess or something like a Fermi estimation. For the lifetime of some systems, the initial data frequency will differ quite a bit from the actual real-world data frequency. Note this, and take it into account later, particularly by not implementing optimizations until they are necessary.

Once the data is created, does it ever need to change? Who can change it? Is the list of changes itself, significant data too? Are there security concerns? Can anyone see the data, can anyone change it? Are there data quality concerns? What are the chances that the initial version of the data is wrong? Does this require some extended form of auditing? If so, what is the frequency for the audit entities themselves? Is the auditing fully recursive?

For some of the subparts, there may not be a one-to-one relationship. Sometimes, the main data often called an ‘entity’ is associated with many similar subparts. Break this into its own entity. Break any one-to-many, many-to-one or even many-to-many bits of subparts into separate entities. During implementation, it may not make sense for the initial version of the system to treat the subparts as a separate entity, but for data modeling, it should always be treated as such. The model captures the developers understanding of the data as it gets created and used, the implementation may choose to optimize that if necessary. The two viewpoints are related, but should not be intermixed.

For some data, there may be an inter-relationship between entities of the same kind. Capture this as well. The relationships span the expressibility of data structures, so they include single entities, lists, trees, dags, graphs, and hypergraphs. There may be other structural arrangements as well, but most of these will be decomposed into the above set. These interrelationships sometimes exist externally, in which case they themselves are another entity and should be treated as such. Sometimes they are confused with external indexing intended to support searching functionality, that too is a different issue, more entities. Only real structural interrelationships should be captured this way. Most entities do not need this.

What makes each entity unique? Is there a key or a composite set of values that is unique? Are there multiple keys? If so, can they conflict with each other? Spend a lot of time understanding the keys. Poorly keyed data causes huge problems that are hard to fix. Almost all entities need to be unique, so there is almost always at least one composite key, sometimes quite a few. Key mappings can be a huge problem.

While working with one type of entity, a whole bunch more will be created. Each one of these new entities needs the same analysis treatment as the original entity. As the understanding of each one has been explored they can be added to the model. The model should grow fairly large for non-trivial systems. Abstraction can combine sets of entities with the same structural arrangement together, but the resulting abstract entities should not be so generic that they can no longer properly constrain the data. A model is only useful if it accurately holds the data and prevents invalid data from being held.

Be very careful about naming and definitions. The names need to match the expected usage for both the target domain and computer science. Sometimes it takes a while to figure out the correct name, this is normal. Misnaming data shows a lack of understanding, and often causes bugs and confusion later. Spend a lot of time on names. Do research. They are hard to change later. They need to be accurate.

Don’t try to be clever or imaginative. The system only adds value by capturing data from the real world, so the answers to most questions are just laying around, out there, in the real world. There has been plenty of history for building up knowledge and categorizations, leverage that work. Conflicts between the model and reality should be resolved by fixing the model. This work only has value if it is detail-oriented and correct, otherwise, it will become one of the sources of the problem, not part of the solution.

There are forms of optimizations that require injecting abstract data into the model, but those types of enhancements should be added later. They are implementation details, not data modeling.

Some types of data are constrained by a fixed or limited set of values. These sets are domain based. Try to find existing standards for them. Put them into their own entities, expect them to change over time. Do some analysis to figure out how often they are expected to change and who is expected to change them. Anyone involved in running, using or administrating a system is a user, and not all users are end-users.

As this work progresses, it will build up a large collection of entities. Groups of these entities will be tightly related. These groups draw architectural lines within the system.

Now, look at how people will use this collection of data. Do they need access to it quickly? Is it huge, and what types of navigation will users need to find out what they are looking for? Is the searching slow, does it need some form of indexing optimization to make it usable? Will the system build up data really quickly, consistently or very slowly? Do different parts of the system have very different access requirements? Is there data in the system which is creativity-based or experimental? Some systems need tools to play around with the data and are subject to lots of little experimental changes. Some systems need the data to remain relatively static and only require changes to fix quality issues.

Will different users need the same data at the same time? Will different users edit the same data at the same time? How do changes with one entity affect others?

For large data, summaries are often important to present overviews of the data? For each entity what types of summaries are necessary? Are these their own entities? How often do they change, who can regenerate them? Are there users specific categorizations that would be helpful in crafting useful summaries? Can these be shared? Are they entities as well? Can this reporting part of the system be fully integrated into the main model so that it stays in sync with any enhancements to the domain model itself?

Can the summary data be computed on-the-fly, or is it expensive enough to be regenerated only periodically? What is the precise cost of computations? Are there industry standards, existing libraries, etc. that can be used?

The answers to some of the above questions, such as how quickly the data needs to be viewed after changes and how many users need to view it, will create optimization requirements. The collection of these requirements will dictate the hardware, the architecture and any dependent technologies. Often there will be multiple possible solutions so regional programming resources and current programming fads will pin down a precise implementation.

The work modeling the data, since it is small in comparison to the implementation work, should extend beyond the initial short term goals of the development. It doesn’t have to fully cover all scope and all detail of the given domain, but it should at least extend out to the next couple of expected iterations in the development cycle. This makes it possible in the future to pipeline extensions in the system first by modeling, then by the actual implementation, so that the current coding work is at least one generation behind the current modeling work. Dramatic, unexpected pivots in the development will, of course, disrupt this cycle, but the frequency of these should diminish rapidly in the early days of development (or the project is already doomed for non-technical reasons).

A full data model then includes all of the entities, their internal and external relationships, all subparts that are ‘typed’ and all of the expected computations and necessary optimizations that are expected. Follow up extensions, should be highlighted as changes from the previous version. The version changes should match the code implementations (development cycles). The structure of any entity groups should lay out a high-level architecture, with further architectural constraints driven by the optimizations and possibly the arrangement of the development teams themselves.

The data model should include any of the analyst’s notes, and any issues about standards, ambiguities, conventions, and issues with keys. The full document can then be used to produce a high-level design and a number of necessary mid-level and low-level designs needed to distribute the work to the development teams.

Depending on the haste involved in doing this work, it is possible that there are flaws in the data model. These will come in different types: the model does not reflect the real world, b) it has missing elements or c) the model differs from an incorrect model used by an upstream or downstream system. If the model is wrong or incomplete it should be updated. That should be pushed through design and implementation as a necessary change to the core of the system. It is the same process as extending the system. If the model doesn’t reflect some earlier mistake, an appendix should be added that maps that mistake back to the model and outlines the consequences of that mapping. That mapping should be implemented as a separate component (so that it can be removed later) and enabled through configuration.

For most systems, spending the time to properly understand and model the data will lay out the bulk of the architecture and coding. Building systems this way will produce better quality, reduce the development time and produce far less operational issues. Done correctly, this approach also lays out a long term means of extending the system without degrading it.

Thursday, February 7, 2019

Implementing Sophistication

Computers are intrinsically stupid.

To get around this problem, programmers have to take all of the knowledge they have acquired, visualize it in a way that makes it codable, and then implement it in software.

The easiest approach to this is to be as crude as possible. The coder does nothing other than sling around unknown data; all of the complexity is pushed either to the users or to the operating environment. This gets the baseline mechanics up quickly, but it does create a fragile system that isn’t extendable. It’s good for a demo, but rarely solves any real underlying problems.

Sophistication comes from the code taking on more and more of the problem in a way that is reliable. Instead of pushing back unintelligible character strings for a user to track, a sophisticated system employs powerful navigation so that they don’t have to remember anything. Instead of crashing poorly and requiring a long recovery time, a sophisticated system bounces right up again, ensuring that all of its components are in working order. The efforts pushed back to the ‘users’ are minimized, while the functioning of the system is predictable.

It’s a huge amount of work to add in sophistication for software. We can crank out flaky websites quickly, but to actually build up deep functionality takes time, skill and knowledge. Often people believe that it is not necessary. They figure that it’s only a little extra work that is pushed back to the users, so it is okay. But if you look at our modern software, with the amount of our time that it wastes, then it should be more than obvious that we aren’t making good use of our hardware and all of the electricity that we pour into it. Crude software doesn’t really automate processes for us, rather it just shifts the way we waste our time.

Sophistication starts with understanding the data. Since computers are confined to the digital realms, at best they are only able to ‘symbolically’ represent things in the real world. These representations are bound by real-world constraints that are often extraordinarily complicated. It takes time to map their informal behavior into a rigorous formal environment. Time that isn’t wasted. If the system properly encapsulates the managed data then it doesn’t need external help or hacks when it is operating. If the data is a mess, then any or all of the computations on the data are suspect. Bad systems always have bad data models, the two are intertwined.

Really understanding data is complicated and it usually means having to go beyond just the normal branches of programming knowledge and directly into domain knowledge. For some people, this is the interesting part of programming, but most try very hard to avoid building up depth on any particular domain. That is unfortunate since the same basic ‘structural’ domain issues span across areas like finance, healthcare, etc. From an implementation standpoint, the usages are very similar and digging into one domain can lead to insights in others. Uniqueness and time, for instance, are common across everything, even if the instances of those problems are steeped in domain-specific terminology.

If the base of the system rests on an organized and complete data model, the construction of the system is quite easy. Most code, in most systems, is just moving the data from one part of the system to another. The trick is to maximize reuse.

Coding is still a slow, tedious, operation particularly when you include the work of testing and debugging. Reuse means that the finalized, well-edited code is deployed repetitively, which can eliminate huge amounts of work. That is, the only code that is ‘good’ code has been heavily edited and battle tested. Otherwise, it is fresh code; it should be assumed that it contains significant bugs. This is always a safe assumption that makes it easier to understand the amount of work involved in releasing a new version of a system.

In most systems, there are huge opportunities for reuse, but they often require deep abstraction skills. Unfortunately, this makes them unavailable for most development efforts. To leverage them requires a significant up-front investment that few organizations are willing to gamble on. It’s not possible, for instance, to convince management that for an extra six months up front, it saves years of work down the road. Our industry is too impatient for that. Still, one can identify reuse and slowly refactor the code in that general direction, without having to commit significant resources immediately. This spreads the effort over the full duration of the project but requires that this type of work is not discontinued halfway through. Thus modern programming should accept that reuse and refactoring are bound together. That the latter is the means to achieve the former.

Big sophisticated systems take years, if not decades, to build. That is never how they are pitched, the time frame is usually ridiculously short and overly ambitious. Still, any developer that has been through a number of big projects is aware that the amount of work invested is massive. Because of this, systems in active development are continuously being extended to do more than their original designs. This is quite dangerous for the software in that the easiest way to extend code is to just build some independent functionality on the side and barely integrate it. This type of decay is extremely common. It is a lot less work to slap something different at the edge, then it is to revisit the underlying data model and work out how to enhance it. But each time this shortcut is chosen, the system gets considerably less sophisticated, more fragile and more bug-prone. What seems to be the faster way of achieving our goals, is actually extremely destructive in the long run. So, sophisticated isn’t just an initial design goal, it is an ongoing effort that continues as long as there are new capabilities and data getting added into the system. Sophistication can be watered down or essentially removed by poor development efforts.

Given that adding sophistication to a system is extremely time-consuming, coders have to learn how to be efficient in order to be able to meet most development constraints.

The first issue is that not all time spent building the system should be actual coding. In fact, coding should be the last, most routine part of the effort. Programmers need to learn how to acquire an understanding first, then visualize a solution and then only at the end do they sit down and start fiddling with the code. Diving head first into the code and getting lost there always wastes a huge amount of time. As well, new programmers are often hesitant to delete their code, so instead, they start to build up unintelligible, disorganized messes, that they flail at to fix a never-ending set of bugs. Code gets sticky and that causes its own problems. Fear of changing code often leads to writing more bad code.

Sometimes, the best approach to fixing the code is to walk away from the computer. Research (textbooks, blogs, etc.) and bouncing ideas off other programmers are two really critical but underused approaches to being more efficient. If you are having trouble explaining what the code should do, then you probably don’t understand it well enough to get it to work properly. It’s a waste of time to fight with code.

Efficiency also comes from non-development activities as well. Micro-management is a very popular approach these days for software development projects, but it can be horrifically misapplied and lead to tonnes of make-work. Stakeholders need some level of accountability for the work they commission, but software development should never be driven by non-technical people. They don’t understand the priorities so they focus on the shallow issues that are most visible to themselves, while the real problems in development come from the details. This always leads to a quick death as the technical debt overwhelms the ability to progress. A reasonable methodology can help, but it is tricky to apply it properly. Methodology for small projects is trivial, but the complexities grow at least exponentially as the size of the project grows. It is a very difficult and concentrated skill to keep large scale development projects from imploding or exploding. It is quite a different set of skills from either coding or architecture. In this sense, software development is intrinsically unscalable. More resources often result in more make-work, not real progress.

Realistically it isn’t difficult to type in a small set of instructions for a computer to follow. It is difficult however to type in a complete set of instructions that would help a user reliably solve one of their problems. We often get these two things confused and a great deal of modern software development is about claiming to have done the second, by only doing the first. As the software development industry matures, we should be able to do more this enhanced type of development and we do this by getting beyond our crude practices and adding in sophisticated code. This type of coding takes longer, is harder and requires more skills, but ultimately it will make computers significantly more usable for us. We shouldn’t have to accept so many software problems; we shouldn’t let our standards remain so low. Computers are amazing machines which still have a huge ability to improve our lives. Sophisticated software is what makes this possible.