Tuesday, May 17, 2011

Deep Refactoring II

In my last post, I laid out some very high-level ideas about refactoring our basic software architecture by removing the distinction between memory and disk. It’s was a very simplistic sketch of some of the underlying characteristics of this type of change.

A reader, Kemp, commented on several issues. The first was with regard to temporary memory usage and the ability to clear out short-term objects. The second was with respect to the program counters (PCs) and scheduling. I thought these were interesting points, worthy of a bit more exploration of the idea.

Please keep in mind that both of these posts are really just playing around with the basic concept, not laying out a full proposal. Ideas must start somewhere; in some simplified form, but it is often worthy to give them consideration, even if only to understand why they will or will not work appropriately. We tend to take the status quo too seriously. Just because things have evolved in a particular fashion doesn’t make them right, or other alternatives wrong. At some point big progress has to leap frog over things existing. The path to innovation is not linear, but often curvy and riddled with gaps and lumps.

Kemp’s first concern was with temporary data. A process has two basic sections of allocatable memory, the stack and the heap. The stack works by pushing new frames onto the structure for each function call, and popping them when its completed, thus allowing for recursion. Temporary variables are allocated within that chunk of memory. The heap, on the other hand, generally just grows as new blocks of memory are allocated. It is explicitly managed by the process (although languages like Java now hide them from the programmers).  In many systems, the code and any static data are loaded roughly into the center of memory, while the stack and heap grow out in both directions.

Way back, when I was building long-running processes in C, it was considered good practice to only use the heap for long-term data. Anything temporary was placed on the stack, which was essentially cleared when a function finished. Buffer overflow concerns, Object Oriented languages, and more advanced memory management techniques contributed to change this, by dumping more stuff into the stack.

These changes, and a reduced understanding of the underlying behavior of the machines have contributed to an increasing problem with bloat in modern software. There was a day when 64K could be used effectively, but these days many programs eat megabytes or even gigabytes excessively. Although the hardware has followed Moore’s law, doubling every two years, software programs aren’t significantly more responsive than they were twenty years ago. In some cases, like scrolling, they’ve even gotten noticeably slower.

If we are considering refactoring the underlying architecture, then it is not unreasonable that this would have a significant affect on how we design programming languages as well. The compiler or interpreter can tell in many cases from the scope of the variable that it is temporary. But there will be many situations where some data is allocated, used and then passed around into other regions of the code, where it will be assumed to be global.

One way to fix this, would be to explicitly indicate that some memory should be deleted, or preserved. In an Object Oriented like language long-term objects could inherit from Persistence, for example (OODBs worked this way). That still might leave cyclic dependencies, but given that the program never actually stops running, there is a considerable amount of available CPU power during slack times to walk the structures and prune them. For workstations, late at night they might start a cleanup process (similar to how dreaming compacts/deletes memories).

One thing that is really noticeable with most modern systems is that they are paging excessively. Disks have considerably more space then addressable memory, and we have the expectation that there are many different programs and threads all running at the same time. There just isn’t enough physical memory to hold all of the code and data for everything that is running. Depending on the OS (Unix and Windows use different underlying philosophies for paging), there can be a considerable amount of resources burned, causing excessive copying (thrashing) back and forth between memory and disk. In many overloaded Windows systems, this is often the reason for the strange, inconsistent pauses that occur.

If you consider even a simple arrangement for many systems it’s easy to see how inefficient these operations really are. For example, consider someone running Outlook on their workstation. If they have a large number of inbox messages, and they want to search them it can be very resource intensive. Their client talks to an Exchange server, which then talks to a database server. The emails are stored in the database on disk, so if you want to use Advanced Find to search for some text in the emails, each and every one of them needs to be loaded into the database. Most databases have some sort of internal caching mechanism, plus there is the OS paging as well. For a large number of emails, the data is loaded, searched, then probably paged out and/or cached. The actual text matching code may be in the database, or it may be in the Exchange server. The resulting Subjects (and possibly text) are collected, probably cached/paged and then sent to the Exchange server, which again caches/pages. Finally the data is sent over a socket to the client, but again this triggers more caching or paging. In the worst case, any of the email bodies might exist twice in memory, three times on disk. The Subjects could be replicated as many as six times on several machines. And all of this extra work doesn’t necessarily make the process faster. Caches are only effective if the same data is requested multiple times (which in this case they aren’t), while memory thrashing can add large multiples of time to the necessary work (depending on the paging algorithm).

One large block of fully accessible memory that covers all programs and their current execution states would still have to have some way of indexing memory blocks (or our pointers would be arbitrarily long), but it could follow an architecture more like a memory mapped file. The process would have some internal relative addresses that would then be indexed back onto the larger space (but no need to copy). Given the rapid growth of modern storage, the indexing method would likely be hierarchical, to support a much larger range than 32 or 64 bits (or we could fall back into ‘memory models’ from the old DOS days).

The access to each instruction in the code, and its data might be slower, but these days instruction caching in the chips is far more effective. Sometimes with the early RISC systems, for some programs it was necessary to turn off all of the cache optimizations (in programs with complex flows), but that doesn’t seem to be the case anymore (or people just stopped analyzing performance at this level).

Paging and caching aren’t the only places that eat resources. Most systems store their data in a relational database (although other types of data stores are becoming popular again). The representation in the database is considerably different from the one used in memory, so there are often huge translation costs associated with moving persistent data into memory and back again. These can be alleviated by keeping the persistent representation almost identical to the internal one (absolute memory addresses have to be made relative, at least), but that cripples the ability to share data across different programs. Thus programmers send a great deal of effort writing code to convert from a convenient memory representation to a larger, more generalized persistent one. Not only does this eat up lots of time, but it also makes it far more laborious to write the code and the underlying static nature of the database representation tends to be fragile. So basically we do more work to waste more resources and get more bugs.

With all of that in mind, changing our language paradigm somewhat to be more vigilant in cleaning out temporary data isn’t a onerous shift, particularly if we can eliminate thrashing problems, and although I didn’t talk about them, race conditions caused by faulty caching.

That leads nicely into Kemp’s second point. My original description of the idea in the last post was considerably vague. But what I had in mind is to flip the dynamics. Right now we send the code and data to the CPU, but in this new idea we could send the CPU (and all of its PCs) to the code, wherever it lies.

For a single machine, since this new memory would be slower, it would take longer. But not necessarily longer than it already takes to page in the data, then cache it in the chip. Like our current architecture, there could be many different threads all running within the same process. One simple arrangement would be to basically emulate our modern control flow, so that there is a main thread which spans off new threads as needed.

This is OK, but I can envision also changing this arrangement. Instead of a main line branching off different executions paths, we could reconstruct the system so that everything executing was basically similar to async handling. That is, the code is completely event-driven. Nothing happens in the program until an event is processed. There is no main busy-wait loop. Each program is just a massive series of independent functionality. As the user interacts with the system one or many of these are triggered. They too can trigger other bits of code.  In this way, the user interaction, and the scheduler can send various PCs to spots in a program, based on their needs. If the user triggers an operation that can exist in parallel, a considerable amount of resources can be sent in that direction to handle the processing. A system level scheduler can also launch back-ground processing as it is required. Given this flexibility, the system has a greater degree of control on how to balance the load between competing processes and back-ground jobs. Re-indexing the file system, or checking for viruses can be temporary delayed if the user is performing furiously with their mouse. This flexibility doesn’t exist in modern systems because the control over the internal resources is buried inside of each program. The scheduler isn’t privy to the main loop, or to any busy-waits.

From a user perspective on a workstation, there would be no noticeable difference, except that intermittent pauses would be less likely to occur. But underneath, the only main loop in operation would be the operating system’s.

Now, since we’ve got one massive chunk of memory, and some ability to memory map the process space onto this, it isn’t a stretch to extend this further. Instead of going to a particular external server to request data with a sub-language like SQL, we could just map a portion of the server’s memory right into our workstations. Then we could access it’s data, as if it were already in our process. That of course, would require coordination, locking, and fighting with some particular race conditions from writes, but if the performance were fast enough then we could use the network to effectively extend out our workstation to terabytes or petabytes. Why ‘copy’ the data over, when you can just ‘mount it’ and use it directly?

While that would be fun (if it’s workable), even more entertaining would be to mount our memory onto other workstations. Late at night, when every one is gone, we could just co-opt their CPUs to start updating our processes. You’d just need to send some type of communications to the other stations to access your memory, and respond to, or generate events. After working-hours, programmers could turn their now abandoned offices into giant clusters, all working in tandem to handle some heavy duty processing.

This type of architecture is far from unique, way (way) back I worked on some Apollo computers, that run a CAD/CAM for super-computer circuit boards. The system was developed in Pascal, and as the co-op student my task was to write some code to distribute processing to any workstation that wasn’t being used (unfortunately the company crashed before I had made significant progress on the work). Systems like that and Beowulf clusters distribute the code and data to their nodes, but in this model the the nodes would instead converge on the code and data.

There would of course be lots of resource contention issues, and some maximum threshold for how many CPUs could effectively work together, but distributing or replicating the underlying information to speed up computation would be a lot simpler, and could be handled dynamically, possibly on demand.

Alright, that’s it for now (or possibly forever). With this much text, I’ve got a 2% chance of any reader getting down to this paragraph, so although I could go on, I’ll stop, after getting one final round-up paragraph :-)

If the underlying synchronization, locking and race conditions are manageable one big large addressable chunk of memory would reduce a lot of redundant data movement and translation. It would also allow for ways to extend it across networks to both provide massive available memory and to allow independent machines to co-operate with each other. Scheduling would be more effective, since it has more information to work with. From a coding perspective, it would drop out a lot of the work required to convert the data from the database, although some other multi-process synchronization technique would still be necessary. As an odd point, it would make many user’s lives easier because they already can’t distinguish between memory and disk. Also, after power outages, the programs would go right back to their last state, there would be no rebooting time. Backups would no longer be hot/cold and there might be a way to use RAID-like ideas for synchronizing memory, leading to better fault tolerance and recovery (particularly if  the same chunk of memory were shared on multiple machines, and kept up-to-date).

Then again, it might not be possible :-) We can’t know, until someone tries to build it.