Sunday, April 11, 2010

Data Vaults

"Insanity: doing the same thing over and over again and expecting different results."

-- Albert Einstein

We build massive monolithic systems, trapping them into vertical silos. Many of these systems are so deeply embedded into our organizations that these giant reservoirs of potentially useful data never get utilized. While we've learned to collect data, we fail miserably at being able to use it, particularity with larger volumes and/or complicated data structures.

Given these problems, I have an idea for turning the solution on its head. If we're not getting real value out of every organization collecting their own huge repositories, then perhaps we need to change this?

We'll start with a very simple concept. Each and every user has their own, little unique 'data vault'. It is a safe place where they can store some data, and allow others to access it.

The technology would be simple, all of the data in the vault is application-independent. That is, it is constructed in a manner that insures that it isn't implicitly reliant on some application code. Everything needed to properly interpret the data, is in the data.

A vault then, would be this place where a large number of outside systems could deposit this universally structured data and then read it back later.

For most of the data, the vault's owner could likely delete it if they wanted, but would be unable to change it in any way. This immutable nature helps to insure for organizations that the vaulted data is not being used in some in-appropriate manner against them. If it exists, it has not been altered. It is for the organization's comfort.

Each and every vault would explicit audit all reads and writes to its contents. That is, any and everyone who ever had access to the vault would be added to an internal permanent record.

To make this effective, for every system that utilized the data, every user would have to be declared, and no 'shared accounts' or 'temporary users' would be acceptable. Organizations would be held legally accountable for insuring the integrity of their usage.

The vault hosting sites would also be registered, and would be held accountable to some very strict access rules. They would have to make sure, for example, that access to any backups was not possible without clearance. That the basic access and editing security rules were being followed to the letter. They would be legally accountable for breaking any of these.

The security around the vault may be sounding a bit Draconian, but in concept we are moving the trustfulness within organizations into some outside party. Without a well-entrenched security model, the necessary trust to make these ideas work could not be established. The organization has to trust the host, and the customer also has to trust both of them.

So, now what we have is a mechanism for any organization to save specific customer data into a customer defined location. In some cases the customer can delete the data, but in all cases they cannot edit it. Each and every organization that has capabilities of altering the vault is registered, along with a up-to-date list of all employees with access. Access can be given, but it will be tracked. Anything that happens to the vault is recorded permanently.

Now to make this more interesting, an organization can write some data into the customer's vault and then retrieve it later. A simple scenario for this would be customer-relationship management (CRM). Each and every time the customer talks with the organization, they would start by identifying their vault, and then the details of the conversation would be stored in it. Organizations could 'personalize' their services, but not have to store all of this information locally. They could save huge costs on running their own local complex IT systems, instead just concentrating on using a simpler IT infrastructure to maintain just the core systems.

For an organization storing data, there would obviously be some concern over rival organizations utilizing their efforts, so one of their storage choices would be 'private'. Only the depositors of the data and the customers could see the underlying data.

For customers, some would prefer to have a context set over their conversations, so the vault would be really handy. Others, no doubt would prefer to have no-one retain any information, so they could be free to visit the vault and eliminate its contents. In fact they could just choose to not mention a vault id at all.

Another interesting aspect is that a single customer, like email accounts, could easily have multiple vaults. In that case they may choose which vaults to present to which companies, allowing some to have closer relationships, while not trusting others.

Not all vaults are created equal. That is, since it is a computer system, some vault hosts may choose to offer premium services: larger or faster access, better UIs, better backup, backups in other regions, better security, archiving, etc. Data vaulting would commoditize really quickly, with different organizations competing heavily over an ever increasing territory. They'll always be more data to collect, and it will always be growing.

While storage space is generally cheap, there may be applications for vaults that could eat up lots of storage. Customers aren't going to want to pay the full cost of their storage, in fact most aren't going to want to pay any costs at all. To make this work, each and every time an organization writes to the vault, their costs help subsidize extra space for the customers.

That is, if DELL stores 100K worth of client interactions into someone's vault, they are implicitly paying for 200K worth of space. The customer gets that extra storage for free. Although DELL has paid for more space, the overall costs of the data vault would still be significantly less than they are currently paying for their silo'ed systems, so it would represent a price cut to them. Also, some of the marketing budget could be redirected for silo 'promotions'. The vault industry itself would be priced very low, because there would be significant competition.

There would also be a lot of other ways the customer could avoid paying bills, research being a big one, governments being another.

Certified research companies could pay for access to anonymized vault information. Since it is the vault itself that is anonymizing the access, the security is maintained and access behavior is consistent. The data could be truly anonymized and still be hugely useful for research.

Every vault used in the research could be rewarded with some extra space or some other payment option. Organizations get high quality data for their research, while the users get the comfort of knowing their data can't be mis-used.

Beyond customer relations, purchasing and research, I can see an ever bigger market for the vaults. Healthcare is struggling with the digital age, and the vaulting system is the perfect answer to these problems.

Healthcare providers can use the vault as the official electronic medical record (EMR). All of the data from all of the different doctors, diagnostics, treatments and other interactions can go into the vaults like any other organization.

The vaults can hold a limitless amount of data, so in the special cases where the medical relationship has been long and difficult, any enhanced vaulting requirements for massive data can be handled specially. Large or complex treatments may generate large data that needs to be distributed.

For most customers however, handling their medical data in a vault would be fairly trivial. To their benefit, they would get to see their data, review it, and possibly get some subsidized vault space out of the arrangement.

For people without home computer access, Internet cafes and public libraries would provide a way to see their data.

Now, an interesting issue would be to allow users to move their vaults around, and to try to provide some federated access for users with multiple vaults, if they were too large and needed to be split.

So, it makes sense at the higher level to have a simple 'redirection' layer applied to the technology.

Keeping it simple, vault ids would be analogous to URLs. The vault locations themselves would be analogous to IP addresses. This would allow a distributed DNS-like technology to supply and cache the specific vault location, independent of the customer's id. Further, because it would be almost identical to existing DNS technology, caching would provide fast access for frequent clients.

Then a 'vault' could be spread over multiple separate locations. Locations could be moved around, from different machines. The whole indexing mechanism would provide a simple layer over the top to make things more flexible. It would be possible for a customer to easily switch their vault handling from one host to anther.

Because of the nature of some of the data, all vault providers would have to be registered, and follow some strict basic strict rules. The software they would be using itself would be fixed by standards, but there would probably be a small number of different distributions. Different software companies will implement their vaults in slightly different manners. Hosting companies could pick and choose the best software (or write their own).

One thing that needs to be included into the vaulting standard is a large number of software libraries used for vault access from the various different organizations. There should be a consistent/common interface for all of the applications that need access to vault data. Basically, given a vault id, the return from a vault access library should be the correct 'data model'. Partial models and paging support would be necessary.

Once the programmers have access to the model, the code would be very strict in allowing them add or delete data. This isn't a restriction intended to annoy other programmers, but it is based on the observations that the vaults would get quickly filled with junk data, if there is little control over what is stored. A free-for-all, given our current software development industry, would be a disaster, but the data still needs to be dynamic in structure.

Relational databases would likely be bad underlying technologies for this type of data model because they can't handle dynamic data properly. A slightly new storage model would be needed to get the necessary balance between flexibility and structure. I've got some ideas on how this could be worked out. It's not complicated, but it is the key to making all of this work properly.

All outside applications would need to add dynamic structured values, but unless there was some clear, simple overriding structure, sharing data, research and data mining would be impossible.

Since this idea was intended to get beyond just collecting data into giant silos, distributed but inaccessible data isn't really a sufficient improvement.

Besides CRM and Healthcare, there are plenty of other uses. People could purchase vaults to store their digital images, or backup their digital music collections. They could store scans of important documents like birth certificates, taxes or other legal documents. They could automatically sync their hard-drive to a vault somewhere. The integrity of the vault system would make it an ideal candidate for handling data replication issues. Extended a little farther, other organizations could utilize it as well for their off-site storage backup.

There is no limitation on the type of data that can be stored.

Along with personal documents, vaults could also be the anchor point for a huge number of social systems, allowing different networks to share common related data based on individual agreements between companies. For instance, three social sites could form a common organization that would allow each other to read/write common data. The vault provides neutral ground and a standardized technology so that none of the three was getting unfair advantage.

Customers could wire all of their accounts into vaults. In that way, they could become responsible for managing their own identities. They might have a vault that contains all of the preferences for their most favorite and accessed systems, and another that 'hides' their identity in some manner (fake user name for example).

There are a huge number of other possibilities as well.

This idea balances a number of different factors: for their own comfort, customers want to control their own data within a system. Organizations need access to that data, but full control of it does not gain them anything. Neutral mechanisms are needed to share data between organizations. Large data needs to be split across hardware. Vaults strike a reasonable trade-off between these different issues.

All of these things become achievable if we break off the storage from the system, and allow third-parties to maintain it. While it is a really simple idea, if we can get the right data representations, a network of data vaults will allow us to re-shift the way we think of data, and to fix many of the intrinsic problems we currently have with our existing systems. Once free of the silos, we can start to employ our data to enrich our lives, rather than to just make IT operations personal miserable because they are working so hard to collect stuff, without actually being able to use it for anything.

So, who is willing to invest in building this?

Monday, April 5, 2010

Development Ideas

I have some great ideas about how to build solutions to some of our most common programming issues. The problem? I don't really know any person, company or organization that is genuinely interested in trying to solve real problems.

There are, of course, a massive number of people out there keen on making money, and a few people out there dedicated towards making noise, but when it comes down to really working on ideas that could change the world, everyone is suddenly busy with other things.

Perhaps if I could be more modern, and would swear up and down that my ideas will definitely work right away, and that they aren't just experimentation or research, I could con, err,... I mean: convince someone into backing this work, but then if I knew for sure that these ideas were winners my house would be mortgaged, and I'd be eating cat food (and not the human grade kind) for years while busily trying to finish up the work.

Over the years I tried getting my own companies going, finding angel or venture capital, and just setting up something simple, but all to no avail. I'd just do some sort of free openSource project, but I've got a house to pay for and a family to feed (although half of it is canine). I work to maintain my life, and that also draws off my excess energy from working on the side. Writing at night and photography on the weekends complements my full-time software development activities, coding until the wee hours of the morning does not.

I'd like to change the world, it could use the help, but I have no idea how to get myself into a position where I could try. Any suggestions?

Saturday, April 3, 2010

Future Considerations

Something one of my friends said got me thinking. He was complaining about the iPad not having any communication ports on it, like USB, or serial.

Yea, at first I thought it was crazy too, but in consideration I started to get the bigger picture.

The classic view of a computer was this thing that exists all together in one place, that you interact with it to calculate stuff. The first machines were massive but we lived through an age where they became smaller and smaller. With each downward shift in size, the machines got into a close proximity to our home, our lives and too being invisible. They became ubiquitous.

But still, while the machines were growing smaller and smaller they were still all contained in one place. A machine, and its peripherals were still a single collected entity, all interacting nicely with each other.

At the same time as things started getting smaller, they also started getting more distributed. Sun's "the network is the computer" slogan started a range of efforts to make things distributed, parallel or both.

The computer, went from being a single discrete entity, gradually into a larger number of parts, distributed over larger and larger areas. Culminating with the Internet spanning the world, which is an amazing feat.

But still, even as the 'computing' became distributed, the basic building block was the single 'computer'.

Now, along comes wifi and it gives us this really cool ability to create invisible, transparent networks. The interconnections have been great, but so far limited. We just replace the most obvious wires on the floor.

But what we can do now is start using wifi networks as the back-bone for individual computers themselves. That is, you no long buy a machine for home, instead you just buy the individual pieces and assemble them into you house as needed. Plug and play will become "proximity and play" where just bringing in a new component and giving it the wifi password will be enough to fully configure it into its new environment. It will be easy to extend the capabilities of your house, gradually as you keep buying more gadgets.

The iPad becomes the screen and keyboard, but you could have a separate machine to hold the disk-space, another as backup and a completely independent printer. I've heard of people connecting speakers to their net, and one would like to connect bigger displays like TVs as well. This could also extend to all of the appliances, and even to various types of storage too, like closets.

Basically, the house becomes the computer, the wifi the bus between the devices, and all of the components just connect to everything. You're already living in your next machine ...

Now if we mix this further with the growing trend towards 'natural user interfaces' (NUI), we should start to see more and more intuitive, natural methods of interfacing with these extended machines.

Talking, moving, interacting with objects in the real world. All of these can be tracked, analyzed and enhanced by our equipment. Some day you mother or wife won't have to nag you about tossing you recently worn clothes into laundry hamper, it will be complete automated instead.

The big thing about natural interfaces to me is that we previously had to go to the computer to interact with it. We were on the computer's turf, playing according yo he computer's rules. Now these new interfaces are allowing the computer to come to us instead. That's a big change.

As always with technology, though, while the hardware is going through a number of really exciting new changes and growth, software has been lagging behind, restricting our usage.

The last few decades have seen more code and more data get created, but while they are prettier, computers are way more frustrating to use than ever. We've started a trend towards stupidly complex systems, that are just horrifically painful to work with. Every feature under the sun is packed badly, into some weird or awkward interface, and keep in place by its sheer size. Big, ugly and monolithic.

Each new generation of programmers simply rebuilds the failures from the last generation, but with few real improvements and many many dis-improvements. There will little learned from history.

As an industry, software for decades has been about creating new fads and flaky technology, but not about fixing or improving the essentials.

Even the current mantra now about things like "not creating new frameworks" shows how tunnelled we've become, since all of the existing frameworks are awkward to use, don't solve the core technological problems and all basically use variations on of the same design. One that clearly isn't working.

You'd think we'll be be out exploring new interesting alternatives given our current technologies, but that approach is frowned upon. The current fad is to be too focused on re-writing that domain application (for the third time), so as not to be able to spend any time fixing any of the real low-level core issues. Leave em for someone else.

But then software development has always been driven by fanatics to some degree, although lately the trend is getting worse. Most of what is being offered up as solutions are pretty little band-aids trying to cover up gaping wounds.

I'm literally more embarrassed year after year about the failing state of the industry and it is only get worse. I rarely tell people what my main profession is anymore (I'm just sick of the complaints), more often I'll talk about photography or writing.

Still, someday, software too will grow out of its current malaise. With all of this new hardware we'll need to tackle two very complex problems: a) making distributed things simple and reliable, and b) upgrading to new versions of software.

Both problems don't seem difficult, but neither works right now.

Our paradigms for building software are mostly based around mammoth, single deterministic architectures. Our current technologies are intricate, overly-complex and are easy to mis-use. We've yet to find an abstraction that makes distributed computing trivial.

And as long as it is non-trivial, it will always contain a significant number of bugs, and be difficult to trust. This problem will only increase as the hardware becomes more and more distributed.

Our models of software development are still based on big-design-upfront projects (even if we do release them more often in less completed versions), and thus we have no idea how to release a multitude of small, ever-changing pieces without accruing so much technical debt that the system is mostly unusable.

Common approaches towards backward compatibility and version upgrades are very dependent on significant human effort and thus are costly and frequently prone to failure.

If machines get more distributed, in smaller parts, that also means the upgrade problems that we are having trouble with now, are going to get exponentially worse in the future. More pieces, means more dependency failures.

The hardware will quickly become an interoperability nightmare, a sea of strange undesirable errors and odd behaviors. It is bad now, and it will get even uglier.

If not, it will only be because we'll just stop upgrading the code, and many of the new advances in hardware will get lost for generations. Either way it will be software problems holding us back.

So it seems that the future is going to provide us with more ubiquitous computing, where machines are getting smaller, more specialized, less feature-based and more embedded into their environment. Once software breaks free of it's current dark age, we'll be quietly surrounded by a vast array of different machines all interacting with each other, in other to help us managed our lives and our world. 

I can imagine someday just grabbing a 'pad' someplace, and telling it to give me the grocery shopping list, which will just pop-up at a little printer by the doorway as I leave. The items will all be cross-referenced against the fridge contents and cabinets, and me and my wife will both have our 'favorites' represented. Our food budget will have been consulted, as well as our schedules. Mostly the heavy bulk stuff will already be delivered by the weekly run, but because I'm more traditional, I'll still prefer to visit the store and pick out the fresh fruit and veggies. This week "fresh carrots and broccoli" from the little farmer's market, down the street. Nice.