The Programmer's Paradox: Data Vaults

"Insanity: doing the same thing over and over again and expecting different results."

-- Albert Einstein

We build massive monolithic systems, trapping them into vertical silos. Many of these systems are so deeply embedded into our organizations that these giant reservoirs of potentially useful data never get utilized. While we've learned to collect data, we fail miserably at being able to use it, particularity with larger volumes and/or complicated data structures.

Given these problems, I have an idea for turning the solution on its head. If we're not getting real value out of every organization collecting their own huge repositories, then perhaps we need to change this?

We'll start with a very simple concept. Each and every user has their own, little unique 'data vault'. It is a safe place where they can store some data, and allow others to access it.

The technology would be simple, all of the data in the vault is application-independent. That is, it is constructed in a manner that insures that it isn't implicitly reliant on some application code. Everything needed to properly interpret the data, is in the data.

A vault then, would be this place where a large number of outside systems could deposit this universally structured data and then read it back later.

For most of the data, the vault's owner could likely delete it if they wanted, but would be unable to change it in any way. This immutable nature helps to insure for organizations that the vaulted data is not being used in some in-appropriate manner against them. If it exists, it has not been altered. It is for the organization's comfort.

Each and every vault would explicit audit all reads and writes to its contents. That is, any and everyone who ever had access to the vault would be added to an internal permanent record.

To make this effective, for every system that utilized the data, every user would have to be declared, and no 'shared accounts' or 'temporary users' would be acceptable. Organizations would be held legally accountable for insuring the integrity of their usage.

The vault hosting sites would also be registered, and would be held accountable to some very strict access rules. They would have to make sure, for example, that access to any backups was not possible without clearance. That the basic access and editing security rules were being followed to the letter. They would be legally accountable for breaking any of these.

The security around the vault may be sounding a bit Draconian, but in concept we are moving the trustfulness within organizations into some outside party. Without a well-entrenched security model, the necessary trust to make these ideas work could not be established. The organization has to trust the host, and the customer also has to trust both of them.

So, now what we have is a mechanism for any organization to save specific customer data into a customer defined location. In some cases the customer can delete the data, but in all cases they cannot edit it. Each and every organization that has capabilities of altering the vault is registered, along with a up-to-date list of all employees with access. Access can be given, but it will be tracked. Anything that happens to the vault is recorded permanently.

Now to make this more interesting, an organization can write some data into the customer's vault and then retrieve it later. A simple scenario for this would be customer-relationship management (CRM). Each and every time the customer talks with the organization, they would start by identifying their vault, and then the details of the conversation would be stored in it. Organizations could 'personalize' their services, but not have to store all of this information locally. They could save huge costs on running their own local complex IT systems, instead just concentrating on using a simpler IT infrastructure to maintain just the core systems.

For an organization storing data, there would obviously be some concern over rival organizations utilizing their efforts, so one of their storage choices would be 'private'. Only the depositors of the data and the customers could see the underlying data.

For customers, some would prefer to have a context set over their conversations, so the vault would be really handy. Others, no doubt would prefer to have no-one retain any information, so they could be free to visit the vault and eliminate its contents. In fact they could just choose to not mention a vault id at all.

Another interesting aspect is that a single customer, like email accounts, could easily have multiple vaults. In that case they may choose which vaults to present to which companies, allowing some to have closer relationships, while not trusting others.

Not all vaults are created equal. That is, since it is a computer system, some vault hosts may choose to offer premium services: larger or faster access, better UIs, better backup, backups in other regions, better security, archiving, etc. Data vaulting would commoditize really quickly, with different organizations competing heavily over an ever increasing territory. They'll always be more data to collect, and it will always be growing.

While storage space is generally cheap, there may be applications for vaults that could eat up lots of storage. Customers aren't going to want to pay the full cost of their storage, in fact most aren't going to want to pay any costs at all. To make this work, each and every time an organization writes to the vault, their costs help subsidize extra space for the customers.

That is, if DELL stores 100K worth of client interactions into someone's vault, they are implicitly paying for 200K worth of space. The customer gets that extra storage for free. Although DELL has paid for more space, the overall costs of the data vault would still be significantly less than they are currently paying for their silo'ed systems, so it would represent a price cut to them. Also, some of the marketing budget could be redirected for silo 'promotions'. The vault industry itself would be priced very low, because there would be significant competition.

There would also be a lot of other ways the customer could avoid paying bills, research being a big one, governments being another.

Certified research companies could pay for access to anonymized vault information. Since it is the vault itself that is anonymizing the access, the security is maintained and access behavior is consistent. The data could be truly anonymized and still be hugely useful for research.

Every vault used in the research could be rewarded with some extra space or some other payment option. Organizations get high quality data for their research, while the users get the comfort of knowing their data can't be mis-used.

Beyond customer relations, purchasing and research, I can see an ever bigger market for the vaults. Healthcare is struggling with the digital age, and the vaulting system is the perfect answer to these problems.

Healthcare providers can use the vault as the official electronic medical record (EMR). All of the data from all of the different doctors, diagnostics, treatments and other interactions can go into the vaults like any other organization.

The vaults can hold a limitless amount of data, so in the special cases where the medical relationship has been long and difficult, any enhanced vaulting requirements for massive data can be handled specially. Large or complex treatments may generate large data that needs to be distributed.

For most customers however, handling their medical data in a vault would be fairly trivial. To their benefit, they would get to see their data, review it, and possibly get some subsidized vault space out of the arrangement.

For people without home computer access, Internet cafes and public libraries would provide a way to see their data.

Now, an interesting issue would be to allow users to move their vaults around, and to try to provide some federated access for users with multiple vaults, if they were too large and needed to be split.

So, it makes sense at the higher level to have a simple 'redirection' layer applied to the technology.

Keeping it simple, vault ids would be analogous to URLs. The vault locations themselves would be analogous to IP addresses. This would allow a distributed DNS-like technology to supply and cache the specific vault location, independent of the customer's id. Further, because it would be almost identical to existing DNS technology, caching would provide fast access for frequent clients.

Then a 'vault' could be spread over multiple separate locations. Locations could be moved around, from different machines. The whole indexing mechanism would provide a simple layer over the top to make things more flexible. It would be possible for a customer to easily switch their vault handling from one host to anther.

Because of the nature of some of the data, all vault providers would have to be registered, and follow some strict basic strict rules. The software they would be using itself would be fixed by standards, but there would probably be a small number of different distributions. Different software companies will implement their vaults in slightly different manners. Hosting companies could pick and choose the best software (or write their own).

One thing that needs to be included into the vaulting standard is a large number of software libraries used for vault access from the various different organizations. There should be a consistent/common interface for all of the applications that need access to vault data. Basically, given a vault id, the return from a vault access library should be the correct 'data model'. Partial models and paging support would be necessary.

Once the programmers have access to the model, the code would be very strict in allowing them add or delete data. This isn't a restriction intended to annoy other programmers, but it is based on the observations that the vaults would get quickly filled with junk data, if there is little control over what is stored. A free-for-all, given our current software development industry, would be a disaster, but the data still needs to be dynamic in structure.

Relational databases would likely be bad underlying technologies for this type of data model because they can't handle dynamic data properly. A slightly new storage model would be needed to get the necessary balance between flexibility and structure. I've got some ideas on how this could be worked out. It's not complicated, but it is the key to making all of this work properly.

All outside applications would need to add dynamic structured values, but unless there was some clear, simple overriding structure, sharing data, research and data mining would be impossible.

Since this idea was intended to get beyond just collecting data into giant silos, distributed but inaccessible data isn't really a sufficient improvement.

Besides CRM and Healthcare, there are plenty of other uses. People could purchase vaults to store their digital images, or backup their digital music collections. They could store scans of important documents like birth certificates, taxes or other legal documents. They could automatically sync their hard-drive to a vault somewhere. The integrity of the vault system would make it an ideal candidate for handling data replication issues. Extended a little farther, other organizations could utilize it as well for their off-site storage backup.

There is no limitation on the type of data that can be stored.

Along with personal documents, vaults could also be the anchor point for a huge number of social systems, allowing different networks to share common related data based on individual agreements between companies. For instance, three social sites could form a common organization that would allow each other to read/write common data. The vault provides neutral ground and a standardized technology so that none of the three was getting unfair advantage.

Customers could wire all of their accounts into vaults. In that way, they could become responsible for managing their own identities. They might have a vault that contains all of the preferences for their most favorite and accessed systems, and another that 'hides' their identity in some manner (fake user name for example).

There are a huge number of other possibilities as well.

This idea balances a number of different factors: for their own comfort, customers want to control their own data within a system. Organizations need access to that data, but full control of it does not gain them anything. Neutral mechanisms are needed to share data between organizations. Large data needs to be split across hardware. Vaults strike a reasonable trade-off between these different issues.

All of these things become achievable if we break off the storage from the system, and allow third-parties to maintain it. While it is a really simple idea, if we can get the right data representations, a network of data vaults will allow us to re-shift the way we think of data, and to fix many of the intrinsic problems we currently have with our existing systems. Once free of the silos, we can start to employ our data to enrich our lives, rather than to just make IT operations personal miserable because they are working so hard to collect stuff, without actually being able to use it for anything.

So, who is willing to invest in building this?

The Programmer's Paradox

Sunday, April 11, 2010

Data Vaults

1 comment: