Start with some data that you want the system to capture.
Where does this data come from? Data is usually entered by people or generated by some type of machine.
Most data is composite. Break it down into its subparts. It is fully decomposed when each subpart is easily representable by a programming language primitive. Try to stick to a portable subset of primitives like strings, integers, and floating point numbers. Converting data is expensive and dangerous.
Are any of these subparts international or common standards? Do they have any known, necessary properties, constraints, restrictions or conventions? Do some research to see what other people in the same industry do for representing these types of values. Do some research to see what other programmers have done in representing these values.
Now, investigate usage. How many of these values will the system store? How often are they generated? Are there any weird rules for frequency or generation? Sometimes close estimates for frequency are unavailable, but it’s always possible to get a rough guess or something like a Fermi estimation. For the lifetime of some systems, the initial data frequency will differ quite a bit from the actual real-world data frequency. Note this, and take it into account later, particularly by not implementing optimizations until they are necessary.
Once the data is created, does it ever need to change? Who can change it? Is the list of changes itself, significant data too? Are there security concerns? Can anyone see the data, can anyone change it? Are there data quality concerns? What are the chances that the initial version of the data is wrong? Does this require some extended form of auditing? If so, what is the frequency for the audit entities themselves? Is the auditing fully recursive?
For some of the subparts, there may not be a one-to-one relationship. Sometimes, the main data often called an ‘entity’ is associated with many similar subparts. Break this into its own entity. Break any one-to-many, many-to-one or even many-to-many bits of subparts into separate entities. During implementation, it may not make sense for the initial version of the system to treat the subparts as a separate entity, but for data modeling, it should always be treated as such. The model captures the developers understanding of the data as it gets created and used, the implementation may choose to optimize that if necessary. The two viewpoints are related, but should not be intermixed.
For some data, there may be an inter-relationship between entities of the same kind. Capture this as well. The relationships span the expressibility of data structures, so they include single entities, lists, trees, dags, graphs, and hypergraphs. There may be other structural arrangements as well, but most of these will be decomposed into the above set. These interrelationships sometimes exist externally, in which case they themselves are another entity and should be treated as such. Sometimes they are confused with external indexing intended to support searching functionality, that too is a different issue, more entities. Only real structural interrelationships should be captured this way. Most entities do not need this.
What makes each entity unique? Is there a key or a composite set of values that is unique? Are there multiple keys? If so, can they conflict with each other? Spend a lot of time understanding the keys. Poorly keyed data causes huge problems that are hard to fix. Almost all entities need to be unique, so there is almost always at least one composite key, sometimes quite a few. Key mappings can be a huge problem.
While working with one type of entity, a whole bunch more will be created. Each one of these new entities needs the same analysis treatment as the original entity. As the understanding of each one has been explored they can be added to the model. The model should grow fairly large for non-trivial systems. Abstraction can combine sets of entities with the same structural arrangement together, but the resulting abstract entities should not be so generic that they can no longer properly constrain the data. A model is only useful if it accurately holds the data and prevents invalid data from being held.
Be very careful about naming and definitions. The names need to match the expected usage for both the target domain and computer science. Sometimes it takes a while to figure out the correct name, this is normal. Misnaming data shows a lack of understanding, and often causes bugs and confusion later. Spend a lot of time on names. Do research. They are hard to change later. They need to be accurate.
Don’t try to be clever or imaginative. The system only adds value by capturing data from the real world, so the answers to most questions are just laying around, out there, in the real world. There has been plenty of history for building up knowledge and categorizations, leverage that work. Conflicts between the model and reality should be resolved by fixing the model. This work only has value if it is detail-oriented and correct, otherwise, it will become one of the sources of the problem, not part of the solution.
There are forms of optimizations that require injecting abstract data into the model, but those types of enhancements should be added later. They are implementation details, not data modeling.
Some types of data are constrained by a fixed or limited set of values. These sets are domain based. Try to find existing standards for them. Put them into their own entities, expect them to change over time. Do some analysis to figure out how often they are expected to change and who is expected to change them. Anyone involved in running, using or administrating a system is a user, and not all users are end-users.
As this work progresses, it will build up a large collection of entities. Groups of these entities will be tightly related. These groups draw architectural lines within the system.
Now, look at how people will use this collection of data. Do they need access to it quickly? Is it huge, and what types of navigation will users need to find out what they are looking for? Is the searching slow, does it need some form of indexing optimization to make it usable? Will the system build up data really quickly, consistently or very slowly? Do different parts of the system have very different access requirements? Is there data in the system which is creativity-based or experimental? Some systems need tools to play around with the data and are subject to lots of little experimental changes. Some systems need the data to remain relatively static and only require changes to fix quality issues.
Will different users need the same data at the same time? Will different users edit the same data at the same time? How do changes with one entity affect others?
For large data, summaries are often important to present overviews of the data? For each entity what types of summaries are necessary? Are these their own entities? How often do they change, who can regenerate them? Are there users specific categorizations that would be helpful in crafting useful summaries? Can these be shared? Are they entities as well? Can this reporting part of the system be fully integrated into the main model so that it stays in sync with any enhancements to the domain model itself?
Can the summary data be computed on-the-fly, or is it expensive enough to be regenerated only periodically? What is the precise cost of computations? Are there industry standards, existing libraries, etc. that can be used?
The answers to some of the above questions, such as how quickly the data needs to be viewed after changes and how many users need to view it, will create optimization requirements. The collection of these requirements will dictate the hardware, the architecture and any dependent technologies. Often there will be multiple possible solutions so regional programming resources and current programming fads will pin down a precise implementation.
The work modeling the data, since it is small in comparison to the implementation work, should extend beyond the initial short term goals of the development. It doesn’t have to fully cover all scope and all detail of the given domain, but it should at least extend out to the next couple of expected iterations in the development cycle. This makes it possible in the future to pipeline extensions in the system first by modeling, then by the actual implementation, so that the current coding work is at least one generation behind the current modeling work. Dramatic, unexpected pivots in the development will, of course, disrupt this cycle, but the frequency of these should diminish rapidly in the early days of development (or the project is already doomed for non-technical reasons).
A full data model then includes all of the entities, their internal and external relationships, all subparts that are ‘typed’ and all of the expected computations and necessary optimizations that are expected. Follow up extensions, should be highlighted as changes from the previous version. The version changes should match the code implementations (development cycles). The structure of any entity groups should lay out a high-level architecture, with further architectural constraints driven by the optimizations and possibly the arrangement of the development teams themselves.
The data model should include any of the analyst’s notes, and any issues about standards, ambiguities, conventions, and issues with keys. The full document can then be used to produce a high-level design and a number of necessary mid-level and low-level designs needed to distribute the work to the development teams.
Depending on the haste involved in doing this work, it is possible that there are flaws in the data model. These will come in different types: the model does not reflect the real world, b) it has missing elements or c) the model differs from an incorrect model used by an upstream or downstream system. If the model is wrong or incomplete it should be updated. That should be pushed through design and implementation as a necessary change to the core of the system. It is the same process as extending the system. If the model doesn’t reflect some earlier mistake, an appendix should be added that maps that mistake back to the model and outlines the consequences of that mapping. That mapping should be implemented as a separate component (so that it can be removed later) and enabled through configuration.
For most systems, spending the time to properly understand and model the data will lay out the bulk of the architecture and coding. Building systems this way will produce better quality, reduce the development time and produce far less operational issues. Done correctly, this approach also lays out a long term means of extending the system without degrading it.
This comment has been removed by the author.
ReplyDeleteHi Paul,
ReplyDeleteIs there a book you can recommend that discusses the same topic?
Thanks.
I haven't come across any books that put it all together, but understanding a lot more about relational databases, normal forms, entity-relationship diagrams, etc. would help. Many of my other posts discuss different perspectives on this as well.
DeleteCan you give a good example of an open source project that is designed this way? Your posts seem to imply data oriented design which is more common used in the gaming industry.
DeleteActually, my post is more oriented towards big proprietary enterprise systems, where the data models are huge and often quite vague. In one company, they had at least 5 different systems to handle similar data, with 5 very different models, and all of the problems one would expect when trying to synchronize the flow between them.
DeleteI've never worked in gaming, but for the most part open source projects tend towards tackling technical problems which are more fun to write and aimed at larger audiences. Most, very domain 'specific' code is usually proprietary (since it is written for companies and the people working on it need to pay their bills) and usually has very, very large models (since it tends to cover broader functionality that spans large organizations).