Data modeling was once a very stable discipline. It focused on relationally organized data and modeled that data as entities, relationships, and attributes. I was a good data modeler in the 1980s and
at the top of the game in the early ‘90s. Entities and relationships were easy things, and the interesting stuff came in the form of patterns, abstraction, and specialization. Good data models
were easy with five simple principles:
- Represent the real world – model the data to look like the business.
- Model relationships and cardinality to represent business rules.
- Abstract for flexibility.
- Specialize when roles or states are important.
- Normalize, normalize, normalize …
I came to believe that there was nothing new to be learned about data modeling. We had it all figured out – a true engineering discipline with clear rules and little ambiguity.
Since that time there have been a few bumps along the road. First was data warehousing where I learned to rethink normalization. Like many of the Codd and Date generation I treated the third-normal
form like the eleventh commandment … sacred and carved in stone. You can imagine what a problem I was as a modeler on my first data warehousing project.
So I changed my thinking about normalization, and I learned the principles of purposeful normalization. Then came another little ripple – dimensional data – to shake my data modeling
world. But it was really just a bit of turbulence. Soon I learned that dimensional models are really just relational models with some additional constraints. I extended my five simple principles.
They became eight principles to accommodate both relational and dimensional data:
- Represent the real world by modeling the facts to look like business measures.
- Represent the real world by modeling dimensions to look like business hierarchies.
- Denormalize, denormalize, denormalize … (for star schema design).
By the mid-1990s the dust had settled, and I was once again comfortable and confident in the stable and structured world of data modeling. But comfort leads to complacency, and complacency to
surprises. After nearly fifteen years of relative quiet in the world of data modeling, I looked around recently to discover that much has changed. Starting with dimensional data, which today is
widely used, we’ve seen many changes in the organization of structured data – columnar and correlation databases for example. Beyond structured data we find new challenges in modeling of
unstructured data such as text, images, voice, and video. Then there are the highly specialized data structures that challenge modelers to integrate them with mainstream corporate data: clickstream
and web analytics data from e-commerce applications, geo-spatial data, RFID tagging and location data, and more.
Of course, even the world of OLTP data has changed for most of us. We no longer have the freedom to design optimal databases for transaction processing. Instead we seek to understand and unravel the
hidden, complex, and obscure data structures of commercial software and ERP databases.
So what does all of this mean to data modelers? Quite simply, it means that eight simple principles don’t work anymore. Let’s look at just a few examples of how those principles break
down with new data management challenges:
Columnar Databases: Modeling the data to look like the business doesn’t work. Nor can I model it to look like business measures or business hierarchies.
These are analytic databases. I need to model the data to look like the questions.
Correlation Databases: A correlation database is by definition data model independent. Correlation DBMS architect Joe Foley believes that data models are
unnecessary with correlation databases because the value-based storage and load-generated indexing and metadata make them irrelevant. For data modelers this raises some hard questions: Are we ready to advocate un-modeled data? What does it mean for data governance and data stewardship? How does it fit into data management programs and processes? If this is
value-based storage, do I model the data to look like the values? Do I model the data at all?
Unstructured Data: Once easily dismissed as binary large objects (BLOBs) text, images, voice, and video now represent the majority of corporate data. Some
studies indicate that unstructured data is as large as 80% of the overall data resource. Content management, business intelligence, knowledge management and similar applications increase the need to
integrate rather than append unstructured data. The “BLOB” approach works to append but not to integrate. Again, new questions arise for data modelers: How
to distinguish content from context? How to distinguish unstructured from semi-structured data? When to impose structure on unstructured data? How to distinguish data from metadata – for
example, is tagging data or metadata? Where do semantic structures fit? What about lexical structures? Do I model the data to fit with taxonomy? Or do I model it to be ontologically
So many changes in the data management world – columnar, correlation, unstructured, clickstream, geospatial, RFID, ERP, and likely more on the horizon – mean that a command of
entity-relationship modeling simply isn’t enough to do the job today. The old foundations are shaky, and the new questions are many and varied. Every data modeler, no matter how experienced,
has the need for new skills. And they have the opportunity to find new challenges and new rewards in their work.