In this column, I am making the case for Data Centric architectures for enterprises. There is a huge economic advantage to converting to the data centric approach, but curiously few companies are making the transition.
One reason may be the confusion of Data Centric with Data Driven, and the belief that you are already on the road to data centric nirvana, when in fact you are nowhere near it.
Data centric refers to an architecture where data is the primary and permanent asset, and applications come and go. In the data centric architecture, the data model precedes the implementation of any given application and will be around and valid long after it is gone.
Many people may think this is what happens now or what should happen. But it very rarely happens this way. Businesses want functionality, and they purchase or build application systems. Each application system has its own data model, and its code is inextricably tied with this data model. It is extremely difficult to change the data model of an implemented application system, as there may be millions of lines of code dependent on the existing model.
Of course, this application is only one of hundreds or thousands of such systems in an enterprise. Each application on its own has hundreds to thousands of tables and tens of thousands of attributes. These applications are very partially and very unstably “interfaced” to one another through some middleware that periodically schleps data from one database to another.
The data centric approach turns all this on its head. There is a data model—a semantic data model (but more on that will be in a subsequent white paper)—and each bit of application functionality reads and writes through the shared model. If there is application functionality that calculates suggested reorder quantities for widgets, it will make its suggestion, and add it to the shared database, using the common core terms. Any other system can access the suggestions and know what they mean. If the reordering functionality goes away tomorrow, the suggestions will still be there.
Many companies now claim to be data driven; far more than that claim to be data centric.
But they aren’t the same thing.
In Creating a Data-Driven Organization, Carl Anderson starts off saying, “Data-drivenness is about building tools, abilities, and, most crucially, a culture that acts on data.” This is a very good book, and I think it echoes what most people think of when they think “data driven.” It’s about acquiring and analyzing data to make better decisions.
As our appetites to acquire and analyze more data intensified, “Big Data” emerged.
But acquiring more data isn’t going to make you data centric, and may even make you less data centric. If each dataset you acquire has a different data model, and you just plop them down in a data lake without any attempt to harmonize them, you are getting less and less data centric—even as you become more data driven.
We understand why data lakes are popular now. The traditional data warehouse environment relied on complex ETL (extract, transform, and load) routines to scrub the data and get it all to conform to a predesigned data warehouse schema.
But this process is slow. It is not untypical for it to take weeks or months for a new data source to be incorporated into the data warehouse environment. The biggest problem is, until the data is normalized and cleansed, it’s unavailable for analytics. This is good for canned analytics, but for exploratory analytics, this is a problem. You would like to analyze the data to determine whether it will be worth the effort for normalizing, but by the time you can run your analytics, you invested the cost of conforming it.
The data lake approach says, “just put all your data in the lake, roughly in the format it was, and the data scientists will take it from here.” This is great for the initial exploration, but the explosion in data sources means that, over time, the data lake will be overrun with inconsistent variation and unnecessary variety.
Adding Data-Centricity to Your Data Driven Organization
It is possible to get the best of both possible worlds. You can become a data centric / data driven organization.
The key is to have a core model of the concepts in your organization. This core model, or enterprise ontology, can become the organizing principle for your firm.
Let’s say you’re a healthcare enterprise. You’ve acquired data about physicians in addition to data about doctors and nurses, not to mention data about people’s residential addresses. With a core model in place, you will learn over time that physicians, doctors, and nurses are people (really!!), and in addition to attributes they may have related to their person-ness (such as residing at a physical address), they have attributes about their specialties, etc.
The key is that you don’t have to do all this mapping up front, and you can use the model and the data to help you understand what you don’t yet know. You gradually get the data in the data lake cataloged in a way that makes it easier for your analysts to use. You catalog it in such a way that your applications can access it, if need be (as they will be sharing the same core model).
Skillfully executed, the meaning of the data in your data lake can grow right along with the lake itself and become an asset, rather than a liability.
Data centric and data driven are not synonyms. In fact, unchecked ambitions in acquiring and analyzing data sets could easily make your organization less data centric, as you drown in your data lake.
Luckily, the data centric approach has a life preserver: use the shared model of your data centric architecture as a way to organize and interpret the data you are acquiring in an agile way.