Where Did “Data Lakes” Come From?
Many large firms are implementing “Data Lakes.” The term, and the practice, came into being when two trends collided. The first trend was the upwelling of demand for “Big Data”-inspired architectures. Firms found Big Data approaches using massively parallel distribution while processing large data sets, and the ability to perform analytics with little or no schema, very attractive. The second trend was the growing frustration with the data warehousing infrastructures. As data warehouses have become more mature, firms are experiencing longer and longer delays in the process of onboarding a new data set. Each new data set (or application data base) that needs to be incorporated into the data warehouse faces a more and more complex environment to be merged into. It is not unusual now for firms to takes months to incorporate a new data set into their data warehouse.
Many firms looked at this situation and said, “Why don’t we treat our own data as if it were big data?” Just copy it, lay it down (more or less as-is), and let the analyst work it out.
Schema on Write vs. Schema on Read
By the way, this was the origin of the distinction between “schema on write” and “schema on read.” “Schema on write” is the paradigm of traditional databases. In a traditional system, it is just common sense that the schema must exist and must be complete before you write any data to the database. It’s not just a good idea, a relational database won’t work without a schema. Hence the term, “schema on write.” You must have a schema if you’re writing to the data base, and everything you write must conform to the schema. A corollary to this, and the one that ultimately causes the most angst, is that changes to the schema are very disruptive. Any change that results in a structural change generally requires a load and unload, or a conversion. And in relational technology, developers generally write application code bound to the schema in such a way that any change to the schema requires an impact analysis to determine what else is adversely affected.
Big Data doesn’t need a schema. Big Data generally has some structure (more often, these days, the dictionary and array structure of json), but the structure need not be dependent on or derived from a schema. This is a big plus for flexibility, but as we will see, it has many minuses elsewhere. This ability to find patterns in the data and then use them to query is what people call “schema on read.”
Will the Data Lake Be Successful?
I think it is safe to say that there will be declared successes in the Data Lake movement. A clever data scientist, given petabytes of data to troll through, will find insights that will be of use to the enterprise. The more enterprising will use machine learning techniques to speed up their exploration and will uncover additional insights.
Will the Data Lake Be Unsuccessful?
But in the broader sense, we think the Data Lake movement will not succeed in changing the economics or overall architecture of the enterprise. In a way, the Data Lake is something to do instead of dealing with the very significant problems of legacy ecosystems and dis-economics of change.
Even at the analytics level, where the Data Lake has the most promise, we think it will fall short. The business analyst who generates reports and dashboards in the current data warehouse environment will likely be stunned by the sophistication of the tools and architecture of the Data Lake environment and be overwhelmed by the complexity of the data itself. In the data warehouse environment, the structures were simplifications and homogenizations of the complex schemas that exist in the source systems. Without the gating factor of the ETL (Extract Transform and Load) process, the analyst will be forced to deal with all the complexity of all the schemas. Most large enterprises have millions of distinctions in their collective application schemas (we count, at a minimum, all the tables and all the columns, as well as some of the enum and taxonomic distinctions as long as they are essential to understanding what the data means).
Our handicapping suggests that most firms will have a Data Lake, most firms will get some value out of their Data Lake, but most firms will not have used the opportunity to transform the tower of babble they have created.
What Will It Take to Make Your “Data Lake” “Data Centric”?
Conceptually, the Data Lake is not far off from the Data Centric Revolution. The data does have a more central position. However, there are three things that a Data Lake needs in order to be Data Centric:
- Understandability
- Usability
- Updateability
Understandability
The Data Scientists that work through the data in the Data Lake are coming across meaning but are doing so in a very ad-hoc fashion. We suggest making this process more structured and making the result more useful.
Separate the Lake into two partitions: one pretty much as it is now (the raw data, unadorned), then a second Lake where the data is partially interpreted. In the second Lake, every schema concept (every tag, every element) should be mapped to its type in the Enterprise Ontology. This, at least, creates a repository of primitive understanding. For instance, at a minimum you would understand the difference between the customers and their orders. You may not have all the distinctions of types of customers, but you will have the building blocks to make those distinctions.
Usability
Once the Data Lake has some schema, it becomes useable. Business Analysts can rely on the Enterprise Ontology for their basic needs. The Enterprise Ontology gives them the ability to “follow their nose,” and then take what they do know and lead themselves gracefully to what they will know.
The big difference in the follow-your-nose style is that you can run a query and get back results, in context. You could query for information on prospects. You may retrieve some properties that describe the client’s golf handicap. You will retrieve this despite the fact that you weren’t expecting it and don’t know what it means. However, if you do retrieve it and become curious, there will be a URI that resolves to explain what a golf handicap is, if you wish to know.
Updateability
The most significant difference between a Data Lake and the Data Centric Revolution is updatability. So far, virtually no one is considering making their Data Lake updateable. It doesn’t make sense. The Data Lake is not the system of record, so updating it doesn’t do any good.
But if you rethink this, you can have it both ways. Consider having some of the repositories in your Data Lake be updatable as well as the system of record for the data stored within. Consider starting with the low volume, relatively stable items, such as reference data. Having a portion of the Data Lake be the system of record for your other systems will allow you to prototype the functionality you will need to create a Data Centric Architecture. This portion of your Data Lake will have to have data security, validation, and constraints and identity management. To be useful, it will share schema with the non-curated parts of your Data Lake. Once this data architecture is in place, you can begin to see how you would use it to start converting application systems to the data centric approach.
Summary
No question, the Data Lake is the fad du jour. But the Data Lake is missing three key attributes that would be needed to make it a platform for the Data Centric Revolution: Understandability, Usability, and Updateability.
By pointing out the shortcomings of the Data Lake approach, we can suggest incremental improvements that will put us on the road to the Data Centric Revolution.