About 540 million years ago most of the animal body plans we see today suddenly appeared in a “relatively brief” period of about 20 million years or so. This was the “Cambrian Explosion.” A good number of strange forms that subsequently went extinct were also part of this strange episode. Since that time, the types of animal that survived have basically improvised on these fundamental body plans, exhibiting tremendous diversity, but always within the constraints set by their original architecture.
Today, it seems that we are seeing a Cambrian Explosion in the realm of Data Catalogs. A large group of tools, offering a broad range of functionality is quickly appearing, all of which seem to be very different to their primitive forerunners that evolved in previous decades.
Data Catalogs are a new class of metadata tool. Broadly speaking, their goal is to unlock the value of the enterprise’s data resource for everyone who may need to work with data. As such, they are definitely an enterprise class of tool, rather than a class of tool designed to support a specific type of business unit. Being enterprise-wide in scope makes them extremely important, and strategic in nature. It also means that the market for them is enormous.
Before 2019, an epoch that is the equivalent of the Precambrian period in this story, tools that resembled primitive Data Catalogs inhabited favorable environmental niches, like Data Governance units, or BI teams. But in 2019, and continuing into 2020, a bewildering array of products came into the market, in many cases seemingly from nowhere. I recently counted 42 of them, and the number appears to be growing.
Now, if all these Data Catalogs were clones of each other, or had very similar functionality, it would be easier to understand them. But their functionality varies, and it is this diversity that makes the Cambrian Explosion a particularly good analogy for what is going on. Just like good paleontologists, we need to start by classifying the fundamental types of forms we are looking at. This is tricky and can easily be proven wrong at a later date, but let’s give it a try.
It seems to me that if we look at the fundamental paradigms of the current Data Catalogs, they have 3 major orientations that are likely to be the pathways for their future evolution, as shown in the illustration below:
Let’s explore the 3 fundamental paradigms shown in Figure 2:
- Technical Metadata Inventory. This covers all the technical metadata that is related to data. Data Dictionaries that provide an understanding of databases and other data stores are one example. Report metadata, data lineage, data discovery, and automated data classification, are examples of other areas of metadata covered by this functionality.
- Human Factors in Data. This covers all metadata at the level of business understanding, and which guide human behavior around data. A big part of this is what has been called the “Business Glossary,” but which today covers much more than terms and definitions. Collaboration and sharing are also included, as are rules, roles and responsibilities in dealing with data.
- Active Data Management. This covers enabling people to directly work with data through the Data Catalog. It is not just providing helpful information but providing an environment where actual data manipulation can occur. Also included is metadata engineering, which is the use of metadata to directly manipulate data.
Each Data Catalog product has some combination of all three of these fundamental orientations, but each product typically emphasizes one of them. The likelihood is that each Data Catalog will continue to focus on this one orientation and build more and more functionality to support it. Of course, this is a prediction, so we will see how it really turns out.
One other feature to note is the distinction between “Active” and “Passive” Data Catalogs. Active Data Catalogs help users to create data products, usually oriented to some kind of analytics. Passive Data Catalogs hold information that is used to understand, govern, manage, and use the enterprise data resource. All Data Catalogs have some degree of mixture of both active and passive.
From this discussion, we can see that Data Catalogs belong to groups based on three different paradigms, and that individual products are likely to evolve in ways that further differentiate themselves based on the particular paradigm they have adopted. No doubt there will continue to be new entrants, but if the Cambrian Explosion analogy holds, the number of new entrants will decline rapidly in the near future, followed by a long period with extinctions and variations on the themes already established.
Time will tell.