About a week ago, I was teaching a data modeling class, and an attendee asked me to explain the concept of a data catalog.
Like a lot of hype-related terms in IT, there is more than one definition. However, I had recently read the book, The Data Catalog: Sherlock Holmes Data Sleuthing for Analytics by Bonnie K. O’Neil and Lowell Fryman.
This book provided a very clear and objective definition.
In addition to explaining the data catalog and all of the features available in data catalogs, this book also covers the current state of the industry. It is organized into three sections:
- Chapters 1 and 2 reveal the rationale for a data catalog and share how data scientists, data administrators, and curators fare with and without a data catalog.
- Chapters 3-10 present the many different types of data catalogs.
- Chapters 11 and 12 provide an extensive features’ list, current trends, and visions for the future.
Here is an excerpt from Chapter 1 (explaining the concept of a data catalog):
Organizations are drowning in data—it is overwhelming. The growth of data of all kinds is exponential, and surpassing an organization’s ability to manage it, let alone exploit it. Data offers the promise of many benefits, from profits to efficiency, but in order to realize these benefits, it must be understood and managed.
Enter the Data Catalog, which is an automated inventory of data assets, augmented (powered) by machine learning (ML). An asset is a highly valuable resource that merits management. Assets include money, real estate, and personnel, all of which contribute to the organization’s ability to perform its mission. Valuable assets must be tracked and managed. Data has universally been recognized in recent years as a valuable asset as well. It has great potential to maximize efficiency, pinpoint new opportunities, and report on the status of mission goals. However, as such, it requires management and tracking, just like other assets.
A data catalog is an inventory of data assets that enables users to discover and explore all the data sources available, enhancing their understanding of these sources, enabling collaboration with other users to enrich the quality of the assets, and achieving more value from the organization’s data.
A card catalog for data.
A data catalog is a reference for data very much like a card catalog works for library books. A card catalog helps readers select and locate the books that are potentially pertinent to a specific research endeavor. It provides lots of useful facts about books such as:
Author’s name(s)
Topic
Publication date
Publisher
Brief book description
Dewey Decimal classification number indicating its shelf location
An “Amazon-like” online catalog for data shopping.
The shopping experience can also be another great analogy for the data catalog. Imagine you are looking for a new book from your favorite author, such as the one in Figure 1-1
We are all familiar with Amazon’s “recommender engine,” which shows related items. Figure 1-2 shows what other customers buy together with your item.
The online catalog shows what customers also view when looking at this item, as shown in Figure 1-3. This is helpful because you might spot one of the author’s books that you might not have read.
The editorial reviews along with product details and a favorability ranking are shown in Figure 1-4.
The online sales catalog shown in Figure 1-5 features a summary of the reviews based on five stars and actual customer reviews.