451 Research begins its paper on the “unstoppable rise of the Data Catalog” with the following:
Could the data catalog be the most important data management breakthrough to have emerged in the last decade? There’s certainly a case to be made given its importance in enabling modern analytics architecture.[1]
The data catalog is indeed the most important breakthrough in data management in the last decade, and maybe even rivals the advent of the data warehouse. The latter took the back office statistical scientist capabilities into the front office, enabling business consumers to conduct their own analyses to obtain their own insights. The data catalog is the next wave of this, enabling business users even further to reduce time to insight even more, despite the rising tide of data flooding the enterprise. Sherlock Holmes was able to quickly deduce insights from sharp observations and make brilliant connections. The data catalog does this too. Powerful indeed, and an enormous breakthrough.
The Data Flood
Organizations are drowning in data; it is overwhelming. The growth of data of all kinds is exponential, and surpassing an organization’s ability to manage it, let alone exploit it. Data offers the promise of many benefits, from profits to efficiency, but to realize these benefits it must be understood and managed.
Enter the Data Catalog: an automated inventory of data assets, augmented (powered) by machine learning (ML). An asset is a highly valuable resource which merits management. Assets include money, real estate, and personnel, all of which contribute to the organization’s ability to perform its mission. Valuable assets must be tracked and managed. TDAN has, over the years, stressed the importance of managing data as a valuable asset; it has great potential to maximize efficiency, pinpoint new opportunities, and report on the status of mission goals. However, as such, it requires management and tracking, just like other assets.
The problem with the data flood is that data is coming into the organization in huge volumes; it is created by analysts who make copies of pre-existing data, new data sets are brought in by analysts and business people alike, and every organization has systems that create data. This data flood cannot reasonably be tracked and inventoried without automation. But the automation requires intelligence to not only maintain the simple facts about the data (data set names, column names, etc.) but also its inherent meaning and value to the organization, which involves its connectedness to other assets in the organization.
This is where Machine Learning (ML) comes in.
Machine Learning
ML is the “secret sauce” that makes data catalogs so powerful; a catalog product that is not ML-augmented simply will not be able to handle the vast array of data volume and complexity. In our book, The Data Catalog, Lowell Fryman and I illustrate why this is true by peeling back the covers to demonstrate what ML provides and how it enables the inferences made from associating many different types of data assets. It is these connections that provide the analyst with knowledge and insight to choose the right data to use.
In this column, I will briefly touch upon some of the main advantages that Data Catalogs, empowered with ML, provide.
Affinity Analysis
Data catalogs use ML in various ways to relate a data asset to others in the catalog. It does this by various means such as:
- Similar names
- Data coming up in similar searches
- Columns and tables with similar data: duplicate detection
- Overlapping values such as reference data
- Data lineage: source to target connections
- Business rules and the data they control
- Business terms linked to technical assets like columns in a table
- Topics, categories, and tags
Automation is used to ingest data as it comes into the organization, and the ML creates the links. It makes inferences and suggestions for links between data assets, and a human can accept or reject these suggestions. The catalog then “learns” from the body of knowledge created by the partnership between the catalog and the human. The catalog becomes smarter the more it is used.
General Benefits
You cannot manage what you don’t know, and the vast amount of data in most organizations is not known or understood. The first benefit is the creation of a data inventory, which is important for
- e-discovery
- legal matters
- regulatory compliance
- policy enforcement
- privacy protection
- risk management
- security
- data usage agreements (DUAs)
- financial data rules
- exploiting the power of data to produce new insights leading to higher profits or cost savings.
If you don’t know what you have, you will likely suffer problems in two major areas:
- Search
- Curation (maintenance)
Search
The most recognizable benefit of a data catalog is its search facilitation. It adds context to data, greatly enhancing understanding for data analysts and data scientists. InfoWorld reported: “Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data, which is an inefficient data strategy.”[2]
Data scientists indeed spend most of their time looking for and trying to decipher which data sets are appropriate for their further study. Data set and column names are deceptive and often don’t provide reliable information about the actual data that the name seems to indicate. What if the columns that appear to be the most promising are all NULL (which usually means unknown), containing no data? Data analysts also want to know who is using the data; did they find it useful for their purposes? What other studies are based on this data set? Are these studies like the current one that I’m working on?
Data catalogs can be used to answer these questions; they provide rich metadata[3] which can aid analysts’ understanding. They can provide a window into the contents of the data without the analyst having to perform many preliminary steps to investigate it themselves. Analysts often must prepare the data for analysis and clean it, and a data catalog can help them locate versions of the data that have already been cleaned, saving them the effort of performing redundant transformations. A catalog can also provide a forum for users to add their own comments and ratings for data assets which can provide advice concerning how to clean it and pitfalls to avoid.
Management, Governance and Compliance: Oh My! : Curation
Data catalogs also help simplify and reduce the time spent on the management of assets. Some data have sharing rules that must be followed when repurposing it or using it in a study. Some data may also be subject to regulations such as the General Data Protection Regulation (GDPR).[4]
The personnel role responsible for management of data assets is called a Curator. It is like a curator in a museum, who is in charge of the museum artifacts. Consequently, the task performed to manage the assets is called curation.
The data catalog contains the information where the data set resides in the Information Technology (IT) ecosystem of the enterprise so it can be found by both the analyst and the curator. Data catalogs help curators locate, manage and track the status of data assets. For example, if an error occurs in a load or quality check, the curator can change the status of the asset to notify users of a potential problem which may influence the analyst’s decision whether to include that asset in a study.
The automation of curation is very important, as anyone who has tried to keep track of data, files or soft copies of reports knows. It solves problems such as:
- How do you know when there is a new data asset available?
- What if there are changes to the data?
- When was the last time that this data was refreshed?
- Who is using this data?
Manual inventories of data sets are usually maintained in spreadsheets or SharePoint. Both methods depend upon author initiative, adding a row in a spreadsheet manually or uploading a document in the SharePoint. The metadata captured is also not uniform and highly dependent upon the author’s manual input. It is never certain that all relevant data assets are included in either inventory method. The location of the assets may change and therefore search results are unpredictable and do not inspire confidence. Users and researchers therefore embark on their own manual search journey, scouring the intranet and internet looking for anything that might have potential. The net result is a huge waste of peoples’ valuable time, which adds to the total elapsed time that a mathematical model or report can be produced. The lack of confidence, which stems from source ambiguity, also casts doubt upon the final study’s dependability.
Data catalog products can automate many ingestion and management tasks, cutting curation time down considerably, and helping to increase confidence in data. Curation can greatly benefit from automation, but a human curator can never be replaced by automation. Human curation is still required to make judgment calls on automated suggestions, for example. The tool performs the rote tasks, allowing the user to focus on the more important tasks that require human judgment.
Another Benefit: Making Sense of Data Lakes
One of the main benefits, as discussed above, is the ability to automate data and metadata ingest. This is very important for “Data Lakes” or “Big Data” environments.[5] The ability to add intelligent automation and metadata discovery to the ingest process is extremely beneficial for Data lakes because they often contain large volumes of data that do not have schemas that describe its format. Data catalogs can use automation and ML to discover the underlying format of the data.
Some data catalog products provide file comparison upon ingest and automated schema discovery. This helps both curation and search, facilitating understanding of the data by grouping files with like structure together. The ability to tie natural language names to columns further enhances the understanding of this data.
Types of Data Catalogs
There is a wide variety of data catalog products in the market, mainly because each started out focusing on a specific aspect of data management. Here’s a summary of some of the main players and their focus areas:
- Business-Friendly: Alation was the first product to brand itself as a self-service data catalog for data discovery and collaboration. Its main emphasis is business-friendly. It is very easy to install and have up and running quickly.
- Data Prep: Boomi Unifi started out as a data preparation tool to help analysts and businesspeople alike to prepare the data for analysis, which is usually a time-consuming process. Unifi then broadened its offering to include data catalog features.
- Data Governance Platform: Collibra began as a business glossary product, then became a full-fledged enterprise data governance platform. It has a very flexible and rich meta-model and great governance capabilities.
- “One-Stop-Shop” and Data Management Platform: IBM Watson Cloud Pak for Data and Informatica Enterprise Data Catalog/Axon both have data catalogs complete with comprehensive data management capabilities such as data quality, data lineage, data profiling and reference data management, to name a few.
- Data Lake: Waterline by Hitachi Ventara specializes in a data catalog to help manage data lakes, with comprehensive data profiling and duplicate detection, providing “fingerprinting” for affinity matching at scale.
Conclusion
Data catalogs are imperative for managing today’s data glut and enabling data scientist productivity. In the next column, we will showcase how use case/user story analysis can help you determine which data catalog is right for you.
Excerpts from The Data Catalog: Sherlock Holmes Data Sleuthing for Analytics, by Bonnie O’Neil and Lowell Fryman, published by Technics Publications, 2020
©2020 The MITRE Corporation. All rights reserved.Approved for public release. Distribution
unlimited case number 20-00222-1.
[1] Use https://clients.451research.com/reportaction/95778/Toc? Figure ©451 Research, used by permission.
[2] “The 80/20 data science dilemma”, InfoWorld, September 26, 2017. https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html
[3] The term “metadata” refers to data that describes other data. SharePoint prompts a user who is uploading a document to provide various information about that document, such as who the author is (it may be different than the person uploading it), its topic, etc. This information supplied by the user is metadata.
[4] The General Data Protection Regulation governs data associated with individuals living in the European Union (EU).
[5] See Chapter 6 of The Data Catalog, “Fishing in the Data Lake”