If you do a general internet search for data catalogs, all sorts of possibilities emerge. If you look closely, and ask a lot of questions, you will find that some of these products are not actually fully functional data catalogs at all.
Some software products start out life-solving a specific use case related to data, and then the inventors realize that there’s a great market out there for data catalogs.
But it is not just about cashing in; they see real benefits that a data catalog can bring and their software has some related functionality. These tools perform a specific function and have found that a data catalog is a nice complement to their offering. Just because a software product can perform one feature desired from a Data Catalog product does not make it a Data Catalog. Whilst, the Data Catalog market is rapidly evolving, there are some basic features, such as the ability to serve and allow collaboration with all data consumers in the enterprise.
Examples of these types of products and solution providers that may not be fully functional are:
- Cloud providers
- Data wrangling tools
- Data integration tools
- Data governance tools
- Data visualization tools
I label these products as “add-ons” because they typically do not work outside of their product’s purview. Their biggest problem is that most are siloed by their very nature. They assist the user only in the environment or platform where they reside. This limits their usefulness and use case. A Data Catalog should be able to serve the data needs of all data consumers across the Enterprise and for all business and analytics functions. It should be noted, also, that products change rapidly and you should always investigate a specific product and its ability to ingest data from multiple platforms and/or service multiple use cases (if this is what you need). Always examine your use cases and compare functionality with ability to match the use case.
There are a few products, mostly open source, that are Portals and not fully functional Data Catalogs. A portal makes resources available along with simple search features but does not really offer inventory management of data assets.
A portal allows the user to publish and share data. They are publishing vehicles, making data accessible. They usually offer some search capabilities through hard-coded web page links but do not have the vast array of metadata that catalogs have. An example of a Portal is CKAN (https://ckan.org/) which is an open source data portal platform, used by many government entities to provide public access to data based on the Open Data Initiative and other similar data policies and regulations.
A major distinction is the vast array of linkage types that data catalogs provide, such as usage statistics, tying a data asset to users that have searched for it or are actively using the data. Portals don’t provide crowd sourcing (unless a programmer manually adds it in). They have very simple metadata. Catalogs provide a rich pool of metadata, along with the powerful ML that makes the searches smarter the more they are used. And we must not forget auto curation and metadata ingest, automating the inventory collection process.
Each of the major public cloud vendors offers a multiplicity of services that assist data usage. Microsoft Azure offers many services and data storage options, one of which is the Azure Data Catalog.
The Azure Data Catalog cuts across all the products and services, enabling a data catalog that provides rich metadata about all the data assets stored in Azure. But what if you have a hybrid cloud environment? Azure Data Catalog is not available for Azure Stack (the private cloud option). Plus, what if your environment has on-premise databases or other cloud providers? The Azure Data Catalog would not be able to serve up metadata for these outside data assets. Thus, I don’t consider the Azure Data Catalog a fully functional Data Catalog even though Microsoft has named it so. In this case, just because it “clucks like a duck, it is not a duck (or Data Catalog)”.
Some independent cloud products (those not offered by the cloud providers) are beginning to support hybrid cloud architectures, but at this writing, the major cloud providers only offer data catalogs for their platforms.
Data Virtualization and Integration Tools
Data integration tool vendors first emerged as performing ETL (Extract, Transform and Load) and were used primarily for data warehouses. Then a new class of tools evolved which enabled users to integrate data virtually: not moving the data but instead integrating it “on the fly” in memory. An example of this type of tool is Denodo. Denodo and others are offering data catalog add-ons to enhance their integration functionality. The typical limitations of these tools involve the type of data that they can connect with. Data virtualization vendors solved the problem of integrating mainly relational data and may not be suited for other forms of data such as document (XML, JSON), key-value pair, etc. Denodo states that they handle unstructured data, so you should verify their support of document data if your organization’s use case includes it. The vendors are always adding new features, and when this article is published some limitations may not exist anymore. Buyers should always verify that all desired data types and formats for their use cases are adequately covered by whatever product they ultimately choose.
Denodo has extended its virtualization solution to include data management features. Denodo also supports cloud environments. The cloud should help with scalability. Virtualization sometimes had scalability issues in the past for extremely complex joins.
Business Intelligence Tools
There are many data visualization tools, otherwise known as Business Intelligence (BI) tools, which got their start in the Data Warehouse days. These tools were used for creating dashboards depicting Key Performance Indicators (KPIs) in attractive visualizations all on one easy-to-understand screen. They also featured drill-down and rollup capabilities for hierarchical data, such as drill down sales by a specific region, then a specific office, etc. Here’s a partial list of these tools as there are hundreds of these:
- Microsoft Power BI
- Business Objects
Some of these tools such as Cognos were acquired by larger companies (in this case, IBM) and incorporated into their larger product suite offering. The larger suite of products often includes a data catalog function. One such example is Tableau.
Tableau, along with Qlik, were the first tools to offer superior visualizations and at the same time breaking the cost barrier. Today they market an analytics platform, which is a large ecosystem including desktop, server, prep, data management (including governance), mobile, developer tools and embedded analytics. The platform is of course structured around the tableau universe, the description of the Tableau Catalog includes “…you get a complete view of all of the data being used by Tableau…” It is Tableau-centric, an add-on to Tableau, limited to those assets; not to the general data consumer.
Caution: Proceed with Care
For this article I have explored the realm of data catalogs with the Enterprise in mind: the biggest gain that a data catalog brings is the ability to make sense of the vast array of siloed pockets of dark data that are difficult to find, and to enable the enterprise to know where all its data assets are. However, there may exist a need for targeted, siloed use cases within the organization that may make a siloed data catalog appropriate. Examples of this include a specific data lake that merits its own specific data catalog; or a cloud hosted by one specific cloud provider.
It should be strongly cautioned that these siloed solutions usually will not scale to become an enterprise data catalog because they do not work on all platforms or technologies. For example, Azure Data Catalog is not available for Azure Stack, Microsoft’s on-premise cloud option. And if you have a hybrid cloud, which utilizes more than one cloud provider, cloud-hosted data catalogs usually will not span across a hybrid environment.
The best solution is to “try before you buy.” Create a trial setup or proof of concept in your own environment.
Please note that much of the content of this article is an excerpt from a book that is being written by myself and Bonnie O’Neil. I must recognize Bonnie for a significant contribution to the content of this article, yet she has not asked to be listed as an author. We are anticipating the publication of our book on Data Catalogs in March, 2020. We are documenting the functionality that should be available, as well as including examples and screenshots from many vendor products to communicate the concepts and usage of Data Catalogs.