The Data-Centric Revolution: Toss Out Metadata That Does Not Bring Joy

As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough! We have several projects in flight to expand our use of metadata.”

Sorry, I’m going to have to disagree with you there. You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s ability to make use of the data they have.

Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you. If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.

Most large firms have thousands of application systems. Each of these systems have data models that consist of hundreds of tables and many thousands of columns. Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).

Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications. And let’s not even get started on your Data Scientists. They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”

Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud. “Storage is cheap.”

This is where the Marie Kondo analogy kicks in. As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.” You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.” The advantage that they have, and you don’t is that their world is finite. You are faced with cataloging billions of pieces of metadata. You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake). You mandate that anything that goes into the lake must have a complete catalog. Pretty soon you notice, that the people putting the data in don’t know what it is either. And they know most of it is crap, but there are a few good nuggets in there. If you require them to have descriptions of each data element, they will copy the column heading and call it a description.

Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise. Now what?

You have somewhere between hundreds of millions and billions of pieces of metadata. Someone wants to consume some data, so they go to the catalog. But it’s all descriptive. It’s in the terms, words, level of abstraction, structure, and whim of whoever put it in there. There is no real way to tell what the definitive source is, which data set is “better”, etc. Yes, you could double down and add more and more metadata to the pile, but I don’t see this as a viable approach.

Have we been looking through the wrong end of the telescope? Have we been looking at a voluminous mass of metadata and trying to make sense out of it, when maybe we should have started with sense making, and from there directed our gaze?

Our work has convinced us, quite compellingly, that there is a core set of interrelated meaning at the heart of every enterprise, and that this core is comparatively simple. For most firms in a single industry, their core concepts amount to less than a thousand concepts. By concept here, we mean either a class / table / entity or a column / attribute / property. It seems hard to believe that a firm could be simpler than most of its applications, but this is what we find over and over again.

This core is extended in subdomains (but only slightly) and augmented with taxonomies (which add new labels but no new structure).

How does this benefit you, and what does it have to do with Marie Kondo? Once you know your core model, you know that all the datasets of interest to your firm, somehow relate to this core. You know that there are more definitive and less definitive sources for each of the concepts in the core. Your job now becomes how to go from the thousand to the billion, which while a bit of work is far more viable than trying to go from the billions to the thousands.

Now, instead of trying to embrace the billions of bits of metadata, most of which are worth far, far less than all that junk in your attic, basement or garage, you start with the items that bring you joy. The concepts that you really run your business on. This tiny subset of metadata will lead you to the most authoritative sources for each. There may be many, but there won’t be hundreds or thousands (if you boil the ocean from the bottom up, you will be surprised at for instance, how many Address Line 1 concepts you have in all your various systems, and as obvious as it sounds at first, you will have no idea what any of them mean until you put them in context).

Most of the data in your datascape are copies of data from elsewhere in your datascape. Most of it was copied because a second system needed some seed data to get their functionality to work. It is often copied to preserve referential integrity in some local application. Occasionally the data is copied in order to be augmented and / or improved. Different studies estimate the extent of duplicate data to be between 50% and 85% of all data. [i] [ii] [iii] [iv]

Right off the bat, we know we only need 15-50% of our data. My suspicion is that it is much worse than this, as many of the transmogrified copies are escaping detection as true duplicates. And much of what is left is of marginal value.

Often the copy gets transformed in some way to conform with the second system. Sometimes subsets are omitted. And there is delay in all this copying. Not every copy is up to date. Not every copy is equivalent to every other copy. It behooves us to focus our gaze from the core metadata to the system or systems most likely to have the best version of that type of data.

So, if you find yourself overwhelmed by a tsunami of metadata, instead of trying to find a place for all of it, take a page from Marie Kondo and simplify your life. Build the most elegant model you can of the information that runs your business. This model will be metadata. Connect the dots between this model and the definitive sources in your existing systems (more metadata). These two sets of metadata are the ones that will bring you joy. For the rest, you can either discard it, or if you must, retain it, but don’t mix it in with the good stuff. Put it in the equivalent of a U Store -IT locker.

[i] https://www.veritas.com/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data

[ii] https://info.aiim.org/aiim-blog/newaiimo/2011/09/20/5-myths-about-rot-redundant-obsolete-and-trivial-files

[iii] https://whatis.techtarget.com/definition/ROT-redundant-outdated-trivial-information

[iv] https://www.varinsights.com/doc/copy-data-management-could-save-billions-on-redundant-data-0001

MenuMenu

The Data-Centric Revolution: Toss Out Metadata That Does Not Bring Joy

Dave McComb

MenuMenu

Share this post

Dave McComb