As I begin this article, it strikes me that the entire world is focusing on data, the data about COVID-19. Each day, here in New York City, so many of us wait with a mix of anticipation and dread for the daily recitation of statistics at our governor’s news conference.
The numbers are more encouraging these days, though the count of fatalities is still staggering.
It also occurs to me that there may never have been a time where more brilliant people—medical researchers, statisticians, data scientists, public health experts—are pouring through enormous amounts of data, of all types, searching for the correlations, the trends, the evidence of success and failure of clinical trials, all with the intention of finding a treatment, a vaccine, and new strategies and tactics to slow the spread, to save lives and salvage devastated economies.
Imagine how much these efforts could benefit from a structure that would enhance the transparency, the understandability, the reliability of this vast universe of data? What could be a more fitting goal for data governance than to provide a structure for invention and innovation to fight this deadly, complex, and invisible enemy?
Returning to Eliyahu M. Goldratt and Jeff Cox’s book, The Goal, which I wrote about in my first article in this series, the cover notes for the 30th Anniversary edition observe, “One of Eli Goldratt’s convictions was that the goal of an individual or an organization should not be defined in absolute terms. A good definition of a goal is one that sets us on a path of ongoing improvement.”[1]
Providing a structure for invention is all about creating such a path.
Since I’ve finished writing my last article, I read Gartner’s 2020 Planning Guide for Data Management. There was much in the guide of which I was familiar, including DataOps, hybrid cloud, data streaming technologies, and the like. I was struck by how many new technologies and concepts Gartner highlighted for which I had heard nothing or little about—LDWs (Logical Data Warehouses), Data Gravity, CEP (complex-event processing), multi-model databases, geospatial databases, and more. I found myself constantly going from the article to Google to find out what these unfamiliar acronyms and terms meant. I admit to thinking more than once that my idea for deriving data governance’s goal from Bach’s The Art of The Fugue, written 270 odd years ago, was a bit quaint.
But when I read Gartner’s comments about data governance priorities, I saw highlighted there the same areas I had related to the structure of fugues—metadata, clarity, lineage. These formidable-sounding technologies were all just tools to manage data, and as Gartner wrote,
“Managing data effectively depends on being able to answer the following questions:
- What data do we have?
- Where is it?
- What does it mean?
“Without answers to these questions, many data management initiatives—including self-service analytics, data security, data privacy, information life cycle management and many aspects of regulatory compliance—are doomed to fail. The key to all of these initiatives is data governance powered by metadata management.”[2]
You may recall from my previous article how I compared how Bach used the subject of his fugues to provide “metadata” and “data lineage” to the listener, guideposts with which they could keep their bearings in the sea of contrapuntal music Bach creates. Just so, as Gartner points out, metadata management is “the key” to unleashing the full creative power of data.
As Goldratt writes in The Goal, in words spoken by his fictional alter ego, Jonah, “Every action that brings a company closer to its goal is productive. Every action that does not bring a company closer to its goal is not productive.”[3]
When the book’s protagonist, the plant manager, tells his physics teacher he has discovered what the goal of his plant is (and if you haven’t yet read The Goal, I won’t reveal what it is), they discuss how they need to translate that goal into measurements that make sense in the context of running the plant. Similarly, I need to translate my proposed goal of data governance into measurements which are meaningful to business and technology users. What does data need to be for people to use it to generate invention?
I am sticking with “invention,” rather than innovation, because, as I discussed in my last article, the definition of invention is broader than innovation and includes “productive imagination: inventiveness,” as well as “discovery, finding.”[4] This definition is extremely relevant to the goal of data governance I have proposed.
Thinking back to the data governance principles I have used as guideposts in the past, I think three align well with the goal in terms of what the user needs for data to drive invention:
- The user needs to understand the data. If the business meaning of the data isn’t clear, and the relationships between data sets is murky, then users are going to have to spend much of their energy and time sorting these aspects out and have little left to innovate.
- The user needs to rely on the data, from multiple perspectives, including its quality and its source. The user needs to trust the data. They need to trust that it meets their needs for quality, timeliness and other requirements so they can go ahead with exploring the data without needing to worry about cleaning it up or figuring out where it came from.
- The user needs easy access to the data needed, but only the data permitted to be used. They need to have the data at their fingertips, but the controls must be in place to prevent restricted data from being revealed to the wrong person.
All three contribute towards the all-important trust in the data. Maximizing a data user’s creativity and capability for invention happens only by freeing their minds of concern about the quality, meaning, origin, and appropriate accessibility of the data.
If we can determine a meaningful metric for each of the above, then (to the extent that data governance can provide a structure to improve these metrics) the framework we establish will foster invention and innovation. It will be similar to how the “fugue governance framework” enabled Bach and other composers to create complex but accessible works of art.
Before I propose measures for reliability, understandability, and accessibility, I need to frame my suggestions with a few words about DataOps. I learned of DataOps last year when I was searching for a data-centric answer to DevOps, the hot new approach to faster software integration and deployment. DataOps is exactly that: it incorporates Agile development methodologies and DevOps and adds a relentless focus on data quality by incorporating the Lean Manufacturing concept of Statistical Process Control.[5]
DataOps focuses on applying these disciplines to “data pipelines”. There are many definitions of data pipelines—the one I find most useful reads as follows:
“A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. A pipeline also may include filtering and features that provide resiliency against failure.”[6]
Even this definition is a bit limited, in my view. Most of the DataOps literature talks about data pipelines in connection with data analytics, but almost any business process that uses data involves ingesting raw data, combining data, aggregating data, and/or filtering data. Viewed from this perspective, data pipelines are truly ubiquitous. Much as a composer transforms the raw materials of music—notes, rhythm, instruments—into a melody, a song, a symphony, a fugue, a data pipeline transforms data into valuable, usable information.
DataOps talks about orchestrating data pipelines—controlling the flow across multiple platforms and tools, testing the quality of the data at each step, and monitoring the process. Orchestration is another word like invention, with a musical context. The Merriam Webster definition is:
- the arrangement of a musical composition for performance by an orchestra
- harmonious organization[7]
I think the second definition is what DataOps strives to do through orchestration, by arranging the data pipeline, which composes the data needed by the user, for performance by an orchestra of tools and code.
So, with this overview of DataOps, let us get back to those measurements, and start with the easy one—how do we know our data is reliable? There are many ways to measure quality, but quality is meaningful only in context, as Laura Madsen points out in her book, Disrupting Data Governance.[8] This is where I believe DataOps comes into the picture.
The most salient feature of DataOps, to me, is the focus on automated tests of data quality at each step of a data pipeline. In fact, as Chris Bergh has explained, an effective way to begin DataOps is simply to define and build these tests, and then monitor the results.[9] I propose a metric for data reliability should be the number of automated tests running in the relevant data pipelines. The more data testing, the less likely unreliable data is to flow to the end user, and the more likely interesting data variations emerge which can fuel invention.
For accessibility, we need to know if users are getting their data from the authoritative data source, via a recognized data pipeline vs. looking for the same data from other sources. This presumes that the authoritative data sources have the appropriate data controls in place. We can measure the number of users obtaining their data from the approved pipelines and sources, and the frequency with which they get this data compared to the number of requests from these same users to other sources of the same data. This measurement can build confidence that the users are not circumventing the system and going to unprotected, unvalidated sources.
Here is one that I think is tough—what one metric can tell us if data is understandable? The number of terms defined in your business glossary tool? The fact that you have hundreds of data elements defined does not alone yield an effective understanding of data when attributes such as data relationships and context are important.
We could consider data literacy the metric here. Jordan Morrow, Global Head—Data Literacy at Qlik, offers some guidance on such measurements in his article, Developing a Data Literate Workforce.[10] The focus of such an assessment is on the skills and knowledge of the people using the data, but it does not necessarily tell us the degree to which the data itself is understandable.
I found inspiration from an article by Lars Ole Schwen and Sabrina Rueschenbaum Ten quick tips for getting the most scientific value out of numerical data.[11] The authors write about the importance of tracking sources when performing statistical analysis of data, noting that “in any scientific work, it is crucial to document in sufficient detail how these data were acquired and which calculations were performed so that the reported results can be reproduced.” From a data governance perspective, this points to the criticality of data lineage in contributing to the understandability of data. A user needs to know the source of the data and the transformations impacting the data along the way.
Interestingly, I think these ties back to my previous comments on DataOps and the metric proposed for data reliability. Again, our metric here could be something conventional—number of CDEs (critical data elements) with fully captured data lineage. However, if we have built data tests into each step of a data pipeline, have we not captured de facto data lineage for the data elements the pipeline produces for users? The percentage of data pipelines, relevant to a specific data use case, with automated tests in place for every step, including each data transformation, could be an effective metric. This is important because, to decide where, what and how to test, we must understand the processes creating the data we need.
In summary, here are the three key measurements I propose that will translate to providing a structure for invention:
- The number of automated data tests built in the data pipelines providing users the data they need.
- The number of relevant data pipelines with automated tests in place for each step.
- The number of users/data queries run against the authoritative sources of the data and relevant data pipelines by the users, vs. the number of queries run against other sources.
Getting back to Goldratt’s hypothesis that any activity that contributes toward the goal is productive and any that does not contribute is not productive, and focusing on these three measures with which we’ve translated our data governance goal, I think the activities below need to be given highest priority in building the framework:
- Metadata management
- Identification of authoritative sources[12]
- Access control for the data in these authoritative sources
- Data lineage
- Implementing DataOps, including developing and deploying automated data quality tests and establishment of data issue monitoring and tracking
For this to work, everyone, as Bob Seiner and Laura Madsen have emphasized before[13][14], must be a data steward. That means that each person is accountable for asking questions about data and notifying someone or some group about data they think is incorrect. I know I have been as guilty as anyone of finding something questionable in data I’ve pulled for a report, process or other use, checking the data against some other source or with the provider, finding the right value and going right ahead with whatever I have to do, without making sure the person or group who can fix the original data knows about it. And when I have notified that group, and then pull the same data again and find the same error, I likely said to myself, “Oh, I told them about this, they will get it fixed sometime, so I will just go ahead and fix it for my work, and not bother them again.” True data stewardship means taking personal accountability for always following up on data questions.
As we observed, data governance can contribute towards invention by making data reliable, understandable, accessible, and secure. The focus of DataOps on continuous improvement, on running automated tests at each step of the data pipeline to catch data issues before the users have to deal with them, and to spot variances that need review and may yield new insights, is operationalizing this aspect of data governance. We are a long way from measuring data quality of the output or creating rigid business rules to vainly prevent bad data from entering the system. Instead, our measurements and processes are providing a structure for invention, what I believe data governance is all about.
In the first part of this series, I described the three eras of data governance as defined by Winston Chen[15]:
- The Application Era (1960-1990)
- The Enterprise Repository Era (1990-2010)
- The Policy Era (2010 – ?)
I wrote in that first article that I believed that in 2020, data
governance would commence a new era. The explosion in the prominence of data as
a weapon against COVID-19 has convince me that we indeed have embarked on data governance’s
4th Era, the Invention Era. Data governance must rise to the
challenges posed by rapidly changing technology and the enormous volume and
variety of data to provide the structure fostering creativity and innovation.
Just as composers and listeners alike have found the fugue a framework for the
invention and appreciation of musical works of genius, data governance can
enable individuals and organizations’ use of data to create potentially endless
insights providing value to businesses and societies.
[1] Goldratt, Eliyahu M. and Cox, Jeff, The Goal, 30th Anniversary Edition, The North River Press Publishing Corporation, 1984
[2] Gartner (Sanjeev Mohan, Lyn Robison, Sumit Pal, Joe Maguire, Dk Mukherjee), “2020 Planning Guide for Data Management”, October 7, 2019
[3] Goldratt and Cox, The Goal, The North River Press Publishing Corporation, 1984, pg. 32
[4] https://www.merriam-webster.com/dictionary/invention
[5] Bergh, Christopher; Benghiat, Gil; Strod, Eran, The DataOps Cookbook, Data Kitchen, Inc, 2019.
[6] https://www.stitchdata.com/resources/what-is-data-pipeline/
[7] https://www.merriam-webster.com/dictionary/orchestration
[8] Madsen, Laura, Disrupting Data Governance (Technics Publications, 2019), pp. 173-174.
[9] Bergh, Christopher, Orchestrate Your Production Pipelines for Low Errors, webinar broadcast April 22, 2020.
[10] Morrow, Jordan, https://www.qlik.com/us/-/media/files/training/global-us/qlik-education-data-literacy-program-strategy-and-framework.pdf, July 2018
[11] Schwen and Rueschenbaum, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6181270/
[12] A data catalog is a useful tool to address items 1 and 2, as it is a collection of metadata, with tools that help users find the authoritative sources of the data they need.
[13] Seiner, Robert S., It’s all in the Data: Everybody is a Data Steward (Get Over It), TDAN.com, July 1, 2016.
[14] Madsen, How to Disrupt Data Governance, webinar broadcast April 21, 2020.
[15] Chen, Winston, A Brief History of Data Governance, Magnitude, June 25, 2010