How a Data-Centric Environment Becomes Harder to Govern
A traditional data landscape has the advantage of being extremely silo-ed. By taking your entire data landscape and dividing it into thousands of databases, there is the potential that each database is small enough to be manageable.
As it turns out this is more potential than actuality. Many of the individual application data models that we look at are individually more complex than the entire enterprise model should be. However, that doesn’t help anyone trying to govern. It is what it is.
What is helpful about all this silo-ization is that each silo has a smaller community of interest. When you cut through all the procedures, maturity models and the like, governance is a social problem. Social problems, such as “agreement,” get harder the more people you get involved.
From this standpoint, the status quo has a huge advantage, and a Data-Centric firm has a big challenge: there are far more people whose agreement one needs to solicit and obtain.
The other problem that Data-Centric brings to the table is the ease of change. Data Governance likes things that change slower than the process can manage. Often this is a toss-up. Most systems are hard to change and most data governance processes are slow. They are pretty much made for each other.
I remember when we built our first model driven application environment (unfortunately we chose health care for our first vertical). We showed how you could change the UI, API, Schema, Constraints, etc. in real time. This freaked our sponsors out. They couldn’t imagine how they would manage [govern] this kind of environment. In retrospect, they were right. They would not have been able to manage it.
This doesn’t mean the approach isn’t valid— it means we need to spend a lot more time on the approach to governance. We have two huge things working against us: we are taking the scope from tribal silos to the entire firm and we are increasing the tempo of change.
How a Data-Centric Environment Becomes Easier to Govern
A traditional data landscape has the disadvantage of being extremely silo-ed. You get some local governance being silo-ed, but you have almost no hope on enterprise governance. This is why its high-fives all around for local governance, while making little progress on firm wide governance.
One thing that data-centric provides that makes the data governance issues tractable is incredible reduction in complexity. Because governance is a human activity, getting down to human scales of complexity is a huge advantage.
Furthermore, to enjoy the benefits of data-centric you have to be prepared to share. A traditional environment encourages copying of enterprise data to restructure it and adapt it to your own local needs. Pretty much all enterprises have data on their employees. Lots of data actually. A large percentage of applications also have data on employees. Some merely have “users” (most of whom are employees) and their entitlements, but many have considerably more. Inventory systems have cycle counters, procurement systems have purchasing agents, incident systems have reporters, you get the pattern.
Each system is dealing with another copy (maybe manually re-entered, maybe from a feed) of the core employee data. Each system has structured the local representation differently and of course named all the fields differently. Some of this is human nature, or maybe data modeler nature, that they want to put their own stamp on things, but some of it is inevitable. When you buy a package, all the fields have names. Few, if any of them, are the names you would have chosen, or the names in your enterprise model, if you have one.
With the most mature form of data-centric, you would have one set of enterprise employee data. You can extend it, but the un-extended parts are used just as they are. For most developers, this idea sounds either too good to be true or too bad to be true. Most developers are comfortable with a world they control. This is a world of tables within their database. They can manage referential integrity within that world. They can predict performance within that world. They don’t like to think about a world where they have to accept someone else’s names and structures, and to agree with other groups decision making.
But once you overcome developer inertia on this topic and you are actually re-using data as it is, you have opened up a channel of communication that naturally leads to shared governance. Imagine a dozen departments consuming the exact same set of employee data. Not local derivations of the HR golden record, or the LDAP files, but an actual shared data set. They are incented to work together on the common data. The natural thing to happen, and we have seen this in mature organizations, is the focus shifts to the smallest, realest, most common data elements. This social movement, and this focus on what is key and what is real, actually makes it easier to have common governance. You aren’t trying to foist one applications view of the world on the rest of the firm, you are trying to get the firm to understand and communicate what it cares about and what it shares.
And this creates a natural basis for governance despite the fact that the scope became considerably larger.
Critical Data Elements
Is it just me, or is this idea of managing “Critical Data Elements” in a traditional environment insane? A large firm typically has millions of data elements. We are working with a vendor that does detailed scanning and profiling of large complex data-scapes. They are in the midst of a scan (which will take many months to complete) of a large bank (it’s one of the “too big to scale” banks) and they are now predicting that they will discover over a billion data elements [columns]. A billion. Are you kidding me? This is a billion columns, not a billion rows.
And faced with that impossible to comprehend level of complexity, what do firms do? They pick a few dozen to understand very, very well. Somehow this reminds me of the George Carlin routine of the seven words you can’t say on television.[1] There are 500,000 words in the English language and seven that you can’t say on broadcast TV. “They must really be bad!”
In the same sort of way, you have between a million and a billion (a billion, really?) elements under “management” and you are going to govern a couple of dozens of them. These must be very, very important data elements. I’m going to let you in on a little secret: this is a semi-randomly selected list that is all a group of earnest people can handle in a finite amount of time. It doesn’t mean anything. It is an audit, not governance. If these several dozen can’t be traced, and documented there is no hope for the other 999,950 or 999,999,950.
Is This Even Governance, or Something Else?
Governance tends to be a top down, command and control enterprise. It tends to be more focused on dampening change than encouraging it.
But the data-centric enterprise needs to run on shared trust. And it wants to encourage rapid change rather than stolid status quo.
Some of our clients are finding even the word “governance” (or as they say, “Capital G Governance”) is part of the problem. It’s part of the problem because just putting the label “governance” on an activity attracts corporate types who believe it is their prerogative to establish control. Further the term sends the wrong message. Governance typically is about conforming to procedures.
What we’re looking for in the data-centric approach is sharing.
Some Approaches We Are Finding Very Helpful
We’re borrowing as much as we can from the Linked Open Data movement. Linked Open Data[2] is a movement around publishing data in open standards in a structure free (graph) format which makes it as easy as possible for others to consume the data directly. The Linked Open Data (LOD) movement originally coalesced around DBPedia, which was a semantic version of Wikipedia. Since then thousands of sites and data sources have joined the confederation.
The first thing to know, and to learn from, is no one is in charge of the LOD. Therefore, their governance approaches cannot be top down and heavy handed. They govern by encouraging sharing. If no one shared the data, then it is lost to the world.
When we adopt this approach in our enterprises we win. Rather than mandating how people are to comply we should set up meritocracies of data. If your data is worthwhile and easy to consume, then it will get used. The only thing I want to offer to mitigate a full on “let a thousand flowers bloom” is the cacophony of 1000 flowers may be less productive that a small amount of coordination. Creating a nucleus of key shared concepts around which the shareable data can form can be very productive.
We are finding most of the approaches from LOD useful in data-centric governance. For instance, the mantra “Cool URIs don’t change[3]” means once you publish an identifier (even internally in your enterprise) you should make every effort to make it permanent. This goes for classes and properties as well as URIs for specific things like people and products.
Graph databases are very helpful in promoting sharing, as the consumer does not need to agree on a rigid structure, and are not limited to a single type for an instance. Both of which make it far easier to share information.
Strict versioning is important, as people will not commit to something if they believe it may change out from under them in ways that would be disruptive.
Summary
The data-centric approach turns a lot of what we know about governance on its head. The scope of sharing and the rate of change are major challenges to traditional data governance. At the same time, complexity reduction of structural impediments and the encouragement of sharing can set up an environment where governance comes naturally.
https://www.amazon.com/dp/1634623169/
[1] https://www.youtube.com/watch?v=kyBH5oNQOS0
[2] https://www.w3.org/standards/semanticweb/data
[3] https://www.w3.org/Provider/Style/URI