The Data-Centric Revolution: The Role of SemOps (Part 2)

In our previous installment of this two-part series we introduced a couple of ideas.

First, data governance may be more similar to DevOps than first meets the eye.

Second, the rise of Knowledge Graphs, Semantics and Data-Centric development will bring with it the need for something similar, which we are calling, “SemOps” (Semantic Operations).

Third, when you peel back what people are doing in DevOps and Data Governance, we get down to five key activities that will be very instructive in our SemOps journey:

Quality
Allowing/ “Permission-ing”
Predicting Side Effects
Constructive
Traceability

We’ll take up each in turn and compare and contrast how each activity is performed in DevOps and Data Governance to inform our choices in SemOps.

But before we do, I want to cover one more difference: how the artifacts scale under management.

Code

There isn’t any obvious hierarchy to code, from abstract to concrete or general to specific, as there is in data and semantics. It’s pretty much just a bunch of code, partitioned by silos. Some of it you bought, some you built, and some you rent through SaaS (Software as a Service).

Each of these silos represents, often, a lot of code. Something as simple as Quick Books is 10 million lines of code. SAP is hundreds of millions. Most in-house software is not as bloated as most packages or software services; still, it isn’t unusual to have millions of lines of code in an in-house developed project (much of it is in libraries that were copied in, but it still represents complexity to be managed). The typical large enterprise is managing billions of lines of code.

The only thing that makes this remotely manageable is, paradoxically, the thing that makes it so problematic: isolating each codebase in its own silo. Within a silo, the developer’s job is to not introduce something that will break the silo and to not introduce something that will break the often fragile “integration” with the other silos.

Data and Metadata

There is a hierarchy to data that we can leverage for its governance. The main distinction is between data and metadata.

There is almost always more data than metadata. More rows than columns. But in many large enterprises there is far, far more metadata than anyone could possibly guess. We were privy to a project to inventory the metadata for a large company, who shall go nameless. At the end of the profiling, it was discovered that there were 200 million columns under management in the sum total of the firm. This is columns not rows. No doubt there were billions of rows in all their data.

There are also other levels that people often introduce to help with the management of this pyramid. People often separate Reference data (e.g., codes and geographies) and Master data (slower changing data about customers, vendors, employees and products).

These distinctions help, but even as the data governance people are trying to get their arms around this, the data scientists show up with “Big Data.” Think of big data being below the bottom of this pyramid. Typically, it is even more voluminous, and usually has only the most ad hoc metadata (the “keys” in the “key/value pairs” in the deeply nested json data structures are metadata, sort of, but you are left guessing what these short cryptic labels actually mean).

Knowledge Graph / Data-Centric

The knowledge graph is also layered, but there are several quite interesting aspects of its layering that come to the fore for our governance, and SemOps. Done well, the first is that the ontology that holds the whole thing together is “human scale.” Unlike 200 million bits of metadata or even 10,000 applications, the top ontology is often 400-600 concepts. Not only is it understandable at that level, but it also doesn’t change as often. These are 400-600 enduring themes that change slowly.

The taxonomy is usually thousands to tens of thousands of concepts, but unlike the ontology, they have little structure or complexity, and it is rare (but occasional) that developers hard code to the taxonomy terms (they do this all the time with the ontology, even though this isn’t best practice).

We show the data segmented here to indicate that unlike data warehouses, and data lakes, there isn’t a need to co-locate all the data, and as we’ll see later there are some great advantages in performance and security for not co-locating all the data.

Ok, now we’re ready to start looking at how governance is different in our three domains, by our five activities.

Quality

The primary approaches to detecting quality issues in each of the sub domains are:

Software	Data / Metadata	Knowledge Graph / Data-Centric
Regression testing	Statistics and conformance between metadata and data	Inference, patterns, hygiene queries

Software developers typically build vast suites of tests. Unit tests, regression tests, systems tests and the like. Developers know it is futile to try to predict what is going to break when a change is made. Make the change, run the tests, repair, if necessary. Part of this is because how little layering or leveling there is in software code. Some development shops are good at “design by contract” around their APIs and can limit the side effects, but most are not.

Most enterprises live with acceptable levels of data and metadata quality. They know they live with “dirty data,” unnecessary duplicates, missing data, inconsistencies and the like. It is very hard to get ahead on data quality, there is so much of it. Data quality ends up being dashboards, where earnest people try to keep a trend line (i.e., orders taken without credit report on file or addresses without postal codes) moving in the right direction.

In the knowledge graph world, the first quality approach is rigorous socialization. The ontology is small enough that it can be reviewed, in great detail, by technologists as well as subject matter experts. Inspection is also a good technique in coding, but the volume of changes in code bases is typically so much greater, and the ability for non-technical people to contribute to the review is limited.

By using formal axioms in the definition of concepts, the system (a “reasoner” or “inference engine”) can detect a large range of logical errors that would not be caught in a traditional system until much later in the process. Also, we are finding that just putting data in a graph gives easy-to-write-and-execute queries that can detect anomalous patterns.

Allowing/ “Permission-ing”

Authorization is the process of deciding who can see and modify the code or the data. The other side of security, authentication (how do we know you are who you claim to be) is not so different between the three realms, but the authorization is different.

Software	Data / Metadata	Knowledge Graph / Data-Centric
Few roles, automated turnover process	Entitlements, and many hard coded application specific roles.	Roles and rules derived from your real world relationships including your relationship to the data.

It is essential that unauthorized actors do not get into the codebase. Getting into the codebase is a vector where all kinds of bad things can happen. Generally, developers have a few well-defined roles (e.g., Developer, QA, Move-to-Production) with no overlap in duties. Developers code in a partitioned environment but do not have permission to move their code to the QA environment. The QA developers independently verify that the code works and does not introduce stray code into the codebase; then they approve the code to move to production, at which point the Move-to-Production team takes over.

For metadata, the process is much more like code, in that a change to metadata can easily break production code. If you drop a table that a program relies on, the program will fail.

Data is different. The authorization rules for any given application system are designed as the application is being designed or implemented. The most common approach is to invent a series of “roles,” such as Admin, AR Entry, and Vendor Management. A system admin assigns people to these roles (often in records called, “entitlements”) and the developer introspects the authenticated users, picks up their entitlements and grants them or denies them access either to the use case or occasionally more directly to the data (as defined by its metadata). The data and system entitlements are almost always limited to an individual application. Unfortunately, the same data and same functions are often present in other applications and it is a very difficult discipline to apply the rules consistently over many applications and users.

Knowledge Graphs open up new possibilities, but they also raise the stakes considerably. Even if all the data isn’t collocated, often the shared meaning and federated queries mean that it is easy to wander your way into data you shouldn’t have access to. We have just begun working on architectures and patterns that take advantage of our ability to know someone’s real life roles (if you filed an insurance claim, you’re a “claimant”) and their relationship to specific data (being a claimant should only give you access to your claims, being a claims manager should give you access to a much wider set of claims).

Separating especially sensitive data to repositories with very limited access is another way to implement authorization control in a Knowledge Graph.

Predicting Side Effects

Predicting side effects is about contemplating a change, and then trying to determine what else will be affected.

Software	Data / Metadata	Knowledge Graph / Data-Centric
Very little attempt to predict, change and test	Ultra conservative in change making as most metadata changes break code	Use the graph to predict the change

Most developers have a pretty good sense what kinds of changes will break things. But they use this knowledge more to predict effort than specific impact. In practice, they seem to make the change and run the test. For example, if a change later in production is found to create defects, the approach is to write another test that would have detected that problem.

Metadata change is very slow. Data modelers know that once there is a body of code dependent on a set of metadata, almost any change will have a major ripple effect. You won’t necessarily know the impact until you make the change as the cause of the failure is often references to metadata in code. My observation is that in a typical business application, half of the code directly or indirectly addresses the metadata.

The data is easy. No one contemplates regression testing or impact analysis when they are changing data. When you add a new contact to your contact management system you don’t expect the system to crash. You rely on the fact that validation and constraint management coded into the application will prevent you from entering data that would break the system.

In a knowledge graph system, especially one that relies heavily on model driven development, most of the dependencies are right there in the graph. If you write a query that refers to Employee and Salary, your query persistence routine can easily parse the query text and attach a link from the query to the two concepts. This way, when you consider changing something about Person or Salary, it is easy to find the parts of the system that will be affected.

Constructive Change

Here the issue is, if you decide to make a change, and your impact analysis tells you what will be affected, can the system help with the change.

Software	Data/ Metadata	Knowledge Graph / Data-Centric
Some refactoring	The metadata side can be scripted, but the code will have to be fixed by hand	There are some scripts that can be packaged with the change

For code, we mostly just have renaming and refactoring. If someone decides to change Employee to Personnel, the coder will have a lot of find and replace style refactoring.

Some of the data / metadata changes can be automated. If you decide that a previously optional field is to become required, you will have to pick a suitable default and run a script to update the data to agree with the metadata.

In the Knowledge Graph world, we often find that quite significant changes can be packaged with an update-in-place script (a SPARQL update query, typically) and pushed with the change.

Traceability

Finally, one part of governance is: Can we assure ourselves that we can account for all the changes in the system?

Software	Data / Metadata	Knowledge Graph / Data-Centric
Change logs	Metadata change logs	Provenance

In a modern development environment, every line of code is associated with a specific change of “delta.” Code repositories are able to present the change, what was changed along with it, who made the change, when it was made, when it was pushed to production, etc.

Data and metadata systems have something similar, although they are nowhere near as consistent. Each application may log data its own way, and each Database Management System logs changes to metadata in their own way.

Knowledge graphs rely on provenance, which is a rich vocabulary for tracing a change all the way back to its source. Provenance, in the real world, is about detecting whether the chain of custody for a piece of artwork is legitimate (i.e., there is a continual record of ownership, and there haven’t been episodes when it was stolen or plundered). Similarly, provenance is used to detect blood diamonds to trace the chain of ownership back to a mine that either is or isn’t known for cruelty to its workers.

For the knowledge graph we can determine where an individual triple came from. We can also determine if that triple was generated from an analytic step, what the steps were, what input it relied on, and so on, back to the equivalent of the diamond mine.

Summary

By disassembling DevOps, Data Governance and rethinking what a SemOps platform needs to do, we think we have outlined a practice that may mean that we can make changes to Ontologies, Taxonomies or even model driven applications, predict with a great deal of confidence what the side effects will be, and arrange to push it to production, with little human intervention.

Unlike Data Governance with its army of Data Stewards, Quality Assurance Professionals and Change Managers, we think Knowledge Graph based systems will be human-scaled and much easier to manage.

Unlike code, we believe that a mature Knowledge Graph environment will not have hundreds of thousands of lines of test code to be run every time a change is made, and that if we can predict and introduce the constructive changes, we have a path to continuous evolution.

MenuMenu

The Data-Centric Revolution: The Role of SemOps (Part 2)