The Data-Centric Revolution: ABox Versioning

Have you ever invented something, seemingly out of whole cloth, only to do a simple Google search to find out it’s a well-defined discipline you’d never heard of? That’s how this article started.

The mini methodology we’re going to describe in this article started life in a discussion with Mike Pool and his team at Bloomberg. They were interested in our approach to versioning in graph databases. We described how we used “semantic¹ versioning” (you know things like 13.2.1 to note major, minor, patch versions) for our ontologies. But the more we talked, the more we started reflecting on other problems we were overlooking.

There are plenty of cases where usage patterns create downstream problems. You might not even change (and therefore not version) the ontology but may change how you’re using the ontology. You may decide that: dependsUpon is a more appropriate predicate than the one you were using, :directlyDependsUpon, and make that switch (without making any changes to the ontology). This innocuous change often breaks downstream programs and queries that referred to the prior property.

Even more diabolical is when the cardinality changes slightly. Again, this can easily happen without any changes to the ontology. For everyone whose SHACL² neuron just fired, hold steady; that can be part of the solution, but we want to spend a bit more time with the problem first.

The Problem, More Specifically

Maybe you have a situation (as we did) where each employee has a single employment agreement. This isn’t an ontological restriction, it is conceivable for someone to have more than one employment agreement, but it’s quite rare. It would be easy (we know, we did it) to write a query that implicitly expected only one employment agreement per employee, because that’s all we’ve had for many years. These types of queries break subtly. It’s not like a reference to a property that is no longer there, which fails spectacularly (that is a query that previously returned a dataset suddenly returns nothing). The extra cardinality creates subtle problems that you might not notice immediately. There is an extra row in a table of 500 results. Who is going to notice that?

Or the converse: there was “always” a budget for every project. Maybe it wasn’t a required field, but up until now, people who set up projects always put a budget in there. It might not occur to a query writer to put an “optional” around the part of the query that accessed the budget (those “optionals” are such performance drags). But when someone sets up a project without a budget, we get another silent failure. The project without a budget just vanishes from queries that should have included it.

It’s Not Just an Ontology Versioning Problem

As we said earlier, this is orthogonal to ontology versioning. This can come up in traditional systems, but we think the flexibility of graph-based systems makes these problems more prevalent.

So, what to call this? Our first thought was “data versioning.”

Data Versioning

At first the idea of data versioning in enterprise systems seems absurd (probably why it doesn’t come up very often). Literally every update to a database creates a new version. While this is true, it isn’t very useful. What good does it do you to know there have been 10,000 versions of the database today? Even knowing there have been 500 versions of the customer master file isn’t very helpful.

Then, we started working on what we thought would be useful. At first, we called it “ABox Versioning” (because in our semantic nerd speak, the TBox is where terms are defined (the ontology) the CBox is where categories are maintained (the taxonomies) and the ABox is where the assertions live). So ABox Versioning was perfect. Until you want to talk to anyone outside of a small clique.

So, we pivoted to data versioning and worked out a lot of what the rest of this paper will describe. Before writing this article, I wanted to make sure the term wasn’t already taken. It was.

A quick Google search reveals: Of course there is such a thing as data versioning! (Although it has very little to do with traditional enterprise data.) Data versioning is for data scientists and AI engineers to be able to refer to which version of a dataset they did their analysis or training on. Totally makes sense. Don’t want to squat on their term and cause ambiguity around it.

Graph Data Versioning

So, graph data versioning it is. Except I googled that, and it too is already a thing. Still not the thing I was working on, but a thing nonetheless. There is some very cool stuff there, mostly about schema evolution in graph environments. Some good stuff there, but still not the points I was trying to make, so I’m back to semantic nerd speak.

ABox Versioning

Here’s the deal, we want to have some way to communicate with consumers of graph data that something has changed, that may affect them. Ideally something of the major, minor, patch ilk. We want to warn people at different levels of need to be concerned.

And yes, this does have something to do with shapes, as in SHACL shapes, but I think the conversation is broader than that. We want to be able to say, “The shapes of these objects, in this area of the graph have changed in a way you need to be aware of.”

Major ABox Version Change

As we alluded to in the intro, the big thing we want to alert consumers of graph data to are cases where the data they are processing has crossed a threshold that is likely to adversely affect them.

The main one that we are targeting is when a shapes relationship crosses a very specific threshold. That threshold is 1.00. But not just any 1.00.

Run a query to count the min, average and max property counts on a class. If you had 1000 projects that each had a budget, you’d have min 1, average 1.00 and max 1. I’m going to concentrate on the average, but astute readers will realize there is an edge case if 900 projects had 2 budgets and 100 projects had none, you’d get a false positive 1.00 (min 0, average 1.00, max 2). So really, we’re going to look at changes of the min and max from 1, but the discussion is way easier to follow by following the changes in the average.

Let’s take the case of the project class that previously had exactly 1.00 budgets per project. When that cardinality drops, even to 0.99 we have a problem. Some of our queries will be missing a project. When our cardinality goes from 1.00 to 1.01 similarly, we have introduced the possibility of double counting.

It turns out no other transition matters. It is hard to think of a normal scenario where going from an average of 2.00 to average of 2.01 or even to 3.00 would break a working query.

The scenario that is on the fence for me is whether the type of the object class changing is a major version change. I think this is going to be a site-by-site decision. We’re going to experiment with it a bit and see how it goes.

Minor ABox Version Change

Going from 0.99 to 1.00 is not a breaking change. At 0.99 or any lower number, the query writer was already dealing with an optional property. They had (or should have) been dealing with the optionality, either in their code, or with an optional clause in their SPARQL. It is a minor change, and it would be nice to make them aware of it. They may choose to take the optional clause out of their query and get a free performance boost.

In a similar vein, dropping from 1.01 down to 1.00 is also not a breaking change. Again, the programmer or query writer had some strategy for dealing with extra cardinality (maybe a group by or a distinct depending on how it showed up). Again, knowing that it now is exactly one for the whole set is worth knowing; not as urgent but nice to know.

I’m going to suggest (and may get shot down for this) that adding entirely new properties to a class is a minor change. Most consumers of a class will be unaffected but may want to know.

Patches

I suppose any detectable change in the average cardinality could be considered a patch. There is typically not anything anyone would do with this information, but it is nice to know.

Changing Property Patterns

The example cited above, of changing which property is being used in a graph, in most cases will trigger a major version change. If you went from using the property :directlyDependsUpon to :dependsUpon, and if :directlyDependsUpon had an average cardinality of 1.00, then this would trip the major version flag because a property that had been 1.00 went from 1.00 to something less (in this case 0.00). If :directlyDependsUpon was less than 1.00 to start with, this would probably be a minor version change. The query writers would have already considered the property optional; it now doesn’t show up at all, but the new property bumps the minor version.

Detection/Prevention/Correction

I think the control systems trio of detection, prevention, and correction is a good way to anchor the next bit of the discussion. In a control system, it is generally best if you can prevent all risks. But if you can’t prevent all risks, then you want to make sure you have a way of detecting when they have occurred and correcting (repairing) the damage done.

In our analogy here “risk” will be replaced with “change.” One prevention strategy is SHACL shapes. If every update goes through a SHACL engine and every shape has a complete set of constraints, we can prevent changes that would take a property from required to optional or from singular to multi. It’s one thing (and a good thing) to be able to prevent unanticipated changes like this, but at some point, you may intentionally decide that you want to change the cardinality, and you still need a way to communicate those changes to consumers of your data. One way to do that will be covered in the section communicating the changes. The other issue is that not all sites have 100% SHACL validation on all their classes, which means they need to rely even more on the detection and correction tactics.

The detection part of the triad implies that we can write a query that will detect the type of changes we’re talking about. Indeed, the query that gets the basic as implemented shape is not too difficult. The slightly harder bit is keeping a baseline and detecting and reporting the changes to it. We’ll have a bit more to say about that in the communication section.

Finally, assuming you detect a change, you want some mechanism for making the repair as simple as possible. Part of this solution that we have been implementing internally is to put all queries into the triple store. Our current implementation relies of a string search into those queries to find queries that rely on the property in question. This at least gets a candidate list of queries to be reviewed. A future version of this that hasn’t made its way up the priority ladder is: At the moment of storage, parse the query and attach it to all the meta data it refers to. This makes the query writing easier. By the way, this doesn’t catch every case, there are some meta cases where the property in question isn’t explicitly named in the query. We’ll deal with those as we come to them.

The other side of this, which affects us a lot less but does still a bit, is references to ontology in code. Our environment is mostly model driven, so the number of references to domain objects in our code is far less than traditional development. That said, we still have some cases. And the architecture itself is expressed in an ontology and if the cardinality of architectural shapes changes, there will be a big side effect in the code. At the moment, our main recourse is to grep the source code.

Communicating Version Changes

Ok, so we detected either some major or minor changes to the shapes of some of our classes in our domain. How are we going to talk about this and communicate to our consumers?

First note that every major change will increment the left-hand version number by 1. If we were at version 3.2.1 and had a major version change, we would now be at version 4.0.0. A subsequent minor change would take us to 4.1.0.

First, declaring versions at the graph level is probably too broad. In a large graph, we could imagine every week going from say version 84.0.0 to 85.0.0. You will be broadcasting change notices to a lot of people who will be unaffected.

The flip side, versioning every class is likely too granular, although in some cases this may work. Most of our implementations have hundreds of classes. Many others have thousands. I suppose we could have a mini configuration file that declared what each classes version was and is.

I think we’re going to start with data domains and start with the top level of gist. Since virtually all our classes are proper descendants of 14 high-level gist classes, that seems like a reasonable place to start. If there is a major version change in any of the subclasses of, say gist:Organization or gist:Event or gist:Place, the people affected will likely know right away, “Oh, that is probably going to affect me” or “No, I can safely ignore that.” Since any subclass of those top-level classes could increment the version number there will be a few more version changes at the top level, but it seems like a good tradeoff for simplifying the notifications.

Summary

There is a hidden problem in graph databases that arises at least in part from their flexibility. Traditional systems change more slowly, and their changes tend to be driven by changes in the schemas, which create an early warning sign for affected developers.

In graph, subtle changes in usage can change the effective shape of parts of the schema, and if done without warning can break existing queries or code.

ABox versioning gives us a way to detect and communicate these changes. Presumably most sites will want to implement this in their development and test environments to minimize the effect on live data.

MenuMenu

The Data-Centric Revolution: ABox Versioning

The Problem, More Specifically

It’s Not Just an Ontology Versioning Problem

Data Versioning

Graph Data Versioning

ABox Versioning

Major ABox Version Change

Minor ABox Version Change

Patches

Changing Property Patterns

Detection/Prevention/Correction

Communicating Version Changes

Summary

Dave McComb

MenuMenu

The Problem, More Specifically

It’s Not Just an Ontology Versioning Problem

Data Versioning

Graph Data Versioning

ABox Versioning

Major ABox Version Change

Minor ABox Version Change

Patches

Changing Property Patterns

Detection/Prevention/Correction

Communicating Version Changes

Summary

Share this post

Dave McComb