Paying Down Semantic Debt

In my last piece, “Where does semantic debt come from?,” I described the Semantic Debt Life-cycle. At a higher level, “semantic debt” is the difference between an organization’s business models –the organization’s current processes, expectations, and strategies– and the data models that describe those processes and strategies. Semantic debt increases as an organization’s business model drifts from what is described by their data models, making those data models less accurate and less valuable to the organization over time.

Organizations create data models to structure extracts from various systems, which are used to support downstream processes that need data to make decisions. This process may be conscious or implicit in the software development process. We call these extracts –the data model and the data structured by the model– “semantic assets.”

Assets are often combined with other assets and new extracts are created from the new combinations as necessary in order for processes further downstream. But over time, every asset’s structure (or data model) looks less and less like the business it was intended to describe. This decline in value is a natural process, as the business is organic and always changing while assets are fixed. There are four types of debt, or four reasons, that we might see assets become less useful over time. And finally I provided examples of each type, although the examples were more focused on the consequences of the debt than the business processes that created the liability.

My previous two pieces were done in a very Agile frame of mind, so before we can cash out the concept of semantic debt any further, we need to pay down some architectural debt. In this piece we need to clean up that debt, but the cleanup will be a lot less tedious because we’ve already got some conceptual infrastructure in place. Isn’t “Agile” wonderful? Thankfully most of my readers should be comfortable with delaying architectural coherence in favor of building machinery. We’ll also introduce some taxonomy and finish with the outline of a program for paying down semantic debt.

What Do We Mean By “Semantics”?

Let’s start with a clarification: what do we mean by the term “semantics”? There are old and rich philosophical, linguistic, and computer science literature on “semantics” and we’re not going to get into any of it. A simple gloss on “semantics” is to describe the semantics of a system as “the meaning of the system,” or “the things a system is about.” (“Meaning” and “Things” are not the same thing philosophically, but they’re good enough for our purposes.) In philosophy “system” is synonymous with “theory,” or in more modern terms “model.” All software systems are theories or a best-fit model of the structure of a particular environment.

A software system could be the comparatively simple model underlying a knowledge management system, where we really just want a robust metadata regime around specific assertions stored as a well-thought-out fact table. Or it could be a complex CRM application, combining a detailed logging framework with customer and product data. Or our model could be a predictive model for customer add-to-cart behavior for an e-commerce site (We’re going to refer to these three examples repeatedly, so keep them in mind). The “semantics” of any of these systems is the assertion that the model underlying the system is true of some set of things out there in the world.

It’s easier to think of software systems in terms of “applications” though (Most readers already know this, but let’s make sure we’re talking the same language). An “application” combines an input and persistence layer. The input layer is designed to structure some interactions with some stuff in the world, typically mediated by a user. The persistence layer is designed to record the interesting bits of those interactions. Both layers are structured so users of the application, of either layer, can predict what it will do. This is not a debatable point; even so-called “unstructured” data is created by some structured process, or it would be simply white noise. We build applications for many reasons, but the primary overarching reason is to collect good data about some situation so we can do something about it with the data we’ve collected.

The input layer is most often a UI that collects data and structures user interaction with the application. But the input layer could also be source data like a text file from some other system, or a device reading and structuring some portion of the environment. Whatever the actual input mechanism, there’s a set of mapping functions that states “we’re going to label this specific user behavior with that string.” In the simplest terms, anything typed into box 10 is labeled first_name while anything in box 11 is labeled last_name.

The persistence layer structures the results of the input layer so that the data collected by the application can be used later. There’s another set of mapping functions between input and persistence that more or less re-labels the input layer objects so they can be persisted efficiently. In a relation persistence layer the forms of the input layer are translated into relational entities; this makes it easy to keep entities consistent. In a NoSQL persistence layer such as a document model, we reflect the contents of the input layer into the database. This means we can add and subtract form fields easily without worrying about normal form. In either case, the persistence layer is where we put stuff we might need later.

There is always a “later” in an application, even if the latter is milliseconds, and so there is always a need for a persistence layer. We get stuff out of the persistence layer using a query function, which effectively just reverses the mapping function. So far, this is pretty basic software architecture.

We said earlier that the semantics of a system or application are “the things the application is about.” But the semantics of the application isn’t just the inputs or “far left side” of the application’s two big mapping functions. It’s not just The World, in our diagram. The semantics of the application is an assertion that the application is true of some set of things. That set of things is of necessity and precisely defined, and most of software engineering practice is focused on ensuring precise definitions. We don’t have the time or energy to build applications for uses people don’t pay for, generally, so we have to say ahead of time of what The World consists.

For example, the semantics of an add-to-cart predictive model is a site’s information architecture joined up with actual aggregated customer behavior. We assert that when we’re creating the model. The “information architecture” is a list of site pages and what they’re intended to do, e.g. categories like “call to action,” “help,” “pitch,” and so on. The customer behavior is a couple of simple graphs of behavior with high-levels nodes. This may just be some “saw pages and then added to cart,” and “saw pages and abandoned session,” plus some customer segmentation. The predictive model predicts which pages are likely to convert which types of users, so it conjoins both sets and calculates a probability based on actual graphs of behavior, segmentation, and specific outcomes. The result is an assertion that customers exhibit a certain pattern of behavior, which we can use to tease our friends in Marketing about how AI is coming for them.

Or the semantics of an application might be customer complaints, or literature citations for observations of specific genome location-states and their associations with specific diseases. An application’s data model organizes these semantics, defining, for lack of a better term, the entities assumed to exist by the application, or its ontology. That data model may be explicit in the sense of a document model in MongoDB or an ER model in SQL Server. Or it may be implicit, in the sense of a poorly-documented stack of text files of uncertain provenance sitting on a file server somewhere. In the implicit case (also the worst-case), the files are still about some things, but our understanding of those things may be incomplete or vague. In the implicit case we lack both a positive assertion and a defined domain.

In mathematical logic the semantics of a model is often described as “the list of things that make the theory true or false.” This is a much more operational definition of a system’s semantics. So if the semantics of any given model is the set of things the model is about, the persistence layer implements the semantics so it sticks around. This implementation might take the form of records in relational tables that decompose entities and their relationships, or the documents that more-or-less store data collected by input forms, or a log of incidents, observations or events created by a device doing some incident, event or observation monitoring. Thus we can subdivide a system’s semantics into components that belong in either of the Input or Persistence layers – a particular set of values corresponding to a particular feature of a model, such as a short list of reason codes for an RMA that populate a dropdown box on an RMA form.

Failure Modes

Considering the foregoing it’s easy to see how a system might go into semantic debt. Indeed it’s easy to see how semantic debt is inevitable, and that for many systems semantic debt may begin accruing as soon as the system is used. There are two modes this failure might occur in.

First, there’s the simple fact that the world has an irritating way of changing. A model of the world breaks the world into entities: A “product to be returned,” which presupposes there are things called “products” that can be dis-satisfactory to their purchasers and are thus returned. Or an “add-to-cart” event on a website, which supposes there is an agent who will add products to their “cart” on a website. Or a statistically interesting relationship between the local configuration of an interval on a strand of some person’s DNA and a disease state. Each of these three statements is a model of the world. But when the world changes the model may no longer be valid. If there is a list of products that can be returned, and the product that the purchaser wants to return is new and somehow not on that list, then the model exhibits a failure to model reality.

It’s not a major failure in this case, because we can just update the list, but you could see how it might be worse. We call this kind of debt “coverage debt,” because the model doesn’t cover some portion of reality that it should cover; there are elements of the business model that have not been organized by a data model.

Let’s extend the thought experiment. Suppose our predictive model for “add-to-cart” behavior is initially intended to cover a single add-to-cart process, one where the web pages are those long drawn-out graphics-heavy things with the relevant button at the very bottom of the page. The predictive model is trained on customers browsing those pages. The very same model will not work as well – it will not be as accurately predictive– on pages that are constructed differently. If the pages put the relevant button at the very top of the page, effectively bypassing all the attempted persuasion of the long page, then a model trained on long pages may underestimate the likelihood of conversion. Or it may overestimate as well, if the top-button pages are structured otherwise differently. In such a case we say we have “equivalence debt,” in that the downstream semantic asset (our predictive model) is expected to merge two inequivalent situations.

Now these examples may seem spurious. Would anyone actually try to report on conversion using a partly defined IA? Would an organization fail to keep its list of returnable products updated in code and virtually ensure negative experiences for its customers? Certainly no reader of TDAN would make such a mistake, and software engineers in general would be appalled by such a lack of accountability. But while conscious and intentional decisions of these sorts may be extremely rare, they highlight the more likely cause of Coverage and Equivalence debt– change over time. We see both Coverage and Equivalence debt because the organization’s business model has expanded but its stock of data models has not. In cases of Coverage debt, that’s because there are areas of the business that have not been modeled. In cases of Equivalence debt, that’s because there are areas of the business that have been modeled, but not integrated.

Failure Points

In both types of failure mode, for both Coverage and Equivalence debt, the incurred semantic debt might be Sudden or Immediate, or it might be Gradual. These are our “failure points.” Coverage debts are “sudden failures” in situations where, for example, the model worked just fine yesterday and today it doesn’t, as in the example of the new product missing from the RMA-eligible list. Failure is “Immediate” when the model was never intended to cover a particular scenario, as in a situation where we build a data model or UI for returns-management that can’t manage returns due to product functionality. The Coverage debt is an “immediate failure” because the model – either the data model or the UI– simply doesn’t have a way to describe the situation; the Coverage debt is intentionally built in. “Immediate failure” is thus usually discernible as a function of scope. “Sudden failure” is more difficult to predict, but not usually all that surprising to experienced data managers.

“Gradual failures” are both more insidious and more common, but often much more difficult to recognize (gradual accretion of semantic debt is what inspired me to think this concept through). In a Gradual failure the model is thought to be adequate for the world its intended to describe, but in actual fact it isn’t. The model is insulated from reality somehow, and so users of the model – whether downstream silicon-based systems that rely on the model’s accuracy, or carbon-based lifeforms using the model– are unaware the model doesn’t match reality. These failures can lead to a broad-based poverty of imagination when they’re “Coverage debt,” where the entities of the model encompass an increasingly smaller portion of the actual world the model needs to cover.

The Coverage debt might arise because there are simply more kinds of things in the world than the ontology has, or the attributes of those things are more numerous than the ontology allows for. The paradigm case here is a “Flatland”-type scenario, where a two-dimensional model of the world is completely bewildered by three-dimensional behavior.) But even the case of a hard-coded list of returnable products, or the standard practice of using hard-coded enums in code instead of lookup tables in a separate persistence layer, produces a gradual form of semantic debt that unintentionally constrains what the application can do. The hassle of doing a software release to update the RMA-able enum list leads to its own form of discipline, one where the team will do almost anything to its architecture to avoid having to do that update. That discipline functions as a constraint on future options. Thus the same feature in an application can be subject to Sudden failure – when the RMA-able list becomes incomplete– and Gradual failure, as the architectural decisions constrain future features.

Paying Down Debt

I said at the top of this piece and at the end of my last that I would discuss here how to pay down semantic debt. And that’s also the title of this piece. There is obviously a great deal to be said tactically about paying down semantic debt, and a lot of case-by-case examples. We’ve talked through a couple of examples where the point at which semantic debt creeps into the system becomes obvious. But let’s talk in broad terms about how an organization could go about identifying, prioritizing, and paying down semantic debt.

First, let’s get the obvious out of the way. Semantic debt is inevitable because the world changes on its own while mapping functions are fixed. Changing a mapping function requires human intervention, at the very least a release. But organizations can avoid semantic debt by simply paying attention to their data architects. Indeed if the role of the data architect is to balance the most efficient persistence mechanisms with the least risky for some subset of an enterprise’s processes, it is the notion of semantic debt that the architect uses to understand where that balance lies. That is, all good data architects worry about semantic debt, whether they’re conscious of it or not.

So when your data architect says “the data warehouse is a decade old, we should update it,” they’re not just trying to find something to do. They’re concerned about coverage and equivalence failures. They’re worried there’s some gradual debt built up that can’t be seen, in addition to the debt everyone knows about, and that gradual debt may have strategic implications. They want the organization to consciously decide whether to fix the identifiable immediate failures, and they want to try to get ahead of any sudden failures that may be lurking in the business model. While the admonition to “pay attention to your data architect” is a “Try to be nice to people, avoid eating fat, read a good book every now and then, get some walking in…”– type of rule, it doesn’t hurt to repeat it.

Once you’ve listened to your data architect, you want to catalog your semantic assets. This catalog doesn’t need to be exhaustive, but it should have a couple of dimensions and hierarchies. First, identify to some satisfactory level of detail which systems are creating data and how well they’re doing it. For example, is your ERP system using the same order status scheme today that it was using twenty years ago? Are those order statuses sufficient? Under the heading of that particular system begin to catalog its outputs. Some of those outputs will be feeds into downstream systems, such as PLMs, MRPs, suppliers, etc. Many of those outputs will be into BI systems of some kind, whether that’s large targets such as a data warehouse or data lake or one-off reports written directly against the ERP system. The creation of this catalog should be prioritized by order of importance, where “importance” is some function of “critical to employee satisfaction,” “customer impact,” “revenue impact,” or a matrix that combines all three. It does not need to be exhaustive, but it should not hide critical semantic assets.

Are there any other kinds of outputs? If the data isn’t flowing to a downstream system for either reporting or as an input into some automated process, where else is it going? While I suppose it’s possible those aren’t the only two categories, that an organization might black holes in their data management systems, I can’t think of what that would look like. So if you, dear reader, run across one please let me know.

Once you’ve got the beginning of this catalog, begin to categorize failure modes and points. Our two failure modes are “Equivalence debt” and “Coverage debt.” An asset might be subject to both equally, although that’s rare. An asset is subject to the “Equivalence” failure mode when there is another asset (or at least another system) that has natural keys or schemas that are incompatible with the subject asset. Reporting that merges two email marketing systems with different natural keys for the definition of a “customer” is subject to Equivalence failure, in that there may be scenarios in the model or even between records where the debt is impossible or at least very difficult to pay down because there are two different and incompatible definitions of customer. Our predictive model example provides another case, where a model is trained on one kind of content and then expected to predict behavior using a substantially different kind of content as well.

An asset is subject to “Coverage” failure when the model is inadequate for the business situation its intended to describe. This may be because the model is vague on certain entities; there’s a notion of “product” for example, but no way to subtype the product by composition, delivery mechanism or returnability. Or it may be because there are entities in the real world that are relevant to the model that aren’t documented in the model. It’s tempting to go down the slippery-slope of ontological relativity and say that all models are, to some extent, vague with respect to reality. Prediction is not deduction, after all, and the Planck constant is pretty small. But we’re not doing Philosophy, we’re doing Data Management. So there are relevant questions: Are there processes happening right now that might benefit from being recorded that are not currently recorded? Do these processes cost a lot of money or create some sort of business impact? If so, then the Coverage failure is relevant. Similarly, has the business changed since the model was created? If not, is that because the model is perfect, or is it because there’s a bunch of ad-hoc annexes created to ensure no one needs to hire a data modeler?

Our failure points are “sudden,” “immediate” and “gradual.” “Sudden failures” tend to be pretty obvious: We’ve used the example of the organization with two email marketing systems repeatedly, and that’s an excellent example of two systems that generate semantic assets that can cause sudden failures. “Immediate failures” should be relatively simple to find, in that they’re often a function of scope: What aspects of the business were not included in an application that probably should be, and do we have at least an early-warning system to detect an inadequate model? These may be specific features that were left out because of time, technical debt, or architectural errors, such as enums that can’t be updated, products that can’t be returned, or processes that require manual intervention. ETL or pipeline processes often include this type of failure.

Finally, “Gradual failures” are the hardest to determine, and this is where your data architect earns their pay. One key question for the business is “what are you not able to do that you want to do?” This question isn’t a wish list or a fishing expedition, because it’s a “now” problem: What can’t marketing personnel do now that they used to do at prior employers, for example, and it might arise in a discussion of “best practices” that your organization currently doesn’t follow.

Or consider our “add-to-cart” predictive model asset. If the model was built on a particular information architecture and is used to determine which experience a potential customer should follow, but there have been pages or even flows added to the IA since the model’s construction that weren’t part of the model’s training, then we have an obvious case of “gradual failure.” The “gradual failure” may be harder to find, though, in the same example if new pages were added and the model was retrained, but the pages are structurally distinct from the older style. In such a case we have a gradual equivalence failure, in that the two types of pages are not the same experience. But the model presumes they are, and the asset is in semantic debt.

Conclusion

In this piece we promised to talk about how to pay down semantic debt. First we took a long detour with a discussion of the concept of “semantics” so we could deal with architectural debt. We then introduced a couple of failure modes for semantic debt. “Coverage debt” occurs when an asset’s model doesn’t include as much of the business as it needs to. “Equivalence debt” occurs when two assets have incompatible models. We also talked about three failure points for semantic debt. “Immediate failure” occurs when an asset simply doesn’t describe the situation it needs to describe, and this failure is known to its creators and/or users. “Sudden failure” occurs when an asset unexpectedly fails to perform. And “Gradual failure” occurs over time, as an asset grows less and less accurate.

Much more can be said about the analysis of semantic debt, and I will pick up both individual examples and the overall picture in future articles.

MenuMenu