Semantic debt arises when an organization’s data management systems are conceptually inadequate. We often see semantic debt in source-to-target mappings buried deep in otherwise stable ETL processes. A more complex example would be the transaction system that simply can’t store relationships between entities that modelers never considered worth relating.
“Semantic debt” is distinguished from “technical debt” in a couple of different ways. While organizations may be aware of their technical debt (and clever ways to kite it), they generally have no idea how much their strategic plans are dependent on their data management.
For another, paying down complex semantic debt is often harder than paying down technical debt. Semantic debt results from systems that constrain an organization’s ability to evolve over time, while the parameters of technical debt are often tightly constrained by precisely defined Agile timelines.
I want to sketch a picture of semantic debt in this article. In future articles, I will plan to propose a more rigorous definition and discuss how to pay down semantic debt.
Where does Semantic Debt come from?
Like technical debt, semantic debt results from scoping decisions. Unlike technical debt, semantic debt is rarely called out. We can often accurately gauge the amount of technical debt a given development effort creates by asking developers the simple question “how much longer it would take to do this the right way, instead of the quick way?” The cost of that time in current dollars is equivalent to the amount of technical debt we’re taking on. We can choose to ignore technical debt until we need to refactor the software, or we can establish paydown teams that periodically but regularly remedy short-term decisions. Some organizations have so much technical debt that they’re never able to pay it down and must scrap their code bases. Others take a more rigorous approach, testing various aspects of their code base to ensure they’re not carrying an onerous amount of debt.
Semantic debt is a much broader concept. I hope it’s not so broad that the concept loses traction, but some examples will help explain where the notion of “semantic debt” can be helpful.
Consider an organization that that depends on one or more parties for access to their paying customers. Over time the organization will consolidate its data collection and reporting around the status of those intermediaries and their relationships to their customers: How many customers are associated with each intermediary, the average dollar value per customer for each intermediary, and the total dollar value paid to each intermediary for access. Those relationships are instantiated in the organization’s data models and more than likely each party and relationship are solid, physical tables with data about each type of party and their clearly defined relationships to other parties. A mature reporting regime will highlight changes in these relationships: which customer groups are changing, which intermediaries are rising or declining, and the cost of acquisition of a customer.
While the business environment changes because of regulatory chaos, technology disruption, mergers or failures, the organization continues to see the world through that initial model. But if every relationship with a customer is supposed to require an intermediary, what happens to the transaction model when the direct acquisition of customers is suddenly an option? One solution might be to re-purpose the cross-reference tables that instantiate the relationship between provider, customer, and intermediary so that a customer can act as their own intermediary. This is a straightforward ETL fix, with some modifications on the BI side to ensure we’ve performed the proper incantation to identify intermediaries vs. intermediaries-who-are-customers.
This workaround may serve a long and honorable life, but it should be obvious to the reader that this fix is actually a semantic debt. While the new type of relationship can be accommodated in the existing model, and we don’t need to change much except some queries at the presentation layer, we aren’t explicitly calling out the new type of relationship. Not all organizations have data modelers that can refactor their systems correctly, but even inexperienced data management people can see that the supposedly straightforward solution involves CASE statements and plumbing fixes in multiple layers that will cost time now – though less time than rethinking the data model – while creating some unknown headaches down the line. When the inevitable complication arises – the Marketing team decides they want to type these directly-acquired customers by acquisition channel – our fix must now find a way to store a table’s worth of detail, only spread between the ETL, storage, and BI layers.
Semantic debt in this case clearly embeds technical debt. The “do it right versus do it fast” calculation is part of the data management team’s decision to promote the fix and subsequent fixes as technical fixes, instead of telling their management chain that a model change is required. But this is semantic debt instead of just technical debt, because while we can quantify the ROI of the short-term fix, we’re no closer to making explicit the new type of relationship. For that we need a data model. Our technical debt might require minor changes to our existing data collection, storage, and reporting systems, but our semantic debt in this case is implicit. The organization has decided, in short, that any new relationships it may build with customers are just like the old relationships.
While this would be a fine decision if organization-built models that allow any and all types of relationships, it is the author’s sad experience that even seasoned database developers have never heard of Len Silverston’s Party model, almost twenty years after its publication.
This problem is as difficult to solve with a NoSQL collection system as a relational model. In fact, the semantic debt in a NoSQL system will often be even greater, as a YAGNI-driven semantic model apparent to developers in their mid-20s won’t work for the formal and rigorous analytical style of finance. Any development team, in software or reporting, that has had to work with an overgrown EAV implementation will agree that semantic debt is easy to rack up yet very hard.
Our simpler example earlier gives some insight into the other end of the spectrum. Many mature data warehouses fell exhausted across the finish line a decade or more ago, and the experience was so painful that no one really wants to go through that again. But they generally work, or more accurately the organization more-or-less adapted to the business model reified by the warehouse’s various structures. Generally, there will be some collection systems within the organization that haven’t been integrated into the warehouse; some may have been partially integrated, ideally on some clean partition. While that integration gap is itself a form of semantic debt, there’s an even more practical example. For many data warehouses showing their age the ETL tools and mappings haven’t changed substantially since they were first built. Again, some additional integration may have been performed (e.g. the CASE statements to accommodate direct customers mentioned earlier). But the mappings themselves – the design of the ETL process, what you might call the “architecture” implicit in the ETL – have likely not been revisited. That is almost entirely because the system is producing stable numbers, and nobody wants to break it.
A simple example: When I first started doing ETL at scale I discovered the “upsert.” But because I was using Microsoft DTS, I didn’t have an UPSERT operation in my ETL tool, and I was just beginning to understand ETL architecture. So I wrote a lot of stored procedures that separated the identification of new and old records into multiple explicit steps, and often several intermediate tables to hold results. As my experience with ETL tools expanded I came across products that allowed me to explicitly identify the logical components of an upsert, and now of course there are MERGE statements in SQL that unify upsert tasks. But it’s distinctly possible there is some seasoned ETL practitioner out there in Silicon Valley staring at some sparsely-commented and over-long stored procedures written by somebody named DG in the early 00s, and that developer, significantly better at their job now than I was then, is not allowed to change the procedure because doing so, for some reason, changes the outcome.
This is the more mundane and almost more insidious kind of semantic debt, and it is common. The mapping may be opaque because the developer is gone or was high when they wrote it, but the downstream results are a foundation stone for the business. Modifying the mapping changed the foundation, and so the team’s management won’t permit changes to the mapping. Performance issues may require the team do extra-clever work elsewhere to compensate for DG’s poor ETL design, or that reports are simply late, but technical debt here is easy to identify: The ETL team can point to multiple tools that will do the job better, each with well-known costs. But because there is an unknown impact to the downstream reporting, to the semantics everyone has come to rely on, the mapping must remain as semantic debt.
This kind of problem, where components act as anti-explanatory non-linear functions within an ETL process that should create legitimacy, is rampant. It’s a smaller form of the epistemic closure one finds in large scale examples of semantic debt, but it can often occupy the bulk of a team’s time in a larger data management group.
We took two examples of semantic debt and worked through their contours. At one extreme, an organization’s semantic debt is the difference between what its business could be and what the organization’s current data management systems allow it to be. This is big-picture semantic debt, a kind of epistemic closure that makes it difficult for an organization to respond to the world as it is.
At the other extreme semantic debt might be described as the outcomes of processes that may or may not be amenable to technical changes, but which can’t be changed anyway because doing so would impact stable business decisions.