In my first piece, Thoughts on Semantic Debt, I sketched out the concept of “semantic debt” and provided some examples. Semantic debt arises “when an organization’s data management systems are conceptually inadequate.” Organizations usually try to manage semantic debt with ad-hoc fixes or kludges, which often need to spread throughout the organization’s data management systems. These debts collectively create problems adapting to changing business conditions.
In this article, I want to expand on the notion of semantic debt, provide some more examples, and explain how organizations get into semantic debt.
Every organization of any size implements applications to collect data about customers and internal operations. Some front-line collection applications are out-of-the-box systems with prefabricated forms and databases, and are more-or-less customizable to the organization’s needs. Some are custom-built by the organization’s own development teams from scratch. Each application is designed to collect specific types of data, and some applications – such as monster ERP systems – claim to be able to collect data about everything the organization does, from interacting with customers and taking orders through to inventory replenishment, to billing and returns. These systems generally collect data about various activities and store that data somewhere, perhaps in a relational database, but as often as not in a NoSQL system or log files.
The data collected by these systems is meant, at some point, to be used to make decisions, whether by customers themselves or for customers. Even devices that generate reams of log data are generating that data because someone wanted it. Although log data may be so large as to be practically unreadable, or so obscurely recorded as to require complex interpretation, there is a point to the recording, and purchase of that software was predicated on the notion that it could be used to record and report something valuable. This is true of both the most all-encompassing ERP system and the smallest metering tool.
Over time an organization will add and obsolesce many collection systems, large and small, and the reasons are varied and many are familiar to the reader. Sometimes these systems will merge, as when a big ERP vendor buys a giant CRM vendor and merges the operations into an even bigger ERP/CRM offering. Other times the vendor will go bankrupt or stop development of the system, and the organization that uses that package will need to decide what to do. And organizations always pay down technical debt created by their customizations and integrations between their collection systems.
Technical debt results when short-term architectural decisions are consciously preferred over long-term ones. Examples of technical debt are also familiar to readers: we choose the expedient over the correct because we’re not sure the correct can be justified by ROI, because we don’t have the resources and skill-sets, or because we want to test out some ideas by correctly doing one layer of an application while doing other layers with less attention to consequences. Many organizations can quantify their technical debt, and several even have formal processes for paying down technical debt. Technical debt can be found throughout the organization’s technology assets, both within an application and in the integrations that pass data from one collection point to another.
In any organization of any size there are also systems designed to co-locate data from various collection points for reporting. Sometimes reporting systems are built into applications themselves to report directly on the data collected by those applications, although thankfully today is the rare system that reports directly against the data model that also manages user interaction. Organizations build repositories to co-locate data from multiple collection points and organize that repository to make it performant for reporting. The reporting system must also somehow be easy to use for its intended users, and so it must reflect a shared semantics or business ontology.
Large and/or old organizations will have many of these repositories. Most are intended to co-locate data from a small number of applications, perhaps connecting data from a nascent direct email-marketing process with order data, to try to determine conversion rates. Others aim for a “grand unified theory” of the organization. Whatever the ultimate size, the way we populate repositories is the same whether it’s performed by an ETL team or an individual data scientist. Datasets are “extracted” from source systems, where each dataset is structured by some fragment of the source’s data model. The quality of the extract is largely dependent on the fragmented portion of the source’s data model and extracts those mix cardinalities. It can also confuse or miss key relationships, which can make downstream repositories difficult to work with. (Sometimes that problem exists in the upstream model, but more often than not it’s the design of the extract that’s at fault.) Generally, if the downstream repository reorganizes the extracts into some kind of internally-coherent data model, it is called a “data warehouse.” If the extracts are dropped as-is into the repository, it’s called a “data lake.” Obviously data lakes require significantly less work than data warehouses, and have become quite popular as a result.
Note that there’s not much difference, ontologically, between an order detail extract from an ERP designed to go into a gigantic corporate data warehouse, and an extract out of a super-normalized data warehouse that shows last month’s five highest-selling products. There is a content and model difference, obviously, and experienced ETL people know that the former – the order detail extract – can be used to generate the latter and a lot more. But “reports” are also just extracts from systems. They may feed into visualization tools and generate pie charts, or they may go into a downstream repository for combination with other extracts. But once data has been collected, everything we do with it requires an extract first.
Let’s call an extract from a system a “semantic asset.” A semantic asset is the data model fragment (from the source system) in the extract, plus the data. We could also say a semantic asset is a schema and data, as long as we’re clear that a “schema” isn’t some mysterious thing that appears out of nowhere, but instead comes from a source’s data model. If the source system is an ERP then an extract might consist of a denormalized “super transaction” combining the order, the customer, the customer’s address, and a few thousand records; if the source system is a device then the extracted semantic asset might just be a timestamp, an error code, and billions of rows. Someone designs a semantic asset because they want to do something with it. In practice, there will be multiple stages of semantic assets between a source system busily collecting data and a Kimball warehouse storing the organization’s data in a coherent whole. Each time an extract is modified or transformed into something else, another semantic asset is created, both at the point of transformation – i.e. the application of some mapping logic on one schema so it will fit into another data model – and in the result – i.e. the asset that results from the merging of two data models. A data warehouse functionally requires the creation of many semantic assets. A data lake requires comparatively few, simply because the process of creating a coherent semantic model in the data warehouse requires more transformation than the simple “pump and dump” operations that constitute a data lake’s creation.
If the last thirty years of data management has taught us anything, it’s that we don’t invent new tools and techniques so much as provide a name and a set of best-practices for stuff that operations people figured out a long time ago. (To paraphrase a comment Justin McCarthy once made to me, architecture is just three heuristics and an attitude.) We now know, from experience, that organizations build both data lakes and data warehouses for different purposes, as part of their organic response to changing business conditions. At the enterprise level, it may be that an organization is implementing a formal data lake development plan and is beginning to think about what a data warehouse might look like. At the department level and below, as Curt Monash has pointed out repeatedly, the flow of data from source system goes through a lake and warehouse, to decision-makers, and then back into customer interactions, is faster than at the organizational level. It is almost certainly more mature and short-term stable. That is, departments build their own downstream repositories and figure out what works faster than is possible at the overall organization level. The department-level data lake becomes the scaffolding for the organizational system, and the original semantic assets are implemented with more robust ETL tools and rigid change management, which leads the department to build a new data lake to respond to environment changes that move faster than the change management process allows. Then the cycle starts again.
What also becomes clear from the last thirty years is that we repeat a pattern of repository development that is analogous to the pattern of technical debt creation that we see in application development that we’ll call the “semantic debt cycle.” Application development generally starts with a basic set of use cases. An architecture is developed (implicitly or explicitly) that attempts to generalize those use cases, perhaps into modules, layers, meta-objects, or some more discrete set of entities and interactions, such as “customer,” “CSR,” “ticket,” and “product.” We decide which entities owe their existence to the application we’re building – e.g. the “ticket” – and which must be derived from other sources – e.g. the “customer” – and we begin development of the application.
In the course of that development process, as we break down the interactions and entities into smaller components, we need to make decisions about what software to build and when. We may decide that to save time we will use some prefabricated libraries, classes, objects or methods, either from prior applications we’ve built or from third parties. Or we may decide that given our time and resource constraints we will need to exclude certain subtleties of interaction, relationship, or properties from our application. We mark that exclusion, or that adoption of a handy template, as technical debt. “Technical debt” can include a whole lot more than what I’ve just characterized, but the idea here should be obvious. We create technical debt when we take a shortcut that restricts what we can do later with our application. We can create technical debt consciously, as in when we argue about what features are in scope and why their exclusion from the current scope will make it more difficult to add them in later. Or we can create technical debt unconsciously, when we design an application in such a way that certain kinds of collection or interaction are unintentionally excluded from the scope of the application.
Data management follows a similar lifecycle. Organizations implement applications to collect data about portions of their business: orders, customer complaints, email marketing, web analytics, payroll and HR, inventory management, and facilities management. Each of those applications more-or-less reflects the business ontology of the organization and the various tasks and interactions performed by its departments, at a higher level. Over time data from each application is passed to other applications as needs are identified, and then eventually to one or more data lakes and perhaps to a data warehouse.
The data lakes, having no intentional internally-coherent data models, contain straight-up copies of fragmented and denormalized data models from the original systems; semantic assets are just created and dropped into the lake with hopefully some metadata and institutional knowledge about their origin attached. The data warehouse at the very least tries to do some deduping on important entities like customers, and rekeys incoming semantic assets to conform to the warehouse’s local notion of customer. But a given organization will have multiple repositories with overlapping data; one system for customer service, for example, that contains perhaps customer complaints and orders, and another for marketing, which contains orders, outbound emails and inbound web analytics. Whether these systems are weedy, junk-filled lakes (on the one end of sophistication) or refined and organized warehouses (on the other) depends on the culture and resources of the department that owns them.
But time causes change. Upstream applications are swapped out or collapse, or new collection points are added because of mergers. The business ontology changes. The Marketing data lake now includes outbound emails from multiple email vendor systems; that is, multiple semantic assets with slightly different data models. The customer key in one semantic asset is not the same as the customer key in another, and so get a consolidated view someone needs to create a new semantics to join the two assets. At the same time the CS data warehouse contains data from the old relational CRM system and the new SAAS-based CRM system. While the old system allowed hourly extracts, a contractor hoodwinked us and the warehouse is structurally an exact copy of the source data model; but the new system only allows daily extracts, and its data model is a NoSQL document model so it couldn’t be mapped directly into the warehouse schema and there’s a lot of gothic, weird hierarchical tables with no apparent relationship to the old data. These scenarios are common and more than one team is dealing with this problem as you read this.
Virtually every modern organization has this problem with reconciling semantic assets, whether they’re large or small. As application developers formalized “rapid prototyping” into “Agile,” scope management became tighter, timelines compressed, and the cost of application development dropped. But the recognition of the need for applications to do more than simply exchange information through an API – simply an automated semantic asset generator – has disappeared from the modern enterprise. That downstream repositories need semantic assets to enable a coherent view of the enterprise comes as something of a surprise to modern software teams, when it’s not viewed with outright distaste.
However, over time, the decay of the at-least initially unified view of the enterprise present in its various source systems and collection points is a semantic debt that must, at some point, be paid. There is very clearly technical debt here: someone has to write some SQL to convert the customer key of one email vendor to another, and someone has to rewrite reports so they can use both the old boring CRM warehouse model and the new super-abstract hierarchical CRM document model. The SQL will almost certainly be extremely simple, but likely very complex semantically. But that technical debt doesn’t result from decisions made during the construction of the application; indeed, the upstream application may be completely free of any technical debt forever. (It’s often the case that application development managers derive deep frustration from complaints that their shiny new application causes downstream chaos, and are as likely to blame the repository managers for the problem as their own designers.) The technical debt results from the semantic debt incurred by the addition of new and so-far unreconciled data models to the organization’s collection of data models. New layers of semantic assets need to be created to merge the new models.
But the organization inevitably creates semantic debt by trying to adapt to changing business conditions. As we suggested, it can create this debt when it swaps out its old and busted CRM system for a shiny new NoSQL SAAS product. To connect the semantic assets generated by those two systems in a downstream repository, someone needs to rethink the organization’s operational or implicit data model. That is, they need to develop a more all-encompassing model and a more abstract model than what is contained in the semantic assets pulled from both source systems.
Or it may create this debt when it changes its business and decides to do trigger email marketing with a new ESP for new customers while retaining batch-and-blast for warranty updates to old customers. The two email marketing systems will have different definitions of the word “customer” and may even be driven by different extracts – one from the order system and one from the CRM – and thus create subtle differences in cardinality in downstream semantic assets, resulting in significant differences in customer experience. But these differences in data model are semantic differences, and they are debts that must be paid down before the system – whether it’s a casually-developed lake or a formally-designed warehouse – can be returned to its prior level of utility.
A very common and much more insidious form of semantic debt is found in the modern rush to store data of varying schemas on file servers for consumption by data scientists on an ad-hoc basis. Each file type is its own self-contained fragment of an upstream data model, whether as a log of some device behavior or as a supplementary extract from a larger system designed to provide context. Without robust metadata about the origin and intention of the content of the file – which particular version of the device generated the file, which activities the device was performing at the time, which particular partition of source data in the source system was indicated in the WHERE clause to create the semantic asset – any attempt at automated “model recognition” or reconciliation will be under-determined by the data. (This is “under-determined” in the philosophical or Quinean sense: you can propose any number of valid models to organize the data you’re looking at, without getting at the actual data model that would stabilize reproducibility.) Files that purport to be of the same “type” may not in fact be generated by the same partition, or from the same device version or activities. In that case, their meanings may be different, and automated inferences of the type data scientists are fond of generating won’t be reliable.
There’s a final kind of semantic debt, a kind of confirmation bias I’ll call “epistemic circularity.” I gave an example of this in my earlier piece, but given what we discussed above, the reader should be able to see a couple of obvious examples. Consider the well-constructed CS warehouse which does not, as of yet, contain data from the super-fancy SAAS product, which is intended to manage complaints for the organization’s own new SAAS product. If the original CS warehouse didn’t contain tables for “product” – because no one assumed it was important – then the problem is not simply the reconciliation of a hierarchical document model where the data is provided daily with a relational star schema populated hourly. The problem is that there’s no product table anywhere to even differentiate the two product lines. It’s not just that the various semantic assets can’t be reconciled without a richer model, it’s that the gigantic fact table in the middle of the CS warehouse’s schema doesn’t contain a product key.
While the original design of the warehouse may have dismissed the need for a product key, there is now a debt that must be paid. All the old semantic assets have to be adapted to include a product, and the star schema modified to include product as part of the fact. That will have significant knock-on effects, perhaps to the granularity of our fact table and any reports that depend on them. We will likely need to do some significant work to modify business logic and data models in order to pay down the debt in our semantic assets, and to ensure our downstream users see much the same data as before.
We’ve seen, then, the creation of semantic debt naturally results from an organization’s evolution. Semantic debt nearly always creates technical debt as well, but while the technical debt can often be trivially resolved the conceptual work can be daunting and often require significant rework.
We talked about four types of semantic debt:
- Debt that results from two incompatible but otherwise defined schemas.
- Debt that results from two incompatible natural keys.
- Debt that results from under-determined schemas or content.
- Debt that results from epistemic circularity.
While this debt may be proximally located in the semantic assets generated out of source systems, we should note that sometimes debt results from improperly scoped or designed assets. But even the most transparent semantic asset will be unable to overcome the debt it inherits from its parent.
In my next piece, I’ll talk more at length about how one can minimize semantic debt, and how you might pay it down over time.
[Writer’s Note: I was describing this problem to Dan Avery when he coined the term “semantic debt.” I think it’s a genius coinage, but I can’t claim credit for the term. This article benefited from feedback from Erin Yokota, John van der Koijk and Justin McCarthy.]