We have taken the position that a core model is an essential part of your data centric architecture. In this article, we will review what a core model is, how to go about building one, and how to apply it both to analytics as well as new application development.
What is a Core Model?
A core model is an elegant, high fidelity, computable, conceptual, and physical data model for your enterprise.
Let’s break that down a bit.
By elegant we mean appropriately simple, but not so simple as to impair usefulness. All enterprise applications have data models. Many of them are documented and up to date. Data models come with packaged software, and often these models are either intentionally or unintentionally hidden from the data consumer. Even hidden, their presence is felt through the myriad of screens and reports they create. These models are the antithesis of elegant. We routinely see data models meant to solve simple problems with thousands of tables and tens of thousands of columns. Most large enterprises have hundreds to thousands of these data models, and are therefore attempting to manage their datascape with over a million bits of metadata.
No one can understand or apply one million distinctions. There are limits to our cognitive functioning. Most of us have vocabularies in the range of 40,000-60,000, which should suggest the upper limit to a domain that people are willing to spend years to master.
Our experience tells us that at the heart of most large enterprises lays a core model that consists of fewer than 500 concepts, qualified by a few thousand taxonomic modifiers. When we use the term “concept” we mean a class (e.g., set, entity, table, etc.) or property (e.g., attribute, column, element, etc.). An elegant core model is typically 10 times simpler than the application it’s modeling, 100 times simpler than a sub-domain of an enterprise, and at least 1000 times simpler than the datascape of a firm.
An overly simple model is simplistic and therefore not terribly useful. Sure, we could build a model that says: customers place orders for products. This is literally true but not sufficiently detailed to build systems that drive analytics. This is the main reason that application data models have increased in complexity: an attempt to represent requisite detail.
But virtually every application we’ve looked at has way overshot the mark and executed poorly, to boot. Application developers, who really drive their data models, tend to do one of two things when they encounter a requirement: write some code to address it, or amend the data model to address it (and then write some more code). It rarely occurs to them (and in fairness, they haven’t had access to approaches and techniques that would make a difference even if it had occurred to them) to consider a way to represent the distinction that would be reusable. Very often the additions being made to a model are “distinctions without a difference.” That is, they add something that was “required” but never used in a way that affected any outcome.
Our lens for fidelity is this: if the distinction is needed to support “data structuration,” business rules, or classification for retrieval or analytics, then you need that distinction in the model. “Data structuration” is the term that our European customers use that I have grown very fond of. It essentially means making decisions around how you want your data structured. So if you decide that you need to store different information on exempt employees versus non-exempt employees (e.g., monthly salaries versus hourly rates and overtime rates), then you need to be able to represent that distinction in the model.
For business rules, if you charge more to insure convertibles than hard top cars, then the model has to have a place to keep the distinction between the two. If your users want to sort their customer lists between VIPs and riff raff, then the model needs that distinction. If your analytics need to aggregate and do regressions on systolic versus disystolic blood pressure readings, then you must keep that distinction.
You’d be forgiven for believing this justifies the amount of complexity found in most data models. It doesn’t. For the most part, these necessary distinctions are redundantly (but differently) stored in different systems, and distinctions that could easily be derived are modeled as if they could not be.
The real trick we have found is determining which distinctions warrant being modeled as concepts (classes or properties), and which can be adequately modeled as taxonomic distinctions. The former have complex relationships between them. Changing them can disrupt any systems depending on them. The latter are little more than tags in a controlled vocabulary, which are easier to govern and evolve in place.
The other trick for incorporating the needed distinctions without letting complexity get away from you is the judicious use of faceting. For reasons that we can pin on Carl Linnaeus and Melvil Dewey, taxonomists today feel the urge to create a single giant, rooted taxonomic tree to represent their domain. There are almost always many smaller, orthogonal facets trapped in those big trees, and extricating them will not only reduce the overall complexity, it has the added benefit of making the pieces far more reusable than the whole.
We have found the secret to high fidelity coupled with elegance is in moving as many distinctions as possible to small, faceted taxonomies.
A computable model is one that a program can do something useful with directly.
By analogy, it is the difference between a paper Rand McNally road map and Google Maps: both model the same territory, but one might be more or less useful for the purpose at hand, and either could be more detailed. But the Google Map is computable in a way the paper map isn’t. You can ask Google Maps for a route between two points or what coffee shops are nearby. You can ask Rand McNally all day long, but nothing will happen.
A data model on a white board is not computable. One in Visio barely more so. Sophisticated data modeling tools give some computability, but this is often not available in the final product (in the same way that Rand McNally probably uses Geospatial Information System software to build their maps, and could have done some google-like queries in their design environment, it is no longer present in the delivered environment).
The core model that we advocate continues to be present, in its original design form, in the delivered application, and can be interrogated in ways previously only available in the design environment.
Conceptual and Physical
Received wisdom these days is that a data model is either a conceptual model, a logical model, or a physical model. This is mostly driven from the analogy with construction where a conceptual model is the architect’s drawings, the logical model is the blue prints, and the physical model is the as-built.
In the data world, these models are often derived from each other. More specifically, the logical is derived from the conceptual and the physical from the logical. Sometimes these derivations are partially automated. To the extent the transformation is automated, there is more likely to be some cross reference between the models, and there is more possibility that a change will be made in the conceptual model and propagated down. But in practice, this is rarely done.
However, the need for three models has more to do with the state of tooling and technology decades ago than what is possible now. Applications can now be built directly on top of graph databases. The graph database makes it possible to have your cake and eat it too (with regard to structure). The graph database, when combined with the new standard SHACL, allows application builders to define minimum structure that will be enforced. At the same time, the inherent flexibility of the graph database, coupled with the open world assumptions of OWL, allows us to build models that have structure but are not limited by that structure.
By using URIs as identifiers in the data model, and once a concept has been defined (e.g., in the equivalent of a conceptual model), the exact same URI is used in the equivalent of the logical and physical models. The logical conclusion is that the conceptual, logical, and physical are the same.
The real shift that needs to happen is a mental one. We’ve been separating conceptual, logical, and physical models for decades. We have a tendency to do conceptual modeling at a more abstract level, but this isn’t necessary. If you start your conceptual core modeling project with “concrete abstractions,” they can be used just as well in implementation as in design. Concrete Abstractions are concepts that, while they are at a more general level, can be implemented directly. The classes fit this (Person, Organization, Event, and Document), as do properties (hasPart, hasJurisdiction, governs, startDate, or name).
This paper has distinguished what a “core model” is and put some parameters around that. In the next installment, we will provide advice on how to build a core model and to apply it in analytic applications, as well as using it to change the nature of application development.