Occasionally, I pick up on trends in my peripheral vision. These are trends that aren’t in the center of my professional field of view, but are out there on the edges. Obviously, these trends are in the center of someone’s field of view, and there are people out there who make a living tracking technology trends, so my apologies to any one for whom this is yesterdays news.
The Metrics Layer
A few weeks ago, I started seeing flashes on my periphery about something called “the metrics layer.” I’ve stumbled across pieces like this where they don’t really tell you what it is, but their product will address a bunch of problems you have as a result of not having a metrics layer. This one, though brief, is much more descriptive as to what a metrics layer is . It’s a layer on top of other layers (the data warehouse for instance). About this time, you start wondering, what happened to my data warehouse such that I need another layer on top of it just to make sure I derive data from it consistently? Finally, this one does a pretty good job not only with the definition, but who are the incumbents in the other parts of the data stack.
The Modern Data Stack
I had to stop and pause for a minute. Now, how can someone such as me who is an apostle of the Data-Centric movement, be so under aware of this whole ecosystem? Most of our clients have numerous tools called out in the last article above, but I haven’t really paid much attention to why they have them or what they are doing with them.
This article points out the modern data stack is composed of the following “layers” (sort of from bottom to top, but it is not a strict layering like the OSI Telecomm layer):
- Data Orchestration
- Data Catalog
- Data Observability
- Cloud Data Warehouse
- Event Tracking
- Data Integration
- Data Transformation
- Reverse ETL
- Artificial Intelligence (this is the raison d’être for most stacks these days)
- Metric Store
- Data Analytics
A Data-Centric Architecture
Even more puzzling to me, is how orthogonal this is to what we’re working on. You would think two tribes trying to solve sort of the same problem would accidently align better than what I’m seeing. Week after next, (historical by the time you read this) June 6-8, 2022, we’re hosting the fourth annual Data-Centric Architecture Forum. This year, we’re rolling up our sleeves and tackling what we’ve come to believe are the ten key challenges in deploying a Data-Centric Architecture, namely (in the order that they will be presented at the conference):
- Approaches to Graphical Visualization
- AI/ML and Data Science (Ok, busted, we’re fad surfing too)
- Unstructured Data
- URI Resolution
- Entity Resolution
- Federation (especially to relational/big data)
- Model Driven UI
- Constraint Management
- Sem Ops (DevOps for Semantic Graph driven systems)
It’s frightening how little overlap there is between a bunch of practitioners who are trying to orchestrate a data-centric future and the consensus view of the modern data stack (no wonder none of those vendors signed up for our conference).
Headless BI (Business Intelligence)
Just as I’m reflecting on this, I get another peripheral flicker from the other side of my field of view. Headless BI. The first Headless BI article I noticed was this one. It did a nice job of framing the problem. It wasn’t this article, but another I read around the same that I can’t find now that pointed out that many firms have tens of thousands of metrics. This started blowing my mind.
As the cube.dev blog pointed out, the source of this explosion is the realization that “users can only use the metrics they’ve defined within the four walls of the visualization tool.” Wow. Yes. Duh. It’s like a nano version of application-centricity. Every user, every dashboard is redefining the metrics they use.
This camp proposes the solution is “Headless BI” which is essentially to build a BI tool without any UI, which just serves up consistent metrics. The consistent metrics get served up via APIs of course.
This might work, but look at all the coordination needed, first to get Tableaux users to find and then attach to a series of APIs when all their history and training leads them elsewhere. And there isn’t a natural and simple evolution from itch (“I just need this one new metric for my dashboard”) to the scratch (“I have to contact a developer to build a headless BI and give me the API before I can proceed.”).
Metrics in Data-Centric
While several of these articles mention semantics, it ends there. Neither the metrics layer nor the headless BI seem to address what are the central problems:
- How to unambiguously define a new metric?
- How to compose metrics out of other metrics?
- How to find the metric you want?
In fairness, each of the approaches have something to say about these (one open-source product I drilled into had a way to define metrics in SQL. That kind of works, but it seems like it is just layering on top of something that isn’t workable. The example I saw suggested that cost per unit could be defined (in SQL) as REVENUE / NUMBEROFUNITS (Can’t fight logic like that). But of course, the real problem is you have thousands of tables, many of which have revenue, and there are many qualifiers to the revenue (time frame, product, geo region, including tax, including freight etc.), so which one do you take?
The Data-Centric approach to metrics puts the definition of the metrics in the shared data. Not in the BI tool, not in code in an API. It’s in the data, right along with the measurement itself.
With Data-Centric principles, we just define metrics like we do anything else: rigorously and unambiguously. We haven’t yet done a project where the metrics are similar to most of the examples I’ve been seeing. Examples seem to deal with conversion rates, order sizes and net promoter scores. All of which are a bit simpler than the metrics we tend to end up with, which involve unit of measure conversions and multiple dimensions (where dimension here is as the term used in the scientific measurement community).
The following measurement examples come from a variety of firms. A commodity price assessments company, an international central bank-like organization, an industrial electric device manufacturer, an oncology hospital, a credit rating firm, and a professional services firm (see if you can tell which is which)
What They Share?
Separation of the Measurement From the Definition of What the Measurement Means
Every time we do this, we create a formal definition of what is being measured. For instance, the price of Brent Crude in Yen, Gross Domestic Product per Capita, Rated Breaking Capacity, All-Cause Mortality, U3 Unemployment and Consultant Utilization.
Explicit Units of Measure
Every measurement has a unit of measure. Some are simple one-dimensional measures like distance, weight, or time. Many are more complex and are either ratios or products. For instance, speed is the ratio of distance over time, and acceleration the ratio of speed over time. Area is the product of distance by distance. The definitions bottom out in one of the Systems International base units: meter (length), kilogram (mass), second (time), ampere (electric current), Kelvin (temperature), mole (quantity), and candela (brightness). We have added three more: dollar (monetary value), bit (information), each (count).
Unit of Measure Conversion
To introduce a new simple unit (say furlong), you are obligated to supply its conversion rate to one of the base units. If you were wondering, one furlong is 201.198 meters. By supplying this information, we not onlyknow how to convert furlongs to any other distance measure, but we also know that this system is in fact a distance measure (a distance unit is anything that can be converted to meters).
Two of the simple unit types require a bit more information to convert them. Most temperature units (Celsius, Fahrenheit, Rankine, and Delisle) require an offset to get to and from Kelvin (the base). And currencies require a date, as conversions are time sensitive.
Ratios with Numerators and Denominator
For complex units, we need to supply either the numerator and denominator or the multiplicands. Eventually, everything bottoms out in base units and therefore all units can be converted.
Sometimes, knowing the unit of measure for a measurement is not sufficient to disambiguate it. A pipe has at least three length measurements (the overall length of the pipe, but also the inside and outside diameter, which are also expressed in distance units).
One way to distinguish them would be to create separate properties for each variation. But this rapidly expands the size of the data model schema (the ontology) and puts maintenance of simple domain information in the hands of the ontologists. We generally find that introducing taxonomic distinctions (inside diameter, outside diameter etc.) and relating them to either the units or the actual measurement via a “has Aspect” relationship to be the best tradeoff.
The choice as to whether to qualify the unit or the measure comes down to intended usage. Qualifying the unit, creates a new unit of measure (“inside diameter in inches” for example) where qualifying the measure is a bit like tagging it.
Qualifiers for Aggregates
The above gives us an unambiguous definition of anything we can measure. But metrics is mostly about aggregating.
Here is where we may learn something from Headless BI or the Metrics Layer (or even use some of these products).
The simplest aggregate is the ephemeral one; the one that one exists at its point of consumption. We can write these as GROUP BYs in a query. In a more model driven approach, we could define the dimensions and filters to be used for the aggregation and store them as data. This way, we have a formal definition of the aggregate results, but the results are still ephemeral.
We are just starting to experiment with accounting systems unambiguous definition. Until I sat down to author this paper, I wouldn’t have thought of financial reports as “metrics”, but that’s what they are. If Revenue and Profit aren’t Key Performance Indicators, I don’t know what is. What we are now looking at is a subset of the aggregates in a financial reporting system, specifically the sums that make up the figures on the published financials. It should be persisted. And the act of persisting can link the sums to the figures that contributed to them (Much like double clicking on a cell in a pivot table brings up the rows that contributed to it does). This approach has potential to transform how companies respond to their Sarbanes Oxley obligations.
The final question is, Should arbitrary, analytical aggregates be cached? It seems like a fully generalized approach would end up recreating a resource-oriented computing framework. See this source for more detail on how this would work.
The Steve Jobs Moment
I’m having a realization that reminds me of Steve Jobs iPhone reveal. In the famous 2007 Macworld keynote, he announced that Apple would be launching three new products: a touchscreen iPod, an internet communication device, and a mobile phone. If you haven’t seen it, it’s a classic.
Three products: iPod, mobile phone, internet device. “Are you getting it?” Not three separate devices.
Headless BI, Metrics Layer, Data-Centric Architecture. Three products. Get it?
There is a Real Problem to be Solved
But as with the application-centric quagmire, it is a creature of our own creation. I reluctantly accept that with the current state of fragmentation of data, there may well be a place for a metrics layer and headless BI. There may be a need for the modern data stack. But I think these are transient needs. When the data model is simple and much of the data of the firm has been conformed to the model, most of the problems the modern data stack seeks to address will have been sidestepped.