Data Mesh and Data as a Product
In the first article, I introduced and explained the approach to application development called Domain-Driven Development (or DDD), explained some of the Data Management concerns with this approach, and described how a well-constructed data model can add value to a DDD project by helping to create the Ubiquitous Language that defines the Bounded Context in which application development occurs.
In the second article, I talked about the importance of modeling and persisting data at as high a level as possible, to ensure that services don’t have to make multiple calls to different subdomains to get fragments of data for a canonical data entity and then translate that data into a useable form for the application.
In the third article, I discussed various issues around the physical persistence and virtual representation of data in databases and other types of data stores.
In this article, I’d like to explore two additional subjects of interest: an approach to data architecture called Data Mesh, and the idea of Data as a Product. As I will show, both of these are related to the ideas we have already covered in the previous articles.
The “Data Mesh” concept was first propounded by Zhamak Dehghani, a ThoughtWorks consultant and the original architect of the term.[i] In its simplest form, the idea is to move away from centralized storage and control of data in a Data Warehouse or Data Lake by a centralized Data Engineering or Data Warehouse team. Instead, data is managed by multiple teams of Domain Experts at a Domain (or business subject area) level. Each team manages the data in its Domain and creates “data products” that are then published to a central Repository (usually in the Cloud) for consumption by end users. Each team manages its own data pipelines and does its own data cleansings and transformations prior to publication.
The advantage, according to Data Mesh proponents, is that new sources and streams of data can be more easily accommodated by multiple teams, leading to increased scalability of an organization’s Data and BI solutions. A centralized Data or BI team leads to bottlenecks and increased time to respond to user requests. Having decentralized Data teams can lead to increased innovation and creation of specialized data products. This is especially important when the meaning of data can change from one business Domain to another. For example, “Customer” in the Sales domain may contain sales leads, whereas in the Orders domain it may include only people who actually place orders for goods and services.
This is all well and good. However, there are several things that need to be said about this approach. Proponents of Data Mesh view this approach as a logical extension of Domain-Driven Design concepts, applied to “data products” consumed by business users instead of “persistence stores” consumed by applications and services. However, as we’ve already seen, this is almost the exact opposite of what Domain-Driven Design does! In true DDD, persistence stores are not curated by Domain Experts, they are created ad-hoc by application developers who answer to no one outside of their team. Data in the persistence stores does not have to conform to any naming conventions or business rules, and each persistence store contains only the minimum data needed to support the functionality of applications and services within a particular subdomain. So, we have to ask: Where does the “data products” come from?
One hard truth that is never acknowledged, but needs to be, is that creating data for consumption by business users is a much different— and much harder— problem than creating data for consumption by applications and services. For one thing, as I’ve noted, data that supports applications and services does not have to be of the same quality as data that supports business users. Nor does application data have to be modeled or defined in business terms, nor does it have to conform to business definitions or business rules. It’s quite acceptable, in an application data store, for a customer’s business telephone number to be called “custPhone” (or “cusTelNo”, or “phoneNbr”), be defined as a string, and contain data formatted in half a dozen different ways.
Another issue is that data consumed by applications and services does not have to be self-explanatory. The logic needed to consume and understand the data is coded within the application or service and does not need to be part of the database. So, it may not matter to an application that the customer phone data is stored in different formats under different names; the code logic can take care of this. But it will matter a lot to a business user trying to make sense of the data in Excel or Tableau!
The idea of “self-service” data for business users is not a new thing; that idea has been around for decades. But we know— or should know— by now that such data does not magically create itself. In order for data to be truly “self-service”, it has to meet the following criteria:
- The data has to be defined in business (not application) terms
- The data has to conform to business definitions and business rules
- The data has to be self-explanatory
- The data has to be of sufficient quality and currency to be useful
- The data has to be easy to find
- The data has to be easily accessible and consumable by non-technical people
- The data has to be secure from access by non-authorized people
Data Mesh advocates refer to these sorts of requirements for self-service data with the acronym DATSIS: Discoverable, Addressable, Trustworthy, Self-Describing, Inter-operable and Secure.
None of these are requirements for data consumed by applications and services. All of these are requirements for data consumed by business users. Software developers writing applications and services are not going to care about any of these, nor should they be expected to. This work needs to be done by Data professionals in an organization.
It should also be apparent that the goal of self-service data cannot be achieved if data is fragmented across multiple application persistence stores and defined differently within each subdomain. As noted in the previous articles, an effort must be made to determine at which level of the organization data should be properly defined. Some data may be domain-specific; other data may be canonical. Some data may need to exist in a curated MDM repository. At the very least, data that spans multiple domains and subdomains should be similarly named and defined, so that it is easier to get a holistic understanding of how data is used across the organization.
Also, it needs to be acknowledged that the criteria for self-service data listed above cannot be achieved unless an organization has a very high degree of “Data Literacy” (to use the current buzzword). In practical terms, this means that an organization must have implemented, and be actively using, the following Data Management tools and practices:
- Data Governance. Data Governance is the process by which an organization assigns Domain Experts (aka Data Stewards) to oversee the creation and usage of data. Note: one important part of Data Governance is understanding which business domains create and consume which sets of data. You literally can’t implement a domain-centric approach to data management without Data Governance!
- Data Quality Management. At a bare minimum, an assessment must be made of the quality, usability and business-relevance of the data at each data source that is going to be accessed and consumed by business users. If the data has been modeled correctly up-front and is fundamentally sound, the data may be fit to use as-is. Otherwise, if the number of defects is small, it might be possible to fix the problems in the creating (source) applications. Or, if the data is really bad, there may be no choice but to move the data to a Data Lake or Data Warehouse and fix it there.
- Data Catalog. A Data Catalog is what end users will use to quickly and easily find the data they need for a particular need. This is an especially important requirement for Data Mesh, since there is no one single place users can go to get the data they need.
- Data Marketplace. An extension of a Data Catalog, this tool gives users a place to publish datasets (aka Data Products) that they think will be useful to others.
- Metadata Management. In conjunction with a Data Catalog, business users will need access to metadata in order to understand where a given set of data comes from, how the data is used in the business, how current it is, what cleansing and transformations have been done to it, what it legally can and cannot be used for, etc. This is especially important for data entities that might have different meanings in different business domains. For example, “Customer” data might include sales leads in the Sales domain, but not in the Orders domain.
- Master Data Management. An MDM repository is essential for data that must be defined consistently across an organization. If not consumed directly from the MDM repository, this data must be published regularly to any and all sources from which this data will be used.
The important thing that needs to be acknowledged is that Data Mesh, properly understood, is not an architecture or a collection of Cloud technologies. Data Mesh needs to be understood as a Data Governance Program! The essence of Data Governance is the assigning of responsibility for data meanings and values at a business domain level, where teams of domain experts and Data professionals (overseen by a higher-level Data Governance Council and Executive Steering Committee)[ii] work together to ensure that each Domain’s data assets are properly defined, managed, catalogued and published for consumption. It must also be acknowledged that these teams need to include Data Management expertise as well as Domain knowledge and technology (mostly Cloud technologies) skills. Most software developers do not have the Data Management expertise needed to support this work, nor is it usually their concern. So, whereas Domain-Driven Development for applications and services can be done by software developers familiar with Object-Oriented programming, Data Mesh can only be done by Data professionals conversant with Data Management tools and processes. This is one of the critical differences between centralized approaches to data management such as data warehouses and data lakes and decentralized approaches such as data mesh. With a centralized approach, you need a single team of Data specialists (which admittedly can produce bottlenecks in delivery), but with decentralized approaches you need to have Data specialists on every domain team.
Another problem that needs to be addressed is the question of where and how Data Products will be published, and how they will be consumed. There are two basic approaches: Data Products can be published to a central repository (in which case you will need a centralized IT Data team to administer it), or the Data Products can be stored and published locally. In organizations that practice Domain-Driven Development, the Data Products are usually made available as APIs (i.e., services). This requires some specialized skills, as not everyone knows how to write (or consume) APIs. Also, the organization must have a centralized API management system and services catalog.
If Data Products are not published as APIs (if, for example, they are stored in one or more databases), then some sort of centralized Data Catalog and/or Metadata Repository will be needed so that end users can easily find the data they need. If the database(s) are relational in nature, then good SQL coding skills will be needed in order to query the data.
Finally, what about Data Products whose data spans multiple business domains? Data Mesh consultancy Monte Carlo expresses the problem like this:
Underlying each domain is a universal set of data standards that helps facilitate collaboration between domains when necessary— and it often is. It’s inevitable that some data (both raw sources and cleaned, transformed, and served data sets) will be valuable to more than one domain. To enable cross-domain collaboration, the data mesh must standardize on formatting, governance, discoverability, and metadata fields, among other data features. Moreover, much like an individual microservice, each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers.[iii]
In other words, cross-domain Data Products must be similarly modeled, and their common data attributes similarly defined. Moreover, the data within these Products must have similar business meanings, along with similar quality metrics and currency. This will involve a close degree of cross-collaboration between the domain teams.
There is much to be said for the Data Mesh approach. But we
should understand that the biggest difference between centralized approaches
such as Data Warehouses and Data Lakes and decentralized approaches such as
Data Mesh is this: You can have a Data Warehouse or a Data Lake without Data
Governance. This is true in most organizations. But you absolutely cannot do
Data Mesh without an organizational Data Governance program and corresponding
Data Management processes in place. It simply will not work.
[i] Deghani, Zhamak. “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”. May 20, 2019. martinFowler.com. https://martinfowler.com/articles/data-monolith-to-mesh.html.
[ii] See, for example, Chapter 9 of Robert Seiner’s book, Non-Invasive Data Governance (Technics Publications LLC, 2014). These higher-level oversight bodies are needed to help prevent the creation of “data silos”, and to ensure that data assets being created at the Domain level provide value to the entire organization.
[iii] Gavish, Lior and Barr Moses. “What is a Data Mesh, and How Not to Mesh it Up”. Blog post, August 2, 2022. https://www.montecarlodata.com/blog-what-is-a-data-mesh-and-how-not-to-mesh-it-up/.