I love the term “Zero Copy Integration.” I didn’t come up with it, it was the Data Collaboration Alliance,[1] that came up with that one. The Data Collaboration Alliance is a Canadian based advocacy group promoting localized control of data along with federated access.
What I like about the term is how evocative it is. Everyone knows that all integration consists of copying and transforming data. Whether you do that through an API or through ETL (Extract Transform and Load) or Data Lake style ELT (Extract Load and leave it to someone else to maybe eventually Transform). Either way, we know from decades of experience that integration is at its core copying data from a source to a destination.
This is why “copy-less copying” is so evocative. It forces you to rethink your baseline assumptions.
We like it because it describes what we’ve been doing for years, and never had a name for. In this article, I’m going to drill a bit deeper into the enabling technology (i.e., what do you need to have in place to get Zero Copy Integration to work), then do a case study, and finally wrap up with “do you literally mean zero copy?”
Enabling Technology
To consider what the enabling technology is, we first need to consider what is the disabling technology. In other words, what is it about traditional technology that prevents us from doing this right now.
The main issue is how each application system or tool is completely hidebound by its own data model. Integrating two applications, or integrating an application with a data warehouse, means getting data expressed in a source model to conform to the target model. This isn’t just as simple as mapping different terms, although even that is a chore. If a source system calls an employee a “worker” and a target system calls them “staff” there is a bit of mapping to do. But that is trivial compared to the deeper problems with integration.
The real problems come when one system structures the data differently, or one system operates at a different level of abstraction. It is exacerbated when two systems use different identifiers for the same (or worse, similar) items. This happens all the time and is harder to overcome than it first sounds. This is essentially why integration is hard. Even the famous edict of Jeff Bezos, that all applications must communicate through APIs, does not solve this problem. Each API is a data model. Each API has its own language. Each API exposes data in some level of abstraction or another. Each API is a front for an application that uses its own local identifiers.
So of course, every integration copies data from the source model format to the target.
In order to break this spell, we need to do four things (that at first sound difficult, but when you work in this stack you realize they just come along for free):
- A single shared structure for all users of the data
- A shared semantics for all users of the data
- Shared identifiers
- Cross vendor query federation
I’m talking about Semantics and Knowledge Graphs. The shared structure is the triple. There is no other structure, and therefore no need to restructure your data. RDF is the structure. We will see in the case study a more concrete example of this.
The shared semantics are in what we call the core model. In many other articles in this series and my books, I emphasize the importance of having a shared model of all the information of the firm. Many are reluctant because they believe it is not achievable. It is. We have proven in dozens of major engagements that even complex enterprises have a relatively simple (less than 1000 concepts) set of concepts at its core. Forming around this shared model provides the ability to interrogate once and receive data from many sources.
To the third point, shared identifiers are URIs/ IRIs. The magic of URIs/ IRIs is that they do not need meta data, and they do not need documentation to be useful. An ID in a relational table only means something if you know the DB, Table and Column, the meta data. This is similar with the key in a JSON document or the fields in an API. The identity is clouded.
In RDF, the identifier is globally unique (best practice, which is followed by 99% of all practitioners, is to base identifiers either on a domain name you own, or one you can rely on such as purl.org or web id). Also, best practice is to provide, subject to authorization, the ability to resolve a URI/IRI. The URI is a uniform identifier, not a universal identifier. That means that the URI refers to a single thing, but it might not be the only identifier that does so. As we discover, through entity resolution, that two URIs refer to the same thing, we can note that (with another triple, remember that’s the only structure) in the form :URI1 owl:sameAs :URI2 .
Finally, the ability to seamlessly federate is a key ingredient to avoiding copying. The traditional data warehouse or data lake has taken the implicit decision that integration = co-location, and of course co-location implies copying. The standards compliant triple store (RDF Graph) database vendors all support the SPARQL protocol that allows for transparent federation of queries. Obviously, federation has the potential to slow things down, and some planning is usually in order.
Case Study – Investment Bank
We worked with a large investment bank. Our first project with the bank was in the legal department. Specifically, we were helping with the automation of records retention classification. To do this, we extracted data from a number of systems. We got all of the SharePoint sites, shared folders, and shared databases from several different sources, and turned them into triples, conforming to the core model we had built in parallel. We extracted their organizational structure and cost centers from their financial systems, “triplified” them, and conformed them to the core model. We also obtained key data about employees and contractors from HR and various entitlement systems — triples that were likely to help us with the process of classifying information.
Our sponsor’s hypothesis, which turned out to be true, was that a very good first cut at retention classification could be built by introspecting the contextual data about a set of records. That is, what part of the organization owned the repository, what cost center was it billed to, who set up and contributed to the repository, and what was their job description.
The project worked well and spawned a longer-term project within the firm to further refine this and build on the initial work. Meanwhile, we extended the core in many other domains within the firm.
Years later, working in a Tech Asset project the sponsor needed employee data. We told them that this other project had triplified most of the employee data they needed, and it was in a pipeline that was keeping the data fresh.
So strong is the siren’s song of copy-based integration that the sponsors said, “Great, let’s get a copy of that employee data and adapt it to our own use.”
We had to remind them that there was no need to copy it, and there wasn’t really a restructuring needed or possible. Each fact about an employee is a triple. We can say :Emp1 :reportsTo :Org1, or that :Emp1 :hasEmail “xx@xx.com”. A consumer of this data can either say, “I’m not interested in some of those facts” or they can add to the facts, with additional information they are interested in, but there is no need to restructure, rename or in any other way change the source data.
And with federation they can consume it as needed.
Do I Literally Mean ‘Zero Copy?’
Astute readers will have noticed that even in the case study we copied data from the HR and entitlements systems into the triple store. We also mentioned there was a pipeline keeping data fresh, which of course also implies copies.
As long as the legacy systems remain in place, copies will be needed. But we are headed for a long-term future where legacy systems aren’t needed. Once the knowledge graph is in place, new functionality can be built directly in place. At that point, there truly will be no more need for copying, as the source / golden / master will be the original.
In the meantime, maybe we should call this the “Last Copy” you will need to do.
Further, as we cautioned in the introduction to federation, the act of federating queries does introduce coordination and latency concerns. The solution to this often is to create a local copy. However, this copy is more like a cache than a copy that has had human labor applied to transform it from one form to another. For our purposes, a cache does not count as a copy (or at least not as a copy that needs to be eliminated)
Summary
“Zero Copy Integration” is a term coined by the Data
Collaboration Alliance. It simultaneously challenges orthodox thinking and at
the same time strongly hints at the requirements such an architecture will
need. We are completely in favor of this meme.
[1] https://www.datacollaboration.org/zero-copy-integration