In the past few weeks, we (the author’s company) hosted a new Data-Centric Architecture Conference. One of the conclusions made by the participants was that it wasn’t like a traditional conference. This wasn’t marching from room to room to sit through another talking head and PowerPoint lead presentation. There were a few PowerPoint slides that served to anchor, but it was much more a continual co-creation of a shared artifact.
The agreed consensus was:
- — Yes, let’s do it again next year.
- — Let’s call it a forum, rather than a conference.
- — Let’s focus on implementation next year.
- — Let’s make it a bit more vendor-friendly next year.
So retrospectively, last week was the first annual Data-Centric Architecture Forum.
What follows are my notes and conclusions from the forum.
Shared DCA Vision
I think we came away with a great deal of commonality and more specifics on what a DCA needs to look like and what it needs to consist of. The straw-man (see appendix A) came through with just a few revisions (coming soon). More importantly, it grounded everyone on what was needed and gave a common vocabulary about the pieces.
Uniqueness
I think with all the brain power in the room and the fact that people have been looking for this for a while, after we had described what such a solution entailed, if anyone knew of a platform or set of tools that provided all of this, out of the box, they would have said so.
I think we have outlined a platform that does not yet exist and needs to. With a bit of perseverance, next year we may have a few partial (maybe even more than partial) implementations.
Completeness
After working through this for 2 ½ days, I think if there were anything major missing, we would have caught it. Therefore, this seems to be a pretty complete stack. All the components and at least a first cut as to how they are related seems to be in place.
Doable-ness
While there are a lot of parts in the architecture, most of the people in the room thought that most of the parts were well-known and doable.
This isn’t a DARPA challenge to design some state-of-the-art thing, this is more a matter of putting pieces together that we already understand.
Vision v. Reference Architecture
As noted right at the end, this is a vision for an architecture— not a specific architecture or a reference architecture.
Notes From Specific Sessions
DCA Strawman
Most of this is covered was already covered above. I think we eventually suggested that “Analytics” might deserve its own layer. You could say that analytics is a “behavior” but it seems to be burying the lead.
I also thought it might be helpful to have some of the specific key APIs that are suggested by the architecture, and it looks like we need to split the MDM style of identity management from user identity management for clarity, and also for positioning in the stack.
State of the Industry
There is a strong case to be made that knowledge graph driven enterprises are eating the economy. Part of this may be because network effect companies are sympathetic with network data structures. But we think the case can be made so that the flexibility inherent in KGs applies to companies in any industry.
According to research that Alan provided, the average enterprise now executes 1100 different SaaS services. This is fragmenting the data landscape even faster than legacy did.
Business Case
A lot of the resistance isn’t technical, but instead tribal.
Even within the AI community there are tribes with little cross-fertilization:
- Symbolists
- Bayesians
- Statisticians
- Connectionists
- Evolutionaries
- Analogizers
On the integration front, the tribes are:
- Relational DB Linkers
- Application-Centric ESB Advocates
- Application-Centric RESTful developers
- Data-centric Knowledge Graphers
Model Driven / Low Code / No code / Declarative Systems
We didn’t spend as much time as I had hoped on model-driven. I think the general consensus was:
- Model-driven isn’t essential for DCA
- A DCA makes model driven much easier, and it is a natural fall out of thinking this way
- There is a lot of model-driven out there that isn’t DCA
- The more of a system you can reduce to a model and put back into the graph, the easier impact analysis, evolution, and governance becomes
NLP
NLP and DCA are a match made in heaven.
- Ontology driven NLP is going to surface better and more consistent triples from a corpus, and make them available to the users of the architecture
- NLP can help extend the ontology by looking for specialized classes in a corpus
- NLP can participate in turning questions into queries as a key part of the UI for a DCA
Metadata in the DCA
There was huge interest in metadata in DCA. I think it split into two (three really but this was an after-the-conference reflection) issues:
- The architecture is managing an ontology (a TBox) which is its primary metadata, and should be managed as such
- The models for the model-driven development are also essentially metadata, this will include how constraints are expressed as a model, how the UI is described as a model, etc.
- The metadata of the broader ecosystem – we can either sample the broader world (which is what we do when we build an R2RML model for instance, however this is only aware of the small part of the legacy environment that we have directly mapped to) or we can turn part of the DCA into a broader metadata management system.
The argument for turning the DCA into a metadata management system is that a graph-based system is so much better equipped to handle the complex relationships that metadata management involves.
Security
I think we all agreed that security was going to be the most difficult of all the layers, or at least the most difficult to generalize and have a solution that really works for a large number of targets.
The good news, as we’ll get into the implementation strategy, is that we may be able to postpone deep security to some subsequent phases.
In general, I think most people considered authentication important, but that there were plenty of solutions existing that would work fine with DCA. Additionally, there is nothing special or new about encryption and other techniques for keeping data secure.
The consensus was that authorization was the tough one to handle. By bringing a lot of disparate data together, we raise the visibility and stakes for getting authorization right.
There may be a place for Blockchain based identity management ala Sovrin.
We believe that there are two general problems to be solved:
- Finding a way to describe authorization. In order to be tractable, there will have to be some combination of roles (groups of people that share authorization permissions) and rules (ways of describing what data should be accessible). The rules need to be “subjective”, meaning you may have permission to see your employees’ workers compensation claims, but not all workers compensation claims.
- Designing an efficient way to implement the rules and roles at run time.
We have launched a working group that is going to concentrate on designing an approach that will satisfy a wide range of authorization use cases.
Knowledge Graphs and Graph DBs
Many issues solved by RDF graphs (Schema, Linked Data, Provenance, Resource Resolution) would need to be reimplemented in a Labeled Property Graph.
We are going to need to interoperate between RDF and Labeled Property Graphs.
Strategies for Implementing
This session perhaps threw up the greatest surprise. In the moment, we generated a very interesting matrix of the 7 or 8 generic strategic approaches you might take to introduce DCA in an organization, and which layers in the architecture would be most important based on the strategic approach.
The key strategic approaches to implementing a data-centric architecture are:
- Semantic ETL – This approach focuses on copying and aligning as many internal sources as possible. In some ways it is like creating a data warehouse, but instead of conformed dimensions, we ended up with a knowledge graph. In a lot of ways this is the fastest and least risky way to go. This is how Jacobus Geluk did it at BNYM and Parsa Mirhaji at Montefiore. We are doing this with several of our customers.
- Exemplar App – This approach says build and app using this stack that does something that would be hard to do with a more traditional application. We did this for the proof of concepts at Schneider and Sentara, but sadly neither was adopted despite compelling features.
- Data fabric – This approach uses semantics to define an abstraction layer to bide time to switch out legacy systems. You write all new functionality to the fabric and gradually migrate the data behind the scenes. One of our clients is doing this and another client is preparing to.
- Data storage swap w/ same API – Boris had an experience where there was a well-written application (meaning it had a proper business object layer) and the storage layer API could be substituted out. This can port an application to a triple store, but getting advanced functionality means writing something new directly on the triple store, as anything going through the object layer will behave like a traditional app.
- Metadata – Several of our clients have an overwhelming problem with legacy metadata management. The complexity impedance mismatch is so great that we can’t map directly from their existing systems to a semantic model. We had one client that had 150,000 elements just in their customer facing APIs (many times that in their entire data supply chain). The core model had 500 concepts and covered all the concepts represented in the 150K. By creating a faceted meta-model we were able to build a bridge— first between their existing APIs, and then with an eye to later connecting to the core, where new development could be done.
- Burning platform – In some cases there may be a system that has to be replaced. Doing it in a data centric fashion is a bit risky in this case, as the MVP (Minimum Viable Product) is all the functionality of the old system, but it is also very risky to do these projects with traditional technology. We have talked to several firms who have gotten deeply into a terrible implementation and wished they had gone data-centric, but it’s often too late at that point. This approach will be more feasible when there are commercial or open source platforms ready to go, which will take 6-12 months out of the project schedule.
- True contingency – I have pitched this a couple of times, but so far unsuccessfully. This approach says if you are already committed to a $100+ Million implementation project, you likely have a $20 Million contingency budget. For a fraction of the contingency (say $2-$4 Million) you can build a true contingency. That is a fully functional system ready to go in case the main project fails. Politically you can’t say you are betting against your main project that you have just spent years lobbying for. But secretly if you are paying attention you should have grave doubts. What you do instead is launch a project to build a fully functioning prototype to try out UI /UX concepts, so the main project can learn from them. The requirements for the fully functioning prototype should include the full volume of the existing system, and all the high volume and happy path use cases.
What came spontaneously out of the conversation was the relationship of the strategies to the parts of the architecture you would need. In the early days the architecture will need to be built incrementally, it is worth considering what parts of the architecture could be deferred depending on the strategy taken.
This chart summarizes the discussion we had with the strategies as rows and the layers in the architecture as columns.
Governance
We discussed the key pivot from sandbox to production as the entry point for governance.
The stakes are much higher for governance in a data-centric architecture. As more people are sharing portions of a model, governance will need to be more centralized. We discussed strategies for coordinating subdomain governance with core governance.
Next Steps
Next years “Forum” is already in the planning stages. If you want to save big, sign up now. https://www.semanticarts.com/dcc/ up to April 15th we will offer tuition at our cost ($225 for the full 2 ½ days, which just covers the room and refreshments.)
Next year’s event is going to focus on implementations. We are hoping to see exemplars of some of the key layers in the architectures and lessons learned.
Next year there will be an exhibitor’s room, and all the food (and a cocktail hour) will be at the back of that room and participants will run the gauntlet to get there, giving the vendors more opportunity and better traffic layouts for demos and discussions.
In the meantime, we are launching some working groups, the first one being “Security.”
Appendix
Strawman diagram (it requires zooming to see, the layers are listed belong from inner to outer)
The diagram is epicentral to represent that much of the architecture applies more to the curated (right hand yellower of the grapefruit slices in the middle) data.
Layers of the DCA (from center out):
- Data store
- Curated Data
- Mapped External Data
- Harvested, aligned data
- Provenance data
- Federated Query layer over data storage
- Core Ontology
- Meta Data à Ontologies, SHACL
- Data Security & Access logging
- Data Identity Management
- Integrity Management, Constraints and Triggers
- Consistent, correct data
- Data compliance
- Programmatic Behaviors
- Geospatial
- Temporal
- Mathematical
- Workflow
- Domain Ontologies & Taxonomies
- Endpoint Access Management
- User Interfaces (applications)
- Hard coded? Model Builders
- Model driven UI
- Bespoke UI