We’ve been working on something we call “SemOps” (like DevOps but for Semantic Technology + IT Operations). The basic idea is how can we create a pipeline to go from proposed enterprise ontology or taxonomy enhancements to “in-production” as frictionlessly as possible.
As so often happens, when we shine the Semantic Light on a topic area, we see things anew. In this very circuitous way, we’ve come to some observations and benefits that we think will be of interest even to those who aren’t on the Semantic path.
DevOps for Data People
If you’re completely on the data side, you may not be aware of what developers are doing these days. Most mature development teams have deployed some version of DevOps (Software Development + IT Operations) along with CI/CD (Continuous Integration / Continuous Deployment).
To understand what they are doing it helps to harken back to what preceded DevOps and CI/CD. Once upon a time, software was delivered via the waterfall methodology. Months or occasionally years would be spent getting the requirements for a project “just right.” The belief was that if you didn’t get the requirements right up front, the cost to add even a single new feature would cost 40 times what it would cost if the requirement were identified up front. It turns out there was some good data on this cost factor, and it still casts its shadow any time you try to make a modification to a packaged enterprise application, 40 x is a reasonable benchmark compared to what it would cost to implement that feature outside the package. This as a side note is the economics that creates the vast number of “satellite systems” that seem to spring up alongside large packaged applications.
Once the requirements were signed off on, the design began (more months or years) then coding (more months or years) finally systems testing (more months or years). Then the big conversion weekend, the system goes into production, tee shirts are handed out to the survivors and the system becomes IT Operations problem.
There really was only ever, one “move to production” and few thought it worthwhile to invest the energy in making this more efficient. Most sane people, once they’d stayed up all night on a conversion weekend, were loath to sign up for another, and it certainly didn’t occur to them to find out a way to make it better.
Then agile came along. One of the tenets of agile was that you always had a working version that you could, in theory, push to production. In the early days it wasn’t that people were pushing to production on any frequent schedule, but the fact that you always could was a good discipline to avoid technical debt and straying off building hypothetical components.
Over time, the idea that you could push to production became the idea that you should. As people invested more and more in their unit testing and regression testing, and pipelines to move from dev to QA to production, people became used to the idea of pushing small incremental changes into production systems. That was the birth of DevOps and CI/CD. In mature organizations like Google and Amazon, new versions of their software are being pushed to production many times per day (some reports say many times per second, but this may be hyperbole).
The reason I bring it up is because there are some things in there that we expect to duplicate with SemOps, and some that we already have with data (as I was writing this sentence, I was tempted to write “DataOps” and I thought: “is there such a thing?”) A nanosecond of googling later and I found this extremely well written article on the topic from our friends at DataKitchen. They are focusing more on the data analytics part of the enterprise, which is a hugely important area. The points I was going to make were more focused on the data source end of the pipeline, but the two ideas tie together nicely.
Capital Costs v Operational Costs
In some ways what happened on the way to DevOps and CI/CD was a shift from operational costs to capital costs (I don’t think that’s what motivated it, but I think that is a good way to characterize what happened). In the old days, when someone introduced a change to an existing system, they ran a suite of tests (by hand!) against the database to make sure nothing was adversely affected. The UI tests were all done by hand. As the world’s most interesting man puts it, “I don’t always test my code. But when I do, I do it in production.”
This approach incurs a lot of operational cost each time code is moved to production. It also has been proven to miss a lot of bugs.
Having a rigorous unit and regression test suite, is a large one-time investment, both of infrastructure, but mostly building the tests. It is therefore a capital cost. It is not unusual for a mature modern system to have half as much test code as actual code. But the payoff, as with most capital investments, is it makes each individual turnover cheaper and safer.
In a moment (or maybe in the next installment) we’re going to look long and hard at the capital costs we’re already incurring, and how to think about that tradeoff going forward.
Data Governance and Software Development
We’re going to spend some time with the similarities and differences between data governance and software development, especially because we’re going to try to figure out when do changes in ontologies more resemble Data Governance and when does it more resemble Software Development.
In order to make these comparisons I’m going to have to abstract what both groups are doing in a way that they are comparable. This is harder than it sounds. Or easier than it sounds, depending on your point of view. If your point of view is these two disciplines are so different that no sensible comparison could be made, then what I’m going to outline is easier than the impossible task of reconciling two irreconcilables. On the other hand, if you think, “they really are kind-of doing the same thing” then you’ll find this a bit harder than in first sounds.
What makes this harder, as far as I’m concerned, is that there is a great deal of literature on Data Governance, but not much that really gets specific about what people are doing when they do “Data Governance.” I went back and re-read my (DAMA International) DMBOK Chapter 3 on Data Governance and while there is a great deal of the strategic importance and organizational considerations it was light on what Data Governance people are actually doing. Same with many other sources I consulted. I’m hoping someone reading this will come up with some helpful suggestions, and if I’m really lucky they will align with what I’m describing here. Meanwhile, I’m stuck putting forward my own framework.
A lot of what I’m about to suggest here is derived more from my experience and observations working side by side with Data Governance professionals. We encounter all kinds of Data Stewards, Data Custodians, Metadata Managers, Data Council Leaders all the way up to Chief Data Officers. What I’m going to outline is my impression of what I think they are doing.
Some of what they are doing is just being an expert in a particular subdomain and knowing where the credible sources are. This is a valuable role, especially in companies that have let their data landscape grow in an uncontrolled fashion and they have thousands of potential sources of data. We rely on them when we build ontologies, because knowing what all the data means has become a highly decentralized function. But this isn’t a “doing” in the sense that I want to explore here. Being an expert and being a go-to resource isn’t a “doing” kind of thing.
My thesis is that in both Data Management and IT Operations the landscape already exists. The data landscape and the code landscape. The thousands of applications and thousands of databases already exist. Governance is not so much about creating new applications or data sources. The firm tends to create projects to do that. Governance of code and data, is more about dealing with what exists and performing the following activities:
- Quality — Assessing and attempting to continually improve the overall quality of the data or the code.
- Allowing/ “Permissioning” – Creating mechanisms for determining who gets to make changes to data or code, and how those changes are prioritized
- Predicting side effects – This is about doing impact analysis to determine if a given change will have a detrimental effect on the overall landscape (data or code)
- Constructive change – Sometimes a proposed change requires knock-on effects. Changing the name of a schema element may require code, queries or documentation be changed to accommodate it. This task is about making those upgrades to a system that is in use.
- Traceability – A great deal of compliance is about proving that all changes were authorized. Whether this is Sarbanes Oxley wanting to know the provenance of key data elements on financial reports, or cyber-security wanting to know that all changes to code were authorized and unlikely to contain viruses, this is the discipline of traceability.
This is what I think governance means at the ground level. There are many strategic reasons to want to do this, there are many ways to organize for it, but at some point I think we need to execute these basic functions. I think the functions exist on the data side and the code side, and by extension on the semantic side. They exist at the data level and at the metadata level.
What is fascinating to me, and I will need to leave it to next episode because I’ve hit my word target, is:
- How and why have data professionals and software developers evolved different strategies to satisfy these five basic functions
- What can ontologists learn from this, and therefore, what will SemOps ultimately look like?
This is where we pick up next time.