As you can probably tell from my previous columns, my thinking lately has been very focused on the issue of data curation and data governance automation. I have always been troubled by the amount of manual effort required in almost all phases of the data life cycle. In my November 2023 article, Waldo Where Are You?, I talked about some of the funded research focused on developing methods and techniques to perform unsupervised data linking, clustering, and cleaning, and digital data governance policies. I strongly believe that more automation is the key needed to unlock the full potential value of our data resources. The question is, how do we get there?
Broadly speaking, I see research following two different paths. The first path is the human replacement approach. As is the mantra of AI, if a person can do, a machine can do it. This path starts with the assumption that the data won’t change. It is always going to be dirty, noisy, and un-annotated. On this path, the goal is to replace the data analysts who are spending their time profiling, assessing, cleaning, and organizing the data with AI systems that can automatically perform the same tasks.
Clearly, we have a long way to go with this approach. First, almost all the work being done with machine learning assumes that someone has already expended the time and effort to prepare the data before the models are ever trained. We have all heard the data scientist’s universal complaint that 80% of their time is spent preparing data and only 20% analyzing data. But data science must share some of the blame for this situation because it has been slow to turn its focus inward and apply AI techniques to the data preparation problem itself.
Second, some of the most successful AI advances reply on supervised machine learning. This means the system must be trained with presumably correct examples so it can learn, and in many cases, the training must be repeated as the data being processed changes over time. Supervision is still a step away from idea of fully automated data curation. The development of training data and the training both require analyst involvement. More unsupervised methods are needed to adequately address this problem.
There is another path, a path less traveled. It is not to accept the premise that data will always be dirty, noisy, and un-annotated. As I noted in my data littering article, it is possible to image a world in which systems produce and exchange fully curated data. This can happen through much more robust (also fully automated) data governance. For example, if more people would just follow established standards such as the ISO 8000 Part 110 standard for exchanging data, then manual data curation effort would be reduced significantly. But like most data issues, the adoption of these practices requires not so much a change in technology as it does a change in the attitude of individuals and the culture of the organization.
The ISO 8000-110 standard is basically a metadata standard. The key to producing and exchanging fully curated data is the concept of embedded metadata and semantic encoding. The biggest issue with metadata management is that the data and the metadata that described it usually reside in different systems and are only loosely coupled or not connected at all.
In my opinion, organizations should not be producing and exchanging datasets. Instead, they should be producing and exchanging data objects (smart data) that wrap together both data and its describing metadata. The XML and JSON scripting languages are both more than capable of doing this. Encapsulating data with its metadata would not only follow ISO standards, but it would also increase the level of data process automation by allowing systems to be metadata driven. Much of the data processing software design in use today is still stuck in the last century where the primary objective was to minimize storage, and a time when software developers always assumed that input data met all data quality and format requirements.
These legacy designs are usually hardwired to infer that data items found in a particular ordinal position of an input record is of a certain type and requires a certain type of processing. Such systems will fail when there is an unexpected change in the input format or in the meaning of the data item. This decision to separate data from its metadata annotation to save storage also makes debugging data issues much more difficult. It requires a data analyst to be able to unravel the code and trace the logic of how each item is processed. This can be especially problematic when the code is complex, not well-documented, and the original software developer has long since moved on.
These system designs implicitly hide the metadata in the program (process) and separate it from the data. A better approach is to read each data item together with its metadata label from a data object, then let the system take appropriate actions based on the metadata content. Embedded metadata can also promote more automation in data governance. Data processing actions should not only be driven by the metadata in the source data objects, but these processes should also produce data objects as their output. The output data objects should not only carry forward the input metadata, but they should also enrich it with new metadata describing the transformations and new data items produced by the process. This by itself would greatly simplify and help to automate the data governance function of data lineage data content discovery.
All these ideas are in line with the new DataOps thinking that is trying to remove the system and organizational barriers in dataflows that impede the efficiency of producing data products. However, one problem with the smart data path is that fewer companies are now developing custom code. This leaves the problem of developing and deploying smart, metadata driven data tools to the software vendors. But vendors simply respond to the market. Unless and until companies demand more metadata driven data governance and processes, organizations are left to do it themselves.
In my opinion, we could move much faster toward a future of automated data curation and data governance if we move along both paths at the same time. Embedded metadata would make it so much easier for data scientists to find, organize, and prepare data for analysis. This would in turn, accelerate the development of new methods and techniques to automate data transformation processes. Achieving success in digital transformation is going to require everyone to think about data, metadata, and data transformation in a new way.