Almost every child knows about the cute search-and-find Waldo with the red-striped shirt and cap hidden in the crowd. But something you may not know is that there was another much earlier Waldo. It was an early robotic device that let a human remotely operate mechanical hands in a sealed chamber. It was first used in 1945 at Argonne National Laboratory to safely manipulate radioactive and other toxic materials. While the original Waldo was a relatively primitive mechanism that has been replaced by much more sophisticated and intelligent robots, I would argue that the principles of the device can still inform our approach to, and perhaps even the portent the future, of data governance.
The notion of a sealed chamber is already well established in the world of digital networks. Virtually every enterprise has a network firewall to automate protection from unwanted and potentially malicious connections. The connections and attempted connections are constantly monitored, and the connection logs are analyzed for patterns that could forewarn the system of connection attempts by bad actors.
In addition to the sealed chamber, the most innovative feature of the Waldo device was the way it allowed users to perform actions on objects in the chamber while being completely separated and protected from the materials. Waldo’s mechanical arms mimicked the human arm and hand movements allowing the user to manipulate certain objects in the chamber and perform actions such as mixing materials or transferring materials between different containers inside the chamber. However, it did not allow the user to perform all actions. It prevented the user from doing things that were potentially dangerous such as moving materials outside of the chamber.
So, what does any of this have to do with data governance? While the Waldo was not an electronic device, it did enforce a certain level of governance on the user in terms of controlling the user’s access to materials and the actions the user could perform. One of the biggest problems we have with data governance today is the lack of automation. While we have lots of software tools, a great deal of DG operational compliance depends on the goodwill and integrity of individuals to understand and follow policies and guidelines. In my opinion, the two largest gaps in trying to automate the enforcement of policies and standards are the lack of software controls to limit user actions on data, and the lack of automatic metadata generation and capture of the action that are allowed.
The current approaches to DG automation focus primarily on user authentication and data source access. All organizations implement some type of user authentication mechanism and most have some way of limiting access to certain data sources, or even specific fields within the data. This is usually done by assigning user to predefined roles. But once a user is authenticated and his or her role grants access to data, the compliance to policies regarding user actions performed on the data to policy is largely voluntary. In addition to the lack of automated controls on data actions, there is also the issue of operational context. The same user might have authorization to perform certain actions on a given dataset for one purpose, but not for another purpose. These scenarios can be difficult to manage through role assignment. Systems for automating DG need to accommodate all four of these factors including the user, the data, the actions, and the context.
So, to be clear, I am suggesting that DG automation needs to be more like Waldo where users only interact with the data through an intervening layer. You can think about this as an extension of the growing trend to implement self-service analytics. Self-service systems allow authenticated users access to a limited number of datasets to perform a limited (predefined) set of actions in a fixed context, usually business analytics. In the same way we have seen the data warehouse migrate from the backend of operational systems to the frontend (i.e., lake houses), I predict we will see self-service analytics expand to include more datasets and more upstream data operations.
A very useful model for controlling data tools was implemented in the Hadoop ecosystem through an application called Ranger. It is a software application designed to communicate with and control other tools such as Hive and Spark through a stored digital policy. For example, a digital policy that allows a user to perform a read operation, but not an update action on certain data. In addition to controlling the actions of a data tool, Ranger can also collect metadata describing the actions the tool performed. Some commercial applications for the big data space have followed this model to develop very robust automated DG polices that can control data, users, and actions, but mostly in the context of data analytics, not as an enterprise solution. There is also ongoing academic research in this area. I am currently part of a team of professors, students, and industry partners exploring methods for automating data curation and data governance sponsored by a National Science Foundation grant (https://dartproject.org/).
However, the Ranger-type control of legacy data tools such as relational databases is much more difficult where such controllers must be custom built. Just the opposite of the Waldo, most vendors of data tools want to give the user as much power as possible. This is because they compete with other vendors on features and functionality, the more the better. This makes harnessing these tools even more difficult.
But just for a moment, try to image a world in which all your organization’s data are safely enclosed in a controlled data space (the Waldo chamber) and users can only manipulate the data through a software control layer (the Waldo arms) that encloses the space and automates compliance to digital policies. And suppose the control layer’s digital policies only allows authorized users to perform the operations allowable for their roles and given operational context. And as long as we are imagining, suppose that every action taken inside of the controlled data space is captured as metadata and sent back to the control layer where a series of filters forward updates to the data catalog, dashboards, and other monitors in real, or near-real, time. As an added bonus, the system would prevent data littering because data could only enter or leave the control space if it adhered to the ISO 8000-110 standard for master data exchange requiring complete metadata references.
While developers and data scientists might shudder at the thought of having to go through an intermediate software layer that limits their data access and actions, I do believe that eventually this will all happen. The increasing risks and costs of not having automated DG policy compliance is simply unacceptable.