Recently I had the privilege of taking part in a panel discussion about Data Discovery with several distinguished colleagues from both banks and solutions providers. Knowing that the first question for the panelists was the definition of Data Discovery, I did a bit of research and had no problem finding multiple definitions, covering a vast range of capabilities. For example, in her article Data Discovery and Classification are Complicated, But Critical to Your Data Protection Program, Joanne Godfrey writes:
“Data discovery is the process of scanning your environment to determine where data (both structured and unstructured) resides — e.g., in database and file servers that could potentially contain sensitive and/or regulated data.[1]”
Farther along the spectrum, Bernardita Calzon, in How Can Smart Data Discovery Tools Generate Business Value, answers the question of what Data Discovery is this way:
“Data discovery is a term used to describe the process for collecting data from various sources by detecting patterns and outliers with the help of guided advanced analytics and visual navigation of data, thus enabling consolidation of all business information.”[2]
Today, there’s even the concept of “Smart Data Discovery,” with a truly vast scope of functionality. Aditya Kapoor, in Smart Data Discovery – How smart can we become in the future?, explains that:
“Data discovery was always a case of building data models and/or writing algorithms. Not anymore! ‘Smart’ data discovery is a next-gen data discovery potential by readily finding and presenting data through visualization and machine-generated narration. Smart data discovery is an encapsulation of predictive analytics, interactive data visualization, pattern matching and machine learning to produce automated decision support.”[3]
As I mentioned in the panel discussion, my definition of Data Discovery’s core capability is a bit more modest – discover the data. Discover all the sources of truth. Or, to reference the Oxford English Dictionary’s etymology, which traces “discover” back to the Old French descovrir, “to make known, reveal, divulge (something not generally known).”[4]
Successful discovery requires the ability to understand the context of data. For example, a 9-digit number in a table of individual client data screams “social security number”, but in the context of a database of financial instrument data for institutional clients, it might be a unique, internal identifier for a security, a company, or a transaction.
The existing solution for solving the problem of context is through machine learning: fine tuning algorithms by running them through huge data sets, and “adjusting the dials” with input of human SMEs so the AI can identify an ever-wider range of scenarios.
This is effective yet takes time and resources. Each company that implements a data discovery tool must train the tool on the context scenarios particular to their enterprise. I noted at our panel discussion the concept of building a “learned machine,” where the software developer could gather the teachings from each customer’s usage and build it into each new iteration of the tool. This raises interesting points about intellectual property – who “owns” what a machine “learns” on site, through deployment with a customer – the software company, or the customer? Both? That is for another article!
Still, even such a learned machine will likely be stymied the first time it encounters a scenario significantly different from anything it’s learned previously. Time to call in the subject matter expert once more.
The computer, for all the “knowledge” gathered through machine learning, is still not capable of thinking like a human being, of applying analogies to tie familiar scenarios to new ones.
Not long ago, I finished reading Douglas R. Hofstadter’s epic, Pulitzer Prize-winning, Gödel, Escher, Bach: An Eternal Golden Braid.[5] I had started it early last year, when I was working on my article, Data Governance and The Art of the Fugue, as I had found that Hofstadter also cited Bach’s masterpiece in his text. I got about halfway through by last summer, and then ran aground somewhere around page 421 (out of ~740) amidst Cantor’s Original Diagonal Argument. I put the book aside, and then decided this year I should finish it.
This is a massive, stunning literary and intellectual achievement, interweaving vastly different worlds – metamathematics, particularly the brilliant Kurt Gödel and his Incompleteness Theorem, cellular biology, computer science, the aforementioned composer J.S. Bach and the remarkable art of the Dutch artist, M.C. Escher, all in the service of seeking to understand how human beings think, and whether it is possible for a computer to think like a human being. I’ve seen this book often cited as one of the seminal works of Artificial Intelligence, yet, as James Somers wrote in his article about Hofstadter, The Man Who Would Teach Machines to Think,[6] the path Hofstadter lays out has largely been left behind in the field, with the embrace of machine learning and the successful deployment of this technology across many industries.
Somers’ article itself is fascinating. I highly recommend it. He provides what I think is a wonderfully straightforward explanation of machine learning, reviews its widespread success, and then presents this assessment, very much in line with Hofstadter’s own skepticism of what we call “artificial intelligence”:
“It’s insidious, the way your own success can stifle you. As our machines get faster and ingest more data, we allow ourselves to be dumber. Instead of wrestling with our hardest problems in earnest, we can just plug in billions of examples of them. Which is a bit like using a graphing calculator to do your high-school calculus homework—it works great until you need to actually understand calculus.”[7]
Described this way, machine learning reminds me of learning by rote, which I first encountered back in fourth grade when I began studying cello in elementary school. We used the Applebaum String Builder books, and, in addition to melodies and scales, they included “rote projects,” suggestions to the teacher to teach a student a technique through imitation and memorization. A perfect definition of rote learning can be found in Dictionary.com: “Learning or memorization by repetition, often without an understanding of the reasoning or relationships involved in the material that is learned.”[8] I remember my school strings teacher took a very dim view of rote learning and, to my recollection, never had us do any of these “rote projects”. I do not think my playing suffered from their omission.
As I read Somers’ article, my mind seemed to make one of those “strange loops” Hofstadter postulates is at the core of human intelligence, which he describes as “an interaction between levels in which the top level reaches back down towards the bottom level and influences it, while at the same time being itself determined by the bottom level.”[9] Or, as Somers explains:
“But at each step, Hofstadter argues, there’s an analogy, a mental leap so stunningly complex that it’s a computational miracle: somehow your brain is able to strip any remark of the irrelevant surface details and extract its gist, its “skeletal essence,” and retrieve, from your own repertoire of ideas and experiences, the story or remark that best relates.”
So, what was this strange loop for me, this unbidden analogy? Somewhere, out of the depths of my memory, came a sentence I recognized from my reading back in high school of Frank Herbert’s Dune. I didn’t have the entire quote in my mind, but it was close enough for me to look it up:
“And always, he fought the temptation to choose a clear, safe course, warning ‘That path leads ever down into stagnation’.”[10]
I certainly have been thinking about Dune more than I have for an exceedingly long time, given the premiere of the new film. But I can’t have recollected that quote for many, many years. And yet, here’s what my brain had retrieved as the remark that best relates to what I had just read.
Can our Learned Machine, chock full of the lessons gained from collective experience (please, not “learnings” – truly one of the most annoying misuses of the English language, and not just to me[11]) be taught analogical thinking, so that when faced with a new scenario, it can strip away the irrelevant surface details and extract its gist, and turn its mission of Data Discovery into more than retrieving random facts, to truly uncovering value?
Now that would be something.
[1] Godfrey, Joanne (November 4, 2019), “Data Discovery and Classification Are Complicated, But Critical to Your Data Protection Program”, Security Intelligence, https://securityintelligence.com/posts/data-discovery-and-classification-are-complicated-but-critical-to-your-data-protection-program/
[2] Calzon, Bernardita (May 19, 2021), “How Can Smart Data Discovery Tools Generate Business Value?”, Business Intelligence, https://www.datapine.com/blog/what-are-data-discovery-tools/
[3] Kapoor, Aditya (February 2, 2017), “Smart Data Discovery – How smart can we become in the future?”, https://www.linkedin.com/pulse/smart-data-discovery-how-can-we-become-future-aditya-kapoor
[4] OED Third Edition, December 2013; most recently modified version published online September 2021
[5] Hofstadter, Douglas R. Gödel, Escher, Bach: an Eternal Golden Braid, 1979, Basic Books
[6] Somers, James (November, 2013), The Man Who Would Teach Machines to Think – The Atlantic
[7] Ibid
[8] https://www.dictionary.com/browse/rote-learning
[9] Hofstadter, page 709
[10] Herbert, Frank Dune, 1965, Penguin Publishing Group. Kindle Edition, page 352
[11] Ramsden, Dan (August 16, 2019) “Learnings is not a word”, Medium