“There is another undefined frontier, that of time. A living language is in a continuous state of change, of ‘slow but incessant dissolution and renovation’…Where then, does one fix the dates of entry and termination of a word’s ‘current usage’?”[1]
As data professionals, we’ve all been there: preparing a set of data from migration from one data source to another, we discover the older data just doesn’t make sense. We find strange values which the business requirements do not recognize, and, in history tables, imponderable sequences of events which seem contradictory or plain wrong.
We are tempted to chalk this all up to erroneous data entry, but if you’ve had experiences like mine, you have found this approach a serious mistake. Excluding such data from a migration, or transforming it by arbitrary rules, often results in unpleasant surprises once the business begins relying on the migrated data.
To avoid this, we conduct a frantic search for the person or persons who might have been with the organization long enough to know what the “ancient” data means. We ask one person, who refers us to another, and that person to another, and eventually we find a person with some knowledge of the scenario…or we don’t, and we are left to guess.
If we do learn the truth, it often seems bizarre when viewed in the light of the present data usage. But it may have been quite correct at the time, or a workaround due to system limitations. In either case, the knowledge we’ve obtained can help us decide how to migrate the data to the new system: whether to map the old values to new ones, keep the original values as historically correct but now obsolete, or discard some legacy data entirely. We might even divine how to represent data in history tables as accurately as possible.
Data Governance professionals are familiar with the concept that “Data has a Lifecycle”. I have seen it included as a Data Governance principle, with specific lifecycle phases laid out, each with their own requirements. Malcolm Chisholm’s article, 7 phases of a data life cycle,[2] describes one approach to depicting these phases.
Missing from the lifecycle schemas I’ve encountered is the idea that data itself has a history, an etymology, if you will, like a language’s words. The Oxford English Dictionary (OED) supplies exhaustive detail on the history of a word’s meaning, usage, and spelling. What if we considered data having not a fixed definition and usage, but an analogous history, complete with evolving denotations which the practitioner often must derive from context and convention? How might that change our approach to data migration and conversion?
There is a book which has sat on my bookshelf, wherever I have lived, for many more years than I will reveal here. I received it as an award at my high school graduation, and, while I certainly appreciated the recognition, the title, Caught in the Web of Words, did not appeal to my interests at the time. But through all my many moves and clear-outs over the years, I did not part with this book. I thought someday I might even read it.
Flash forward to last year, when I started writing about data governance and simultaneously rekindled my love of J.R.R. Tolkien, from listening to “The Prancing Pony Podcast,” hosted by Alan Sisto and Shawn Marchese. The hosts, embracing the centrality of philology to Tolkien’s writing, often quote from the OED.[3] I have found “word nerdery” valuable to my thinking on data governance, and in my last article I cited the OED’s fascinating etymological revelations about the word “assets.”
At some point, I remembered the subtitle of that book, which continued to gather dust on our bookshelves: James Murray and the Oxford English Dictionary. Now I knew the time had come to at last open the cover and read this amazing biography by James Murray’s granddaughter, K.M. Elisabeth Murray.
James Murray was a remarkable man, largely self-educated, speaker of many languages and interested in everything, with apparently boundless energy. For many years, he worked as a banker or teacher by day while building his philological CV by editing volumes for the Early English Text Society. Even when he took on the role of chief editor for the first edition of the OED, a task that would take every ounce of his time and energy for 35 years, he managed to be a loving father to his eleven (!) children.
In his article, The evolution of crowdsourcing: an old idea whose time has come,[4] Matt Ellis cites the creation of the OED as one of the first examples of crowdsourcing. Elisabeth Murray goes into detail about her grandfather’s efforts to make this complex approach work.
Volunteers who answered the OED editor’s appeal were asked to read books and note down quotations of ordinary and unusual words. The volunteers wrote each quote with the particulars of the book and quotation on a half sheet of note paper and sent reams of these “slips” back to Dr. Murray, who stored them in pigeonholes in a custom-built shed called The Scriptorium, first in Mill Hill, a suburb of London, and later at Oxford. Murray and his staff, later aided by additional editors, would shift through and organize the slips, trying to deduce the history of each word. It is not surprising that completing the OED took many years longer, many more pages, and far more money than the Delegates of the Oxford University Press initially believed. But even the Delegates came to realize the immense importance of the Dictionary to the understanding of the English language.
Crowdsourcing benefits today from automation and the internet, but it’s still uniquely dependent on human participation. In his article, 5 Reasons Data Democratization May Fail, Malcolm Chisholm notes that, “There is a lot of information that modern data catalogs can harvest automatically. However, there is also a lot of important metadata that only humans can contribute to the data catalog. This includes semantics and facts of business significance that cannot be gleaned from technical metadata.”[5]
I would also include the etymology of data – what did it mean in the past, and what was it used for? This is information only people can provide. What is needed is a modern version of the OED’s crowdsourcing methodology, one that streamlines the process by taking advantage of the tools available today.
Communicating a request for historic information to employees and giving them a straightforward way of replying is well within the capabilities of today’s collaborative tools, be they Slack, Wiki, or assorted others. Targeting the request to employees having a certain tenure or experience within the company is helpful, but should we limit such queries only to current employees? Organizations may be hesitant to ask former employees about the work they did while employed, but with people changing employers an average of twelve times in 32 years,[6] it is certainly worth exploring. Organizations often have “alumni networks.” Why not put them to work? I have no idea of the compliance or legal aspects, but it is worth investigating.
Using machine learning and AI tools could supplement the search for human knowledge. I can think back to many email exchanges over data definitions which I filed away in my Outlook archive; intelligent exploration of these repositories could shed meaningful light on history which people may have forgotten.
Of course, the biggest challenge may be motivating employees (and potentially non-employees as well) to dig into their memory banks or email archives. Financial rewards may seem an obvious solution, but it is worth noting that the Oxford English Dictionary began and continues its crowdsourcing requests to the public on a strictly volunteer basis. The history of appeals and even a televised contest with the wonderful name “Balderdash and Piffle” can be found on the OED website: ‘Your dictionary needs you’: a brief history of the OED’s appeals to the public | Oxford English Dictionary
Let’s not underestimate the power of such appeals, given how
their contribution to the amazing work of scholarship which is the OED. Paraphrasing
their exhortation, “Your Data Needs You!”, may turn out to be an effective
slogan.
[1] Murray, K.M. Elisabeth Caught in the Web of Words: James Murray and the Oxford English Dictionary, 1977, Yale University Press, pg. 194
[2] Chisholm, Malcolm (2015) “7 phases of a data life cycle”, Information Management, licensed by Bloomberg https://www.bloomberg.com/professional/blog/7-phases-of-a-data-life-cycle/
[3] Tolkien himself worked on the OED after the end of World War I. See Gilliver, Peter; Marshall, Jeremy; Weiner, Edmund. The Ring of Words, Tolkien and the Oxford English Dictionary, August 2009, OUP Oxford. Kindle Edition.
[4] Ellis, Matt (2017), “The evolution of crowdsourcing: an old idea whose time has come”, 99designs, https://99designs.com/blog/crowdsourcing/evolution-crowdsourcing/
[5] Chisholm, Malcolm (2021), (2) 5 Reasons Data Democratization May Fail | LinkedIn
[6]Tran, Michelle (2020), “How Many Careers Does the Average Person Have?“, Sparrows, https://stories.sparrows.co/stories/careers-in-a-lifetime