Data Speaks for Itself: Data Littering

No, this is not a mistyping of data literacy. Yes, like everyone, I am aware of and fully on-board with the growing movement to improve data literacy in the enterprise. What I want to talk about is Data Littering, which is something else entirely. Data Littering is the deliberate act of creating and distributing data with missing or inadequately metadata annotation thus rendering the data unusable junk for future applications. Another way to describe it is practicing “stingy metadata.” It is a deeply rooted problem that has been plaguing information technology and data management for years.

[Publisher’s Note: Join me in welcoming Dr. John Talburt as a new quarterly TDAN.com columnist. John is a recognized thought-leader and educator in the data management space.]

All of us are guilty of data littering. We create spreadsheets with nothing more than one- or two-word column headers. We send data files with no embedded metadata, only an unwritten understanding between the sender and receiver of what the records represent. Our directories are filled with files created in the past for which we vaguely remember their original purpose, but no details of their structure or meaning of the data elements. When I worked at Acxiom Corporation several years ago, more than 800 associates were employed full-time to review the hundreds of data files received each day. Many of the files did not include any metadata, and those that did often had very sparse or incorrect metadata. One of their primary functions of the reviewers was to add or correct the metadata so that processing could proceed.

I think we all acknowledge that there has been a cultural shift when it comes to data. Everyone is talking about data as an asset. Companies are clearly beginning to realize that data is a resource for generating value and gaining competitive advantage in their markets. Thankfully, after many years into the so-called Information Age, data and information are beginning to achieve parity with software. The “little I, big T” view of IT is beginning to turn around as we realize the purpose of an information system is to build information products and software is only a means to that end.

That is all well and good, but there is still a major disparity. The lack of adequate metadata needed to make data not only a usable resource, but a reusable resource. While many organizations are embracing the FAIR principles of data – findable, accessible, interoperable, and reusable – we sometimes fail to understand that robust metadata is a key success factor for each of these components. Just as data have gained parity with software, metadata needs to have full standing with data.

Inadequate and missing metadata are a legacy of the past when storage was at a premium and when processor memory was measure in kilobytes, not gigabytes. Developers devoted every byte of memory to operational data, there was just no room for descriptive metadata. Metadata was either provided outside of the computer on paper or stored in someone’s head. Unfortunately, these habits still live on. Getting people to populate data dictionaries and data catalogs with meaningful metadata and keeping it up to date is like pulling teeth.

I remember attending a conference several years ago and hearing a speaker say that “in the future we will need thousands, and perhaps millions, of bytes of metadata for each byte of data.” At the time, I thought this was a bit hyperbolic if not absurd, but now I agree. When I think of all the aspects of data that are becoming increasing important such as security, classification, access, quality, lineage, retention, and responsible and accountable stewards, then it is not so difficult to accept such a high ratio of metadata to data.

Having described the problem, what is the solution? First, I would say that the movement toward embracing data governance is definitely a step in right direction. In many ways, data governance is metadata management. The answers to knowing what data you have, where it is located, and what is in it, are all answered by metadata. Almost all of the data governance requirements such as data quality, data models, access, and security are forms of metadata albeit often only on paper. In short, I believe better data governance equals better metadata.

Another major problem we have with data governance is the lack of automation. The current practice of data governance relies heavily on the manual curation of metadata. Until vendors and the IT industry in general are fully focused on the importance of metadata, this will remain a problem. Software applications that read, write, and transform data needs to be more metadata aware, able to produce metadata as well as ingest metadata. A great example of this is the Apache Ranger application developed for the Hadoop environment. It can be attached to specific data transformation tools and act on digital policies to control the actions the user can perform on data using the tool. In addition, Ranger generates event messages describing in detail the actions performed by the user with the tool. Software tools need to have these metadata features and controls built in by design. This would allow the data governance standards written on paper to be embodied as machine-interpretable metadata to automate compliance and enable many catalog update functions to be fully automated.

A second factor is the development and adoption of standards for metadata management. One of the most recent and complete standards is the ISO 8000 Part 110: Master data – Exchange of characteristic data: Syntax, semantic encoding, and conformance to data specifications. It describes in detail the requirement to merge into a single file the data, its metadata, and any data specifications using a machine interpretable language. To ensure that the sender and receiver have the same understanding of the data, the standard describes a method of semantic encoding by which the metadata definitions and descriptions reference an entry in a shared data dictionary.

The Electronic Commerce Code Management Association (ECCMA) even advocates a Universal Electronic Technical Dictionary that would provide Universal Data Semantics (See Len Silverston’s TDAN.com column, Zen and the Art of Data Maintenance: Data ‘Mine’ing and Universal Data Semantics). While this might not be practical in the near term, the idea of sharing the same metadata definitions through an electronic dictionary within a consortium of organizations is a practical solution for data exchange implementable today. If Acxiom and its data suppliers would have all agreed to use this standard, then none of the manual reviews of incoming files would have been necessary. Instead, the validation of the metadata and conformance of the data-to-data requirements would be entirely automatic.

A detailed implementation of the ISO 8000-110 standard is described in the ISO 22745: Open technical dictionaries and their application to master data. ISO 22745 is implemented in XML and was first adopted by the members of the North Atlantic Treaty Organization (NATO) for the exchange of parts and supplies among companies in treaty member countries. If the reader would like more information about the ISO 8000-110 data exchange standard, please send me an email (jrtalburt@ualr.edu) and I will be happy to send you more information. While you must purchase the actual standard from ANSI (the US member of ISO, ansi.org), I can send you a pdf of a chapter from one my recent books describing the standard in detail.

By the way, ISO 8000-110 is only one of several standards in the ISO 8000 family of data quality standards currently under development. Designed to be companion to the widely adopted ISO 9000 family of process standards, there are many other parts of ISO 8000 you may be interested in learning more about. I am a particularly strong advocate of the ISO 8000-61: Reference model for data quality management, as the foundation for data quality programs and data governance standards for data quality.

When I was younger, people did not have second thoughts about throwing trash out of their car window onto the side of the road. But through education, the consequences of trash strewn highways were brought into public awareness. On the positive side people were encouraged to save trash in bags in their cars and deliver it to roadside waste cans. On the negative side, laws have been passed to fine those who continued to practice it. Littering now is illegal and very much frowned upon as an unacceptable practice by most of the public. You might say there has been a cultural change in our attitudes about highway littering.

I earnestly hope that we will see another cultural shift when it comes to metadata, and we can stamp out data littering. As people and organizations become more metadata savvy. I can foresee a time when someone creating and sending a file not-conforming to an agreed upon metadata exchange standard such as ISO 8000-110 would be looked upon as if they had thrown trash out of their car onto the highway.

MenuMenu

Share this post

Dr. John Talburt