business intelligence resources

TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI World Conference

TDAN.com - The Data Administration Newsletter

Business Intelligence Resources

business intelligence resources

TDAN.com - The Data Administration Newsletter

   > home > newsletter > article
It's All About the Data

by Noreen Kendle
Published: January 1, 2008
It would be wonderful to be able to simple purchase a tool or technology and have your data challenges disappear. It is time to step back and take a much needed different look at data.

The Data Situation

With a rapidly growing economic globalization and an increasing demand of technology, data has become a very valuable resource in our world today. As the demand for data increases, so does its growth.  At the rate it’s growing, the total volume of data accumulated throughout human history is expected to double in just a few more years.  Gartner predicts, “A typical business in 2007 stores ten times more data than in 2000.”  With the world in a data binge, we have truly become an information society in a knowledge economy – extremely dependent on data and lost in a galaxy of unmanageable data.

Our dependency on data, as well as its growth rate, creates many challenges, especially concerning its quality. Larry English, a well known data quality expert, estimates the cost of bad quality data may be as high as 10-20 percent of an organization’s revenue and as much as 40-50 percent of the typical IT budget may actually be spent due to data errors. The Data Warehousing Institute estimates it costs the U.S. business over $600 billion a year because of bad data. But, the even greater costs are the ones unknown from the unrecognized inaccurate data used to make decisions and drive the organization. 

Most organizations recognize their data has problems and spend ridiculous amounts of time hunting for missing data, correcting inaccurate data, creating “work arounds,” pasting data together, and reconciling conflicting data. This greatly impacts productivity, detracting from the bottom line. And to make matters worse, organizations have become so accustomed to this insanity that they treat these functions as “normal” and acceptable.  In spite of the growing number of tools and technology touted as the “magic” solution to our data problems, the problems continue to grow as rapidly as the data. Tools and technology are simply not the solution. Rather, tools and technology automate the solutions. It would be wonderful to be able to simple purchase a tool or technology and have your data challenges disappear. It is time to step back and take a much needed different look at data.

Data and the Real World

Introduction

Data is core to the data professions and like any other field, it is very important to have a solid foundational understanding. Those who practice without theory are like electricians who work with electricity without knowledge of the basic physics of electricity. Sooner, rather than later, they are going to get zapped! As Leonardo da Vinci stated, “Practice should always be based on a sound knowledge of theory.”

There are many definitions for the term data and even one that defines data as the name of a Star Trek character. Using a Google search, out of the first 25 definitions, almost all were limited to a computer technology perspective. Therein lies a fundamental issue underlying many a data challenge today: an unawareness of the real world perspective of data. Data is not a computer technology phenomenon. Data has existed long before computers were ever imagined. Understanding the concepts of data outside the confines of computer systems is essential to a foundational knowledge of the subject, as well as solving many of our data challenges today. Let’s get out of the box of computer technology and really look at the subject of data.

Data Basics

Data is everywhere. The study of the subject of data is unique in that data is at the root of all fields of study, particularly scientific. The facts about anything only have to be recognized to exist. Data is Latin for “a given," a term interchangeable with "fact."  A primary function of any field is the collection of facts (data), with the ultimate goal to discern the order that exists between and amongst the various facts and then to discover the ways in which the facts can be organized into meaningful patterns. Data is that observable evidence of all science.

Although data is typically thought of as a computer technology phenomenon, it existed long before computers, as long as man and maybe even before, depending on your philosophical and religious views. In this broader sense, data is defined as the observable or measurable sensory input (facts) from research, discovery or collection about the real world: people, places, things, or events. These facts are usually recorded in the form of numbers, words, and images. Data is both the mechanism and media for capturing reality. 

Data representing reality is captured through a process of abstraction, both reducing and generalizing reality. This conceptualization of the real world presents a much simpler view, easy to grasp and process. The human mind simply does not have the capacity to capture, hold and process all of reality because it is infinitely complex.   Data represents the characteristics of reality considered important and neglecting those which seem secondary. As an approximation of reality, it does not include every aspect of the real world. If it did, it would no longer be a representation, but real world itself. 

The ultimate goal of data is to provide useful, accurate information in support of knowledge, as knowledge is power.  The usefulness of the data is directly dependent on the accuracy of its alignment to reality. The data has no value of its own. Its value totally depends on how well it represents the real world. Data can be processed by computers, individuals and/or organizations, but its viability still depends on the human skill of accurately capturing the reality it represents. Getting the data “right” is essential, although getting it perfect is nearly impossible because of the infinite complexities of the real world it represents. 

Data & Information

The term data is widely used, although inconsistently. Data is technically a plural noun, but it is grammatically popular to use it in a singular manner. The terms information and data are frequently used interchangeably, as data in everyday language is synonymous to information. Although the terms are very closely related, technically speaking, data are the individual facts collected about reality and used as the materials to build information. Information is what a person understands about reality, as it is data in a more useful form. 

The individual pieces of data are often referred to as “raw data.” At this lowest level, raw data has no inherent meaning because it is out of context, on its own and therefore meaningless. Data needs to be captured and recorded in context to have value. Context is the surrounding circumstances that helps decipher meaning.  The human mind needs this context to process information. Data in context does not necessarily equate to information because context provides no judgment or interpretation, teaches nothing and provides no sustainable basis for action. It is only when data is organized in some manner, for a specific purpose, that it yields information. Giving data context is not the same thing as organizing data for a purpose. Thus, the terms data and information are often confused.  

Data does not automatically assemble itself into information; it takes human intervention. Information is what a person is able to understand about reality and use to answer questions – resolving uncertainty, adding to knowledge. Knowledge is what’s understood from the result of perception, learning and reasoning. The intellectual flow from data to information, knowledge and ultimately wisdom is critical in the decision-making process. What constitutes information to one person may, in fact, be data to another person. The distinctions between data, information, knowledge, and wisdom are more like shades of gray. The process of collecting information, analyzing, understanding it, making a decision and then taking action is extremely important to the survival of anything. If data are the details that form information, then only when the data is accurate, timely, and reliable will the information be useful and reliable, leading to accurate knowledge, decisions and, ultimately, survival. Thus data is vitally essential for survival.  

The Real World

If data is about capturing the real world or reality, it begs the question: What exactly is reality, its nature and characteristics? Reality, in the broadest sense is everything that is. The "real world" is a phrase used to refer to the physical reality of everyday life. Matter, space and time make up the basic framework of reality. Physical things are made of matter, constituting much of what can be observed. These physical things of the real world have identity, relationships, and properties – all giving meaning or context. If we can understand these characteristic about the real world, we can then understand the characteristics of data.

Physical things exist within a three-dimensional space, having the dimensions of length, width, and depth required to describe the position of anything. Reality changes from moment to moment as time advances. So physical things not only exist within space, but also exist within time. It is thought that time is the fourth dimension, although the human mind is unable to visualize four dimensions. Also, humans perceive time and space separately because of the nature of our senses, but scientific and mathematical theories view them as connected. This is the space-time physics concept. There is a difference of opinion whether the past and future are part of reality, or does reality only exist in the present moment? All of these interesting characteristics of reality are important to consider when capturing and working with data.

Matter, space, and time are intrinsically linked, and each gives meaning to the others.  Matter exists within both space and time. Although this is based on high level, complex mathematical theories, simple observations support this. A tree, made of matter, has a location in space with the three dimensions of length, width, and height. The tree continually changes as time passes; it is either growing or dying. Thus, space and time give context to the tree (matter). Physical things in the real world are interrelated as David Bohm, a Quantum physicist states, “Everything is connected to everything else.” A tree is in relation to the ground, as well as the atmosphere. If the atmosphere or ground change, the tree is affected. Nothing exists by itself; everything is related.  As the relational theory proposes, things are meaningful relative to other things. This interrelationship is very important, as it gives meaning or context.

Time is not only a fundamental component of reality, it is also fundamental to change.  Everything continually changes, as everything exists within time and time never stands still. Change is often spoken of in terms of time; age and speed are good examples. In nature, things are either growing or in a state of decay. Even if something changes by moving positions, time is still a factor. Physical events take place within space (a location) and time. Time and location are related as the time interval between two events could have different values if they are measured from different positions in space (frames of reference). This is a well-known effect of Einstein’s theory of relativity. 

The things and events in the real world can be defined and described by their properties. These properties can change or remain the same over time or space (location). Color is a reflection of light and a property shared by many physical things. Color can definitely change over time or space. Properties can be repeatable, being common to many things. They are often used to group or categorize things for generalizations. Classification systems are generalizations used to determine what kinds of things exist in the real world, based on similar properties. The taxonomy of biology, defining the kinds of living things, is a great example of a classification system. 

Although there are many similar characteristics that exist among things in reality, nothing is truly identical in the real world. Take snowflakes for example. Everything has its own uniqueness that gives it identity. There is also a repeatable nature about reality as there are many cycles, or sequences of events that repeat themselves in the real world. The seasons, the phases of the moon and tides are good examples of real world cycles. The characteristics of reality or the real world such as, location, time/change, relationships, properties, and identity are very important when capturing reality because they give meaning or context.

Capturing the Real World through Data

Data exists, whether it is captured or not. It does not have to be written down to be captured, as it can be captured within someone’s head. But, for data to be used by many, it is generally recorded, as it has been for as long as man has existed. Data can be recorded manually or electronically in the form of numeric, figures, text, images or audio/video. Prior to the computer age, data was captured manually with paper-based files and records. With computerization came automated data capture and electronic formats. Regardless of the media used, data needs to effectively represent reality.  Capturing data in context is core to assuring an accurate, meaningful representation.  It makes perfect sense that the same things that give context in the real world give context to the data as it’s recorded. These things include time/change, identity, relationships, and properties/classification.

Data representing reality is a “snapshot” of facts at a point in time. These facts are the things and events in the real world that exist within the dimension of time. Due to the nature of the time dimension, reality in itself cannot be saved, as it continually passes. Unlike fiction, time never stops in the real world. Time continues to pass. Nothing stays the same, everything “ages” or changes – some things quickly, some things slowly. This is especially true for real world events, as they happen at a moment in time. The real world is always captured at a point in time. Then as it changes, the time element is used to help represent that change, as well as reference various states of the data captured or reality. In the same way as in the real world, data is relevant to time, whether the dimension of time is recorded with the data or not. But when time is not captured, there is a risk the data will be compromised, as time gives data meaning or context. 

Real world things, including concepts and events have identity, as they can be uniquely described. Identity needs to be recorded with the data, and perhaps even levels of identity, giving it context. Description is just a more verbose form of identity and also needs to be captured.  Raw data does not have meaning on its own. Recording the “raw data” item “brown” is pretty meaningless on its own. Brown what? Brown as a color, brown as a person’s last name, brown as a place?  So if we choose hair color, then is it a hair color choice, or someone’s actual hair color?  If someone’s hair color, then whose?  But, if Brown is a name of a person then is it the last name or first? Whose name? Identity is essential when it comes to capturing data in context. 

Data derives meaning from its relationships. These are very important and need to be captured because everything that exists in the real world derives meaning from its relationships; things have meaning relative to other things. Relationships are fundamental to the nature of data and information. A change in a relationship changes the meaning of the data. Capturing these relationships can be challenging because there are usually multiple relationships between things, as well as different types of relationships in the real world. A good example is the relationships that can exist between people. For example, a person can be both a manager and brother to the same person. The brother relationship is two-way, whereas the manager relationship is one way. There are many combinations and types of relationships. All relationships give data meaning and context.  It is critical to capture only the relevant relationships in support of both the most simplistic and meaningful representation possible. 

Characteristics are features, properties or details about real world things or events that also need to be captured with data. Categorization is where things are recognized, differentiated and understood, based on common characteristics. As there is order and structure to the real world, identity, relationships, characteristics and categories are all important to help structure the data. Structure provides a framework for capturing the data about reality and adds context or meaning to the data, such that not only the data itself, but also its structure represents the real world.   

Consumers of information oftentimes have different needs.  Rarely is there only one use for the data collected. Since data is used in a building-block fashion to produce information, it is important that it be collected and recorded in the most basic or atomic form possible. This is similar to reusable parts that can fit into many different sub assemblies. Data needs to be captured in its most detail form as possible, rather than specialized. This enables the flexibility to put the data together in different ways, supporting the future requirements for information, as the same data organized differently can produce different information.   

Data is recorded in order to manage current reality, look at past reality and/or to predict future realities. Regardless, it is absolutely essential that the data represents reality as closely as possible. Therefore, choosing the right data to record, at the right level, as well as correctly recording the data is essential. Errors in capturing or recording reality result in an erroneous picture of the past and present reality, leading to a distorted prediction of the future. Capturing a correct picture of reality is critical to the usefulness of data. Data not properly collected and recorded is actually more of a detriment to a user of the information than not having the data.

Data Capture and Perception

Capturing reality is not an easy task due to the nature of perception in general.  Perception is an individual’s sensory experience or interpretation of reality. And as it is often said, “perception is everything.”  There is external reality and then there is internal reality. They are rarely, if ever, the same. Internal reality is everything observed within the mind, as well as the thoughts about the observations. Even when something is directly observed by actually seeing it, internally, the mind has to process and interpret back a three-dimensional view from the two dimensions the eyes work in. 

Individual interpret reality differently due to the uniqueness of the internal pre-processing our brains use to filter the information around us. Then there is the added filtering the brain performs using each person’s experiences, knowledge base, prejudices, and genetics. In addition to the filtering, the mind is also able to creatively produce things that are independent of observations – feelings about the observation, for example. The variations in perceptions complicates the task of accurately and consistently capturing reality, so carefully choosing a consistent means for capture and recording the data is essential.

Perception is not only a factor in capturing reality, but also when using the data to interpret the real world things or events.  As the cave drawings captured prehistoric reality, the correct interpretation of the drawings is very important to get an accurate understanding. The human mind or perception is also involved when the representation of reality is interpreted and used. The greater the degree of detail and clarity of context captured within the data, the more likely the interpretation will be correct and the data used successfully. This is where data context or structure and definition play a big role. If either the capture of reality or its interpretation is incorrect, then the resulting information will be flawed and relatively useless. 

Closing Thoughts

A solid knowledge of the real world things or events is critical to recording them correctly. When these things or events are not familiar or understood, or even out of context, then their representation will surely be flawed. Also, the more removed from reality the capture of data and its context, the higher probability for errors. Second-hand or third-hand knowledge is a very big risk, as is the case when data is not captured directly from reality, but from another representation of reality that was captured from yet another representation of reality, and so forth. This would be like taking a photograph from a photograph of a photograph leading to a much distorted image.  

Technology often takes on a life of its own, especially when it ends up driving the organization, like the tail wagging the dog. When deep in the weeds of technology, it can be extremely difficult to look up and realize what it’s all about and what we are really doing. It all boils down to the data that represents the real world in order to manage the current reality, look at past reality and/or to predict future realities. So it is really “all about the data,” not the technology. Technology is great, but it is still only the vehicle for the capturing, manipulating, managing and delivering the data. “If the data isn’t right, nothing is.” There cannot be a truer statement.  As data professionals, we especially need to get back to basics and understand data from this real world perspective, outside the confines of computer technology, if we are ever going to have any hope to tame the data chaos.

Go to Current Issue | Go to Issue Archive


Recent articles by Noreen Kendle

Noreen Kendle -

Noreen is currently the Senior manager of Data Architecture for the Home Depot. With over 25 years of experience in information technology, primarily within data architecture, she has worked within the communication, financial, manufacturing, health, non-profit, retail, graphics, and travel industries.  Noreen has extensive industry experience leading the development and implementation of data architectural initiatives including: enterprise modelling methodology, master data, enterprise data management, data ownership, enterprise data integration, data quality, enterprise business intelligence, metadata, data governance, and physical design.  She can be reached at noreen.kendle@comcast.net  or 678-768-6591.