Data Value Domains

Published in TDAN.com April 2003


Data Value Domains

Since each column in a database table is assigned a data type, we have an expectation that each data attribute draws its values from a specific set of allowed values as defined by that data type.
For example, when a column is defined to have the type integer, we expect that all values assigned into that column are integral numbers, and that column could not be assigned a character string.

In a way, our expectation is that any value bound to a field is taken from a value set that has some structural or syntactic definition as well as some semantic meaning. Essentially, these
expectations boil down to a set of restrictions on the values that an attribute may take. The semantic rules are additional restrictions on the set of valid values for an attribute that are
expressed as a subset of the allowed structural values. When we can define an explicit set of restrictions on a set of values within a data type, we call that a data value domain, and we can use
the definition of data value domains as a means for data validation.


An Example

A simple example is US Social Security numbers. A social security number has a data type (VARCHAR(11)), but it also has both structural restrictions and semantic restrictions. The structural
restriction is on the form: 3 digits (0-9) followed by a hyphen (-), followed by 2 digits, a hyphen, then 4 digits. The semantic restrictions specify rules about the number itself. The first 3
digits denote the state (or area) where the application for the number was filed. The next 2 digits are called the group number, and they are issued in a particular order, namely odd numbers from
01 through 09, followed by even numbers from 10 though 98. After those groups have been allocated, then even numbers from 02 through 08, followed by odd groups numbers from 11 through 99. Each
month, the Social Security Administration (SSA) publishes high group numbers for each area. The final four digits, called the serial number, are allocated consecutively for each group. And,
according to the SSA, “alleged Social Security numbers containing area numbers other than those found on that table are impossible.” These rules define what is and is not a valid social
security number. A program can be written to determine whether a specific social security number is valid or not.


What is a Data Value Domain?

More formally, a data value domain definition is composed of a domain name, a set of rules that describe the set of values that belong to a restriction set, and a description of the meaning
associated with that set. The most basic data value domains are the simple data types themselves, since they possess these characteristics. For example, “VARCHAR(25)” is the name of a
data value domain consisting of character strings of length 25 or less, which both provides the restriction rule (i.e., length(x) <= 25), and the domain description.

Defining data value domains is useful as meta data within an organization for at least three interesting reasons. The first is that as organizations become more distributed in their processing,
there is a greater likelihood that different groups make use of the same conceptual data value domains without having an enterprise-wide convention for their use. Having a centralized repository of
domain definitions will help in the information sharing and integration process. The second reason is that when data value domains are defined in a formal way, those definitions can be used to
automatically generate validation filters. For example, knowing the rules that define a valid social security number allows us to generate a script or program that can then test a social security
number to make sure it is valid within the domain definition.

The third reason is that by ascribing a name and a meaning to a set of values, we are supplying additional knowledge about the use of data within the organization. This knowledge can survive the
maturation of systems and the departure of personnel, since referring to used data domains provides meta knowledge about the business process, even if the specific rules for defining that domain
may change over time. For example, we will want a payroll process to only pay “active employees,” no matter how the set of active employees is defined.

While there may be many ways to define data value domains, in this article we look at three specific ways:

  • Enumeration
  • Description
  • Derivation


Enumerated Domains

Perhaps the simplest way to define domains, an enumerated data value domain is a list of values that belong to that domain. For example, the colors of the rainbow can be defined as
{“RED”, “ORANGE”, “YELLOW”, “GREEN”, “BLUE”, “INDIGO”, “VIOLET”}. We make use of enumerated domains very
frequently, such as US Postal Service state abbreviations, country names or country codes, or currency names. When using enumerated domains, automated validation can be performed by constructing a
hash set of the valid values and then checking each column value for membership within that has set.


Descriptive Domains

A descriptive domain is defined as a set of one or more formats or patterns to which values must conform. A good example is a North American telephone number, which we might restrict to one of
these two patterns (where D represents a digit from 0-9):

  • (DDD) DDD-DDDD
  • DDD-DDD-DDDD

Checking for value conformance within pattern-based descriptive domains is relatively straightforward, especially if the patterns are amenable to representation using regular expressions. Many
programming languages provide support for building finite state machines that can automatically check whether a value can be constructed using the described pattern, and this can help with value
validation.


Derived Domains

Given an already existing set of values derived from data instances (such as records) within a structured context (such as a database table), we can derive new domains from the existing set. These
domains are derived as a subset of the original values using rules that express dependence on corresponding attributes within each instance. For example, we might define the set of active employee
IDs as the set of employee IDs from the EMPLOYEE table where the ACTIVE field is set to ‘Y’.

These kinds of domains are interesting in that they embed intra-instance dependence within the definition. This dependence is important because it exposes potential business dependences that are
embedded in the data set itself. These kinds of dependences can be “extracted” and articulated as business meta-knowledge as well.


Composite Definitions

Some domains are described better using composite definitions that make use of format (i.e., syntactic) and derivational (i.e., semantic) rules together. Our earlier description of North American
telephone numbers is a good example. In reality, we might want to restrict the format of a telephone number to one of the above patterns, but we also might want to provide additional restrictions
regarding the numeric components of the number. For example, the first group of three digits within the number is known as the NPA (an abbreviation for “Numbering Plan Area,” also known
as an area code), and second group of three digits within the number is referred to as the NXX (or exchange). For any telephone number, the NPA must be taken from the set of valid NPAs, and the NXX
must be valid within the area attributed to the NPA (as defined by a standards committee). This is an augmentation of the domain definition for telephone numbers by enforcing a
“sub-domain” relationship related to components of the number.


Domains and Metadata

The value of data value domains as metadata is clearly seen. The International Committee for Information Technology Standards (which operates under ANSI, the American National Standards Institute)
has a technical committee (called L8) to define metadata standards, and a large part of their work involves the definition of metadata repositories, which is largely focused on data value domain
definitions. For more information, see their web page http://incits-l8.org/.

Share

submit to reddit

About David Loshin

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality ApproachÊand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

Top