Taking the Temperature of Your Data

There are a large variety of quantitative techniques available to assist in the development of mathematical models, but the seasoned practitioner understands that they all do basically the same
thing: they help us search for a set of variables, weights and operators in the form of an equation. When that equation is applied to a set of decision data, it enhances the performance of our
decision making.

The algorithms behind our model development effort are seeking those variables that have information content relative to the goals we have defined. Our data, and the information content it
contains, is the source of enhanced performance.

Successful practitioners typically spend 75% to 80% of their overall modeling effort preparing data. These efforts deal with issues such as understanding the context of the available data fields,
handling of missing data, identifying and correcting data errors, identification and representation of interaction effects between variables, mathematical transformation of data to obtain different
perspectives on the information content, and data representation schemes appropriate for the type of data being utilized.

Practitioners new to predictive analytics often overlook this last issue. The physical representation of the data in their data set can often have significant impact on the information content
presented to the modeling technique. This article presents a brief discussion comparing two approaches: common data representation, and an enhanced approach for certain types of data.


Data Types

Just as quantitative techniques have strengths and weaknesses, so does our data. When considering the context of our data, it is also important to understand the mathematical capabilities of our
data. It is obviously trivial to point out that the mean and standard deviation of variables such as ZIP code is meaningless at best. However, many practitioners overlook more serious
considerations and miss important data representation issues as a result.

Each variable in your data set should be clearly identified as being either quantitative or qualitative in nature.The characteristic of importance here is “order.” There is no inherent order in a
qualitative variable. Quantitative variables, on the other hand, have an underlying order. It is beyond the scope of this article to consider the types of mathematics that are appropriate for the
various types of quantitative variables (ordinal, interval and continuous). Rather, we will focus on the implications of the characteristic of “order” and data representation schemes that are of
use to enhance the extraction of information content.


Qualitative Variables

A qualitative variable is typically simply a variable that describes a set of categories. The variable will have two or more values, each representing a category meeting a particular set of
conditions. An example of a qualitative variable is marital_status.

For this discussion, let’s assume that marital_status has the following values:

 

 Marital Status  
 Married
 Single
 Divorced
 Separated
 Other

 

The values of the variable marital_status have relative order. We can easily rearrange them in any other order with no impact on the information content.

However, from a predictive analytics perspective, we still have many questions that need to be addressed from a field of this type.

  • Are the values exhaustive? Have they captured all possible circumstances? The value “other” takes care of this for us.

  • Are the values mutually exclusive? Given that this is an individual’s current status, not all values that have ever applied, is a start. However, the values as stated suggest that
    “Separated” is mutually exclusive from “Married.”  Is that context what is desired?

  • Should this variable be represented in our modeling data as one variable with six values, or are there other alternatives that should be considered?

  • Do we need six values for the variable marital_status ?


Collapsing Values

For the marital_status, we have identified six values. Is this the appropriate number of categories? It is important to understand that there is no “right” answer to
this question generally. The answer is always going to be contingent on what the context of usage is. For some decision environments, this is going to be the most appropriate representation.

  • Is it sufficient to use on the values “Married” and “Other.”

  • Do we need to combine “Divorced” and “Widowed”? Is it sufficient to know only that the individual was at one time married, but no longer is?

  • Should we combine “Single,” “Divorced” and “Widowed”? Is is sufficient to know that the person is not currently married?

These are empirical questions. They can only be answered in the context of the particular decision environment we are exploring. How many values to use, and how to collapse the values, is best
answered by testing each of the combinations and measuring the impact that the representation has on performance.


Data Representation Alternatives

We must also consider the impact of different data representation schemes. In this case there are two alternatives:

  • A single variable with six values, as above, and

  • Six variables, one for each of the values, using binary representation for each… commonly referred to as a 1 of N representation.

 

 

 Married  

 Single  

 Divorced  

 Widowed  

 Separated  

 Other  

 Married

1

0

0

0

0

0

 Single

0

1

0

0

0

0

 Divorced

0

0

1

0

0

0

 Widowed

0

0

0

1

0

0

 Separated  

0

0

0

0

1

0

 Other

0

0

0

0

0

1

 

The 1 of N representation allows for more flexibility.Some of our modeling techniques may identify relationships differently than others.Some may focus only on one of the values.Others may use more
than one, but not all of the values. Still others may use all six values. This inherent flexibility makes the 1 of N representation appropriate for virtually all qualitative variables.


Quantitative Data

Let’s explore another example… Education_Level.

 Education Level

 < High School
 High School
 Some College
 Bachelor’s Degree
 > Bachelor’s Degree  

 

Education_Level is an example of quantitative data. While it isn’t represented by numeric values, “order” is a significant characteristic. This is, in fact, an ordinal
variable. It would be inappropriate to compute any type of mathematical calculations, even if the data were represented numerically since there is an inconsistent interval in the values.

Just as we considered collapsing the values in the variable Marital_Status, above, the same considerations apply here. The number of values appropriate for
Education_Level is purely determined by empirical testing in the decision environment we are working in.

The data representation issues are also similar. We can obtain a number of advantages by using a 1 of N representation for Education_Level.

 

 

< High
 
School 

 High
 School
 

 Some
 College
 

 Bachelor’s
 Degree
 

 > Bachelor’s
 Degree

 < High
  School 

0

 0

0 

 0

 High School 

0

1

 0

0 

 0

 Some College 

0

0

 1

0 

 0

 Bachelor’s Degree 

0

0

 0

1 

 0

 > Bachelor’s Degree 

0

0

 0

0 

 1

 

While this 1 of N representation allows for the flexibility advantages discussed above, it does not capture the “order” characteristics of the variable Education_Level. If this
representation were used as an output variable, for instance, your answers would either be correct or incorrect. You would be unable to assess the degree of incorrectness, as the data
representation scheme does not capture that information.

On the other hand, consider a different representation scheme, a Thermometer Representation.

 

< High
 
School 

 High
 School
 

 Some
 College
 

 Bachelor’s
 Degree
 

 > Bachelor’s
 Degree

 < High School 

0 

0

0

 High School 

1 

1 

0

0 

 Some College 

1 

1 

0

0 

 Bachelor’s Degree 

1 

1 

1 

1 

0

 > Bachelor’s Degree 

1 

1 

1 

1

1 

 

The logic of a Thermometer Representation is very straightforward. An individual in the category High_School, has all of the attributes of someone in the category
<<em>High_School…plus something else. An individual in the category Some_College, has all of the attributes of someone in the category
High_School… plus something else. And, so on.

The Thermometer Representation allows us to capture “order” in our values and, as a result, allows us to consider degree of incorrectness.

While it would be physically possible to use a Thermometer Representation on the Marital_Status variable, discussed above, it would not make sense to do so. A qualitative
variable has no “order.” On the other hand, restricting our data representation method for a quantitative variable to a 1 of N representation misses an important characteristic of the
information content available.

It is worth noting, that a Thermometer Representation also allows us to control the direction of error. In the representation above, the logic reinforces the building of levels. As a result, this
representation scheme will have a tendency to underestimate the value.

Is this what we want? Again, it depends. If we are in a decision environment where we would prefer to have overestimation when we are incorrect, we simply need to invert the Thermometer
Representation to achieve that result.

 

< High
 
School 

 High
 School
 

 Some
 College
 

 Bachelor’s
 Degree
 

 > Bachelor’s
 Degree

 < High School 

1 

 1

1 

1

 High School 

0

1

 1

1 

1

 Some College 

0

0

 1

1 

1

 Bachelor’s Degree 

0

0

 0

1 

1

 > Bachelor’s Degree 

0

0

 0

0 

 1 



Conclusion

Take the time to carefully consider the attributes of your data fields.Creatively match data representation schemes with the characteristics of the variable in use. This effort can have a
dramatic impact on the performance of your models.

Enhanced model performance comes from extracting as much information content as possible… relative to the specific performance metrics you are using to measure success.

All Rights Reserved, The Modeling Agency. Copyright © 2008

 

Share

submit to reddit

About Thomas Rathburn

Thomas A. "Tony"ÊRathburn is a senior consultant and director of training with The Modeling Agency. TonyÊhas more than 25 years of predictive analytics development experience, and he is a regular speaker on data mining and predictive analytics at TDWI Conferences. He is also a co-presenter for a popular webinar entitled ÒData Mining: Failure to Launch,Ó produced live monthly by The Modeling Agency. He can be contacted atÊTony@The-Modeling-Agency.com.

Top