How to Manage Null Values in Your Data Warehouse

ART03x - Iverson & Oates - imageI’m sure many business intelligence developers have experienced a customer asking why there are null values in their report and what they mean.

It’s very common to have unknown values in a data warehouse solution and there are a number of reasons for them. Sometimes developers take action to fix or hide these unknowns, but sometimes developers decide it’s okay to just leave them alone;it all depends if the developer understands where this null value came from.

With that as a backdrop, it is important that the developer consider a few rudimentary questions before deciding on the next course of action.

Data Quality

Forget about the design; the first action item is to always check your data quality. Often the source data is invalid for the field type or there’s a domain constraint that wasn’t captured correctly. These are easy mistakes to make, but it takes time and effort to discover them since it’s a very detailed process. It’s best to begin the data quality audit by checking to see if there’s a violation on the data selection rules.

Is it a Dimension Attribute?

In an OLTP relational database, it’s important to keep null values in place to ensure data integrity and the “real truth” of the data. However, unlike a transactional system, a data warehouse has more complex situations. For instance, in scenario one, a null value attribute could reside in a dimension table. This could occur because certain attributes may not be applied to all the dimension records.

For example, one can have an empty value in state if not a U.S.-based customer. On the other hand, the data would not be available if there is a delay of entry in the source system. For either situation, we can create a custom transformation to capture the null values instead of just leaving a blank field there. A simple way is to insert a row with -1 as key and “Not Applicable” as the description. Also, we can always assign any custom value to represent a missing piece of data.

Is it a Fact Measure?

A null fact measure is very similar to a null dimension attribute. The cause could be non-existing data or delayed data input. When we consider the interpretation of fact measures, it’s actually fine to treat a blank value as zero. When it comes to cube aggregation and report development, it won’t be a big challenge to display the null values as 0 or 0.00. That said, a developer can simply leave the null value as is and change the format strings in the following database tools.

Is it a Fact Foreign Key?

This is probably one of the most important ETL steps a developer should take care of. In theory, any foreign key should be assigned a value to avoid the violation of referential integrity. There are plenty of reasons for a null value in a foreign key field. For example, the join could be incorrect, the fact row may not be applicable to the specific dimension, or the key might not exist in the source system. Thus, we need to maintain a consistent method to deal with empty foreign keys. For anything that is not applicable or not related to the fact table, we can set the key as -1,-2, or -3 for its particular situation. From there, we can insert the keys and corresponding descriptions into the dimension tables.

Inserting -1, -2, or -3 for different missing fact foreign keys can make it difficult to reconcile the missing dimension value when it finally arrives. Many fact records that have a -1 in multiple foreign key fields may cause confusion since many fact records may, in fact themselves, have a -1 in this particular foreign key field. There is no way to determine with which fact record to associate the missing dimension data when it finally does arrive.

For missing fact foreign keys, most experts recommend creating an “inferred” type 2 slowly changing dimension record for the missing fact foreign key. You can do this by:

  1. Making sure that your dimension records have a field for a Business Key (this is the key for this field from the source system).
  2. Making sure that your dimension record has a field called Status.
  3. Creating a new dimension record and inserting the surrogate key from this new dimension into the fact record foreign key field.
  4. Inserting the source Business Key from the source transaction record into the newly created dimension record Business Key field.
  5. Setting the status of the newly created dimension record Status field to a value representing something like “Dimension key not available when fact record loaded.”
  6. Also, when the source data for the late arriving dimension record finally does arrive, handle it in the normal way for type 2 slowly changing dimensions by creating a new dimension record and updating the missing values from the source data record.

An example of a late arriving dimension applies to the scenario when a marketing custom product is sold in a special business unit and the item cannot be found in the product dimension table. When the information about the new product finally comes in from the source system, the missing data will be properly updated. 

Consistent Action

No matter what action you take, we have learned from the field that the key is to stay consistent in your data warehouse development. Inconsistency can definitely create confusion to business users when they see three rows each with a blank, “N/A,” or “Not Applicable” in their pivot table, even if they have the same meaning. A developer’s job is to treat null values based on the identified cause and implement the solution consistently throughout the system.

Share

submit to reddit

About Heine Krog Iverson, Joe Oates

Heine Krog Iverson is the CEO of TimeXtender, the largest provider of data warehouse automation software for the Microsoft SQL Server. He can be reached at hki@timextender.com.   Joe Oates is a Senior Data Warehouse Architect at TimeXtender. He provides architecture and consulting services to clients, as well as being involved in new product development.

Top