When I first heard the term “data archeology,” many things went through my head. Was this the process of searching for old data to preserve in a museum somewhere? Or, maybe the pharaohs were
buried with data storage devices. Or, maybe this is just another term created by consultants to confuse everyone.
I went about asking my colleagues what it was. The answers were all interesting, but all different. Hmmm, so we consultants are finally confusing ourselves. With all of this in mind I decided to
create a definition that can be used by everyone and will unconfuse the confused.
Data archeology is the process of identifying critical data sources (both internal and external) and their respective data for analysis and migration to a separate database and/or data warehouse.
It identifies and classifies both “good” and “bad” data. It feeds the data migration effort and helps to ensure that only “good” data is migrated to the new data repository. It is also very
useful in helping determine if the analysts and modelers discovered all entities and attributes.
A simplified example of the process would be:
- Determine the scope of the project.
- Create a logical model – fully attributed.
- Determine data sources of each attribute.
- Evaluate characteristics and properties of each attribute.
- Build requirements for tools selection.
- Purchase tool.
- Test each data source for valid/invalid data.
- Document all data attributes (meta data).
- Determine what to do with data marked “invalid.”
Obviously, you can proceed with data archeology while the logical model is being created. This will help you get a quick start to this tedious process. Just coordinate carefully with your DBA and
modelers. At this point there is a need to purchase special tools and create special programs, or scripts, to migrate the original data over to its new home. These tools and programs will
eliminate, modify or identify the invalid data.
An example of bad data that would be eliminated would be “test” data that was inserted into production files so programmers could easily test their changes. An example of data that would be
identified might be the use of certain fields within the source files that are used by ingenious users to help them do their jobs. You may not want to eliminate this data. You may just identify it
and create a separate attribute for it in the new database.
You may wish to modify data at times. Let’s say you have a customer file that contains data which lacks standardization. For instance, a post office box may have been entered in numerous ways
(i.e., PO Box, P.O. Box, Box, etc.). You may wish to standardize it to a single format (i.e., PO Box). A parsing utility would help greatly in this instance. Remember that many data warehousing
projects have failed due to problems originating from “bad” data. If success is to be assured, every database and/or data warehouse project should include this process in its work plan.
Now that I think about it, I suppose “data archeology” is as good a term as any to identify this very important process.