Database Archiving for Long-Term Data Retention

Published in TDAN.com October 2006

Organizations are generating and keeping a more data now than at any time in history before. This is so for many reasons. First of all, the amount of data in general is growing. According to
industry analysts, enterprise databases are growing at the rate of 125% annually. Even more interesting is that as much as 80% of the information in those databases is not actively used (in other
words, it is ready for archiving).

But why are we producing so much data? True, technology advances have better enabled our ability to capture and store data. But technology alone is not sufficient to account for the current rate of
data growth.

Data may need to be retained for both internal and external reasons. Internal reasons are driven by company needs. If an organization business requires the data to conduct business and make money
then that data will be retained. Today’s modern organizations are storing more data for longer periods of time for many internal reasons. Typically, data is stored longer than it used to be to
enable analytical processes to be conducted on the data. Data warehousing, data mining, OLAP, and similar technologies have delivered more and better techniques for extracting information out of
data. So businesses are inclined to keep the data around for longer periods of time.

But external reasons, typically driven by the mandate to comply with legal and governmental regulations are another significant factor driving the need to store more data.


Legal Requirements to Archive

The corporate accounting scandals of the past few years have caused an onslaught of new laws to be written. These laws place regulations on how businesses are to treat their sensitive,
business-critical data. Additionally, older laws that have been on the books are being enforced more rigorously than in the past. Basically, government regulations are being adopted to ensure that
corporations are “doing the right thing” with their data. One of the things that is being mandated by these regulations is longer data retention periods.

Indeed, the number one driver of data management initiatives is likely to be government regulations. The growing number of regulations and the need for organizations to be in compliance is driving
data retention. Regulations such as the Sarbanes-Oxley Act, HIPAA and BASEL II are some of the laws governing how long data must be retained. Moreover, industry analysts have estimated that there
are over 150 federal and state laws that dictate how long data must be retained.

Many of these laws great expand the duration over which data must be retained. Until recently most organizations dealt with mandatory retention periods of only a few years for important business
data. And this data was kept around longer because of business reasons, not legal requirements. But the situation has changed due to the bevy of new regulations at the federal, state, and local
levels. Depending on the industry, what was once five or seven year retention periods is now expanding to 20, 30, or even 70 years. Today, retention periods are determined almost exclusively by
government regulations and not from business needs.

To comply with these laws corporations must re-evaluate their established methods and policies for managing and retaining data. What worked in the past to retain data for a few years will no longer
be sufficient over a much longer period.

Perhaps the most significant piece of legislation impacting data governance is the Sarbanes-Oxley Act. Section 802 of this act defines penalties for altering or deleting important business data and
documents. Additionally, this legislation supports the records preservation rule defined in the Securities and Exchange Act of 1934 (Rule 240.17a-4). This means that electronic storage media must
preserve the records in a non-rewritable, non-erasable format. Clearly, Sarbanes-Oxley requires organizations to implement a robust data retention solution. But, of course, Sarbanes-Oxley is not
the only legislation driving data retention requirements.

According to research conducted by Enterprise Strategy Group – in its report titled “Digital Archiving: End-User Survey & Market Forecast 2006-2010″ – digital archive capacity will increase
nearly tenfold between 2005 and 2010. Total worldwide digital archive capacity in the commercial and government sectors will grow from about 2500 petabytes in 2005 to more than 27,000 petabytes by
2010. And they state that the major factors driving this growth will be regulatory compliance, corporate governance, litigation support, records management, and data management initiatives.

Clearly, organizations will be retaining more data over longer periods of time. And this will create the need for new policies, procedures, methodologies, and software to support storage,
management and access of archived data.


The Lifecycle of Data

So how can we determine when data needs to be archived? In order to accurately answer that question we need to understand the different states of data as it progresses through its lifespan.

The diagram in Figure 1 delineates the various states of data over its useful life. Data is created at some point, usually by means of a transaction: a product is released, an order is processed, a
deposit is made, etc. For a period of time after creation, the data enters it first state: it is operational. That is, the data is needed to complete on-going business transactions. This is where
it serves it primary business purpose. Transactions are enacted upon data in this state.

alt
Figure 1. The Lifecycle of Data

The operational state is followed by the reference state. This is the time during which the data is still needed for reporting and query purposes. It could be to produce internal reports, external
statements, or simply exist in case a customer asks for it.

Then, after some additional period of time, the data moves into an area where it is no longer needed for completing business transactions and the chance of it being needed for querying and
reporting is small to none. However, the data still needs to be saved for regulatory compliance and other legal purposes, particularly if it pertains to a financial transaction. This is the archive
state. It is the requirements for data in this state which this white paper addresses.

Finally, after a designated period of time in the archive, the data is no longer needed at all and it can be discarded. This actually should be emphasized much stronger: the data must be discarded.
In most cases the only reason older data is being kept at all is to comply with regulations, many of which help to enable lawsuits. When there is no legal requirement to maintain such data, it is
only right and proper for organizations to demand that it be destroyed – why enable anyone to sue you if it is not a legal requirement to do so?

Don’t think in terms of databases or technologies that you already know when considering these data states. The data could be in three separate databases, a single database, or any combination
thereof. Furthermore, don’t think about data warehousing in this context – here we are talking about the single, official store of data – and its production lifecycle.

From here-on out we will use the terms introduced here for the various states of data throughout its lifecycle, with the emphasis being on archiving database data and the issues arising from doing
so.


What is Database Archiving?

Database Archiving is part of a larger topic, namely Data Archiving. Data exists in many formats and for many purposes, and only a small percentage of it is actually in a database. Physical
documents, electronic documents, computer files and data sets, e-mail, and multimedia files are all examples of data that may reasonably need to be archived at some point. Refer to Figure 2. Each
of these “things” needs to be archived to fulfill regulatory, legal, and business requirements.

But each type of data requires different archival processing requirements due to its form and nature. What works to archive e-mail is not sufficient for archiving database data, and so on. In other
words, type of data may need to command its own technology. This is most certainly true for database data. Why?

Well, data stored in a database is different than other types of data in many ways. The main advantage of using a DBMS is to impose a logical, structured organization on the data. A DBMS provides a
layer of independence between the data and the applications that use the data. In other words, applications are insulated from how data is structured and stored. The interface to the data is
through the DBMS data language, whether it is SQL for relational databases, DL/1 for IMS, or even XQuery for XML databases. So the archival of data from a database requires knowledge of, and
operation in conjunction with, the mechanisms and interfaces of the DBMS.

alt
Figure 2. All Types of Data Need to be Archived

OK, if we now accept that database archiving is a subset of data archiving, let’s define exactly what we mean by the term. Database Archiving is the process of removing
selected data records from operational databases that are not expected to be referenced again and storing them in an archive data store where they can be retrieved if needed.

Let’s examine each of the major components of that last sentence. We say removing because the data is deleted from the operational database when it is moved to the data
archive. Recall our earlier discussion of the data lifecycle. When data moves into the archive state, query and access is no longer anticipated to be required.

Next, we say selected records. This is important because we do not want to archive database data at the file level. We need only those specific pieces of data that are no
longer needed for operational and reference purposes by the business. This means that the archive needs to be able to selectively choose particular pieces of related data for archival… not the
whole database, not an entire table or segment, and not even a specific row. Instead, all of the data that represents a business object is archived at the same time. For example, if we choose to
archive order data, we would also want to archive the specifics about each item on that order. This data likely spans multiple constructs within the database (tables for DB2 or Oracle; segments
and/or databases for IMS).

The next interesting piece of the definition is this: and storing them (the data) in an archive data store. This implies that the data is stored separately from the
operational database and does not require either the DBMS or the operational applications any longer. Archived data is separate and independent from the production systems from which it was moved.

The final component of the definition that warrants clarification is… where they can be retrieved if needed. The whole purpose of archiving is to maintain the data in
case it is required for some purpose. The purpose may be external, in the form of a lawsuit or to support a governmental regulation; or the purpose may be internal, in the form of a new business
practice or requirement. At any rate, the data needs to be readily accessible in a reasonable timeframe without requiring a lot of manual manipulation. I mean, let’s face it, anyone can archive
data if they don’t have to worry about how to query it later, right?

So, what do you think? Does your organization have the technology and resources at your disposal to archive your database data in accordance with legal requirements?

Share

submit to reddit

About Craig Mullins

Craig S. Mullins is a data management strategist and principal consultant for Mullins Consulting, Inc. He has three decades of experience in the field of database management, including working with DB2 for z/OS since Version 1. Craig is also an IBM Information Champion and is the author of two books: DB2 Developer’s Guide and Database Administration:The Complete Guide to Practices and Procedures. You can contact Craig via his website.

Top