The Data Maze October 2010

In today’s “information hungry” society, most mission-critical applications and data warehouses contain large volumes of historical data that is rarely accessed and is growing exponentially. With the increased visibility and viability of analytics in the enterprise, yet another multiplier is being added to a company’s ever-increasing data storage budget. These vast stores of historical data need to be retained on some storage media, but often do not need to reside in the online portion of a data warehouse where excessive data growth degrades application performance and availability, slows system response time, and consumes expensive processing power. Also, the larger the database, the more time it takes to load, unload, search, reorganize, index and optimize, which results in longer daily processing times and declining service levels. The overall impact from excessive data growth and storage directly impacts data consumers and their productivity and, most importantly, a company’s bottom line.

One answer to this problem is the effective use of data archiving techniques. Data archiving can be defined as a structured, methodical process for moving inactive or infrequently accessed data to a lower cost storage media while still providing access to that data. Many business drivers such as regulatory compliance rules, merger and acquisition data consolidation, the ability to reduce operational expense and the ability to respond quickly to legal or government discovery requests have made data archiving solutions a necessary infrastructure component for all enterprises.

High Level RequirementsBefore embarking on an archival project, whether choosing a vendor solution or a custom approach, make sure that certain high level requirements addressing accessibility, scalability, availability, flexibility and reliability are agreed upon with the user community.

Accessibility

  • Archived data must be readily accessible and can be selectively restored when needed.
  • User expectations will need to be set so that they realize it will take longer to retrieve archived data versus online data. 
  • Active and archive data’s metadata, such as table schemas, must be extracted and maintained together. For example, if you need to restore data from several years ago, you need to know what the table structure looked like at that time so that the data values can be attached to the proper column. 
  • For retrieval, the archive solution must be able to retrieve multiple years’ worth of data for initial loads, ad hoc queries or for research and reporting.
  • The archive solution must be able to find and select specific portions of archived data for the purpose of satisfying audits or legal needs.

Scalability

  • With the current data explosion and the planning for additional sources of data being stored in an enterprise information environment, the archive solution must be able to scale to meet these demanding storage and retrieval requirements. For example, if we calculate volume metrics for 2 years’ worth of data and use that to create estimates as to what the size of twenty years worth of data will be, the solution will need to be able to manage that.
  • Ensure comprehensive application coverage, both operational and informational applications need to be addressed.

Availability

  • The archive solution implementation and resulting regularly scheduled execution should not interfere with or degrade existing processing windows, existing service level agreements or normal usage requirements.
  • Backup and recovery of archive data should be similar to active data as requirements dictate.

Flexibility

  • Due to new or revised legislation and marketplace dynamics, the archive solution must have a flexible architecture and design so that the rules used to store and retrieve archived data can change with those dynamics.
  • The archive solution must be able to delete/sanitize records that have met their legal retention requirements. This includes shorter term sanitization/deletion based on systems’ records management requirements and, in the long term, deletion of data that has exceeded the determined history window (e.g., 20 years).
  • The ability to restore data back into archival must be available. This would be needed for record sanitation or data quality purposes.
  • The solution must be able to support a multi-tiered data storage model.

Reliability

  • Data movement from active to archive state should not introduce any data integrity or data loss issues.
  • The restoration of archived data should not overlay any existing active data or any other production data. 
  • Archived data must be able to be restored with the original referential integrity that it originated from.
  • The incorporation of a data archive strategy into the current processing should not introduce any system degradation issues.

If you can meet all of these requirements for the selected solution and, in addition, successfully define detailed data retention requirements, consider impacts to all systems and define a low impact archiving strategy, keep all data and business rules and relationships intact, define and implement sound data security requirements and, finally, empower your business users, you will have a very good shot at project success.

Finally, the benefits of any data archiving project would include the following:
 

Increased functionality in a constrained resource environment. The need for current and additional enterprise data will explode as analytic capabilities are recognized and proven across the enterprise. This will lead to additional funding for additional sources of data to make it accessible to the enterprise to enhance and grow its capabilities.

Efficient storage of current data to prepare for future data growth. An archiving solution will allow the enterprise to target its data for the most cost effective storage, resulting in increased utilization and performance for its production systems.


Explicit data privacy, legal and regulatory compliance. The impact of newly enacted legislation requiring companies to provide on-demand documentation is causing companies to initiate or revisit their archive strategies. Data stored by an enterprise must be categorized and retrieved based on new rules or laws that are continually changing. A flexible archiving solution will allow these requirements to be met without storing it all online on an expensive storage media.

Reduced operational costs. A seamless, real time access path to all historical data is critical in this competitive era, but the cost of obtaining this objective must not overshadow the cost of storage and providing the access. A robust archiving solution will allow you access to all of a client’s data without “breaking” the budget.

Vendor solutions for data archival are improving at a rapid pace due to the demands of the marketplace so working with a vendor partner or multiple vendors to do a proof of concept is an excellent first step in the process. Take a holistic, enterprise view of the project because over time data archiving and its approach within an enterprise architecture will be as prevalent in the data management lifecycle as many of the foundational components we design for today.

Share

submit to reddit

About Dan Sutherland

Dan is an IT Architect at IBM specializing in business intelligence solutions and integrated data architectures. Over the past 20+ years, he has gained valuable experience working in multiple technical leadership roles defining requirements, architecting solutions, designing large scale relational database management systems using accepted design practices and successfully implementing systems on multiple software and hardware platforms. 

Top