Data archiving is an important aspect of data governance and data management. Not only does archiving help to reduce hardware and storage costs, but it is also an important aspect of long-term data retention and a key participant in regulatory compliance efforts.
When long-term data retention is imposed on your data—anything more than a couple of years—then archiving it can be the most optimal solution. First of all, archiving data can improve the efficiency of your operational platform because there is less data to process and manage. Second, a large amount of data that must be stored for a long period of time can challenge the limits of your operational platform. And finally, an archive is designed to control authenticity of data and the removal of data when its retention period passes.
Ensuring Proper Data Recovery
If we are archiving our data into a separate archive data store, we will need to integrate our archive policies with our backup schedule. This is mandatory to ensure that data recoveries, which are unscheduled by definition, can be performed properly and effectively.
Data archiving is the process of removing selected data records from your operational system when that data is not expected to be referenced again. Plus, the data is stored in an archive data store, where it can be retrieved as needed. It is very important to keep in mind that when data is archived from the operational platform, whether that is a database system, an email platform, or any other data storage mechanism, that data is no longer stored in its operational source, only in the archive.
So far, so good. But now, let’s think about some potential recovery scenarios. Let’s say that during a weekly batch cycle, jobs were run that erroneously modified data in the operational database. Also, that same day, a database archive policy was run that caused data to be removed from the operational database. If we are not careful, our recovery could reverse not only the changes that were made in error, but also the data that was deleted and moved to the archive data store. In such a case, the data would then exist in both the operational database and the archive data store. The next time we execute the archive policy for that database, we would archive the data again, which would result in duplicate data in the archive.
To avoid this type of situation, DBAs must have knowledge of when database archive policies are run. The DBA would need to either ensure that the recovery does not apply the data deleted from the archive, or alternately, discard the data from the archive data store immediately after the recovery. The former solution is probably better than the latter.
Different Organizations Require Different Approaches
Of course, much depends on the complexity of the database recovery. Consider the following situation: Process 1 runs and makes erroneous changes, the Archive Policy runs, and then Process 2 runs, also making erroneous changes. If you recover the database back to the point in time before Process 1, you will include the archived data in your recovery. If you have a log-based recovery scenario that can apply application recovery by reversing the impact of a process, then you could run it for both processes and avoid reapplying the archived data.
There are innumerable recovery scenarios a DBA could be faced with, so it is impractical (if not impossible) to come up with a step-by-step guide to database backup and recovery when archiving is involved. Suffice it to say, you will need to integrate knowledge of and policies for recovering your databases in conjunction with your database archives.
Furthermore, although the examples given were focused on database systems, the same type of issues can arise for any other operational platform where data is both archived and backed up for possible recovery scenarios.