Friday, September 05, 2008

The Most Important Thing is Recoverability

I know that many readers will question the title of this blog posting. But it is true. Oh, many DBAs think that managing performance is the most important thing they do, but they are confusing frequency with importance. Yes, many are managing performance more often than building backup plans – and they better be managing performance more frequently than they are actually recovering their databases or their company has big problems!

Anyway, why do I place recoverability at the very top of the DBA task list? Well, if you cannot recover your databases after a problem then it won’t matter how fast you can access them, will it? Anybody can deliver fast access to the wrong information. It is the job of the DBA to keep the information in our company’s databases accurate, secure, and accessible.

So what do we need to do to assure the integrity of our database data? First we need to understand the availability needs of our data in terms of the business. In the event of a failure how rapidly must we be able to recover from that failure? Keep in mind that the failure could be either physical, such as a failed disk drive, or logical, such as applying the wrong input to a process which corrupts the database.

Only after we know the impact to the business can we develop an appropriate backup and recovery plan. We need service level agreements (SLAs) for recovery just like we have SLAs for performance. The recovery SLA, or Recovery Time Objective (RTO) needs to be from an application perspective, such as “Time to restore application availability after a failure for application X cannot exceed 2 hours (or 10 minutes or …)”

To create effective RTOs you must be able to answer the question “What is the cost of not having this data available?” When we know the expectations of the business we can work to create a backup and recovery plan that matches the requirements. There are multiple techniques and methods for backing up and recovering databases. Some techniques, while more costly, can enhance availability by recovering data more rapidly.

It is imperative that the DBA team creates an appropriate recovery strategy for each database object. This requires mapping database objects to applications so we can adopt the proper strategy in accordance with RTOs. Some database objects will participate in multiple applications, and their recovery strategy will therefore be more complex.

Not all data is created equal. Some of your databases and tables contain data that is necessary for the core of your business. Other database objects contain data that is less critical or easily derived from other sources. Armed with this information -- and our RTOs -- a DBA can create a recovery plan that matches the needs of the business.

Establishing a reasonable backup schedule requires you to balance two competing demands: the need to take image copy backups frequently to assure reasonable recovery time, while at the same time dealing with the need to take image copies infrequently so as not to interrupt daily business. All the while keeping in mind, if you make fewer image copies you will need to apply more log records during the recovery, and the recovery will take longer. The DBA must balance these competing objectives based on SLAs, usage criteria, and the capabilities of the DBMS.

When was the last time you re-evaluated and tested your backup and recovery plans? Oh, you may have looked at disaster plans, but have you examined your ability to recover locally? Do you know how long it would take to recover your most important primary customer tables, for example, if you took a hit in the middle of the day?

Regular recoverability health checking should be a standard documented responsibility for every DBA staff; and if you can acquire software to automate the health-check process, all the better.

SEGUS offers a nice option for checking the recoverability of your DB2 databases called Recovery AssuranceExpert. Using this automated tool you can monitor the recoverability of your DB2 environment including DB2 settings (such as DB2 logging, buffer pools, DSNZPARMs), recovery prerequisites, recovery service levels, and recover time objectives for your database objects.

When was the last time you tested recovery? Are you sure you can recover your DB2 databases within a satisfactory timeframe? Wouldn’t you sleep better if you had a methodology and process in place for doing so? I know I would…

No comments: