Thursday, September 04, 2014

The Importance of SLAs and RTOs

Assuring optimal performance is one of the most frequently occurring tasks for DB2 DBAs. Being able to assess the effectiveness and performance of various and sundry aspects of your DB2 systems and applications is one of the most important things that a DBA must be able to do. This can include online transaction response time evaluation, sizing of the batch window and determining whether it is sufficient for the workload, end-to-end response time management of distributed workload, and so on. 

But in order to accurately gauge the effectiveness of your current environment and setup, Service Level Agreements, or SLAs, are needed. SLAs are derived out of the practice of Service-level management (SLM), which is the “disciplined, proactive methodology and procedures used to ensure that adequate levels of service are delivered to all IT users in accordance with business priorities and at acceptable cost.”

In order to effectively manage service levels, a business must prioritize its applications and identify the amount of time, effort, and capital that can be expended to deliver service for those applications.

A service level is a measure of operational behavior. SLM ensures that applications behave accordingly by applying resources to those applications based on their importance to the organization. Depending on the needs of the organization, SLM can focus on availability, performance, or both. In terms of availability, the service level might be defined as “99.95 percent uptime from 9:00 a.m. to 10:00 p.m. on weekdays.” Of course, a service level can be more specific, stating that “average response time for transactions will be 2 seconds or less for workloads of 500 or fewer users.”

For an SLA to be successful, all parties involved must agree on stated objectives for availability and performance. The end users must be satisfied with the performance of their applications, and the DBAs and technicians must be content with their ability to manage the system to the objectives. Compromise is essential to reach a useful SLA.
In practice, though, many organizations do not institutionalize SLM. When new applications are delivered, there may be vague requirements and promises of subsecond response time, but the prioritization and budgeting required to assure such service levels are rarely tackled (unless, perhaps, if the IT function is outsourced). It never ceases to amaze me how often SLAs simply do not exist. I always ask for them whenever I am asked to help track down performance issues or to assess the performance of a DB2 environment.

Let's face it, if you do not have an established agreement for how something should perform, and what the organization is willing to pay to achieve that performance, then how can you know whether or not things are operating efficiently enough? The simple answer is: you cannot.

It may be possible for a system assessment to offer up general advice on areas where performance gains can be achieved. But in such cases -- where SLAs are non-existent -- it you cannot really deliver guidance on whether the effort to remediate the "problem areas" is worthwhile. Without the SLAs in place you simply do not know if current levels of performance are meeting agreed upon service levels, because there are no agreed-upon service levels (and, no, "subsecond respond time" is NOT a service level! Additionally, you cannot know what level of spend is appropriate for any additional effort needed to achieve the potential performance, because no budget has been agreed upon.

Another potential problem is the context of the service being discussed. Most IT professionals view service levels on an element-by-element basis. In other words, the DBA views performance based on the DBMS, the SysAdmin views performance based on the operating system or the transaction processing system, and so on. SLM properly views service for an entire application. However, it can be difficult to assign responsibility within the typical IT structure. IT usually operates as a group of silos that do not work together very well. Frequently, the application teams operate independently from the DBAs, who operate independently from the SAs, and so on.

To achieve end-to-end SLM, these silos need to be broken down. The various departments within the IT infrastructure need to communicate effectively and cooperate with one another. Failing this, end-to-end SLM will be difficult to implement.

The bottom line is that the development of SLAs for your batch windows, your transactions and business processes is a best practice that should be implemented at every DB2 shop (indeed, you can remove DB2 from that last sentence and it is still true).

Without SLAs, how will the DBA and the end users know whether an application is performing adequately? Not every application can, or needs to, deliver subsecond response time. Without an SLA, business users and DBAs may have different expectations, resulting in unsatisfied business executives and frustrated DBAs—not a good situation.
With SLAs in place, DBAs can adjust resources by applying them to the most-mission-critical applications as defined in the SLA. Costs will be controlled and capital will be expended on the portions of the business that are most important to the business. Without SLAs in place, an acceptable performance environment will be ever elusive. Think about it; without an SLA in place, if the end user calls up and complains to the DBA about poor performance, there is no way to measure the veracity of the claim or to gauge the possibility of improvement within the allotted budget.

Recovery Time Objectives (RTOs)

Additionally, the effectiveness of backup and recovery should be a concern to all DB2 DBAs. This requires that RTOs (Recovery Time Objectives) be established. An RTO is basically an SLA for the recovery of your database objects. Without RTOs, it is difficult (if not impossible) to gauge the state of recoverability and the efficacy of image copies being taken. 

Each database object should have an RTO assigned to it. The RTO needs to take into account the same type of things that an SLA considers. In other words, the business must prioritize its applications, DBAs must map database objects to the applications, and together they must identify the amount of time, effort, and capital that can be expended to assure the minimization of downtime for those applications.

Again, we are measuring operational behavior. The RTO ensures that, when problems occur requiring database recovery, the application outage is limited to what has been defined as tolerable for the business (in terms of uptime and cost to provide that uptime).
Again, as with an SLA, for the RTO to be successful, all parties involved must agree on stated objectives for downtime and time to recovery. The end users must be satisfied with the potential duration of their application’s downtime, and the DBAs and technicians must be content with their ability to recover the system to the objectives. And again, cost is a contributing factor. The RTO cannot simply be I need my application up in 5 minutes and I can’t spend any more money to do that, because that is not reasonable (or possible).

Without written RTOs, DBAs can provide due diligence to make sure that database objects are backed up and recoverable, but cannot really provide any guarantee in terms of how quickly the data can be recovered (or perhaps, to what point in time) when an outage occurs. Of course, the DBA can create and review backup policies and procedures to encourage a recoverable environment. But there won't be any way to ensure with any consistency that the backup plan can deliver the time-to-recovery needed by the business.

So why don't organizations create SLAs and RTOs as a regular course of business? 

And if your organization does create SLAs and RTOs, please share with us how doing so became a standard at your shop...


No comments: