Tuesday, November 29, 2005

Cross Platform Help

As regular readers of this blog know, this space is devoted to DB2 for z/OS – that’s the mainframe for those of you just visiting. I’d like to take a moment here, though, to point out some interesting posts for cross-platform DB2 DBAs.

Chris Eaton, Senior Product Manager for DB2 Universal Database at IBM, writes a blog for ITtoolbox focusing on DB2 UDB for Linux, Unix, and Windows (LUW). If you use DB2 on that platform, be sure to read his blog regularly.

So why am I mentioning this on a mainframe DB2 blog? Well, Chris has recently posted some very nice entries outlining the similarities and differences between DB2 on the z/OS and LUW platforms. Here are the links to those postings for those of you interested in expanding your understanding. Each is interesting and worth reading:


Cheers!
Craig

Tuesday, November 22, 2005

Mainframes Rock!

It is good to see mainframes getting some positive press again. I'm talking about this November 17, 2005 article published in InfoWorld. It talks about a company that tried to get rid of its mainframe and replace it with first, Windows servers; and when that didn't work, Unix servers. When neither worked for them, they finally gave in and moved back to the reliable environment provided by mainframe computing.

Basically, it boils down to this. There are some workloads that just are better off being served by mainframes. This is the parallel I like to draw:

If you are going to plow a field, what animal(s) would you choose to drive your plow: a nice strong, sturdy ox (mainframe) -or- 64 chickens (Unix servers) -or- 128 gerbils (Windows servers)?

Monday, November 21, 2005

To REBIND or Not to REBIND, That is the Question

There are two basic mindsets on when to REBIND your DB2 plans and packages. The first -- which I believe is the best approach -- is to REBIND regularly after running RUNSTATS. Using this approach you will ensure that your access paths have been formulated by the DB2 optimizer using the most up-to-date information available on your data. If you fail to REBIND your static SQL you are failing to give DB2 the chance to achieve the best performance it can for your applications.

(Of course, this begs the question: "How frequently should I run RUNSTATS?" to which my answer is "As frequently as possible based on how often your data changes," but enough of this aside for now.)

Of course, the DB2 optimizer is not perfect so sometimes rebinding can cause the performance of certain SQL statements to degrade. You will have to be ready to handle these problems by using optimization hints (OPTHINT in the PLAN_TABLE) to go back to a satisfactory access path or by tweaking your SQL to achieve a better performing access path (and some people also may say "...or change the catalog statistics," but that should only be a last resort and is rarely required these days).

Additionally, we have not considered the impact and need to periodically reorganize your DB2 table spaces using the REORG utility. RUNSTATS populates the DB2 Catalog with the information you need to decide when a REORG is warranted. Of course, you would want to run RUNSTATS again after a REORG to obtain the most up-to-date statistics... and only then would you want to REBIND your plans and packages.

The second approach is the "if it ain't broke, don't fix it" approach. In this scenario, you will continue to run RUNSTATS regularly but you will not REBIND your plans and packages until performance degrades. This approach is embraced by shops that do not have the manpower or time to review all access paths after a mass REBIND. By not running REBIND the thought is that performance will continue along as is until data volumes change so significantly that end users start to complain. Only then will individual plans and packages be rebound following the next scheduled RUNSTATS or immediately if the problem is large enough. This approach will degrade performance, albeit possibly subtly over time. However, it does save DBA manpower, which might be in short supply.

Examine your shop's approach to the REBIND issue to see which approach is best for you. Although philosophically I agree with the first approach, I understand that the second approach sometimes can be preferable in practice. If you follow the second approach, be sure that you have pre-agreed Service Level Agreements (SLAs) for your DB2 applications. Then, you can reasonably argue that there is no reason to REBIND anything until you are no longer meeting the SLA.

Friday, November 11, 2005

New DB2 DBA Portal from IBM

Just a quick post to alert readers to a new portal from the IBM DeveloperWorks team. The portal is named DBA Central and it bills itself as offering resources for IBM Information Management database administrators.

So, the portal won't be 100% mainframe content, but it should have some interesting nuggets of data for mainframe DB2 DBAs.

Thursday, November 10, 2005

DB2 Compression: z/OS versus LUW

Space compression for non-mainframe DB2 is quite a bit different than it is for DB2 for z/OS. In mainframe DB2, specifying COMPRESS YES on the CREATE TABLESPACE statement will cause DB2 to implement Ziv-Lempel compression for the table space in question. Data is compressed upon entry to the database and decompressed when it is read.

For DB2 UDB on Linux/Unix/Windows, when creating a table, you can use the optional VALUE COMPRESSION clause to specify that the table is using the space saving row format at the table level and possibly at the column level. There are two ways in which tables can occupy less space when stored on disk:
  • If the column value is NULL, do not set aside the defined, fixed amount of space.
  • If the column value can be easily known or determined (like default values) and if the value is available to the database manager during record formatting and column extraction.
When VALUE COMPRESSION is used, NULLs and zero-length data that has been assigned to defined variable-length data types (VARCHAR, VARGRAPHICS, LONG VARCHAR, LONG VARGRAPHIC, BLOB, CLOB, and DBCLOB) will not be stored on disk. Only overhead values associated with these data types will take up disk space.

If VALUE COMPRESSION is used then the optional COMPRESS SYSTEM DEFAULT parameter can also be specified to further reduce disk space usage. Minimal disk space is used if the inserted or updated value is equal to the system default value for the data type of the column. The default value will not be stored on disk. Data types that support COMPRESS SYSTEM DEFAULT include all numerical type columns, fixed-length character, and fixed-length graphic string data types. This means that zeros and blanks can be compressed.

The two platforms vary dramatically in how they approach "compression." The mainframe actually applies an algorithm to the data to compress it into another format. Every row that is inserted must first be compressed before storing it; every row that is read must be decompressed. On LUW platforms, DB2 compression is simply a way of avoiding the storage of certain types of data that either can be determined easily, or need not be stored.

So, it is highly probable that you will get completely different results on LUW than you do on a mainframe (OS/390, z/OS). Which one is better will depend on the type of data you are storing based on the requirements of your applications.

So, when should you consider using compression? In general, use DB2 for z/OS compression for larger tablespaces where the disk savings can be significant. For very small tables, the amount of space required to store the compression dictionary may exceed the space saved by compressing the data.

What is the compression dictionary? Well, as I mentioned earlier, DB2 for z/OS compression is enabled by specifying COMPRESS YES for the tablespace in your DDL. When compression is specified, DB2 builds a static dictionary to control compression. This will cause from 2 to 17 dictionary pages to be stored in the tablespace. These pages are stored after the header and first space map page.

For partitioned tablespaces, DB2 will create a separate compression dictionary for each tablespace partition. Multiple dictionaries tend to cause better overall compression ratios. In addition, it is more likely that the partition-level compression dictionaries can be rebuilt more frequently than non-partitioned dictionaries. Frequent rebuilding of the compression dictionary can lead to a better overall compression ratio.

Avoid compressing table spaces with multiple tables in them because the compression ratio can be impacted by the different types of data in the multiple tables, and DB2 can only have one compression dictionary per table space.

But why compress data at all? Consider an uncompressed table with a large row size, say 800 bytes. Therefore, five of this table's rows fit on a 4K page. If the compression routine achieves 30 percent compression, on average, the 800-byte row uses only 560 bytes, because (800*0.3)=560. Now, on average, seven rows fit on a 4K page. Because I/O occurs at the page level, the cost of I/O is reduced because fewer pages must be read for tablespace scans, and the data is more likely to be in the bufferpool because more rows fit on a physical page. This can be a significant I/O improvement. Consider the following scenarios. A 10,000-row table with 800-byte rows requires 2,000 pages. Using a compression routine as outlined previously, the table would require only 1,429 pages. Another table also with 800-byte rows but now having 1 million rows would require 200,000 pages without a compression routine. Using the compression routine, you would reduce the pages to 142,858 - a reduction of more 50,000 pages.

Another question I am commonly asked is about overhead. Yes, there is going to be some overhead involved if you turn on compression... CPU is required to apply the Ziv-Lempel algorithm to compress upon insertion - and to de-compress upon access. Of course, this does NOT mean that overall performance will suffer if you turn on compression. Rememeber the trade-off: additional CPU in exchange for possibly improved I/O efficiency. You see, when more compressed rows fit onto a single page fewer I/O operations may be needed to satisfy your query processing needs. If you are performing a lot of sequential access (as opposed to random access) you can get improved performance because fewer I/O operations are required to access the same number of rows.

Of course. there is always the other trade-off to consider, too: disk storage savings in exchange for CPU cost of compressing and decompressing data. Keep in mind, too though, DB2 can use hardware-assisted compression if you have the right type of hardware. Hardware-assisted compression simply speeds up the compression and decompression of data -- it is not a requirement for the inherent data compression features of DB2. So, the overall cost of compression may be minimal with hardware-assisted compression. Indeed, due to I/O issues, overall elapsed time for certain I/O heavy processes may decrease when data is compressed.

You can use the DSN1COMP utility to estimate how much disk space will be saved by compressing a tablespace before deciding whether to turn compression on or not. This utility can be run on full image copy data sets, VSAM data sets that contain DB2 table spaces, or sequential data sets that contain DB2 table spaces (such as DSN1COPY output). DSN1COMP does not estimate savings for data sets that contain LOB table spaces or index spaces. Refer to the IBM Utility Guide and Reference for more information on DSN1COMP.

Of course, before you consider compression be sure to examine all of its details -- and be sure to understand all of the nuances of your particular data and applications. But don't be afraid of investigating its use... compression can be a very handy tool in the DBA's arsenal!

Friday, November 04, 2005

No Black Boxes!

I've written about this subject before, but I think it is important enough to merit another go-round. First of all, before I go any further, let's first define what I mean by a “black box.” If I plan to recommend that you prohibit them we better both understand what it is we are talking about proscribing.

Simply put, a black box is a database access program that sits in between your application programs and DB2. It is designed so that all application programs call the black box for data instead of writing SQL statements that are embedded into a program. The general idea behind such a contraption is that it will simplify DB2 development because programmers will not need to know how to write SQL. Instead, the programmer just calls the black box program to request whatever data is required. SQL statements become calls – and every programmer knows how to code a call, right?

This approach is commonly referred to as a “black box” approach because the data access interface shields the developers from the “complexities” of SQL. The SQL is contained in that black box and programmers do not need to know how the SQL works – just how to call the black box for data. Black boxes usually are introduced into an organization when management gets the notion that it would be quicker and easier for programmers to request data from a central routine than to teach them all SQL.

But I think there are a very good reasons why this approach is not sound. Let’s examine them.

Ignorance (of SQL) is not a Virtue

The basic premise of implementing black box technology is that it is better for programmers to be ignorant of SQL. This means that your company will be creating DB2 applications using developers with little to no understanding of how SQL works. So what may seem like simple requests to a non-educated programmer may actually involve very complex and inefficient SQL “behind the scenes” running in the black box. So innocuous requests for data can perform quite poorly.

When programmers are knowledgeable about SQL they can at least understand the complexity of their data requests and formulate them to perform better. For example, SQL programmers will understand when data must be joined and thereby can form their data requests in such a way as to join efficiently (and perhaps to minimize joining in certain circumstances). With no knowledge of SQL the programmer will have no knowledge of joining – and more importantly, no true means at his or her disposal to optimize their data requests.

Industry experts agree ("kind of") that about 80 percent of database performance problems are due to inefficient application code - mostly SQL. Whereas basic SQL is simple to learn and easy to start using, SQL tuning and optimization is an art that can take years to master.

Be sure to train your application development staff in the proper usage of SQL – and let them write the SQL requests in their programs. Develop and publish SQL guidelines in a readily accessible place (such as your corporate intranet or portal). These guidelines should outline the basics elements of style for DB2 SQL programming. I won't go into what these rules are here...

Now let’s face it, even when using the "black box" technique some technicians in your organization will still have to understand SQL – namely the writer(s) of the black box code. Because all of the SQL is coded in the black box program (or programs) someone has to be capable of writing efficient and effective SQL inside of the black box program. Which brings us to our next consideration.

Shortcuts Make for Poor Performance

The SQL programmers in charge of writing the black box code will inevitably introduce problems into the mix. This is so because of simple human nature – and because of most technicians’ desire to find shortcuts. But SQL shortcuts can lead to poor performance.
The black box inevitably will deviate from the standards and procedure of good SQL development. For example, let’s assume that there are three application programs and each one of them needs to retrieve customer information by area code. Program 1 needs the customer name and address, program 2 requires customer ID, name, and phone number, and program 3 requires customer ID, name and type. This is properly coded as three different SQL requests (each one in its own program). For program 1 we would write:

SELECT FIRST_NAME, LAST_NAME, ADDRESS, CITY,
STATE, ZIP
FROM CUSTOMERWHERE AREA_CODE = :HV-AC;

For program 2 we would write:

SELECT CUST_ID, FIRST_NAME, LAST_NAME, PHONE_NUM
FROM CUSTOMER
WHERE AREA_CODE = :HV-AC;

And for program 3 we would write:

SELECT CUST_ID, FIRST_NAME, LAST_NAME, CUST_TYPE
FROM CUSTOMER
WHERE AREA_CODE = :HV-AC;

Of course, all of these SQL statements are remarkably similar, aren’t they? If we were in charge of writing the black box for these requests we would likely think about consolidating these three SQL statements into one statement like this:

SELECT FIRST_NAME, LAST_NAME, ADDRESS, CITY,
STATE, ZIP, PHONE_NUM, CUST_TYPE
FROM CUSTOMER
WHERE AREA_CODE = :HV-AC;

Of course, this revised query will work for all three of these requests. When program 1 calls the black box we execute the query and return just the customer name and address; for program 2 we return just customer ID, name, and phone number; and for program 3 the black box returns only customer ID, name and type. If we do this, we’ve just coded a shortcut in our black box.

“So what?” you may ask. "Isn't it good to cut down on the amount of code that must be written?" Well, this is bad program design because we are violating one of the cardinal SQL coding guidelines, namely: SQL statements should retrieve only those columns required; never more. This is so because additional work is required by DB2 to send additional columns to your programs. Minimizing the number of columns in your SELECT statements will improve application performance.

By coding shortcuts such as these into the black box you are designing poor performance into your DB2 applications. And a black box will use shortcuts. The example given here is a simple one, but even more complex shortcuts are possible in which WHERE clauses are coded so that they can be bypassed with proper host variables. For example, perhaps sometimes we need to query by area code and other time by area code and customer type. Well, we could code the CUST_TYPE predicate as a range something like this:

WHERE CUST_TYPE >= :HV1 and CUST_TYPE =< :HV2;

When we want to query for CUST_TYPE we simply provide the same value to both HV1 and HV2; when we do not want to query for CUST_TYPE we choose a larger value for HV1 than for HV2 (for example, 1 and 0). This effectively blocks out the CUST_TYPE predicate. Using tricks like this it is possible to cram a lot of different SQL statements into one – with the results usually being worse performance than if they were separate SQL statements.

Imagine the further performance difficulties that can ensue if instead of just returning extra columns, we code SQL that returns extra rows -- but pass back only what is needed. Of course, this violates an even more important basic SQL rule, namely: return only those rows required by the program, never more. Applications and SQL performance suffers greatly when programs access and/or return rows that are not needed.

Extra Code Means Extra Work

Additionally, when you code a black box your application will require more lines of code to be executed than without the black box. It is elementary when you think about it. The call statement in the calling program is extra and the code surrounding the statements in the black box that ties them together is extra. None of this is required if you just plug your SQL statements right into your application programs.

This extra code must be compiled and executed. When extra host language code is required – no matter how little or efficient it may be – extra CPU will be expended to run the application. More code means more work. And that means degraded performance.

SQL is Already an Access Method

The final argument I will present here is a bit of a philosophical one. When you code a black box you are basically creating a data access method for your programs. To access data each program must call the black box. But SQL is already an access method – so why create another one?
Not only is SQL an access method but it is a very flexible and comprehensive access method at that. You will not be able to create an access method in your black box that is as elegant as SQL – so why try?

Summary

I assert that you should not implement data access interfaces that are called by application programs instead of coding SQL requests as needed in each program or stored procedure. When a black box is used, the tendency is that short cuts are taken. The black box inevitably deviates from proper SQL development guidelines, requires additional work and additional code, and is just another access method that is not required. Do not get lost in the black box – instead, train your programmers to code efficient SQL statements right in their application programs. Your applications will thank you for it!