Sunday, September 24, 2006

Sequence Objects and Identity Columns

When designing DB2 databases a frequently heard request is for a column that contains sequentially generated numbers. For example, each row has a counter associated with it. When a new row is inserted, the counter should be incremented by one for the new row. This way, each new DB2 row has a unique “row number” associated with it. Until recently such a design was difficult to deliver.

Without sequence objects or identity columns an application program can implement similar functionality, but usually not in a manner that performs adequately as database usage scales. A common technique is to maintain a one-row table that contains the sequence number. Each transaction locks that table, increments the number, and then commits the change to unlock the table. In this scenario only one transaction at a time can increment the sequence number. A variation uses something like this:

SELECT MAX()+ 1
FROM ONEROW_TABLE
WITH RR;

The result is the next highest number to be used. This value is used by the application and ONEROW_TABLE must be updated with the incremented value. Performance bottlenecks will occur with this method when a lot of concurrent usage is required.

But now DB2 offers two methods of automatically generating sequential numbers for a column:
· Identity columns, and;
· SEQUENCE objects.

Identity Columns

Identity columns were formally added to DB2 as of Version 7, but were actually available as of the DB2 Version 6 refresh. The identity property is applied to a DB2 column using the IDENTITY parameter. A column thusly defined will cause DB2 to automatically generate a sequential value for that column when a row is added to the table. For example, identity columns might be used to generate primary key values or a value that somewhat mimics Oracle’s row number capability. Using identity columns helps to avoid some of the concurrency and performance problems that can occur when application programs are used to populate sequential values for a “counter” column.

When inserting data into a table that uses an identity column, the program or user will not provide a value for the identity column. Instead, DB2 automatically generates the appropriate value to be inserted.

Only one identity column can be defined per DB2 table. Additionally, the data type of the column must be SMALLINT, INTEGER, or DECIMAL with a zero scale, that is DECIMAL(n,0). The data type also can be a user-defined DISTINCT type based on one of these numeric data types. The designer has control over the starting point for the generated sequential values, and the number by which the count is incremented.

An example creating a table with an identity column follows:

CREATE TABLE EXAMPLE
(ID_COL INTEGER NOT NULL
GENERATED ALWAYS AS IDENTITY
START WITH 100
INCREMENT BY 10 ...);

In this example, the identity column is named ID_COL. The first value stored in the column will be 100 and subsequent INSERTs will add 10 to the last value. So the identity column values generated will be 100, 110, 120, 130, and so on.

Note, too, that each identity column has a property associated with it assigned using the GENERATED parameter. This parameter indicates how DB2 generates values for the column. You must specify GENERATED if the column is to be considered an identity column or the data type of the column is a ROWID. This means that DB2 must be permitted to generate values for all identity columns. There are two options for the GENERATED parameter: ALWAYS and BY DEFAULT.

GENERATED ALWAYS indicates that DB2 will always generate a value for the column when a row is inserted into the table. You will usually specify ALWAYS for your identity columns unless you are using data propagation.

GENERATED BY DEFAULT indicates that DB2 will generate a value for the column when a row is inserted into the table unless a value is specified. So, if you want to be able to insert an explicit value into an identity column you must specify GENERATED BY DEFAULT.

Additionally, you can specify what to do when the maximum value is hit. Specifying the CYCLE keyword will cause DB2 to begin generating values from the minimum value all over again. Of course, this can cause duplicate values to be generated and should only be used when uniqueness is not a requirement.

Actually, the only way to ensure uniqueness of your identity columns is to create a unique index on the column. The IDENTITY property alone will not guarantee uniqueness.

Sometimes it is necessary to retrieve the value of an identity column immediately after it is inserted. For example, if you are using identity columns for primary key generation you may need to retrieve the value to provide the foreign key of a child table row that is to be inserted after the primary key is generated. DB2 provides the IDENTITY_VAL_LOCAL() function that can be used to retrieve the value of an identity column after insertion. For example, you can run the following statement immediately after the INSERT statement that sets the identity value:

VALUES IDENTITY_VAL_LOCAL() INTO :IVAR;

The host variable IVAR will contain the value of the identity column.

Problems with Identity Columns

Identity columns can be useful, depending on your specific needs, but the problems that accompany identity column are numerous. Some of these problems include:


  • Handling the loading of data into a table with an identity column defined as GENERATED BY DEFAULT. The next identity value stored by DB2 to be assigned may not be the correct value that should be generated. This can be especially troublesome in a testing environment.
  • LOAD INTO PART x is not allowed if an identity column is part of the partitioning index.
    What about environments that require regular loading and reloading (REPLACE) for testing? The identity column will not necessarily hold the same values for the same rows from test to test.
  • Prior to V8, it was not possible to change the GENERATED parameter (such as from GENERATED BY DEFAULT to GENERATED ALWAYS).
  • The IDENTITY_VAL_LOCAL() function returns the value used for the last insert to the identity column. But it only works after a singleton INSERT. This means you cannot use INSERT INTO SELECT FROM or LOAD, if you need to rely on this function.
  • When the maximum value is reached for the identity column, DB2 will cycle back to the beginning to begin reassigning values - which might not be the desired approach.
If you can live with these caveats, then identity columns might be useful to your applications. However, in general, these "problems" make identity columns a very niche solution. IBM has intentions to rectify some of these problems over time in upcoming versions of DB2.

SEQUENCE Objects

But remember, DB2 has two methods of automatically generating sequential numbers. The first method is to define an identity column for the table; the second is to create a SEQUENCE object. A SEQUENCE object is a separate structure that generates sequential numbers.

New to DB2 V8, a SEQUENCE is a database object specifically created to generate sequential values. So, a using a SEQUENCE object requires the creation of a database object; using an identity column does not.

A SEQUENCE objects is created using the CREATE SEQUENCE statement.

When the SEQUENCE object is created it can be used by applications to “grab” a next sequential value for use in a table. SEQUENCE objects are ideal for generating sequential, unique numeric key values. A sequence can be accessed and incremented by many applications concurrently without the hot spots and performance degradation associated with other methods of generating sequential values.

Sequences are designed for efficiency and to be used by many users at the same time without causing performance problems. Multiple users can concurrently and efficiently access SEQUENCE objects because DB2 does not wait for a transaction to COMMIT before allowing the sequence to be incremented again by another transaction.

An example creating a SEQUENCE object follows:

CREATE SEQUENCE ACTNO_SEQ
AS SMALLINT
START WITH 1
INCREMENT BY 1
NOMAXVALUE
NOCYCLE
CACHE 10;

This creates the SEQUENCE object named ACTNO_SEQ. Now it can be used to generate a new sequential value, for example:

INSERT INTO DSN8810.ACT (ACTNO, ACTKWD, ACTDESC)
VALUES
(NEXT VALUE FOR ACTNO_SEQ, ‘TEST’, ‘Test activity’);


The NEXT VALUE FOR clause is known as a sequence expression. Coding the sequence expression causes DB2 to use the named SEQUENCE object to automatically generate the next value. You can use a sequence expression to request the previous value that was generated. For example:

SELECT PREVIOUS VALUE FOR ACTNO_SEQ
INTO :IVAR
FROM DSN8810.ACT;


As you can see, sequence expressions are not limited to INSERT statements, but can be used in UPDATE and SELECT statements, too.

Caution: If you specify the NEXT VALUE FOR clause more than once in the same SQL statement DB2 will return the same value for each NEXT VALUE FOR specification.

SEQUENCE Object Parameters

Similar to identity columns, a SEQUENCE object has parameters to control the starting point for the generated sequential values, and the number by which the count is incremented. You can also specify the data type to be generated (the default is INTEGER). You can also specify a minimum value (MINVALUE) and a maximum value (MAXVALUE) if you wish to have further control over the values than is provided by the data type chosen.

Again, as with identity columns, you can specify how the SEQUENCE should handle running out of values when the maximum value is hit. Specifying the CYCLE keyword will cause the SEQUENCE object to wrap around and begin generating values from the minimum value all over again.

A final consideration for SEQUENCE objects is caching. Sequence values can be cached in memory to facilitate better performance. The size of the cache specifies the number of sequence values that DB2 will pre-allocate in memory. In the previous example CACHE 10 indicates that ten sequence values will be generated and stored in memory for subsequent use. Of course, you can turn off caching by specifying NO CACHE. With caching turned off each new request for a sequence number will cause I/O to the DB2 Catalog (SYSIBM.SYSSEQUENCES) to generate the next sequential value.

SEQUENCE Object Guidelines

DB2 does not wait for an application that has incremented a sequence to commit before allowing the sequence to be incremented again by another application. Applications can use one sequence for many tables, or create multiple sequences for use of each table requiring generated key values. In either case, the applications control the relationship between the sequences and the tables.

The name of the SEQUENCE object indicates that we are going to use it to generate activity numbers (ACTNO), but its usage is not limited to that. Of course, failure to control the use of a SEQUENCE object can result in gaps in the sequential values. For example, if we use the ACTNO_SEQ object to generate a number for a different column, the next time we use it for ACTNO there will be a gap where we generated that number.


Other scenarios can cause gaps in a SEQUENCE, too. For example, issuing a ROLLBACK after acquiring a sequence number will not roll back the value of the sequence generator - so that value is lost. A DB2 failure can also cause gaps because cached sequence values will be lost.
Please note, too, that when sequences were introduced in non-mainframe DB2, syntax was supported that did not conform to the SQL standard. This non-standard syntax is supported on the mainframe as well:

  • NEXTVAL can be used in place of NEXT VALUE; and
  • PREVVAL can be used in place of PREVIOUS VALUE.
Choosing Between IDENTITY and SEQUENCE
Although both identity columns and SEQUENCE objects are useful for generating incremental numeric values, you will be confronted with situations where you will have to choose between the two. Consider the following criteria for when to use one instead of the other. Identity columns are useful when:

  • Only one column in a table requires automatically generated values
  • Each row requires a separate value
  • An automatic generator is desired for a primary key of a table
  • The LOAD utility is not used to load data into the table
  • The process of generating a new value is tied closely to inserting into a table, regardless of how the insert happens

SEQUENCE objects are useful when:

  • Values generated from one sequence are to be stored in more than one table
  • More than one column per table requires automatically generated values (multiple values may be generated for each row using the same sequence or more than one sequence)
  • The process of generating a new value is independent of any reference to a table
Unlike SEQUENCE objects, which are more flexible, identity columns must adhere to several rigid requirements. For example, an IDENTITY column is always defined on a single table and each table can have at most one IDENTITY column. Furthermore, when you create an IDENTITY column, the data type for that column must be numeric; not so for sequences. If you used a SEQUENCE object to generate a value you could put that generated into a CHAR column, for example. Finally, when defining an IDENTITY column you cannot specify the DEFAULT clause and the column is implicitly defined as NOT NULL. Remember, DB2 automatically generates the IDENTITY column’s value, so default values and nulls are not useful concepts.

Summary

Both identity columns and SEQUENCE objects can be used to automatically generate sequential values for DB2 columns. Prior to Version 8, identity columns are your only option. However, after you move to V8, SEQUENCE objects will provide more flexibility and be easier to use than the identity column option.

Happy sequential value generation with DB2!

Friday, September 15, 2006

The Path to an Executable DB2 Program

Here is another Q+A exchange that I thought might be useful to share with everyone here on the blog:

QUESTION: Could you please explain the difference between a package and a plan, the process of precompilation, compilation and running a COBOL DB2 program as well as a COBOL CICS DB2 program?

ANSWER: Well, I'll hit the highlights in response to your question but if you really want an in-depth answer then you should read the DB2 manuals (or a book like DB2 Developer's Guide).

What is a plan?

A plan is an executable module containing the access path logic produced by the DB2 optimizer. It can be composed of one or more DBRMs and packages. Before a DB2 for z/OS program (with static SQL) can be run, it must have a plan associated with it.

Plans are created by the BIND command. The plan is stored in the DB2 directory and accessed when its program is run. Information about the plan is stored in the DB2 catalog.

What is a package?

A package is a single, bound DBRM with optimized access paths. By using packages, the table access logic is "packaged" at a lower level of granularity than a plan -- at the package (or program) level.

To execute a package, you first must include it in the package list of a plan (usually, there are some exceptions, such as for triggers and user-defined functions). In general, packages are not directly executed, they are only indirectly executed when the plan in which they are contained executes -- as discussed previously, UDFs and triggers are exceptions to this rule. A plan can consist of one or more DBRMs, one or more packages or, a combination of packages and DBRMs.
What is program preparation?

A DB2 application program must go through a process known as program preparation before it can run successfully. Program preparation requires a series of code preprocessors that when enacted in the proper sequence, create an executable load module and a DB2 application plan. The combination of the executable load module and the application plan is required before any DB2 program can be run, whether batch or online. CICS programs require an additional preprocessing step.

The first step is precompilation. DB2 programs must be parsed and modified before normal compilation. The DB2 precompiler performs this task. At a high level, the precompiler basically searches for SQL, strips it out of the program and puts it into a DBRM. The output of precompilation is a DBRM with SQL and a modified source program.

The DBRM is bound using the BIND command, which is a type of compiler for SQL statements. In general, BIND reads SQL statements from DBRMs and produces a mechanism to access data as directed by the SQL statements being bound.

The modified source is compiled using the language compiler of your choice. The compiled source then is link-edited to an executable load module. The appropriate DB2 host language interface module also must be included by the link edit step. This interface module is based on the environment (TSO, CICS, or IMS/TM) in which the program will execute.

At the end of this series of steps, if all goes well, you will have an executable DB2 program.

Wednesday, September 06, 2006

Help for SAP Shops Using DB2 for z/OS

Just a quick blog post today to alert folks to a relatively new blog focusing entirely on SAP usage with DB2 for z/OS. According to the blogger, Omer Brandis:

Whether you have already implemented SAP on DB2 for z/OS, or are just seriously considering it, this is the blog for you. This blog will deal with real issues from the point of view of a true real-world professional, discussing the good, bad, and ugly of SAP on DB2 for z/OS.

Recent posts have covered offsite recovery, reorganizing SAP OFFICE, and hidden REORG jobs. If you use SAP and DB2 on the mainframe, be sure to check back in on this blog on a regular basis (no pun intended).

Friday, September 01, 2006

The Two Biggest DB2 Performance Things

DB2 performance is one of those perennial topics that people just can't seem to get enough of. I guess that is because the performance of applications is one of the bigger issues that end users complain about. And DBAs can be heroes if they can resolve performance problems quickly. It alos could be that performance problems are so ubiquitous because people keep on making the same design and coding mistakes...

With all of that in mind, let's take a look at what I think are the two biggest things you need to get control of to keep DB2 and SQL performance in check.

(1) Keep DB2 statistics up-to-date

Without the statistics stored in the DB2' system catalog, the optimizer will have a difficult time optimizing anything. These statistics provide the optimizer with information pertinent to the state of the tables that the SQL statement being optimized will access. The type of statistical information stored in the system catalog include:
  • Information about tables, including the total number of rows, information about compression, and total number of pages;
  • Information about columns, including number of discrete values for the column and the distribution range of values stored in the column;
  • Information about table spaces, including the number of active pages;
  • Current status of the index, including whether an index exists, the organization of the index (number of leaf pages and number of levels), the number of discrete values for the index key, and whether the index is clustered;
  • Information about the table space and index node groups or partitions.
Statistics populate the DB2 system catalog when you execute the RUNSTATS utility. Generally, this utility is invoked in batch jobs. Be sure to work with your DBA to ensure you accumulate statistics at the appropriate time, especially in a production environment.

(2) Build appropriate indexes

Perhaps the most important thing you can do to assure optimal DB2 application performance is create correct indexes for your tables based on the queries your applications use. Of course, this is easier said than done. But we can start with some basics. For example, consider this SQL statement:

SELECT LASTNAME, SALARY
FROM EMP
WHERE EMPNO = '000010'
AND DEPTNO = 'D01' ;


What index or indexes would make sense for this simple query? ""'First, think about all the possible indexes that you could create. Your first short list probably looks something like this:

  • Index1 on EMPNO
  • Index2 on DEPTNO
  • Index3 on EMPNO and DEPTNO
This is a good start, and Index3 is probably the best of the lot. It lets DB2 use the index to immediately look up the row or rows that satisfy the two simple predicates in the WHERE clause. Of course, if you already have a lot of indexes on the EMP table, you might want to examine the impact of creating yet another index on the table. Factors to consider include:

Modification impact: DB2 will automatically maintain every index you create. This means every INSERT and every DELETE to this table will insert and delete not just from the table, but also from its indexes. And if you UPDATE the value of a column that is in an index, you also update the index. So, indexes speed the process of retrieval but slow down modification.

Columns in the existing indexes: If an index already exists on EMPNO or DEPTNO, it might not be wise to create another index on the combination. However, it might make sense to change the other index to add the missing column. But not always, because the order of the columns in the index can make a big difference depending on the query. For example, consider this query:

SELECT LASTNAME, SALARY
FROM EMP
WHERE EMPNO = '000010'
AND DEPTNO > 'D01';

In this case, EMPNO should be listed first in the index. And DEPTNO should be listed second, allowing DB2 to do a direct index lookup on the first column (EMPNO) and then a scan on the second (DEPTNO) for the greater-than.

Furthermore, if indexes already exist for both columns (one for EMPNO and one for DEPTNO), DB2 can use them both to satisfy this query so creating another index might not be necessary.

Importance of this particular query: The more important the query, the more you might want to tune by index creation. If you are coding a query that the CIO will run every day, you want to make sure it delivers optimal performance. So building indexes for that particular query is important. On the other hand, a query for a clerk might not necessarily be weighted as high, so that query might have to make do with the indexes that already exist. Of course, the decision depends on the application's importance to the business-not just on the user's importance.

Index design involves much more than I have covered so far. For example, you might consider index overloading to achieve index-only access. If all the data that a SQL query asks for is contained in the index, DB2 might be able to satisfy the request using only the index. Consider our previous SQL statement. We asked for LASTNAME and SALARY, given information about EMPNO and DEPTNO. And we also started by creating an index on the EMPNO and DEPTNO columns. If we include LASTNAME and SALARY in the index as well, we never need to access the EMP table because all the data we need exists in the index. This technique can significantly improve performance because it cuts down on the number of I/O requests.

Keep in mind that making every query an index-only access is not prudent or even possible. You should save this technique for particularly troublesome or important SQL statements.

Summary

If you are just embarking on your journey into the wonderful world of DB2 performance management, please, start with the two items covered in this short blog entry. I've just scratched the surface of both areas and you can benefit by additional research and education in both statistics gathering and index design. And if you are a long-time DB2 professional, it can't hurt to bone up on these topics either. You might learn about some newer DB2 feature or function that you haven't used yet, or maybe just strengthen what you already know.

So what do you think? Are these the two most important DB2 performance things, or do you think something else is more important? Post your comments below...