Monday, April 21, 2014

A Little Bit About LOBs

In today's blog post we will take a brief look at LOBs, or Large OBjects, in DB2. I have been prepping for my webinar later this week, titled Bringing Big Data to DB2 for z/OS with LOBs: Understanding, Using, and Managing DB2 LOBsBe sure to click on the link for that and join me on April 24, 2014 for that webinar!

But back to the topic du jour... LOBs. Let's start with a bit of history. LOBs were added to relational products like DB2, back in the 1990s, ostensibly to compete more capably against the Object-Oriented databases of the time. Remember them? Back then it seemed that everybody thought OO DBMS products would supplant relational technology. Seems somewhat quaint now, doesn't it?

At any rate, the idea was to enable relational tables to house unstructured data like images and audio and video data. DB2 for z/OS users were slow to adopt LOBs in their mainframe databases. I think that is due to several different reasons, not the least of which is that it took several new versions of DB2 for LOBs to mature into capable, usable things for enterprise applications. Early implementations of LOBs in DB2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of the deficiencies over time and there are now tools that can help us to effectively manage DB2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is a force that is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. The most common definition of Big Data was coined by Forrester Research defining big data in terms of “The 4 V’s” -- volume, velocity, variety, variability. But that is somewhat limiting. However, I do not want to turn this posting into a definition of big data, so... let's just think of Big Data as MORE DATA, MORE TYPES OF DATA, and FASTER GENERATION OF DATA. One example of Big Data driving the usage of LOBs in DB2 is the JSON support that has been added to DB2. JSON objects are stored in DB2 as BLOBs. 

So LOBs have been around for awhile now, but it has taken some time for them to gain significant levels of usage "out there!" 

Why LOBs, you may ask? Why not just store the data in regular old DB2 data types like VARCHAR or VARGRAPHIC? The basic answer is that DB2's data types were not large enough to hold this amount of data, because of their limit of 32 KB. Some multimedia data can get VERY large. For example, a high resolution video requires about 3 GB for each hour. And high-def TV video requires 720 GB/hour!

But an additional part of the answer is that you do not always want to access the LOB data with the traditional data. For example, if you are querying employee information you probably don't always want their photo or their entire resume... so the data can be stored in a LOB, external from the rest of the data (so as not to impede performance) but connected to the data (when you need to access it).

There are three types of LOBs supported by DB2:
1. BLOB – Binary Large Object – for binary data
2. CLOB – Character Large Object – for text data
3. DBCLOB – Double Byte Character Large Object – for graphic character data

Furthermore, there are two options for specifying LOBs in DB2 tables:

  1. The traditional way to define LOBs where the LOB data is stored separately from the rest of the data. This requires defining additional database objects.
  2. A newer method, for smaller LOBs, introduced with DB2 10, called Inline LOBs.

Using the traditional method, LOBs are defined such that they are stored external to the base table. The column is specified as a BLOB, CLOB or DBCLOB in the base table, but DB2 stores the actual LOB data in an auxiliary table in a LOB table space. You must define a LOB table space per LOB per partition. So, let’s say we are creating a table with 2 LOBs in a table space with ten (10) partitions. That means we will need to define 20 LOB table spaces – 10 for the first LOB (one per partition) and 10 for the second LOB (one for each partition). An auxiliary table is defined for each LOB table space to hold the LOB data. And an auxiliary index is required on each auxiliary table. 

Now back to the base table. Remember that we have defined the LOB columns in the base table. But we also need to include a ROWID column in the base table. Only one ROWID column is needed per base table no matter how many LOBs you have defined in the table.


Don’t confuse the ROWID with other concepts that may seem to be similar. A ROWID is not an identity column nor is it a SEQUENCE. The ROWID is used by DB2 behind the scenes to connect the base table row to the auxiliary table row that contains the LOB data. You generally will not access the ROWID, although it is possible to do so just like any other column using SQL. With the ROWID you can perform direct row access. This is so because the row ID value implicitly contains the location of the row. 

Finally, each LOB column also has a 2 byte version number associated with it. This is used by DB2 when LOB data is modified. You cannot access the version number.

Now as of DB2 10 for z/OS, you can create inline LOBs in your DB2 tables. But what is an inline LOB? Well, at a high level, the name is somewhat self-explanatory. The LOB data for inline LOBs is stored with the rest of the base table.

But it is really a bit more nuanced that that. You can store part of the LOB inline with the rest of the data and the rest of the LOB externally, if you so choose. So an inline LOB can have all of the LOB data stored inline with the rest of the data (if it is small enough) or it can store only a portion with the rest of the data. This can be a great technique to use if you have some applications that require perhaps only the first 500 bytes of a larger LOB. Those 500 can be stored inline – with the rest of the data – while the rest is stored externally and accessed only when needed. You can create an inline LOB by specifying the INLINE LENGTH clause on your CREATE TABLE statement.

Inline LOBs can improve performance depending upon the type of access required. Consider the example where RESUME data is stored in a CLOB. The first 200 bytes of the resume are accessed most of the time, with the entire resume being accessed only during rare instances (e.g. interview, performance reviews, etc.) By storing those first 200 bytes inline with the rest of the data we can eliminate the I/O to the LOB in the auxiliary table, thereby improving performance. External LOB data is not buffered, but inline LOB data is – this too can impact performance.

This blog entry is getting a little longer than I was expecting, so I am going to cut it off here. We've reviewed what LOBs are, a little but of their history, and discussed a but about both traditional and in-line LOBs. To learn more, be sure to join me for the webinar on April 24th (or watch it later - it will be recorded).

Monday, April 14, 2014

Aggregating Aggregates Using Nested Table Expressions


Sometimes when you are writing your SQL to access data you come across the need to work with aggregates. Fortunately, SQL offers many simple ways of aggregating data. But what happens when you uncover then need to perform aggregations of aggregates?

What does that mean? Well, consider an example. Let's assume that you  want to compute the average of a sum. This is a reasonably common requirement that comes up frequently in applications that are built around sales amounts. Let's say that we have a table containing sales information, where each sales amount has additional information indicating the salesman, region, district, product, date, etc. 

A common requirement is to produce a report of the average sales by region for a particular period, say the first quarter of 2014. But the data in the table is at a detail level, meaning we have a row for each specific sale.

A novice SQL coder might try to write a query with a function inside a function, like AVG(SUM(SALE_AMT)). Of course, this is invalid SQL syntax. DB2 will not permit the nesting of aggregate functions. But we can use nested table expressions and our knowledge of SQL functions to build the correct query.

Let’s start by creating a query to return the sum of all sales by region for the time period in question. That query should look something like this:

SELECT REGION, SUM(SALE_AMT)
FROM   SALES
WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                AND     DATE(‘2014-03-31’)
GROUP BY REGION;


Now that we have the total sales by region for the period in question, we can embed this query into a nested table expression in another query like so:

SELECT NTE.REGION, AVG(NTE.TOTAL_SALES)
FROM (SELECT REGION, SUM(SALE_AMT)
      FROM   SALES
      WHERE SALE_DATE BETWEEN DATE(‘2014-01-01’)
                      AND     DATE(‘2014-03-31’)
      GROUP BY REGION) AS NTE

GROUP BY NTE.REGION;


And voila! We have aggregated an aggregate, (averaged a sum)...

Sunday, April 06, 2014

DB2 Buffer Pool Monitoring

After setting up your buffer pools, you will want to regularly monitor your configuration for performance. The most rudimentary way of doing this is using the -DISPLAY BUFFERPOOL command. There are many options of the DISPLAY command that can be used to show different characteristics of your buffer pool environment; the simplest is the summary report, requested as follows:

-DISPLAY BUFFERPOOL(BP0) LIST(*) DBNAME(DSN8*)

And a truncated version of the results will look something like this:

DSNB401I - BUFFERPOOL NAME BP0, BUFFERPOOL ID 0, USE COUNT 20
DSNB402I - VIRTUAL BUFFERPOOL SIZE = 500 BUFFERS 736
             ALLOCATED = 500 TO BE DELETED = 0
             IN-USE/UPDATED = 0
DSNB404I - THRESHOLDS - 739
             VP SEQUENTIAL        = 80   HP SEQUENTIAL = 75
             DEFERRED WRITE       = 85   VERTICAL DEFERRED WRT = 80,0
             PARALLEL SEQUENTIAL  = 50   ASSISTING PARALLEL SEQT = 0

Of course, you can request much more information to be displayed using the DISPLAY BUFFERPOOL command by using the DETAIL parameter. Additionally, you can request that DB2 return either incremental statistics (since the last DISPLAY) or cumulative statistics (since DB2 was started). The statistics in a detail report are grouped in the following categories: GETPAGE information, Sequential Prefetch information, List Prefetch information, Dynamic Prefetch information, Page Update statistics, Page Write statistics, and Parallel Processing Activity details.

A lot of interesting and useful details can be gathered simply using the DISPLAY BUFFERPOOL command. For example, you can review GETPAGE requests for random and sequential activity, number of prefetch requests (whether static or dynamic, or for sequential or list prefetch), number of times each of the thresholds were tripped, and much more. Refer to the DB2 Command Reference manual (SC19-4054-02 for DB2 11) for a definition of each of the actual statistics returned by DISPLAY BUFFERPOOL.

Many organizations also have a performance monitor (such as IBM’s Omegamon) that simplifies the gathering and display of buffer pool statistics. Such tools are highly recommended for in-depth buffer pool monitoring and tuning. More sophisticated tools also exist that offer guidance on how to tune your buffer pools — or that automatically adjust your buffer pool parameters according to your workload. Most monitors also provide more in-depth statistics, such as buffer pool hit ratio calculations.

The buffer pool hit ratio is an important tool for monitoring and tuning your DB2 buffer pools. It is calculated as follows:

Hit ratio = GETPAGES - pages_read_from_disk / GETPAGEs

“Pages read from disk” is a calculated field that is the sum of all random and sequential reads.

The highest possible buffer pool hit ratio is 1.0. This value is achieved when each requested page is always in the buffer pool. When requested pages are not in the buffer pool, the hit ratio will be lower. You can have a negative hit ratio — this just means that prefetch was requested to bring pages into the buffer pool that were never actually referenced.


In general, the higher the hit ratio the better because it indicates that pages are being referenced from memory in the buffer pool more often. Of course, a low hit ratio is not always bad. The larger the amount of data that must be accessed by the application, the lower the hit ratio will tend to be. Hit ratios should be monitored in the context of the applications assigned to the buffer pool and should be compared against hit ratios from prior processing periods. Fluctuation can indicate problems.