Monday, December 21, 2009

Happy Holidays


Just a brief posting today to wish everyone a very happy holiday season.

I am taking some down time thru the end of the year to visit family "up North."

So, until next year, may your databases run without a glitch and here's hoping you all have an enjoyable holiday!

Wednesday, December 16, 2009

Quick Thoughts on DB2 Performance

Database performance problems are not caused by magic. Indeed, all performance problems are always caused by change. That statement flies in the face of what I normally say, which is “Almost never say always or never”… but in this case, it is true.

Think about it for a moment. If everything remains stable and unchanging in your environment, then why would performance vary? That’s right, it wouldn’t.

Something tangible must change before a performance problem can be experienced. The challenge of performance tuning is to find the source of the change, gauge its impact, and formulate a solution.

Change can take many forms, including the following:
  • Physical changes to the environment, such as a new CPU, new disk devices, or different tape drives.
  • Changes to system software, such as a new release of a product (for example, WebSphere, CICS, or even z/OS), the alteration of a product (for example, the addition of more or fewer CICS regions or an IMS SYSGEN), or a new product (for example, implementation of DFHSM). Also included is the installation of a new release or version of DB2, which can result in changes in access paths as well as utilization of new features.
  • Changes to the DB2 engine from maintenance releases and PTFs, which can change the optimizer (and sometimes introduce other new functionality).
  • Changes in system capacity. More or fewer jobs could be executing concurrently when the performance problem occurs. Or additional users may be banging away at your transactions.
  • Environmental changes, such as the implementation of client/server programs, the adoption of SOA, or other new technologies.
  • Database changes. This involves changes to any DB2 object, and ranges from adding a new column or an index to dropping and re-creating an object.
  • Changes to the application development methodology, such as usage of check constraints instead of application logic or the use of stored procedures.
  • Changes to application code, both SQL and host language code (COBOL, C, Java, etc.).

Although the majority of your performance problems are likely to be application-oriented, you must be prepared to explore any and all of these other areas when application tuning has little effect.

My advice is to be sure that you institute strict change control tracking across all areas of your IT infrastructure. That way, whenever you experience a performance problem, you will be able to track what has changed recently, along with who changed it and why. This is important because every DBA knows what the answer to the question “What changed?” will be… right?

It is always “nothing!”

And that cannot be true. Oh, it does not mean that the person answering is lying. He or she may not have changed anything. And it is not necessarily reasonable to expect an application developer to know what all could have changed…especially when what can impact DB2 performance spans so many areas of the IT infrastructure.


So do yourself… and your company a favor: be sure that you meticulously track each and every change to any aspect of your systems. Then – and this is where many shops break down – make sure that you have methods of tying all of the change information together in such a way that it can be queried and examined in the face of a performance problem.


Only then can you reasonably expect your DBAs rapidly to be able to track down and remedy DB2 performance problems… because only then will they have the pertinent information at their disposal.

Friday, December 11, 2009

A Short History of DB2 for z/OS – Part 2

Today’s blog entry is a continuation of yesterday’s post in which we began a brief review of the history of DB2 for z/OS. That post covered Versions 1 through 3; so today we pick up our historical review with Version 4.

Version 4 was a very significant milestone in the history of DB2. It was highlighted by the introduction of Type 2 indexes, which removed the need to lock index pages (or subpages, now obsolete). Prior to V4, index locking was a particularly thorny performance problem that vexed many shops. And, of course, I’d be remiss if I did not discuss data sharing, which made its debut in V4. With data sharing, DB2 achieved new heights of scalability and availability unmatched within the realm of DBMS; it afforded users the ability to upgrade without an outage and to add new subsystems to a group “on the fly.” The new capabilities did not stop with there; V4 also introduced stored procedures, CP parallelism, performance improvements, and more. DB2 V4 was, indeed, a major milestone in the history of mainframe DB2.

In June 1997 DB2 Version 5 became generally available. It was the first DB2 version to be referred to as DB2 for OS/390 (previously it was DB2 for MVS). Not as significant as V4, we see the trend of even numbered releases being bigger and more significant than odd numbered releases (of course, this is just my opinion). V5 was touted by IBM as the e-business and BI version. It included Sysplex parallelism, prepared statement caching, reoptimization, online REORG, and conformance to the SQL-92 standard.

Version 6 brings us to 1999 and the introduction of the Universal Database term to the DB2 moniker. The “official” name of the product is now DB2 Universal Database for OS/390. And the Release Guide swelled to over 600 pages! Six categories of improvements were introduced with V6 spanning:
  • Object-relational extensions and active data
  • Network computing
  • Performance and availability
  • Capacity improvements
  • Data sharing enhancements
  • User productivity
The biggest of the new features were SQLJ, inline statistics, triggers, large objects (LOBs), user-defined functions, and distinct types.

Version 6 is also somewhat unique in that there was this “thing” typically referred to as the V6 refresh. It added functionality to DB2 without there being a new release or version. The new functionality in the refresh included SAVEPOINTs, identity columns, declared temporary tables, and performance enhancements (including star join).

March 2001 brings us to DB2 Version 7, another “smaller” version of DB2. Developed and released around the time of the Year 2000 hubbub, it offered much improved utilities and some nice new SQL functionality including scrollable cursors, limited FETCH, and row expressions. Unicode support was also introduced in Db2 V7. For a more detailed overview of V7 (and the V6 refresh) consult An Introduction to DB2 for OS/390 Version 7.

DB2 Version 8 followed, but not immediately. IBM took advantage of Y2K and the general desire of shop’s to avoid change during this period to take its time and deliver the most significant and feature-laden version of DB2 ever. V8 had more new lines of code than DB2 V1R1 had total lines of code!

I don’t want to get bogged down in recent history here outlining the features and functionality of DB2 releases that should be fresh in our memory (V8 and V9). If you really want some details on those refer to these links for them:

An Overview of DB2 for z/OS Version 8


DB2 9 for z/OS Features



Which brings us to today. Most shops should be either running Version 9 in production or planning their migration from V8 to V9. And we are all waiting with baited breath for DB2 X… which hopefully should be announced sometime next year.

Thursday, December 10, 2009

A Short History of DB2 for z/OS – Part 1

Let's go back in time... almost three decades ago... back to the wild and woolly 1980s! And watch our favorite DBMS, DB2, grow up over time.

Version 1 Release 1 was announced on June 7, 1983. And it became generally available on Tuesday, April 2, 1985. I wonder if it was ready on April 1st but not released because of April Fool’s Day? Any old-time IBMer out there care to comment?

Initial DB2 development focused on the basics of making a relational DBMS work. Early releases of DB2 were viewed by many as an “information center” DBMS, not for production work like IMS.

Version 1 Release 2 was announced on February 4, 1986 and was released for general availability a month later on March 7, 1986. Wow! Can you imagine waiting only a month for a new release of DB2 these days? But that is how it happened back then. Same thing for Version 1 Release 3, which was announced on May 19, 1987 and became GA on June 26, 1987. DB2 V1R3 saw the introduction of date data types.

You might notice that IBM delivered “releases” of DB2 back in the 1980s, whereas today (and ever since V3) there have only been versions. Versions are major, whereas releases are not quite as significant as a version.

Version 2 of DB2 became a reality in 1988. Version 2 Release 1 was announced in April 1988 and delivered in September 1988. Here we start to see the gap widening again between announcement and delivery. V2R1 was a very significant release in the history of DB2. Some mark it as the bellwether for when DB2 began to be viewed as a DBMS capable of supporting mission critical, transaction processing workloads. Not only did V2R1 provide many performance enhancements but it also signaled the introduction of declarative Referential Integrity (RI) constraints. RI was important for the acceptance of DB2 because it helps to assure data integrity within the DBMS.

No sooner than V2R1 became GA than IBM announced Version 2 Release 2 on October 4, 1988. But it was not until a year later that it became generally available on September 23, 1988. DB2 V2R2 again bolstered performance in many ways. It also saw the introduction of distributed database support (private protocol) across MVS systems.

Version 2 Release 3 was announced on September 5, 1990 and became generally available on October 25, 1991. Two very significant features were added in V2R3: segmented table spaces and packages. Segmented table spaces have become the de facto standard for most DB2 data and packages made DB2 application programs easier to support. DB2 V2R3 is also the version that beefed up distributed support with Distributed Relational Database Architecture (DRDA). Remote unit of work distribution was not available in the initial GA version, but IBM came out with RUOW support for DB2 V2R3 in March 1992.

And along comes DB2 Version 3 announced in November 1993 and GA in December 1993. Now it may look like things sped up again here, but not really. This is when the QPP program for early support of DB2 started. QPP was announced in March 1993 and delivered to customers in June 1993. Still though, fairly rapid turnaround by today’s standards, right?

V3 greatly expanded the number of bufferpool options available (from 5 pools to 80). There were many advances made in DB2 V3 to take better advantage of the System 390 environment: V3 introduced support for hardware assisted compression and hiperpools. It was also V3 that introduced I/O parallelism for the first time.

We’ll stop here for today and continue our short history of DB2 in my next DB2Portal blog posting. See you soon...

Wednesday, December 02, 2009

Wordle of my DB2 Portal Blog

Wordle: DB2 Portal Wordle The "jumble" of words shown here is a Wordle, which is a "word cloud" of text. I fed my blog location into the Wordle generator and it created this pretty picture based on the words I most commonly use here in this blog.

The cloud gives greater prominence to words that appear more frequently in the source text. No surprise that "DB2" and "data" dominate the other words!

Monday, November 23, 2009

Reading Things That Aren't There... and Missing Things That Are!

You can shoot yourself in the foot using DB2 if you are not careful. There are options that you can specify that may cause you to read data that is not really in the database. And, alternately, you can set things up so that you miss reading data that is actually in the database.

How, you might be asking? Well, dirty reads will take care of the first one. Specifying ISOLATION(UR) implements read-through locks, which is sometimes referred to as a dirty read. It applies to read operations only. With this isolation level data may be read that never actually exists in the database, because the transaction can read data that has been changed by another process but is not yet committed.

Read uncommitted isolation provides the highest level availability and concurrency of the isolation levels, but the worst degree of data integrity. It should be used only when data integrity problems can be tolerated. Certain types of applications, such as those using analytical queries, estimates, and averages are likely candidates for read uncommitted locking. A dirty read can cause duplicate rows to be returned where none exist or no rows may be returned when one (or more) actually exists. When choosing read uncommitted isolation the programmer and DBA must ensure that these types of problems are acceptable for the application.

OK, so what about not reading data that is in the database? DB2 V9 provides us an option to do just that. In DB2 9 it is possible for a transaction to skip over rows that are locked. This can be accomplished by means of the SKIP LOCKED DATA option within your SQL statement(s). SKIP LOCKED DATA can be specified in SELECT, SELECT INTO, and PREPARE, as well as searched UPDATE and DELETE statements. You can also use the SKIP LOCKED DATA option with the UNLOAD utility.

When you tell DB2 to skip locked data then that data is not accessed and your program will not have it available. DB2 just skip over any locked data instead of waiting for it to be unlocked. The benefit, of course, is improved performance because you will not incur any lock wait time. But it comes at the cost of not accessing the locked data at all. This means that you should only utilize this clause when your program can tolerate skipping over some data.

The SKIP LOCKED DATA option is compatible with cursor stability (CS) isolation and read stability (RS) isolation. But it cannot be used with uncommitted read (UR) or repeatable read (RR) isolation levels. DB2 will simply ignore the SKIP LOCKED DATA clause under UR and RR isolation levels.

Additionally, SKIP LOCKED DATA works only with row locks and page locks. That means that SKIP LOCKED DATA does not apply to table, partition, LOB, XML, or table space locks. And the bigger the lock size, the more data that will be skipped when a lock is encountered. With row locking you will be skipping over locked rows, but with page locking you will be skipping over all the rows on the locked page.

Use both of these features with extreme care and make sure that you know exactly what you are telling DB2 to do. Otherwise, you might be reading more... or less than you want!

Tuesday, November 17, 2009

Replacing UNION with CASE

When a UNION is required to put together data from multiple queries, you might be able to use a CASE statement instead. This is very useful, particularly when the data for each of the queries in the UNION come from the same table. The CASE statement can potentially enhance performance by minimizing the number of times the data is read.

Let’s look at an example to clarify why:

SELECT CREATOR, NAME, 'TABLE'
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'T'
UNION
SELECT CREATOR, NAME, 'VIEW '
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'V'
UNION
SELECT CREATOR, NAME, 'ALIAS'
FROM SYSIBM.SYSTABLES
WHERE TYPE = 'A'
ORDER BY NAME;

This simple SQL statement uses UNION to put together the results of three queries against the SYSTABLES table. The report shows all of the DB2 table-like objects that exist in the DB2 subsystem: tables, views, and synonyms.

To do this, DB2 must scan through the table three times – once for each query (as there is no index on the TYPE column). But, you can use CASE and code an equivalent, but more efficient query, as follows:

SELECT CREATOR, NAME,
CASE TYPE
WHEN 'T' THEN 'TABLE'
WHEN 'V' THEN 'VIEW '
WHEN 'A' THEN 'ALIAS'
END
FROM SYSIBM.SYSTABLES
ORDER BY NAME;

This new query will need to scan SYSTABLES only once. The CASE statement will translate the code in the TYPE column into the text that we desire.

CASE statements are very powerful and you should use them when you can to create elegant and optimal SQL in your DB2 applications.

Tuesday, November 03, 2009

Deprecated Features Planned for DB2 X for z/OS

Everyone is always interested in the latest and greatest features of their favorite DBMS, in this case DB2. But sometimes features get removed from the DBMS when a new version is released. According to the IBM teleconference on DB2 X for z/OS today, there are several features planned to be deprecated (i.e. removed). Let's briefly take a look at them.

The first feature that will be removed is private protocol DRDA. This should come as no surprise to anybody since IBM has been indicating that private protocal distribution was on its way out for a number of releases now. And it really is not that difficult to convert to DRDA (and there will be some new help in DSNTP2DP).

Perhaps more troublesome to some shops will be the removal of support for plans containing DBRMs. You will have to convert the DBRMs to packages when you get to DB2 X. You would think that since packages have been available as far back as V2.3 that most shops would have converted to them already. But I do know that there are some shops out there" who still have DBRMs bound directly into their plans. So start planning to convert ASAP! IBM will offer some support in DB2 X to track down affected plans.

While on the topic of plans and packages, you will have to REBIND any that have not been rebound since V5 or before. But it is a good rule to REBIND everything when you go to a new version of DB2 anyway, and most (but not all) shops do that.

From a documentation perspective, BookManager will no longer be supported. Instead we get the PDF versions and the Info Center online.

And the DB2 Management Clients (DB2 Administration Server, Control Center, and Development Center) are deprecated. IBM's new management client direction for DB2 is Data Studio.

Other deprecated items include:
  • ACQUIRE(ALLOCATE) [change to ACQUIRE(USE)]
  • Workload capture through profile monitor
  • XML Extender [change to use pureXML data type]
  • DB2 MQ XML user-defined functions and stored procedures [change to XML functions]
  • msys for Setup DB2 Customization Center [change to install panels]
At this point, DB2 X is on track to become generally available at the end of next year, 2010. At this point IBM has announced that migration will be supported from DB2 9 NFM (and has hinted that they are considering supporting migration to DB2 X directly from DB2 V8 NFM).

Monday, November 02, 2009

New DB2 Twitter List

Just a very quick post this morning to let all you DB2 Twitter folks out there know that I created a list of the DB2 tweeters I know about at http://twitter.com/craigmullins/db2-folks.

If you are a DB2 professional and I left you off the list please leave a comment here or drop me an e-mail and I'll be happy to add you.

Friday, October 30, 2009

IOD2009 Day Three – Malcolm Gladwell

Today’s blog entry is a little late seeing as how this is Friday and IOD is over, but I’m writing about Wednesday morning’s keynote session highlighted by Malcolm Gladwell.

For those who do not know him, Malcolm Gladwell is a Canadian journalist and author best known as the author of the books The Tipping Point (2000), Blink (2005), and Outliers (2008). I’ve read all three of them and I highly recommend that you do, too. He also has a new book, What The Dog Saw, that I bought at the airport on the way home from IOD. I hope it is as good as the other three!

Gladwell’s books deal with the unexpected implications of research in the social sciences. He spoke about some of these during his keynote session in a very entertaining and informative way. It was especially rewarding to hear him tie his messaging into the conference them of information-led transformation… and to hear him call all of the attendees mavens (read his books to understand that term).

Gladwell began his talk by noting the irony of hosting a conference dealing with information analytics in Las Vegas, of all places. You would think that the casinos might have an interest in that topic!

The major idea conveyed by Gladwell during his talk focused on change, and how it never occurs the way you think it will. He explained how radical changes happen much more quickly than we imagine. And he used a story about how the broadcast of a major prize fight transformed radio from a niche product to a transformative one.

He also talked about reframing as being necessary to elicit major changes. Prior to broadcasting the boxing match, radio was used to deliver news and classical music. But reframing it as a product for the delivery of real-time sports coverage – reframing its use – caused major transformation. Gladwell also highlighted Apple and the iPod. The iPod was not the first portable MP3 product, but it is the most successful. Because Apple simplified the interface and promoted it simply… that is, they reframed the issue!

I also enjoyed Gladwell’s story about how he purchased a laptop at CompUSA. Upon entering the store he was confronted with tables and tables of similar looking laptops and had no way to differentiate them. He tried the sales people but as anyone who ever went to CompUSA knows, they either couldn’t be found when you needed them or knew nothing about technology. So he called up his brother, who works in IT. He is not a well-known expert, but he is Malcolm’s maven – the guy he turns to for help. And he told him which laptop to purchase. Gladwell then went back to the CompUSA and he said he stood there pointing to the laptop his brother told him about. And about a half hour later the crack CompUSA salesman came to help him.

This story highlighted the concept of the maven. He also mentioned that if the executives were watching how he confidently pointed to the machine he wanted that it might have led them to erroneously believe that CompUSA needed less experienced… and, indeed, fewer sales people. Which might have led to their demise.

All in all, Gladwell was entertaining and informative – a very powerful combination.

The other highlight of the keynote session, for me at least, was that I was mentioned by name as the most prolific Twitter-er at the IOD conference. Now does that mean I was the most helpful or the most annoying… I’ll leave that to you all to decide.

Wednesday, October 28, 2009

IOD2009 Day Two

Day two of the IBM Information on Demand conference was just as informative and exciting as day one. The day kicked off with a general session titled "A New Kind of Intelligence for a Smarter Planet." The idea presented is that the world is changing. It is becoming more instrumented, interconnected, and intelligent. Basically, as Steve Mills of IBM clarified, the ability to embed intelligence into millions of things will lead the transformation to an information led smarter planet. And that this information-led transformation will create opportunities for organizations to strategically gain control of information and create a new kind of intelligence.

Expanded intelligence begins with sensors and metering, which is doable today because price points have become reasonable now. There are 1 billion transistors per human, today. And an estimated 2 billion people will soon be on the Internet. At the same time, we are moving toward one trillion connected objects (video cameras, GPS devices, healthcare instruments, and even livestock). And there are hundreds of satellites orbiting the Earth, generating terabytes of data each and every day.

As we begin to intelligently interconnect these devices and analyze their streaming data, a transformative opportunity can results...

IBM brought up three customers to talk about how their organization were helping to transform to a smarter planet. The customers came from Cardinal Health, Statoil Hydro ASA (Norway), and the Food and Drug Administration. Highlights of the panel discussion:
  1. Security in the supply chain for food is absolutely critical and food safety is one of the most complex systems to deal with.
  2. The US imports 50pct of its food - from over 150 different countries.
  3. Every food supplier to US (domestic or foreign) must be registered with the FDA.
  4. And each supplier must complete forms as to the safety of its food. This can be difficult since suppliers range from large agri-businesses to small farms many of whom do not have a computer at all. So some records are probably kept on paper in shoe boxes.
  5. Supply chain security is very important in the health care (drugs) and oil industries too! In fact, it was discussed how each of these seemingly disparate industries face many similar challenges.
  6. Preventing problems requires understanding risk. And complexity requires collaboration in order to succeed becase nobody has enough resources to do it all alone.
  7. In some areas (such as remote parts of Norway) instrumentation is essential to even get the data because nobody wants to go there.
  8. Legacy systems often are rich sources of information (hey, that probably means mainframes!)
  9. Analytics are required for prevention of problems, but also aid in reaction to problems that actually occur too. No one prevents 100 percent of their problems.
After the panel IBM came back and wrapped it up. They mentioned how IBM was awarded the National Medal of Honor for their Blue Gene supercomputer for DNA sequencing. It was a very informative and entertaining general session.

I then attended the DB2 9.7 versus Oracle 11g Smackdown presentation. It was chock full of statistics on why IBM's DB2 is superior to Oracle in terms of cost. The presenter explained how DB2
outperforms Oracle on TPC-C benchmarks for the same test, on the same machine, at the same point in time. He cautioned folks to to read the small print details on all benchmark results... for example, if you are examining the cost of ownership double check to see whether the benchmark uses a term or full purchase license. Also, the cost of the database license depends a great deal on your ability to negotiate a discount (if the vendor will discount). And you also need to be aware of how the products are licensed. Some features are separately licensed for both DB2 and Oracle. The bottom line is that licensing can cause a more than 30 percentt swing in price performance results

But do people even believe these benchmarks any more? I don't think very many people put much stock in benchmark tests today.

The author frequently cited an independent study by ITG that compares the value proposition of DB2 9.7 versus Oracle Database 11g. You can read the study yourself at this link.

(Note to my regular z/OS readers: the previous discussion was all about DB2 LUW and not DB2 for z/OS).

I also got to attend a special briefing for bloggers on IBM's new stream computing solution, which I blogged about on my Data Management Today blog if you are interested.

And finally, I'll plug my Twitter feed again. You can follow my tweets at IOD this week (well, thru Wednesday) by following me on Twitter at http://www.twitter.com/craigmullins.

Monday, October 26, 2009

IBM IOD2009 Day One

It is Monday, October 26, 2009, and the annual IBM Information on Demand (IOD2009) conference is officially underway. Well, actually it kicked off with a bang on Sunday. The exhibition hall opened at 6:00 pm and the early goers traipsed through the vendor hall sharing stories, checking out the vnedor's wares, and looking for the latest tschotskes (the favorites seem to be a mini-book light being given out by SPSS and the nifty DB2 t-shirts being given out by SEGUS, Inc.).

But the event really does not get started (at least as far as I'm concerned) until the big Monday morning kickoff, which this year was emceed by Terry Fator (the ventriloquist impersonator who won America's Got Talent) last year. He did a very nice job, and his mimicry and ventriloquism skills are great... but I think ventriloquism works better in a more intimate setting.

Before Terry came on an IBMer shared several tidbits of data from research and from polls they have been taking at IOD. For example, did you know that 15 petabytes of data are created every day? Regarding the even, they shared that there are more overall attendees here than there were at last year's event (6,000+) and that the average distance folks traveled to attend #IOD2009 is 2500 miles. You can actually participate in the daily IOD polls at http://iodpoll.com.

In between the entertainment respites, we got to hear from several IBM folks, including Ambuj Goyal, Arvind Krishna, and Frank Kern. According to Mr. Goyal, IBM is building an information on demand software stack to deliver trusted information. This can solve a very real problem that organizations are experiencing today. You see, it seems that one in three business leaders frequently make critical decisions based on inaccurate or non-existing information. Business executives don't just need data, but trusted information that can be relied upon for analysis to make accurate business decisions.

Again, according to Goyal, he sees IBM's Information Agenda doing for trusted information what ERP did for enterprise resource processes. IBM will help move its customers from information projects to information-based architecture and operations, moving organizations from information on demand to information transformation.

Then Frank Kern hosted three IBM customers as he asked each of them to explain how they benefited from IBM's Information Management solutions. The representative from BlueCross BlueShield talked about their data warehouse with 54 millions records, which happens to be the largest healthcare data warehouse in the world. Their analytical data is made available thru a portal. Offering integrated healthcare details helps improve healthcare. And the Chevron representative spoke about how IBM has helped Chevron improve its supply chain operations using integrated information.

And
Arvind Krishna, GM of the Information Management group at IBM, told about IBM's on-going investments in research and development for IOD. So far, IBM has invested over $12 billion in R+D and acquisitions.

IBM also unveiled some announcements today at IOD including several new offerings to provide organizations with analytics capabilities to make better use of their archived information and improve business processes. These offerings are part of IBM's unified archiving strategy called IBM Smart Archive. Most intriguing to me is the preview of a SaaS offering called IBM Information Archive Cloud Services. The entire press release can be read here, if you're interested in the topic.

I also attended a presentation on DB2 X which was quite interesting, too. Delivered by Jeff Josten, there was much good news for mainframe DB2 folks including improved performance, temporal support, improved utility operations, hashed data and access paths, and more. I tweeted about this quite extensively on Twitter.

In fact, if you want to follow my tweets all week (well, thru Wednesday) be sure to follow me on Twitter at http://www.twitter.com/craigmullins... even when I'm not at a conference like IOD, I regularly tweet on DB2, database, and information management topics.

Monday, October 19, 2009

Hear Ye, Hear Ye, Learn All About DB2 X for z/OS

IBM is hosting a free webinar on November 3, 2009, offering a technical preview of the next version of DB2 for z/OS, currently known as DB2 X (but we all know, err, ah, think it will be called DB2 10).

If you work with DB2 on the mainframe you will want to set aside some time to attend this informative DB2 X webinar. It will have information for DBAs, as well as for Managers, Application Architects, Developers, System Administrators, System Programmers, and System Architects... and that is just about anyone who works with DB2.

The speaker will be IBM Distinguished Engineer, Jeff Josten.

How do you participate, you may be asking? Well, that is easy. Simply click on over to DB2 X for z/OS Technical Preview and register yourself.

The webinar will be held on November 3, 2009 at 11:00 a.m. Eastern Standard Time, 4:00 p.m. UTC.

Monday, October 12, 2009

On The Importance of Choosing the Correct Data Type

Most DBAs have grappled with the pros and cons of choosing one data type versus another. Sometimes the decision is easy, whereas sometimes the decision is a little bit more difficult. In today's blog entry, we'll discuss both situations.

Data type and length are the most fundamental integrity constraints applied to data in a database. Simply by specifying the data type for each column when a table is created, DB2 automatically ensures that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to a non-conforming value will be rejected. Furthermore, a maximum length is assigned to the column to prohibit larger values from being stored in the table.

The DBA must choose the data type and length of each column wisely. The general rule of thumb for choosing a data type for the columns in your tables is to choose the data type that most closely matches the domain of correct values for the column. That means you should try to adhere to the following rules:
  • If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL data types. DECFLOAT and FLOAT are also options for very large numbers.
  • If the data is character, use CHAR or VARCHAR data types.
  • If the data is date and time, use DATE, TIME, and TIMESTAMP data types.
  • If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or DBCLOB data types.
But, of course, there are always exceptions. For example, what about social security number? That is a numeric column, but you won't be performing calculations using it. And, to format it correctly requires hyphens, which cannot be stored in a numeric data type. So, perhaps, SSN is an exception.

Leading Zeroes

Let's consider another typical situation: the desire to display leading zeroes. Maybe the application requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, the users want the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks of doing this?

Well, there are drawbacks! Without proper edit checks, inserts and updates could place invalid alphabetic characters into the product code. This can be a very valid concern if ad hoc data modifications are permitted. This is rare in production databases, but data problems can still occur if the proper edit checks are not coded into every program that can modify the data. But let's assume that proper edit checks are coded and will never be bypassed. This removes the data integrity question (yeah, right!).

There is another problem that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 374 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative or 5 digit product codes could be entered. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well. And we can always code a simple CHECK constraint to ensure that the code is always greater than zero.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem might be able to be solved using other methods. If you are using a report writer, most of them have quick methods of displaying leading zeroes. When using QMF, you can ensure that leading zeroes are shown by using the "J" edit code. Report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields. Ad hoc access through other reporting tools typically provide a parameter that can enable leading zeroes to be displayed.

In general, it is wise to choose a data type which is closest to the domain for the column. IF the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. In addition, always be sure to code appropriate edit checks to ensure data integrity.

A DATE with DB2

What about dates? In my opinion, you should always use a DATE data type for date data. Why anyone would ever store a date or time in a DB2 table any other format than DATE, TIME, or TIMESTAMP is beyond me. But, oh, they do it all the time. And it causes all sorts of headaches. The benefits of using the proper DB2 data type are many, including:
  • Ensuring data integrity because DB2 will ensure that only valid date and time values are stored
  • The ability to use date and time arithmetic
  • A vast array of built-in functions to operate on and transform date and time values
  • Multiple formatting choices
Additionally, you have multiple built-in functions at your disposal for manipulating date and time data, but only when the data is stored as DATE, TIME, and TIMESTAMP data types. The time-related DB2 functions include CHAR, DATE, DAY, DAYOFMONTH, DAYOFWEEK, DAYOFYEAR, DAYS, HOUR, JULIAN_DAY, MICROSECOND, MIDNIGHT_SECONDS, MINUTE, MONTH, QUARTER, SECOND, TIME, TIMESTAMP, WEEK, WEEK_ISO, and YEAR.

I get questions all the time from folks who've stored dates using character data types; they ask all kinds of integrity and manipulation questions and I always, always, always start my reply with "You shouldn't have done that!"... and then I try to help them out, if I can. Don't fall into the trap of storing dates using anything but a DATE data type... you'll thank me later.

Summary

There is a lot more that can be said about choosing appropriate data types, but I think we've covered enough for today. The bottom line on data types is to use them to protect the integrity of your DB2 data and to simplify your job by taking advantage of DB2’s built-in capabilities. By choosing that data type that most closely matches your data you will be doing yourself, your systems, and your users a big favor.

Friday, October 02, 2009

IDUG Europe is Right Around the Corner

Just a quick post today to remind everybody that the annual European IDUG conference will be held next week (the week of October 5, 2009) in Rome, Italy. And it is not too late to ensure that you will be there to hear the latest and greatest news, tips, tricks, and guidelines on our favorite DBMS - IBM's DB2!

For those of you not lucky enough to be there keep an eye on my DB2portal blog here where I will attempt to summarize the key events of the week.

And if you are a Twitter aficionado, be sure to follow me on Twitter as I will try to make regular Tweets about the event (as long as my Blackberry works in Rome).

Thursday, September 24, 2009

Limiting the Number of Rows Fetched

Application developers frequently need to retrieve a limited number of qualifying rows from a table. For example, maybe you need to list the top ten best selling items from inventory, or a list of the top five most expensive products (i.e., highest price tag). There are several ways to accomplish this prior to DB2 V7 using SQL, but they are not necessarily efficient.

The first reaction is to simply use the WHERE clause to eliminate non-qualifying rows. But this is simplistic, and often is not sufficient to produce the results desired in an optimal manner. What if the program only requires that the top ten results be returned? This can be a somewhat difficult request to formulate using SQL alone.

Consider, for example, an application that needs to retrieve only the top ten most highly paid employees from the EMP sample table. You could simply issue a SQL request that retrieves all of the employees in order by salary, but only use the first ten retrieved. That is easy; for example:

EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP
ORDER BY SALARY DESC;

You must specify the ORDER BY clause with the DESC key word. This sorts the results into descending order, instead of the default, which is ascending. Without the DESC key word, the "top ten" would be at the very end of the results set, not at the beginning.

But that does not really satisfy the requirement - retrieving only the top ten. It merely sorts the results into descending sequence. So the results would still be all employees in the table, but in the correct order so you can view the "top ten" salaries very easily. The ideal solution should return only the top ten employees with the highest salary and not merely a sorted list of all employees.

You can code some "tricky" SQL to support this request for all versions of DB2, such as the following:

SELECT EMPNO, FIRSTNME, LASTNAME, SALARY
FROM DSN8710.EMP A
WHERE 10 > (SELECT COUNT(*)
FROM DSN8710.EMP B
WHERE A.SALARY < B.SALARY)
AND SALARY IS NOT NULL
ORDER BY SALARY DESC;

This SQL is portable from DB2 to other DBMSs, such as Oracle or SQL Server. And, of course, you can change the constant 10 to any number you wish, thereby retrieving the top 20, or top 5, as deemed necessary by the needs of your application.

Since the SALARY column is nullable in the EMP table, you must remove the nulls from the results set. And the ORDER BY is required to sort the results in the right order. If it is removed from the query, the results will still contain the top ten, but they will be in no particular order.

But DB2, as of V7, provides an easier and less complicated way to limit the results of a SELECT statement - the FIRST key word. You can code FETCH FIRST n ROWS which will limit the number of rows that are fetched and returned by a SELECT statement.

Additionally, you can specify a new clause -- FETCH FIRST ROW ONLY clause -- on SELECT INTO statements when the query can return more than one row in the answer set. Doing so informs DB2 to ignore the other rows.

There is one difference between the new V7 formulation and the other SELECT statement we reviewed, and that is the way "ties" are handled. A tie occurs when more than one row contains the same value. The previous query we examined may return more than 10 rows if there are multiple rows with the same value for price within the top ten.

Using the FIRST key word DB2 will limit the number of rows returned to ten, even if there are other rows with the same value for price as the number ten row in the results set. The needs of your application will dictate whether ties are to be ignored or included in the result set. If all "ties" need to be included in the results set, which would mean that more than 10 rows would be needed, the new V7 feature may not prove to be helpful.

And it is also important to note that as of DB2 9, you can include the FETCH FIRST clause in a subselect. ORDER BY is allowed in a subselect, too. The subselect MUST be enclosed in parentheses and the FETCH FIRST (or ORDER BY) cannot be in the outermost fullselect of a view, or in a materialized query table.

Friday, September 04, 2009

Dynamic SQL Causing Lock Escalation?

Lock escalation is the promotion of a lock from a row, page or LOB lock to a table space lock because the number of page locks that are concurrently held on a given resource exceeds a preset limit.

But lock escalation can cause problems. Yes, when fewer locks are taken, CPU cost and memory usage can be reduced. On the other hand, escalating the size of a lock causes more resources to be locked... and that impacts concurrency with the most likely result being applications experiencing lock timeouts or deadlocks.

When lock escalation occurs, DB2 writes messages to the system log. So when complaints inevitably arrive about failing transactions you can look for the appropriate lock escalation message indicating that it could be the culprit.

But such simple steps are typically not sufficient to track down the root cause and take corrective actions. You need to be able to determine what SQL statement in what program and under what circumstances is causing the escalation. Doing so involves tracing and using performance monitoring tools.

For those interested in this topic, (indeed, the primary purpose of this blog post) you should seek out the IBM TechDoc titled Identifying the dynamic SQL statement which is causing a
lock escalation in DB2 for z/OS
. For those who don't know what a TechDoc is, don't worry, it isn't that complicated. It is basically a published piece of technical documentation that isn't part of a larger manual or IBM RedBook. They can be short papers, product flashes, presentations, etc.

This particular TechDoc details a methodology for identifying dynamic SQL statements involved in a lock escalation. It describes which traces to start, and which jobs to run to analyze the collected traces. It also explains how to analyze the trace report to find the problematic SQL statement.

You can search all of IBM's TechDocs by following this link.

Happy reading!

Friday, August 21, 2009

Approaches to Access Path Management

BIND and REBIND are important components in assuring efficient DB2 applications. Because the
BIND/REBIND process determines exactly how your DB2 data is accessed, it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as your data and systems change. This approach involves some form of regular maintenance that keeps DB2 statistics up to date and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

The final approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the three Rs. To implement this approach you:
  1. Regularly reorganize the data to ensure that it is optimally structured.
  2. Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
  3. And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.
At any rate, your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that DB2 is accessing data based upon false assumptions. DB2 is unlikely to make the same access path choice as your data grows – and as patterns within the data change.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or Real-Time Statistics (RTS). So, we need to add another R – in other words:
  1. RUNSTATS or better yet, RTS
  2. REORG
  3. RUNSTATS
  4. REBIND
Now it is true that some folks don’t rely on statistics to schedule a REORG. Instead, they just build the JCL to REORG their database objects when they create the object. So they create a table space then build the REORG job and schedule it to run monthly, or quarterly, or on some regular basis. This is better than no REORG at all, but it is probably not the best approach because you are most likely either reorganizing too soon (in which case you waste the CPU cycles to do the REORG) or you are reorganizing too late (in which case performance is suffering for a period of time before the REORG runs). Better to base your REORGs off of statistics and thresholds using either RUNSTATS or RTS.

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer doesn’t have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive.

There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four R’s approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four R’s becomes the Five R’s as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

  1. Start RTS (or a RUNSTATS) to determine when to REORG.
  2. Reorganize the table spaces (and/or indexes)
  3. After reorganizing, run RUNSTATS again,
  4. Follow that with the REBINDs.
  5. Then we need that fifth R – which is to Review the access paths generated by the REBIND.
The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only ring you up when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices.

Thursday, July 30, 2009

Discount on IBM Information on Demand Conference

The IBM Information on Demand (IOD, for short) conference is rapidly approaching. The conference will be held in Las Vegas, Nevada the week of October 25 – 30, 2009 at the Mandalay Bay casino and hotel.

The IOD conference is IBM’s signature event for their data and information management product lines. By attending IOD 2009 you can gain unique perspectives from IBM experts, technical leaders and visionaries as well as peers in your industry. Many of the developers of IBM’s offerings, such as DB2, Informix, and IMS, will be delivering educational sessions at IOD. And over 200 customers will share real-world experiences detailing how they have unlocked the value of their information and realized tangible and immediate return on investment.

OK, so what about that discount? IBM has been kind enough to provide me with a discount code that I can share with my readers. You will need to provide the code, G09MULL, when your register for the conference and you will get a $100 discount off of your registration!

Here’s how to register on-line:

  • An IBM userid and Password are required to register. If you have an existing IBM.com account ID, please select 'Register for the Conference' and sign on. If you do not have an IBM.com account ID, please follow the link 'MY IBM ID' to obtain one. Complete the form and click 'SUBMIT'. You will then be allowed to continue as a registered IBM user. Click 'Continue' and select 'Register for the Conference' and sign on with your new userid and password.
  • Select registration type of 'Customer' and click 'Continue'
  • Complete the appropriate fields on the enrollment form
  • Under the 'Promotion Code Information' section, enter PROMOTION CODE G09MULL. You must enter this code to receive your $100 discount.
  • Complete the form with attendee, arrival and payment information clicking 'Continue' at the bottom of each section.
  • Click 'Submit Registration'

Note: Credit card information is required to guarantee your registration regardless of your method of payment and/or discount amount.

That’s all there is to it. And you can use the promotion code multiple times for all of the folks at you company that will be attending the Information On Demand conference in 2009.

Cheers!

Monday, July 06, 2009

The "Dirty" Read (AKA Uncommitted Read)

Anyone accustomed to application programming when access to a database is required understands the potential for concurrency problems. When one application program attempts to read data that is in the process of being changed by another, the DBMS must forbid access until the modification is complete in order to ensure data integrity. Most DBMS products, DB2 included, use a locking mechanism for all data items being changed. Therefore, when one task is updating data on a page, another task can not access data (read or update) on that same page until the data modification is complete and committed.

Programs that read DB2 data typically access numerous rows during their execution and are thus quite susceptible to concurrency problems. Since V4, DB2 has provided read-through locks, also know as “dirty read” or “uncommitted read,” to help overcome concurrency problems. When using an uncommitted reads an application program can read data that has been changed, but is not yet committed.

Dirty read capability is implemented using a new isolation level, UR, for uncommitted read. This adds to the current isolation levels of RR (repeatable read) and CS (cursor stability). If the application program is using the UR isolation level, it will read data without taking locks. This enables the application program to read data contained in the table as it is being manipulated.

Consider the following sequence of events:

(1) At 9:00 AM, the a transaction is executed containing the following SQL to change a specific value.

UPDATE EMP
SET FIRST_NAME = “MICHELLE”
WHERE EMPNO = 10020;

The transaction is a long-running one and continues to execute without issuing a COMMIT.

(2) At 9:01 AM, a second transaction attempts to SELECT the data that was changed, but not committed.


If the UR isolation level were specified for the second transaction, it would read the changed data even though it had yet to be committed. Obviously, if the program need not wait to take a lock and merely reads the data in whatever state it happens to be at that moment, the program will execute faster than if it had to wait for locks to be taken and resources to be freed before processing.

However, the implications of reading uncommitted data must be carefully examined before being implemented. Several types of problems can occur. A dirty read can cause duplicate rows to be returned where none exist. Alternately, a dirty read can cause no rows to be returned when one (or more) actually exists. Additionally, an ORDER BY clause does not guarantee that rows will be returned in order if the UR isolation level is used. Obviously, these problems must be taken into consideration before using the UR isolation level.

UR Isolation Requirements

  • The UR isolation level applies to read-only operations: SELECT, SELECT INTO, and FETCH from a read-only result table.
  • Any application plan or package bound with an isolation level of UR will use uncommitted read functionality for any read-only SQL. Operations that are contained in the same plan or package and are not read-only will use an isolation level of CS.
  • The isolation level that is defined at the plan or package level during BIND or REBIND can be overridden as desired for each SQL statement in the program. The WITH clause can be used to specify the isolation level for any individual SQL statement. For example:


SELECT EMPNO, LAST_NAME
FROM EMP
WITH UR;

The WITH clause is used to allow an isolation level to be used on a statement-by-statement basis. The UR isolation level can be used only with read-only SQL statements. This includes read-only cursors and SELECT INTO statements.

Benefits and Drawbacks

So when is it a good idea to implement dirty reads using the UR isolation level? The general rule of thumb is to avoid dirty reads whenever the results must be 100 percent accurate. Examples of this would be when:
  • calculations that must balance are being performed on the selected data
  • data is being retrieved from one source to insert to or update another
  • production, mission-critical work is being performed that can not contain or cause data integrity problems
Truthfully, most DB2 applications are not usually candidates for dirty reads. However, there are a few specific situations in which the dirty read capability will be of major benefit. Consider the following cases in which the UR isolation level could prove to be useful:
  • Access is required to a reference, code, or look-up table that is basically static in nature. Due to the non-volatile nature of the data, a dirty read would be no different than a normal read the majority of the time. In those cases when the code data is being modified, any application reading the data would incur minimum, if any, problems.
  • Statistical processing must be performed on a large amount of data. For example, your company may wish to determine the average age of female employees within a certain pay range. The impact of an uncommitted read on an average of multiple rows will be minimal because a single value changed will not greatly impact the result.
  • Dirty read can prove invaluable in a data warehousing environment that uses DB2 as the DBMS. A data warehouse is a time sensitive, subject-oriented, store of business data that is used for on-line analytical processing. Other than periodic data propagation and/or replication, access to the data warehouse is read only. An uncommitted read can be perfect in a read only environment since it can cause little damage because the data is generally not changing. More and more data warehouse projects are being implemented in corporations worldwide and DB2 with dirty read capability may be a very wise choice for data warehouse implementation.
  • In those rare cases when a table, or set of tables, is used by a single user only, UR can make a lot of sense. If only one individual can be modifying the data, the application programs can be coded such that all (or most) reads are done using UR isolation level, and the data will still be accurate.
  • Finally, if the data being accessed is already inconsistent little harm can be done using a dirty read to access the information.
Administration

Because of the data integrity issues associated with dirty reads, it is a good idea for DBAs to keep track of the plans and packages that specify an isolation level of UR. This information can be found in the DB2 catalog. The following two queries can be used to find the applications using uncommitted reads.

Issue the following SQL for a listing of plans that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:

SELECT DISTINCT S.PLNAME
FROM SYSIBM.SYSPLAN P,
SYSIBM.SYSSTMT S
WHERE (P.NAME = S.PLNAME AND
P.ISOLATION = ‘U’
)
OR S.ISOLATION = ‘U’
ORDER BY S.PLNAME;

Issue the following SQL for a listing of packages that were bound with ISOLATION(UR) or contain at least one statement specifying the WITH UR clause:

SELECT DISTINCT P.COLLID, P.NAME, P.VERSION
FROM SYSIBM.SYSPACKAGE P,
SYSIBM.SYSPACKSTMT S
WHERE (P.LOCATION = S.LOCATION AND
P.LOCATION = ‘ ‘ AND
P.COLLID = S.COLLID AND
P.NAME = S.NAME AND
P.VERSION = S.VERSION AND
P.ISOLATION = ‘U’
)
OR S.ISOLATION = ‘U’
ORDER BY S.COLLID, S.NAME, S.VERSION;

Synopsis

The dirty read capability can provide relief to concurrency problems and deliver faster performance in very specific situations. Be certain to understand the implications of the UR isolation level and the “problems” it can cause before diving headlong into implementing it in your production applications.

Monday, June 22, 2009

Know Your ISOLATION Levels

Did you know that DB2 provides a way to change the way that a program or SQL statement acquires locks? That way is known as the isolation level and it can be set to specify the locking behavior for a transaction or statement. Standard SQL defines four isolation levels that can be set using the SET TRANSACTION ISOLATION LEVEL statement:

  • Serializable
  • Repeatable read
  • Read committed
  • Read uncommitted

The isolation level determines the mode of page or row locking implemented by the program as it runs.

DB2 supports a variation of the standard isolation levels. DB2 implements page and row locking at the program execution level, which means that all page or row locks are acquired as needed during the program run. Page and row locks are released as the program run depending on the isolation level.

In DB2 you can specify the following four isolation levels:

  • cursor stability (CS),
  • repeatable read (RR),
  • read stability (RS), and
  • uncommitted read (UR).

Using the ISOLATION parameter of the BIND command you can set the isolation level of a package or plan. You also can use the WITH parameter on a SELECT statement to set the isolation level of a single SQL statement.

Cursor stability is the DB2 implementation of the SQL standard read committed isolation level. CS is perhaps the most common DB2 isolation level in use in production applications because it offers a good tradeoff between data integrity and concurrency. When CS is specified the transaction will never read data that is not yet committed; only committed data can be read.

A higher level of integrity is provided with repeatable read. Under an RR isolation level all page locks are held until they are released by a COMMIT (or ROLLBACK), whereas with CS read-only page locks are released as soon as another page is accessed. Repeatable read is the default isolation level if none is specified at BIND time.

An RR page locking strategy is useful when an application program requires consistency in rows that may be accessed twice in one execution of the program, or when an application program requires data integrity that cannot be achieved with CS.

For example of a good reason to use RR page locking, consider a reporting program that scans a table to produce a detail report, and then scans it again to produce a summarized managerial report. If the program is bound using CS, the results of the first report might not match the results of the second.

Suppose that you are reporting the estimated completion dates for project activities. The first report lists every project and the estimated completion date. The second, managerial report lists only the projects with a completion date greater than one year.

The first report indicates that two activities are scheduled for more than one year. After the first report but before the second, however, an update occurs. A manager realizes that she underestimated the resources required for a project. She invokes a transaction to change the estimated completion date of one of her project's activities from 8 months to 14 months. The second report is produced by the same program, but reports 3 activities.

If the program used an RR isolation level rather than CS, an UPDATE that occurs after the production of the first report but before the second would not have been allowed. The program would have maintained the locks it held from the generation of the first report and the updater would be locked out until the locks were released.

How about another example? Consider a program that is looking for pertinent information about employees in the information center and software support departments who make more than $30,000 in base salary. In the DB2 sample tables department 'C01' is the information center and department 'E21' is software support.

The program opens a cursor based on the following SELECT statement:


SELECT EMPNO, FIRSTNME, LASTNAME,
WORKDEPT, SALARY
FROM DSN8710.EMP
WHERE WORKDEPT IN ('C01', 'E21')
AND SALARY > 30000;

The program then begins to FETCH employee rows. Assume further, as would probably be the case, that the statement uses the XEMP2 index on the WORKDEPT column. An update program that implements employee modifications is running concurrently. The program handles transfers by moving employees from one department to another, and implements raises by increasing the salary.

Assume that Sally Kwan, one of your employees, has just been transferred from the information center to software support. Assume further that another information center employee, Heather Nicholls, received a 10 percent raise. The update program running concurrently with the report program implements both of these modifications.

If the report program were bound with an isolation level of CS, the second program could move Sally from 'C01' to 'E21' after she was reported to be in department 'C01' but before the entire report was finished. Thus, she could be reported twice: once as an information center employee and again as a software support employee. Although this circumstance is rare, it can happen with programs that use cursor stability. If the program were bound instead with RR, this problem could not happen. The update program probably would not be allowed to run concurrently with a reporting program, however, because it would experience too many locking problems.

Now consider Heather's dilemma. The raise increases her salary 10 percent, from $28,420 to $31,262. Her salary now fits the parameters specified in the WHERE condition of the SQL statement. Will she be reported? It depends on whether the update occurs before or after the row has been retrieved by the index scan, which is clearly a tenuous situation. Once again, RR avoids this problem.

You might be wondering, "If CS has the potential to cause so many problems, why is it used so ubiquitously? Why not trade the performance and concurrency gain of CS for the integrity of RR?"

The answer is simple: the types of problems outlined are rare. The expense of using RR, however, can be substantial in terms of concurrency. So the tradeoff between the concurrency expense of RR and the efficiency of CS usually is not a sound one.

The third isolation level provided by DB2 is read stability (RS). Read stability is similar in functionality to the RR isolation level, but a little less. A retrieved row or page is locked until the end of the unit of work; no other program can modify the data until the unit of work is complete, but other processes can insert values that might be read by your application if it accesses the row a second time.

Consider using read stability over repeatable read only when your program can handle retrieving a different set of rows each time a cursor or singleton SELECT is issued. If using read stability, be sure your application is not dependent on having the same number of rows returned each time.

Finally, we come to the last, and most maligned isolation level, uncommitted read (UR). The UR isolation level provides read-through locks, also know as dirty read or read uncommitted. Using UR can help to overcome concurrency problems. When you're using an uncommitted read, an application program can read data that has been changed but is not yet committed.

UR can be a performance booster, too, because application programs bound using the UR isolation level will read data without taking locks. This way, the application program can read data contained in the table as it is being manipulated. Consider the following sequence of events:

1. To change a specific value, at 9:00 a.m. a transaction containing the following SQL is executed:


UPDATE EMP
SET FIRSTNME = "MICHELLE"
WHERE EMPNO = 10020;

The transaction is a long-running one and continues to execute without issuing a COMMIT.


2. At 9:01 a.m., a second transaction attempts to SELECT the data that was changed, but not committed.


If the UR isolation level were used for the second transaction, it would read the changed data even though it had yet to be committed. Obviously, if the program doesn't need to wait to take a lock and merely reads the data in whatever state it happens to be at that moment, the program will execute faster than if it had to wait for locks to be taken and resources to be freed before processing.

The implications of reading uncommitted data, however, must be carefully examined before being implemented. Several types of problems can occur. Using the previous example, if the long-running transaction rolled back the UPDATE to EMPNO 10020, the program using dirty reads may have picked up the wrong name ("MICHELLE") because it was never committed to the database.

Inaccurate data values are not the only problems that can be caused by using UR. A dirty read can cause duplicate rows to be returned where none exist. Alternatively, a dirty read can cause no rows to be returned when one (or more) actually exists. Additionally, an ORDER BY clause does not guarantee that rows will be returned in order if the UR isolation level is used. Obviously, these problems must be taken into consideration before using the UR isolation level.

Keep in mind, too, that the UR isolation level applies to read-only operations: SELECT, SELECT INTO, and FETCH from a read-only result table. Any application plan or package bound with an isolation level of UR will use uncommitted read functionality for any read-only SQL. Operations contained in the same plan or package and are not read-only will use an isolation level of CS.

When is it appropriate to use UR isolation? The general rule of thumb is to avoid UR whenever the results must be 100 percent accurate. Following are examples of when this would be true:

  • Calculations that must balance are being performed on the selected data

  • Data is being retrieved from one source to insert to or update another

  • Production, mission-critical work is being performed that cannot contain
    or cause data integrity problems

In general, most DB2 applications are not serious candidates for dirty reads. In a few specific situations, however, the dirty read capability will be of major benefit. Consider the following cases in which the UR isolation level could prove to be useful:

  • Access is required to a reference, code, or look-up table that basically is static in nature. Due to the non-volatile nature of the data, a dirty read would be no different than a normal read the majority of the time. In those cases when the code data is being modified, any application reading the data would incur minimum, if any, problems.

  • Statistical processing must be performed on a large amount of data. Your company, for example, might want to determine the average age of female employees within a certain pay range. The impact of an uncommitted read on an average of multiple rows will be minimal because a single value changed will not greatly impact the result.

  • Dirty reads can prove invaluable in a data warehousing environment that uses DB2 as the DBMS. A data warehouse is a time-sensitive, subject-oriented, store of business data that is used for online analytical processing. Other than periodic data propagation and/or replication, access to the data warehouse is read-only. Because the data is generally not changing, an uncommitted read is perfect in a read-only environment due to the fact that it can cause little damage. More data warehouse projects are being implemented in corporations worldwide and DB2 with dirty read capability is a very wise choice for data warehouse implementation.

  • In those rare cases when a table, or set of tables, is used by a single user only, UR can make a lot of sense. If only one individual can be modifying the data, the application programs can be coded such that all (or most) reads are done using UR isolation level, and the data will still be accurate.

  • Finally, if the data being accessed already is inconsistent, little harm can be done using a dirty read to access the information.

Although the dirty read capability can provide relief to concurrency problems and deliver faster performance in specific situations, it also can cause data integrity problems and inaccurate results. Be sure to understand the implications of the UR isolation level and the problems it can cause before diving headlong into implementing it in your production applications.

Summary

It is important for DB2 DBAs and application programmers to know the four isolation levels and their impact on SQL. Using the isolation levels is an effective way to control concurrency and locking for your DB2 applications.

Friday, June 12, 2009

The Mainframe is Number One!

Congratulations to the mainframe, just cited at #1 in the InfoWorld Hardware Hall of Fame.

Okay, so maybe they are listing the hardware chronologically, but I refuse to acknowledge that. The IBM System/360 mainframe is listed first -- and that means number one, right?

Actually, the entire list is interesting and entertaining, so take a moment to click over and review the InfoWorld Hardware Hall of Fame, which is littered with titans from the past (DEC, Compaq) as well as current offerings.

Thursday, May 28, 2009

Hasn't DB2 9 for z/OS Been GA for 2 Years Now?

Just a quick note to question the entire DB2 community about embracing the latest (I hesitate to call a 2 year old product "new") version of DB2. Isn't it about time that most of the DB2 base moved to DB2 V9?

Why would people not be moving? Well, there is always the cost and time involved in migrating to a new version. Perhaps organizations are not willing to expend the resources needed to migrate as frequently as they did in the past.

Perhaps some organizations are waiting for others to shake the bugs out of a new release. But this is troublesome. The more organizations that follow that approach, the longer it takes for those that do adopt the latest version to fully test the software. Which means that waiting might not help.

There is also the possibility that some organizations do not find a compelling need for the new functionality offered by V9. Even if that is the case, most organizations still should want the performance gains to be had by upgrading. But to fully take advantage of that requires rebinding, which also takes time and resources to do properly.

And then there is the issue of support for new versions. Organizations need their DB2 tool providers (ISVs and IBM) to not only support, but exploit the new features of the "new" version. Remember a few years ago when IBM entered the DB2 tools market in a big way? One of their stated objectives was to provide day one support for new versions. I wonder what happened to that objective? It is years after V9 GA and many of the IBM tools (according to IBM's own web site) only offer a "very limited number of new functions and features of DB2 9."

So is your shop at DB2 V9 yet? If not, why? If so, howzitgoing?

Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Tuesday, May 05, 2009

Approaches to Access Path Management... or The Five R's

BIND and REBIND are important components in assuring efficient DB2 applications. Because the BIND/REBIND process determines exactly how your DB2 data is accessed it is important that you develop an appropriate strategy for when and how to REBIND your programs.

There are several common REBIND approaches taken by DB2 users. By far, the best approach is to REBIND your applications over time as the data changes. This approach involves some form of regular maintenance that keeps DB2 statistics up to date (or better yet, uses Real Time Statistics) and formulates new access paths as data volumes and patterns change.

Other approaches include REBINDing only when a new version of DB2 is installed, or perhaps more ambitious, whenever new PTFs are applied to DB2. Another approach is to REBIND automatically after a regular period of time (days, weeks, months, etc.). This approach can work if the period of time is wisely chosen based on the application data – but it still can pose administrative issues.

Another (unfortunately) popular approach can be summarized as “if it ain’t broke don’t fix it!” This is the worst of the several approaches discussed here. The biggest problem with this approach is that you are penalizing every program in your subsystem for fear that a program or two may have a degraded access path. This results in potentially many programs having sub-optimal performance because the optimizer never gets a chance to create better access paths as the data changes.

Of course, the possibility of degraded performance is real – and that is why this approach has been adopted at some sites. The problem is being able to find which statements have degraded. In an ideal world we would be to be able to review the access path changes beforehand to determine if they are better or worse. But DB2 itself does not provide any systematic method of administering access paths that way. There are third party tools that can help you achieve this though.

Anyway, let’s go back to the best approach again, and that is to REBIND on a regular basis as your data changes. This approach has become known as the three Rs. To implement this approach you:

  1. Regularly REORGanize the data to ensure that it is optimally structured.
  2. Follow that with RUNSTATS to be sure that the reorganized state of the data is reflected in the DB2 Catalog.
  3. And follow that with a REBIND for all the application programs that access the data structures impacted by the REORG and RUNSTATS.

At any rate, your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that DB2 is accessing data based upon false assumptions. DB2 is unlikely to make the same access path choice as your data grows – and as patterns within the data change.

By REBINDing you can generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. Additionally, as DB2 changes are introduced (PTFs, new version/release) optimizer improvements and new access techniques can be incorporated into the access paths. That is, if you never REBIND, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself.

Adopting the Three R’s approach can pose additional questions. For example, when should you reorganize? In order to properly determine when a REORG is needed you’ll have to look at statistics. This means looking at either RUNSTATS or RTS. So, perhaps it should be at least 4 R’s – in other words:

  1. RUNSTATS or RTS
  2. REORG
  3. RUNSTATS
  4. REBIND

Now it is true that some folks don’t rely on statistics to schedule a REORG. Instead, they just build the JCL to REORG their database objects when they create the object. So they create a table space then build the REORG job and schedule it to run monthly, or quarterly, or on some regular basis. This is better than no REORG at all, but it is probably not the best approach because you are most likely either reorganizing too soon (in which case you waste the CPU cycles to do the REORG) or you are reorganizing too late (in which case performance is suffering for a period of time before the REORG runs). Better to base your REORGs off of statistics and thresholds using either RUNSTATS or RTS.

Without accurate statistics there is little hope that the optimizer will formulate the best access path to retrieve your data. If the optimizer doesn’t have accurate information on the size, organization, and particulars of your data then it will be creating access paths based on either default or inaccurate statistics. Incorrect statistics will cause bad choices to be made – such as choosing a merge-scan join when a nested loop join would be better, or failure to invoke sequential prefetch, or using the wrong index – or no index at all. And the problem of inaccurate statistics is pervasive. There are shops out there that never, or rarely, run RUNSTATS to gather up-to-date statistics. Make sure yours is not one of those shops!

When should you run RUNSTATS? One answer is “As frequently as possible based on how often your data changes.” To do this you will need to know a thing or two about your data growth patterns: what is its make-up, how is it used, how fast does it grow, and how often does it change? These patterns will differ for every table space in your system.

Next we need to decide when to REBIND? The best answer for this is when statistics have changed significantly enough to change access paths. When we know that data has significantly changed it makes sense to REBIND after the RUNSTATS completes. But the trick is determining exactly when we have a “significant” change in our data. Without an automated method of comparing and contrasting statistics (or even better yet, access paths) coming up with an answer in a manual way can be time-consuming and error-prone – especially if we have thousands of DB2 programs to manage.

As we REBIND, we always must be on alert for rogue access paths. A rogue access path is created when the optimizer formulates a new access path that performs worse than the previous access path. This can happen for a variety of reasons. Of course, number one is that the optimizer, though good, is not perfect. So mistakes can happen. Other factors can cause degraded access paths, too. The access paths for volatile tables depend on when you run the RUNSTATS. Volatile tables are those that start out empty, get rows added to them during processing, and are emptied out at the end of the day. And, of course, if the catalog or statistics are not accurate we can get problems, too.

So adopting the Four R’s approach implies that you will have to develop a methodology for reviewing your access paths and taking care of any “potential” problem access paths. Indeed, the Four R’s becomes the Five R’s as we add a step to review the access paths after REBINDing to make sure that there are no rogue access paths:

  1. Start with a RTS (or use RUNSTATS) to determine when to REORG.
  2. REORGanize the table spaces (and indexes)
  3. After reorganizing, run RUNSTATS (to ensure the DB2 Catalog is up-to-date)
  4. Follow that with the REBINDs.
  5. Then we need that fifth R – which is to review the access paths generated by the REBIND.

The review is of utmost importance because the optimizer can make mistakes. And, of course, so can you. But your users will not call you when performance is better (or the same). They only dial your numbers when performance gets worse. As such, proactive shops will put best practices in place to test REBIND results comparing the before and after impact of the optimizer’s choices. Again, there are tools that can help to automate this review process.