The Db2 Portal Blog: Big Data

Showing posts with label Big Data. Show all posts

Monday, November 22, 2021

"Unstructured " Data and LOBs

Let's take a look at the burgeoning role of "unstructured" data and the impact of using LOBs to store that data in our Db2 databases.

The Growth of "Unstructured" Data

Although structured data remains the bedrock of the information infrastructure in most organizations, unstructured data is growing in importance. And there is much more "unstructured" data “out there” than structured. Indeed, analysts at IDC estimate that unstructured data accounts for as much as 90 percent of all digital information.

The rise of "unstructured" data is often attributed to the growing amount of multimedia data being adopted by organizations as they store more audio, video, and images. But that is only part of the story. Text documents, in the form of business forms, letters, and documents, and most importantly, e-mail, greatly contribute to the growing importance of unstructured data.

Now I'm sure you've noticed the quotes I've been putting around the word "unstructured," and they are there for a reason. You see, the word unstructured is a horrible adjective to use in conjunction with data to refer to this type of data. I refer you to this article I wrote called The Big Unstructured Data Lie for why I think this way. Nevertheless, the term "unstructured" will continue to be used to describe this type of data, so from now on I will stop using the quotes...

As organizations expand their digital information footprint, more types of unstructured data are being captured, stored, and made available for analysis. And not just internally generated data, but from more external data sources, too.

Db2 and Unstructured Data

Db2 for z/OS can be used to store unstructured data using BLOB, CLOB, and DBCLOB data types – collectively known as LOBs. Historically, LOBs were added to relational products like DB2 to compete more capably against the Object-Oriented databases of the time – this was back in the late 1990s. The idea was to enable relational tables to house unstructured data like images and audio and video data.

But Db2 for z/OS users were slow to adopt LOBs in their mainframe databases. This was due to several different reasons, not the least of which is that it took several versions of Db2 for LOBs to mature into capable, useful things for enterprise applications. Early implementations of LOBs in Db2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of those deficiencies over time and there are now tools that can help organizations to exploit and effectively manage Db2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. A specific example of Big Data driving the usage of LOBs in Db2 is the JSON support that has been added to Db2. JSON objects are stored in Db2 as BLOBs.

So, more and more organizations are adopting LOB data in their Db2 databases – to support unstructured data, for their big data projects, and to store documents and multimedia data.

Nevertheless, sometimes I hear DBAs say things like “Oh, well, I don’t use LOBs so I don’t really need to worry about them.” That can cause trouble, because you have been using LOBs for some time now, whether you know it or not. LOB data has been part of the Db2 Catalog since Version 7 and the number of LOB columns used in the Db2 Catalog has been increasing over the past couple of Db2 releases.

As you can see in table below, the Db2 Catalog now has over 40 LOB columns. So, even if you have not created any user Db2 tables with LOBs, you have system Db2 tables with LOBs. For example, SYSIBM.SYSVIEWS contains a CLOB containing the source text used to create the VIEW.

LOBs are also used in the Db2 Directory, in the DBD01 “table” a BLOB column is used to store the DBD data (2GB). And in the SPT01 “table” there are two BLOB columns for storing data and explain information.

Using LOBs in Db2

I will not attempt to train you on the usage of LOBs in Db2 in this blog post. Suffice it to say that LOBs require different management and administration tactics to ensure their accuracy and usability. But I do want to bring up some of the administration concerns that using LOBs can create.

The first thing to keep in mind is that most LOBs are larger than your typical column data. I mean, it’s right there in the name LOB = Large OBject. As the size of data increases, so do the management concerns, such as lengthy elapsed times to run utilities on the data, slower performance of accessing the data, and so on. And you’ll also need to decide whether or not to log changes to your LOB data. If you do log LOB changes, then you can stress your logs because of the size of the LOB data. If you do not log LOB changes, then you will need to make sure that you have sufficient methods to recover the LOB data because the changes between image copies won’t be there in the log. Generally speaking, most usually choose to avoid the logging of LOBs. You can turn off LOB logging by specifying NOT LOGGED in the LOB table space DDL.

There are also many restrictions on how LOB data can be used with SQL. LOB data is not like traditional, structured database data, so Db2 imposes some limitations, for example:

Cannot use a LOB in a GROUP BY or ORDER BY clause
Cannot specify SELECT DISTINCT on a LOB
LOBs cannot be used in the context of an INCLUDE(column-name) clause in a MERGE statement
Cannot define check constraints, primary key, unique, or foreign keys on LOBs
LOBs cannot be used in any predicate except EXISTS, LIKE and NULL

These are for illustrative purposes only. There are other restrictions, all of which can be found in the IBM SQL Reference manual.

Unless it is an inline LOB where the entire LOB is stored in the base table, a LOB will require a LOB table space, auxiliary table, and LOB index. When building auxiliary tables and indexes, you do not specify columns the same way that you do for normal tables and indexes. For the Auxiliary Table you specify the LOB column and base table and Db2 automatically generates the columns needed. For the Auxiliary Index you just specify the auxiliary table and Db2 implicitly generates the needed index keys.

Each table can have 0, 1, or many LOB columns and each LOB instance can be up to 2GB in size. Each table with at least 1 LOB must have a ROWID; the ROWID is a varying-length 17 byte field. One page of a LOB table space will never contain more than one LOB, but one LOB can span multiple LOB table space pages. An auxiliary table, which resides in the LOB table space, can store only one LOB column of a base table; there must be one and only one index on this column. All of these things can alter the way in which you manage and administer your Db2 tables and table spaces.

Let’s backtrack and review the size of LOBs in a little more depth. Each LOB instance can be up to 2G – and that is per row! Each LOB table space can have as many as 254 different data sets with a DSSIZE from 2G to 64G each for a total of about 16 terabytes (TB). This is per partition, so if there are 4096 partitions, then the total size for a single LOB is over 66,000 TB. That is big!

Unless all of your LOB data is static – meaning it never changes – the size of your LOB data sets will continue to grow. Are you prepared for running utilities on such large table spaces?

When you drop a LOB column from a base table, DB2 will not automatically clean up LOB table spaces. After removing the LOB column you can either drop the LOB table space yourself explicitly, or perhaps reuse it for another LOB.

And finally, LOB columns are not really updated. The old version of the LOB is de-allocated, and a new LOB is allocated. So LOBs are a bit different than the traditional data we are used to managing.

What can go wrong with LOBs?

Errors with LOBS occur when there are inconsistencies between the components of the LOB. We all know that “normal” Db2 indexes can be inconsistent with their associated table, but the issues are multiplied for LOB indexes:

The ROWID-Version number in the base table row may not be found in the LOB index.
There may be entries in the LOB index that are not referenced by any row in the base table.
The LOB data itself may not be where the LOB index points to.
There may be LOBs in the LOB table space that are not referenced by the LOB index.

CHECK DATA can be used to find errors 1 and 2 (from the list above); CHECK LOB can be used to find errors 3 and 4. But it is possible that CHECK LOB will convert a type 4 error into a type 2 error, so proceed with caution.

Then there is the issue of LOB index consistency. If the LOB index is inconsistent with the base table data, the LOB data cannot be accessed. There is no direct access to the LOB table space except through the LOB index. If the LOB index is inconsistent with the LOB table space, Db2 will get errors trying to access the LOB data for that row.

Maintaining LOBs

Db2 maintains LOB data in the LOB table space using a hierarchical structure.

LOB data in the LOB table space can be distributed over many different pages of the LOB table space. Remember, this LOB data is very large. Db2 uses a structure of map pages to point to data pages. At the top is the first map page and it is this page number that is stored in the LOB index. This first map page contains a list of pages, which can be other map pages and data pages. It also contains the total size of the LOB data. If all the data pages are not referenced by map pages or if the map pages are not properly referenced by a higher level map page, LOB data will be lost.

With all of these pointers and structures to maintain, there are a variety of things that can go wrong. To verify that your LOBs are structurally sound you must run a series of DB2 utilities, in the following order:

Run CHECK DATA to verify that the ID fields specified in the base table are also found in the LOB index.
Run CHECK INDEX to verify that the LOB index is valid.
Run CHECK LOB to verify that the internal structure of the LOB table space is sound.

Of course, there are easier ways. It can make a lot of sense to consider using a modern tool that understands the nuances of LOBs so you can manage them accordingly and appropriately.

The Bottom Line

Business and industry trends dictate that unstructured data, in the form of LOBs, is increasingly being stored in our Db2 databases. This type of data is different than traditional, structured data and must be managed with these differences in mind. To do so requires in-depth knowledge and planning to avoid inconsistencies and errors.

Tuesday, September 08, 2020

Know Your Db2 Universal Table Spaces

One of the biggest changes in the last decade or so has been the introduction of new types of table spaces – known as Universal table spaces, or UTS. Not only are UTS new to Db2, they are quickly becoming the de facto standard type of table space for Db2 applications, new and old.

At some point, Universal table spaces will displace your existing segmented and classic partitioned table spaces. We’ll examine why this is so later in the post, but first let’s briefly describe what Universal table spaces are.

Two Types of Universal Table Spaces

Introduced in Db2 9 for z/OS, Universal table spaces combine the best attributes of partitioned and segmented table spaces. If you do not know what partitioned and segmented table spaces are, I refer you to this older article I wrote on DB2 Table Space Options to bring you up to speed (note that this article is almost 20 years old at this point).

Universal table spaces offer improved space management for variable length rows because they use space map pages (like segmented table spaces). Also, like segmented table spaces, UTS deliver improved mass delete performance, and you can immediately reuse the table segments after the mass delete. And like partitioned table spaces, Universal table spaces can grow large (up to 128TB of data) and consist of multiple partitions.

At a high-level, there are two types of Universal table spaces:

1. Partition-by-growth (PBG): The PBG UTS creates new partitions as the amount of data grows without the need to specify key ranges. This type of UTS is beneficial for tables that grow over time and need the additional limits afforded by partitioning but can benefit from the performance of segmented.

2. Partition-by-range (PBR): The range-partitioned, or PBR UTS requires a key range for partitioning like classic partitioned table spaces. A PBR UTS basically adds segmentation to the existing partitioned table space.

Both types of UTS can contain only a single table, but IBM presentations have indicated that this is likely to change at some point in the future (although nothing has been announced or confirmed for certain).

A partition-by-range UTS is basically a segmented, partitioned table space. The limit key ranges must be specified in the table DDL. Index partitioning, which was supported for the earliest classic partitioned table spaces, is not supported for a PBR UTS. So before converting your classic partitioned table spaces to PBR UTS, you must first convert from index-controlled partitioning to table-controlled partitioning. Check out this blog post for a trick to quickly convert to table-controlled partitioning.

The second type of UTS is the partition-by-growth Universal table space. As its name implies, a PBG UTS can automatically add a new partition as the data in the table space grows. Over time, as the UTS is used by applications, data gets added to the table. When the PBG UTS reaches its maximum size, a new partition is automatically added to the table space. The new partition uses the same characteristics as the existing partitions, including compression details, free space, and so on.

You control the type of UTS using the DDL keywords: NUMPARTS, MAXPARTITIONS, and SEGSIZE. To create a PBR UTS you specify both NUMPARTS and SEGSIZE. To get a PBG UTS you must code the MAXPARTITIONS and SEGSIZE parameters. MAXPARTITIONS indicates the limit on the total number of partitions that a PBG UTS can grow to. Be careful, because if you only code the NUMPARTS parameter without SEGSIZE, then you will create a traditional partitioned table space. If you only code the SEGSIZE parameter (without either NUMPARTS or MAXPARTITIONS) you will create a traditional segmented table space.

Db2 12 for z/OS

A significant new feature for supporting big data was introduced in Db2 12, relative page numbering (or RPN) for range-partitioned table spaces. An RPN range-partitioned table space can be created, or an existing range-partitioned table space can be changed to RPN via an ALTER TABLESPACE with PAGENUM RELATIVE, followed by an online REORG of the entire table space.

An RPN table space delivers many benefits for availability and storing large amounts of data. This requires an expanded RID, which increases from 5 bytes to 7 bytes.

From an availability perspective, you can specify DSSIZE at the partition level for RPN table spaces. Furthermore, the allowable DSSIZE value is no longer dependent on the page size and number of table space partitions. The DSSIZE change can be an immediate change (no online REORG required to take effect) as long as the change does not decrease the DSSIZE value. You still can decrease DSSIZE, but only at the table space level.

From a big data perspective, the DSSIZE can grow up to 1 TB for a partition. And the maximum table size increases to 4 PB with approximately 280 trillion rows per table. That is a lot of data that can be stored. Think about it this way: if you were to insert 1000 rows per second it would take more than 8000 years to fill the table to capacity!

Why Are Universal Table Spaces the Future of Db2?

As of today (September 2020, Db2 12 for z/OS), there are basically five types of table spaces from which to choose:

1. Segmented table spaces

2. Universal Partition-By-Growth (PBG) table spaces

3. Universal Partition-By-Range (PBR) table spaces

4. Universal Partition-By-Range Relative Page Number table spaces

5. Classic partitioned table space

Of course, for new databases, it is best to remove the classic partitioned table space from consideration because the PBR UTS is more efficient (and classic partitioning will likely be deprecated at some point). Technically speaking, there are actually two other types of table spaces (LOB and XML table spaces), but they are not general-purpose table spaces and can be used only in specific situations (with LOB and XML data).

So why do I advise that you favor Universal table spaces over segmented whenever you can? Well, for a number of reasons. First of all, because Universal table spaces are newer and all you really need for most every Db2 implementation. Secondly, because many new features of Db2 can only be used with Universal table spaces. Newer features that only work with UTS include:

· Clone tables

· Hash-organized tables

· Currently committed locking

· Pending DDL

· Inline LOBs

· XML multi-versioning

· ALTER TABLE with DROP COLUMN

And this trend is likely to continue. As IBM introduces new function levels and versions of Db2 with new features that only work with UTS, it will become increasingly difficult for DBAs to keep track of which table spaces are not UTS so that they can make sure they are not using any new features that will not work with their old types of table spaces.

What this means is that other than Universal table spaces, the only other type you should be using are segmented table spaces and then only when you absolutely must have a multi-table table space. Therefore, the best practice I recommend is to make all new table spaces Universal (except for multi-table table spaces which can be segmented).

So, what is the future of the segmented table space? For the immediate future, segmented table spaces will continue to be supported. My guess is that, at some point, IBM will deliver a multi-table UTS capability, and then at some point deprecate segmented table spaces. But this is only my guess. As of the date I am writing this, IBM has not committed to a multi-table UTS and the segmented table space is still the recommended (and only) method for assigning multiple tables into a single table space.

My general recommendation though is that you avoid multi-table table spaces unless you have many very small tables and are close to reaching the open data set limit (200,000). Of course, your limit may be lower depending on the setting of the DSMAX subsystem parameter, which specifies the maximum number data sets that can be open at one time. Acceptable values range from 1 to 200,000; a lower setting may be specified due to operating system contraints or storage/memory limitations.

My general recommendation for table spaces is to slowly work on a conversion project to migrate your classic partitioned table spaces to PBR UTS and your segmented table spaces to PBG UTS. Doing so will bring you to the latest and greatest Db2 table space technology and position you to be able to use all new functionality in current and future versions of Db2 whenever – and wherever – you see fit.

Summary

To make sure that your systems are up-to-date and ready for new functionality it makes sense to adopt Universal table spaces for all of your Db2 tables. The only exception is for multi-table segmented table spaces, and you shouldn’t have too many of them.

Good luck Universalizing your Db2 databases!

Thursday, January 17, 2019

Making Plans for IBM Think 2019

I'm looking forward to once again attend IBM Think, IBM's flagship technology conference. I attended the inaugural Think conference and it was one of the highlights of the year (2017). This year IBM Think is being held in San Francisco February 12 thru 15 at the Moscone Center and surrounding hotels. San Francisco is a wonderful location because it will give the conference more room to accommodate the large crowds more comfortably than the Las Vegas venue of 2017.

One of the great things about Think is the breadth and scope of pertinent technical content that it covers. So whether you are a developer, a DBA, a data scientist, a manager, or any flavor of IT specialist, there will be a plethora of useful sessions and activities to educate and make you “think.”

Now you all know that my primary background is database administration and Db2, but I also work with and have interest in many other technologies, including data governance, security and data protection, DevOps, machine learning, AI, blockchain, quantum computing, and cloud computing. And the great thing about the IBM Think conference is that it provides in-depth coverage of all of these areas, and more.

A big struggle for such a large event (expected attendance in excess of 30,000) is finding what you need. Well, IBM Think makes it a bit easier because it is broken down into campuses that focus on a specific areas. This year’s campuses include:

· Smarter Business Showcase
· Data & AI Campus
· Cloud & Infrastructure Campus
· Security & Resiliency Campus

There will be more than 2,000 business strategy sessions and technical deep dives over the course of the week, along with professional development opportunities from 100s of hands-on labs and certification exams.

One of the big highlights of IBM Think is always the great speakers, and this year is no exception. From IBM speakers like CEO Ginni Rometty and Sr. VP Hybrid Cloud Arvind Krishna, to industry speakers like Founder & CEO of Mogul Tiffany Pham and AT&T CEO John Donovan, to researchers like MIT Media Lab and Harvard research specialist Dr. Kate Darling, to entertainers like Super Bowl MVP Joe Montana and skateboarding legend Tony Hawk, there will be a lot of knowledge imparted. I’m particularly looking forward to hearing Paul Cormier, EVP and President of Products and Technologies at Red Hat to hear how the IBM / Red Hat combination is working.

Another advantage of attending IBM Think is the access to exclusive information about IBM products, technologies, strategies, and services that are sure to be shared during the event. IBM always unveils a ton of great stories and technologies at Think.

I’ll be live-tweeting at IBM Think 2019, so be sure to follow me at twitter.com/craigmullins so you can experience Think right along with me, as it happens. Some of the sessions I plan on attending include topics on governed data science, using machine learning to prioritize business issues, and Db2 on cloud... but those are just the tip of the tech iceberg.

And finally, it is not too late. Click here if you want to attend IBM Think 2019… If you do, maybe I’ll see you there amongst 30,000 of our IT friends!

Monday, November 02, 2015

IBM Insight 2015 Wrap-Up

Last week I attended the IBM Insight conference and blogged about the first few days of the conference here at http://db2portal.blogspot.com/2015/10/an-update-from-ibm-insight-2015.html… and I promised to blog about the remainder of the conference, so here is a synopsis of the highlights.

On Wednesday, the focus of the general session was on IBM’s acquisition of The Weather Company’s technology. The deal calls for IBM to acquire The Weather Company’s B2B, mobile and cloud-based web properties, including WSI, weather.com, Weather Underground and The Weather Company brand. IBM will not be acquiring The Weather Channel television station, which will license weather forecast data and analytics from IBM under a long-term contract. IBM intends to utilize its newly acquired weather data in its Watson platform.

The deal is expected to close in the first quarter of 2016. Terms were not disclosed.

You can read all about the acquisition in this IBM press release.

I spent some of my time at Insight this year learning more about dashDB and it is a very interesting technology. Marketed as data warehousing in the cloud, IBM touts four use cases for dashDB: standalone cloud data warehouse, as a store for data scientists, for those implementing a hybrid data warehouse, and for NoSQL analysis and rapid prototyping.

IBM promotes simplicity, performance, analytics on both traditional and NoSQL, and polyglot language support as the most important highlights of dashDB. And because it has DB2 BLU under the covers IBM dashDB not only super-compresses data, but it can operate on that data without necessarily decompressing it.

Additionally, a big theme of the conference was in-memory technology, and dashDB sports CPU cache capabilities. In fact, I heard several folks at the conference say some variation of “RAM is too slow”… meaning that CPU cache is faster and IBM is moving in that direction.

The bottom line for dashDB is that it offers built-in high availability and workload management capabilities, along with being in-memory optimized and scalable. Worth a look for folks needing a powerful data warehousing platform.

For you DB2 for z/OS folks, IDAA was a big theme of this year’s Insight conference. The latest version, V5.1, adds advanced analytics capabilities and in database transformation, making your mainframe queries that can take advantage of the accelerator faster than ever.

Apache Spark was another pervasive topic this year. It was talked about in multiple sessions and I even had the opportunity to play with it in a hands-on lab. The big news for z folks is that IBM is bringing out a version of Spark for the mainframe that will run on z/OS – it is already supported on zLinux.

Of course, I attended a whole slew of DB2 sessions including SQL coding, performance and administration presentations. Some of the highlights include DB2 11 for LUW being announced, several discussions about dark data, and a lot of information about IBM's Big SQL and how it can be used to rapidly and efficiently access Hadoop (and other unstructured) data using SQL.

I live-tweeted a bunch of highlights of those sessions, too. Indeed, too many to include here, if you are interested in catching everything I have to say about a conference, keep reading these blog posts, of course, but you should really follow me on Twitter, too at http://twitter.com/craigmullins

I also had the honor of delivering a presentation at this year's conference on the changes and trends going on in the world of DB2 for z/OS. Thanks to the 70 or so people who attended my session - I hope you all enjoyed it and learned something, too!

As usual, and well-you-know if you've ever attended this conference before, there was also a LOT of walking to be done. From the hotel to the conference center to the expo hall to lunch to the conference center. But at least there were some signs making light of the situation this year!

There was a lot of fun to be had at the conference, too. The vendor exhibition hall was stocked with many vendors, big and small, and it seems like they all had candy. I guess that’s what you get when the conference is so close to Halloween! The annual Z party at the House of Blues (for which you need a Z pin to get in – this year’s pin was orange) was a blast and the Maroon 5 concert courtesy of Rocket Software was a lot of fun, too.

If you are looking for a week of database, big data, and analytics knowledge transfer, the opportunity to chat and connect with your peers, as well as some night-time entertainment, be sure to plan to attend next year’s IBM Insight conference (October 23 thru 27, 2016 at the Mandalay Bay in Las Vegas).

Tuesday, June 09, 2015

New England DB2 User Group

Just a brief blog post today to promote my upcoming speaking engagement at the New England DB2 User Group, in Sturbridge, MA (just outside of Boston). If you are in or around the area on June 18, 2015 be sure to stop by and participate in the meeting!

I will be giving two presentations (in the afternoon) that day. First up is my Big Data introductory presentation titled A Big Data Roadmap for the DB2 Professional. This was voted one of the Top Ten presentations at IDUG North America in 2014. My second presentation is a little bit different than what I normally present. It is titled Reduce Costs by Tuning DB2 to Reduce Your Rolling 4 Hour Average and in it I will walk through the components of subcapacity pricing and variable workload license charge... and how to use that knowledge to tune and reduce your monthly mainframe costs.

There are two other speakers that day, including a pitch from IBM on migrating to DB2 11 for z/OS and another presentation from the State of CT on IBM's PureData system.

So if you are going to be in the New England area be sure to put the NEDB2UG meeting on your agenda.

Hope to see you there!

Tuesday, May 12, 2015

A Trip Report from the 2015 IDUG DB2 Tech Conference

Last week I attended, and spoke at, the annual North American IDUG DB2 Tech Conference in Philadelphia, PA. As usual, the event was busy and chock full of useful and interesting DB2 information.

My week started on Sunday with the IBM gold consultant briefing, whereby IBM talks to the gold consultants about their upcoming plans and solicits feedback from us. I can’t really share with you what happened there, but as is almost always the case, it was time well spent.

The conference kicked off in earnest on Monday with the keynote from IBMers Tim Vincent and Namik Hrle titled “Big Data, Analytics and the Future of Data Management.” Tim and Namik discussed how the growth of data is fueling innovation causing a big change in the way value is created. Some of the key takeaways from the keynote, for me at least, were:

The predominant source of Big Data for most projects is still structured transactions
Primary focus of most data projects is usually on understanding customers
There is a trend toward self-service
Clearly there is value in Hadoop but you can't replace all your DBMSes with it!

Perhaps the most salient tidbit shared at the keynote address was this: “People have forgotten, or never truly understood, how complex data integration actually is.” I think this is true; all too often people underestimate how difficult the integration of data will be. And I agree, too, with sentiment of the presenters who declared that “We need to realize that data is never going to be in one place.”

The bottom line of the keynote: All of our existing DB2 skills will still be relevant but must co-exist with skills in newer technologies like NoSQL and Hadoop.

Good stuff!

Some of the other highlights of the week:

Attendance seemed to be higher than at the past few IDUG conferences. And I see growth in the number of DB2 for LUW attendees. IDUG, at least historically, was mostly attended by DB2 for z/OS folks. And it is probably still true that more z/OS folks attend than LUW folks, but the LUW attendance continues to grow and may surpass z/OS at some point! Of course, this is all based on my unscientific eyeballing of attendees at presentations.
My session on DB2 performance had a full room and nobody seemed to nod off or run for the exits while I spoke. I also delivered a VSP for Dell to a room full of DBAs and developers, as well as a couple of presentations for DataKinetics during an evening session. So that was all good!
I attended a couple of sessions on newer features of DB2 and how some of them are under-utilized. I think this speaks to a few trends hitting the world of DB2 these days. Number one: training is drying up. If you don’t get training on new features it is hard to use them appropriately. Number two: failure to take advantage of free learning opportunities like webinars and freely-available PDF manuals:

The vendor exhibit hall was active and well-attended throughout the week. All the usual suspects were there like BMC, CA, dbi, Dell, and IBM, but there were some newer (at least to IDUG) companies, too, like DataKinetics and Imperva.
The educational seminars were integrated into the conference this year. And they did not cost extra to attend. That means that more people attended ed seminars this year (at least I think they did) and the conference offered more value to attendees looking for more in-depth education than can be gained by an hour session.

All in all, the 2015 North American IDUG Tech Conference was a big success. And it is not too early to start planning for next year’s conference, which will be in Austin, TX. Finally, I’ll be able to drive to an IDUG… see you there in 2016!

Monday, April 21, 2014

A Little Bit About LOBs

In today's blog post we will take a brief look at LOBs, or Large OBjects, in DB2. I have been prepping for my webinar later this week, titled Bringing Big Data to DB2 for z/OS with LOBs: Understanding, Using, and Managing DB2 LOBs. Be sure to click on the link for that and join me on April 24, 2014 for that webinar!

But back to the topic du jour... LOBs. Let's start with a bit of history. LOBs were added to relational products like DB2, back in the 1990s, ostensibly to compete more capably against the Object-Oriented databases of the time. Remember them? Back then it seemed that everybody thought OO DBMS products would supplant relational technology. Seems somewhat quaint now, doesn't it?

At any rate, the idea was to enable relational tables to house unstructured data like images and audio and video data. DB2 for z/OS users were slow to adopt LOBs in their mainframe databases. I think that is due to several different reasons, not the least of which is that it took several new versions of DB2 for LOBs to mature into capable, usable things for enterprise applications. Early implementations of LOBs in DB2 were somewhat kludgy and difficult to administer and use. But IBM has corrected many of the deficiencies over time and there are now tools that can help us to effectively manage DB2 LOBs, too.

The other new force driving LOB usage is the whole Big Data movement. Big Data is a force that is driving organizations to accumulate and analyze more data, and more varied types of data, to gain business insight. The most common definition of Big Data was coined by Forrester Research defining big data in terms of “The 4 V’s” -- volume, velocity, variety, variability. But that is somewhat limiting. However, I do not want to turn this posting into a definition of big data, so... let's just think of Big Data as MORE DATA, MORE TYPES OF DATA, and FASTER GENERATION OF DATA. One example of Big Data driving the usage of LOBs in DB2 is the JSON support that has been added to DB2. JSON objects are stored in DB2 as BLOBs.

So LOBs have been around for awhile now, but it has taken some time for them to gain significant levels of usage "out there!"

Why LOBs, you may ask? Why not just store the data in regular old DB2 data types like VARCHAR or VARGRAPHIC? The basic answer is that DB2's data types were not large enough to hold this amount of data, because of their limit of 32 KB. Some multimedia data can get VERY large. For example, a high resolution video requires about 3 GB for each hour. And high-def TV video requires 720 GB/hour!

But an additional part of the answer is that you do not always want to access the LOB data with the traditional data. For example, if you are querying employee information you probably don't always want their photo or their entire resume... so the data can be stored in a LOB, external from the rest of the data (so as not to impede performance) but connected to the data (when you need to access it).

There are three types of LOBs supported by DB2:

1. BLOB – Binary Large Object – for binary data

2. CLOB – Character Large Object – for text data

3. DBCLOB – Double Byte Character Large Object – for graphic character data

Furthermore, there are two options for specifying LOBs in DB2 tables:

The traditional way to define LOBs where the LOB data is stored separately from the rest of the data. This requires defining additional database objects.
A newer method, for smaller LOBs, introduced with DB2 10, called Inline LOBs.

Using the traditional method, LOBs are defined such that they are stored external to the base table. The column is specified as a BLOB, CLOB or DBCLOB in the base table, but DB2 stores the actual LOB data in an auxiliary table in a LOB table space. You must define a LOB table space per LOB per partition. So, let’s say we are creating a table with 2 LOBs in a table space with ten (10) partitions. That means we will need to define 20 LOB table spaces – 10 for the first LOB (one per partition) and 10 for the second LOB (one for each partition). An auxiliary table is defined for each LOB table space to hold the LOB data. And an auxiliary index is required on each auxiliary table.

Now back to the base table. Remember that we have defined the LOB columns in the base table. But we also need to include a ROWID column in the base table. Only one ROWID column is needed per base table no matter how many LOBs you have defined in the table.

Don’t confuse the ROWID with other concepts that may seem to be similar. A ROWID is not an identity column nor is it a SEQUENCE. The ROWID is used by DB2 behind the scenes to connect the base table row to the auxiliary table row that contains the LOB data. You generally will not access the ROWID, although it is possible to do so just like any other column using SQL. With the ROWID you can perform direct row access. This is so because the row ID value implicitly contains the location of the row.

Finally, each LOB column also has a 2 byte version number associated with it. This is used by DB2 when LOB data is modified. You cannot access the version number.

Now as of DB2 10 for z/OS, you can create inline LOBs in your DB2 tables. But what is an inline LOB? Well, at a high level, the name is somewhat self-explanatory. The LOB data for inline LOBs is stored with the rest of the base table.

But it is really a bit more nuanced that that. You can store part of the LOB inline with the rest of the data and the rest of the LOB externally, if you so choose. So an inline LOB can have all of the LOB data stored inline with the rest of the data (if it is small enough) or it can store only a portion with the rest of the data. This can be a great technique to use if you have some applications that require perhaps only the first 500 bytes of a larger LOB. Those 500 can be stored inline – with the rest of the data – while the rest is stored externally and accessed only when needed. You can create an inline LOB by specifying the INLINE LENGTH clause on your CREATE TABLE statement.

Inline LOBs can improve performance depending upon the type of access required. Consider the example where RESUME data is stored in a CLOB. The first 200 bytes of the resume are accessed most of the time, with the entire resume being accessed only during rare instances (e.g. interview, performance reviews, etc.) By storing those first 200 bytes inline with the rest of the data we can eliminate the I/O to the LOB in the auxiliary table, thereby improving performance. External LOB data is not buffered, but inline LOB data is – this too can impact performance.

This blog entry is getting a little longer than I was expecting, so I am going to cut it off here. We've reviewed what LOBs are, a little but of their history, and discussed a but about both traditional and in-line LOBs. To learn more, be sure to join me for the webinar on April 24th (or watch it later - it will be recorded).

Wednesday, November 06, 2013

IBM Information on Demand 2013, Tuesday

The second day of the IBM IOD conference began like the first, with a general session attended by most of the folks at the event. The theme of today's general session was Big Data and Analytics in Action. And Jake Porway was back to host the festivities.

The general session kicked off talking about democratizing analytics, which requires putting the right tools in people's hands when and where they want to use them. And also the necessity of analytics becoming a part of everything we do.

These points were driven home by David Becker of the Pew Charitable Trust when he took the stage with IBM's Jeff Jonas Chief Scientist and IBM Fellow. Becker spoke about the data challenges and troubles with maintaining accurate voting rolls. He talked about more than 12 million outdated records across 7 US states. Other issues mentioned by Becker included deceased people still legitimately registered to vote, people registered in multiple states, and the biggest issue, 51 million citizens not even registered.

Then Jonas told the story of how Becker invited him to attend some Pew meetings because he had heard about Jonas' data analytics expertise. After sitting through the first meeting Jonas immediately recognized the problem as being all about context. Jonas offered up a solution to match voter records with DMV records instead of relying on manual modifications.

The system built upon this idea is named ERIC, short for the Electronic Registration Information Center. And Pew has been wowed by the results. ERIC has helped to identify over 5 million eligible voters in seven states. The system was able to find voters who had moved, not yet registered and those who had passed away.

"Data finds data," Jonas said. If you've heard him speak in the past, you've probably heard him say that before, too! He also promoted the G2 engine that he built and mentioned that it is now part of IBM SPSS Modeler.

This particular portion of the general session was the highlight for me. But during this session IBMers also talked about Project NEO (the next generation of data discovery in the cloud), IBM Concert (delivering insight and cognitive collaboration), and what Watson has been up to.

I followed up the general session by attending a pitch on Big Data and System z delivered by Stephen O'Grady of Redmonk and IBM's Dan Wardman. Stepehen started off the session and he made a couple of statements that were music to my ears. First, "Data doesn't always have to be big to lead to better decisions." Yes! I've been saying this for the past couple of years.

And he also made the observation that since data is more readily available, businesses should be able to move toward evidence-based decision-making. And that is a good thing. Because if instead we are making gut decisions or using our intuition, the decisions simply cannot be as good as those based on facts. And he backed it up with this fact: organizations using analytics are 2.2x more likely to outperform their industry peers.

O'Grady also offered up some Big Data statistics that are worth taking a look at --> here.

And then Wardman followed up with IBM's System z information management initiatives and how they tie into big data analytics. He led off by stating that IBM's customers are most interested in transactional data instead of social data for their Big Data projects. Which led to him to posit that analytics and decision engines need to exist where the transactional data exists -- and that is on the mainframe!

Even though the traditional model moves data for analytics processing, IBM is working on analytics on data without moving it. And that can speed up Big Data projects for mainframe users.

But my coverage of Tuesday at IOD would not be complete without mentioning the great concert sponsored by Rocket Software. Fun. performed and they rocked the joint. It is not often that you get to see such a new, young and popular band at an industry conference. So kudos to IBM and Rocket for keeping things fresh and delivering high quality entertainment. The band performed all three of their big hits ("Carry On", "We Are Young", and "Some Nights", as well as a bevy of other great songs including a nifty cover of the Stones "You Can't Always Get What You Want."

All in all, a great day of education, networking, and entertainment. But what will Wednesday hold? Well, for one thing, my presentation on Understanding The Rolling 4 Hour Average and Tuning DB2 to Control Costs.

So be sure to stop by the blog again tomorrow for coverage of IOD Day Three!