Thursday, March 17, 2016

Digital Transformation and DB2 for z/OS: It’s Not Your Daddy’s DB2!

If you are a DBA who has been using DB2 for z/OS for a while you should have noticed that we are not doing things the same way we used to. DB2 is changing and we should be changing with it. If you are still using DB2 the same way you did 10 or 20 years ago, then you are definitely not adhering to industry best practices!
The same trends that are driving the digital explosion are also changing DB2 and the traditional role of the DBA. We are storing more data and different types of data for longer periods of time and in different ways than we have in the past.
And DB2 for z/OS keeps changing to adopt and embrace modern data management requirements and techniques. Whether it is modernizing storage with universal table spaces, embracing unstructured data in LOBs, or expanding the SQL language with new and more functionality, today’s DB2 looks a lot different than it did yesterday. Indeed, it is different – it is not your daddy’s DB2.
I’ve been writing a series of blog posts for BMC about this topic under the title It’s Not Your Daddy’s DB2!  You can find the first three blog posts in this series here: 1 2 3
But you can also attend a live webinar that BMC is sponsoring where I will talk about these issues. You can learn about:
·        Trends that influence the size and complexity of your DB2 environment and how this impacts data management
·        How to adapt to new DB2 data types and structures
·        Best practices and technologies for managing DB2 in the digital age
·        And BMC will share its next generation technology for managing the new world of DB2 for z/OS.

Learn how digital transformation will change the way your DBAs manage critical business needs. Attend this webinar on March 30, 2016, at 12:00 pm CT.

Tuesday, March 15, 2016

The Most Misunderstood Features of DB2 – Part 6: Not Indexing

Welcome to our on-going series of misunderstood issues in the world of DB2 for z/OS. Today’s topic is indexing, or to be more precise: not indexing. What do I mean by that?

Well, I’ve heard folks say that you should always create at least one index on every DB2 table. And while I can empathize with that general notion, I can’t agree completely because of that word “always,” which almost always makes a sentence wrong.

Sure, there are several good arguments for creating indexes on DB2 tables. Perhaps the most important one is clustering. DB2 uses an index to control how data is clustered on disk. Only one index can be specified as the clustering index (because, of course, the data on disk can only be stored in one order, right!). If you do not specify a clustering index then DB2 will use the earliest created (oldest) index to cluster the data. So it is usually a good idea to create an index to guide clustering.

Another popular reason to create an index on a DB2 table is to enforce uniqueness for a UNIQUE constraint or PRIMARY KEY. The only way to enforce uniqueness on a DB2 column (or set of columns) is by using a unique index.

Of course, there are a lot of other good reasons to create indexes, most of them to improve query performance. But I contend that there are situations when it makes sense not to create any indexes at all.

So when does it make more sense not to build an index for a DB2 table?

Let's start by saying that most of the time you will want to build at least one - and probably multiple - indexes on each table that you create. Indexes are crucial for optimizing performance of SQL access. Without an index, queries must scan every row of the table to come up with a result. And that can be very slow.

Having said that, here are a few situations some times where it can make sense to have no indexes defined on a table:

When all (or most) accesses retrieve every row of the table. Because every row will be retrieved every time you want to use the table, an index (if used) would just add extra I/O and would diminish, not enhance performance. Though not extremely common, you may indeed come across such tables in your organization.
For a very small table with only a few pages of data and no primary key or uniqueness requirements. A very small table (perhaps 20 to 30 or so pages) might not need an index because simply reading all of the pages is very efficient already.
When performance doesn't matter and the table is only accessed very infrequently. But, hey, when do you ever have that type of requirement in the real world?

Other than for these circumstances, you will most likely want to build one or more indexes on each table, not only to optimize performance, but also to ensure uniqueness, to support referential integrity, and to drive data clustering.


Of course, indexes do not come without cost. Indexes take up disk space and adding a lot of indexes will consume disk space. An additional consideration is their impact in data modification. Although indexes speed up queries they degrade inserts and deletes, as well as any modification to indexed columns.

What do you think? Are there other situations where a table should have no indexes? Are there any pertinent high-level issues I missed? Feel free to add your thoughts and comments below!

Friday, March 04, 2016

The Most Misunderstood Features of DB2 – Part 5: Choosing the Clustering Key

Hello everybody, and welcome to part 5 in my on-going series where we take a look at The Most Misunderstood Features of DB2. You can find the earlier installments in this series here (Part 1 Part 2 Part 3 Part 4).

Today’s topic, on choosing an appropriate clustering index, might be a bit confusing to some of you. So let’s start at the beginning and describe what clustering is.

What is Clustering?

In DB2 for z/OS, you use an index to indicate how DB2 should try to store table space data physically on disk. This is called clustering. A DB2 index is a clustering index if the CLUSTER keyword is specified when the index is created. Clustering causes inserted rows to be stored contiguously in sequence whenever possible. Additionally, when the table space is reorganized the data will be sequenced according to the clustering index. Since there can only be one physical sequence for data on disk, there can only be one clustering index per table.

For tables that do not specify APPEND YES and do not have hashing defined, if you do not specify a clustering index, DB2 chooses to cluster the data using the oldest existing index. Therefore, it is wise to explicitly specify a clustering index instead of letting DB2 decide because you can almost always choose better than the (basically random) choice DB2 makes.

Clustering of data is important because when the data is sequentially accessed, if the data exists on the same page (or range of pages) then I/O is minimized because fewer pages are read than if the data were spread throughout the entire table space on “random” pages. And fewer I/Os means improved performance.

So then, how should a DBA go about deciding how to cluster the data? Well, the best answer is to analyze the various ways in which the data is to be accessed. This means reviewing the actual SQL along with frequency and importance of the execution patterns. When the most important/frequently executed sequential accesses are identified, then you can choose to cluster the data according to that information. By following this approach then the most frequent and/or important sequential accesses will be optimized by clustering.

Of course, frequently the decision on how to cluster is made during the database design phase when few, if any, SQL statements or access patterns are readily available. So the DBA sometimes makes a knee-jerk decision to just cluster based on the primary key (PK) and be done with it. This is usually not a good idea.

Why is Clustering by PK not a very good idea?

Think about what we just said about sequentially accessing data. How often is data sequentially accessed by PK? Sometimes this might be the case, maybe if you are producing a report in order by PK, but even that example is not a very good one. Think about it. If you are producing a customer report you probably do not create it in CUSTNO order but probably in order by customer name (or something more interesting to humans than a key number).

So the best option is to figure out the most common and important sequential access patterns and cluster accordingly. But what if this information is not available? My assertion is that clustering by Foreign Key (FK), instead of by PK, makes a lot more sense. Usually a FK is on the many side of a one-to-many relationship. Think about what that means. When you join PK to FK there will be one PK row being joined to multiple FK rows. If those multiple FK rows are all on the same page (or on contiguous pages) due to clustering, then we can impact I/O.

The Bottom Line


The bottom line here is that you should think about (and even re-think) your general assumptions and why you make any database design decision. Sometimes your immediate gut reaction won’t be the right one. And it is always better to think with our heads instead of our guts, right?

Wednesday, February 17, 2016

The Most Misunderstood Features of DB2 – Part 4: Base Table Views

Welcome to my continuing series focusing on The Most Misunderstood Features of DB2. Today’s topic is an oldie but a goodie: base table views.

What is a Base Table View?
I am going to assume that regular readers of this blog know what a view is (if not, go here). So what is a base table view? It is a view that contains all of the columns and all of the rows of the base table and nothing else. In other words, it is like a SELECT * against the base table with no WHERE clauses.

Now why would somebody create such a beast? This gets to the heart of the misunderstanding here. There is a continuing notion “out there” that it is a good idea to create a base table view for every table and to give programmers access to the base table view only, not the base table itself.

The reason given for doing so is to insulate programs from change. But this is a weak argument at best. I admit that this particular misunderstanding is less prevalent than it was in the earlier years of DB2. But I do still find shops adhering to this archaic, and poorly founded, idea.

Reasons Given for Base Table Views
One reason given by base table view proponents is that when you add a column to a table you do not have to change the program. But this is the case, too, if you code your programs appropriately, avoiding SELECT * and coding only the specific columns you need to access. Most shops do this even when using base table views.

Other reasons given revolve around removing columns or splitting tables. If you are interested in all of the reasons and rebuttals I urge you to read my article (PDF) on this topic from 1991 titled One View Per Base Table? Don’tDo It! (Wow! This is a quarter of a Century old now!)

Another View Fallacy
I recall yet another misunderstanding about views from many years ago that I have not heard in a long time. Basically, the argument was that accessing a base table view would out-perform accessing the base table.

This was never proven to me – and I do not believe it was ever true. How could it be? Static SQL against a base table view would use view merge so it would be the same as going against the base table. And dynamic SQL would have the added step of having to resolve the view to the base table. Sooo…

Summary

Base table views are not worth the time and effort because the bring no added value of any type whatsoever. Do not use them.

Wednesday, February 10, 2016

The Most Misunderstood Features of DB2 – Part 3: Nulls

Welcome to Part 3 in my on-going blog series on The Most Misunderstood Features of DB2. You can find the first two parts here: Part 1 (on locking) and Part 2 (OPTIMIZE FOR v. FETCH FIRST). Today’s topic is one that confuses many SQL developers, Nulls.

What is a Null?
A null represents missing or unknown information at the column level. When a column is set as null, it can mean one of two things: the attribute is not applicable for certain occurrences of the entity, or the attribute applies to all entity occurrences, but the information may not always be known. Of course, it could be a combination of these two situations, too.

A null is not the same as 0 or blank. Null means no information has been specified for the column and it implies that the value is either unknown or not applicable.

Because DB2 supports null you can distinguish between a deliberate entry of 0 (for numerical columns) or a blank (for character columns) and an unknown or inapplicable entry (NULL for both numerical and character columns). Null indicates that the user did not explicitly make an entry or has explicitly entered NULL for the column. For example, a null “value” in the Price column of the ITEM table in some database does not mean that the item is being given away for free; instead it means that the price is not known or has not yet been set.

Nulls sometimes are inappropriately referred to as “null values.” Using the term value to describe a null is inaccurate because a null implies the lack of a value. Therefore, it is a good practice to simply use the term null or nulls (without appending the term “value” or “values” to it).

A Few Examples 
When are nulls useful? Well, defining a column as NULL provides a placeholder for data you might not yet know. For example, when a new employee is hired and is inserted into the EMP table, what should the employee termination date column be set to? I don’t know about you, but I wouldn’t want any valid date to be set in that column for my employee record. Instead, null can be used to specify that the termination date is currently unknown.

Let’s consider another example. Suppose that we also capture employee’s hair color when they are hired. Consider three potential entity occurrences: a man with black hair, a woman with unknown hair color, and a bald man. The woman with the unknown hair color and the bald man both could be assigned as null, but for different reasons. The woman’s hair color would be null meaning presently unknown; the bald man’s hair color could be null too, in this case meaning not applicable. But there are probably better methods of dealing with the bald man’s hair color.

How could you handle bald without using nulls? You could create special values for the HairColor column that mean “bald” and “unknown.” This is possible for a CHAR column like HairColor, but what about a DB2 DATE column? All occurrences of a column assigned as a DATE data type are valid dates. It might not be possible to use a special date value to mean “unknown.” This is where using a null is most practical.
DB2 does not differentiate between nulls that signify unknown data and those that signify inapplicable data. This distinction must be made by the program logic of each application.

Keep in mind, though, that using null to indicate “not applicable” can be an indication of improper database design. By properly modeling and normalizing your data structures you can usually eliminate the need to use nulls to indicate that a column is inapplicable for a specific row.

DB2 and Indicator VariablesDB2 represents null in a special “hidden” column known as an indicator variable. An indicator variable is defined to DB2 for each column that can accept nulls. The indicator variable is transparent to the end user, but must be managed when programming in a host language (such as Java, COBOL or PL/I).

The null indicator is used by DB2 to track whether its associated column is null or not. A positive value or a value of 0 means the column is not null and any actual value stored in the column is valid. If a CHAR column is truncated on retrieval because the host variable is not large enough, the indicator value will contain the original length of the truncated column. A negative value indicates that the column is set to null. If the value is -2 then the column was set to null as the result of a data conversion error.

Let’s take a moment to clear up one common misunderstanding right here: nulls NEVER save storage space in DB2 for z/OS. Every nullable column requires one additional byte of storage for the null indicator. So, a CHAR(10) column that is nullable will require 11 bytes of storage per row – 10 for the data and 1 for the null indicator. This is the case regardless of whether the column is set to null or not.

DB2 for Linux, Unix, and Windows has a compression option that allows columns set to null to save space. Using this option causes DB2 to eliminate the unused space from a row where columns are set to null. This option is not available on the mainframe, though.

Every column defined to a DB2 table must be designated as either allowing or disallowing nulls. A column is defined as nullable – meaning it can be set to NULL – in the table creation DDL. Null is the default if nothing is specified after the column name. To prohibit the column from being set to NULL you must explicitly specify NOT NULL after the column name. In the following sample table, COL1 and COL3 can be set to null, but not COL2, COL4, or COL5:

CREATE TABLE SAMPLE1
 (COL1 INTEGER,
  COL2 CHAR(10) NOT NULL,
  COL3 CHAR(5),
  COL4 DATE NOT NULL WITH DEFAULT,
  COL5 TIME NOT NULL);

In SELECT statements, testing for null is accomplished differently than testing for other “values.” You cannot specify WHERE COL = NULL, because this does not make any sense. Remember, null is a lack of a value so the column does not equal anything. Instead, you would have to code WHERE COL IS [NOT] NULL. In INSERT statements NULL can be specified in the VALUES clause to indicate that a column is to be set to NULL; but in UPDATE statements you can use the equality predicate (=) to assign a column to NULL. When inserting data, if the user fails to make an entry in a column that allows nulls, DB2 supplies the NULL as the default (unless another default value exists). If an attempt to insert NULL is made against a column defined as NOT NULL, the statement will fail.

More Confusion
OK, now that we have discussed what nulls are and the basics of using nulls in DB2, let’s look at how the presence of nulls can confuse SQL developers.

The AVG, COUNT DISTINCT, SUM, MAX, and MIN functions omit column occurrences set to null. The COUNT(*) function, however, does not omit columns set to null because it operates on rows. Thus, AVG is not equal to SUM/COUNT(*) when the average is being computed for a column that can contain nulls. You must be aware of which columns can be null as well as which functions operate on columns versus rows to make sure you code correctly!

It is important to understand, too, that for comparison in a SELECT statement, two null columns are not considered to be equal. When a nullable column participates in a predicate in the WHERE or HAVING clause, the nulls that are encountered cause the comparison to evaluate to UNKNOWN. But ORDER BY, GROUP BY, DISTINCT, and unique indexes consider nulls to be equal and handle them accordingly.

Additionally, when a column set to null participates in a calculation, the result is null. This is true even if the calculation is NULL/0 (which as every math savvy person knows should be an error). But because there is a null in the calculation, the result will be null!

Let’s look at another statement that confuses a lot of people:

SELECT SUM(SALARY)
FROM   EMP
WHERE  JOBCODE > 999;

Assume that SALARY is defined as NOT NULL. Furthermore, assume that the largest JOBCODE is 500 (that is, less than the 999 we are checking for). What is the results of this SQL?

No rows satisfy the predicate. As such, many people say the results should be 0. But it is not. The result is null! Yes, you can get a null result when performing functions on a non-nullable column. The predicate resolves to the empty set and the sum of the salaries in the empty set is null because there are no employees, and hence no salaries. Therefore, you better make sure that you code a null-indicator in your program when you write code like this.

The Bottom Line
Nulls are confusing, but cannot be completely avoided in DB2. Take the time to learn how nulls work and how they impact your coding and development practices… or you will certainly make mistakes.



----------------------------------------------------------------------------------------------------------
Additional resources for understanding nulls and their usage: