Tuesday, November 28, 2006

On DB2 Naming Conventions

What's in a name? The establishment and enforcing of naming conventions is often one of the first duties to be tackled when implementing new software. Adequate thought and preparation is required in order for such a task to be successful. What amount of effort should be extended in the creation of appropriate database naming standards? Are current industry standards acceptable?

Shakespeare, many, many years ago, may have said it best when he wrote: "What's in a name? That which we call a rose by any other name would smell as sweet." But, if that is true, then why do those of us in IT spend so much time and effort developing and enforcing naming standards? Maybe what something is called is not quite so trivial a matter after all!

Well, we know what Shakespeare was trying to convey: the name by which we call something has no effect upon the actual object. Calling a desk a lamp will not turn it into a lamp: it is still a desk. Let's not forget this.

In today's blog, though, I want to offer a basic approach to database naming standards, and DB2 for z/OS naming standards, in particular.

Tables and Their Cousins

In an today's SQL DBMSes the primary access point for data is the "table." A table consists of multiple rows, each with a fixed and unchanging number of defined columns. But there are several table alternatives that behave just like tables. For example, DB2 allows the following:

  • ALIAS - an alternative name that can be used in SQL statements to refer to a table or a view in the same or a remoteDB2 subsystem.
  • SYNONYM - an alternative name that can be used in SQL statements to refer to a table or a view in the same DB2 subsystem. Synonyms are accessible only by the synonym owner.
  • VIEW - an alternative representation of data from one ormore tables or views.
These three alternative means of access are similar in one way: they all present data by means of values in rows and columns. An end user need not know whether he or she is querying a table, an alias, a synonym, or a view -- the results are the same, namely data represented by values in rows and columns.

So why do some shops impose different naming standards on views than they do on tables? It makes sense to use the exact same naming convention for tables, views, aliases, and synonyms. These four database objects all logically refer to the same thing - a representation of data in terms of columns and rows.

Why would you want to implement different naming conventions for each of these objects? Let's examine the pros and cons. Consider your current table naming conventions. If your shop is typical, you will find a convention that is similar to this:

Format: aaa_dddddddddddddd

Example: ORD_ITEM

Here aaa is a three character application identifier and the remainder is a free-form alphanumeric description of the table. In the example we have the three character ORD (representing perhaps "order entry" followed by an underscore and ITEM. So this is the "item" table in the order entry system.

If your standards are significantly different, pause for a moment to ask yourself why. The format shown is almost an industry standard for table naming. You most surely do not want to force every DB2 table to begin with a T (or have a strategically embedded T within the table name). The name of a DB2 table should accurately and succinctly convey the contents of the data it contains. The naming convention displayed in Figure 1 accomplishes this.


So this brings us to our second naming convention recommendation: avoid embedding a 'T', or any other character, into table names to indicate that the object is a table. Likewise, indicator characters should be avoided for any other table-like object (i.e. aliases, synonyms, and views). Although most shops avoid embedding a 'T' in the table name, many of these same shops do embed a character into alias, synonym, and especially view names. The primary reason given is that the character makes it easy to determine what type of object is being accessed just by looking at the name.

There are two reasons why this is a bad idea. The first is a semantic reason, the second a flexibility issue. In semantic terms, an object's name need only identify the object, not the object's type. Consider the following arguments: How are people named? Usually one can ascertain the gender of someone simply by knowing their name but would you banish all males named Chris, Pat, or Terry? Or maybe all females named Chris, Pat, and Terry? After all, men and women are different. Shouldn't we make sure that all men's names are differentiated from women's names? Maybe we should start all men's names with an 'M' and all women's names with a W? If we did, we'd sure have a lot of Marks and Wendys, wouldn't we? The point here is that context enables us to differentiate men from women, when it is necessary. The same can be said of database objects.

How about another example: how are COBOL program variables named? Do you name your 01, 05, and 77 level variable names differently in your COBOL programs? For example, do all 01 levels start with 'O' (for one), all 05 levels start with 'F', and all 77 levels start with 'S'? No? Why not? Isn't this the same as forcing views to start with V (or having a strategically embedded V within the name)?

What about the naming of pets? Say I have a dog, a cat, and a bird. Now, I wouldn't want to get them confused, so I'll make sure that I start all of my dog names with a D, cat names with a C, and bird names with a B. So, I'll feed C_FELIX in the morning, take D_ROVER for a walk after work, and make sure I cover B_TWEETY's cage before I go to bed. Sounds ridiculous, doesn't it?

The whole point of this tirade is that if we don't manufacture hokey names in the real world, why would we want to do it with our DB2 objects? There is really no reason to embed special characters into DB2 objects names to differentiate them from one another. It is very practical and desirable to name DB2 objects in a consistent manner, but that consistent manner should be well-thought-out and should utilize the system to its fullest capacity wherever possible.

The second reason for this rule is to increase flexibility. Say, for example, that we have a table that for some reason is significantly altered, dropped, or renamed. If views are not constrained by rigid naming conventions requiring an embedded 'V' in the name, then a view can be constructed that resembles the way the table used to look. Furthermore, this view can be given the same name as the old table. This increases system flexibility. Most users don't care whether they are using a table, view, synonym, or alias. They simply want the data. And, in a relational database, tables, views, synonyms, and aliases all logically appear to be identical to the end user: as rows and columns.

Although it is true that there are certain operations that cannot be performed on certain types of views, the users who need to know this will generally be sophisticated users. For example, very few shops allow end users to update any table they want using QMF, SPUFI, or some other tool. Updates, deletions, and insertions (the operations which are not available to some views) are generally coded into application programs and scheduled in batch or executed on-line in transactions. The end user does need to query tables dynamically. Now you tell me, which name will your typical end user remember more readily when he needs to access his marketing contacts: MKT_CONTACT or VMKTCT01?

Further Arguments For Indicators

Some folks believe they have very valid reasons for embedding an object type indicator character into database objects - view names, in particular. Let's examine these arguments.

Point Embedding a V into our view names enables our DBAs to quickly determine which objects are views and which are tables.

Counterpoint Many of these shops do not embed a T into the table name, but feel that a V in the view name is necessary. It is believed that the DBA will be able to more easily discern views from tables. But, rarely do these shops force an S into synonym names or an A into alias names. Even if they do, it is usually overkill. Any good DBA already knows which objects are tables and which are views, and if he or she doesn't, a simple query against the system catalog will clarify the matter. For example, in DB2, this query will list all table-like objects:

SELECT NAME, CREATOR, "TABLE"
FROM SYSIBM.SYSTABLES
WHERE TYPE = "T"
UNION ALL
SELECT NAME, CREATOR, "ALIAS"
FROM SYSIBM.SYSTABLES
WHERE TYPE = "A"
UNION ALL
SELECT NAME, CREATOR, "SYNONYM"
FROM SYSIBM.SYSSYNONYMS
UNION ALL
SELECT NAME, CREATOR, "VIEW"
FROM SYSIBM.SYSVTREE
ORDER BY 3, 1;

Point It is necessary to code view names differently so that users understand that they are working with a view and not all operations can be performed on the view.

Counterpoint All operations can be performed on some views but not all operations can be performed on all tables! What if the user does not have the security to perform the operation? For example, what is the difference, from the user's perspective, between accessing a non-updateable view and accessing a table where only the SELECT privilege has been granted?

Use It or Lose It

Another common problem with database naming conventions is unnecessary size restrictions. Using DB2 as an example, most objects can have a name up to 18 characters long (even more after moving to DB2 V8). But, in many instances, shops establish naming standards that do not utilize all of the characters available. This is usually unwise, which brings us to our third recommendation: Unless a compelling reason exists, ensure that your standards allow for the length of database object names to utilize every character available.

Here is a list of maximum and recommended lengths for names of each DB2 object:


----------------------------------------------------------
Recommended
DB2 Object Max Length (V7) Max Length(V8) Length
---------- --------------- -------------- -----------
STOGROUP 8 128 128
Database 8 8 8
Tablespace 8 8 8
Table 18 128 128
View 18 128 128
Alias 18 128 128
Synonym 18 128 128
Column 18 128 128
Check Constraint 18 128 128
Ref. Constraint 8 128 128
Index 18 128 8
----------------------------------------------------------




Notice that, except for indexes, the recommended length is equal to the maximum length for each object. Why are indexes singled out in DB2? This is an example of a compelling reason to bypass the general recommendation. Developers can explicitly name DB2 indexes, but they cannot explicitly name DB2 index spaces. Yet, every DB2 index requires an index space - and an index space name. The index space name will be implicitly generated by DB2 from the index name. If the index name is 8 characters or less in length, then the index space name will be the same as the index name. However, if the index name is longer than 8 characters, DB2 will use an internal, proprietary algorithm to generate a unique, 8-byte index space name. As this name cannot be determined prior to index creation, it is wise to limit the length of index names to 8 characters. Also, since the only folks who will be interested in the index space name are DBAs, there is no reason to make the names more descriptive for general end users... and DBAs are used to dealing with cryptic names (some of them even like it).

This is a good example of the maxim that there are exceptions to every rule.

Another exception may be that users of DB2 V8 may want to impose V7 length restrictions on their objects. You might want to do this to discourage verbose names (128 is a BIG extension over 18 bytes). Maybe you have applications that cannot easily use or display such wordy object or column names.

Embedded Meaning

One final troublesome naming convention used by some shops is embedding specialized meaning into database object names. The name of an object should reflect what that object is or represents. However, it should not attempt to define the object and describe all of its metadata.

With this in mind, it is time for another recommendation: Do not embed specialized meaning into database object names.

Let's examine this in more detail by means of an example. Some shops enforce DB2 index naming conventions where the type of index is embedded in the index name. Consider the following example:

Format: Xaaaaayz

Example: XORDITCU

Here X is a constant, aaaaa is unique five character description (perhaps derived from the table name), y is an indicator for clustering (either C for clustered or N for nonclustered), and z is an indicator for index type (P for primary key, F for foreign key, U for unique, and N for non-unique).

So, the example is a unique, clustering index on the ORD_ITEM table. Note two potential problem areas with this standard:

  • An embedded X identifies this object as an index.

  • Embedded meaning in the form of indicators detailing the type of index.

The embedded indicator character 'X', although unnecessary, is not as evil as indicator characters embedded in table-like objects. Indexes are not explicitly accessed by users. Therefore, obscure or difficult-to-remember naming conventions are not as big of a problem. The same arguments hold true for tablespace names. In fact, indicator characters may actually be helpful to ensure that tablespaces and indexes are never named the same. Within the same database, DB2 does not permit a tablespace to have the same name as any index, and vice versa. DB2 uses a name generation algorithm to enforce uniqueness if it is attempted. So, if you must use indicator characters in database names, use them only in objects which are never explicitly accessed by end users.

The second potential problem area poses quite a bit of trouble. Consider the following cases which would cause the embedded meaning in the index name to be incorrect:

  • The primary key is dropped.

  • A foreign key is dropped.

  • The index is altered from non-unique to unique (or vice versa) using a database alteration tool.

In each of these cases we would have to re-name the index or we would have indexes with embedded characters that did not accurately depict the metadata for the index. Both are undesireable.

And there are additional problems to consider with this naming convention, as well. What if an index is defined for a foreign key, but is also unique? Should we use an 'F' or a 'U'? Or do we need another character?

The naming convention also crams in whether the index is clustering ('C') or not ('N'). This is not a good idea either. Misconceptions can occur. For example, in DB2, if no clustering index is explicitly defined, DB2 will use the first index created as a clustering index. Should this index be named with an embedded 'C' or not? And again, what happens if we switch clustering indexes - we would have to re-name indexes.

Let's look at one final example from the real world to better illustrate why it is a bad idea to embed specialized meaning into database object names. Consider what would happen if we named corporations based upon what they produce. When IBM began, they produced typewriters. If we named corporations like some of us name database objects, the company could have been named based upon the fact that they manufactured typewriters way back when. So IBM might be called TIBM (the T is for typewriters).

Now guess what, TIBM doesn't make typewriters any longer. What should we do? Rename the company or live with a name that is no longer relevant? Similar problems ensure with database object names over time.

Synopsis

Naming conventions evoke a lot of heated discussion. Everybody has their opinion as to what is the best method for naming database objects. Remember, though, that it is best to keep an open mind.

It is just like that old country novelty hit "A Boy Named Sue." Johnny Cash may have been upset that his father gave him a girl's name, but that was before he knew why. In a similar manner, I hope that this blog topic caused you to think about naming conventions from a different perspective. If so, I will consider it a success.

Tuesday, November 21, 2006

Character Versus Numeric Data Types

Most DBAs have faced the situation where one of their applications requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, users or developers wish the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks, if any, to doing this?

Well, there are drawbacks! Without proper edit checks, INSERTs and UPDATEs could place invalid alphabetic characters into the alphanumeric column. This can be a very valid concern if ad hoc data modifications are permitted. Although it is uncommon to allow ad hoc modifications to production databases, data problems can still occur if all of the proper edit checks are not coded into every program that can modify the data. But let's assume (big assumption) that proper edit checks are coded and will never be bypassed. This removes the data integrity question.

There is another problem, though, that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if programmatic edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 3**74 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative codes and/or 5 digit codes could be entered. Both do not conform to the 4 digit maximum. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem should be able to be solved using other methods. It is not necessary to store the data in the same format that users wish to display it. For example, when using QMF, you can ensure that leading zeroes are shown in reports by using the "J" edit code. Other report writes offer similar functionality. And report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields.

In general, it is wise to choose a data type which is closest to the domain for the column. If the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. Same goes for temporal data (that is, choose DATE, TIME, or TIMESTAMP instead of a character or numeric data type). In addition, always be sure to code appropriate edit checks to ensure data integrity - but remember, fewer need to be coded if you choose the correct data type because DB2 automatically prohibits data that does not conform to the data type for each column.

Wednesday, November 08, 2006

DB2 Access Paths and Change Management Procedures

No one working as a DB2 DBA or performance analyst would deny that one of the most important components of assuring efficient DB2 performance is making sure that DB2 access paths are appropriate for your DB2 programs. Binding programs with EXPLAIN(YES) specified is important to ensure that we know what access paths DB2 has chosen for each SQL statement. Without the information that EXPLAIN puts in the PLAN_TABLE we would be "flying blind."

Anyway, programs need to be rebound periodically to ensure that DB2 has forumlated access paths based on up-to-date statistics and to ensure that you are taking advantage of all the latest and greatest DB2 optimizer features. Whether you are implementing changes into your DB2 applications, upgrading to a new version of DB2, or simply trying to achieve optimum performance for existing application plan and packages, an exhaustive and thorough REBIND management process is a necessity.

But not every organization does what is necessary to keep access paths up-to-date with the current state of their data. Oh, there are always reasons given as to why the acknowledged “best practice” of REORG/RUNSTATS/REBIND is not followed religiously. But these reasons are not always reasonable - especially when there are approaches to overcome them.

But let's approach this subject from another perspective: that is, from a change management procedures perspective. On the mainframe, change has traditionally been strictly controlled. But one exception has been DB2 access paths. In a mainframe shop everything we do is tightly controlled. If we make even a minor change to an application program, that program is thoroughly tested before it ever reaches a production environment. The program progresses through unit testing, QA testing, volume testing, and so on. As developers, we do a good job of testing a change to minimize the risk that the change might have unintended consequences.

We do the same type of due diligence with most other changes in the mainframe world. Database changes are planned and thoroughly tested. Many shops use change manager software to ensure that database changes are implemented appropriately and in a controlled manner. System software (e.g. CICS, WebSphere, etc.), including subsystem and DB2 changes, are all subject to strict change control procedures. This is done to minimize disruption to the production work being conducted by our business folks.

But, if you think about it, there is one exception to this tight change control environment: Binds and Rebinds are typically done in the production environments without the benefit of oversight or prior testing. This lack of change control results in unpredictable performance impacts. In most shops, programs are moved to production and bound there. Indeed, we are at the mercy of the DB2 optimizer, which generates access paths on the fly when we Bind or Rebind our programs. Any issues with inefficient access paths are then dealt with in a reactive mode. That is, problems are addressed after the fact.

One of the biggest reasons for not implementing strict change control processes for access paths is the lack of built-in methods for ensuring access path change control discipline. Let’s face it, manually evaluating thousands of packages and tens of thousands of SQL statements can be quite impractical. But there are things that can be done to help alleviate this problem.

Rebinding does not always produce DB2 performance improvements—and in some cases rebinding can cause DB2 performance to degrade. Typically, 75% to 90% of all rebinds are unnecessary. Bind ImpactExpert (from NEON Enterprise Software) manages the bind and rebind processes to assure optimal application performance by checking what access path changes will be before your rebind, and then only rebinding those programs where performance would improve. How does it do this?

Well, Bind ImpactExpert helps to eliminate the unpredictable results of rebinds that DBAs experience daily. By filtering out the rebinds that are likely to have a negative impact on DB2 performance, Bind ImpactExpert guarantees improved or consistent performing DB2 applications.

And one of the most popular features of Bind ImpactExpert is its ability to eliminate the unpredictability of rebinds when moving between DB2 releases. The EarlyPrecheck feature assists you in moving to new DB2 releases. The EarlyPrecheck feature allows you to evaluate and correct access path problems for both dynamic and static SQL prior to an actual DB2 version upgrade. You can perform access path evaluation weeks or months prior to the migration date to allow for in-depth analysis and pre-emptive correction of access path issues.

Bind ImpactExpert can also be stop unnecessary binds. It does this by determining whether a DBRM contains changed SQL and skipping the bind step for those that do not. This process eliminates unnecessary binds, optimizes and accelerates the compile procedure, reduces CPU usage, and reduces locks to the DB2 catalog.

All forward-thinking organizations should adopt a liberal Rebind process to ensure optimal access paths based on up-to-date statistics. Keeping abreast of data changes and making sure that your programs are optimized for the current state of the data is the best approach. This means regular executions of RUNSTATS, REORG, and Rebind.

So, if you are not rebinding your programs on a regular basis because you are afraid of degrading a few access paths, it is time to take a look at Bind ImpactExpert. If you never Rebind, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to DB2 itself. I mean, why would you want to penalize every DB2 program in your subsystem for fear that a program or two may have a few degraded access paths? Especially when NEON Enterprise Software offers an automated solution to the problem...

The bottom line is this: failing to keep your access paths aligned with your data is a sure recipe for declining DB2 application performance.

Monday, November 06, 2006

Try Out the XML Capability of DB2 9 for Free

Are you aware that there is a version of DB2 that you can use free of charge? It is called DB2 Express-C and it is basically IBM's way of removing price as being the issue in terms of you trying out and using DB2. Think of it as a way to use DB2 just like you would use an open source DBMS (except you don't get the source code).

According to IBM: DB2 Express-C is a version of DB2 Express Edition (DB2 Express) for the community. DB2 Express-C is a no-charge data server for use in development and deployment of applications including: XML, C/C++, Java, .NET, PHP, and more. DB2 Express-C can be run on up to 2 dual-core CPU servers, with up to 4 GB of memory, any storage system setup and with no restrictions on database size or any other artificial restrictions.

So, if you are wondering what it means for DB2 to support pureXML, then you might want to download DB2 Express-C and try it out for yourself. DB2 Express-C can run on AIx, HP-UX, Linux, Solaris, and Windows.

Now why would I be writing about a LUW product on a z/OS blog? Well, DB2 9 for LUW and DB2 9 for z/OS both support pureXML in the same way. So even if you are a DB2 for z/OS user, getting familiar with the XML support in DB2 Express-C can prepare you to help plan for how you might want to use XML in DB2 9 for z/OS when it becomes available.