Showing posts with label data quality. Show all posts
Showing posts with label data quality. Show all posts

Tuesday, December 02, 2014

DSN1COPY Improvements in DB2 11 for z/OS

There have been some nice data validation improvements made to the IBM DSN1COPY utility in DB2 11 for z/OS. I suppose I should first explain what the DSN1COPY utility is before I talk about how it has been improved, so...

DSN1COPY is also known as the "Offline Copy utility." It has many uses. Of course, the primary use case for DSN1COPY is to copy data sets without DB2 having to be up and running.  DSN1COPY can be used to copy VSAM data sets to sequential data sets, and vice versa. It also can copy VSAM data sets to other VSAM data sets and can copy sequential data sets to other sequential data sets. As such, DSN1COPY can be used to:
  • Create a sequential data set copy of a DB2 table space or index data set.
  • Create a sequential data set copy of another sequential data set copy produced by DSN1COPY.
  • Create a sequential data set copy of an image copy data set produced using the DB2 COPY utility, except for segmented table spaces. 
  • Restore a DB2 table space or index using a sequential data set produced by DSN1COPY.
  • Restore a DB2 table space using a full image copy data set produced using the DB2 COPY utility.
  • Move DB2 data sets from one disk to another.
  • Move a DB2 table space or index space from a smaller data set to a larger data set to eliminate extents. Or move a DB2 table space or index space from a larger data set to a smaller data set to eliminate wasted space.
DSN1COPY runs as a batch job, so it can run as an offline utility when the DB2 subsystem is inactive. It can run also when the DB2 subsystem is active, but the objects it operates on should be stopped to ensure that DSN1COPY creates valid output. DSN1COPY does not check to see whether an object is stopped before carrying out its task. DSN1COPY does not directly communicate with DB2.

DSN1COPY 
performs a page-by-page copy. Therefore, you cannot use DSN1COPY to alter the structure of DB2 data sets. For example, you cannot copy a partitioned table space into a segmented table space.

Perhaps the nicest feature of  DSN1COPY is its ability to modify the internal object identifier stored in DB2 table space and index data sets, as well as in data sets produced by DSN1COPY and the DB2 COPY utility. When you specify the OBIDXLAT option, DSN1COPY reads a data set specified by the SYSXLAT DD statement. This data set lists source and target DBIDs, PSIDs or ISOBIDs, and OBIDs, thereby enabling you to modify these IDs accordingly (possibly for moving data from one subsystem to another).

You can also use DSN1COPY to  check the validity of table space and index pages. 

OK Then, But What's New?

So now that we understand the DSN1COPY utility, let's dig in to learn a little bit about how it has been improved in DB2 11 for z/OS. Basically, DB2 11 bring improved data validation to the DSN1COPY utility.

In DB2 11, the target data set produce by DSN1COPY is automatically validated after it is populated. The first time that the target data set is physically opened by an operation other than a utility, DB2 checks for inconsistencies in the data and the DB2 Catalog. The validation performed includes checking: 
  • DBID, PSID, and OBID
  • SEGSIZE and PAGESIZE
  • Table space type
  • Table schema (if the table space contains only one table)

If inconsistencies are found, DB2 throws a -904 SQLCODE and reports the issue. You can then use the REPAIR utility to remediate the reported issues. In past releases, validation did not occur immediately, which could have resulted in data corruption issues, storage overlays, and even ABENDs.

Summary

So you can rest easier knowing that DSN1COPY data is checked after it is created, thereby removing a lot of the chance for calamity if you ran the utility improperly... and that's a good thing!

Monday, October 12, 2009

On The Importance of Choosing the Correct Data Type

Most DBAs have grappled with the pros and cons of choosing one data type versus another. Sometimes the decision is easy, whereas sometimes the decision is a little bit more difficult. In today's blog entry, we'll discuss both situations.

Data type and length are the most fundamental integrity constraints applied to data in a database. Simply by specifying the data type for each column when a table is created, DB2 automatically ensures that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to a non-conforming value will be rejected. Furthermore, a maximum length is assigned to the column to prohibit larger values from being stored in the table.

The DBA must choose the data type and length of each column wisely. The general rule of thumb for choosing a data type for the columns in your tables is to choose the data type that most closely matches the domain of correct values for the column. That means you should try to adhere to the following rules:
  • If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL data types. DECFLOAT and FLOAT are also options for very large numbers.
  • If the data is character, use CHAR or VARCHAR data types.
  • If the data is date and time, use DATE, TIME, and TIMESTAMP data types.
  • If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or DBCLOB data types.
But, of course, there are always exceptions. For example, what about social security number? That is a numeric column, but you won't be performing calculations using it. And, to format it correctly requires hyphens, which cannot be stored in a numeric data type. So, perhaps, SSN is an exception.

Leading Zeroes

Let's consider another typical situation: the desire to display leading zeroes. Maybe the application requires a four-byte code that is used to identify products, accounts, or some other business object, and all of the codes are numeric and will stay that way. But, for reporting purposes, the users want the codes to print out with leading zeroes. So, the users request that the column be defined as CHAR(4) to ensure that leading zeroes are always shown. But what are the drawbacks of doing this?

Well, there are drawbacks! Without proper edit checks, inserts and updates could place invalid alphabetic characters into the product code. This can be a very valid concern if ad hoc data modifications are permitted. This is rare in production databases, but data problems can still occur if the proper edit checks are not coded into every program that can modify the data. But let's assume that proper edit checks are coded and will never be bypassed. This removes the data integrity question (yeah, right!).

There is another problem that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 374 or 1,874,161 possible values.

A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative or 5 digit product codes could be entered. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well. And we can always code a simple CHECK constraint to ensure that the code is always greater than zero.

DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, favor the SMALLINT over the CHAR(4) definition.

The leading zeroes problem might be able to be solved using other methods. If you are using a report writer, most of them have quick methods of displaying leading zeroes. When using QMF, you can ensure that leading zeroes are shown by using the "J" edit code. Report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields. Ad hoc access through other reporting tools typically provide a parameter that can enable leading zeroes to be displayed.

In general, it is wise to choose a data type which is closest to the domain for the column. IF the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, DECIMAL, or floating point. In addition, always be sure to code appropriate edit checks to ensure data integrity.

A DATE with DB2

What about dates? In my opinion, you should always use a DATE data type for date data. Why anyone would ever store a date or time in a DB2 table any other format than DATE, TIME, or TIMESTAMP is beyond me. But, oh, they do it all the time. And it causes all sorts of headaches. The benefits of using the proper DB2 data type are many, including:
  • Ensuring data integrity because DB2 will ensure that only valid date and time values are stored
  • The ability to use date and time arithmetic
  • A vast array of built-in functions to operate on and transform date and time values
  • Multiple formatting choices
Additionally, you have multiple built-in functions at your disposal for manipulating date and time data, but only when the data is stored as DATE, TIME, and TIMESTAMP data types. The time-related DB2 functions include CHAR, DATE, DAY, DAYOFMONTH, DAYOFWEEK, DAYOFYEAR, DAYS, HOUR, JULIAN_DAY, MICROSECOND, MIDNIGHT_SECONDS, MINUTE, MONTH, QUARTER, SECOND, TIME, TIMESTAMP, WEEK, WEEK_ISO, and YEAR.

I get questions all the time from folks who've stored dates using character data types; they ask all kinds of integrity and manipulation questions and I always, always, always start my reply with "You shouldn't have done that!"... and then I try to help them out, if I can. Don't fall into the trap of storing dates using anything but a DATE data type... you'll thank me later.

Summary

There is a lot more that can be said about choosing appropriate data types, but I think we've covered enough for today. The bottom line on data types is to use them to protect the integrity of your DB2 data and to simplify your job by taking advantage of DB2’s built-in capabilities. By choosing that data type that most closely matches your data you will be doing yourself, your systems, and your users a big favor.

Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Thursday, November 06, 2008

Data Driven: A Great New Book on Data Quality Issues

If you are at all involved in assuring the quality of your company’s data you need to know the work of Thomas C. Redman. Dr. Redman has been working on improving data quality for years and he has written numerous articles and books on the subject. His latest book, Data Driven: Profiting From Your Most Important Business Asset is another winner.

Redman offers the basic thesis of the book right there on page one, where he states “…bad data lie at the root of issues of international importance, including the current subprime mortgage meltdown, lost and stolen identities, hospital errors and contested elections.” After laying down the problem, the rest of the book tells us what we need to do to correct the problems.

Data Driven will help you to improve the methods you deploy for the care and feeding of your data and information; in other words, helping you to control and manage data using similar processes and controls that you deploy on your other assets (finances, people, structures, etc.) – a noble goal, indeed!

The writing is concise and snappy – you won’t get bored reading this book. The style is engaging and it is easy to read. For example, instead of just saying what to do and how to do it, which can be boring, Redman discusses many of the arguments people use to say that data quality is impossible, and then debunks them showing that data quality is possible, if approached properly and thoroughly.

There are many good ideas, charts, and graphs in Data Driven, too. One of my favorites is on page 54, where you can find a chart of the ten habits followed by those with the best data. If you buy this book, make a poster-sized photocopy of that page and hang it up on the wall of the break room and in the data folks’ cubicles. Maybe the habits will rub off on everyone as they gaze upon them everywhere.

But the best little gem in this wonderful book is the entirety of the last chapter, which is titled “The Next One Hundred Days.” In this chapter Dr. Redman offers what he calls a hundred-day panorama. It is not a grand plan because most will not have the depth of understanding required to create such a plan and have it succeed. Instead, the panorama strives for breadth, not depth, with a focus on quality. Diligent readers can follow the guidance in this chapter and thereby begin the long-term process of appreciating the importance of data quality on their business practices.

And that alone is worth the price of the book… but, of course, Data Driven offers much more and I recommend it to every IT and business professional whose job relies on accurate data.