2012-02: HASH Index – What is it good for?

Oh dear! I seem to have made a HASH of it!

Many thanks to Jeff for requesting information about HASH indexing and thus causing the terrible pun in the first line!

 

A brand new access path for DB2 10

Hash access is one of the new goodies that arrived in DB2 10 NFM and is a brand new access path for DB2. (Anyone remember HDAM?) First the good news – When Hash access is used, it is pretty fast! If you are lucky, one I/O is all that is needed unless you get your hash space definition hopelessly wrong. In which case, it gets messy!

 

HASH Access. What is it?

Simply put, hash access takes your given key (max 255 bytes) and computes a 64 bit number. DB2 computes “the relative page number” using the MOD function with 2 modulo arguments: “this 64 bit number” and “a prime number prime number derived from the size of the table”.   Next, DB2 computes the hash anchor point using the MOD function with 2 modulo arguments:  “this 64 bit number” and “another prime number between 17 and 53”. Now the “relative page number” and the “hash anchor point” act like a cross-reference directly to the row in question. Of course, life is a bit more complicated than that, and you can easily get collisions where two keys end up with the same numbers.  If this happens, the second one gets written to the ominous sounding “hash over flow area” and its associated index (Remember for later that this is also a sparse index!). By the way, the optimal setting is to have between 0% and 8% of rows in the overflow area.

 

Any physical limitations?

Yes, there are a whole bunch! The table must be in a UTS, and it must be in RRF format. The hash key is *not* updateable (déjà vu here with partitioned key changes from aeons ago!). DROP a hash, and your table space is put into REORGP, which of course needs a SHRLEVEL NONE REORG to repair it and that is not a recoverable scenario either. No CLUSTER indexes are allowed of course. Hash index *must* be implicitly created. No LOB or XML columns and it cannot be an MQT. <phew!>

Which all reminds me of a certain Monty Python sketch: “But apart from that, what have the Romans done for us?”. Just replace Romans with Hash Access. 🙂

So that is the list of physical limitations – What about logical limitations?

 

What about logical limitations?

No parallel access or star/hybrid joins are possible. Hash access is not used for sensitive dynamic scrollable cursor with multi-element in-list. RUNSTATS is a waste of time as it is a sparse index and so, by definition, only has a few percent of the table rows in it. My advice is to simply exclude hash indexes from your DB2 database maintenance during RUNSTATS processing. However, remember you must still monitor these indexes for REORG and for re-sizing as well as hash usage.

For which data does it actually makes sense to change to be Hash accessible? Straight off the bat are: Single Row unique equals and IN predicates. No range scans. Access is random. The row that you are searching for is normally found. The size of the table is static or well known. The columns all have the same size (no large variants please!), which should ideally be between 30 and 2000 bytes. Finally it would be perfect if 20 or more rows fit on a page.

Good candidates should have four or more index levels in order to save tons of I/O.

Bad candidates have high insert rates or updates that change row length.
So you can see that you have to do a fair bit of work to find the candidates!

Once you have decided to change a table to be hash-based, you can see from EXPLAIN whether or not a given query will attempt to use the hash method or not. Access types of H, HN, and MH will confirm usage – You can use the RTS to see when and how often the last HASH access actually occurred (HASHLASTUSED and REORGHASHACCESS columns).

 

If you decide to stop using the hash organization and drop it, remember the REORGP.

LOADing is still a bit of trauma of course. The current best method is:
Create table, Load Data, Alter to Hash, and REORG.

Mass Inserts are also pretty bad.

I did some testing here loading up 90,000 rows with an eight byte (repeatable) random key and found the elapsed time varies greatly with the size of the HASHSPACE and with the bufferpool definition of the table space.  (Remember that basic BP tuning is always a good idea!) Here is what I found:

1 MB Hash Space leads to 73,183 overflow records 36 secs and 251 hash pages
Then I went high with the HASH SPACE:
256 MB Hash Space leads to zero overflow records 96 secs and 65,327 hash pages
Then I stepped down (in binary of course!):
128 MB Hash Space leads to zero overflow records 82 secs and 32,573 hash pages
64 MB Hash Space leads to zero overflow records 44 secs and 16,381 hash pages
32 MB Hash Space leads to zero overflow records 23 secs and 8,191 hash pages
16 MB Hash Space leads to zero overflow records 22 secs and 4,093 hash pages
8 MB Hash Space leads to one overflow record 22 secs and 2,039 hash pages
4 MB Hash Space leads to 21,620 overflow records 36 secs and 1,021 hash pages
2 MB Hash Space leads to 55,897 overflow records 38 secs and 509 hash pages

And then finally the optimum size for my little test:
9 MB Hash Space leads to zero overflow records 19 secs and 2,297 hash pages

Then I decided to test out the AUTOESTSPACE YES function to see what changes it made.
The following is a list of results from my test data:
256MB Hash Space LOADed 90,000 using 32,719 Hash pages with 0 overflows
REORG of above changed the hash pages down to 653 with 3,762 overflow records
1MB Hash Space with no data REORGed to 31 pages
1MB Hash Space LOADed 90,000 using 127 Hash pages with 72,728 overflows
REORG of above changed the hash pages up to 653 with 3,762 overflow records
8MB Hash space LOADed 90,000 using 1,021 Hash pages with 0 overflows
REORG of above changed the hash pages down to 653 with 3,762 overflow records

So you can see that REORG aims for about 5% of rows, which is OK.
Finally, also check out “Chapter 20. Organizing tables by hash for fast access to individual rows” in the Managing Performance book for some great details about how to squeeze the last bytes out of your hash space!

Feel free to send me your comments and ask questions.

TTFN,
Roy Boxwell
Senior Architect

2012-03: INCLUDE Index Usage with Examples to find Redundant Indexes

DB2 10 contains a whole bunch of great new stuff, but my personal favourite has got to be the new INCLUDE syntax for CREATE INDEX.

What it does is quite simply magical – it allows you to “add” columns to an index which are *not* used for uniqueness, but are in the index of course. This means you must so allow that wonderful little Y to appear in the INDEXONLY column in the PLAN_TABLE when you BIND / REBIND with EXPLAIN(YES) – You are all doing that, aren’t you?

Up until DB2 10, indexes “bred” because the business process *needed* a unique key for COL, COL2, and COL3, but for INDEXONLY Y access you *needed* COL5 and COL6; thus leading to a second basically pointless index coming into existence. Now of course, the Optimizer is a clever little devil and may, in its infinite wisdom, decide never to use the first index, but an INSERT causes the odd -811. The end user suffers under the extra CPU, disk space and I/O time of two indexes. Now with INCLUDE syntax, you no longer need that second index.

 

DB2 10 new INCLUDE syntax for CREATE INDEX

The following indexes existed for the following query runs (I did a full runstats of course!)

For QUERYs 1 – 34

CREATE TYPE 2 UNIQUE INDEX IQA061HU.IQAXY1092    
                        ON IQA061HU.IQATY109     
                         ( STMT_KEY              
                         , STMT_TYPE             
                         )                       
                           USING STOGROUP SYSDEFLT
                           BUFFERPOOL BP0        
                           FREEPAGE 0            
                           PCTFREE 10            
                           CLOSE YES;

Leaf pages 2079 space 10800

For QUERYs 21 – 34

CREATE TYPE 2 UNIQUE INDEX IQA061HU.IQAXY1093    
                        ON IQA061HU.IQATY109     
                         ( STMT_KEY              
                         , STMT_TYPE             
                         , ISOLATION_LEVEL       
                         )                       
                           USING STOGROUP SYSDEFLT
                           BUFFERPOOL BP0        
                           FREEPAGE 0            
                           PCTFREE 10            
                           CLOSE YES;

Leaf pages 2294 space 10800

For QUERYs 41 – 54

CREATE TYPE 2 UNIQUE INDEX IQA061HU.IQAXY109I      
                        ON IQA061HU.IQATY109       
                         ( STMT_KEY                
                         , STMT_TYPE               
                         )                         
                           INCLUDE (ISOLATION_LEVEL)
                           USING STOGROUP SYSDEFLT 
                           BUFFERPOOL BP0          
                           FREEPAGE 0              
                           PCTFREE 10              
                           CLOSE YES;

Leaf pages 2294 space 10800

 

The queries looked like

EXPLAIN ALL SET QUERYNO = 1, 21, 41 FOR             
SELECT STMT_KEY, STMT_TYPE                  
FROM IQA061HU.IQATY109                      
;                                           
COMMIT ;                                    
EXPLAIN ALL SET QUERYNO = 2, 22, 42 FOR             
SELECT STMT_KEY, STMT_TYPE                  
FROM IQA061HU.IQATY109                      
WHERE STMT_KEY = X'01329149008E899B000001D6'
;                                           
COMMIT ;                                    
EXPLAIN ALL SET QUERYNO = 3, 23, 43 FOR             
SELECT STMT_KEY, STMT_TYPE                  
FROM IQA061HU.IQATY109                      
WHERE STMT_KEY = X'01329149008E899B000001D6'
  AND STMT_TYPE = 'M'                       
;                                             
EXPLAIN ALL SET QUERYNO = 11, 31, 51 FOR           
SELECT STMT_KEY, STMT_TYPE, ISOLATION_LEVEL
FROM IQA061HU.IQATY109                     
;                                          
COMMIT ;                                   
EXPLAIN ALL SET QUERYNO = 12, 32, 52 FOR           
SELECT STMT_KEY, STMT_TYPE, ISOLATION_LEVEL
FROM IQA061HU.IQATY109                     
WHERE STMT_KEY = X'01329149008E899B000001D6'
;                                          
COMMIT ;                                   
EXPLAIN ALL SET QUERYNO = 13, 33, 53 FOR           
SELECT STMT_KEY, STMT_TYPE, ISOLATION_LEVEL
FROM IQA061HU.IQATY109                     
WHERE STMT_KEY = X'01329149008E899B000001D6'
  AND STMT_TYPE = 'M'                      
;                                          
COMMIT ;                                   
EXPLAIN ALL SET QUERYNO = 14, 34, 54 FOR           
SELECT STMT_KEY, STMT_TYPE, ISOLATION_LEVEL
FROM IQA061HU.IQATY109                     
WHERE STMT_KEY = X'01329149008E899B000001D6'
  AND STMT_TYPE = 'M'                      
  AND ISOLATION_LEVEL = 'CS'               
;

Results were

QUERY  QNO  PNO  SQ  M  TABLE_NAME    A   CS  INDEX         IO
---------+---------+---------+---------+---------+---------+--
00001  01   01   00  0  IQATY109      I   00  IQAXY1092     Y
00002  01   01   00  0  IQATY109      I   01  IQAXY1092     Y
00003  01   01   00  0  IQATY109      I   02  IQAXY1092     Y
00011  01   01   00  0  IQATY109      R   00                N
00012  01   01   00  0  IQATY109      I   01  IQAXY1092     N
00013  01   01   00  0  IQATY109      I   02  IQAXY1092     N
00014  01   01   00  0  IQATY109      I   02  IQAXY1092     N

00021  01   01   00  0  IQATY109      I   00  IQAXY1092     Y
00022  01   01   00  0  IQATY109      I   01  IQAXY1092     Y
00023  01   01   00  0  IQATY109      I   02  IQAXY1092     Y
00031  01   01   00  0  IQATY109      I   00  IQAXY1093     Y
00032  01   01   00  0  IQATY109      I   01  IQAXY1093     Y
00033  01   01   00  0  IQATY109      I   02  IQAXY1093     Y
00034  01   01   00  0  IQATY109      I   03  IQAXY1093     Y

00041  01   01   00  0  IQATY109      I   00  IQAXY109I     Y
00042  01   01   00  0  IQATY109      I   01  IQAXY109I     Y
00043  01   01   00  0  IQATY109      I   02  IQAXY109I     Y
00051  01   01   00  0  IQATY109      I   00  IQAXY109I     Y
00052  01   01   00  0  IQATY109      I   01  IQAXY109I     Y
00053  01   01   00  0  IQATY109      I   02  IQAXY109I     Y
00054  01   01   00  0  IQATY109      I   02  IQAXY109I     Y

 

As can be seen, the first block gave especially bad results when the ISOLATION_LEVEL was added – leading to a tablespace scan in the worst case scenario!
Creating the second index alleviated the problem, and as you can see, the access’s all went INDEXONLY = Y, but now we have two indexes! That is double the disk space and double the time for updates, inserts and deletes. After I dropped the first two indexes and then created the INCLUDE one, you can see that the access is the same (Apart from the 2 columns for the query 54 of course – Here the original index would actually be better because the column is in the predicate list not just in the select list!) and now there is only the one index “to worry about” – for RUNSTATS, REORGs, etc.

 

Now all you need to do is find where you have these “double” defined indexes. The method is to look for any unique indexes where there exists another index with fewer equal columns. Easier said than done… however LISTSERV can help you here! If you are registered you can easily find some SQL written by Larry Kirkpatrick that very handily does nearly what you need! Search for “This could be a useful query when going to V10” to get a really neat SQL that you can simply SPUFI (You must do one tiny change and that is the SPACE line must be rewritten to look like INTEGER(A.SPACEF) AS UNIQ_IX_KB, to actually get the allocated space) to get this result from my test database:

---------+---------+----- --+---------+---------+---------+
UNIQUE_IX_TO_DEL           UNIQ_IX_KB  IX_WITH_PART_UNIQUE     
---------+---------+--------+---------+---------+---------+
IQA061HU.IQAXY1092             10800  IQA061HU.IQAXY1093      
IQA061HU.IQAXY1092             10800  IQA061HU.IQAXY109I

 

Here you can see that it is correctly telling me to drop the 1092 index as it is completely redundant and of course it finds it again due to my INCLUDEd column index with the 109I Now of course what you actually want to do is the inverse of this query. You probably want to DROP the longer index *and* the shorter index and then recreate the shorter with INCLUDE columns like the longer. Now is also a good time to think about using DSN_VIRTUAL_INDEXES to check the effects of the DROP before you actually do it…  also take a look in my Newsletter from June 2011 for details about finding dead indexes (or ask me to resend it).
Feel free to send me your comments and ask questions.

TTFN,
Roy Boxwell
Senior Architect

2012-04: SOUNDEX and other “cool” features – part one for DB2 V8

New Scalar functions in every release IBM

DB2 has always been delivered with Scalar functions and in every release IBM create and/or improve more and more. This month I thought I would do a trawl through the archives and list out my personal favorites from DB2 V8, 9, and 10. This is not a complete list of all new scalar functions in each release but nearly!! In fact there are so many that I decided to split this newsletter into three parts one for V8 one for 9 and one for 10. Starting off with DB2 V8 – The last ever version of DB2 with a V in the title…shame really as I was looking forward to V12.

See the  Scalar functions in DB2 V9: Newsletter 2012-05 SOUNDEX part 2

See the Scalar functions in DB2 10:  Newsletter 2012-06  SOUNDEX part 3

 

Starting off with DB2 V8 :  ASCII function

Now V8 introduced us to the ASCII function:

SELECT ASCII('BOXWELL')              
FROM SYSIBM.SYSDUMMY1 ;              
         66

So it takes the leftmost character of any given string expression and returns the ASCII value, in this case 66 for B – I must admit that I have *not* yet found a use for this little gem… Now V8 was also a major change due to use of UNICODE and so IBM released CHARACTER_LENGTH to complement the LENGTH scalar function. The difference being that CHARACTER_LENGTH counts the characters and LENGTH counts the bytes. Now in the US or GB these will probably be the same but in Germany they are nearly always different!

SELECT CHARACTER_LENGTH(                                  
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                       , CODEUNITS32)                     
FROM SYSIBM.SYSDUMMY1 ;                                   
         10                                               

SELECT CHARACTER_LENGTH(                                  
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                       , CODEUNITS16)                     
FROM SYSIBM.SYSDUMMY1 ;                                   
         10                                               

SELECT CHARACTER_LENGTH(                                  
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                       , OCTETS)                          
FROM SYSIBM.SYSDUMMY1 ;                                   
         11                                               

SELECT LENGTH(                                            
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
              )                                           
FROM SYSIBM.SYSDUMMY1 ;                                   
         11

Note that the CODEUNITS/OCTETS is how you specify the CCSID schema. Not that clear but you can see that the umlaut (ü) in Düsseldorf is counted as 1 character for 32 and 16 but as two bytes for Octets and the simple LENGTH function. I bet that there are hundreds of SQLs out there that do not use this feature and are therefore not 100% correct! Just waiting for your first UNICODE named employee or customer to cause problems somewhere! Now of course with a new LENGTH you also had to be able to SUBSTR the data on a character boundary *not* on the byte boundary and sure enough along came SUBSTRING

SELECT SUBSTRING(                                         
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                , 1 , 2 , CODEUNITS32)                    
FROM SYSIBM.SYSDUMMY1 ;                                   
Dü                                                        

SELECT SUBSTRING(                                         
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                , 1 , 2 , CODEUNITS16)                    
FROM SYSIBM.SYSDUMMY1 ;                                   
Dü                                                        

SELECT SUBSTRING(                                         
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)      
                , 1 , 2 , OCTETS)                         
FROM SYSIBM.SYSDUMMY1 ; 
D                                                                                  

SELECT SUBSTR(                                                         
    CAST('DüSSELDORF' AS VARCHAR(128) CCSID UNICODE)                   
                , 1 , 2)                                               
FROM SYSIBM.SYSDUMMY1 ;                                                

DSNE610I NUMBER OF ROWS DISPLAYED IS 0
DSNT408I SQLCODE = -331, ERROR:  CHARACTER CONVERSION CANNOT BE PERFORMED
         BECAUSE A STRING, POSITION 1, CANNOT BE CONVERTED FROM 1208 TO 1141,
         REASON 16
DSNT418I SQLSTATE   = 22021 SQLSTATE RETURN CODE
DSNT415I SQLERRP    = DSNXROHB SQL PROCEDURE DETECTING ERROR 
DSNT416I SQLERRD    = -117  0  0  -1  0  0 SQL DIAGNOSTIC INFORMATION
DSNT416I SQLERRD    = X'FFFFFF8B'  X'00000000'  X'00000000'  X'FFFFFFFF'
         X'00000000'  X'00000000' SQL DIAGNOSTIC INFORMATION

OK just like CHARACTER_LENGTH you “tell” the system which schema to use by giving the 32, 16 or octets again. Here you can see what happens if you attempt to split the byte by using the normal SUBSTR but exactly on a two byte character. Again this is a “lurking” bug that will byte people at some point (Sorry about that I could not resist!) Finally IBM brought out TIMESTAMPDIFF. Bonnie Baker once held a class where she wondered out loud “Who on Earth programmed this?”

 SELECT TIMESTAMPDIFF( 64 ,                                  
         CAST(CURRENT_TIMESTAMP                             
            - CAST('1939-05-10-01.00.00.000000' AS TIMESTAMP)
         AS CHAR(22))) AS GRANDADS_AGE_IN_MONTHS
FROM SYSIBM.SYSDUMMY1 ;                     
---------+---------+---------+---------+---------+---------+-
GRANDADS_AGE_IN_MONTHS                                      
---------+---------+---------+---------+---------+---------+-
                   874

Now the first option is the bit that Bonnie was talking about (Oh dear another pun has crept in – sorry!) this number is just a simple bit flag Valid values for numeric-expression equivalent intervals 1 Microseconds 2 Seconds 4 Minutes 8 Hours 16 Days 32 Weeks 64 Months 128 Quarters 256 Years The question is just “Why?” Why not use the words??? There is no answer except “Blame the programmer!”. The important thing to remember is that TIMESTAMPDIFF works *only* with assumptions and has rounding errors such as One year has 365 days One year has 52 weeks One month has 30 days These are obviously not always true (Think of leap years or simply days in February) but if you are looking for ball-park figures – like Grandad – then who cares? The other thing to remember is that the input must be casted as a CHAR(22) like in my example. So stay tuned for more Scalar fun next month with a bumper crop of interesting DB2 9 new functions! Feel free to send me your comments and ask questions. TTFN, Roy Boxwell Senior Architect

2012-05: SOUNDEX and other “cool” features – part two for DB2 V9

 

Part two of my walk through new Scalar functions and I will start with a whole new “group” of functions all to do with character manipulation for cross-system strings. ASCII_CHR, ASCII_STR, EBCDIC_CHR, EBCDIC_STR, UNICODE, and UNICODE_STR

 

See the  Scalar functions in DB2 V8: Newsletter 2012-04 SOUNDEX part 1

See the Scalar functions in DB2 10:  Newsletter 2012-06  SOUNDEX part 3

 

Character manipulation for cross-system strings in DB2 V9

SELECT ASCII('$') -- THIS IS THE V8 FUNCTION AS AN EXAMPLE TO GIVE US INPUT                              
FROM SYSIBM.SYSDUMMY1 ;                                        
         36                                                    
                                                               
SELECT ASCII_CHR(36)                                           
FROM SYSIBM.SYSDUMMY1 ;                                        
          $                                                              

SELECT ASCII_STR(                                              
                'EMBEDDED UNICODE ' CONCAT UX'C385' CONCAT ' FRED'
                )                                              
FROM SYSIBM.SYSDUMMY1 ;                                        
EMBEDDED UNICODE C385 FRED                                    

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION                  
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE               

SELECT EBCDIC_CHR(124)                                                 
FROM SYSIBM.SYSDUMMY1 ;                                                
          §                                                                      

SELECT EBCDIC_STR(                                                     
                 'EMBEDDED UNICODE ' CONCAT UX'C385' CONCAT ' FRED'       
                 )                                                      
FROM SYSIBM.SYSDUMMY1 ;                                                
EMBEDDED UNICODE C385 FRED                                            

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION                          
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE                       
                                                                       
SELECT UNICODE(UX'C385')                                               
      ,HEX(UNICODE(UX'C385'))                                          
FROM SYSIBM.SYSDUMMY1 ;                                                
      50053  0000C385                                                  
                                                                     
SELECT UNICODE_STR(                                                    
                  'A UNICODE CHAR C385 WITH BACK SLASH AT THE END'     
                  )                                                      
FROM SYSIBM.SYSDUMMY1 ;                                                
A UNICODE CHAR . WITH BACK SLASH AT THE END                           

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION 
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE

The _CHR functions return the character in the respective code page using the decimal number as input. The _STR functions return a string where the “unprintable” data is prefixed with a and then the hex codes are listed out. Note the extra use of the \ in the UNICODE_STR to get a in UNICODE.

Then we have the SOUNDEX and DIFFERENCE functions which are pretty neat and excellent for searching for names that are similar but not exact (Eg Roy and Roi)  here are some samples of both

SELECT DIFFERENCE('LUCY' , 'JUICY')
      ,SOUNDEX('LUCY')            
      ,SOUNDEX('JUICY')           
FROM SYSIBM.SYSDUMMY1 ;           
          3  L200  J200           

SELECT DIFFERENCE('BEER' , 'BUYER')
      ,SOUNDEX('BEER')            
      ,SOUNDEX('BUYER')           
FROM SYSIBM.SYSDUMMY1 ;           
          4  B600  B600           

SELECT DIFFERENCE('COOL' , 'KEWL')
      ,SOUNDEX('COOL')            
      ,SOUNDEX('KEWL')            
FROM SYSIBM.SYSDUMMY1 ;           
          3  C400  K400           

SELECT DIFFERENCE('AARDVARK' , 'ZOMBIE')
      ,SOUNDEX('AARDVARK')             
      ,SOUNDEX('ZOMBIE')               
FROM SYSIBM.SYSDUMMY1 ;                
          0 A631  Z510

The SOUNDEX takes any input string and converts it down into a four character string. As you can see the first character is the first character of the string the rest are the “value” of the string. This is of academic interest as it really is of use when you plug in the DIFFERENCE which returns a value of 0, 1, 2, 3, or 4. 0 is the words have no similarity 4 is the words are practically the same. That BEER == BUYER was clear!

MONTHS_BETWEEN harks back to one of my examples from last month

SELECT MONTHS_BETWEEN(                                      
              CURRENT_TIMESTAMP                             
            , CAST('1939-05-10-01.00.00.000000' AS TIMESTAMP)
                     ) AS GRANDADS_AGE_IN_MONTHS            
FROM SYSIBM.SYSDUMMY1 ;                                     
---------+---------+---------+---------+---------+---------+-
           GRANDADS_AGE_IN_MONTHS                           
---------+---------+---------+---------+---------+---------+-
              874.645161290322580

Given two timestamps it calculates the number of months between – However this time more accurately than the bit switch stuff in TIMESTAMPDIFF

Lots of string stuff came in 9 as well. LPAD, RPAD, LOCATE_IN_STRING, and OVERLAY here are some examples:

SELECT LPAD('FRED', 8, 'Q')
FROM SYSIBM.SYSDUMMY1 ;   
QQQQFRED                  

SELECT RPAD('FRED', 8, 'Q')
FROM SYSIBM.SYSDUMMY1 ;   
FREDQQQQ                  

SELECT LOCATE_IN_STRING('RÖY LIVES IN DÜSSELDORF' ,
                        'Ü' , 1 , CODEUNITS32)    
FROM SYSIBM.SYSDUMMY1 ;                           
         15

LPAD pads any given input string to the left with the supplied padding character. RPAD does the same but from the right hand side. These two are very handing for filling data up with “.” To make alignment better on the eye. LOCATE_IN_STRING does what it says – it returns the position in the string of the character given but this works, like the CHARACTER_LENGTH and SUBSTRING functions in V8, at the character and not the byte level. Further there is an “instance”, which I did not use in the example but it defaults to 1, so that you can find the nth character if desired. Very handy if you want to check that the user has entered four –‘s in the input field etc.

SELECT LOCATE_IN_STRING('TEST 1-2-3-4-5' , '-' , 1 , 4 , CODEUNITS32)
FROM SYSIBM.SYSDUMMY1 ;                                             
         13                                                         

SELECT LOCATE_IN_STRING('TEST 1-2-3-4 5' , '-' , 1 , 4 , CODEUNITS32)
FROM SYSIBM.SYSDUMMY1 ;                                             
          0

Here you can see that I start the search at the start (1) and that I want the position of the 4th occurrence. So the first returns 13 which is correct and the next returns 0 as there are indeed only three hyphens. As you can imagine – lots of potential uses here!

OVERLAY is the same as good old INSERT except that the length argument is optional.

Maths got a small boost in DB2 9 with the introduction of QUANTIZE. A strange function that outputs a quantized DECFLOAT field based on the parameters input. Again I have never found a use for it but it *must* be used somewhere… Here are the documented examples of what it does:

 QUANTIZE(2.17, DECFLOAT(0.001)) = 2.170
 QUANTIZE(2.17, DECFLOAT(0.01)) = 2.17
 QUANTIZE(2.17, DECFLOAT(0.1)) = 2.2
 QUANTIZE(2.17, DECFLOAT(’1E+0’)) = 2
 QUANTIZE(2.17, DECFLOAT(’1E+1’)) = 0E+1
 QUANTIZE(2, DECFLOAT(INFINITY)) = NAN –- exception
 QUANTIZE(-0.1, DECFLOAT(1) ) = 0
 QUANTIZE(0, DECFLOAT(’1E+5’)) = 0E+5
 QUANTIZE(217, DECFLOAT(’1E-1’)) = 217.0
 QUANTIZE(217, DECFLOAT(’1E+0’)) = 217
 QUANTIZE(217, DECFLOAT(’1E+1’)) = 2.2E+2
 QUANTIZE(217, DECFLOAT(’1E+2’)) = 2E+2

The first parameter is simply changed to have the same look-and-feel as the second parameter definition. I guess when working with floating point, bigint etc. it would be nice to get the output all the same format…

Finally they brought out a few new DATE, TIME functions. EXTRACT, TIMESTAMPADD and TIMESTAMP_ISO. EXTRACT is a high level version of existing functions and looks like this

SELECT EXTRACT(YEAR FROM CURRENT_TIMESTAMP)                            
      ,        YEAR(CURRENT_TIMESTAMP)                                         
      ,EXTRACT(MONTH FROM CURRENT_TIMESTAMP)                           
      ,        MONTH(CURRENT_TIMESTAMP)                                        
      ,EXTRACT(DAY FROM CURRENT_TIMESTAMP)                             
      ,        DAY(CURRENT_TIMESTAMP)                                          
FROM SYSIBM.SYSDUMMY1 ;                                                
  2012         2012            3            3           30           30
                                                                     
SELECT EXTRACT(HOUR FROM CURRENT_TIMESTAMP)                            
      ,        HOUR(CURRENT_TIMESTAMP)                                         
      ,EXTRACT(MINUTE FROM CURRENT_TIMESTAMP)                          
      ,        MINUTE(CURRENT_TIMESTAMP)                                       
      ,EXTRACT(SECOND FROM CURRENT_TIMESTAMP)                          
      ,        SECOND(CURRENT_TIMESTAMP)                                       
FROM SYSIBM.SYSDUMMY1 ;                                                
 10           10           11           11   26.511601              26

As can be seen the functions are the same (apart from the better accuracy at EXTRACT(SECOND FROM xxxx) but you could always simply encapsulate it in an INTEGER function and then you would really have 1:1 functionality. The results are all INTEGER apart from SECOND which is DECIMAL(8,6)

TIMESTAMP_ISO simply returns a version of your given date in the ISO timestamp format – pretty handy when all you get as input is a date.

SELECT TIMESTAMP_ISO (CURRENT DATE)                         
FROM SYSIBM.SYSDUMMY1 ;                                     
2012-03-30-00.00.00.000000

TIMESTAMPADD was obviously written by the same programmer who developed TIMESTAMPDIFF as it uses the same “odd” Bit masking to get the job done. If you remember my example with granddads age in months we can now reverse it so

SELECT TIMESTAMPADD( 64                                     
            , 874                                           
            , CAST('1939-05-10-01.00.00.000000' AS TIMESTAMP)
                     ) AS GRANDADS_BIRTHDATE_PLUS_MONTHS    
FROM SYSIBM.SYSDUMMY1 ;                                     
---------+---------+---------+---------+---------+---------+-
GRANDADS_BIRTHDATE_PLUS_MONTHS                              
---------+---------+---------+---------+---------+---------+-
2012-03-10-01.00.00.000000

Aha! The first parameter is 64 (Which means Months – Look at last month’s newsletter for a table listing all valid values) the second parameter is 874 so it adds 874 months to the supplied timestamp. There are better uses I am sure!

Last but not least are two TEXT extender functions that I have not actually used but seem pretty neat. CONTAINS searches through the text index for a hit. The result is 1 if the text is found in the document. Here is an example from the docu:

SELECT EMPNO
FROM EMP_RESUME
WHERE RESUME_FORMAT = ’ascii’
AND CONTAINS(RESUME, ’cobol’) = 1

If the text index column RESUME contains the word “cobol” it returns 1

Finally SCORE just cut-and-pasted from the docu as I have not got the text extender stuff installed

The following statement generates a list of employees in the order of how well their resumes matches the query “programmer AND (java OR cobol)”, along with a relevance value that is normalized between 0 (zero) and 100.

SELECT EMPNO, INTEGER(SCORE(RESUME, ’programmer AND
      (java OR cobol)’) * 100) AS RELEVANCE
  FROM EMP_RESUME
 WHERE RESUME_FORMAT = ’ascii’
   AND CONTAINS(RESUME, ’programmer AND (java OR cobol)’) = 1
 ORDER BY RELEVANCE DESC

 

DB2 first evaluates the CONTAINS predicate in the WHERE clause, and therefore, does not evaluate the SCORE function in the SELECT list for every row of the table. In this case, the arguments for SCORE and CONTAINS must be identical.

If you have the text support then these two functions could be very useful indeed!
OK, next month DB2 10 functions!
Feel free to send me your comments and ask questions.

TTFN,
Roy Boxwell
Senior Architect

 

 

2012-08: DSC Control – ZPARMs and Aggregation

 

The Dynamic Statement Cache (DSC) has been around for a long long time (since DB2 V5 in fact), but I still meet people and more importantly DB2 subsystems who/which are not clued up on the correct way to monitor, check and tune the DSC so that it and the enterprise flies along at a nice cruising speed.

The DSC is where any PREPAREd statement lands if caching is enabled. And if the very same SQL statement is then PREPAREd again then this PREPARE can be either completely avoided – very good; or made into a “short” PREPARE – good. Remember that the primary authorization Id is also part of the “key” so you can still get duplicate DSC records but for different Authorizations. (In detail all of the BIND Options [CURRENTDATA, DYNAMICRULES, ISOLATION, SQLRULES, and QUALIFIER], Special registers [CURRENT DEGREE, CURRENT RULES, CURRENT PRECISION, CURRENT PATH and CURRENT OPTIMIZATION HINT], Authorization Id, Declared cursor data [HOLD, STATIC, SENSITIVE, INSENSITIVE, SCROLLABLE], Parser options [APOST or QUOTE delimiter, PERIOD or COMMA decimal delimiter, date and time format DECIMAL options] and the complete Attribute string must all be 100% the same – in fact it is a MIRACLE that there is ever a hit!).

In a perfect world, the SQL in the cache should stay there forever. However reality dictates that two days is about the best you can aim for, and I have seen shops where 2 hours is good. Of course at 10 minutes you are just thrashing the cache and you might as well switch it off to improve performance. Remember that sometimes no cache is better.
The major controller of the DSC is the ZPARM CACHEDYN which you simply set to YES or NO. Set to NO it severely limits the use of the cache as the cache is really two caches : the LOCAL cache for the thread and the GLOBAL cache for everyone. The LOCAL is storage in the EDMPOOL using a simple FIFO (First In First Out) queue and is controlled by the MAXKEEPD ZPARM which is normally set and left at 5000 statements. The GLOBAL is in the EDMPOOL (EDMP) and uses a sophisticated LRU algorithm (Last/Recent Used) but its true size has varied a lot over the years:

 

Version

MAXKEEPD
0 – 65535

CACHEDYN
(Yes/No)

Notes

55000NOEDMPOOL was the CASH
65000NOIf CACHEDYN “YES” then new EDMDSPAC With VALID RANGE  1K – 2,097,152K
75000NONew EDMDSMAX with Valid range
0 – 2,097,152K and default 1,048,576K
85000YESEDMDSPAC and EDMDSMAX removed.
New EDMSTMTC With Valid range
5,000K – 1,048,576K and new “opaque”
ZPARM CACHEDYN_FREELOCAL Valid
Values 0 or 1 With default 0 (off)
95000YESCACHEDYN _FREELOCAL default changed
To 1 (on) and EDMTSTMTC default changed to 56,693K
105000YESEDMSTMTC default changed to 113,386K

 

The LOCAL size has not changed a bit but the other defaults and the location of the GLOBAL is regularly changed! Now the GLOBAL is 113,386K and that is HUGE!
Setting the KEEPDYNAMIC bind option to NO is also not brilliant for the DSC but if the ZPARM CACHEDYN is YES you can still get the GLOBAL cache benefit. This now leads us to the four flavors of PREPARE:

 

• Full
Skeleton copy of the SQL is not in the cache or the cache is not active. Caused by a PREPARE or EXECUTE IMMEDIATE statement.
• Short
A skeleton copy of the PREPAREd SQL statement is copied to local storage.
• Avoided
PREPARE avoided by using full caching. PREPAREd statement information is still in thread’s local storage.
• Implicit
Due to limits, such as MAXKEEPD, a PREPARE cannot be avoided and DB2 will issue the PREPARE on behalf of the application.

A Full PREPARE takes the most time to process (Think of 100 here) then a Short PREPARE (Think of one here) and then an Avoided PREPARE (Think of zero here)

So now in a parameter matrix:

			  CACHEDYN NO 		  CACHEDYN YES
KEEPDYNAMIC(NO) 	-> No caching		-> GLOBAL cache
KEEPDYNAMIC(YES)	-> LOCAL cache		-> LOCAL and GLOBAL (Called FULL)

 

LOCAL keeps no skeletons in the EDMP, only allows FULL PREPARES, allows you to PREPARE once over COMMITs (Only with hold cursors) and statement strings are kept across commits which gives you Implicit PREPARES.

GLOBAL keeps skeletons in the EDMP, 1st PREPARE is FULL, others are SHORT, allows you to PREPARE once over COMMITS (Only for WITH HOLD cursors) and no statement strings are kept across commits.

FULL keeps skeletons in the EDMP, 1st PREPARE is FULL, others are SHORT, keeps PREPAREd statements over COMMITS (avoided PREPARES) and statement strings are kept across commits which gives you Implicit PREPARES.

So remember MAXKEEPD > 0, CACHEDYN=YES and on the BIND statement KEEPDYNAMIC(YES) to get the most out of the DSC but watch out for memory growth.
OK, so now we know what knobs and buttons to fiddle with. The next question is: Where should I concentrate my efforts? Now you can easily extract the DSC with an EXPLAIN statement and you can dump the contents and then analyze all of the data. There are LOTS of columns full of great data but you must remember to switch on IFCID 318 when you start DB2; otherwise all the really good data will be zero!

 

 To start the IFCID 318 you must issue a
 ‐START TRACE(PERFM) CLASS(31) IFCID(318)
command whenever DB2 has started up. It is also recommended to always have the accounting trace class(3) started by using either the installation parameter “SMF ACCOUNTING” field with a value of “*” or “3” on panel DSNTIPB or by issuing a
‐START TRACE(ACCTG) CLASS(3)
command.

 

Now of course you have between 5,000 and 120,000 rows of data to work with. First off the bat are the heavy hitters where you simply ORDER BY “CPU” or “GETPAGES” to pick out simple fixes where an INDEX or an additional column could avoid a sort or a data scan. But then you run out of low hanging fruit and you will want to aggregate your SQL. The best method is to group together the SQL which is “the same” but “different”. To do this you must make sure that a given SQL statement has the same tables in all of the FROM statements and that the predicates are the same. You can basically ignore the SELECT line and also the literals in the predicates to begin with. This then aggregates similar SQL together so that you can see the “big picture”. I have been to sites where 120,000 statements actually shrank down to only 900 different statements. The vast majority was literal usage in predicates instead of parameter markers which is of course anathema to the DSC *but* could be a performance requirement! Remember that “It Depends” is the correct answer but all you should be doing is trying to wrap your head around 120,000 statements!

Once you have a method to do this (Or you buy our SQL PerformanceExpert  or  Bind ImpactExpert that does this for you of course!) you can repeatedly extract and check that all is “well” in the world of the DSC. Finally, you should start doing “baselines” where you snap the cache and then a month or so later, or just before/after a new roll out, you snap the cache again and then compare the two cache extracts against one another. This enables you to see which SQL is “new” in the Dynamic World, and if your cache allows it you can then get a “rolling DSC” that contains all the dynamic SQL at your shop. This is a very powerful audit and performance possibility that not many shops are doing today. Just having the chance to easily see what the JAVA developers will be running next week in production is worth its weight in gold!

 

OK, that’s all for this month, as usual any questions or comments are gladly wanted/wished!
TTFN,
Roy Boxwell
Senior Architect

 

2012-09: DB2 Catalog Statistics – revisited

1 – Which data is used by the DB2 Optimizer and which is updated by RUNSTATS?

2 – Which default column values trigger the DB2 Optimizer to use its own internal default values?

 

Every now and again, I hold a little presentation called Are you a RUNSTATS Master? where I describe in detail, what the DB2 Optimizer uses for access path selection in relation to the DB2 catalog data.

I am always surprised at how often people say “just that data?” or “Is that it?” (the various other reasons for access path selection like CP speed, Number of CPs, RID Pool size, Sort Pool size, Max data caching size, and, of course, the 80 bufferpools are also mentioned, but these have nothing to do with RUNSTATS).

So generally the answer is “Yes”, however the permutations and combinations make the devil in the detail – The DB2 Optimizer’s algorithms are top secret, but the input data it uses is fully described in the documentation.

What I want to do this month is show the Catalog data that is used, the default values that can cause surprising things to happen and the problem of correlations in the catalog.

 

First is – Which data is used by the DB2 Optimizer and which is updated by RUNSTATS?

Here is a complete list of the eleven tables used by the DB2 Optimizer:

  • SYSIBM.SYSCOLDIST
  • SYSIBM.SYSCOLSTATS *
  • SYSIBM.SYSCOLUMNS
  • SYSIBM.SYSINDEXES
  • SYSIBM.SYSINDEXPART
  • SYSIBM.SYSKEYTARGETS            9 and up (same as SYSCOLUMNS)
  • SYSIBM.SYSKEYTGTDIST             9 and up (same as SYSCOLDIST)
  • SYSIBM.SYSROUTINES
  • SYSIBM.SYSTABLES
  • SYSIBM.SYSTABLESPACE
  • SYSIBM.SYSTABSTATS

* degree of parallelism only and, after APAR PK62804, also „sometimes“ used to bound filter factor estimates…
Now we can also list out all of the columns (obviously not including the key columns) which are used by the DB2 Optimizer:

SYSCOLDIST
CARDF, COLGROUPCOLNO, COLVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMCOLUMNS, QUANTILENO, STATSTIME

SYSCOLSTATS
COLCARD, HIGHKEY, LOWKEY

SYSCOLUMNS
COLCARDF, HIGH2KEY, LOW2KEY

SYSINDEXES
CLUSTERING*, CLUSTERRATIO, CLUSTERRATIOF, DATAREPEATFACTORF, FIRSTKEYCARDF, FULLKEYCARDF, NLEAF, NLEVELS

SYSINDEXPART
LIMITKEY*

SYSKEYTARGETS
CARDF, HIGH2KEY, LOW2KEY, STATS_FORMAT

SYSKEYTGTDIST
CARDF, KEYGROUPKEYNO, KEYVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMKEYS, QUANTILENO, STATSTIME

SYSROUTINES
CARDINALITY*, INITIAL_INSTS*, INITIAL_IOS*, INSTS_PER_INVOC*, IOS_PER_INVOC*

SYSTABLES
CARDF, EDPROC*, NPAGES, NPAGESF, PCTROWCOMP

SYSTABLESPACE
NACTIVE, NACTIVEF

SYSTABSTATS
CARD, CARDF, NPAGES

Notes: * Columns are not updated by RUNSTATS and _ Columns are not updatable at all. The column STATSTIME is used only if there are duplicates in the SYSCOLDIST table, and then the DB2 Optimizer will use the “newer” data that was probably inserted by a User.

 

Second is – Which default column values trigger the DB2 Optimizer to use its own internal default values?

SYSCOLUMNS
If COLCARDF= -1 then use 25
SYSINDEXES
If CLUSTERRATIOF<= 0 then use CLUSTERRATIO
If CLUSTERRATIO<= 0 then use 0.95 if the index is CLUSTERing = ‘Y’ otherwise 0.00
If FIRSTKEYCARDF= -1 then use 25
If FULLKEYCARDF= -1 then use 25
If NLEAF= -1 then use 33 (Which is SYSTABLES.CARDF / 300)
If NLEVELS= -1 then use 2
SYSROUTINES
If CARDINALITY= -1 then use 10,000
If INITIAL_INSTS= -1 then use 40,000
If INITIAL_IOS= -1 then use 0
If INSTS_PER_INVOC= -1 then use 4,000
If IOS_PER_INVOC= -1 then use 0
SYSTABLES
If CARDF= -1 then use 10,000
If NPAGESF<= 0 then use NPAGES
If NPAGES= -1 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20))
SYSTABLESPACE
If NACTIVEF<= 0 then use NACTIVE
If NACTIVE<= 0 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20))
SYSTABSTATS
If CARDF= -1 then use SYSTABSTATS.CARD
If CARD= -1 then use 10,000
If NPAGES= -1 then use 501 (Which is CEILING (1 + SYSTABSTATS.CARDF / 20))

 

So now you can see that non-floating point “old” data, may still be used today and then causes access path headaches!

Now to top it all, the data in the SYSCOLDIST and SYSKEYTGTDIST never gets simply “deleted”. Once that data is inserted it stays there, until it is overwritten by new data or the object is dropped. This all leads to some very old data in these two tables that can and does cause the DB2 Optimizer a ton of grief! One of the first things I do is select the MIN(STATSTIME) from these tables just to see how old the data really is. Do it yourself and be surprised! I have seen sites with eight years old data in the SYSCOLDIST and that cannot be good!

Finally now onto correlations… There are lots of little tricks that DBAs use to massage access path choice and one of these is to just set NLEVELS to 15 for a given index. Then lots of queries simply refuse to touch it as it would appear to be HUGE. Now just simply updating columns can cause the DB2 Optimizer, in the best case to ignore your updates or perhaps makes things even worse! So here is a list of the correlations (In other words, if you change xxx remember to change yyy and zzz as well):

  • Relationships exist among certain columns of certain tables:
    • Columns within SYSCOLUMNS
    • Columns in the tables SYSCOLUMNS and SYSINDEXES
    • Columns in the tables SYSCOLUMNS and SYSCOLDIST
    • Columns in the tables SYSCOLUMNS, SYSCOLDIST, and SYSINDEXES
  • If you plan to update some values, keep in mind the following correlations:
    • COLCARDF and FIRSTKEYCARDF/FULLKEYCARDF (They must be equal for the 1st column and full, if a single column index)
    • COLCARDF, LOW2KEY and HIGH2KEY. (For non-default COLCARDF LOW2KEY and HIGH2KEY key must be filled with data) and if the COLCARDF is 1 or 2 DB2 uses LOW2KEY and HIGH2KEY as domain statistics to generate frequencies.
    • CARDF in SYSCOLDIST.  CARDF is related to COLCARDF and FIRSTKEYCARDF and FULLKEYCARDF. It must be at a minimum
      • A value between FIRSTKEYCARDF and FULLKEYCARDF if the index contains the same set of columns
      • A value between MAX(colcardf of each col) and the product of all the columns COLCARDFs in the group
    • CARDF in SYSTABLES. CARDF must be equal or larger than any other cardinalities, such as COLCARDF, FIRSTKEYCARDF, FULLKEYCARDF, and CARDF in SYSCOLDIST
    • FREQUENCYF and COLCARDF or CARDF. The number of frequencies collected must be less than or equal to COLCARDF for the column or CARDF for the column group
    • FREQUENCYF. The sum of frequencies collected for a column or column group must be less than or equal to 1

 

Do not forget that our little Freeware tool StatisticsHealthCheck will find all bad correlations, old data and badly updated data for you and it is FREE!

So I hope this little round-up of Catalog Statistics data was interesting, and, as usual, if you have any  comments or questions, then please, feel free to mail me!

2012-10: Existence check SQL – Through the ages

 

Before DB2 V7
DB2 V7
DB2 V8

 

Over the years, we have all been faced with the problem of checking for existence. (not just with DB2 and SQL either – that is for another BLOG!).

Now in the days before even SYSIBM.SYSDUMMY1 existed, it was pretty nasty, and in my old firm we created our own single row dummy table for exactly this requirement (and doing date arithmetic etc.). However the programmers of the world often re-invent the wheel time and time again, a simple existence check has also evolved over the years.

Here is a brief selection of the various ways that people have done it in the past, and could well be still running today, every one to two seconds, somewhere in the world…

 

Before DB2 V7

Line 188 in the EXPLAIN Output

SELECT COUNT(*)
FROM ZIPCODES
WHERE COL1 = ‘xxx’

If the count(*) was > 0 – BINGO!
(Line 199 WITH UR)

Or

Line 211
SELECT 1
FROM ZIPCODES
WHERE COL1 = ‘xxx’

If the SQLCODE was 0 or -811 – BINGO!
(Line 222 WITH UR)

Or

Line 234
SELECT 1
FROM SYSIBM.SYSDUMMY1
WHERE EXISTS ( SELECT 1 FROM ZIPCODES
A WHERE A.COL1 = ‘xxx’)

If the SQLCODE was 0 – BINGO!
(Line 247 WITH UR)

Or

Line 261

SELECT 1
FROM SYSIBM.SYSDUMMY1 D

WHERE EXISTS ( SELECT 1
FROM ZIPCODES A
WHERE A.COL1 = ‘xxx’
AND D.IBMREQD = D.IBMREQD )

 

If the SQLCODE was 0 – BINGO!
Note the use of the “correlated” predicate to actually force good access!
(Line 275 WITH UR)

 

DB2 V7

In DB2 V7 you could then add the FETCH FIRST 1 ROW ONLY. So the scan would stop

Line 290
SELECT 1
FROM ZIPCODES
WHERE COL1 = ‘xxx’
FETCH FIRST 1 ROW ONLY

If the SQLCODE was 0 – BINGO!
(Line 302 WITH UR)
Adding WITH UR “allegedly” also sped up the processing of SYSDUMMY1 and the possibility to declare a cursor with OPTIMIZE FOR 1 ROW and a single FETCH arrived.

 

DB2 V8

In DB2 V8 you could now drop the correlation check and still get good performance
Same as line 234
SELECT 1
FROM SYSIBM.SYSDUMMY1
WHERE EXISTS ( SELECT 1
FROM ZIPCODES A
WHERE A.COL1 = ‘xxx’ )

If the SQLCODE was 0 – BINGO!

 

As Terry Purcell wrote on listserv many years ago, “The FETCH FIRST 1 ROW ONLY made all other existence checks obsolete”, and so now is the time to dig through all of your old SQL, and see if you can improve even simple existence checks!

 

Here is the EXPLAIN output of all of these variations in a DB2 10 NF:

---------+---------+---------+---------+---------+---------+---------
LINE   QNO  TABLE_NAME    A   P  CE  TYPE             MS        SU
---------+---------+---------+---------+---------+---------+---------
00188  01   ZIPCODES      R   S  R   SELECT          137        388
00199  01   ZIPCODES      R   S  R   SELECT          137        388
00211  01   ZIPCODES      R   S      SELECT          187        529
00222  01   ZIPCODES      R   S      SELECT          187        529
00234  01   SYSDUMMY1     R   S      SELECT           11         29
00234  02   ZIPCODES      R   S      NCOSUB           11         29
00247  01   SYSDUMMY1     R   S      SELECT           11         29
00247  02   ZIPCODES      R   S      NCOSUB           11         29
00261  01   SYSDUMMY1     R   S      SELECT            2          6
00261  02   ZIPCODES      R   S      CORSUB            2          6
00275  01   SYSDUMMY1     R   S      SELECT            2          6
00275  02   ZIPCODES      R   S      CORSUB            2          6
00290  01   ZIPCODES      R          SELECT            1          1
00302  01   ZIPCODES      R          SELECT            1          1

 

Access is always a tablespace scan, sometimes Sequential pre-fetch is active. So you can see that now, today the WITH UR makes no difference, and the absolutely best performance is indeed with the FETCH FIRST 1 ROW ONLY code. Just for fun, I ran the above COBOL program calling each of these methods 10,000,000 times… the CPU usage varied from 10.87 seconds for the worst and 10.63 seconds for the best. That is how good DB2 really is: 0.24 / 10,000,000 is a very very small number indeed!
Sometimes you can teach an old dog new tricks!

Finally here’s a real cutey from about ten years ago, where the task was not to see if a given row existed but if a given TABLE existed.

 

DECLARE TESTTAB CURSOR FOR
SELECT ‘PS_DOES_TABLE_EXIST’
FROM PS_TAX_BALANCE
WHERE 1 = 0
FOR FETCH ONLY

I hope this has been fixed to actually query the DB2 catalog these days, as in DB2 9, it now gets a PRUNED access path.

 ---------+---------+---------+---------+---------+---------+---
 LINE   QNO   TABLE_NAME  A P CE  TYPE       MS       SU
 ---------+---------+---------+---------+---------+---------+---
 00319 01                         PRUNED     0         0

Which would probably really kill the application!!!

 

As usual, any  questions or comments, then please feel free to mail me!

2012-11: EXPLAIN table maintenance and clean-up

Remember the good old days when there was a <userid>.PLAN_TABLE with all your access path data in it?
When the package was dropped or changed by some program maintenance you could simply do a DELETE WHERE NOT EXISTS style SQL to keep your data up to date and pristine?

Of course those days have *long* gone…
Nowadays everyone on the block has got DataStudio installed and it installs a whole bunch of “new” hidden tables to enable Query Tuning. This is OK of course, but do you actually take care of this data? If it is the production EXPLAIN tables – Are you image copying, runstating and reorging when you should? Do you have all the indexes you need? Are you deleting too much or not enough data? Can you, in fact, delete anything these days without the risk that you will delete an access path that may “come back from the dead”?

First, a quick review of which tables you may well have in existence at the moment:

 

DB2 V8

PLAN_TABLEThe Good old original, now with 58 columns
DSN_STATEMNT_TABLEAlso available as a published API with 12 columns used for Costs
DSN_FUNCTION_TABLEAs above with 15 columns for checking which function is called
DSN_DETCOST_TABLEThe Hidden tables first officially externalized by DataStudio
DSN_FILTER_TABLE
DSN_PGRANGE_TABLE
DSN_PGROUP_TABLE
DSN_PREDICAT_TABLE
DSN_PTASK_TABLE
DSN_QUERY_TABLENot populated by EXPLAIN(YES) for BIND/REBIND
DSN_SORT_TABLE
DSN_SORTKEY_TABLE
DSN_STRUCT_TABLE
DSN_VIEWREF_TABLE

Now, most of the descriptive text for the new ones is “IBM internal use only” but some of them are pretty cool! Please remember that all this data is asis, and will probably *not* help you in your day to day tuning one little bit. That being said, I do use the DSN_PREDICAT_TABLE quite a lot.
The first “problem” is: the DSN_QUERY_TABLE because it contains a CLOB(2M) column, which means that it can grow very very quickly – well, not the DSN_QUERY_TABLE itself of course, rather its auxiliary table that holds the LOB data.

 

DB2 9

PLAN_TABLENow with 59 columns
DSN_STATEMNT_TABLE
DSN_FUNCTION_TABLE
DSN_DETCOST_TABLE
DSN_FILTER_TABLE
DSN_PGRANGE_TABLE
DSN_PGROUP_TABLE
DSN_PREDICAT_TABLE
DSN_PTASK_TABLE
DSN_QUERYINFO_TABLENew for query rewrite & accelerator usage and with two LOBs
DSN_QUERY_TABLE
DSN_SORT_TABLE
DSN_SORTKEY_TABLE
DSN_STATEMENT_CACHE_TABLENew with a LOB column
DSN_STRUCT_TABLE
DSN_VIEWREF_TABLE

The DSN_STATEMENT_CACHE_TABLE must be separately maintained as it has nothing to do with the PLAN_TABLE of course!

 

DB2 10

PLAN_TABLENow with 64 columns
DSN_STATEMNT_TABLE
DSN_FUNCTION_TABLE
DSN_COLDIST_TABLENew and contains the SYSCOLDIST data
DSN_DETCOST_TABLE
DSN_FILTER_TABLE
DSN_KEYTGTDIST_TABLENew and contains the SYSKEYTGTDIST data
DSN_PGRANGE_TABLE
DSN_PGROUP_TABLE
DSN_PREDICAT_TABLE
DSN_PTASK_TABLE
DSN_QUERYINFO_TABLE
DSN_QUERY_TABLE
DSN_SORT_TABLE
DSN_SORTKEY_TABLE
DSN_STATEMENT_CACHE_TABLE
DSN_STRUCT_TABLE
DSN_VIEWREF_TABLE

 

Now it must be apparent that the amount and size of this data is getting out of hand. If someone, by accident, creates all of these tables in production, then every BIND or REBIND with EXPLAIN(YES) will be doing a vast amount of I/O possibly up to the point, that you run out of space in one or more of the tablespaces (Especially the LOB ones really hurt!). I actually filled two packs with data…cost me a pizza…

In DB2 10, IBM aligned all of the EXPLAIN tables to have a “common” key: QUERYNO, APPLNAME, PROGNAME, COLLID, GROUP_MEMBER, SECTNOI, VERSION and EXPLAIN_TIME. Before this, some or nearly all of these columns were not there (DSN_PGRANGE_TABLE and DSN_QUERY_TABLE being the worst with only QUERYNO, GROUP_MEMBER, and EXPLAIN_TIME as a “key”)

This leads to some minor problems in determining which rows can be deleted, but nothing disastrous!

To clean-up in DB2 10, the simplest and best is a DELETE WHERE NOT EXISTS from the child tables to the PLAN_TABLE and as these are nowadays all the same; it is very simple:

 

DELETE
FROM <userid>.DSN_<name>_TABLE A 
WHERE NOT EXISTS (SELECT 1
                  FROM <userid>.PLAN_TABLE B
                  WHERE A.QUERYNO = B.QUERYNO
                    AND A.APPLNAME = B.APPLNAME
                    AND A.PROGNAME = B.PROGNAME
                    AND A.COLLID = B.COLLID
                    AND A.GROUP_MEMBER = B.GROUP_MEMBER
                    AND A.SECTNOI = B.SECTNOI
                    AND A.VERSION = B.VERSION
                    AND A.EXPLAIN_TIME = B.EXPLAIN_TIME)
;

And you are done once you have done this on all 16 tables…

DSN_STATEMNT_TABLE
DSN_FUNCTION_TABLE
DSN_COLDIST_TABLE
DSN_DETCOST_TABLE
DSN_FILTER_TABLE
DSN_KEYTGTDIST_TABLE
DSN_PGRANGE_TABLE
DSN_PGROUP_TABLE
DSN_PREDICAT_TABLE
DSN_PTASK_TABLE
DSN_QUERYINFO_TABLE
DSN_QUERY_TABLE
DSN_SORT_TABLE
DSN_SORTKEY_TABLE
DSN_STRUCT_TABLE
DSN_VIEWREF_TABLE

Of course the first real work is to find out which data you no longer need in the <userid>.PLAN_TABLE; and here the new catalog table SYSIBM.SYSPACKCOPY can help you – once you have bound everything in NFM; of course!

  
DELETE
FROM <u serid>.PLAN_TABLE A
WHERE NOT EXISTS (SELECT 1
FROM (SELECT COLLID
, NAME
, BINDTIME
FROM SYSIBM.SYSPACKCOPY
UNION ALL
SELECT COLLID
, NAME
, BINDTIME
FROM SYSIBM.SYSPACKAGE
) B
WHERE A.COLLID = B.COLLID
  AND A.PROGNAME = B.NAME
  AND A.BIND_TIME = B.BINDTIME)
 ;

This finds all of the currently “in use” (or possibly in use) collections and packages and then simply deletes from the PLAN_TABLE any that are not there. Having run this, you can then run the 16 other tidy up queries before finally doing a REORG with inline RUNSTATS and COPY, as this data is critical for your business! Now as it is critical… where have you got it stored? Does it also have the same stringent standards as other production data? One table per tablespace with a correct SEGSIZE, Bufferpool etc. Now is the time to review where your EXPLAIN data actually lives and take corrective action – it will speed up BIND and REBIND as deadlocks will reduce, and finally add all of the tablespaces to your DB2 Database Maintenance system! Just doing INSERTs is all well and good but when you try and SELECT the data, the performance can be terrible! For this the RealTime DBAExpert (RTDX) is your friend.

At this point it is also worth remembering that indexes also can be a great help! You should have two indexes on the PLAN_TABLE on BIND_TIME, COLLID, NAME and also BIND_TIME, QUERYNO, PROGNAME. All of the other EXPLAIN tables need at least EXPLAIN_TIME, QUERYNO and PROGNAME. You can happily add or alter these, of course, as it might even make sense to have indexes on QUERYNO, APPLNAME, PROGNAME, COLLID, GROUP_MEMBER, SECTNOI, VERSION, and EXPLAIN_TIME. Feel free to mix’n’ match for what you are doing with your tables! For example if using HINTs, it is recommended to have QUERYNO, APPLNAME, PROGNAME, VERSION, COLLID, OPTHINT or the performance will be pretty bad.
Bottom line is – Just make sure that you *only* create the EXPLAIN tables you actually *need*
and you *only* create the indexes you actually *need*

 

Happy EXPLAINing! As usual  comments and queries are always welcome!

TTFN,
Roy Boxwell
Senior Architect

2011-11: Index Compression – Which ones are worth it?

Before DB2 9 index spaces were stuck at being 4K and then in DB2 9, the full range of page sizes was opened up – 8k, 16k and even 32k pages. Knowing the classic answer to any DB2 question is “It Depends” and for index compression that is,still the correct answer!

The PROs for index compression are:

    • Less disk space

 

    • Fewer levels in the index

 

  • Faster I/O

Of course, the world is not really that simple. There are also CONs for index compression:

    • Increased CPU

 

    • Worse I/O response time

 

    • Increased bufferpools (more bufferpools, not just increasing the sizes of existing ones)

 

    • Real memory

 

  • Standards

Also bear in mind that you do not *have* to use compression to use the larger spaces. In fact, just switching to a larger page size can reduce index splits and index levels without the CPU and bufferpool tradeoffs that you must accept with compression.

If you’re interested in compressing indexes, here are a few quick facts about how it works:

    • The compression is done purely on disk. In the bufferpool, on the log, etc., it will always be uncompressed.

 

    • The compression is done at the “page” level.

 

    • The compression is only on the LEAF pages (it uses the old VSAM methodology).

 

  • The index *must* be defined with either 8k, 16k, or 32k as these will then be compressed down to a 4k page on disk.

How to choose which index to compress? There are these simple rules:

  1. Only compress big indexes when it is worth it. (Rule of Thumb: Between 48k and 1536k is *not* worth compression).
  2. A good candidate is an index with “normal” sequential access, because there is less I/O.
  3. A bad candidate is an index with “normal” random access having a bad bufferpool hit ratio and you are CPU limited, as you will bring more data than needed into the bufferpool.
  4. Run the DSN1COMP job and look at its results to chose the best option.

There are two Rules of Thumb that you may want to keep in mind:

  1. Normally 8k is the best choice as 16k can result in too much bufferpool space being wasted  (refer to the 1st example below).
  2. Use 8K if 50% or more compression, unless you get 75% or more when you can indeed use 16k (refer to the 2nd example below).

I have never seen a 32k space worth using at all. However, remember I have not tested every possible key length and data mix in the whole universe!

Txt File contains typical JCL for DSN1COMP. Note the use of PARM LEAFLIM(10000). This specifies how many index leaf pages should be evaluated to determine the compression estimate. This option prevents DSN1COMP from processing all index leaf pages in the input data set that are pretty large by definition.
The range is from 1 to 99000000. If the LEAFLIM PARM is not specified, the entire index will be scanned.

The output from DSN1COMP shows the results from evaluating the compression with different index page sizes. Look for message DSN1940I to see the details of the compression report; see the following two examples:

Example 1:

  8  K Page Buffer Size yields a     
 51  % Reduction in Index Leaf Page Space                          
     The Resulting Index would have approximately
 49  % of the original index's Leaf Page Space
     No Bufferpool Space would be unused 
     ----------------------------------------------
 16  K Page Buffer Size yields a                                 
 68  % Reduction in Index Leaf Page Space
     The Resulting Index would have approximately
 32  % of the original index's Leaf Page Space
 21  % of Bufferpool Space would be unused to
     ensure keys fit into compressed buffers
     ----------------------------------------------
 32  K Page Buffer Size yields a 
 68  % Reduction in Index Leaf Page Space
     The Resulting Index would have approximately
 32  % of the original index's Leaf Page Space
 60  % of Bufferpool Space would be unused to 
     ensure keys fit into compressed buffers

Here you can quickly see that the 8k wins even though it gives less saved space (51% instead of the 68% of the 16k). The Bufferpool will not be unused a.k.a. wasted. The following shows another output where the 16k page wins:

Example 2:

  8  K Page Buffer Size yields a
 51  % Reduction in Index Leaf Page Space
     The Resulting Index would have approximately
 49  % of the original index's Leaf Page Space
     No Bufferpool Space would be unused
     ----------------------------------------------
 16  K Page Buffer Size yields a 
 76  % Reduction in Index Leaf Page Space 
     The Resulting Index would have approximately 
 24  % of the original index's Leaf Page Space 
     No Bufferpool Space would be unused
     ---------------------------------------------- 
 32  K Page Buffer Size yields a            
 76  % Reduction in Index Leaf Page Space        
     The Resulting Index would have approximately   
 24  % of the original index's Leaf Page Space 
 47  % of Bufferpool Space would be unused to 
     ensure keys fit into compressed buffers

Important is the right balance between space savings, buffer pool usage, and CPU to decide on the best page size for your indexes. This brings us to the downside of all this – Standards. Agree on and write down which index gets what page size and verify this. In shops where the DDL is generated by many DBAs or tools, the “compress index methodology” must be documented and understood by everyone; otherwise, it will go horribly wrong at some point!

Monitoring and tuning bufferpools must also be done/redone with compression and your need of real memory may go up as well as your CPU (a little bit).

There is a great red paper that also describes the whole process in great detail. One nice little snippet of info about indexes that came in with DB2 9 was  the asymmetric page split. Before DB2 9, when you were inserting within a range (not just inserting at the end of a table!), the page splits were always 50/50, and the first page’s space was then almost always never used until the next index REORG came along. In DB2 9, this changed, and the engine could detect that this insert processing was happening and then changed the split to be adaptive and even up to 90/10; thus, drastically reducing the number of splits.

In DB2 9, index look aside was enabled for CLUSTERRATIOF >= 0.80 indexes; not just for the CLUSTERING index. Also in DB2 9, RANDOM indexes were delivered but I have never seen one in the wild – so to speak.

In DB2 10, we now have the excellent usage of INCLUDE columns to eliminate  redundant indexes that were only created for index only access. The point is that every index that exists on a table adds to the “overhead” of INSERT and DELETE, as well as disk space. Not to forget the poor little Optimizer can get a headache trying to work out the best access path with all the data there is to read and process! This overhead can get quite large and reducing the number of indexes should always be a goal for the DBA group; refer to my earlier Newsletter about RTS to see how I go hunting for redundant indexes.

Below is a little table from the IBM labs detailing the CPU usage of different INSERTs:

Scope                             9672-Z17 CPU time
No Index                          40 to 80µs
One index with no index read I/O  40 to 140µs 
One index with index read I/O     130 to 230µs 
Five indexes with index read I/O  500 to 800µs

Feel free to send me your  questions or comments.
TTFN,
Roy Boxwell
Senior Architect

2011-12: SOFTWARE ENGINEERING Holiday present – Free DB2 HealthCheck Series

 

Hi!

As a thank you to everybody joining my monthly News from the Labs Newsletter, this month we have a seasonal freebie for you all! SEGUS offers a DB2 HealthCheck package that pinpoints weaknesses and opportunities for improvements and optimization of your DB2 system.

Just reply to this issue and my support colleagues will ship our Licensed Freeware edition of our HealthCheck series for DB2 z/OS.

 

DB2 z/OS Subsystem check, including Coupling Facility

PerformanceHealthCheck for DB2 z/OS (PHC) checks your DB2 subsystem for a range of problems and lists out what it finds including a the latest enhancement – the Coupling Facility checker. I read on listserv about people with “Coupling Facilities under stress” and so I added some CF checks. It checks the six important values in your CF. The Level of the microcode, the transfer time, the number of rejects, the false contention percentage, the subchannel busy percentage and finally the all paths busy count. From these KPIs you can see if your CF is “under stress” or not! Now to get all this juicy data the LOAD library *must* be APF authorized of course! Remember that the normal Performance HealthCheck still runs fine without being APF auth’d just the CF check must be.

 

Analyzes and assesses a complete DB2 Subsystem

Along with PHC comes Statistics HealthCheck for DB2 z/OS (SHC), which lots of you may already be familiar with. It allows you to analyze and assess a complete DB2 subsystem down to a single database and tell you what is ʺwrongʺ or inconsistent with your DB2 catalog statistics. This enables you to determine any problems before they get bad and to improve performance by providing the DB2 Optimizer with better information from which it can base its cost estimate on. It fully supports and is compliant for DB2 10. It is a perennial favorite and you cannot run it enough. Especially when you’re going to migrate to a new DB2 version, this software is a must to protect yourself from strange optimizer behavior.

The binaries come with the products documentation with a full feature overview that summarizes what our PHC can do for you!
Feel free to send me your questions or comments.
TTFN,
Roy Boxwell
Senior Architect