2012-02: HASH Index – What is it good for?

Oh dear! I seem to have made a HASH of it!

Many thanks to Jeff for requesting information about HASH indexing and thus causing the terrible pun in the first line!

 

A brand new access path for DB2 10

Hash access is one of the new goodies that arrived in DB2 10 NFM and is a brand new access path for DB2. (Anyone remember HDAM?) First the good news – When Hash access is used, it is pretty fast! If you are lucky, one I/O is all that is needed unless you get your hash space definition hopelessly wrong. In which case, it gets messy!

 

HASH Access. What is it?

Simply put, hash access takes your given key (max 255 bytes) and computes a 64 bit number. DB2 computes “the relative page number” using the MOD function with 2 modulo arguments: “this 64 bit number” and “a prime number prime number derived from the size of the table”.   Next, DB2 computes the hash anchor point using the MOD function with 2 modulo arguments:  “this 64 bit number” and “another prime number between 17 and 53”. Now the “relative page number” and the “hash anchor point” act like a cross-reference directly to the row in question. Of course, life is a bit more complicated than that, and you can easily get collisions where two keys end up with the same numbers.  If this happens, the second one gets written to the ominous sounding “hash over flow area” and its associated index (Remember for later that this is also a sparse index!). By the way, the optimal setting is to have between 0% and 8% of rows in the overflow area.

 

Any physical limitations?

Yes, there are a whole bunch! The table must be in a UTS, and it must be in RRF format. The hash key is *not* updateable (déjà vu here with partitioned key changes from aeons ago!). DROP a hash, and your table space is put into REORGP, which of course needs a SHRLEVEL NONE REORG to repair it and that is not a recoverable scenario either. No CLUSTER indexes are allowed of course. Hash index *must* be implicitly created. No LOB or XML columns and it cannot be an MQT. <phew!>

Which all reminds me of a certain Monty Python sketch: “But apart from that, what have the Romans done for us?”. Just replace Romans with Hash Access. 🙂

So that is the list of physical limitations – What about logical limitations?

 

What about logical limitations?

No parallel access or star/hybrid joins are possible. Hash access is not used for sensitive dynamic scrollable cursor with multi-element in-list. RUNSTATS is a waste of time as it is a sparse index and so, by definition, only has a few percent of the table rows in it. My advice is to simply exclude hash indexes from your DB2 database maintenance during RUNSTATS processing. However, remember you must still monitor these indexes for REORG and for re-sizing as well as hash usage.

For which data does it actually makes sense to change to be Hash accessible? Straight off the bat are: Single Row unique equals and IN predicates. No range scans. Access is random. The row that you are searching for is normally found. The size of the table is static or well known. The columns all have the same size (no large variants please!), which should ideally be between 30 and 2000 bytes. Finally it would be perfect if 20 or more rows fit on a page.

Good candidates should have four or more index levels in order to save tons of I/O.

Bad candidates have high insert rates or updates that change row length.
So you can see that you have to do a fair bit of work to find the candidates!

Once you have decided to change a table to be hash-based, you can see from EXPLAIN whether or not a given query will attempt to use the hash method or not. Access types of H, HN, and MH will confirm usage – You can use the RTS to see when and how often the last HASH access actually occurred (HASHLASTUSED and REORGHASHACCESS columns).

 

If you decide to stop using the hash organization and drop it, remember the REORGP.

LOADing is still a bit of trauma of course. The current best method is:
Create table, Load Data, Alter to Hash, and REORG.

Mass Inserts are also pretty bad.

I did some testing here loading up 90,000 rows with an eight byte (repeatable) random key and found the elapsed time varies greatly with the size of the HASHSPACE and with the bufferpool definition of the table space.  (Remember that basic BP tuning is always a good idea!) Here is what I found:

1 MB Hash Space leads to 73,183 overflow records 36 secs and 251 hash pages
Then I went high with the HASH SPACE:
256 MB Hash Space leads to zero overflow records 96 secs and 65,327 hash pages
Then I stepped down (in binary of course!):
128 MB Hash Space leads to zero overflow records 82 secs and 32,573 hash pages
64 MB Hash Space leads to zero overflow records 44 secs and 16,381 hash pages
32 MB Hash Space leads to zero overflow records 23 secs and 8,191 hash pages
16 MB Hash Space leads to zero overflow records 22 secs and 4,093 hash pages
8 MB Hash Space leads to one overflow record 22 secs and 2,039 hash pages
4 MB Hash Space leads to 21,620 overflow records 36 secs and 1,021 hash pages
2 MB Hash Space leads to 55,897 overflow records 38 secs and 509 hash pages

And then finally the optimum size for my little test:
9 MB Hash Space leads to zero overflow records 19 secs and 2,297 hash pages

Then I decided to test out the AUTOESTSPACE YES function to see what changes it made.
The following is a list of results from my test data:
256MB Hash Space LOADed 90,000 using 32,719 Hash pages with 0 overflows
REORG of above changed the hash pages down to 653 with 3,762 overflow records
1MB Hash Space with no data REORGed to 31 pages
1MB Hash Space LOADed 90,000 using 127 Hash pages with 72,728 overflows
REORG of above changed the hash pages up to 653 with 3,762 overflow records
8MB Hash space LOADed 90,000 using 1,021 Hash pages with 0 overflows
REORG of above changed the hash pages down to 653 with 3,762 overflow records

So you can see that REORG aims for about 5% of rows, which is OK.
Finally, also check out “Chapter 20. Organizing tables by hash for fast access to individual rows” in the Managing Performance book for some great details about how to squeeze the last bytes out of your hash space!

Feel free to send me your comments and ask questions.

TTFN,
Roy Boxwell
Senior Architect