2018-02 Db2 Catalog Statistics revisited

Db2 Optimizer & access path selection for Db2 11 & Db2 12 :

Db2 Catalog dataProblematic default values | Correlations in the Db2 Catalog 

It has been six years since the last update so I thought, after Terry Purcell’s excellent presentation in January 2018, it would be a good point in time to go over and rake the coals again—especially as a couple of things have changed for Db2 12!


Terry Purcell – Db2 12 for z/OS Optimizer and RUNSTATS improvements
Webcast replay          Abstract

Are you a RUNSTATS Master?

Every now and again, I hold a little presentation called “Are you a RUNSTATS Master?” Actually these days it’s called “Db2 z/OS Lies, Damn Lies, and Statistics…” where I describe in detail, what the Db2 Optimizer uses for access path selection in relation to the Db2 Catalog data.

Surprised? You will be!

Personally, I am always surprised at how often people say “just that data?” or “is that it?” (the various other reasons for access path selection like CP speed, RID Pool size, Sort Pool size, Max data caching size, and, of course, the 80 bufferpools are also mentioned, but these have nothing to do with RUNSTATS).

So generally the answer is “Yes.” However, the permutations and combinations make the devil in the detail – The Db2 Optimizer’s algorithms are top secret, but the input data it uses is fully described in the documentation.

Just the facts ma’am

What I want to do, is show :

  • the Db2 Catalog data that is used
  • the default values that can cause surprising things to happen
  • the problem of correlations in the Db2 Catalog

Which data are used by the Db2 Optimizer and which are updated by RUNSTATS?

Here is a complete list of the eleven tables used by the Db2 Optimizer:

  1. SYSIBM.SYSCOLDIST
  2. SYSIBM.SYSCOLSTATS *
  3. SYSIBM.SYSCOLUMNS
  4. SYSIBM.SYSINDEXES
  5. SYSIBM.SYSINDEXPART
  6. SYSIBM.SYSKEYTARGETS (same as SYSCOLUMNS)
  7. SYSIBM.SYSKEYTGTDIST (same as SYSCOLDIST)
  8. SYSIBM.SYSROUTINES
  9. SYSIBM.SYSTABLES
  10. SYSIBM.SYSTABLESPACE
  11. SYSIBM.SYSTABSTATS

* degree of parallelism only and, after APAR PK62804, also „sometimes“ used to bound filter factor estimates…

By the Columns

Now we can also list out all of the columns (obviously not including the key columns) which are used by the Db2 Optimizer:


SYSCOLDIST
CARDF, COLGROUPCOLNO, COLVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMCOLUMNS, QUANTILENO, STATSTIME


SYSCOLSTATS
COLCARD, HIGHKEY, LOWKEY


SYSCOLUMNS
COLCARDF, HIGH2KEY, LOW2KEY


SYSINDEXES
CLUSTERING*, CLUSTERRATIO, CLUSTERRATIOF, DATAREPEATFACTORF, FIRSTKEYCARDF, FULLKEYCARDF, NLEAF, NLEVELS


SYSINDEXPART
LIMITKEY*


SYSKEYTARGETS
CARDF, HIGH2KEY, LOW2KEY, STATS_FORMAT


SYSKEYTGTDIST
CARDF, KEYGROUPKEYNO, KEYVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMKEYS, QUANTILENO, STATSTIME


SYSROUTINES
CARDINALITY*, INITIAL_INSTS*, INITIAL_IOS*, INSTS_PER_INVOC*, IOS_PER_INVOC*


SYSTABLES
CARDF, EDPROC*, NPAGES, NPAGESF, PCTROWCOMP


SYSTABLESPACE
NACTIVE, NACTIVEF


SYSTABSTATS
CARD, CARDF, NPAGES


Notes: * Columns are not updated by RUNSTATS and _ Columns are not updatable at all. The column STATSTIME is used only if there are duplicates in the SYSCOLDIST table, and then the Db2 Optimizer will use the “newer” data that was probably inserted by a User.

Know your defaults

Which default column values trigger the Db2 Optimizer to use its own internal default values?


SYSCOLUMNS


If COLCARDF                       = -1 then use 25


SYSINDEXES


If CLUSTERRATIOF            = 0 then use CLUSTERRATIO


If CLUSTERRATIO              = 0 then use 0.95 if the index is CLUSTERing = ‘Y’ otherwise 0.00


DATAREPEATFACTORF    = -1 then is ignored


If FIRSTKEYCARDF            = -1 then use 25


If FULLKEYCARDF             = -1 then use 25


If NLEAF                              = -1 then use 33 (Which is SYSTABLES.CARDF / 300)


If NLEVELS                         = -1 then use 2


SYSROUTINES


If CARDINALITY                  = -1 then use 10,000  


If INITIAL_INSTS                 = -1 then use 40,000


If INITIAL_IOS                     = -1 then use 0


If INSTS_PER_INVOC        = -1 then use 4,000


If IOS_PER_INVOC            = -1 then use 0


If IOS_PER_INVOC            = -1 then use 0


SYSTABLES


If CARDF                             = -1 then use 10,000


If NPAGESF                      <= 0 then use NPAGES


If NPAGES                          = -1 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20))

Here you must be very careful if using NPGTHRSH ZPARM as 501 is more than the default value in most shops. This is one of the little changes in Db2 12 where the value -1 is treated as -1 for the NPGTHRSH check.


SYSTABLESPACE


If NACTIVEF                     = 0 then use NACTIVE


If NACTIVE                       = 0 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20))


SYSTABSTATS


If CARDF                         = -1 then use 10,000


If NPAGES                       = -1 then use 501 (Which is CEILING (1 + SYSTABSTATS.CARDF / 20))


So now you can see that non-floating point “old” data, may still be used today and this may cause access path headaches!

Never ever say never

Now to top it all, the data in the SYSCOLDIST and SYSKEYTGTDIST never gets simply “deleted”.

Well, actually, in Db2 12 you can now do a RUNSTATS xxx.yyy RESET ACCESSPATH to delete all SYSCOLDIST and SYSKEYTGTDIST data and set all other relevant columns to their respective defaults, but you must time this RUNSTATS very wisely! If you run it and then forget to do a normal full RUNSTATS…

Oldie but a goldie

Once the data are inserted, they stay there, until they are overwritten by new data, a RUNSTATS RESET, or the object is dropped. This all leads to some very old data in these two tables that can and do cause the Db2 Optimizer a ton of grief! One of the first things I do is to simply select the MIN(STATSTIME) from these tables just to see how old the data really is. Do it yourself and be surprised! I have seen sites with eight-year old data in the SYSCOLDIST and that cannot be good!

Correlate the world

Now onto correlations… There are lots of little tricks that DBAs use to “massage” access path choice. One of these, is to just set NLEVELS to 15 for a given index. Then lots of queries simply refuse to touch it as it would appear to be HUGE. Now, just simply updating columns can cause the Db2 Optimizer, in the best case, to ignore the updates or, perhaps, makes things even worse! So here is a list of the correlations (In other words, if you change xxx remember to change yyy and zzz as well):

Relationships exist among certain columns of certain tables:

  •       Columns within SYSCOLUMNS
  •       Columns in the tables SYSCOLUMNS and SYSINDEXES
  •       Columns in the tables SYSCOLUMNS and SYSCOLDIST
  •       Columns in the tables SYSCOLUMNS, SYSCOLDIST, and SYSINDEXES

 If you plan to update some values, keep in mind the following correlations:

  •  COLCARDF and FIRSTKEYCARDF/FULLKEYCARDF (They must be equal for the 1st column and full, if a single column index)
  •  COLCARDF, LOW2KEY and HIGH2KEY. (For non-default COLCARDF LOW2KEY and HIGH2KEY key must be filled with data) and if the COLCARDF is 1 or 2 Db2 uses LOW2KEY and HIGH2KEY as domain statistics to generate frequencies.
  • CARDF in SYSCOLDIST.  CARDF is related to COLCARDF and FIRSTKEYCARDF and FULLKEYCARDF. It must be at a minimum
  • A value between FIRSTKEYCARDF and FULLKEYCARDF if the index contains the same set of columns
  • A value between MAX(colcardf of each col) and the product of all the columns COLCARDFs in the group
  • CARDF in SYSTABLES. CARDF must be equal or larger than any other cardinalities, such as COLCARDF, FIRSTKEYCARDF, FULLKEYCARDF, and CARDF in SYSCOLDIST
  • FREQUENCYF and COLCARDF or CARDF. The number of frequencies collected must be less than or equal to COLCARDF for the column or CARDF for the column group
  • FREQUENCYF. The sum of frequencies collected for a column or column group must be less than or equal to 1

New in Db2 11

In Db2 11, the table SYSSTATFEEDBACK was introduced giving us the first chance to see what the optimizer thinks is missing. This is truly awesome, as then we can tailor our RUNSTATS to generate exactly what the optimizer needs to really validate and generate a good, stable access path. Of course, you should be a little bit careful with this data as too much of a good thing can be bad for you!

New in Db2 12

(Not just the lowercase b!)

In Db2 12, the SYSSTATFEEDBACK was made even more interesting by now externalizing the required RUNSTATS options *directly* into the already existing RUNSTATS profile or, indeed, actually creating a RUNSTATS profile for you.

I think that is really dangerous, as then you could easily flood your system with bogus stats for end user QMF/SPUFI queries that were run “by accident,” or so called “boss queries” where someone with *no* idea of SQL clicks together a highly complex and badly written SQL before letting it run for a weekend. Naturally the SQL gets rewritten by a helpful ever present DBA, but the statistics recommendations have now landed in the profile and will be updated and kept from this point on.

My personal recommendation is to switch off this feature as it is sadly *on* by default!

Here are the ZPARMs of interest

ZPARM STATFDBK_SCOPE set to ALL by default.
ZPARM STATFDBK_PROFILE set to YES by default.
Plus, in table SYSIBM.SYSTABLES column STATS_FEEDBACK is set to Y by default.

Out-of-the box it starts automatically creating (for TYPE=’C‘ with NUMCOLS > 1 and TYPE=’F‘ or ‚H‘) profiles and updating existing profiles…Here you must manually check the size of your profiles every now and again just to make sure everything is ok!

One other new thing in Db2 12, is that XML columns can get statistics now to help XMLEXISTS get a better access path.

and finally

Do not forget that our little Freeware tool StatisticsHealthCheck will find all bad correlations, old data and badly updated data for you and it is FREE!

So I hope this little round-up of Db2 Catalog Statistics data was interesting, and, as usual, if you have any comments or questions, then please, feel free to mail me!

TTFN

Roy Boxwell