I heard about an interesting problem the other day. Please remember that “interesting” to me is just that: “interesting”. ”Interesting” for the DBAs and employees of the firm where it happened is, naturally, a bit different.
A normal start
- Monday morning and all is well until around 07:00, when delays start appearing in online transactions.
- Soon the delays are gaining the advantage and customers are starting to complain.
- At around about 07:20 nearly the whole machine just sat there…
- About 15 minutes later everything started running normally.
The stage is set
So, the lead DBAs are off and running, looking for bad SQL that could possibly have caused the disturbance in the force.
They were checking whether the coupling facility was under stress, they were checking for parallel running REORG, MODIFY, or QUIESCE in the SYSIBM.SYSCOPY, they were using our WorkLoadExpert (WLX) to see what was happening in the time window involved.
And so do I! As luck would have it, I am at this site to hold a presentation all about BindImpactExpert, which saves you from bad access paths, and RunstatsRescue, which rescues you from badly timed RUNSTATS. Now this site already has these products, but I must present to a new intake of DBA and developer employees.
After my presentation we checked everything and found a few timeouts and deadlocks, but nothing serious. Then I got my Deer Stalker hat on, (now there’s an image!), and decided to see where delays were coming in. One of the developers had already done a quick WLX check and had seen very high Drain Lock values.
WLX outputs a summary of what workload has been processed which, here in the labs on my little test system, looks like this:
Wait times in microseconds because of … latch requests : 594 0 page latch : 0 0 drain locks : 0 0 drain lock claims : 0 0 log writer : 32230 0 synchronous I/O : 6840389 9623 locks : 0 0 synchronous execute : 0 0 global locks : 0 0 other threads read activity : 28429563 0 other threads write activity : 13166 0
At the actual customer site I could see a 1000 times increase in wait drain locks between two runs!
Utility versus SQL
Now, as I am sure you are all aware, a drain is *only* used by a command or a utility, so I started thinking:
“There must be a parallel running something somewhere!”
So I used WLX to show me the SQLs that had the highest wait drain locks. I took the top two (over 30,000 seconds of delay!) and got their tablespace names from the Db2 Catalog using the first referenced table column.
Horrible Job to do
It is not a pleasant task to search master address space sysouts, but in this case it was the only way. Using the tablespace names from the Db2 Catalog. I just navigated to the date and time in question and did F commands in SDFS on the tablespace names.
After a few minutes I found a strange message (Correlation Id=010.TLPLKNC3) about a drain not being possible for an internal Db2 system task
– This happens to be used by REORG, and it gave me the info about where the drain came from. I looked at that system’s log output in the time range, and sure enough there was a REORG of that very table which kept failing due to not getting the drain!
A retry too far?
At this site they use a 3rd Party software tool to generate REORG, RUNSTATS and COPY and it had a default of RETRY 30. It kept trying 30 times before eventually failing.
This explains the missing SYSCOPY entry as the REORG had failed!
The other one?
So that was one bad boy found – What about the other? That tablespace did not appear in any of the sysouts. So I drilled down to get the full SQL text (Over 8000 bytes long!) and scrolled on down to the FROM lines – and there was the *first* table name! After the dust had settled, I went back and I saw that :
this one table was actually in every single SQL in the top 200 delay candidates! A pretty central table if you ask me!
Who? and Why?
The management level now wanted to know who did it? And why? I left that part up to the customer, of course, as I do not want to get involved in a finger pointing exercise! My feeling is: like most disasters, it was probably a chain of events something like:
- REORG generated on Sunday.
- Due to some unforeseen problem the JCL was shunted to Monday.
- On Monday at 07:00 it started and killed the machine.
Best way is to generate jobs straight to the Job Scheduler for instantaneous execution (No waiting or shunting allowed) and guess what? We have the RealTimeDBAExpert (RTDX) that allows you to do just that! You can easily exclude objects from utilities based on days of the week, hours of day etc. If you have a bought-in or home-grown system would it also have caused this disaster?
If the customer had had our WLX Console Message Use Case licensed, it would have also made the detective work much easier, as then you have a central place to go where *all* console messages from *all* members are written and searchable! This would have saved a lot of time and trouble.
(Removing my Deer Stalker hat and replacing it with a mortar board.)
Look everywhere, trust no-one and remember that a DRAIN is almost definitely nothing to do with SQL or a badly timed RUNSTATS.
As always, any questions or comments would be most welcome!