AWR analysis: another case study
To review an AWR report for a database suffering from instance-level performance issues.
Here are the the key parts of that report (with some masking):
WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Release RAC Host ------------ ----------- ------------ -------- ----------- --- ------------ XXXX XXXXX XXXXX 1 10.2.0.5.0 NO XXXX Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 65115 03-May-16 11:00:09 152 17.8 End Snap: 65116 03-May-16 12:00:18 152 17.7 Elapsed: 60.16 (mins) DB Time: 2,712.41 (mins) Cache Sizes ~~~~~~~~~~~ Begin End ---------- ---------- Buffer Cache: 5,856M 5,856M Std Block Size: 8K Shared Pool Size: 2,048M 2,048M Log Buffer: 14,340K Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 2,455,599.10 14,087.84 Logical reads: 613,415.60 3,519.18 Block changes: 12,238.64 70.21 Physical reads: 12,233.70 70.19 Physical writes: 1,517.54 8.71 User calls: 1,159.19 6.65 Parses: 39,080.15 224.20 Hard parses: 32.45 0.19 Sorts: 708.22 4.06 Logons: 0.31 0.00 Executes: 39,393.06 226.00 Transactions: 174.31 ... |
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- db file sequential read 12,719,449 73,404 6 45.1 User I/O CPU time 50,461 31.0 db file parallel read 2,127,146 14,407 7 8.9 User I/O read by other session 3,660,106 11,240 3 6.9 User I/O db file scattered read 1,005,907 5,493 5 3.4 User I/O ------------------------------------------------------------- Time Model Statistics DB/Inst: XXXX/XXXX Snaps: 65115-65116 -> Total time in database user-calls (DB Time): 162744.5s -> Statistics including the word 'background' measure background process time, and so do not contribute to the DB time statistic -> Ordered by % or DB time desc, Statistic name Statistic Name Time (s) % of DB Time ------------------------------------------ ------------------ ------------ sql execute elapsed time 159,211.8 97.8 DB CPU 50,461.4 31.0 PL/SQL execution elapsed time 5,870.2 3.6 parse time elapsed 5,788.9 3.6 RMAN cpu time (backup/restore) 1,328.8 .8 hard parse elapsed time 227.6 .1 repeated bind elapsed time 169.1 .1 connection management call elapsed time 6.3 .0 PL/SQL compilation elapsed time 4.2 .0 hard parse (sharing criteria) elapsed time 3.8 .0 hard parse (bind mismatch) elapsed time 2.7 .0 failed parse elapsed time 0.1 .0 sequence load elapsed time 0.0 .0 DB time 162,744.5 N/A background elapsed time 9,714.6 N/A background cpu time 1,816.4 N/A ------------------------------------------------------------- ... Operating System Statistics DB/Inst: XXXX/XXXX Snaps: 65115-65116 Statistic Total -------------------------------- -------------------- AVG_BUSY_TIME 345,463 AVG_IDLE_TIME 15,407 AVG_SYS_TIME 53,275 AVG_USER_TIME 292,060 BUSY_TIME 5,529,148 IDLE_TIME 248,026 SYS_TIME 854,375 USER_TIME 4,674,773 RSRC_MGR_CPU_WAIT_TIME 0 VM_IN_BYTES -4.771264721965E+18 VM_OUT_BYTES -1.354598512669E+15 PHYSICAL_MEMORY_BYTES 17,178,402,816 NUM_CPUS 16 NUM_CPU_CORES 16 ... |
What can we learn from all this?
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 2,455,599.10 14,087.84
Logical reads: 613,415.60 3,519.18
Block changes: 12,238.64 70.21
Physical reads: 12,233.70 70.19
Physical writes: 1,517.54 8.71
User calls: 1,159.19
6.65
Parses: 39,080.15 224.20
Hard parses: 32.45 0.19
Sorts: 708.22 4.06
Logons: 0.31 0.00
Executes: 39,393.06
226.00
Transactions: 174.31
Interestingly, the number of parses is also exceptionally high (quite close to the number of executes). However, only a small fraction of parses are hard ones.
The amount of redo generated is fairly small (about 2.4MB/s) and in general the database does much more reading than writing.
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read 12,719,449 73,404 6 45.1 User I/O
CPU time 50,461 31.0
db file parallel read 2,127,146 14,407 7 8.9 User I/O
read by other session
3,660,106 11,240 3 6.9 User I/O
db file scattered read 1,005,907 5,493 5 3.4 User I/O
-------------------------------------------------------------
Time Model Statistics DB/Inst: XXXX/XXXX Snaps: 65115-65116
-> Total time in database user-calls (DB Time): 162744.5s
-> Statistics including the word 'background' measure background process
time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name
Top 5 timed events section doesn’t look particularly exciting. We can see that the database spends most of its time doing I/O which is to be expected.
Some DBAs sometimes get alarmed by “read by other session” events, taking it as a sign of concurrency — but it’s a perfectly normal event which doesn’t necessarily indicate a problem (another session is reading a block we’re interested in into memory so that our session wouldn’t have to do it itself).
I don’t think there is enough information in the report to draw a conclusion with 100% certainty, but there is enough to make some educated guesses. Could parsing be the main problem here? The exceptionally high parsing rate (39k/second, i.e. practically 1 per every execution) from the load profile certainly seems to suggest so. It’s a possibility. However, on the other hand, only a tiny fraction (less than 0.1%) of those are hard parses, and it is hard parses that contribute to CPU usage the most. Further, “Time Model Statistics” section is claiming that only 3.6% of DB time is taken by parsing, and while this figure could be sometimes inaccurate and I wouldn’t trust it blindly, combined with all other information it looks sensible. So while parsing certainly seems to be excessive here, and no doubt it’s making the performance worse, there are likely to be other problems, those that would give more performance benefits if fixed.
What about context switching? That too is a possibility. The strongest evidence in favor of this theory is the disproportion between user calls and executions (1k vs 39k per second), which suggests that there is a lot of SQL called from PL/SQL routines, with a potentially large performance overhead for SQL-PL/SQL context switching. So I would keep this possibility in mind. However, analysing this overhead quantitatively is very difficult, and I wouldn’t spend too much time on it before I ruled out other possibilities.
Finally, logical I/O. 613k logical I/Os per second is certainly a lot, especially for a database that is hosted on a server with only 16 cores. Given the block size here is the standard one, i.e. 8k, this means that the database is reading from cache at 5GB/s, or 18TB/hr! While there exist many databases that read a lot more than that, they are typically hosted on more powerful servers, or even multiple servers connected into a RAC cluster.
Is this enough to explain the high CPU usage? Let’s see. While a logical I/O can be carried out in a number of ways, and exact timings would depend on CPU architecture, generally a logical I/O tends to take somewhere between a few microseconds and a few tens of microseconds (e.g. see here ). E.g. if we assume that on this particular system a logical I/O takes 20 us, that would account for almost the entire CPU load. Of course we don’t know if it’s 20 us or 10 us or less (or more), but I would say that overall the evidence in favor of excessive logical I/Os looks stronger than the other two versions we analyzed here. So that’s what I would focus on first. If after reducing the LIO the problem is still there, then I would start looking at other possibilities.
Very often, logical I/O can be dramatically reduced by tuning top SQL. Of course, that would require more database access than just an AWR report. But still, AWR reports provide an excellent starting point for this analysis, as they contain a subsection called “SQL ordered by gets”. In this particular case, the usefulness of this section was somewhat limited by the abundance of PL/SQL calls and the fact that the total percentage came to more than 100% (probably due to double-counting of SQL called from PL/SQL routines), but it was still useful. After excluding PL/SQL routine calls, most top statement had structure like this:
select id from A where id = :b1 and colA = :b2 and 0 = ( select count (*) from B where colB = :b3) and 1 < ( select count (*) from C where colC1 = :b4 and colC2 = :b5 and colC3 = :b6) and 0 <= ( select count (*) from D where colD1 = :b7 and colD2 = a.ColA2) ... -- and so on and so forth |
so basically, it’s a query to check existence of records in certain tables that satisfy certain conditions. Many of such queries had abundance of both correlated and uncorrelated scalar subqueries in the WHERE clause. One obvious problem is that this increases the risk of an inefficient plan, because the optimizer needs to correctly assess whether the subquery needs to be unnested, and it’s not always easy to do so (and even if it evaluates the situation correctly, it can still fail to do the unnesting). But that’s not all. If you look at the uncorrelated subqueries, you can see, that they can be separated out into standalone queries, which would have allowed to shortcut the execution when one of the predicates is found to be false! For example, Oracle would have been able to evaluate the condition (select count(*) from B where colB = :b3) = 0, and if false, exited this branch of logic, avoiding unnecessary I/O.
No comments:
Post a Comment