Request to review an AWR report for a database suffering from instance-level performance issues. Here are the the key parts of that report (with some masking):

WORKLOAD REPOSITORY report for
 
DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
XXXX           XXXXX     XXXXX               1 10.2.0.5.0  NO  XXXX
 
              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     65115 03-May-16 11:00:09       152      17.8
     End Snap:     65116 03-May-16 12:00:18       152      17.7
   Elapsed:               60.16 (mins)
   DB Time:            2,712.41 (mins)
 
Cache Sizes
~~~~~~~~~~~                       Begin        End
                             ---------- ----------
               Buffer Cache:     5,856M     5,856M  Std Block Size:         8K
           Shared Pool Size:     2,048M     2,048M      Log Buffer:    14,340K
 
Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          2,455,599.10             14,087.84
              Logical reads:            613,415.60              3,519.18
              Block changes:             12,238.64                 70.21
             Physical reads:             12,233.70                 70.19
            Physical writes:              1,517.54                  8.71
                 User calls:              1,159.19                  6.65
                     Parses:             39,080.15                224.20
                Hard parses:                 32.45                  0.19
                      Sorts:                708.22                  4.06
                     Logons:                  0.31                  0.00
                   Executes:             39,393.06                226.00
               Transactions:                174.31
 
 ...

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read          12,719,449      73,404      6   45.1   User I/O
CPU time                                         50,461          31.0
db file parallel read             2,127,146      14,407      7    8.9   User I/O
read by other session             3,660,106      11,240      3    6.9   User I/O
db file scattered read            1,005,907       5,493      5    3.4   User I/O
          -------------------------------------------------------------
Time Model Statistics            DB/Inst: XXXX/XXXX  Snaps: 65115-65116
-> Total time in database user-calls (DB Time): 162744.5s
-> Statistics including the word 'background' measure background process
   time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name
 
Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time                            159,211.8         97.8
DB CPU                                               50,461.4         31.0
PL/SQL execution elapsed time                         5,870.2          3.6
parse time elapsed                                    5,788.9          3.6
RMAN cpu time (backup/restore)                        1,328.8           .8
hard parse elapsed time                                 227.6           .1
repeated bind elapsed time                              169.1           .1
connection management call elapsed time                   6.3           .0
PL/SQL compilation elapsed time                           4.2           .0
hard parse (sharing criteria) elapsed time                3.8           .0
hard parse (bind mismatch) elapsed time                   2.7           .0
failed parse elapsed time                                 0.1           .0
sequence load elapsed time                                0.0           .0
DB time                                             162,744.5          N/A
background elapsed time                               9,714.6          N/A
background cpu time                                   1,816.4          N/A
-------------------------------------------------------------
...
 
Operating System Statistics       DB/Inst: XXXX/XXXX  Snaps: 65115-65116
 
Statistic                                       Total
-------------------------------- --------------------
AVG_BUSY_TIME                                 345,463
AVG_IDLE_TIME                                  15,407
AVG_SYS_TIME                                   53,275
AVG_USER_TIME                                 292,060
BUSY_TIME                                   5,529,148
IDLE_TIME                                     248,026
SYS_TIME                                      854,375
USER_TIME                                   4,674,773
RSRC_MGR_CPU_WAIT_TIME                              0
VM_IN_BYTES                       -4.771264721965E+18
VM_OUT_BYTES                      -1.354598512669E+15
PHYSICAL_MEMORY_BYTES                  17,178,402,816
NUM_CPUS                                           16
NUM_CPU_CORES                                      16
 
...

What can we learn from all this?

First of all, there are a few interesting numbers in the load profile section. There are about 1k user calls per second, but the number of executes is much higher: about 40k per second, which suggests that most of the database load falls on SQL statements in PL/SQL routines.

Interestingly, the number of parses is also exceptionally high (quite close to the number of executes). However, only a small fraction of parses are hard ones.

The amount of redo generated is fairly small (about 2.4MB/s) and in general the database does much more reading than writing.

Top 5 timed events section doesn’t look particularly exciting. We can see that the database spends most of its time doing I/O which is to be expected. Some DBAs sometimes get alarmed by “read by other session” events, taking it as a sign of concurrency — but it’s a perfectly normal event which doesn’t necessarily indicate a problem (another session is reading a block we’re interested in into memory so that our session wouldn’t have to do it itself).

There are no obvious “bad” events here, signalling contention or configuration issues. The second line in the top events list is taken by “CPU time”, which is responsible for 31.0% of the database time. Since the database only has 16 CPUs, 50,461 CPU seconds mean that CPU utilization is high. To covert to percentages, simply divide it by the number of CPUs (16) and the number of seconds in an hour (3600): 50461/16/3600=0.876, i.e. the database alone accounts for about 88% CPU utilization! Furthermore, if we look into the OS stats section, we’ll see that the overall CPU utilization including non-Oracle processes is even higher, about 95% (just take BUSY_TIME and divide by BUSY_TIME + IDLE_TIME).

Now the question becomes: what causes the high CPU usage? There are many possibilities here. Basically, anything that is not related to physical I/O can cause CPU consumption (actually, even physical I/O can contribute to CPU usage as I’ve shown earlier, but that’s more like an exception than a rule, as typically physical I/O is too slow to contribute significantly). Most common scenarios include parsing, context switching overhead and logical I/O.

I don’t think there is enough information in the report to draw a conclusion with 100% certainty, but there is enough to make some educated guesses. Could parsing be the main problem here? The exceptionally high parsing rate (39k/second, i.e. practically 1 per every execution) from the load profile certainly seems to suggest so. It’s a possibility. However, on the other hand, only a tiny fraction (less than 0.1%) of those are hard parses, and it is hard parses that contribute to CPU usage the most. Further, “Time Model Statistics” section is claiming that only 3.6% of DB time is taken by parsing, and while this figure could be sometimes inaccurate and I wouldn’t trust it blindly, combined with all other information it looks sensible. So while parsing certainly seems to be excessive here, and no doubt it’s making the performance worse, there are likely to be other problems, those that would give more performance benefits if fixed.

What about context switching? That too is a possibility. The strongest evidence in favor of this theory is the disproportion between user calls and executions (1k vs 39k per second), which suggests that there is a lot of SQL called from PL/SQL routines, with a potentially large performance overhead for SQL-PL/SQL context switching. So I would keep this possibility in mind. However, analysing this overhead quantitatively is very difficult, and I wouldn’t spend too much time on it before I ruled out other possibilities.

Finally, logical I/O. 613k logical I/Os per second is certainly a lot, especially for a database that is hosted on a server with only 16 cores. Given the block size here is the standard one, i.e. 8k, this means that the database is reading from cache at 5GB/s, or 18TB/hr! While there exist many databases that read a lot more than that, they are typically hosted on more powerful servers, or even multiple servers connected into a RAC cluster.

Is this enough to explain the high CPU usage? Let’s see. While a logical I/O can be carried out in a number of ways, and exact timings would depend on CPU architecture, generally a logical I/O tends to take somewhere between a few microseconds and a few tens of microseconds (e.g. see here ). E.g. if we assume that on this particular system a logical I/O takes 20 us, that would account for almost the entire CPU load. Of course we don’t know if it’s 20 us or 10 us or less (or more), but I would say that overall the evidence in favor of excessive logical I/Os looks stronger than the other two versions we analyzed here. So that’s what I would focus on first. If after reducing the LIO the problem is still there, then I would start looking at other possibilities.

Very often, logical I/O can be dramatically reduced by tuning top SQL. Of course, that would require more database access than just an AWR report. But still, AWR reports provide an excellent starting point for this analysis, as they contain a subsection called “SQL ordered by gets”. In this particular case, the usefulness of this section was somewhat limited by the abundance of PL/SQL calls and the fact that the total percentage came to more than 100% (probably due to double-counting of SQL called from PL/SQL routines), but it was still useful. After excluding PL/SQL routine calls, most top statement had structure like this:

select id
from A
where id = :b1
and colA = :b2
and 0 = (select count(*) from B where colB = :b3)
and 1 < (select count(*) from C where colC1 = :b4 and colC2 = :b5 and colC3 = :b6)
and 0 <= (select count(*) from D where colD1 = :b7 and colD2 = a.ColA2)
...
-- and so on and so forth

so basically, it’s a query to check existence of records in certain tables that satisfy certain conditions. Many of such queries had abundance of both correlated and uncorrelated scalar subqueries in the WHERE clause. One obvious problem is that this increases the risk of an inefficient plan, because the optimizer needs to correctly assess whether the subquery needs to be unnested, and it’s not always easy to do so (and even if it evaluates the situation correctly, it can still fail to do the unnesting). But that’s not all. If you look at the uncorrelated subqueries, you can see, that they can be separated out into standalone queries, which would have allowed to shortcut the execution when one of the predicates is found to be false! For example, Oracle would have been able to evaluate the condition (select count(*) from B where colB = :b3) = 0, and if false, exited this branch of logic, avoiding unnecessary I/O.

==============================================================

Reading AWR reports: case study

Sometimes the answer comes from an unexpected source. For example, the workload profile section of the report contains key information for understanding what the database looks like, but it seldom gives a direct answer to the problem (except for maybe excessive parsing and excessive commits). But recently I came across a case when this section was enough to identify the root cause of a non-trivial issue:

Per Second Per Transaction Redo size: 1,895,241.12 12,004.40 Logical reads: 832,945.54 5,275.85 Block changes: 11,937.82 75.61 Physical reads: 7,458.75 47.24 Physical writes: 759.33 4.81 User calls: 449.83 2.85 Parses: 225.18 1.43 Hard parses: 15.90 0.10 Sorts: 467.90 2.96 Logons: 1.38 0.01 Executes: 103,266.84 654.09 Transactions: 157.88

This excerpt was coming from an AWR report for a database that virtually froze with 100% CPU consumption on the box. The question was what causing this high CPU consumption (the SAs ruled out possibility of blaming other processes on the box).

When looking carefully at the numbers above, one could notice that executes per second looks enormous. This becomes even more apparent when looking at the rate of user calls, which is a few orders of magnitude lower. These numbers, combined with high CPU usage, are enough to suspect context switching as the primary suspect: a SQL statement containing a PL/SQL function, which executes a SQL statement hundreds of thousands of time per function call.

Further investigation confirmed that it was indeed the case. There was a stats job running shortly before the incident, leading to invalidation of the SQL plan, and the new plan was calling the PL/SQL function at an early stage, before most rows were eliminated.

The point I am trying to make is that one should try to maintain a good balance between focusing on just few key performance indicators, and paying attention to secondary details as well.

=====================================================

CPU starvation disguised as an I/O issue (yet another AWR case study)

In AWR analysis, what appears to be the root cause of the issue, can easily turn out to be just a symptom. Last week, Rajat sent me an AWR report which is a perfect illustration of this (thanks Rajat), I posted the key sections from this report below (sorry for less than perfect formatting — I had to manually re-format the HTML version of the report into text).

WORKLOAD REPOSITORY report for
DB Name      DB Id           Instance       Inst num        Release          RAC           Host
DSS          37220993      dss              1               10.2.0.4.0       NO            dssdbnz
 
                  Snap Id      Snap Time             Sessions      Cursors/Session
Begin Snap:       18471      12-Oct-12 08:30:28      131              1.5
End Snap:         18477      12-Oct-12 14:30:24      108              1.8
Elapsed:          359.93 (mins)
DB Time:          25,730.14 (mins)
 
Load Profile
                              Per Second      Per Transaction
Redo size:                    325,282.85      103,923.02
Logical reads:                33,390.52       10,667.77
Block changes:                1,307.95        417.87
Physical reads:               1,927.33        615.75
Physical writes:              244.65          78.16
User calls:                   391.34          125.03
Parses:                       68.14           21.77
Hard parses:                  3.33            1.06
Sorts:                        47.86           15.29
Logons:                       3.15            1.01
Executes:                     234.32          74.86
Transactions:                 3.13
% Blocks changed per Read:       3.92       Recursive Call %:      61.11
Rollback per transaction %:      24.71      Rows per Sort:         3325.52
 
Top 5 Timed Events
Event                               Waits          Time(s)      Avg Wait(ms)      % Total Call Time      Wait Class
free buffer waits              10,726,838      344,377     32      22.3      Configuration
db file sequential read        6,122,262      335,366      55      21.7      User I/O
db file scattered read         3,597,607      305,576      85      19.8      User I/O
CPU time                                      161,491              10.5
read by other session          2,572,875      156,821     61       10.2      User I/O
 
Operating System Statistics
Statistic                                 Total
AVG_BUSY_TIME                             2,093,109
AVG_IDLE_TIME                             63,212
AVG_IOWAIT_TIME                           18,463
AVG_SYS_TIME                              87,749
AVG_USER_TIME                             2,004,722
BUSY_TIME                                 16,749,988
IDLE_TIME                                 510,692
IOWAIT_TIME                               152,594
SYS_TIME                                  707,137
USER_TIME                                 16,042,851
LOAD                                      4
OS_CPU_WAIT_TIME                          ###############
RSRC_MGR_CPU_WAIT_TIME                    0
VM_IN_BYTES                               5,503,492,096
VM_OUT_BYTES                              2,054,414,336
PHYSICAL_MEMORY_BYTES                     34,288,209,920
NUM_CPUS                                  8
NUM_CPU_SOCKETS                           8

The first thing that leaps into the eye is the I/O problem which manifests itself via free buffer waits and abnormally high average times for I/O-related events: 55 ms for sequential file read (!) and 85 ms for scattered file reads (!!) is 10-20 times more than normal. That alone should be enough to bring any application to its knees.

So the next question is what could have caused all this. It’s possible that this abnormality stems from a hardware failure or a misconfiguration in the I/O subsystem, so it would make sense to contact the SA and ask him to check the I/O stats from his angle. However, if we look a bit further in the AWR report, we will find yet another problem:

BUSY_TIME 16,749,988
IDLE_TIME 510,692

i.e. the CPU was 97% busy (!!) during an 6-hour interval (!!!). Now that we have established that both I/O and CPUs were bottlenecks, it’s quite unlikely that the two problems are unrelated to each other. Most probably, there is a causality link between them, but what is the cause and what is the consequence? An I/O problem is unlikely to affect the CPU, but the inverse is possible. If all CPUs on the system are busy, then it can take forever for the OS to complete an O/S request, and eventually the entire system virtually comes to a standstill.

But that’s not yet the end of the story; it would be also nice to know why the CPU usage went up so dramatically, so let’s take at “SQL ordered by CPU time” section of the report (column names and sql id’s have been obfuscated upon the sender’s request):

CPU Time (s)Elapsed Time (s) Executions CPU per Exec (s)  % Total DB TimeSQL Id      SQL Module      SQL Text
78,629      390,095          6       13104.92            25.27      b3p2r2asdfyf        SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,302       21,053           0                           1.36       0vpp40n6p6zq5       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,238       21,042           0                           1.36       2y03w8vc4uz4c       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,230       21,053           0                           1.36       8pcz31gc88vba       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,976       18,853           1       4975.57             1.22       asd0tw7sj5pxv       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,928       18,863           1       4928.05             1.22       4fdsaj175f4jk       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,356       21,049           0                           1.36       1w1fasdff0av        SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...

We can see that one statement is responsible for about 50% of the CPU utilization during the analyzed period, and a bunch of very similar statements (which probably only differ by values of literals) is responsible for at least yet another 15%.

So everything is pointing to the top statement as the sole culprit of the entire issue — and indeed, after I was provided additional information about that statement, I saw that its plan flipped right around the time of the incident. If I had access to the database, I would’ve pursued the matter further — e.g. by using dbms_xplan.display_awr to see the different plans and dbms_stats.diff_table_stats_in_history to see what changes in statistics may have provoked this change.

Conclusion

This post hopefully illustrates following points:

1) CPU starvation has many different appearance, among other things, it can look like an I/O issue

2) Don’t jump to conclusions. After finding a symptom, resist the temptation to pronounce it the root cause of the problem. Note that in this specific case if we decided that the I/O problem was the root cause, then we most likely would have proceeded to “SQL ordered by reads” section of the report rather than “SQL ordered by CPU time”, which doesn’t point in the right direction (the culprit query isn’t anywhere near the top in that list)

3) Oracle databases contain tons of diagnostic information which allow to identify the root cause of almost any performance problem in great detail. You don’t have to settle for a vague answer like “something’s wrong with the I/O subsystem” or “the workload was probably higher”.

Update: I spotted an error in my CPU time calculation: 6 hours is not 216,000 s, but 21,6000. With this taken into account, busy_time is almost exactly equal to NUM_CPUS x elapsed_time, and we don’t even to refer to os_cpu_wait_time to know that there was CPU starvation — so I’ve removed that part from my post.

============================================================

Amit's Oracle DBA Blog

Disclaimer

Monday, 21 February 2022

Case Studies - AWR analysis

Reading AWR reports: case study

CPU starvation disguised as an I/O issue (yet another AWR case study)

Conclusion

No comments:

Post a Comment

Oracle Exadata

Labels