Disclaimer

Monday 21 February 2022

Troubleshooting stuck queries

 What if the query is so slow that it cannot finish within reasonable time frame?

SQL Monitor

If your Oracle version is 11g or higher, then the most convenient tool to deal with “stuck” queries is SQL monitor. It’s very simple in use:

declare
  report clob;
begin
  report := DBMS_SQLTUNE.REPORT_SQL_MONITOR();
  dbms_output.put_line(report);
end;
/

where are optional, and can be sql_id of the statement or session_id or other parameters that can identify the activity of interest (see herefor a full description). If no parameters are specified, then the report is generated for the latest SQL execution monitored. By default, Oracle monitors queries that take longer than 5 seconds, but you can change this behavior with MONITOR hint.

This is what the output looks like:

SQL Monitoring Report
 
SQL Text
------------------------------
select count(*) from t t1, t
 
Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SCOTT (200:69)
 SQL ID              :  bdkccn9auvaf2
 SQL Execution ID    :  16777216
 Execution Started   :  11/09/2012 10:43:25
 First Refresh Time  :  11/09/2012 10:43:31
 Last Refresh Time   :  11/09/2012 10:45:51
 Duration            :  147s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus.exe
 
Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets  |
=========================================
|     146 |     144 |     1.23 |   2159 |
=========================================
 
SQL Plan Monitoring Details (Plan Hash Value=840385502)
=====================================================================================================================================
| Id   |        Operation        | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Mem | Activity | Activity Detail |
|      |                         |      | (Estim) |      | Active(s) | Active |       | (Actual) |     |   (%)    |   (# samples)   |
=====================================================================================================================================
|    0 | SELECT STATEMENT        |      |         |      |           |        |     1 |          |     |          |                 |
| -> 1 |   SORT AGGREGATE        |      |       1 |      |       141 |     +6 |     1 |        0 |     |     6.90 | Cpu (10)        |
| -> 2 |    MERGE JOIN CARTESIAN |      |     10G |  43M |       141 |     +6 |     1 |       4G |     |          |                 |
| -> 3 |     TABLE ACCESS FULL   | T    |    100K |  429 |       141 |     +6 |     1 |    39675 |     |     0.69 | Cpu (1)         |
| -> 4 |     BUFFER SORT         |      |    100K |  43M |       145 |     +2 | 39675 |       4G |  2M |    92.41 | Cpu (134)       |
|    5 |      TABLE ACCESS FULL  | T    |    100K |  427 |         1 |     +6 |     1 |     100K |     |          |                 |
=====================================================================================================================================


It’s almost everything you need in order to identify your problem — for each operation, you can find the expected and actual rowcount, memory usage, activity details etc.

For more information on real-time SQL monitoring, see here.

Active session history

If you’re not on 11g yet, then you have to use other tools, and one of the best alternatives is Active Session History (ASH). Using the V$ACTIVE_SESSION_HISTORY view, you can see what your SQL is waiting on, e.g.:

select nvl(event, session_state), count(*)
from v$active_session_history
where sql_id = 'bdkccn9auvaf2'
and session_id = 200
and session_serial# = 69
group by nvl(event, session_state)
order by 2 desc;

This query will tell you what wait events the query is waiting on. Even though you don’t get a break-down by plan operation with this approach, you still get some very valuable clues as to your performance problem. Furthermore, if you find that your query is waiting on disk I/O you can use event parameters (columns p1, p2 and p3 of the view) to identify the file and block the query is waiting for, which you can then use to identify the object name (e.g. here).

Unfortunately, often the query is stuck on CPU (e.g. when performing lots of logical reads because of an inefficient plan), in which case ASH cannot provide any additional details (if you have server access, and your platform is Solaris, you can use DTrace scripts for that, e.g. see Alexander Anokhin’s post).

Longops

If a plan operation is taking more than 6(?) seconds, it will show up in V$SESSION_LONGOPS view (if you have Toad for Oracle with DBA module, you can conveniently see the contents of the view on “Longops” tab of the Session browser). This is very convenient because not only you can identify which operation your plan is stuck on, but also see the progress of this operation using TOTALWORK and SOFAR columns (however keep in mind that for some sorts and joins these numbers can be inaccurate).

Using longops is great for spotting expensive full table scans, index fast full scans, hash joins, sorts etc. However if Oracle spends hours on an operation in total, but only a couple seconds at once, you won’t see it.

Tracing

Tracing is a classic performance troubleshooting tool. Normally it’s enabled via ALTER SESSION SET EVENT statement or a dbms_monitor call (a wrapper to dbms_system), but when a query is stuck then the only way I know is by calling dbms_system.set_ev directly:
exec dbms_system.set_ev(&&sid, &&serial#, 10046, 12, ”), where &&sid and &&serial# are the sid and serial# for the session running the stuck query, 10046 is the event number for SQL tracing (other events will allow you to trace optimizer, joins, sorts etc.), 12 is the level of tracing (it means that both wait events and binds will be captured; if you don’t want binds, use level = 8). To stop the tracing, call exec dbms_system.set_ev(&&sid, &&serial#, 10046, 0, ”).

A serious downside of that it requires access to the database server filesystem (because that’s where the trace file resides). A yet more serious problem is that if tracing stops before the query stopped running, the trace file won’t capture plan statistics, and in any case trace files don’t contain any information about individual logical reads — so it will probably only contain wait events for this session, i.e. essentially same sort of information that you can find in ASH. Of course, with ASH you only have samples (taken once per second), while with tracing you have the actual events — but if the number of samples is big enough samples are almost as good.

Conclusion

In this post, I reviewed several common tools for troubleshooting runaway queries. In most cases the best choice is SQL monitor, which is only available on 11g. For earlier database versions, there are other methods available, which have their pluses and minuses. The bottom line is that there are always ways to obtain accurate diagnostic information, so there is no excuse for guessing. Instead of randomly trying hints, blaming suspiciously looking operations (“my plan has a full table scan — that must be why it’s so slow”) or the database as a whole (“my query is running slow, can you please check the database?”), one should find the most appropriate tool to see where exactly the query is spending most of the time, and work on it.



No comments:

Post a Comment

100 Oracle DBA Interview Questions and Answers

  Here are 100 tricky interview questions tailored for a Senior Oracle DBA role. These questions span a wide range of topics, including perf...