Disclaimer

Tuesday, 22 February 2022

Read consistency overhead

SQL performance can degrade for many reasons, some of most common are:

– plan changes
– data skewness
– low caching efficiency
– data growth
– contention.

All these factors are relatively well known. A somewhat less common, although not exceptionally rare scenario, is read consistency overhead due to concurrent DML against queried tables. Because of being less common, this scenario is often overlooked, which leads to false diagnoses (and eventually to “fixes” that can do more harm than good).


Read consistency

Read consistency dictates that:
1) changes by other sessions should not be seen until they are committed
2) changes by other sessions made after the current query started should not be seen even if they are committed.

These rules protect from inconsistent results. But consistency comes at a certain performance price, since they require reads of undo data for modified blocks, and I/O is generally the slowest part in the database operation. 

Additionally, there’s also some CPU overhead because the information read from undo needs to be applied to reconstruct the right point-in-time version of the data. 

If DML is intensive enough, all this can cause serious slowdown, especially considering that there may be some “positive feedback loop” kind of effect here: the slower a query is, the more undo you need to make it read-consistent, the more undo you read, the slower your query becomes. 

Eventually, after a certain threshold, SQL performance would spiral down until the query undo needs exceed undo retention, and the query fails with ORA-1555.


Recognizing this situation can be tricky because even though Oracle provides statistics for consistency-related I/O requests, in most real-life scenarios they are not very useful (e.g. to use session-level stats you need to take the baseline, and you don’t normally know that you need to take a baseline until after a performance issue occurs). 

There are some indirect ways to observe the read consistency overhead, e.g. by noticing that SQL is taking more I/O than it should. Or you can notice that I/O per execution varies a lot even though everything else — data, SQL inputs, plan — remain the same, and this variation is in phase with changes in DML activity against queried tables. But the most straightforward and reliable way is by looking at breakdown of disk reads made by the query by their source.



ASH query

This is an ideal case for ASH, because with ASH you don’t need to worry about baselines, or any typical tracing problems such as database server filesystem access. With ASH, all you need (apart from having the Diagnostic and Tuning Pack license to use it) is to run a simple query:

select nvl(ts.contents, 'not in I/O') "I/O type",
round(100*count(distinct sample_time)/
    sum(count(distinct sample_time))over(),1)
        "wall clock time (%)",
round(100*count(1)/
    sum(count(1))over(),1) "DB time (%)"
from dba_hist_active_sess_history ash,
dba_data_files df,
dba_tablespaces ts
where ash.p1 = df.file_id (+)
and df.tablespace_name = ts.tablespace_name (+)
and sql_id = :sql_id
and sql_exec_id = :sql_exec_id
group by ts.contents
order by count(distinct sample_time) desc;


In this query, :sql_exec_id is optional in case you’re only interested in a specific execution of a statement that was executed multiple times during the interval for which ASH is available. The query shows which fraction of query time is spent on reading data, and the breakdown of data read time by tablespace contents type (UNDO/PERMANENT/TEMPORARY).

Note that it returns two kinds of time: wall clock time and DB time. 

They are different for parallel queries and this difference can be very important (like in a recent case when failure to take this difference into account has prevented me from identifying the root cause early), but this is a broad topic which I’m hoping to cover in a separate post.

One limitation of this query is that it doesn’t show the overhead due to undo blocks read from the cache — from the database point of view, this will look as CPU time, without possibility to drill down any further.

Using results

Read consistency overhead cannot be eliminated: if you read and modify data at the same time, you’re going to pay the performance price. 

In this sense, this problem cannot be resolved. However, knowing how much of a price you are paying and for what is still extremely helpful:


– you know where performance overhead and/or instability is coming from which eliminates the risk of blaming some non-related activity for it
– you know how much performance benefits you can get if users move the reading and writing processes to different time slots (if that’s at all possible)
– in some special cases you can sacrifice consistency for performance by using temp tables — and you will be able to know how much you’ll save performancewise (although generally giving up read consistency is a very bad idea)
– in some cases you may want to consider various replication/redundancy mechanism to mitigate the problem, and this will provide you the “benefit” side of the cost/benefit analysis.

Summary

Flexible and powerful architecture of Oracle database allows to maintain read consistency without reading blocking writers. However, even without locking, there’s still a performance price for read consistency. This overhead cannot be eliminated, but it can be measured using ASH.

No comments:

Post a Comment

How to recovery PDB when PDB database is dropped in Oracle

  How to recovery PDB when PDB database is dropped :) [oracle@rac01 ~]$ sqlplus '/as sysdba' SQL*Plus: Release 21.0.0.0.0 - Product...