Disclaimer

Tuesday, 14 September 2021

Golden rules of RAC diagnostics

1. Beware of top event tunnel vision.

2. Eliminate infrastructure as an issue.

3. Identify problem instance(s).

4. Review Send side metrics, and not just Receive side metrics.

5. Use histograms to identify any abnormalities, and not just averages.


Top event tunnel vision

 In RAC, the root cause can be hidden in other node and might not show up in the Top-5 wait events.

 Impact of side effect events can be much, much worse then actual root cause event itself.

 Performance of the instances are intertwined.

 Performance of background processes in one node, can affect the application performance in the all other nodes.


Example 1



Example 2




Example 3 (continued)



gc buffer busy



  Event ‘gc buffer busy [acquire|release]’ can be the symptoms for all the RAC performance root causes discussed earlier.

 Gc buffer busy wait simply means that there is a pending request  for a BL lock for that block already.

 All other process wait for the BL lock to be available and charge the time to gc buffer busy event.



Gc buffer busy – Objects related

Following four pages only applicable if there are no other events indicating systemic issues.

 ASH or trace files can be used to identify the blocks suffering from excessive ‘gc buffer busy’ waits.

 Only use this method if there is no other issue affecting background processes.

 To understand gc buffer busy waits:

-> Identify the object and object type

-> Identify the block type

 ash_gcwait_to_obj.sql can be used to identify the object name and type.



Gc buffer busy – Identify object






Gc buffer busy – Identify block type


 You might need to dump the block to identify the type and contents of the block.

alter system dump datafile 4 block min 103582 block max 103582;

 In this example, it is a leaf block of an index.






2. Eliminate infrastructure as an issue.

Eliminate infra-structure as an issue

 Wait events ‘gc cr block 2-way’ , ‘gc cr block 3-way’ , ‘gc current block 2-way’, and ‘gc current block 3-way’, ‘gc cr grant 2-way’, ‘gc cr grant 2-way’.

 Concurrency or congestion issues are not factored in to these events.

 The lowest wait time that you can possibly get for a block or a message transfer.

 A wait time of <2ms is normal. In Exadata, <0.5ms




3. Identify problem instance(s).

 Gv$instance_cache_transfer keeps track of the RX metrics at both instance and class level.



Identify problem inducing instance

SELECT INSTANCE ||'->’ || inst_id transfer,

class, cr_block cr_blk,

Trunc(cr_block_time / cr_block / 1000, 2) avg_Cr,

current_block cur_blk,

Trunc(current_block_time / current_block / 1000, 2) avg_cur

FROM gv$instance_cache_transfer

WHERE cr_block > 0 AND current_block > 0

ORDER BY INSTANCE, inst_id, class

/



4. Review Send side metrics, and not just

 Receive side metrics.


Sending side analysis

gc cr block receive time=

Time to send message to a remote LMS process by FG

+ Time taken by LMS to build block (statistics: gc cr block build time)

+ LMS wait for LGWR latency ( statistics: gc cr block flush time)

+ LMS send time (Statistics: gc cr block send time)

+ Wire latency.





Problem node




No comments:

Post a Comment

How to recovery PDB when PDB database is dropped in Oracle

  How to recovery PDB when PDB database is dropped :) [oracle@rac01 ~]$ sqlplus '/as sysdba' SQL*Plus: Release 21.0.0.0.0 - Product...