Tuning Oracle RAC InterConnect(Private Network)
In RAC most important part is interconnect performance .
We will walk through how to tune oracle cluster resource .
Critical Oracle Background Processes using private interconnect
• LMSn (Global Cache Service Process)
Handles cache fusion(GCS) (Block transfers, messages)
Maintains records of each data file and block open in a cache in GRD (Global Resource Directory)
Controls flow of messages of remote instance(s)
Tracks global data block access
• LCK0 (Instance Enqueue Background Process)
Handles requests related to non-fusion resources such as library and row cache requests
• LMD0 (Global Enqueue Service Daemon)
Manages global enqueues
Detection of deadlocks
• LMON (Global Enqueue Service Monitor)
Maintenance instance membership
Reconfiguration of GCS/GES during recovery from instance crash or startup
• LMHB (Global Cache/Enqueue Service Heartbeat Monitor)
Monitors LMON, LMD and LMSn Processes
Wait events in RAC Database :
1. gc cr/current grant 2-way (message wait)
• Block does not exist in cache and LMS grants to FG process to read from disk
2. gc cr/current block 2-way/3-way (block wait)
• Requested for read and write operations
• Every execution triggers this since SCN advances
3. gc cr/current block congested (block congestion waits)
• LMS didn’t process within 1 ms due to CPU shortage or scheduling delays
4. gc cr/current block busy (block concurrency/contention waits)
• Indicates high concurrency
• LMS needs to perform additional work to prepare block in requested mode
5. gcs log flush sync
• Before sending reconstructed CR/CURR block, LMS request LGWR to flush redo vectors
6. gc cr failure/gc cr retry
• Inefciencies with interconnect or invalid block request or checksum error
7. gc cr/current block corrupt/lost
• Dropped packets due bufer overflow
• Misconfigured interconnect
Gc [current/cr] [2/3]-way – If you have 2 node cluster you cannot get 3-way as only two RAC instances are available and therefore only 2-way is possible as you can have at most two hops. If you have three or more RAC instances then 2-way or 3-way is possible. Blocks are received after 2 or 3 network hops immediately. The event is not a subject to any tuning except increasing private interconnects bandwidth and decreasing the private interconnects latency. Monitor if average ms > 1ms or close to Disk I/O latency. Look at reducing latency.
Gc [current/cr] grant 2-way – Event when grant is received immediately. Grant is always local or 2-way. Grant occurs when a request is made for a block image current or cr and no instance have the image in its local buffer cache. The requesting instance is required to do an I/O from data file to get the blocks. The grant simply is a permission from the LMS this to happen that is, the process to read the block from the data file. Grant can be either cr or current. Gc current grant is go read the block from the database files, while gc cr grant is read the block from disk and build a read consistent block once is read.
Gc [current/cr][block/grant] congested – means that it has been received eventually but with a delay because of intensive CPU consumption, memory lack, LMS overload due to much work in the queues, paging, swapping. This is worth investigating as it provides a room for improvement. You should look at it as it indicates that LMS could not dequeue message fast enough.
Gc [current/cr] block busy – Received but not sent immediately due to high concurrency or contention. This means that the block is busy for example somebody issue block recover command from RMAN. Variety of reasons for being busy just means cannot be sent immediately but not because of memory, LMS or system oriented reasons but Oracle oriented reasons. It is also worth investigating.
Gc current grant busy – Grant is received but there is a delay due to many shared block images or load. For example you are extending the high water mark and you are formatting the block images or blocks with block headers.
Gc [current/cr][failure/retry] – Not received because of failure, checksum error usually in the protocol of the private interconnect due to network errors or hardware problems. This is something worth investigating. Failure means that cannot receive the block image while retry means that the problems recovers and ultimately the block image can be received but it needs to retry.
Gc buffer busy – time between block accesses less than buffer pin time. Pin buffers can be in exclusive or shared mode depending if buffers can be modified or read only. Obviously if there is a lot of contention for the same block by different processes this event can manifest itself in grater magnitude. Buffer busy are global cache events as a request is made from one instance and the block is available in another instance and the block is busy due to contention.
How to check interconnect details :
$ oifcfg getif
lan902 172.17.1.0 global
lan901 10.28.188.0 global public
Ways to check interconnect performance :
1) ifconfig -a
Using ifconfig we can find if any packets are dropped
ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:27 overruns:0 carrier:0
2) global awr report
In Global Awr report we can find cluster performance .
2) From V$ views
We use racdiag.sql . Also have posted various scripts used my me .
Script to Collect RAC Diagnostic Information (racdiag.sql) (Doc ID 135714.1)
1. Estimate Global Cache traffic flowing in/out a given node from AWR
• Messages are typically 200 bytes in size or less while CR/Curr blocks are in 8k(Same as
DB_BLOCK_SIZE) in size
• Goal is to keep network Packets/sec under 70% of estimate throughput of interconnect device
• DBA_HIST_SYSSTAT provides data related to all GC wait events, block/message transfers
• DBA_HIST_IC_DEVICE_STATS provides stats like packets received/transmitted/dropped for each
interface
• DBA_HIST_IC_CLIENT_STATS provides usage of interconnect by area(IPQ, DLM and Cache)
select * from select * from v$instance_cache_transfer v$instance_cache_transfer
where class = 'data block' where class = 'data block'
and instance = 1;
SET lines 200 pages 100
SELECT inst_id,cr_requests,current_requests
FROM gv$cr_block_server;
SET lines 200 pages 100
SELECT * FROM gv$cluster_interconnects ORDER BY inst_id,name;
SET lines 200 pages 100
SELECT * FROM gv$configured_interconnects ORDER BY inst_id,name;
SET lines 200 pages 100
SELECT inst_id,class,cr_block,current_block
FROM gv$instance_cache_transfer WHERE instance IN (1,2)
ORDER BY inst_id,class;
##### Interesting one coming from racdiag.sql script:
SET lines 200
SELECT b1.inst_id, b2.VALUE "GCS CR BLOCKS RECEIVED",
b1.VALUE "GCS CR BLOCK RECEIVE TIME",
((b1.VALUE / b2.VALUE) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"
FROM gv$sysstat b1, gv$sysstat b2
WHERE b1.name = 'global cache cr block receive time'
AND b2.name = 'global cache cr blocks received'
AND b1.inst_id = b2.inst_id OR b1.name = 'gc cr block receive time'
AND b2.name = 'gc cr blocks received'
AND b1.inst_id = b2.inst_id;
####Latency of interconnect messages:
SET lines 200 pages 100
SELECT * FROM dba_hist_interconnect_pings WHERE snap_id=2846;
#### Global Cache Service processes (LMS) statistics:
SET lines 200 pages 100
SELECT * FROM gv$current_block_server;
SELECT
DECODE(name,'gc cr blocks received','global cache blocks received','gc cr blocks served','global cache blocks served','gc current blocks received','global cache blocks received','gc current blocks served','global cache blocks served',name) AS name,
SUM(VALUE) AS VALUE,
TO_CHAR(SYSDATE,'yyyy-mm-dd hh24:mi:ss') AS date_taken
FROM gv$sysstat
WHERE inst_id=1
AND name IN ('gc cr blocks received','gc cr blocks served','gc current blocks received','gc current blocks served','gcs messages sent','ges messages sent')
GROUP BY DECODE(name,'gc cr blocks received','global cache blocks received','gc cr blocks served','global cache blocks served','gc current blocks received','global cache blocks received','gc current blocks served','global cache blocks served',name)
UNION
SELECT name,VALUE,TO_CHAR(SYSDATE,'yyyy-mm-dd hh24:mi:ss') AS date_taken
FROM gv$dlm_misc
WHERE name IN ('gcs msgs received','ges msgs received')
AND inst_id=1;
Ways to Tune interconnect :
A) Network Kernel Buffer
UDP buffers
You want the fastest possible network to be used for the interconnect. To maximize your speed and efficiency on the interconnect, you should ensure that the User Datagram Protocol (UDP) buffers are set to the correct values. On Linux, you can check this via the following command:
sysctl net.core.rmem_max net.core.wmem_max net.core.rmem_default net
.core.wmem_default
net.core.rmem_max = 4194304
net.core.wmem_max = 1048576
net.core.rmem_default = 262144
net.core.wmem_default = 262144
Alternatively, you can read the associated values directly from the respective files in the directory /proc/sys/net/core. These values can be increased via the following SYSCTL commands:
sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=1048576
sysctl -w net.core.rmem_default=262144
sysctl -w net.core.wmem_default=262144
The numbers in this example are the recommended values for Oracle RAC on Linux and are more than sufficient for the majority of configurations. Nevertheless, let’s talk about some background of the UDP buffers. The values determined by rmem_max and wmem_max are on a “per-socket” basis. So if you set rmem_max to 4MB, and you have 400 processes running, each with a socket open for communications in the interconnect, then each of these 400 processes could potentially use 4MB, meaning that the total memory usage could be 1.6GB just for this UDP buffer space. However, this is only “potential” usage. So if rmem_default is set to 1MB and rmem_max is set to 4MB, you know for sure that at least 400MB will be allocated (1MB per socket). Anything more than that will be allocated only as needed, up to the max value. So the total memory usage depends on the rmem_default, rmem_max, the number of open sockets, and the variable piece of how much buffer space each process is actually using. This is an unknown—but it could depend on the network latency or other characteristics of how well the network is performing and how much network load there is altogether. To get the total number of Oracle-related open UDP sockets, you can execute this command:
netstat -anp -udp | grep ora | wc -l
B) Remove Reverse Path Filtering interference, by means of configuring rp_filter kernel parameter with value 0 (disabled) or 2 (loose). This is an IP level functionality, so in case we would be using HAIP over bonding, this parameter must be configured at bonding OS device and not the individual interface devices. With this configuration, we are preventing, in 2.6.31 and up kernels, a situation where blocks get locked or discarded. We can find more specific information in MOS rp_filter for multiple private interconnects and Linux Kernel 2.6.32+ (Doc ID 1286796.1).
C) Jumbo Frames introduces the ability for an Ethernet frame to exceed its IEEE 802 specified Maximum Transfer Unit of 1500 bytes up to a maximum of 9000 bytes. Even though Jumbo Frames is widely available in most NICs and data-center class managed switches it is not an IEEE approved the standard. While the benefits are clear, Jumbo Frames interoperability is not guaranteed with some existing networking devices. Though Jumbo Frames can be implemented for private Cluster Interconnects, it requires very careful configuration and testing to realize its benefits. In many cases, failures or inconsistencies can occur due to incorrect setup, bugs in the driver or switch software, which can result in sub-optimal performance and network errors.
D) Parameters Influencing RAC behavior
Disclaimer: Hidden parameters (those that begin with an underscore) such as _LM_DLMD_PRCS should be modified with caution and only after first consulting with Oracle Support. It should be noted that Oracle could change or remove such parameters without notice in subsequent releases.
In Oracle RAC, the setting of DB_BLOCK_SIZE multiplied by the MULTI_BLOCK_READ_COUNT determines the maximum size of a message for the Global Cache and the PARALLEL_EXECUTION_MESSAGE_SIZE determines the maximum size of a message used in Parallel Query. These message sizes can range from 2K to 64K or more and hence will get fragmented more so with a lower/default MTU.
The LMS background process (represents global cache services or GCS) of the requested instance may not be able to keep up with the requests from other instances and may cause the LMT background process (represents global enqueue services or GES) to queue new requests increasing its backlog and thus causing delayed response. The LMS process accepts requests on a First-In-First-Out (FIFO) basis. Oracle by default creates one LMS process for every two CPU’s on the system. While Oracle has not provided any direct method to tune the number of LMS processes, one method available is to increase them by modifying a hidden parameter _LM_DLMD_PROC
1. Always run critical background processes in RT priority • _lm_lms_priority_dynamic = false • _high_priority_processes = ‘LMS*|VKTM|LMD*|LGWR’
2. Disable Undo DRM during instance start after crash • _gc_undo_afnity = false
3. Disable DRM to avoid unacceptable and unpredictable freezes • _gc_policy_time = 0
4. Minimize reconfiguration time for bigger SGA • _lm_tickets • gcs_server_processes
5. Monitoring related critical background parameters • Heartbeat ping to local processes • _lm_rcvr_hang_check_frequency • _lm_rcvr_hang_allow_time • _lm_rcvr_hang_kill • Heartbeat ping to peer process on remote instances • _lm_idle_connection_check • _lm_idle_connection_check_interval • _lm_idle_connection_kill
6. Fairness and light work rule (_fairness_threshold)
E) Reducing Full table scans
Evaluate and reduce the number of full table scans on the database. Queries that retrieve data by performing full table scans could cause large cache movement between instances and thus cause significant load on the cluster interconnects.
F) Using multiple cluster interconnects
Multiple cluster interconnects can be configured and allocated to Oracle using the parameter CLUSTER_INTERCONNECTS. The parameter overrides the default interconnect setting at the operating system level with a preferred traffic network and in certain platforms disables the high availability feature of RAC.
G) Dont Use too big Sga :
Though this is not way to Tune Interconnect but to avoid block transfer from one node to Another . I have seen experts recommending not to use too large sga/buffer pool , as they prefer physical reads over cache fusion .
Tuning Cluster Timeout parameters :
Most of node eviction happens due to inproper configuration of cluster timeouts . Please work with Oracle support for idle value however below i am sharing based on
expirience . Have seen this that after migration time out values changes.
The CSS misscount parameter represents the maximum time, in seconds, that a heartbeat can be missed before entering into a cluster reconfiguration to evict the node.
Disktimeout: Disk Latencies in seconds from node-to-Votedisk. Default Value is 200. (Disk IO)
crsctl get css diagwait --- should give 13
crsctl get css misscount --- should give 30
crsctl get css disktimeout --- should give 200
crsctl get css reboottime --- should give 3
crsctl set css diagwait 13
crsctl set css misscount 30
crsctl set css disktimeout 200
crsctl set css reboottime 3
Reference :
Steps To Change CSS Misscount, Reboottime and Disktimeout (Doc ID 284752.1)
Other issues with RAC Performance :
A) Parallel Query Parallel Query
One major issue in RAC is parallel query One major issue in RAC is parallel query that goes across many nodes
Define Instance Groups Define Instance Groups
Specify in Specify in init.ora init.ora
prodb1.instance_groups='pqgroup1' prodb1.instance_groups='pqgroup1'
prodb2.instance_groups='pqgroup2' prodb2.instance_groups='pqgroup2'
Specify Instance Groups in Session Specify Instance Groups in Session
SQL> alter session set SQL> alter session set
parallel_instance_group parallel_instance_group ='pqgroup1';
RAC Common Tuning Tips
Application tuning is often the most beneficial
Resizing and tuning the buffer cache
Increasing sequence caches to a high value: to avoid index leaf contention caused by high or batch inserts.
Reducing long full-table scans in OLTP systems: to reduce GCS requests.
Using Automatic Segment Space Management
Using partitioning to reduce interinstance traffic
Avoiding unnecessary parsing
Minimizing locking usage
Removing unselective indexes
Configuring interconnect properly
Script intended to display RAC private interconnect traffic
#!/usr/bin/ksh
#
#
#
# -----------------------------------------------------------------------------
#
# Script intended to display RAC private interconnect traffic
#
# -----------------------------------------------------------------------------
#
# Usage : $0 <instance name> <interval in seconds (default 10)> <count (default 5)>
#
# According to sprepins.sql file in $ORACLE_HOME/rdbms/admin RAC interconnect
# traffic formula is
# Estd Interconnect traffic:
# ((Global Cache blocks received + Global Cache blocks served)*db_block_size
# +(GCS/GES messages received + GCS/GES messages sent)*200)/elapsed time
#
# -----------------------------------------------------------------------------
# Variables
# -----------------------------------------------------------------------------
OSTYPE=$(uname -s)
case $OSTYPE in
"AIX" ) alias bdf="/usr/bin/df -Ik"
alias ll="/usr/bin/ls -l" ;;
"SunOS") alias bdf="/usr/bin/df -k"
alias ll="/usr/bin/ls -l"
alias awk="/usr/xpg4/bin/awk"
alias grep="/usr/xpg4/bin/grep" ;;
"Linux") alias bdf="/bin/df -k"
alias grep=egrep
alias ll="ls -l" ;;
esac
LOG=/tmp/`basename $0`.$$.log
DB_BLOCK_SIZE=0
START_GCBR=0
START_GCBS=0
START_GCSMS=0
START_GCSMR=0
START_GESMS=0
START_GESMR=0
START_DATE="N/A"
END_DATE="N/A"
TRAFFIC=0
i=1
# -----------------------------------------------------------------------------
# Initialization
# -----------------------------------------------------------------------------
if [ "$LOGNAME" != "root" ]
then
echo -e "\nYou must launch this script as root\n"
exit 1
fi
if [ $# -lt 1 ]
then
echo -e "Usage : $0 <instance name> <interval in seconds (default 10)> <count (default 5)>\n"
echo -e "List of instances : \n`grep -v -e "^#" -e "^$" /etc/oratab|cut -f1 -d:`"
exit 1
else
if grep -q "^$1:" /etc/oratab
then
echo -e "\nInstance $1 exists in /etc/oratab\n"
else
echo -e "Instance <$1> does not exist in /etc/oratab"
echo -e "List of existing instances:\n`grep -v -e "^#" -e "^$" /etc/oratab|cut -f1 -d:`"
exit 1
fi
fi
ORACLE_SID=$1
if [ -n "$2" ]
then
INTERVAL=$2
else
INTERVAL=10
fi
if [ -n "$3" ]
then
COUNT=$3
else
COUNT=5
fi
if [ `bdf /tmp|grep tmp|awk '{print $4}'|tr -d %` -ge 98 ]
then
echo -e "\nFile system /tmp is full\n"
exit 1
fi
rm -f ${LOG}
IFS=:
OHOME=`grep "^$ORACLE_SID:" /etc/oratab | cut -d: -f2`
FLAG=`grep "^$ORACLE_SID:" /etc/oratab | cut -d: -f3`
USER=`grep "^$ORACLE_SID:" /etc/oratab | cut -d: -f4`
VG=`grep "^$ORACLE_SID:" /etc/oratab | cut -d: -f5`
# -----------------------------------------------------------------------------
# Main
# -----------------------------------------------------------------------------
echo -e "Please wait $INTERVAL seconds...\n"
IFS=*
while [ $i -le $COUNT ]
do
su - ${USER} -c "
sqlplus -S / as sysdba << EOF > ${LOG}
set serveroutput on size 999999 linesize 200 feedback off
declare
instance_number number;
db_block_size number;
type figures_type is table of number index by varchar2(64);
start_figures figures_type;
end_figures figures_type;
cursor cursor1(instance_number number) is select
decode(name,'gc cr blocks received','Global Cache blocks received','gc cr blocks served','Global Cache blocks served','gc current blocks received','Global Cache blocks received','gc current blocks served','Global Cache blocks served',name) as name,
sum(value) as value,to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') as date_taken
from gv\\\$SYSSTAT
where inst_id=instance_number
and name in ('gc cr blocks received','gc cr blocks served','gc current blocks received','gc current blocks served','gcs messages sent','ges messages sent')
group by decode(name,'gc cr blocks received','Global Cache blocks received','gc cr blocks served','Global Cache blocks served','gc current blocks received','Global Cache blocks received','gc current blocks served','Global Cache blocks served',name)
union
select name,value,to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') as date_taken
from gv\\\$DLM_MISC
where name in ('gcs msgs received','ges msgs received')
and inst_id=instance_number;
item1 cursor1%rowtype;
plsql_start_date varchar2(19);
plsql_end_date varchar2(19);
interval number;
computed_interval number;
i number;
begin
select value into instance_number from v\\\$parameter where name='instance_number';
select value into db_block_size from v\\\$parameter where name='db_block_size';
interval:=${INTERVAL};
start_figures('Global Cache blocks received'):=${START_GCBR};
start_figures('Global Cache blocks served'):=${START_GCBS};
start_figures('gcs messages sent'):=${START_GCSMS};
start_figures('gcs msgs received'):=${START_GCSMR};
start_figures('ges messages sent'):=${START_GESMS};
start_figures('ges msgs received'):=${START_GESMR};
plsql_start_date:='${START_DATE}';
--dbms_output.put_line('Instance Number: ' || instance_number);
--dbms_output.put_line('db_block_size: ' || db_block_size);
--dbms_output.put_line('interval: ' || interval);
if (start_figures('Global Cache blocks received')=0) then
open cursor1(instance_number);
loop
fetch cursor1 into item1;
exit when cursor1%notfound;
start_figures(item1.name):=item1.value;
end loop;
plsql_start_date:=item1.date_taken;
close cursor1;
dbms_lock.sleep(interval);
end if;
open cursor1(instance_number);
loop
fetch cursor1 into item1;
exit when cursor1%notfound;
end_figures(item1.name):=item1.value;
end loop;
plsql_end_date:=item1.date_taken;
close cursor1;
dbms_output.put_line('DB_BLOCK_SIZE=' || db_block_size);
dbms_output.put_line('START_GCBR=' || start_figures('Global Cache blocks received'));
dbms_output.put_line('START_GCBS=' || start_figures('Global Cache blocks served'));
dbms_output.put_line('START_GCSMS=' || start_figures('gcs messages sent'));
dbms_output.put_line('START_GCSMR=' || start_figures('gcs msgs received'));
dbms_output.put_line('START_GESMS=' || start_figures('ges messages sent'));
dbms_output.put_line('START_GESMR=' || start_figures('ges msgs received'));
dbms_output.put_line('END_GCBR=' || end_figures('Global Cache blocks received'));
dbms_output.put_line('END_GCBS=' || end_figures('Global Cache blocks served'));
dbms_output.put_line('END_GCSMS=' || end_figures('gcs messages sent'));
dbms_output.put_line('END_GCSMR=' || end_figures('gcs msgs received'));
dbms_output.put_line('END_GESMS=' || end_figures('ges messages sent'));
dbms_output.put_line('END_GESMR=' || end_figures('ges msgs received'));
dbms_output.put_line('START_DATE=\"' || plsql_start_date || '\"');
dbms_output.put_line('END_DATE=\"' || plsql_end_date || '\"');
computed_interval:=round((to_date(plsql_end_date,'yyyy-mm-dd hh24:mi:ss')-to_date(plsql_start_date,'yyyy-mm-dd hh24:mi:ss'))*24*60*60);
dbms_output.put_line('COMPUTED_INTERVAL=' || computed_interval);
dbms_output.put_line('TRAFFIC=' || ((end_figures('Global Cache blocks received')+end_figures('Global Cache blocks served')-start_figures('Global Cache blocks received')-start_figures('Global Cache blocks served'))*db_block_size+(end_figures('gcs messages sent')+end_figures('gcs msgs received')+end_figures('ges messages sent')+end_figures('ges msgs received')-start_figures('gcs messages sent')-start_figures('gcs msgs received')-start_figures('ges messages sent')-start_figures('ges msgs received'))*200)/1024/computed_interval);
end;
/
EOF
"
while read line
do
#echo $line
eval $line
done < ${LOG}
#TRAFFIC=`expr (($END_GCBR+$END_GCBS-$START_GCBR-$START_GCBS)*$DB_BLOCK_SIZE+($END_GCSMS+$END_GCSMR+$END_GESMS+$END_GESMR-$START_GCSMS-$START_GCSMR-$START_GESMS-$START_GESMR)*200)/1024/$INTERVAL)`
if [ $TRAFFIC -le 1024 ]
then
echo -e "Estd Interconnect traffic at $END_DATE (KBytes/s): $TRAFFIC\n"
else
TRAFFIC=`echo $TRAFFIC/1024 | bc -l`
echo -e "Estd Interconnect traffic at $END_DATE (MBytes/s): $TRAFFIC\n"
fi
#cat ${LOG}
START_GCBR=$END_GCBR
START_GCBS=$END_GCBS
START_GCSMS=$END_GCSMS
START_GCSMR=$END_GCSMR
START_GESMS=$END_GESMS
START_GESMR=$END_GESMR
START_DATE=$END_DATE
i=`expr $i + 1`
sleep $INTERVAL
done
rm -f ${LOG}
No comments:
Post a Comment