Disclaimer

Wednesday 25 August 2021

 

Oracle Rac Tuning Tips

1-Take AWR or statspack snaps on each instance (at the same time) at regular intervals. 
If there are severe inter-instance performance issues or hung sessions, you may also want to 
run the racdiag.sql script from the following note to collect additional RAC specific data: 
(Note 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql))



2-Monitoring RAC Cluster Interconnect Performance
The most important aspects of RAC tuning are the monitoring and tuning of the global services 
directory processes. The processes in the Global Service Daemon (GSD) communicate through 
the cluster interconnects

2-1- (GLOBAL CACHE CR PERFORMANCE)

-- This shows the average latency of a consistent block request.
-- AVG CR BLOCK RECEIVE TIME should typically be about 15 milliseconds depending
--If your CPU has limited idle time and your system typically processes long-running queries, 
then the latency may be higher. However, it is possible to have an average latency of less 
than one millisecond with User-mode IPC.
--Latency can be influenced by a high value for the DB_MULTI_BLOCK_READ_COUNT parameter.
--Also check interconnect badwidth, OS tcp settings, and OS udp settings 
if AVG CR BLOCK RECEIVE TIME is high.
---If some problems exist then Also refer Doc ID: 181489.1 Tuning Inter-Instance Performance 
in RAC and OPS

set numwidth 20
column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9
prompt '****Interconnect Latency should be lower than 15ms****'
select
b1.inst_id,
b2.value "GCS CR BLOCKS RECEIVED",
b1.value "GCS CR BLOCK RECEIVE TIME",
((b1.value/b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"
from gv$sysstat b1,
gv$sysstat b2
where
(b1.name = 'global cache cr block receive time' and
b2.name = 'global cache cr blocks received' and
b1.inst_id = b2.inst_id
)
or
(b1.name = 'gc cr block receive time' and
b2.name = 'gc cr blocks received' and
b1.inst_id = b2.inst_id
)
/

2-2-The level of cluster interconnect performance

GCS waits that show how well data is being transferred. The waits that need to be monitored are shown 
in v$session_wait, v$obj_stats, and v$enqueues_stats. The major waits to be concerned with for RAC are:
* global cache busy
* buffer busy global cache
* buffer busy global cr


SELECT
INST_ID, EVENT, P1 FILE_NUMBER, P2 BLOCK_NUMBER, WAIT_TIME
FROM GV$SESSION_WAIT WHERE
EVENT IN ('buffer busy global cr', 'global cache busy', 'buffer busy global cache')
/

In order to find the object that corresponds to a particular file and block, the following query can be issued 
for the first combination on the above list:

SELECT OWNER, SEGMENT_NAME, SEGMENT_TYPE
FROM DBA_EXTENTS
WHERE FILE_ID = &file_id AND &block
BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS-1;

Once the objects causing the contention are determined, they should be modified by:
* Reducing the rows per block.
* Adjusting the block size.
* Modifying initrans and freelists.
All of these object modifications reduce the chances of application contention for the blocks in the object. 
Index leaf blocks are usually the most contended objects in the RAC environment; therefore, 
using a smaller block size for index objects can decrease intra-instance contention for index 
leaf blocks.

2-3 -The following wait events indicate that the remotely cached blocks were shipped to the 
local instance without having been busy, pinned or requiring a log flush and can safely be ignored:
* gc current block 2-way
* gc current block 3-way
* gc cr block 2-way
* gc cr block 3-way


3-GLOBAL CACHE LOCK PERFORMANCE

-- GLOBAL CACHE LOCK PERFORMANCE
-- This shows the average global enqueue get time.
-- Typically AVG GLOBAL LOCK GET TIME should be 20-30 milliseconds. the elapsed
-- time for a get includes the allocation and initialization of a new global enqueue. If the average global 
enqueue get (global cache get time) or average global enqueue conversion times are excessive, 
then your system may be experiencing timeouts.
--See the 'WAITING SESSIONS', 'GES LOCK BLOCKERS', 'GES LOCK WAITERS', and 
'TOP 10 WAIT EVENTS ON SYSTEM' sections if the AVG GLOBAL LOCK GET TIME is high.

set numwidth 20
column "AVG GLOBAL LOCK GET TIME (ms)" format 9999999.9
prompt '****Average Lock Time should be 20-30 ms ****'
select
b1.inst_id, (b1.value + b2.value) "GLOBAL LOCK GETS",
b3.value "GLOBAL LOCK GET TIME",
(b3.value / (b1.value + b2.value) * 10) "AVG GLOBAL LOCK GET TIME (ms)"
from gv$sysstat b1, gv$sysstat b2, gv$sysstat b3
where
(
b1.name = 'global lock sync gets' and
b2.name = 'global lock async gets' and
b3.name = 'global lock get time' and
b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id
)
or
(
b1.name = 'global enqueue gets sync' and
b2.name = 'global enqueue gets async' and
b3.name = 'global enqueue get time' and
b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id
)
/

4-GES LOCK BLOCKERS and GES_LOCK_WAITERS


-- GES LOCK BLOCKERS:
-- This section will show us any sessions that are holding locks that
-- are blocking other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to obtain.
-- The lockstate column will show us what status the lock is in. The last column
-- shows how long this session has been waiting.
--
set numwidth 5
column state format a16 tru;
column event format a30 tru;
select /*+ordered */
dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Canceling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocker = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc
/


-- GES LOCK WAITERS:
-- This section will show us any sessions that are waiting for locks that
-- are blocked by other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to obtain.
-- The lockstate column will show us what status the lock is in. The last column
-- shows how long this session has been waiting.

set numwidth 5
column state format a16 tru;
column event format a30 tru;
select /*+ ordered */
dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Cancelling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocked = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc
/

5-Check db_file_multiblock if interconnect latency is occurred

SELECT inst_id,SUBSTR(NAME,1,30) ,SUBSTR(VALUE,1,50) from gv$parameter
where name ='db_file_multiblock_read_count'
/


6-Load distribution between instances (At the moment )


set pagesize 60 space 2 numwidth 8 linesize 132
column service_name format a20 truncated heading 'Service'
column instance_name heading 'Instance' format a10
column stime heading 'Service TimemSec/Call' format 999999999
Select service_name ,
instance_name,
elapsedpercall stime,
cpupercall cpu_time,
dbtimepercall db_time,
callspersec throughput
from gv$instance gvi,
gv$active_services gvas,
gv$servicemetric gvsm
where
gvas.inst_id=gvsm.inst_id
and gvas.name_hash=gvsm.service_name_hash
and gvi.inst_id=gvsm.inst_id
and gvsm.group_id=10
order by
service_name,
gvi.inst_id
/



7-DLM TRAFFIC INFORMATION


prompt '**************************************************************'
prompt 'How many tickets are available in the DLM. '
prompt 'If the TCKT_WAIT columns says "YES" then'
prompt ' we have run out of DLM tickets which could cause a DLM hang. '
prompt 'Make sure that you also have enough TCKT_AVAIL. '
prompt '**************************************************************'
set numwidth 5
select * from gv$dlm_traffic_controller
order by TCKT_AVAIL
/



8-If the interconnect is slow, busy, or faulty, you can look for dropped packets,
retransmits, or cyclic redundancy check errors (CRC). You can use netstat commands
to check the networks. On Unix, check the man page for netstat for a list of options.
Also check the OS logs for any errors and make sure that inter-instance traffic is
not routed through a public network.
With most network protcols, you can use 'oradebug ipc' to see which interconnects
the database is using:
SQL> oradebug setmypid
SQL> oradebug ipc

This will dump a trace file to the user_dump_dest. The output will look something
like this:
SSKGXPT 0x1a2932c flags SSKGXPT_READPENDING info for network 0
socket no 10 IP 172.16.193.1 UDP 43749
sflags SSKGXPT_WRITESSKGXPT_UP info for network 1
socket no 0 IP 0.0.0.0 UDP 0...
So you can see that we are using IP 172.16.193.1 with a UDP protocol.


9- Under configured network settings at the OS. 
Check UDP, or other network protocol settings and tune them. See your OS specific documentation 
for instructions on how to do this. If using UDP, make sure the parameters relating to send buffer space,
receive buffer space, send highwater, and receive highwater are set well above the
OS default. The alert.log will indicate what protocol is being used. Example:
cluster interconnect IPC version:Oracle RDG
IPC Vendor 1 proto 2 Version 1.0
Refer to OS manual Changing network parameters to optimal values


10-Database Ä°mport
Install imp.exe as resident image with shared address space
$ install add imp.exe/resident/share=addr
Increase default quotas for BEQ’s mailboxes
$ define/sys ORA_BEQ_MBXSIZ 64000
$ define/sys ORA_BEQ_MBXSBFQ 64000
Set DEFMBXBUFQUO to 64000
Set DEFMBXMXMSG to 64000


11-DBMS_STATS.GATHER_SCHEMA_STATS
Calling gather_schema_stats results in a database server process being created
The server process in not multithreaded
Typically consumes 100% of one CPU
Performance improvement achieved by affinitizing the server process to one CPU and increasing 
QUANTUM to 20.



12-Analyze the efficiency of sort operations
Determine the number of optimal, one pass and multipass operations


select SUBSTR(NAME,1,30) , SUBSTR(VALUE,1,50) from v$parameter where 
name ='pga_aggregate_target'
/

SELECT optimal_count, round(optimal_count*100/total, 2)
optimal_perc,
onepass_count, round(onepass_count*100/total, 2)
onepass_perc,
multipass_count, round(multipass_count*100/total, 2)
multipass_perc
FROM
(SELECT decode(sum(total_executions), 0, 1,
sum(total_executions)) total,
sum(OPTIMAL_EXECUTIONS) optimal_count,
sum(ONEPASS_EXECUTIONS) onepass_count,
sum(MULTIPASSES_EXECUTIONS) multipass_count
FROM v$sql_workarea_histogram
WHERE low_optimal_size > 64*1024)
/

SELECT
PGA_TARGET_FACTOR , round(PGA_TARGET_FOR_ESTIMATE/1024/1024) target_mb,
ESTD_PGA_CACHE_HIT_PERCENTAGE cache_hit_perc,
ESTD_OVERALLOC_COUNT
FROM v$pga_target_advice
/


13-size the sga and shared_pool


select sga_size, sga_size_factor as size_factor,
estd_physical_reads as estimated_physical_reads
from v$sga_target_advice order by sga_size_factor;


select
SHARED_POOL_SIZE_FOR_ESTIMATE ,SHARED_POOL_SIZE_FACTOR,
ESTD_LC_TIME_SAVED_FACTOR, ESTD_LC_LOAD_TIME
from V$SHARED_POOL_ADVICE
/

14-Tune the sql 's .
Poor SQL or bad optimization paths can cause additional block gets via theinterconnect just as 
it would via I/O.



15-Set base priority of CRS queue to 12 CRS is running in batch By default, base priority of a 
batch queue is 4. So set base priority of CRS queue to 12
10-The use of locally managed tablespaces (instead of dictionary managed) with the 
'SEGMENT SPACE MANAGEMENT AUTO' option can reduce dictionary and f
reelist block contention.

select
TABLESPACE_NAME ,EXTENT_MANAGEMENT ,ALLOCATION_TYPE ,
SEGMENT_SPACE_MANAGEMENT
from dba_tablespaces
/



16-A poor gc_files_to_locks setting can cause problems. In almost all cases in RAC, 
gc_files_to_locks does not need to set at all

SELECT inst_id,SUBSTR(NAME,1,30) ,SUBSTR(VALUE,1,50) from gv$parameter
where name ='gc_files_to_locks'
/


For details Refer to to Oracle Metalink Notes

Doc ID: 135714.1:Script to Collect RAC Diagnostic Information (racdiag.sql)
Doc ID: 181489.1:Tuning Inter-Instance Performance in RAC and OPS
Doc ID: 563566.1:gc lost blocks diagnostics



Thursday 19 August 2021

ORA-609-opiodr aborting process unknown ospid

Details of error are as follows.

ORA-00609: could not attach to incoming connection
ORA-12537: TNS:connection closed
ORA-609 : opiodr aborting process unknown ospid (2436_7464)
ORA-609 : opiodr aborting process unknown ospid (21631120_1)

Opiodr aborting process unknown ospid


When a client makes a connection with Oracle Database, there are multiple steps involved. 
These steps do multiple network trips between the client and the server. 
The parameters that we are using here are basically hiding a problem in the network from the application . 
These parameters control when Oracle will throw an error about connections not getting completed.
 Please do ask your network team to look into network issues.


This ORA-609 errors are related with the client disconnecting before the connection can be established fully. Timeout parameters INBOUND_CONNECT_TIMEOUT_<listener_name> in the  LISTENER.ORA file and SQLNET.INBOUND_CONNECT_TIMEOUT in the SQLNET.ORA need to be reviewed.

If using the default of 60 seconds, (no explicit setting), then it is likely these parameters will need to be increased.

  • Review and check network settings for the server machine where the database is running. Ensure settings are all correct and as expected and DNS servers are available
  • If the server platform is Microsoft Windows, check the Windows Services for TNS listener and database ensure each service is started using the same account.

 ORA-609 with TNS-12537: TNS:connection closed

We almost time receive an error such as ora-609 which shows in alert log as

TNS- TNS-12537: TNS:connection closed
        ns secondary err code: 12560
        nt main err code: 0
        nt secondary err code: 0
        nt OS err code: 0
    opiodr aborting process unknown ospid (12345) as a result of ORA-609

This error means could not attach to incoming connection, so our database process was aborted or closed because incoming connection was not passed to it by listener. In our case we found sqlnet error stack as shown below
 TNS-12537: TNS:connection closed

It means the dedicated process didn’t have a client connection anymore to work.

As we all know the process of client connections:

1. First Client initiates a connection to the database so it connects to the listener.

2. Then listener starts a dedicated DB process that will receive this connection or session.

3. After that, this dedicate process is started and listener passes the connection from the client to this process.

4. Then the sever process takes the connection from the listener to continue the handshake with the client.

5. After that Server process and client exchange messages or information required for establishing a session or user logon.

6. Then the session is opened now.


In our case, it may be when the dedicated process tries to communicate with the client it finds that connection was closed.


👉Solution or Work around:

We can eliminate this error by increasing the value of SQLNET.INBOUND_CONNECT_TIMEOUT in sqlnet.ora file. 

By default the SQLNET.INBOUND_CONNECT_TIMEOUT value is 60 seconds.

The sqlnet.ora file can be found in RDBMS_HOME/network/admin.

For example:

$ cd $ORACLE_HOME/network/admin

$vi sqlnet.ora

And change the value of SQLNET.INBOUND_CONNECT_TIMEOUT as:


SQLNET.INBOUND_CONNECT_TIMEOUT=120
SQLNET.EXPIRE_TIME = 15

Then save the file.


There are many more possibilities which can cause this error such as:

1. This can happen suppose our client closed the connection immediately after initiating it.

2. This may occur when firewall kill the connection.

3. Sometimes, it may happen because of Database, OS or network performance problem.

4. This error may occur when some client crashes.


To solve this error, 

set SQLNET.INBOUND_CONNECT_TIMEOUT=180 in Sqlnet.ora.

and set INBOUND_CONNECT_TIMEOUT_listener_name=120 in Listener.ora 

under

$ORACLE_HOME/network/admin.



SQL*Plus / SQL*Net Dead Connection Detection

Recently I came across the situation where I knew for a fact that my sessions to the database were dead because I pulled the power plug out of my application server for a failover test. But the sessions stayed visible in the database and kept their locks therefore the failover failed.

Now how is that possible?

First let me sketch the layout of the setup.

Suppose you have a vendor supplied application that runs on their middleware. That middleware is using two application servers in a cluster with only one of the application servers communicating to the database. Lets name this one AppServerOne.

The second application server (AppServerTwo) contacts AppServerOne when its clients need to talk to the database. But when AppServerOne becomes unresponsive then AppServerTwo will try to setup a connection to the database and check if AppServerOne is indeed no longer handling the database requests.




In this case the middleware checks if AppServerOne has released a lock in the database that AppServerTwo can then take.

When I kill AppServerOne by killing an important process on that server, AppServerTwo indeed takes over and the clients do not experience a problem. But I want to make this a bit more realistic and decide to pull a power plug.

AppServerTwo still notices the problem with AppServerOne and tries to obtain the lock, but it fails to do so. AppServerTwo gives up trying after about 15 minutes.

High Availability down the drain! Now what?

I repeat the test and look a bit closer at what is happening in the database.
In the first test the sessions from AppServerOne are disappearing from the database, but in the second test they are still there after 18 minutes. In fact they disappear somewhere between 18 and 21 minutes.

This is not acceptable as the failover should not be noticed by clients and they are noticing.

Of course I had created a sqlnet.ora on the database server with sqlnet.expire_time specified to enable Dead Connection Detection:

$ cat sqlnet.ora
sqlnet.expire_time = 2

But that is apparently not working as expected.

The vendor had specified that the following TCP/IP parameters had to be set at all the servers involved.

net.ipv4.tcp_keepalive_time = 180
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_intvl = 6

With these settings a stale socket is detected after 4 minutes (180 + (10 * 6)). So they claim.

Their default values are (On Oracle Linux 6.5):

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Humm? Could the vendor be right after all?

I wanted to get to the bottom of this and needed to test more without causing problems on the application servers. So I created a test with a sqlplus session to a database. When I kill that session the database notices that the session has died and removes it from the database. But when I pull the power plug of the server that initiated the sqlplus session I get the same behavior.

Setting the parameters that the vendor wanted set didn’t change the test result. The sqlplus session still stays way too long in the database.

Using Google didn’t get me closer to a solution, but searching on Oracle Support let me to these documents:
“Performance problem with Oracle*Net Failover when TCP Network down (no IP address) (Doc ID 249213.1)”
“Tuning TCP/IP parameter in Linux Box for SQLNET (Doc ID 274953.1)”
“Dead Connection Detection (DCD) Explained (Doc ID 151972.1)”

The problem here is that the database cannot reach the network endpoint that initiated the connection (because it is powered off) and the network stack tries too long to reach it before giving up. Once it gives up the database removes the session. That is probably one of the reasons that Oracle started using VIP’s and SCAN_LISTENERS for RAC. Those will be started on the surviving nodes and thus reappear on the network.

The parameter that I had to change is: net.ipv4.tcp_retries2.
It defaults to 15 retries.

Once this was set to 3 the database sessions where removed quickly enough. Just a little over 4 minutes, and still way more then the 2 minutes an unsuspecting DBA might expect.

Apparently it isn’t a linear function as my results varied in the time that a session was removed from the database. The explanation of the parameter net.ipv4.tcp_retries2 didn’t help me in the reason why, and I had to give up looking for a reason because I ran out of time.

If you are reading this blog because you have a similar problem please make sure you test your situation carefully. And do not forget to make your changes reboot persistent. In Linux 6 you edit the file: /etc/sysctl.conf but that can change in future versions.

Hope This Helps

Oracle TNS-12535 and Dead Connection Detection

 Dead Connection Detection


These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I'll talk about sessions being disconnected from your databases, firewalls and dead connection detection

Changes:

We moved number of 19c databases from one center to another.

Symptoms

Now probably many of you have seen the following error in your database alertlog "TNS-12535: TNS:operation timed out" or if you haven't you will definitely see it some day.


Consider the following error from database alert log:


Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 19.3.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 19.3.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 19.3.0.0 - Production
Time: 12-MAR-2021 10:28:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Thu Mar 12 10:28:09 2021


Now this error indicate timing issues between the server and the client. It's important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.


Troubleshooting:

The best way to understand what's happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.

To do so you need to parse the listener log and locate the following line from the example above:

(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))


Since the port is random you might not get same record or if you do it might be days apart.


Here's what I found in the listener:

12-MAR-2021 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0


In other words - at 8:16 the user scott established connection from host 192.168.0.10.

Now if you compare both records you'll get the duration of the session:

Established: 12-MAR-2021 08:16:52
Disconnected: Thu Mar 12 10:28:09 2021

Here are couple of other examples:

alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620))
Thu Mar 12 10:31:20 2021
listener.log:
12-MAR-2021 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0

alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157))
Thu Mar 12 10:37:51 2021
listener.log:
12-MAR-2021 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0

alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618))
Tue Mar 10 19:09:09 2021
listener.log
10-MAR-2021 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0

As you may have noticed the errors follow very strict pattern - each one gets disconnect exactly 2hrs 11mins after it has been established.

Cause

Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that's definitely a firewall issue.

The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.

The only issues with BLACKOUT is that neither or the sides will be notified.

In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.

Solution:

The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session
so PMON can release the resources associated with it.

DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.

To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!

cat >> $ORACLE_HOME/network/admin/sqlnet.ora

SQLNET.EXPIRE_TIME=10

This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won't work for existing connections.

Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
- SQLNet fixup protocol
- Deep Packet Inspection (DPI)
- SQLNet packet inspection
- SQL Fixup

I had similar issue with Dataguard already, read more here - Smart Firewalls



How to test Dead Connection Detection?

You might want to test or make sure DCD really works. You've got multiple options here - Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.

1. Establish a connection to the database through SQL*Net

2. Find the processes number for your session:

SQL>  select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE');

SPID
------------------------
62761


3. Trace the process

[oracle@dbsrv ~]$ strace -tt -f -p 62761
Process 62761 attached - interrupt to quit
11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:36:58.158485 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
....
11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:46:58.240211 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
...
11:46:58.331063 write(20, "\0\n\0\0\6\20\0\0\0\0", 10) = 10
...

What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.

Conclusion

Errors in alertlog disappeared after I enabled the DCD.

Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.


References
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)T


100 Oracle DBA Interview Questions and Answers

  Here are 100 tricky interview questions tailored for a Senior Oracle DBA role. These questions span a wide range of topics, including perf...