Amit's Oracle DBA Blog: What happens when node crashed in RAC

Saturday, 2 November 2024

What happens when node crashed in RAC

In Oracle Real Application Clusters (RAC), if a node crashes or becomes unavailable, Oracle’s background processes and daemons detect and handle the failure to maintain database availability and notify the surviving nodes. Here’s a step-by-step breakdown of the notification and communication process between Oracle RAC nodes during a node failure:

1. Heartbeat Monitoring

Each Oracle RAC node constantly monitors the health of other nodes using heartbeats.
Heartbeats are sent through two primary channels:
- Cluster Interconnect (private network connecting RAC nodes)
- Voting Disks (shared disks accessible to all nodes).

2. Detection of Node Failure

If Node 1 crashes, the surviving nodes (Node 2, Node 3, etc.) detect a missed heartbeat from Node 1.
Oracle Clusterware has a daemon called Cluster Synchronization Services (CSS), which recognizes the absence of heartbeats.

3. CSSDAEMON Notifies CRSD

The CSS daemon (on surviving nodes) identifies Node 1 as unavailable and informs the Cluster Ready Services daemon (CRSD).
CRSD is responsible for managing cluster resources and initiating recovery processes.

4. GCS and GES Handle Cache Synchronization

The Global Cache Service (GCS) and Global Enqueue Service (GES) daemons are responsible for managing the shared cache and locks in RAC.
Upon failure detection, GCS and GES coordinate to release any locks or cache resources held by the failed node, allowing other nodes to take over.

5. Inter-node Communication and Notification

The CRSD daemon in surviving nodes broadcasts information about Node 1’s failure across the cluster.
This notification enables other nodes to adjust their processes, workloads, and distribute sessions accordingly.

6. Database Recovery Process

Oracle Recovery Manager (RMAN) and other recovery processes are triggered automatically.
Surviving nodes begin the process of recovery and reallocation for any in-flight transactions that were managed by Node 1.
This ensures that transactions are completed without data loss or corruption.

7. Rebalancing Workloads

Oracle RAC dynamically balances the workload by redirecting sessions that were initially on Node 1 to other available nodes.
Load Balancing Advisory informs clients and reassigns connections, ensuring minimal disruption.

8. Reconfiguration of the Cluster

Oracle Clusterware performs a cluster reconfiguration to remove Node 1 from the active cluster configuration.
This prevents new sessions from attempting to connect to the failed node and redistributes resources to remaining nodes.

9. Notification to DBA or Monitoring Tools

Oracle Notification Service (ONS) or third-party monitoring tools (such as Oracle Enterprise Manager) receive alerts.
DBAs and administrators are informed of the failure and can take corrective actions as needed.

Each of these steps ensures that Oracle RAC handles node failures automatically, keeping the database available and redistributing the workload efficiently.

Amit's Oracle DBA Blog

Disclaimer