Disclaimer

Sunday, 24 November 2024

Fencing and Quorum and Spit Brain in Oracle RAC

 

Split-brain in Oracle RAC refers to a situation where two or more nodes (servers) in a Real Application Clusters (RAC) environment believe they are the only active part of the cluster. This can happen when there is a network or communication failure between the nodes, and the nodes lose connection with each other. In such a case, each node thinks it is the only one functioning properly and continues to operate independently, leading to potential data inconsistency.

To explain it in layman's terms:

Imagine you and a friend are in charge of managing a shared notebook. Both of you are sitting at different desks, and you communicate by passing the notebook between you. However, one day, the connection between your desks is cut off, and you can't talk to each other.

Since you both don't know the other person's actions, you each continue writing in the notebook, thinking you're the only one using it. When the connection is restored, you realize that both of you have written in different parts of the notebook, which could lead to confusion and errors when you compare the two versions.

In Oracle RAC, split-brain is like that situation: each node thinks it's the only one managing the data, and they both continue to make changes independently. When the connection is restored, the system faces the problem of reconciling the two different versions of data, which can result in inconsistency or corruption.

Key Points:

  • Split-brain occurs due to a network failure or communication issues between RAC nodes.
  • It leads to data inconsistencies, as each node may make changes without knowing what the others are doing.
  • Oracle RAC has mechanisms like clusterware fencing or quorum to prevent split-brain from happening, by ensuring that only one node can make changes at a time.

In short, split-brain is when the "brain" (the cluster) gets divided, and each part thinks it’s in charge, leading to chaos!


In Oracle RAC, fencing and quorum are mechanisms designed to protect the database from split-brain scenarios, ensuring that only one node has control over the shared data when a communication failure occurs between nodes.

Let’s break it down step by step to clarify how fencing and quorum work, and whether you need to set up a quorum disk.


1. What is Fencing in Oracle RAC?

Fencing is a process that prevents a node that has lost contact with the other nodes in the cluster from making changes to the shared data. This is critical to avoid the risk of multiple nodes independently modifying the same data, which can lead to corruption (split-brain scenario).


How is Fencing Initiated?

Fencing in Oracle RAC is usually initiated by Oracle Clusterware processes, such as Clusterware daemons and the Oracle CRS (Cluster Ready Services). These processes monitor the health and communication between the nodes. If one node is found to be isolated or “out of the loop,” it is fenced to prevent it from accessing or modifying the data.

The following processes are responsible for fencing:

  1. CRSD (Cluster Ready Service Daemon): The CRSD process manages the cluster services and resources and helps in fencing. If a node becomes isolated, it instructs the node to fence itself.
  2. CSSD (Cluster Synchronization Service Daemon): The CSSD ensures that all nodes are in sync and monitors the status of nodes. If it detects that a node is isolated or experiencing a failure, it may trigger fencing.
  3. OCSSD (Oracle Cluster Synchronization Service Daemon): In case of node failures or network isolation, OCSSD can initiate fencing to prevent that node from making any updates.

Example of Fencing:

  • If Node A loses communication with Node B and Node C, Oracle Clusterware detects this isolation.
  • Fencing would lock Node A out from accessing the shared storage (disk groups), so it can't change the data.


2. What is Quorum in Oracle RAC?

Quorum is a mechanism used in Oracle RAC to ensure that only a majority of nodes can make decisions in the event of communication failures. This avoids the scenario where a partition of nodes could lead to conflicting decisions (split-brain).

Quorum Explained:

  • In a RAC cluster, Quorum represents the majority of nodes that must agree on any decision for the cluster to function normally.
  • If communication is lost between certain nodes, the remaining nodes with quorum can still agree on which node should be allowed to manage the data and which should be fenced.

Example of Quorum:

Imagine a 3-node RAC cluster (Node A, Node B, Node C):

  • If Node A and Node B lose contact with Node C, Node C will still have the majority (1 out of 2).
  • Node A and Node B will not be able to continue making decisions because they do not have a majority of votes.

In a case of a 2-node RAC cluster (Node A and Node B):

  • If Node A loses contact with Node B, the remaining node (Node A or B) cannot function because it is alone, and no majority can be formed.


3. Is Quorum Default in Oracle RAC?

Yes, quorum is a default feature in Oracle RAC. However, depending on your environment, you might need to configure quorum disk explicitly.


Do You Need a Quorum Disk?

A quorum disk (or voting disk) is an additional disk that is used to store voting information in the cluster. The voting disk helps ensure that all nodes in the cluster can agree on which node should have control when failures happen.

In a typical 2-node Oracle RAC, if one node goes down, the remaining node doesn’t have quorum. To avoid this, Oracle suggests adding a quorum disk to make sure the remaining node can continue to function properly.


When to Use a Quorum Disk:
  • Two-Node Cluster: You must use a quorum disk to avoid split-brain scenarios, as there is no other majority in the case of failure.
  • Three or More Nodes: You may not need a quorum disk, as the majority can be decided through the node count itself.


Quorum Disk Configuration:

  1. Creating a Quorum Disk: You can use Oracle ASM (Automatic Storage Management) or regular shared disk storage for the quorum disk. The disk must be accessible by all nodes in the cluster.

    Example command to add a quorum disk:

    crsctl add disk <disk_name>
  2. Checking Quorum Configuration: You can verify the quorum disk status by checking the cluster status:

    crsctl status resource -t



4. How Fencing and Quorum Work Together to Prevent Split-Brain:

When a split-brain scenario happens (nodes can’t communicate), fencing and quorum ensure that only the majority of nodes are allowed to operate.

  1. Fencing locks out any isolated node that cannot communicate with the cluster.
  2. Quorum ensures that the remaining nodes can continue to function if they have the majority vote.


Example Scenario:

  • 3-node RAC Cluster: Node A, Node B, Node C.
  • If Node A loses communication with Node B and Node C, fencing ensures Node A is locked out.
  • Quorum ensures that Node C (with the majority) continues to make decisions and Node B cannot operate independently, as it would not have a majority.


Summary Steps:

  1. Quorum decides which nodes have the majority and can continue making decisions.
  2. Fencing locks out nodes that are isolated from the cluster to prevent them from making conflicting changes.
  3. If you're using a 2-node RAC, a quorum disk is required, but in a larger cluster, quorum is usually determined by the number of nodes.


By setting up fencing and quorum correctly, Oracle RAC prevents multiple nodes from conflicting over data when they lose communication, thereby avoiding split-brain and ensuring data consistency.



No comments:

Post a Comment

Index rebuild online in Oracle - shell script

  [oracle@rac10p reorg]$ cat index_rebuild_EMP.sh #!/bin/ksh export ORACLE_HOME=/oracle/K12/19 export ORACLE_SID=K12 export PATH=$PATH:/$ORA...