What is a “Split Brain”
Split Brain is often used to describe the scenario when two or more nodes in a cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources. In simple terms “Split brain” means that there are 2 or more distinct sets of nodes, or “cohorts”, with no communication between the two cohorts.
For example :
Suppose there are 3 nodes in the following situation.
2. But 1 and 2 cannot talk to 3, and vice versa.
Then there are two cohorts: {1, 2} and {3}.
Why is this a problem
The biggest risk following a Split-Brain event is the potential for corrupting system state. There are three typical causes of corruption:
1. The processes that were once co-operating prior to the Split-Brain event occurring, independently modify the same logically shared state, thus leading to conflicting views of system state. This is often called the “multi-master problem”.
2. New requests are accepted after the Split-Brain event and then performed on potentially corrupted system state (thus potentially corrupting system state even further).
3. When the processes of the distributed system “rejoin” together it is possible that they have conflicting views of system state or resource ownerships. During the process of resolving conflicts, information may be lost or become corrupted.
In simpler terms, in a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption.
How does clusterware resolve a “split brain” situation?
In a split brain situation, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:
a. The group(cohort) with more cluster nodes survive
b. The group(cohort) with lower node member survive, in case of same number of node(s) available in each group.
c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.
Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
[ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
[ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR:
###################################
[ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
###################################
Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.
To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.
Common messages in instance alert log are similar to:
alert log of instance 1: --------- Mon Dec 07 19:43:05 2011 IPC Send timeout detected.Sender: ospid 26318 Receiver: inst 2 binc 554466600 ospid 29940 IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20 Mon Dec 07 19:43:07 2011 Communications reconfiguration: instance_number 2 Mon Dec 07 19:43:07 2011 Trace dumping is performing id=[cdmp_20091207194307] Waiting for clusterware split-brain resolution Mon Dec 07 19:53:07 2011 Evicting instance 2 from cluster Waiting for instances to leave: 2 ...
alert log of instance 2: --------- Mon Dec 07 19:42:18 2011 IPC Send timeout detected. Receiver ospid 29940 Mon Dec 07 19:42:18 2011 Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc: Trace dumping is performing id=[cdmp_20091207194307] Mon Dec 07 19:42:20 2011 Waiting for clusterware split-brain resolution Mon Dec 07 19:44:45 2011 ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1 Mon Dec 07 19:44:51 2011 ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1 Mon Dec 07 19:45:38 2011 ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1 Mon Dec 07 19:52:27 2011 Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc (incident=90153): ORA-29740: evicted by member 0, group incarnation 10 Incident details in: /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc
In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout.
Split brain syndrome occurs when the instances in a RAC fails to connect or ping to each other via the private interconnect. So, in a two node situation both the instances will think that the other instance is down because of lack of connection. The problem which could arise out of this situation is that the sane block might get read, updated in these individual instances which cause data integrity issues, because the block changed in one instance will not be locked and could be overwritten by another instance.
So, when a node fails, the failed node is prevented from accessing all the shared disk devices and groups. This methodology is called I/O Fencing, Disk Fencing or Failure Fencing.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted.
You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.
The CSS (Cluster Synchronization Service) daemon in the clusterware maintains the heart beat to the voting disk.
What is I/O fencing
It is provided by the kernel-based fencing module (vxfen), performs identically on node failures and communications failures. The node tries to eject the key for departed nodes from the coordinator disks using the pre-empt and abort command. When the node successfully ejects the departed nodes from the coordinator disks, it also ejects the departed nodes from the data disks. In a split-brain scenario, both sides of the split would race for control of the coordinator disks. The side winning the majority of the coordinator disks wins the race and fences the loser. The loser then panics and restarts the system.
Voting disk will be used to determine which node(s) survive and which node(s) will be evicted.
Voting Disk – Heart of RAC
It is file that resides on shared storage and Manages cluster members. Voting disk reassigns cluster ownership between the nodes in case of failure.
The voting disk is a file that manages information about node membership. Voting disk is used by Oracle Cluster Synchronization Services Daemon (ocssd) on each node, to mark its own attendance and also to record the nodes it can communicate with.
Following logic is applied for which node will survive and which node will be evicted from cluster.
- If the sub-clusters are of the different sizes, the clusterware identifies the largest sub-cluster, and aborts all the nodes which do not belong to that sub-cluster.
- If all the sub-clusters are of the same size, the sub-cluster having the lowest numbered node survives so that, in a 2-node cluster, the node with the lowest node number will survive
Facts about Voting disk
- Each voting disk must be accessible by all nodes in the cluster.
- If any node is not passing heat-beat across other note or voting disk, then that node will be evicted by Voting disk.
- Minimum 1 and maximum 15 copy of voting disk is possible.
- Voting disk consists of two types of data:
Static data: Information about the nodes in cluster.
Dynamic data: Disk heartbeat logging.
- Find the location of voting disk.
crsctl query css votedisk
Oracle Cluster Registry
It resides on shared storage and maintains information about cluster configuration and information about cluster database. OCR contains information like which database instances run on which nodes and which services runs on which database.
Oracle automatically takes backup every 4 hrs on master node. You can also take backup using ocrconfig export utility.
Facts about OCR
- It created at the time of Grid Installation.
- It’s store information to manage Oracle cluster-ware and it’s component such as RAC database, listener, VIP, Scan IP & Services.
- Minimum 1 and maximum 5 copy of OCR is possible.
- The node that store OCR backups is the master node. The first node in a cluster to be up will become the master node. The role of the master node is basically that this is the node where other nodes will contact to get information about node status.
No comments:
Post a Comment