1. Heartbeats in Oracle RAC
CSSD processes (Cluster Services Synchronization Daemon) monitors the health of RAC nodes by two distinct heart beats, Network heart beat and Disk heart beat.
a) Network heartbeat:-
Network heartbeat is across the interconnect, every one second, a thread (sending) of CSSD sends a network tcp heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat.
If the network packets are dropped or has error, the error correction mechanism on tcp would re-transmit the package, Oracle does not re-transmit in this case.
In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of misscount).
Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of misscount and when the heartbeat is missing for a period of 100% of the misscount (i.e. 30 seconds by default), the node is evicted.
b) Disk heartbeat:-
Disk heartbeat is between the cluster nodes and the voting disk.
CSSD process in each RAC node maintains a heart beat in a block of size 1 OS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk.
In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes.
The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes.
Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.
Thus summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_miscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must recieve a response in (long/short) disk timeout time.
Below are the different possibilities of individual heartbeat failures.
Network Ping | Disk Ping | Reboot |
Completes within misscount seconds | Completes within Misscount seconds | N |
Completes within Misscount seconds | Takes more than misscount seconds but less than Disktimeout seconds | N |
Completes within Misscount seconds | Takes more than Disktimeout seconds | Y |
Takes more than Misscount Seconds | Completes within Misscount seconds | Y |
Voting Disk contains information regarding nodes in the cluster and the disk heartbeat information, CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes.
But as information regarding the nodes is in OCR/OLR too and system calls have nothing to do with previous calls, there isn’t any useful data kept in the voting disk.
So, if you lose voting disks, you can simply add them back without losing any data.
But, of course, losing voting disks can lead to node reboots.
If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.
Now that we have understood both the heartbeats which was the most important part, we will dig deeper into Voting Disk, as in what is stored inside voting disk, why Clusterware needs of Voting Disk, how many voting disks are required etc.
Now finally to understand the whole concept of voting disk we need to know what is split brain syndrome, I/O Fencing and simple majority rule.
Now we are in a state to understand the use of voting disk in case of heartbeat failure.
Example 1:-
Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3.
Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Example 2:-
Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Thus Voting disk plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.
No comments:
Post a Comment