Amit's Oracle DBA Blog: Voting Disk in Oracle 11g R2 RAC

Voting Disk in Oracle11g R2 RAC :

The voting disk is a shared area that Oracle Clusterware uses to verify cluster node membership and status. Voting disk maintains the node membership information by collecting the heartbeats of all nodes in the cluster periodically.

The voting disk must reside on ASM OR shared disk(s) that is accessible by all of the nodes in the cluster. After ASM is introduced to store these files, these are called as VOTING FILE.

CSSD process is responsible for collecting the heartbeats and recording them in to the voting disk.

CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes.

Oracle Clusterware uses the voting disk to determine which instances are members of a cluster by way of a health check and arbitrates cluster ownership among the instances in case of network failures.

For high availability, Oracle recommends that you have multiple voting disks.

In 10g, Oracle Clusterware can supports 32 voting disks but in Oracle Clusterware 11gR2 can supports only 15 voting disks. Oracle recommends minimum of 3 and maximum of 5. If you define a single voting disk, then you should use external mirroring to provide redundancy.

Oracle Clusterware can be configured to maintain multiple voting disks (multiplexing) but you must have an odd number of voting disks, such as three, five, and so on.

A node must be able to access more than half of the voting disks at any time. For example, if you have 5 voting disks configured, then a node must be able to access at least 3 of the voting disks at any time. If a node cannot access the minimum required number of voting disks it is evicted, or removed, from the cluster. After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.

As information regarding the nodes also exist in OCR/OLR and system calls have nothing to do with previous calls, there is not any useful data kept in the voting disk except hear beats. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots.

If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks. Now that we have understood both the heartbeats which was the most important part, we cluster keep it into the Voting Disk/Files

All nodes in the RAC cluster register their heartbeat information in the voting disks/files. RAC heartbeat is the polling mechanism that is sent over the cluster interconnect to ensure all RAC nodes are available.

The primary function of the voting disk is to manage node membership and prevent what is known as Split Brain Syndrome in which two or more instances attempt to control the RAC database. This can occur in cases where there is a break in communication between nodes through the interconnect.

Now finally to understand the whole concept of voting disk we need to know What Type of Data Voting Disk consists, what is Voting, How Voting Happens, What is I/O Fencing, What is NETWORK and DISK HEARTBEAT, what is split brain syndrome and concept of simple majority rule.

What Type of Data Voting Disk consists?

Voting disk consists of two types of data:

Static data: Information about the nodes in cluster.
Dynamic data: Disk heartbeat logging.

Voting Disk/Files contains the important details of the cluster nodes membership like:

Node membership information.
Heartbeat information of all nodes in the cluster.
How many nodes in the cluster.
Who is joining the cluster?
Who is leaving the cluster?

What is Voting in Cluster Environment:

The CGS (Cluster Group Services)is responsible for checking whether members are valid.
To determine periodically whether all members are alive, a voting mechanism is used to check the validity of each member.
All members in the database group vote by providing details of what they presume the instance membership bitmap looks like andthe bitmap is stored in the GRD (Global Resource Directory).
A predetermined master member tallies the vote flags of the status flag and communicates to the respective processes that the voting is done; then it waits for registration by all the members who have received the reconfigured bitmap.

How Voting Happens in Cluster Environment:

The CKPT process updates the control file every 3 seconds in an operation known as the heartbeat.
CKPT writes into a single block that is unique for each instance, thus intra-instance coordination is not required. This block is called the checkpoint progress record.
All members attempt to obtain a lock on a control file record (the result record) for updating.
The instance that obtains the lock tallies the votes from all members.
The group membership must conform to the decided (voted) membership before allowing the GCS/GES (Global Enqueue Service) reconfiguration to proceed.
The control file vote result record is stored in the same block as the heartbeat in the control file checkpoint progress record.

What is I/O Fencing in Cluster Environment?

There will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts.
Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data.
Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing.
I/O fencing implementation is a function of CM and depends on the clusterware vendor.
I/O fencing is designed to guarantee data integrity in the case of faulty cluster communications causing a split-brain condition.

Why Voting disk is essential and needed:

The Voting Disk Files are used for overall health check, by the Oracle Clusterware.

Voting disk Files is used by CSS to determine which nodes are currently members of the cluster.
In concert with other Cluster components such as CRS to shut down, fence, or reboot either single or multiple nodes whenever network communication is lost between any nodes within the cluster, in order to prevent the dreaded split-brain condition in which two or more instances attempt to control the RAC database. It thus protects the database information.
Voting disk will be used by the CSS daemon to arbitrate with peers that it cannot see over the private interconnect in the event of an outage, allowing it to salvage the largest fully connected sub cluster for further operation.
It checks the voting disk to determine if there is a failure on any other nodes in the cluster. During this operation, NM (Node Monitor) will make an entry in the voting disk to inform its vote on availability. Similar operations are performed by other instances in the cluster.
The three voting disks configured also provide a method to determine who in the cluster should survive. For example, if eviction of one of the nodes is necessitated by an unresponsive action, then the node that has two voting disks will start evicting the other node. NM (Node Monitor) alternates its action between the heartbeat and the voting disk to determine the availability of other nodes in the cluster.

What is NETWORK and DISK HEARTBEAT and how it registers in VOTING DISKS/FILES?

All nodes in the RAC cluster register their heartbeat information in the voting disks/files. AC heartbeat is the polling mechanism that is sent over the cluster interconnect to ensure all nodes are available.Voting disks/files are just like attendance register where you have nodes mark their attendance (heartbeats).

1: NETWORK HEARTBEAT:

Network heartbeat is across the interconnect. CSSD process on every node makes entries in the voting disk to ascertain the membership of the node, by the way in every second CSSD process sends a thread (sending) of CSSD i.e a network TCP heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat. That means while marking their own presence, all the nodes also register the information about their communicability with other nodes in the voting disk. This is called NETWORK HEARTBEAT. If the network packets are dropped or has error, the error correction mechanism on TCP would re-transmit the package, Oracle does not re-transmit in this case. In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount). Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of miscount and when the heartbeat is missing for a period of 100% of the miscount (i.e. 30 seconds by default), the node is evicted.

2: DISK HEARTBEAT:

Disk heartbeat is between the cluster nodes and the voting disk. CSSD process in each RAC node maintains a heartbeat in a block of size 1 OS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes. Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.In this case, a message to this effect is written in the KILL BLOCK of node. Each nodes reads its KILL BLOCK once per second/beat, if the kill block is not overwritten, node commits suicide.

During reconfig (leaving or joining), CSSD monitors all nodes heartbeat information and determines whether the nodes has a disk heartbeat including those with no network heartbeat. If no disk heartbeat is detected, then node is considered as dead.

Thus summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_misscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must receive a response in (long/short) disk timeout time.

What are the different possibilities of individual heartbeat failures?

As we know voting disk is the key communication mechanism within the Oracle Clusterware where all nodes in the cluster read and write heartbeat information. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:

Network heart beat is successful, but disk heart beat is missed.
Disk heart beat is successful, but network heart beat is missed.
Both heart beats failed.

In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:

Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
Just one node is unhealthy.

Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted.

Misscount Parameter: The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node.

For NETWORK HEARTBEAT: That means CSSmisscount parameter determines network heartbeat, defaults to 30 seconds. Disk timeout is 200 seconds.If network heartbeat is missed after a timeout of 30 seconds, reboot is initiated (approximately, it is 34 seconds). It doesn’t matter what happens with disk heartbeat.

For DISK HEARTBEAT (Voting Disk): If the heartbeat did not complete in 200 seconds then the node will be rebooted.If the disk heartbeat completes under 200 seconds,then the reboot will not happen as long as network heartbeat is successful.This is little bit different at cluster reconfiguration time.

By default Misscount is less than Disktimeout seconds.

Also, if there is a vendor clusterware in play, then misscount is set to 600.

The following are the default values in seconds for the misscount parameter and their respective versions when using Oracle Clusterware:

Operating System	RAC Oracle 10g R1 and R2	Oracle 11g R1 and R2
Windows	30	30
Linux	60	30
Unix	30	30
VMS	30	30

Below given table will also provide you the different possibilities of individual heartbeat failures on the basis of misscount.

Network Ping	Disk Ping	Reboot
Completes within misscount seconds	Completes within Misscount seconds	N
Completes within Misscount seconds	Takes more than misscount seconds but less than Disktimeout seconds	N
Completes within Misscount seconds	Takes more than Disk timeout seconds	Y
Takes more than Misscount Seconds	Completes within Misscount seconds	Y

What is Split Brain Condition or syndrome in cluster Environment?

A split-brain occurs when cluster nodes hang or node interconnects fail, and as a result, the nodes lose the communication link between them and the cluster.

Split-brain is a problem in any clustered environment and is a symptom of clustering solutions and not RAC.

Split-brain conditions can cause database corruption when nodes become uncoordinated in their access to the shared data files.

For a two-node cluster, split-brain occurs when nodes in a cluster cannot talk to each other (the internode links fail) and each node assumes it is the only surviving member of the cluster. If the nodes in the cluster have uncoordinated access to the shared storage area, they would end up overwriting each other’s data, causing data corruption because each node assumes ownership of shared data.

To prevent data corruption, one node must be asked to leave the cluster or should be forced out immediately. This is where IMR (Instance Membership Recovery)comes in.

Many internal (hidden) parameters control IMR (Instance Membership Recovery)and determine when it should start.
If a vendor clusterware is used, split-brain resolution is left to it and Oracle would have to wait for the clusterware to provide a consistent view of the cluster and resolve the split-brain issue. This can potentially cause a delay (and a hang in the whole cluster) because each node can potentially think it is the master and try to own all the shared resources. Still, Oracle relies on the clusterware for resolving these challenging issues.

Note that Oracle does not wait indefinitely for the clusterware to resolve a split-brain issue, but a timer is used to trigger an IMR-based node eviction. Theseinternal timers are also controlled using hidden parameters. The default values of these hidden parameters are not to be touched as that can cause severe performance or operational issues with the cluster.

As mentioned time and again, Oracle completely relies on the cluster software to provide cluster services, and if something is awry, Oracle, in its overzealous quest to protect data integrity, evicts nodes or aborts an instance and assumes that something is wrong with the cluster.

Split Brain Syndrome in Oracle RAC:

In an Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.

Now in givenPicture, In case of 3 Node cluster and in case of a network error, a Split-Brain problem would occur – without a Voting Disk. Suppose node1 has lost the network connection to the Interconnect. Here, node1 cannot use the Interconnect anymore. It can still access the Voting Disk, though. Nodes 2 and 3 see their heartbeats still but no longer node1, which is indicated by the green Vs and red fs in the picture. The node with the network problem gets evicted by placing the Poison Pill into the Voting File for node1. CSSD of node1 will commit suicide now and leave the cluster.

Simple Majority win Rule:

According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Cluster ware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.

That means a node must be able to access more than half of the voting disks at any time.

Example1: Suppose we have a 2 node cluster with an even number of voting disks, let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2. This means that there is no common file where clusterware can check the heartbeat of both the nodes. Hence, if we have 2 voting disks, all the nodes in the cluster should be able to access both the voting disks.

Example 2:If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each node should be able to access more than half the number of voting disks. A node not able to do so will have to be evicted from the cluster to maintain the integrity of the cluster. After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle clusterware to recover the failed node and restore it to the cluster.

Loss of more than half your voting disks will cause the entire cluster to fail.

Example 3:Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus split brain, so the healthy nodes Node 1 & Node 2 would update the kill block in the voting disk for Node 3.

Then when during pread() system call of CSSD of Node 3, it sees a self-kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Example 4: Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

That’s why voting disks are configured in odd Numbers.

A node in the cluster must be able to access more than half of the voting disks at any time in order to be able to tolerate a failure of n voting disks. Therefore, it is strongly recommended that you configure an odd number of voting disks such as 3, 5, and so on.

Here is a table which represents the number of voting disks whose failure can be tolerated for different numbers of voting disks:

Total Voting Disks	No. of voting disks Which should be accessible	Whose failure can be tolerated
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2

It can be seen that number of voting disks whose failure can be tolerated is same for (2n-1) as well as 2n voting disks where n can be 1, 2 or 3. Hence to save a redundant voting disk, (2n-1) i.e. an odd number of voting disks are desirable.

Thus Voting disk/File plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.

Storage Mechanism of Voting Disk/Files:

Voting disks must be stored on shared accessible storage, because cluster during an operation, voting disk must be accessed by all member nodes in the clusterware.

Prior to 11g R2 RAC, it could be placed ona raw device, a clustered filesystem supported by Oracle RAC such as OCFS, Sun Cluster, or Veritas Cluster filesystem.
You should plan on allocating 280MB for each voting disk file.

Storage Mechanism of Voting Disk/Files in Oracle 11g R2 RAC:

As of Oracle 11g R2 RAC, it can be placed on ASM disks.
This simplifies management and improves performance. But this brought up a puzzle too.
For a node to join the cluster, it must be able to access voting disk, but voting disk is on ASM and ASM can’t be up until node is up.
To resolve this issue, Oracle ASM reserves several blocks at a fixed location for every Oracle ASM disk used for storing the voting disk.
As a result,Oracle Clusterware can access the voting disks present in ASM even, if the ASM instance is down and CSS can continue to maintain the Oracle cluster even if the ASM instance has failed.
The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header.
The voting disk is not striped but put as a whole on ASM Disks.
In the event that the disk containing the voting disk fails, Oracle ASM will choose another disk on which to store this data.
It eliminates the need for using a third-party cluster volume manager.
You can reduce the complexity of managing disk partitions for voting disks during Oracle Clusterware installations.
Voting disk needs to be mirrored, if it became unavailable, cluster will come down. Hence, you should maintain multiple copies of the voting disks on separate disk LUNs so that you eliminate a Single Point of Failure (SPOF) in your Oracle 11g RAC configuration.
If voting disk is stored on ASM, multiplexing level of voting disk is decided by the redundancy of the ASM diskgroup.

Redundancy of the Diskgroup	No. of copies of voting disk	( Minimum # of disks in the Diskgroup)
External	1	1
Normal	3	3
High	5	5

i. If voting disk is on a diskgroup with external redundancy, one copy of voting file will be stored on one disk in the diskgroup.

ii. If we store voting disk on a diskgroup with normal redundancy, then one copy of voting file will be stored on 3 disk in the diskgroup. We should be able to tolerate the loss of one disk i.e. even if we lose one disk, we should have sufficient number of voting disks so that clusterware can continue.

iii. If the diskgroup with normal redundancy has 2 disks (minimum required for normal redundancy), we can store 2 copies of voting disk on it. If we lose one disk, only one copy of voting disk will be left and clusterware won’t be able to continue, Because to continue, clusterware should be able to access more than half the no. of voting disks i.e.> (2*1/2) , i.e. accessible voting disks must be greater than1 or equals to 2. Hence to be able to tolerate the loss of one disk, we should have 3 copies of the voting disk on a diskgroupwith normal redundancy. So, a normal redundancy diskgroup having voting disk should have minimum 3 disks in it.

iv. Similarly, if we store voting disk on diskgroup with high redundancy, 5 Voting Files are placed, each on one ASM Disk i.e a high redundancy diskgroup should have at least 5 disks so that even of we lose 2 disks, clusterware can continue.

13. Ensure that all the nodes participating in the cluster have read/write permissions on disks.

14. You can have up to a maximum of 15 voting disks. However, Oracle recommends minimum 3 voting disks and do not go beyond 5 voting disks.

Backing up voting disk:

In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command.
Starting with Oracle Clusterware 11g R2, Backup of Voting disk using “dd” command is not supported.
Automatic backup of Voting disk and OCR happen after every four hours, end of the day, end of the week. That means there is no to take backup of voting disks manually.
Voting disk and OCR automatic backup and kept together in a single file.
In fact, Oracle explicitly indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
Although the Voting disk contents are not changed frequently, but you will need to back up the Voting disk file every time, when you perform following activities.
You add or remove a node from the cluster or
Immediately after you configure or upgrade a cluster.
Thank you for reading… This is Airy…Enjoy Learning

Amit's Oracle DBA Blog

Disclaimer

Monday, 5 July 2021

Voting Disk in Oracle 11g R2 RAC–Airy’s Notes

Thank you for reading… This is Airy…Enjoy Learning

No comments:

Post a Comment

Oracle Exadata

Labels