Amit's Oracle DBA Blog: Restore OCR and Voting Disk in case of corruption

Automatic backups: Oracle Clusterware automatically creates OCR backups every four hours. At any one time, Oracle Database always retains the last three backup copies of OCR. The CRSD process that creates the backups also creates and retains an OCR backup for each full day and at the end of each week. You cannot customize the backup frequencies or the number of files that Oracle Database retains.

Procedure to restore OCR on Linux or UNIX systems:

1. List the nodes in your cluster by running the following command on one node:

$ olsnodes

2. Stop Oracle Clusterware by running the following command as root on all of the nodes:

# crsctl stop crs

3. If the preceding command returns any error due to OCR corruption, stop Oracle Clusterware by running the following command as root on all of the nodes:

# crsctl stop crs -f

4. If you are restoring OCR to a cluster file system or network file system, then run the following command as root to restore OCR with an OCR backup that you can identify in "Listing Backup Files".

CD $ORACLE_HOME/cdata/rac-scan

# ocrconfig -restore file_name

5. After you complete this step, skip to step 8 next.

Start the Oracle Clusterware stack on one node in exclusive mode by running the following command as root:

# crsctl start crs -excl

Ignore any errors that display.

6. Check whether crsd is running. If it is, stop it by running the following command as root:

# crsctl stop resource ora.crsd -init

Caution:

Do not use the -init flag with any other command.

7. Restore OCR with an OCR backup that you can identify in "Listing Backup Files" by running the following command as root:

# ocrconfig -restore file_name

Notes:

If the original OCR location does not exist, then you must create an empty (0 byte) OCR location before you run the ocrconfig -restore command.

8. Ensure that the OCR devices that you specify in the OCR configuration exist and that these OCR devices are valid.

If you configured OCR in an Oracle ASM disk group, then ensure that the Oracle ASM disk group exists and is mounted.

Verify the integrity of OCR:

# ocrcheck

9. Stop Oracle Clusterware on the node where it is running in exclusive mode:

# crsctl stop crs -f

10. Begin to start Oracle Clusterware by running the following command as root on all of the nodes:

# crsctl start crs

11. Verify the OCR integrity of all of the cluster nodes that are configured as part of your cluster by running the following CVU command:

$ cluvfy comp ocr -n all -verbose

Restore voting disk in case of all voting disk are corrupted.

Voting disk will be automatically recovered using latest available backup of OCR.

=================================================

Current scenario:

The only copy of the voting disk is present in test diskgroup on disk ASMDIsk010

We will corrupt ASMDIsk010 so that we lose the only copy of the voting disk.

We will restore voting disk to another diskgroup using the OCR.

Currently, we have 1 voting disk.

Let us corrupt it and check if clusterware still continues

FIND OUT LOCATION OF VOTEDISK

[grid@host01 cssd]$ crsctl query css votedisk

## STATE File Universal Id File Name Disk group

– —– —————– ——— ———

1. ONLINE 00ce3c95c6534f44bfffa645a3430bc3 (ORCL:ASMDISK010) [TEST]

FIND OUT THE NO. OF DISKS IN test DG (CONTAINING VOTEDISK)

ASMCMD> lsdsk -G test

Path

ORCL:ASMDISK010

Let us corrupt ASMDISK010

— bs = blocksize = 4096

— count = # of blocks overwritten = 1000000 (~1M)

– total no. of bytes corrupted = 4096 * 1000000

(~4096M = size of one partition)

#dd if=/dev/zero of=/dev/oracleasm/disks/ASMDISK010 bs=4096 count=1000000

Here, I was expecting clusterware to stop as the only voting disk was not available but surprisingly clusterware kept running.

Finally, I stopped clusterware and tried to restart it. It was not able to restart.

Reboot all the nodes and note that cluster ware does not start as voting disk is not accessible.

#crsctl stat res -t

– Now since voting disk can’t be restored back to test diskgroup as disk in test has been corrupted,

we will create another diskgroup votedg where we will restore voting disk.

RECOVER VOTING DISK

– To move voting disk to votedg diskgroup, ASM instance should be up and for ASM

instance to be up, CRS should be up. Hence we will

stop crs on all the nodes
start crs in exclusive mode on one of the nodes (host01)
start asm instance on host01 using pfile (since spfile of ASM instance is on ASM)
create a new diskgroup votedg
move voting disk to votedg diskgroup
stop crs on host01(was running in exclusive mode)
restart crs on host01
start crs on rest of the nodes
start cluster on all the nodes

– IMPLEMENTATION –

👉stop crs on all the nodes(if it does not stop, kill ohasd process and retry)

root@hostn# crsctl stop crs -f

👉start crs in exclusive mode on one of the nodes (host01)

root@host01# crsctl start crs -excl

👉start asm instance on host01 using pfile

root@host01# ps -ef | grep +ASM1
if any kill or shut it down
grid@host01$ sqlplus / as sysasm
sql> shut abort

grid@host01$ vi /u01/app/oracle/init+ASM1.ora
INSTANCE_TYPE=ASM
asm_diskstring=/dev/oracleasm/disks/*

Grid@hoat01$chown grid:oinstall /u01/app/oracle/init+ASM1.ora

Grid@host01$ sqlplus / as sysasm
SQL>startup pfile='/u01/app/oracle/init+ASM1.ora';

SQL>create a new diskgroup VDISK
create diskgroup VDISK normal redundancy
failgroup fg1 disk '/dev/oracleasm/disks/VOTE1'
failgroup fg1 disk '/dev/oracleasm/disks/VOTE2'

failgroup fg1 disk '/dev/oracleasm/disks/VOTE3'
attribute 'compatible.asm'='11.2';

Make sure the compatibility parameter is set to the version of Grid software you’re using. You can change it using the following command:
SQL>alter diskgroup VDISK set attribute ‘compatible.asm’=’11.2’;

Restore voting disk to data diskgroup

Voting disk is automaticaly recovered using latest available backup of OCR.
root@host01#crsctl replace votedisk +votedg

root@host01# ocrcheck
root@host01# crsctl stat res -t

stop crs on host01(was running in exclusive mode)
root@host01#crsctl stop crs -f

restart crs on host01
root@host01#crsctl start crs

start crs on rest of the nodes (if it does not start, kill ohasd process and retry)
root@host02#crsctl start crs
root@host03#crsctl start crs

start cluster on all the nodes and check that it is running
root@host01#crsctl start cluster -all
            crsctl stat res -t

Recover voting disk in case we lose 2 out of 3 copies of voting disk.
In this case, voting disk will be recovered using surviving copy of voting disk.

Current scenario:
3 copies of voting disk are present in test diskgroup on disks ASMDIsk010, ASMDIsk011, ASMDIsk012.
We will corrupt two disks ASMDIsk010, ASMDIsk011 so that ASMDISK012 still has a copy of the voting disk. We will restore voting disk to another diskgroup using the only valid copy we have.

Currently, we have 3 voting disks. AT least 2 should be accessible for the clusterware to work. Let us corrupt one of the voting disks and check if clusterware still continues

FIND OUT LOCATION OF VOTEDISK
[grid@host01 cssd]$ crsctl query css votedisk
## STATE    File Universal Id                File Name Disk group
– —–    —————–                ——— ———
1. ONLINE   00ce3c95c6534f44bfffa645a3430bc3 (ORCL:ASMDISK012) [TEST]
2. ONLINE   a3751063aec14f8ebfe8fb89fccf45ff (ORCL:ASMDISK010) [TEST]
3. ONLINE   0fce89ac35834f99bff7b04ccaaa8006 (ORCL:ASMDISK011) [TEST]
Located 3 voting disk(s).

FIND OUT THE NO. OF DISKS IN test DG (CONTAINING VOTEDISK)
ASMCMD> lsdsk -G test
Path
ORCL:ASMDISK010
ORCL:ASMDISK011
ORCL:ASMDISK012

-- Let us corrupt ASMDISK010
– bs = blocksize = 4096
– count = # of blocks overwritten = 1000000 (~1M)
– total no. of bytes corrupted = 4096 * 1000000
                                 (~4096M = size of one partition)

#dd if=/dev/zero of=/dev/soracleasm/disks/ASMDISK010 bs=4096 count=1000000

CHECK THAT C/W KEEPS RUNNING AS 2 VOTING DISKS (MORE THAN HALF OF
VOTING DISKS) STILL AVAILABLE
#crsctl stat res -t

-- Now let us corrupt ASMDISK011
– bs = blocksize = 4096
– count = # of blocks overwritten = 1000000 (~1M)
– total no. of bytes corrupted = 4096 * 1000000
                                 (~4096M = size of one partition)
#dd if=/dev/zero of=/dev/oracleasm/disks/ASMDISK011 bs=4096 count=1000000

Here, I was expecting clusterware to stop as only 1 voting disk ( < half of total(3)) were available but surprisingly clusterware kept running. I event waited for quite some time but to no avail. I would be glad if someone can give more input on this.
Finally, I stopped clusterware and tried to restart it. It was not able to restart.

CHECK THAT C/W IS NOT RUNNING
#crsctl stat res -t

Now we have one copy of the voting disk on one of the disks in test diskgroup we can use that copy to get voting disk back. Since voting disk can’t be restored back to test diskgroup as disks in test have been corrupted, we will restore voting disk to data diskgroup .

RECOVER VOTING DISK –
– To move voting disk to data diskgroup, ASM instance should be up and for ASM instance to be up, CRS should be up. Hence we will

stop crs on all the nodes
start crs in exclusive mode on one of the nodes (host01)
start asm instance on host01 using pfile (since spfile of ASM instance is on ASM)
move voting disk to data diskgroup.
drop test diskgroup (it will allow as it does not have voting disk any more)
stop crs on host01(was running in exclusive mode)
restart crs on host01
start crs on rest of the nodes
start cluster on all the nodes

-- IMPLEMENTATION –

stop crs on all the nodes(if it does not stop, kill ohasd process and retry)
root@hostn# crsctl stop crs -f

start crs in exclusive mode on one of the nodes (host01)
root@host01# crsctl start crs -excl

start asm instance on host01 using pfile
root@host01# ps -ef | grep +ASM1
if any kill or shut it down
grid@host01$ sqlplus / as sysasm
sql> shut abort

grid@host01$ vi /u01/app/oracle/init+ASM1.ora
INSTANCE_TYPE=ASM
asm_diskstring=/dev/oracleasm/disks/*

Grid@host01$chown grid:oinstall /u01/app/oracle/init+ASM1.ora

Grid@host01$ sqlplus / as sysasm
SQL>startup pfile='/u01/app/oracle/init+ASM1.ora';

-- create a new diskgroup data

Make sure the compatibility parameter is set to the version of Grid software you’re using. You can change it using the following command:
SQL>alter diskgroup data set attribute ‘compatible.asm’=’11.2’;

Check that data diskgroup is mounted on host01. if not, mount it.
ASMCMD> lsdg
ASMCMD> mount data

move voting disk to data diskgroup. voting disk will be automatically recovered using surviving copy of voting disk.
root@host01#crsctl replace votedisk +data

root@host01# ocrcheck
root@host01# crsctl stat res -t

drop test diskgroup (it will allow as it does not have voting disk)
SQL>drop diskgroup test force including contents;

stop crs on host01(was running in exclusive mode)
root@host01#crsctl stop crs -f

restart crs on host01
root@host01#crsctl start crs

start crs on rest of the nodes (if it does not start, kill ohasd process and retry)
root@host02#crsctl start crs
root@host03#crsctl start crs

start cluster on all the nodes and check that it is running
root@host01#crsctl start cluster -all
root@host01#crsctl stat res -t

Amit's Oracle DBA Blog

Disclaimer

Monday, 13 September 2021

Restore OCR and Voting Disk in case of corruption

No comments:

Post a Comment

Oracle Exadata

Labels