5 Administering Oracle Clusterware Components

The main components to manage in your Oracle Clusterware environment are the voting disks and the Oracle Cluster Registry (OCR).

About Oracle Clusterware

Oracle Real Application Clusters (Oracle RAC) uses Oracle Clusterware as the infrastructure that binds multiple nodes that then operate as a single server. In an Oracle RAC environment, Oracle Clusterware monitors all Oracle components (such as instances and listeners). If a failure occurs, then Oracle Clusterware automatically attempts to restart the failed component and also redirects operations to a surviving component.

Oracle Clusterware includes a high availability framework for managing any application that runs on your cluster. Oracle Clusterware manages applications to ensure they start when the system starts. Oracle Clusterware also monitors the applications to ensure they are always available. For example, if an application process fails, then Oracle Clusterware attempts to restart the process based on scripts that you customize. If a node in the cluster fails, then you can program application processes that typically run on the failed node to restart on another node in the cluster.

This section contains the following topics:

About the Voting Disks

The voting disk records node membership information. A node must be able to access more than half the voting disks at any time. To avoid simultaneous loss of multiple voting disks, each voting disk should be on a storage device that does not share any components (controller, interconnect, and so on) with the storage devices used for the other voting disks.

For example, if you have five voting disks configured, then a node must be able to access at least three of the voting disks at any time. If a node cannot access the minimum required number of voting disks, then it is evicted, or removed, from the cluster. After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.

About Oracle Cluster Registry

Oracle Cluster Registry (OCR) is a file that contains information about the cluster node list and instance-to-node mapping information. OCR also contains information about Oracle Clusterware resource profiles for resources that you have customized. The voting disk data is also backed up in OCR.

Each node in a cluster also has a local copy of the OCR, called an Oracle Local Registry (OLR), that is created when Oracle Clusterware is installed. Multiple processes on each node have simultaneous read and write access to the OLR particular to the node on which they reside, whether Oracle Clusterware is fully functional. By default, OLR is located at Grid_home/cdata/$HOSTNAME.olr

About High Availability of Oracle Clusterware Files

High availability configurations have redundant hardware and software that maintain operations by avoiding single points of failure. When a component is down, Oracle Clusterware redirects its managed resources to a redundant component. However, if a disaster strikes, or a massive hardware failure occurs, then having redundant components might not be enough. To fully protect your system it is important to have backups of your critical files.

The Oracle Clusterware installation process creates the voting disk and the OCR on shared storage. If you select the option for normal redundant copies during the installation process, then Oracle Clusterware automatically maintains redundant copies of these files to prevent the files from becoming single points of failure. The normal redundancy feature also eliminates the need for third-party storage redundancy solutions. When you use normal redundancy, Oracle Clusterware automatically maintains two copies of the OCR file and three copies of the voting disk file.

See Also:

Oracle Clusterware Administration and Deployment Guide for more information about managing voting disks

Managing the Oracle Clusterware Stack

By default, Oracle Clusterware is configured to restart whenever the server it resides on is restarted. During certain maintenance operations, you may be required to stop or start the Oracle Clusterware stack manually.

This section contains the following topics:

Note:

Do not use Oracle Clusterware Control (CRSCTL) commands on Oracle entities (such as resources, resource types, and server pools) that have names beginning with ora unless you are directed to do so by Oracle Support. The Server Control utility (SRVCTL) is the correct utility to use on Oracle entities.

Starting Oracle Clusterware

You use the CRSCTL utility to manage Oracle Clusterware. If the Oracle High Availability Services daemon (OHASD) is running on all the cluster nodes, then you can start the entire Oracle Clusterware stack (all the processes and resources managed by Oracle Clusterware) on all nodes in the cluster by executing the following command on any node:

crsctl start cluster -all

You can start the Oracle Clusterware stack on specific nodes by using the -n option followed by a space-delimited list of node names, for example:

crsctl start cluster -n racnode1 racnode4

To use the previous command, the OHASD process must be running on the specified nodes.

To start the entire Oracle Clusterware stack on a node, including the OHASD process, run the following command on that node:

crsctl start crs

Stopping Oracle Clusterware

To stop Oracle Clusterware on all nodes in the cluster, execute the following command on any node:

crsctl stop cluster -all

The previous command stops the resources managed by Oracle Clusterware, the Oracle ASM instance, and all the Oracle Clusterware processes (except for OHASD and its dependent processes).

To stop Oracle Clusterware and Oracle ASM on select nodes, include the -n option followed by a space-delimited list of node names, for example:

crsctl stop cluster -n racnode1 racnode3

If you do not include either the -all or the -n option in the stop cluster command, then Oracle Clusterware and its managed resources are stopped only on the node where you execute the command.

To completely shut down the entire Oracle Clusterware stack, including the OHASD process, use the crsctl stop crs command. CRSCTL attempts to gracefully stop the resources managed by Oracle Clusterware during the shutdown of the Oracle Clusterware stack. If any resources that Oracle Clusterware manages are still running after executing the crsctl stop crs command, then the command fails. You must then use the -f option to unconditionally stop all resources and stop the Oracle Clusterware stack, for example:

crsctl stop crs -f

Note:

When you shut down the Oracle Clusterware stack, you also shut down the Oracle Automatic Storage Management (Oracle ASM) instances. If the Oracle Clusterware files (voting disk and OCR) are stored in an Oracle ASM disk group, then the only way to shut down the Oracle ASM instances is to shut down the Oracle Clusterware stack.

Administering Voting Disks for Oracle Clusterware

This section describes how to perform the following tasks:

Adding and Removing Voting Disks

If you choose to store Oracle Clusterware files on Oracle ASM and use redundancy for the disk group, then Oracle ASM automatically maintains the ideal number of voting files based on the redundancy of the disk group.

If you use a different form of shared storage to store the voting disks, then you can dynamically add and remove voting disks after installing Oracle RAC. Do this using the following commands where path is the fully qualified path for the additional voting disk.

To add or remove a voting disk that is stored on disk:

  1. Run the following command as the grid user to add a voting disk:

    crsctl add css votedisk path
    
  2. Run the following command as the grid user to remove a voting disk:

    crsctl delete css votedisk path
    

Backing Up and Recovering Voting Disks

This section contains the following topics:

Backing Up Voting Disks

The voting disk data is automatically backed up in OCR as part of any configuration change so you do not have to perform manual backups of the voting disk. The voting disk files are backed up automatically by Oracle Clusterware if the contents of the files have changed in the following ways:

  • Configuration parameters, for example misscount, have been added or modified

  • After performing voting disk add or delete operations

Replacing Voting Disks

If a voting disk is damaged, and no longer usable by Oracle Clusterware, then you can replace or re-create the voting disk. You replace a voting disk by deleting the unusable voting disk and then adding a new voting disk to your configuration. The voting disk contents are restored from a backup when a new voting file is added; this occurs regardless of whether the voting disk file is stored in Oracle Automatic Storage Management (Oracle ASM).

To replace a corrupt, damaged, or missing voting disk that is not stored in Oracle ASM:

  1. Use CRSCTL to remove the damaged voting disk. For example, if the damaged voting disk is stored in the disk location /dev/sda3, then execute the command:

    crsctl delete css votedisk /dev/sda3
    
  2. Use CRSCTL to create a new voting disk in the same location, for example:

    crsctl add css votedisk /dev/sda3
    

Restoring Voting Disks

If all voting disks are corrupted, then you can restore them as described in Oracle Clusterware Administration and Deployment Guide.

Note:

Restoring a voting disk from a copy created with the Linux or UNIX operating system dd command is not supported

Migrating Voting Disks to Oracle ASM Storage

You can store the Oracle Clusterware voting disk files in an Oracle ASM disk group. If you choose to store your voting disks in Oracle ASM, then Oracle ASM stores all the voting disks for the cluster in the disk group you choose. You cannot combine voting disks stored in Oracle ASM and voting disks not stored in Oracle ASM in the same cluster.

The number of voting files you can store in a particular Oracle ASM disk group depends upon the redundancy of the disk group. By default, Oracle ASM puts each voting disk in its own failure group within the disk group. A normal redundancy disk group must contain at least two failure groups but if you are storing your voting disks on Oracle ASM, then a normal redundancy disk group must contain at least three failure groups. A high redundancy disk group must contain at least three failure groups.

Once you configure voting disks on Oracle ASM, you can only make changes to the voting disks' configuration using the crsctl replace votedisk command. This is true even in cases where there are no working voting disks. Despite the fact that the crsctl query css votedisk command reports zero voting disks in use, Oracle Clusterware remembers the fact that Oracle ASM was in use and the replace verb is required. Only after you use the replace verb to move voting disks back to non-Oracle ASM storage are the CRSCTL commands add css votedisk and delete css votedisk again usable.

To move voting disks from shared storage to an Oracle ASM disk group:

  1. Use the Oracle ASM Configuration Assistant (ASMCA) to create an Oracle ASM disk group.

  2. Verify that the ASM Compatibility attribute for the disk group is set to 12.1.0.0 or higher.

  3. Use CRSCTL to create a voting disk in the Oracle ASM disk group by specifying the disk group name in the following command:

    crsctl replace votedisk +ASM_disk_group
    

See Also:

Backing Up and Recovering the Oracle Cluster Registry

Oracle Clusterware automatically creates OCR backups every four hours. At any one time, Oracle Clusterware always retains the latest three backup copies of the OCR that are four hours old, one day old, and one week old.

You cannot customize the backup frequencies or the number of files that Oracle Clusterware retains. You can use any backup software to copy the automatically generated backup files at least once daily to a different device from where the primary OCR file resides.

This section contains the following topics:

Viewing Available OCR Backups

Use the ocrconfig utility to view the backups generated automatically by Oracle Clusterware.

To find the most recent backup of the OCR:

Run the following command on any node in the cluster:

ocrconfig -showbackup

Manually Backing Up the OCR

Use the ocrconfig utility to force Oracle Clusterware to perform a backup of OCR at any time, rather than wait for the automatic backup that occurs at four-hour intervals. This option is especially useful when you want to obtain a binary backup on demand, such as before you make changes to OCR.

To manually backup the contents of the OCR:

  1. Log in as the root user.

  2. Use the following command to force Oracle Clusterware to perform an immediate backup of the OCR:

    ocrconfig -manualbackup
    

    The date and identifier of the recently generated OCR backup is displayed.

  3. (Optional) If you must change the location for the OCR backup files, then use the following command, where directory_name is the new location for the backups:

    ocrconfig -backuploc directory_name
    

The default location for generating backups on Oracle Linux systems is Grid_home/cdata/cluster_name where cluster_name is the name of your cluster and Grid_home is the home directory of the Oracle Grid Infrastructure for a cluster installation. Because the default backup is on a local file system, Oracle recommends that you include the backup file created with the ocrconfig command as part of your operating system backup using standard operating system or third-party tools.

Note:

You can use the ocrconfig -backuploc command to change the location where the OCR backups are created.

Recovering the OCR

There are two methods for recovering the OCR. The first method uses automatically generated OCR file copies and the second method uses manually created OCR export files.

This section contains the following topics:

Checking the Status of the OCR

In event of a failure, before you attempt to restore the OCR, ensure that the OCR is unavailable.

To check the status of the OCR:

  1. Run the following command:

    ocrcheck 
    
  2. If this command does not display the message 'Device/File integrity check succeeded' for at least one copy of the OCR, then all copies of the OCR have failed. You must restore the OCR from a backup or OCR export.

  3. If there is at least one copy of the OCR available, then you can use that copy to restore the other copies of the OCR.

Restoring the OCR from Automatically Generated OCR Backups

When restoring the OCR from automatically generated backups, you first have to determine which backup file to use for the recovery.

To restore the OCR from an automatically generated backup on an Oracle Linux system:

  1. Log in as the root user.

  2. Identify the available OCR backups using the ocrconfig command:

    [root]# ocrconfig -showbackup
    
  3. Review the contents of the backup using the following ocrdump command, where file_name is the name of the OCR backup file for which the contents should be written out to the file ocr_dump_output_file:

    [root]# ocrdump ocr_dump_output_file -backupfile file_name
    

    If you do not specify an output file name, then the utility writes the OCR contents to a file named OCRDUMPFILE in the current directory.

  4. As the root user, stop Oracle Clusterware on all the nodes in your cluster by executing the following command:

    [root]# crsctl stop cluster -all
    
  5. As the root user, restore the OCR by applying an OCR backup file that you identified in Step 2 using the following command, where file_name is the name of the OCR to restore. Ensure that the OCR devices that you specify in the OCR configuration exist, and that these OCR devices are valid before running this command.

    [root]# ocrconfig -restore file_name
    
  6. As the root user, restart Oracle Clusterware on all the nodes in your cluster by running the following command:

    [root]# crsctl start cluster -all
    
  7. Use the Cluster Verification Utility (CVU) to verify the OCR integrity. Exit the root user account, and, as the software owner of the Oracle Grid Infrastructure for a cluster installation, run the following command, where the -n all argument retrieves a list of all the cluster nodes that are configured as part of your cluster:

    cluvfy comp ocr -n all [-verbose]
    

Changing the Oracle Cluster Registry Configuration

This section describes how to administer the Oracle Clusterware Registry (OCR). The OCR contains information about the cluster node list, which instances run on which nodes, and information about Oracle Clusterware resource profiles for applications that have been modified to be managed by Oracle Clusterware.

This section contains the following topics:

Note:

The operations in this section affect the OCR for the entire cluster. However, the ocrconfig command cannot modify OCR configuration information for nodes that are shut down or for nodes on which Oracle Clusterware is not running. Avoid shutting down nodes while modifying the OCR using the ocrconfig command.

Adding an OCR Location

Oracle Clusterware supports up to five OCR copies. You can add an OCR location after an upgrade or after completing the Oracle RAC installation. Additional OCR copies provide greater fault tolerance.

To add an OCR file:

As the root user, enter the following command to add a new OCR file:

[root]# ocrconfig -add new_ocr_file_name 

This command updates the OCR configuration on all the nodes on which Oracle Clusterware is running.

Migrating the OCR to Oracle ASM Storage

You can store the OCR in an Oracle ASM disk group. By default, the OCR is configured to use Oracle ASM when you perform a new installation of Oracle Clusterware. However, if you upgrade from a previous release, then you can migrate OCR to reside on Oracle ASM, and take advantage of the improvements in managing Oracle Clusterware storage.

The OCR inherits the redundancy of the disk group. If you want high redundancy for the OCR, then you must create an Oracle ASM disk group with high redundancy. You should use a disk group with at least normal redundancy, unless you have an external mirroring solution. If you store the OCR in an Oracle ASM disk group, and the Oracle ASM instance fails on a node, then the OCR becomes unavailable only on that node. The failure of one Oracle ASM instance does not affect the availability of the entire cluster.

Oracle does not support storing the OCR on different storage types simultaneously, such as storing the OCR on both Oracle ASM and a shared file system, except during a migration. After you have migrated the OCR to Oracle ASM storage, you must delete the existing OCR files.

To move the OCR from shared storage to an Oracle ASM disk group:

  1. Use the Oracle ASM Configuration Assistant (ASMCA) to create an Oracle ASM disk group that is at least the same size as the existing OCR and has at least normal redundancy.

  2. Verify that the ASM Compatibility attribute for the disk group is set to 11.2.0.0 or higher.

  3. Run the following OCRCONFIG command as the root user, specifying the Oracle ASM disk group name:

    # ocrconfig -add +ASM_disk_group
    

    You can run this command more than once if you add multiple OCR locations. You can have up to five OCR locations. However, each successive run must point to a different disk group.

  4. Remove the non-Oracle ASM storage locations by running the following command as the root user:

    # ocrconfig -delete old_storage_location
    

    You must run this command once for every shared storage location for the OCR that is not using Oracle ASM.

See Also:

Replacing an OCR

If you must change the location of an existing OCR, or change the location of a failed OCR to the location of a working one, then you can use the following procedure if one OCR file remains online.

To change the location of an OCR or replace an OCR file:

  1. Use the OCRCHECK utility to verify that a copy of the OCR other than the one you are going to replace is online, using the following command:

    ocrcheck 
    

    Note:

    The OCR that you are replacing can be either online or offline.
  2. Use the following command to verify that Oracle Clusterware is running on the node on which you are going to perform the replace operation:

    crsctl check cluster -all
    
  3. As the root user, enter the following command to designate a new location for the specified OCR file:

    [root]# ocrconfig -replace source_ocr_file -replacement destination_ocr_file
    

    This command updates the OCR configuration on all the nodes on which Oracle Clusterware is running.

  4. Use the OCRCHECK utility to verify that OCR replacement file is online:

    ocrcheck 
    

Removing an OCR

To remove an OCR file, at least one copy of the OCR must be online. You can remove an OCR location to reduce OCR-related overhead or to stop mirroring your OCR because you moved the OCR to a redundant storage system, such as a redundant array of independent disks (RAID).

To remove an OCR location from your cluster:

  1. Use the OCRCHECK utility to ensure that at least one OCR other than the OCR that you are removing is online.

    ocrcheck
    

    Note:

    Do not perform this OCR removal procedure unless there is at least one active OCR online.
  2. As the root user, run the following command on any node in the cluster to remove a specific OCR file:

    [root]# ocrconfig -delete ocr_file_name
    

    This command updates the OCR configuration on all the nodes on which Oracle Clusterware is running.

Repairing an OCR Configuration on a Local Node

If a node in your cluster was not available when you modified the OCR configuration, then you must repair the OCR configuration on that node before it is restarted.

To repair an OCR configuration:

  1. As the root user, run one or more of the following commands on the node on which Oracle Clusterware is stopped, depending on the number and type of changes that were made to the OCR configuration:

    [root]# ocrconfig –repair -add new_ocr_file_name
    
    [root]# ocrconfig –repair -delete ocr_file_name
    
    [root]# ocrconfig –repair -replace source_ocr_file -replacement dest_ocr_file
    

    These commands update the OCR configuration only on the node from which you run the command.

    Note:

    You cannot perform these operations on a node on which the Oracle Clusterware daemon is running.
  2. Restart Oracle Clusterware on the node you have just repaired.

  3. As the root user, check the OCR configuration integrity of your cluster using the following command:

    [root]# ocrcheck
    

Troubleshooting the Oracle Cluster Registry

This section includes the following topics about troubleshooting the Oracle Cluster Registry (OCR):

About the OCRCHECK Utility

The OCRCHECK utility displays the data block format version used by the OCR, the available space and used space in the OCR, the ID used for the OCR, and the locations you have configured for the OCR. The OCRCHECK utility calculates a checksum for all the data blocks in all the OCRs that you have configured to verify the integrity of each block. It also returns an individual status for each OCR file and a result for the overall OCR integrity check. The following is a sample of the OCRCHECK output:

Status of Oracle Cluster Registry is as follows :
   Version                  :          3
   Total space (kbytes)     :     262144
   Used space (kbytes)      :      16256
   Available space (kbytes) :     245888
   ID                       :  570929253
   Device/File Name         : +CRS_DATA
                              Device/File integrity check succeeded
...
                              Decive/File not configured

   Cluster registry integrity check succeeded

   Logical corruption check succeeded

The OCRCHECK utility creates a log file in the following directory, where Grid_home is the location of the Oracle Grid Infrastructure for a cluster installation, and hostname is the name of the local node:

Grid_home/log/hostname/client

The log files have names of the form ocrcheck_nnnnn.log, where nnnnn is the process ID of the operating session that issued the ocrcheck command.

Common Oracle Cluster Registry Problems and Solutions

The following table describes common OCR problems and their corresponding solutions.

Table 5-1 Common OCR Problems and Solutions

Problem Solution

The OCR is not mirrored.

Run the ocrconfig command with the -add option as described in the section "Adding an OCR Location".

A copy of the OCR has failed and you must replace it. Error messages are being reported in Enterprise Manager or the OCR log file.

Run the ocrconfig command with the -replace option as described in the section "Replacing an OCR".

OCRCHECK does not find a valid OCR, or all copies of the OCR are corrupted.

Run the ocrconfig command with the -restore option as described in the section "Restoring the OCR from Automatically Generated OCR Backups".

The OCR configuration was updated incorrectly.

Run the ocrconfig command with the -repair option as described in the section "Repairing an OCR Configuration on a Local Node".

You are experiencing a severe performance effect from updating multiple OCR files, or you want to remove an OCR file for other reasons.

Run the ocrconfig command with the -delete option as described in the section "Removing an OCR".

You want to change the location or storage type currently used by the OCR.

Run the ocrconfig command with the -replace option while Oracle Clusterware is running, as described in the section "Replacing an OCR". If some cluster nodes are down when you move the OCR, then you must run ocrconfig -repair on each node that was down before you start Oracle Clusterware on that node.