HP

HP Tru64 UNIX and TruCluster Server Version 5.1B-6

English
  Patch Summary and Release Notes > Chapter 3 TruCluster Server Patches   

Prior Release Notes

The following sections describe some of the key features and enhancements that were first delivered in previous patch kits.

New Flag Option to Turn OFF All Existing Flags for Services

A new cluster alias option none can be added for a service in /etc/clua-services files. This option allows you to unset an option without needing to specify a different option. The most common use of none is to unset the option of a service that has only one option set.

Select Option to Check Tagged Files

During the preinstall stage of a rolling upgrade, you have the option of checking tagged files. You should override the default setting and select the check tag option. The reason for selecting this option is described in “Check for Tagged Files if Messages Are Displayed”.

Check for Tagged Files if Messages Are Displayed

When installing this patch kit during a rolling upgrade, you may see the following error and warning messages during the setup stage:

Creating tagged files.

*** Error *** 
The tar commands used to create tagged files in the '/usr' file system have
reported the following errors and warnings:
     tar: lib/nls/msg/en_US.88591/ladebug.cat : No such file or directory
 
*** Warning *** 
The above errors were detected during the cluster upgrade. If you believe that
the errors are not critical to system operation, you can choose to continue.
If you are unsure, you should check the cluster upgrade log and refer
to clu_upgrade(8) before continuing with the upgrade.

If you see these messages during the setup stage, you should verify that the tagged files were properly created when you execute the preinstall stage.

In cases where the tagged files are not created, you can repeat the setup stage.

Noncritical Errors

During a rolling upgrade to install this patch kit , you may encounter the following noncritical situations:

  • The tagged file for ifaccess.conf (.Old..ifaccess.conf) may disappear. This error will not cause any problems with the rolling upgrade procedure or the installation of the kit. A message would alert you to this condition if you use the clu_upgrade undo command. Running the clu_upgrade -v check setup at the start of the procedure will fix this error.

  • When the worldwide language subset is installed, the file wwinstall will attempt to be tagged and will fail. This error will not affect the operational status of the cluster.

Unrecoverable Failure Procedure

The procedure to follow if you encounter unrecoverable failures while running dupatch during a rolling upgrade has changed. The new procedure calls for you to run the clu_upgrade -undo install command and then set the system baseline. The procedure is explained in the Patch Kit Installation Instructions as notes in Section 5.3 and Section 5.6.

Do Not Add or Delete OSF, TCR, IOS, or OSH Subsets During Roll

During a rolling upgrade, do not use the /usr/sbin/setld command to add or delete any of the following subsets:

  • Base Operating System subsets (those with the prefix OSF).

  • TruCluster Server subsets (those with the prefix TCR).

  • Worldwide Language Support (WLS) subsets (those with the prefix IOS).

  • New Hardware Delivery (NHD) subsets (those with the prefix OSH).

Adding or deleting these subsets during a roll creates inconsistencies in the tagged files.

Undo Stages in Correct Order

If you need to undo the install stage, because the lead member is in an unrecoverable state, be sure to undo the stages in the correct order.

During the install stage, clu_upgrade cannot tell whether the roll is going forward or backward. This ambiguity incorrectly allows the clu_upgrade undo preinstall stage to be run before clu_upgrade undo install. Refer to the Patch Kit Installation Instructions for additional information on undoing a rolling patch.

clu_upgrade undo of Install Stage Can Result in Incorrect File Permissions

This note applies only when both of the following are true:

  • You are using installupdate, dupatch, or nhd_install to perform a rolling upgrade.

  • You need to undo the install stage; that is, to use the clu_upgrade undo install command.

In this situation, incorrect file permissions can be set for files on the lead member. This can result in the failure of rsh, rlogin, and other commands that assume user IDs or identities by means of setuid.

The clu_upgrade undo install command must be run from a nonlead member that has access to the lead member's boot disk. After the command completes, follow these steps:

  1. Boot the lead member to single-user mode.

  2. Run the following script:

    #!/usr/bin/ksh -p 
    #
    #  Script for restoring installed permissions 
    # cd / 
    for i in /usr/.smdb./$(OSF|TCR|IOS|OSH)*.sts
    do
    grep -q "_INSTALLED" $i 2>/dev/null && /usr/lbin/fverify -y <"${i%.sts}.inv"
    done
  3. Rerun installupdate, dupatch, or nhd_install, whichever is appropriate, and complete the rolling upgrade.

For information about rolling upgrades, see the Patch Kit Installation Instructions and the installupdate(8) and clu_upgrade(8) reference pages.

Missing Entry Messages Can Be Ignored During Rolling Patch

During the setup stage of a rolling patch, you might see a message like the following:

Creating tagged files.
 ............................................................
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597530

clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597568

An Entry not found message will appear once for each member in the cluster. The number in the message corresponds to a PID.

You can safely ignore this Entry not found message.

Relocating AutoFS During a Rolling Upgrade on a Cluster

This note applies only to performing rolling upgrades on cluster systems that use AutoFS.

During a cluster rolling upgrade, each cluster member is singly halted and rebooted several times. The Patch Kit Installation Instructions direct you to manually relocate applications under the control of Cluster Application Availability (CAA) prior to halting a member on which CAA applications run.

Depending on the amount of NFS traffic, the manual relocation of AutoFS may sometimes fail. Failure is most likely to occur when NFS traffic is heavy. The following procedure avoids that problem.

At the start of the rolling upgrade procedure, use the caa_stat command to learn which member is running AutoFS. For example:

# caa_stat -t
Name          Type           Target    State     Host
------------------------------------------------------------
autofs         application    ONLINE    ONLINE    rye 
cluster_lockd  application    ONLINE    ONLINE    rye 
clustercron    application    ONLINE    ONLINE    swiss
dhcp           application    ONLINE    ONLINE    swiss
named          application    ONLINE    ONLINE    rye

To minimize your effort in the following procedure, perform the roll stage last on the member where AutoFS runs.

When it is time to perform a manual relocation on a member where AutoFS is running, follow these steps:

  1. Stop AutoFS by entering the following command on the member where AutoFS runs:

    # /usr/sbin/caa_stop -f autofs
  2. Perform the manual relocation of other applications running on that member:

    # /usr/sbin/caa_relocate -s current_member -c target_member

After the member that had been running AutoFS has been halted as part of the rolling upgrade procedure, restart AutoFS on a member that is still up. (If this is the roll stage and the halted member is not the last member to be rolled, you can minimize your effort by restarting AutoFS on the member you plan to roll last.)

  1. On a member that is up, enter the following command to restart AutoFS. (The member where AutoFS is to run, target_member, must be up and running in multi-user mode.)

    # /usr/sbin/caa_startautofs -c target_member
  2. Continue with the rolling upgrade procedure.

Messages Displayed During Rolling Upgrade Can Be Ignored

You can ignore the following messages if you see them displayed during a rolling upgrade:

  • kill:1048674: no such process

    This message may be displayed after the roll stage. For example:

    # clu_upgrade roll
    
    This is the cluster upgrade program.
    The 'roll' stage has completed successfully.  This
    member must be rebooted in order to run with the newly
    installed software.
    Do you want to reboot this member at this time? []:y
    You indicated that you want to reboot this member at this time.
    Is that correct? [yes]:
    
    The 'roll' stage of the upgrade has completed successfully.
    kill: 1048674: no such process
    
    #
  • rmdir: /var/.clu_upgrade: File exists

    This message may be displayed after the clean stage. For example:

    # clu_upgrade clean
    This is the cluster upgrade program.
    
    You have indicated that you want to perform the 'clean' stage
    of the upgrade. Do you want to
    continue to upgrade the cluster? [yes]:
    ⋮
    Deleting tagged files.
    .................................................................
    .................................................................
    .................................................................
    .................................................................
    ...................................Removing back-up and kit files
    
    rmdir: /var/.clu_upgrade: File exists
    
    The 'clean' stage of the upgrade has completed successfully.
    
    #

Error on Cluster Creation

When you attempt to create a cluster after having deleted patches, you may see the following error messages:

*** Error ***
This system has only Tru64 UNIX patches installed.
Please install the latest TruCluster Server patches on your system.
You  can obtain the most recent patch kit from:
http://www.support.compaq.com/patches/ 

*** Error *** 
The system is not configured properly for cluster creation.
Please fix the previously reported problems, and then rerun the
'clu_create' command. 

If you see these messages, enter the following command:

# ls -tlr /usr/.smdb./*PAT*.sts

If this command returns a file with 000000 in its name, you will have to run the clu_create command with the -f option to force the creation of your cluster. The problem is caused by the cluster software misinterpreting the existence of some patches and will be corrected in a future patch kit.

If the command does not return a file with 000000 in its name, you will need to contact HP support to determine the cause of the problem.

When Taking a Cluster Member to Single-User Mode, First Halt the Member

To take a cluster member from multiuser mode to single-user mode, first halt the member and then boot it to single-user mode. For example:

# shutdown -h now
>>> boot -fl s

Halting and booting the system ensures that it provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode.

When the system reaches single-user mode, enter the following commands:

# /sbin/init s
# /sbin/bcheckrc
# /usr/sbin/lmf reset

Login Failure Possible with C2 Security Enabled

Login failures may occur as a result of a rolling upgrade on systems with Enhanced Security (C2) enabled. The failures may be exhibited in two ways:

  • With the following error message:

    Can't rewrite protected password entry for user
  • With the following set of error messages:

    login: Ignoring log file: /var/tcb/files/dblogs/log.00001: magic number 0, not 8
    
    
    login: log_get: read: I/O error
    Can't rewrite protected password entry for user

The problem may occur after the initial reboot of the lead cluster member or after the rolling upgrade is completed and the clu_upgrade switch procedure has been run. The following sections describe the steps you can take to prevent the problem or correct it after it occurs.

Preventing the problem

You can prevent this problem by performing the following steps before beginning the rolling upgrade:

  1. Disable the prpasswdd daemon from running on the cluster:

    # rcmgr -c set PRPASSWDD_ARGS \ 
    "`rcmgr get PRPASSWDD_ARGS` -disable"
  2. Stop the prpasswdd daemon on every node in the cluster:

    # /sbin/init.d/prpasswd stop
  3. Perform the rolling upgrade procedure through the clu_upgrade switch step and reboot all the cluster members.

  4. Perform one of the following actions:

    • If PRPASSWDD_ARGS did not exist before this upgrade (that is, if rcmgr get PRPASSWDD_ARGS at this point shows only -disable), then delete PRPASSWDD_ARGS:

      # rcmgr -c delete PRPASSWDD_ARGS
    • If PRPASSWDD_ARGS existed before this upgrade, then reset PRPASSWDD_ARGS to the original string:

      # rcmgr -c set PRPASSWDD_ARGS \ 
      "`rcmgr get PRPASSWDD_ARGS | sed 's/ -disable//'`"
  5. Check that PRPASSWDD_ARGS is now set to what you expect:

    # rcmgr get PRPASSWDD_ARGS
  6. Start the prpasswdd daemon on every node in the cluster:

    # /sbin/init.d/prpasswd start
  7. Complete the rolling upgrade.

Correcting the problem

If you have already encountered the problem, perform the following steps to clear it:

  1. Restart the prpasswdd daemon on every node in the cluster:

    # /sbin/init.d/prpasswd restart
  2. Reboot the lead cluster member.

  3. Check to see if the problem has been resolved. If it has been resolved, you are finished. If you still see the problem, continue to step 4.

  4. Try to force a change to the auth database by performing the following steps:

    1. Use edauth to add a harmless field to an account, the exact commands depend on your editor. For example, pick an account that does not have a vacation set and add u_vacation_end:

      # edauth
      s/:u_lock@:/u_vacation_end#0:u_lock@:/
      w
      q
    2. Check to see that the u_vacation_end#0 field was added to the account:

      # edauth -g
    3. Use edauth to remove the u_vacation_end#0 field from the account.

    If the edauth commands fail, do not stop. Continue with the following instructions.

  5. Check to see if the problem has been resolved. If it has been resolved, you are finished.

    If you still see the problem, observe the following warning and continue to step 6.

    Warning!

    Continue with the following steps only if the following conditions are met:

    • You encountered the described problem while doing a rolling upgrade of a cluster running Enhanced Security.

    • You performed all previous steps.

    • All user authentications (logins) still fail.

  6. Disable logins on the cluster by creating the file /etc/nologin:

    # touch /etc/nologin
  7. Disable the prpasswdd daemon from running on the cluster:

    # rcmgr -c set PRPASSWDD_ARGS \ 
    "`rcmgr get PRPASSWDD_ARGS` -disable"
  8. Stop the prpasswdd daemon on every node in the cluster:

    # /sbin/init.d/prpasswd stop
  9. Force a checkpoint of db_checkpoint, using the db_checkpoint command with the -1 (number 1) option :

    # /usr/tcb/bin/db_checkpoint -1 -h /var/tcb/files

    Continue with the instructions even if this command fails.

  10. Delete the files in the dblogs directory:

    # rm -f /var/tcb/files/dblogs/*
  11. Force a change to the auth database, as follows:

    • Use the edauth command to add a harmless field to an account, the exact commands depend on your editor. For example, pick an account that does not have a vacation set and enter the following:

      # edauth
      s/:u_lock@:/u_vacation_end#0:u_lock@:/ 
      w
      q
    • Check to see that the u_vacation_end#0 field was added to the account:

      # edauth -g
    • Use the edauth command to remove the u_vacation_end#0 field from the account.

    Warning!

    If the edauth command fails, do not proceed further. Contact HP support.

  12. If the edauth command was successful, perform one of the following actions:

    • If PRPASSWDD_ARGS did not exist before this upgrade (that is, if rcmgr get PRPASSWDD_ARGS at this point shows only -disable), then delete PRPASSWDD_ARGS:

      # rcmgr -c delete PRPASSWDD_ARGS
    • If PRPASSWDD_ARGS existed before this upgrade, then reset PRPASSWDD_ARGS to the original string:

      # rcmgr -c set PRPASSWDD_ARGS \ 
      "`rcmgr get PRPASSWDD_ARGS | sed 's/ -disable//'`"
  13. Check that PRPASSWDD_ARGS is now set to what you expect:

    # rcmgr get PRPASSWDD_ARGS
  14. Start the prpasswdd daemon on every node in the cluster:

    # /sbin/init.d/prpasswd start
  15. Re-enable logins on the cluster by deleting the file /etc/nologin:

    # rm /etc/nologin
  16. Check to see if the problem has been resolved. If it has not, contact HP support.

File System Unmount Recommended if Message Is Displayed

Under certain error conditions, the following message may be seen during a relocation or failover, or during the boot of a member:

WARNING: Unable to failover /mnt: pfs and cfs fsids differ

The result is that the fileset in question is now unserved in the cluster. For example:

 # cfsmgr /mnt
          Domain or filesystem name = /mnt
          Server Status : Not Served

If this occurs, we recommend that you immediately do the following:

  1. Use the following command to unmount the filesystem:

    # cfsmgr -u -p [mountpoint]
  2. If other mounted filesets exist in the same domain, unmount them (they should also be in the "Not Served" state):

    # cfsmgr -u -d [domain]

    For steps on checking an AdvFS domain, see the AdvFS Administration Guide, Section 6.3.1, steps 3-7.

  3. Run diagnostics on the domain prior to remounting its file systems.

    To verify the domain, you can use the AdvFS verify utility or the fixfdmn utility. If using fixfdmn, we recommend first running it with the -n option to see what errors are found prior to allowing fixfdmnn to fix them.

Once you have successfully verified the domain, remounting the domain's file systems in the cluster should succeed.

If the domain cannot be immediately verified, we recommend that you do not remount the original fileset until this can be done.

Note:

In rare cases, the warning message will be accompanied by a system panic. This will occur if CFS error handling is unable to successfully unmount the underlying physical file system. If this occurs, the console will direct you to use cfsmgr to unmount the domain on one of the remaining nodes prior to rebooting the member.

This action will prevent the rebooted member from attempting to failover-mount the file system and will minimize access to the domain. Prior to remounting the file system, it is advisable that the domain be sanity-checked using the steps given above.

Tunable Attribute May Help Performance Problem

The tunable attribute cfs_clone_noccr, included in this patch kit , may correct a problem in which cluster fileset writes that occur simultaneously with reads of the fileset clone on a cluster client (for example, during a backup) may result in performance degradation. This occurs most often when the clone file being read consists of many thousands of extents (for example, 20,000 or more).

If a degradation during cluster clone reads is noticeable (for example, the clone read appears to be hanging and requires a long time to complete), set the value of cfs_clone_noccr to 1 on the server of the given fileset. This sysconfig tunable attribute is set to 0 by default and should be changed only when the degradation is noticeable.

Note that all filesets with clones that are served by the node on which the attribute is set will also see this change. It may be advisable (though not required) to have those filesets whose clone files have fewer extents be served by a different node during the time the tunable attribute is set.

AlphaServer ES47 or AlphaServer GS1280 Hangs When Added to Cluster

If after running clu_add_member to add an AlphaServer ES47 or AlphaServer GS1280 as a member of a TruCluster the AlphaServer hangs during its first boot, try rebooting it with the original V5.1B generic cluster kernel, clu_genvmunix.

Use the following instructions to extract and copy the V5.1B cluster genvmunix from your original Tru64 UNIX kit to your AlphaServer ES47 or AlphaServer GS1280 system. In these instructions, the AlphaServer ES47 or AlphaServer GS1280 is designated as member 5. Substitute the appropriate member number for your cluster.

  1. Insert the Tru64 UNIX Associated Products Disk 2 into the CD-ROM drive of an active member.

  2. Mount the CD-ROM to /mnt. For example:

    # mount -r /dev/disk/cdrom0c /mnt
  3. Mount the boot disk of the AlphaServer ES47 or AlphaServer GS1280 on its specific mount point; for example:

    # mount root5_domain#root /cluster/members/member5/boot_partition
  4. Extract the original clu_genvmunix from the CD-ROM and copy it to the boot disk of the AlphaServer ES47 or AlphaServer GS1280 member.

    # zcat < TCRBASE540 | ( cd /cluster/admin/tmp; \ 
    tar -xf - ./usr/opt/TruCluster/clu_genvmunix)
    #cp /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix \ 
    /cluster/members/member?/boot_partition/genvmunix 
    # rm /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix
  5. Unmount the CD-ROM and the boot disk:

    # umount /mnt 
    # umount /cluster/members/member5/boot_partition
  6. Reboot the AlphaServer ES47 or AlphaServer GS1280.

Problems with clu_upgrade Switch Stage

If the clu_upgrade switch stage does not complete successfully, you may see a message like the following:

versw: No switch due to inconsistent versions

The problem can be due to one or more members running genvmunix, a generic kernel.

Use the command clu_get_info -full and note each member's version number, as reported in the line beginning

Member base O/S version

If a member has a version number different from that of the other members, shut down the member and reboot it from vmunix, the custom kernel. If multiple members have the different version numbers, reboot them one at a time from vmunix.

Data Protector Issues and Restrictions

The following sections describe issues and restrictions for Version 5.1 of the HP OpenView Storage Data Protector backup and recovery product when configuring it on a Tru64 UNIX cluster.

Possible Error Backing Up Cluster Mount Points

When backing up cluster mount points using the cluster alias as the client name, you may encounter an error in which the directory is reported as a mount point to a different file system and is backed up as an empty directory.

To correct this problem, create TruCluster Server clients as follows:

  • Create a client for each host name node in the cluster.

  • Create another client using the cluster alias name, selecting it as a virtual host.

You can then create backups using the alias as the client name.

You may also need to define your mount points to back up using the manual add function of the Add Backup wizard. Under some circumstances, backups that are created using the default device discovery encounter the “backed up as an empty directory” problem.

Configuring Data Protector for Oracle Integration

When Configuring Data Protector for Oracle integration, libobk.so should be linked with /usr/omni/lib/libob2oracle8_64bit.so.

The Data Protector UNIX Integration Guide incorrectly states that it should be linked with /usr/omni/lib/libob2oracle8_64.so.

Set ipport_userreserved Attribute on Large Systems

Larger systems can encounter portmapper problems in a local area network (LAN) cluster if the value of the ipport_userreserved attribute has not been tuned. The recommended value is 65535 and should be the same for all cluster members. Set the value before adding the first member.

If this value is not set for a LAN cluster with larger machines, the machines may run out of ports for interconnect services. For more information, see the manual Tuning Tru64 UNIX for Internet Servers.