c00816695
 
HP Tru64 UNIX - Corrections for Device-related Hangs, Panics, and Boot Issues

»

HP Tru64 UNIX

Tru64 UNIX

» Tru64 UNIX V5.1B-6
» Tru64 UNIX V5.1B-5
» Documentation
» Information library
» Software web index
» Software products library
» Patch database
» Services
» Developer & Solution Partner Program
» Send us your comments
» Support Statements

Evolving business value

» Tru64 UNIX to HP-UX 11i transition benefits calculator
» Alpha RetainTrust Program
» Transition

Related links

» Alpha systems
» HP-UX 11i
» Integrity servers
» Linux
» HP storage
» HP solutions
HP-UX 11i: measurably better TCO!
 Advisory Information
 

RELEASE DATE: 2006-12-19

DESCRIPTION

The ERP identified in the Engineering Advisory contains numerous fixes for device-related hangs, panics, and boot issues.

Descriptions of the fixes follow:

  • This patch fixes a configuration issue found in non CAM devices and CD_ROM devices.
  • This patch improves the reliability of the Tru64 Cluster DRD subsystem when faced with tape devices and tape device failures.
    1. There was a timing hole where two opens would be sent down at the same time to the tape driver. Before the tape driver would check to determine if it was already open, the paths could be changed, which would result in a kernel memory fault panic. A typical stack trace for the panic would be:

      THREAD 1
      drd_open()
      drd_set_tape_changer_server()
      drd_check_path()
      drd_issue_local_ioctl()
      ctape_ioctl()
      ccmn_path_setup3
      ccmn_alloc_path3()
      cmn_reg_hier_path3

      THREAD 2
      drd_open()
      drd_local_open()
      drd_local_device_open()
      drd_issue_local_ioctl() ctape_ioctl()
      ctape_verify_path()
      ccmn_path_setup3
      ccmn_del_stale_paths3()
      ccmn_destroy_invalid_paths()
      ccmn_reg_hier_path3

    2. When a device is deleted via hwmgr and an open is in progress the open can hang. This patch removes the timing hole that allows the open to progress to the point where it hangs.
    3. When a device fails all current IOs are returned with an appropriate error status code. If the upper layers continue to send IOs after the device has been marked as failed, IOs can hang in drd.
    4. This patch also fixes barrier issues when devices fail and a barrier is in progress.

      Symptoms for 2,3 and 4 are:

    Status of a drd disk with stalled IOs:
    drd_disk d_hwid d_state d_flags d_type errno eei d_bp_cnt
    0xfffffc00f4fe0e00 0x0086 0x0003 0x0a800081 0x0000 0x0013 0x0000 1
    DRD_FAILED
    DRD_DISK_BLOCKED
    DK_DAIO_DISK
    DRD_DISK_NOT_USABLE bp 0xfffffc00291b3500 00:02:24.180
    DRD_DRAINED_FLAGS
    DRD_DISK_FAILED
    DRD_STOP_SERVER
    DRD_DO_NOT_DELETE
    DRD_IS_BARRIERABLE

    Typical thread trace for vold threads at the time of hung IOs:
    0 thread_block
    1 volsiowait
    4 volsioctl_rea
    5 spec_ioctl
    6 vn_ioctl
    7 ioctl_base
    8 syscall
    9 _Xsyscall

  • This patch fixes an error in the DRD subsystem wherein uninitialized disk attributes can cause a system panic.
    1. 4 panic
      5 trap
      6 _XentMM
      7 free
      8 drd_release_bp_resources
      9 drd_ics_io
      10 drd_ics_read
      11 svr_drd_ics_read
      12 icssvr_daemon_from_poolsvr_drd_ics_read

      This problem appears when open/read is attempted on deleted XCR disks.
    2. This patch also fixes an error during a failback of a Tape device wherein character devt is not restored properly.
  • Corrects a problem where DRD event thread may run infinitely while responding for bid server transaction.
  • This patch fixes a problem whereby the DRD subsystem may cause a system panic, because routines may be called from a Light weight context(LWC). This could result in a system panic with the following or similar stack trace.

    0 boot
    1 panic
    2 thread_block
    3 lock_wait
    4 lock_write
    5 (source file cannot be determined)
    6 (source file cannot be determined)
    7 (source file cannot be determined)
    8 drd_restart_io
    9 drd_io_barrier_complete_timeout
    10 softclock_scan
    11 lwc_schedule
    12 exception_exit

  • Fixes a hang with disklabel(8) that occurred if a local open failed for the same disk simultaneously.
  • Corrects reference counting issues within the DRD subsystem that can prevent the deletion of hwids.
  • Fixes disk I/O hang in DRD. This patch fixes a problem in DRD that could result in the hanging of commands like disklabel, showfdmn or any file system I/O. Typical stack trace is as follows:

    0 thread_block
    1 sleep_prim
    2 mpsleep
    3 drd_reopen_partitions
    4 drd_change_server_node
    5 drd_complete_failback
    6 drd_handle_event_io_drained
    7 drd_handle_one_event
    8 drd_handle_events
    9 drd_event_thread
     

  • DRD now plays an active role in the device deletion callback and voting. In the past drd would be notified after the device deletion had occurred via an evm event. This caused numerous panics and hung devices as drd could attempt to access a deleted device. With this fix drd will no longer access a device that has a deletion pending or in progress.
  • This patch fixes an issue of DRD returning incorrect device information when the hwid is not found.
  • Provides a fix for a Kernel Memory Fault in drd disk code. A typical stack trace of the problem is as follows:

    0 boot
    1 panic
    2 trap
    3 _XentMM
    4 simple_lock_D
    5 drd_add_server
    6 drd_find_local_disks
    7 drd_config_thread
     

  • Fix for DRD_IOCTL_ERROR handling for tape devices
  • Fixes a Kernel Memory Fault in IO Path for Served Disks and for stalled IOs. A typical stack trace of the problem is as follows:

    0 stop_secondary_cpu
    1 panic
    2 event_timeout
    3 printf
    4 panic
    5 trap
    6 _XentMM
    7 drd_ics_get_disk
    8 drd_ics_io
    9 drd_ics_read
    10 svr_drd_ics_read
    11 icssvr_daemon_from_pool
     

  • Fixes disk access issues that shows up early in the boot process.
    This problem could result in a system panic with the following or similar stack trace.
    PANIC: "CNX MGR: Invalid configuration for cluster seq disk"

    0 boot
    1 panic
    2 init_globals
    3 init_cnx
    4 cnx_subsys_configure
    5 cnx_callback
    6 dispatch_callback
    7 main
    8 main
     

  • Fixes a hang during cluster bootup caused by early reservation conflicts. During cluster bootup, the following warning messages appears and the node hangs till another node comes up.
    "WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount"
  • Fixes a cluster hang issue during cluster boot-up, when local disk open operations fail while disklabel is in progress.
  • This patch corrects an erroneous error message that can be displayed by drdmgr when relocating a device. For example:
    drdmgr: Error, Uknown error -1431655766 for device 'tape0' attribute DRD_SERVER
  • Handles reservation conflict errors to address cluster node hang during boot. During cluster booting, the following warning messages appears and the node may hang until the second node comes up. A typical message that appears on the console when the node hangs is as below,
    "WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount"

    This error message is due to the path being configured later in the boot process resulting in a reservation conflict.
  • Allows retries of disk open at boot time if device is in MUNSA reject state. A disk open can fail if the device is currently in MUNSA reject state. This can result in boot hang conditions while the system is being booted up.
 
SCOPE

The following version of HP Tru64 UNIX is affected:

v 5.1B-3

 
RESOLUTION

HP is releasing the following ERP kits publicly for use by any customer. The ERP kits use dupatch to install and will not install over any Customer Specific Patches (CSPs) that have file intersections with the ERP. Contact your service provider for assistance if the installation of an ERP is blocked by any of your installed CSPs

The fixes contained in the ERP kit are  available in the following mainstream patch kit:

HP Tru64 UNIX v 5.1B-4

The kit distributes the following files:

  • /usr/opt/TruCluster/sys/drd.mod
  • /sys/BINARY/cam_disk.mod

Early Release Patches

HP Tru64 UNIX version: 5.1B-3
ERP Kit Name: TCRKIT1001020-V51BB26-E-20061205
Kit Location:
http://www.itrc.hp.com/service/patch/patchDetail.do?patchid=TCRKIT1001020-V51BB26-E-20061205