HPE P2000 G3 Troubleshooting Guide Part Number: 869944-001 Published: July 2016 Edition: 1 Contents 1 Troubleshooting HPE P2000 G3.........................................................................3 Events and LEDs..................................................................................................................................3 Event Monitoring..............................................................................................................................3 Events sent as Indications to SMI-S Clients....................................................................................3 Events requiring FRU replacements...............................................................................................4 LED Indications and Recommended Actions..................................................................................6 Isolating a host-side connection fault..............................................................................................9 Disk Drives............................................................................................................................................9 Disk Drive LEDs............................................................................................................................10 Disk Error Conditions and Recommended Actions.......................................................................11 Disk Drive Bay Numbers...............................................................................................................13 Drive Status...................................................................................................................................13 Disk Drive Leftover........................................................................................................................14 Vdisk or Disk Group Status...........................................................................................................16 Vdisk or Disk Group Quarantined..................................................................................................16 Disk Drive Failure during Reconstruct...........................................................................................18 Vdisk or disk group Member Unavailable......................................................................................19 Two or Multiple Disk Drive Failure.................................................................................................19 Vdisk or Disk Group Alerts ...........................................................................................................20 Vdisk or disk group Expansion Frequently Asked Questions (FAQs)...........................................20 Management Controller Issue.............................................................................................................21 Restarting the Management Controller.........................................................................................21 Instances when the Management Controller becomes Unresponsive..........................................22 Power Supply Faults and Power Cycle...............................................................................................24 Power Cycle..................................................................................................................................24 Power Supply Faults and Recommended Actions........................................................................25 Chassis Replacement.........................................................................................................................26 Using the Trust Command..................................................................................................................26 Running the trust command..........................................................................................................27 After running the trust command...................................................................................................29 2 Support and other resources.............................................................................31 Accessing Hewlett Packard Enterprise Support.................................................................................31 Accessing updates..............................................................................................................................31 Websites.............................................................................................................................................31 Customer self repair...........................................................................................................................32 Remote support..................................................................................................................................32 Documentation feedback....................................................................................................................32 2 Contents 1 Troubleshooting HPE P2000 G3 The following section details the events and their descriptions used to troubleshoot HPE P2000 G3: • “Events and LEDs” (page 3) • “Disk Drives” (page 9) • “Management Controller Issue” (page 21) • “Power Supply Faults and Power Cycle” (page 24) • “Chassis Replacement” (page 26) • “Using the Trust Command” (page 26) Events and LEDs • “Event Monitoring” (page 3) • “Events sent as Indications to SMI-S Clients” (page 3) • “Events requiring FRU replacements” (page 4) • “LED Indications and Recommended Actions” (page 6) Event Monitoring Event notification can be set to four different levels: • Informational • Warning • Error • Critical INFORMATIONAL— used to inform all system events. This level will result in increased messages as all categories of messages will be reported. This level requires regular attention and review to ensure that WARNING (or ERROR or CRITICAL) events are not lost in the resulting expanded event report. WARNING—recommended as it signals a problem exists that results in the loss of redundancy. If another similar failure occurs, a system outage (and resulting loss of access to data) occurs. HPE recommends that you investigate and resolve the event promptly. ERROR—used to inform failure occurred that might affect data Integrity or system stability. Correct the problem as soon as possible. CRITICAL—used to inform failure occurred that might cause a controller to shutdown. Correct the problem immediately. Events sent as Indications to SMI-S Clients If the storage systems SMI-S interface is enabled, the system will send events as indications to SMI-S clients. The SMI-S clients can then monitor the system performance. For information on enabling the SMI-S interface, see Configuring the system in the MSA 1040/2040 SMU Reference Guide on the Hewlett Packard Enterprise Support Center website. For information on the event categories pertaining to Field Replaceable Unit (FRU) assemblies and certain FRU components, see Events sent as indications to SMI-S clients in the MSA Event Descriptions Reference Guide on the Hewlett Packard Enterprise Support Center website. Events and LEDs 3 Events requiring FRU replacements Table 1 Events requiring FRU replacements scenarios Event ID FRU Symptom Action Required 8 HDD See “Disk Error Conditions and Recommended Actions” (page 11) 39 Controller The sensors monitored a temperature or voltage in the warning range. Verify the following: • The system has two power supplies and both are running correctly • The ambient temperature is not too warm. The enclosure operating range is 5–40°C (41°F–104°F) • A module or blank plate resides in every module slot in the enclosure If none of these conditions apply, replace the controller module that reported the error. 40 Controller The sensors monitored a temperature or voltage in the failure range. Verify the following: • The system has two power supplies and both are running correctly • The ambient temperature is not too warm. The enclosure operating range is 5–40°C (41°F–104°F) • A module or blank plate resides in every module slot in the enclosure If none of these conditions apply, replace the controller module that reported the error. 4 50 Controller Correctable ECC error in buffer memory. Replace on multiple occurrences. 51 Controller An uncorrectable ECC error occurred in buffer memory. Replace on first occurrence. 55 HDD A disk drive reported a SMART event. Replace on first occurrence. 58 HDD A disk drive detected a serious error, such as a parity error or disk hardware failure. Replace on first occurrence. 62 HDD The indicated global or Replace the drive. dedicated spare disk has failed. The drive LED changes to amber. 65 Controller An uncorrectable ECC error occurred on the buffer memory on startup. 95 Controller Both controllers in an Replace one of the controllers on first occurrence. Active-Active configuration have the same serial number. 157 Controller A failure occurred when trying to Replace on first occurrence. write to the Storage Controller flash chip. 168 PSU See “Power Supply Faults and Recommended Actions” (page 25) 204 CompactFlash A failure occurred when trying to Replace the CompactFlash. write to the CompactFlash. 217 Controller A supercapacitor failure occurred Replace on first occurrence. on the controller. Troubleshooting HPE P2000 G3 Replace on first occurrence. Table 1 Events requiring FRU replacements scenarios (continued) Event ID FRU Symptom Action Required 218 Controller The supercapacitor pack is nearing end-of-life. Replace on first occurrence. 233 HDD The indicated disk type is invalid Replace on first occurrence. and is not enabled in the current configuration. 235 Controller An enclosure management processor (EMP) detected a serious error. Replace on first occurrence. 239 CompactFlash A timeout occurs while flushing the CompactFlash. Replace the CompactFlash. 240 CompactFlash The CompactFlash hardware detected an error. Replace the CompactFlash. 242 CompactFlash The controller module's CompactFlash card has failed. Replace the CompactFlash. 246 Controller The coin battery is not present or is not properly seated or has reached end-of-life. Replace on first occurrence. 247 Various The FRU ID SEEPROM for the indicated FRU cannot be read. FRU ID data might not be programmed. Replace the FRU indicated. 307 Controller A temperature sensor on a Verify the following: controller FRU detected an • The system has two power supplies and both over-temperature condition that are running correctly caused the controller to • The ambient temperature is not too warm. The shutdown. enclosure operating range is 5–40°C (41°F–104°F) • Obstructions to airflow. • A module or blank plate resides in every module slot in the enclosure If none of these conditions apply, replace the controller module that reported the error. 313 Controller The indicated controller module Replace on first occurrence. has failed. This event can be ignored for a single-controller configuration. 314 Various The indicated FRU has failed or To determine whether the FRU must be replaced, is not operating correctly. This see Verifying component failure in the component's event follows some other replacement instructions document. FRU-specific event indicating a problem. 318 Controller A hardware failure occurred: XOR Replace on first occurrence. error. 319 HDD The indicated available disk has Replace on first occurrence. failed. 442 Controller Power-On Self-Test (POST) Replace on first occurrence. diagnostics detected a hardware error. 455 SFP The controller detected that the configured host-port link speed Replace with an SFP that supports a later speed. Events and LEDs 5 Table 1 Events requiring FRU replacements scenarios (continued) Event ID FRU Symptom Action Required exceeded the capability of an FC SFP. 464 SFP A user inserted an unsupported Replace with a supported SFP. cable or SFP into the indicated controller host port. 551 PSU See “Power Supply Faults and Recommended Actions” (page 25) LED Indications and Recommended Actions Table 2 LED Indications and recommended actions Indicator Status Cause Action Required The enclosure front panel Fault/Service Required LED Off System is functioning properly. No action required. Amber A fault condition Verify the following: exists/occurred. If installing • The LEDs on the back of an I/O module FRU, the the controller enclosure module has not gone online to narrow the fault to a and likely failed its self-test. FRU, connection, or both • The event log for specific information regarding the fault; follow any Recommended Actions • If installing an IOM FRU, try removing and reinstalling the new IOM, and check the event log for errors If the previous actions do not resolve the fault, isolate the fault, and contact an authorized service provider for assistance. Replacement is necessary. 6 Troubleshooting HPE P2000 G3 Table 2 LED Indications and recommended actions (continued) Indicator Status Cause Action Required The enclosure rear panel FRU OK LED On; blinking regularly System is functioning No action required. Wait for properly. System is booting. system to boot. Off The controller module is not Verify the following: powered on. The controller • The controller module is module has failed. fully inserted and latched in place, and the enclosure is powered on • The event log for specific information regarding the failure The enclosure rear panel Fault/Service Required LED Off System is functioning properly. No action required. Amber; blinking regularly • Hardware-controlled power-up error • Restart this controller from the other controller using the SMU or the CLI • Cache flush error • Cache self-refresh error • If the previous action does not resolve the fault, remove the controller and reinsert it If the previous actions do not resolve the fault, contact an authorized service provider for assistance. Replace the controller. Connected host port Host Link Status LED On; blinking regularly System is functioning properly. No action required. Events and LEDs 7 Table 2 LED Indications and recommended actions (continued) Indicator Status Cause Action Required Off The link is down. • Check cable connections and reseat if necessary • Inspect cables for damage. Replace cable if necessary • Defective cable causes fault. Swap cables to determine the fault. Replace cable if necessary • Verify that the switch, if any, is operating properly. If possible, test with another port • Verify that the HBA is fully seated, and that the PCI slot is powered on and operational • In the SMU, review event logs for indicators of a specific fault in a host datapath component; follow any Recommended Actions. • Contact an authorized service provider for assistance. A connected port Expansion Port Status LED On System is functioning properly. No action required. Off The link is down. • Check cable connections and reseat if necessary • Inspect cables for damage. Replace cable if necessary • Defective cable causes fault. Swap cables to determine the fault. Replace cable if necessary • In the SMU, review event logs for indicators of a specific fault in a host datapath component; follow any Recommended Actions. • Contact an authorized service provider for assistance A connected port Network On Port Link Status LED Off 8 Troubleshooting HPE P2000 G3 System is functioning properly. No action required. The link is down. Use standard networking troubleshooting procedures to isolate faults on the network. Table 2 LED Indications and recommended actions (continued) Indicator Status Cause Action Required The power supply Input Power Source LED On System is functioning properly. No action required. Off The power supply is not receiving adequate power. • Verify that the power cable is properly connected and check the power source to which it connects • Check that the power supply FRU is firmly locked into position • In the SMU, check the event log for specific information regarding the fault; follow any Recommended Actions • If the previous action does not resolve or isolate the fault, contact an authorized service provider for assistance The power supply Voltage/Fan Fault/Service Required LED Off System is functioning properly. No action required. On; blinking regularly The power supply unit or a fan is operating at an unacceptable voltage/RPM level, or has failed. When isolating faults in the power supply, remember that the fans in both modules receive power through a common bus on the midplane. If a power supply unit fails, the fans continue to operate normally: Verify the following: • The power supply FRU is firmly locked into position • The power cable is connected to a power source • The power cable is connected to the power supply module Isolating a host-side connection fault If data hosts are having trouble accessing the storage system, and you cannot locate a specific fault or cannot access the event logs, see Isolating a host-side connection fault section in the HPE P2000 G3 FC/iSCSI MSA System User Guide. For more troubleshooting information, see Troubleshooting chapter in the HPE P2000 G3 FC/iSCSI MSA System User Guide. Disk Drives • “Disk Drive LEDs” (page 10) • “Disk Error Conditions and Recommended Actions” (page 11) • “Disk Drive Bay Numbers” (page 13) Disk Drives 9 • “Disk Drive Leftover” (page 14) • “Vdisk or Disk Group Quarantined” (page 16) • “Disk Drive Failure during Reconstruct” (page 18) • “Vdisk or disk group Member Unavailable” (page 19) • “Two or Multiple Disk Drive Failure” (page 19) • “Vdisk or Disk Group Alerts ” (page 20) • “Vdisk or disk group Expansion Frequently Asked Questions (FAQs)” (page 20) Disk Drive LEDs Figure 1 2.5” SFF disk drive Figure 2 3.5” LFF disk drive Table 3 LEDs - Disk drive LEDs 10 LED Description 1 Fault/UID (amber/blue) 2 Online/Activity (green) Troubleshooting HPE P2000 G3 Table 4 LEDs - Disk drive combinations Online/Activity (green) Fault/UID (amber/blue) On Off Description Recommended Action Normal operation. The drive is online, No action required. but it is not currently active. Blinking irregularly Off The drive is active and operating normally. No action required. Off Offline; the drive is not being accessed. A drive error might be received for this device. Further investigation is required. • Check the event log for specific information regarding the fault Amber; blinking regularly (1 Hz) On Amber; blinking regularly (1 Hz) Blinking irregularly Amber; blinking regularly (1 Hz) Online; possible I/O activity. A drive error might be received for this device. Further investigation is required. • Isolate the fault • Contact an authorized service provider for assistance The drive is active, but a drive error might be received for this drive. Further investigation is required. Off Amber; solid Offline; no activity. A failure or critical fault condition is identified for this drive. This activity indicates that the drive is leftover rather than a problem with the drive itself. Off Blue; solid Offline. A management application (SMU) selects the drive. No action required. On or blinking Blue; solid The controller is driving I/O to the drive. A management application (SMU) selects the drive. No action required. Blinking regularly Off (1 Hz) Off Off CAUTION: Do not remove the drive. Removing a drive closes the current operation and causes data loss. The drive is rebuilding. No action required. Either there is no power, the drive is Check that the disk drive is fully offline, or the drive is not configured. inserted and latched in place, and the enclosure is powered on. Disk Error Conditions and Recommended Actions Table 5 Disk error conditions and recommended actions Symptom Recommended Action Event 8 reports that the RAID controller can no longer detect the disk. Reseat the disk. If it does not solve the problem, replace the disk. Event 8 reports a media error for the disk. If all the volumes and Vdisks or disk groups are online and available, then replace the disk. If the drive is marked as Leftover or Failed, and data recovery is needed, see “Using the Trust Command” (page 26). Event 8 reports a hardware error for the disk. If all the volumes and Vdisks or disk groups are online and available, then replace the disk. If the drive is marked as Leftover or Failed, and data recovery is needed, see “Using the Trust Command” (page 26). Event 8 reports an Illegal Request sense If all the volumes and Vdisks or disk groups are online and available, code for a command the disk supports. then replace the disk. Otherwise, if the drive is marked as Leftover or Disk Drives 11 Table 5 Disk error conditions and recommended actions (continued) Symptom Recommended Action Failed, and data recovery is needed, see “Using the Trust Command” (page 26). Event 8 reports that RAID-6 logic intentionally failed the disk. Replace the disk. Event 55 reports a SMART error for the disk. If all the volumes and Vdisks or disk groups are online and available, then replace the disk. If the drive is marked as Leftover or Failed, and data recovery is needed, see “Using the Trust Command” (page 26). At the time, a disk failed, the dynamic spares feature was enabled, and a properly sized disk was available to use as a spare. No action required; the system automatically uses that disk to reconstruct the Vdisk or disk group. Replace the failed disk when reconstruction has completed. At the time, a disk failed, the dynamic spares feature was enabled but no properly sized disk was available to use as a spare. The best option is to supply an appropriately sized spare so the system can automatically use the new disk to reconstruct the Vdisk or disk group, and replace the failed disk once reconstruction has completed. If that is not possible, and all Vdisks or disk groups are online and your data is available, then replace the failed disk with an appropriately sized disk. Assign the replacement disk as a spare unless dynamic sparing is enabled. At the time, a disk failed, the dynamic spares feature was disabled and no dedicated spare or properly sized global spare was available. The best option is to supply an appropriately sized disk using the SMU or CLI. Assign it as a dedicated Vdisk or disk group spare and let the system automatically reconstruct the Vdisk or disk group, then replace the failed disk once reconstruction is complete. If that is not possible, and all Vdisks or disk groups are online and the data is available, then replace the failed disk with an appropriately sized disk. Assign the replacement disk as a spare. The status of the Vdisk or disk group that Use SMU to assign the new disk as either a global spare or a Vdisk or originally had the failed disk status is disk group spare. Good. A global Vdisk or disk group (dedicated) spare has been successfully integrated into the Vdisk or disk group and the replacement disk can be assigned as either a global spare or a Vdisk or disk group spare. The status of the disk installed is LEFTOVR All the member disks in a Vdisk or disk group contain metadata in the first sectors. The storage system uses the metadata to identify Vdisk or disk group members after restarting or replacing enclosures. This disk was a member of a Vdisk or disk group on another system, and that Vdisk or disk group does not exist on this system. Clear the drive metadata if it is not required for a Vdisk or disk group from another array. If the status of the Vdisk or disk group that originally had the failed disk status is OFFL (Offline) or QTOF (Quarantined Offline), one or more disks have failed in a RAID-0 Vdisk or disk group. Two or more disks have failed in a RAID-1, 3, or 5 Vdisk or disk group. Three or more disks have failed in a RAID-6 Vdisk or disk group. The data in the Vdisk or disk group is inaccessible and at risk. Attempt to recover by reseating the last failed drives. If that does not resolve the issue, and data recovery is needed, see “Using the Trust Command” (page 26). The status of the Vdisk or disk group that Wait for the Vdisk or disk group to complete its operation. Do not remove originally had the failed disk indicates that any leftover drives until the rebuild is complete. the Vdisk or disk group is being rebuilt. The status of the Vdisk or disk group that If this status occurs after you replace a defective disk with a known good originally had the failed disk is CRIT disk, the enclosure midplane might have experienced a failure. Replace (Critical) or QTDN (Quarantined with a the enclosure. down disk). 12 Troubleshooting HPE P2000 G3 Disk Drive Bay Numbers Figure 3 MSA 2040 SAN Array SFF enclosure Figure 4 MSA 2040 SAN Array LFF or supported drive enclosure Drive Status Table 6 Drive Status Status Displayed in the CLI or SMU Description Available AVAIL The drive is available for use. Failed FAILED The drive is unusable and is replaced. Reasons for this status include excessive media errors, SMART errors, drive hardware failures, unsupported drives. Global spare GLOBAL SP The drive is assigned as a global spare. Leftover LEFTOVR The drive is leftover. It is missing during a rescan of drives or it failed due to Unrecoverable Read Errors (UREs), SMART issues, or other errors. Disk Drives 13 Table 6 Drive Status (continued) Status Displayed in the CLI or SMU Description Vdisk or disk group Vdisk or disk group The drive is used in a Vdisk or disk group. Vdisk or disk group Spare Vdisk or disk group SP The drive is a spare assigned to a Vdisk or disk group. Disk Drive Leftover Table 7 Hard disk drive leftover scenarios Symptom Cause Action Required A disk drive is marked as LEFTOVR. Due to MEDIUM/SMART/PROTOCOL/ Check if any Vdisks or disk groups are OFFL; if disks are offline, see “Using I/O TIMEOUT errors. the Trust Command” (page 26). If all Vdisks or disk groups are online, consider replacing the disk drive as an option to resolve the issue. 14 Multiple drives all in one enclosure are marked as LEFTOVR. In earlier versions of P2000 G3 firmware, if a Vdisk was spanned across multiple enclosures, and all drives in an enclosure other than the enclosure numbered 1 went missing due to a site issue, a power loss or power supply issue, or loose cabling, the Vdisk may go OFFL. Once the issue is corrected and the disk drives are detected, those disks are marked as LEFTOVR. It is not necessary to replace all the disk drives that are marked LEFTOVR, as it is not a disk drive issue. Resolve the site, power or cabling issue. See “Using the Trust Command” (page 26). A disk drive in a redundant Vdisk or disk group (in this scenario, other than RAID 6) fails and is marked as LEFTOVR and the Vdisk or disk group status is CRIT. A spare disk is used and reconstruction begins. Before the reconstruction completes, if another disk drive from the same Vdisk or disk group (same sub-Vdisk or subgroup in RAID 10 or RAID 50) fails, the status of the second failed disk drive along with the spare disk is marked as LEFTOVR. The Vdisk or disk group is OFFL or QTOF (quarantined offline). Attempt to dequarantine the Vdisk or disk group by reseating the last failed drives. If unable to dequarantine the Vdisk or disk group, see“Using the Trust Command” (page 26). A disk drive in a RAID 6 Vdisk or disk Vdisk or disk group status is FTDN, group (with 1 spare) fails and is a spare is invoked, and reconstruction marked as LEFTOVR. begins. Before the reconstruction completes, if another member of the NOTE: In cases where two/multiple same Vdisk or disk group fails, its spares are available and a drive fails, status is marked as LEFTOVR and the drive is marked as LEFTOVR and the Vdisk or disk group status is the Vdisk or disk group is FTDN. A CRIT. The reconstruction for the first spare is invoked and reconstruction failed drive does not stop but begins. If another member of the continues to completion. Once the same Vdisk or disk group fails before reconstruction completes, the status the reconstruction completes, the of the Vdisk or disk group is back to spare for the second failed drive is FTDN. not invoked despite available multiple spares. Once the status of the Vdisk or disk group changes to FTDN, another spare is invoked for this Vdisk or disk group and reconstruction begins. Consider replacing the drives if the drives failed due to hardware issues. If the Vdisk or disk group go QTOF after the failure of a third Vdisk or disk group member and before the reconstruction completes, attempt to dequarantine the Vdisk or disk group by reseating the last failed drives. If unable to dequarantine the Vdisk or disk group, consider “Using the Trust Command” (page 26). Troubleshooting HPE P2000 G3 Table 7 Hard disk drive leftover scenarios (continued) Symptom Cause Action Required Reconstruction stops for a RAID 6 Vdisk when a second drive fails In earlier versions of P2000 G3 firmware, if a RAID 6 Vdisk experiences one drive failure, it would use a spare to reconstruct the Vdisk. But if a second drive fails during the reconstruction and no more spares are available, then the reconstruction stops, the Vdisk status becomes CRIT, and the second failed drive is LEFTOVR. If one more member of the same Vdisk fails and is marked LEFTOVR, the Vdisk goes OFFL. Clear the disk metadata on the drive used for reconstruction. Replace the second failed hard drive if possible, and set both to be spares. Allow the reconstruction to restart. If the vdisk goes OFFL, see“Using the Trust Command” (page 26). Reconstruction restarts for a RAID 6 In earlier versions of P2000 G3 Wait for the Vdisk reconstruction to Vdisk when a second drive fails. firmware, if a RAID 6 Vdisk complete. If the Vdisk goes OFFL, see experiences one drive failure, it would “Using the Trust Command” (page 26). use a spare to reconstruct the Vdisk. If a second drive fails during the reconstruction and a spare is available, then the reconstruction restarts from 0%, and the second failed drive is LEFTOVR. If one more member of the same Vdisk fails and is marked LEFTOVR, the Vdisk goes OFFL. Disk Drives 15 Vdisk or Disk Group Status Table 8 Vdisk or disk group status Status Displayed in the CLI or SMU Description Critical CRIT The Vdisk or disk group is online; however, some drives are down and the Vdisk or disk group is not fault tolerant. Fault Tolerant with down drives FTDN The Vdisk or disk group is online and fault tolerant; however, some drives are down. Fault Tolerant and online FTOL The Vdisk or disk group is online and fault tolerant. Offline OFFL The Vdisk or disk group is offline because it is using offline initialization or the drives are down and data is lost. Quarantined critical QTCR The Vdisk or disk group is in a critical state and quarantined because some drives are missing. Quarantined offline QTOF The Vdisk or disk group is offline and quarantined because some drives are missing. Quarantined with down drives QTDN The Vdisk or disk group is offline and quarantined because at least one drive is missing; however, the Vdisk or disk group is accessed and is fault tolerant. Example: One drive is missing from a RAID-6. Up UP The Vdisk or disk group is online and does not have fault-tolerant attributes. Vdisk or Disk Group Quarantined If a virtual disk group is quarantined, see Accessing Hewlett Packard Enterprise Support. Do not dequarantine Virtual disk groups without the assistance of knowledgeable support personnel. Table 9 Quarantine during array boot (available in all FW versions) 16 Level Symptom Action Required Raid 5 During boot, multiple disk drives from the Vdisk or disk group go missing and the Vdisk or disk group status is marked as QTOF. Raid 6 During boot, more than two disk drives go missing from the Vdisk or disk group and the Vdisk or disk group status is marked as QTOF. The system de-quarantines the Vdisk or disk group once the missing disk drives are recognized after a rescan. If the Vdisk or disk group is not de-quarantined, a manual de-quarantine is performed which brings the Vdisk or disk group status to OFFL (if the drives are not recognized during a rescan). If the status of the Vdisk or disk group changes to Troubleshooting HPE P2000 G3 Table 9 Quarantine during array boot (available in all FW versions) (continued) Level Symptom Action Required Raid 10 During boot, both the disk drives from the same OFFL, follow the troubleshooting steps “Using the sub-Vdisk or subgroup go missing and the Vdisk or Trust Command” (page 26) for solution. disk group status is marked as QTOF. One disk drive goes missing from each of the sub-Vdisks or subgroups and the Vdisk or disk group status is CRIT. Raid 50 During boot, multiple disk drives from the same sub-Vdisk or subgroup go missing and the Vdisk or disk group status is marked as QTOF. One disk drive goes missing from each of the sub-Vdisk or subgroups and the Vdisk or disk group status is CRIT. N/A The wrong controller took ownership of the Vdisk or disk group on boot. If controller A owns the Vdisk or disk group, for example, before booting the controller was removed or failed to boot up, the other controller owns the Vidisk, controller B. Because the last known cache and other Vdisk or disk group information is not available on the current controller, the Vdisk or disk group has been quarantined. Shutdown the system, remove the controller that took ownership of the Vdisk or disk group, insert the other controller, and boot. If the controller that was the previous owner is not available, then manually dequarantine the Vdisk or disk group. NOTE: • In the previous scenarios, the drives that went down/missing are not spares or reconstruction targets. • In the previous scenarios, if the disk drives go missing after a re-scan (firmware T210 and earlier), the Vdisk status is marked as OFFL as there is no Quarantine upon rescan feature available. See “Using the Trust Command” (page 26). Table 10 Quarantine upon rescan Level Symptom Action Required Raid 5 A disk drive is marked as LEFTOVR or go missing, making the Vdisk or disk group status as CRIT. Another disk drive fails or is missing, the Vdisk or disk group is marked as QTOF. If the disk drives are: Raid 6 Raid 10/50 • Recognized during a rescan, the system will quarantine the Vdisk or disk group for two minutes • Not recognized during a rescan, perform a manual A disk drive is marked as LEFTOVR or go missing de-quarantine to change the Vdisk or disk group making the Vdisk or disk group status as FTDN. status to OFFL. Another disk drive from the same Vdisk or disk group • If the Vdisk or disk group status changes to OFFL, is marked as LEFTOVR or go missing, the status of see “Using the Trust Command” (page 26). the Vdisk or disk group is CRIT. One more disk drive from the Vdisk or disk group is marked as LEFTOVR or go missing, the Vdisk or disk group is marked as QTOF. A disk drive is marked as LEFTOVR or go missing, making the Vdisk or disk group status as CRIT. Another disk drive from the same sub-Vdisk or subgroup fails or is missing the Vdisk or disk group is marked as QTOF. Disk Drives 17 NOTE: • In the previous scenarios, if the disk drives fail or go missing or both during a boot and also upon a rescan, the status of the Vdisk or disk group is marked as QTOF. • In the previous scenarios, the drives that went down or missing are not spares or reconstruction targets. Disk Drive Failure during Reconstruct A compatible spare has a capacity equal to or greater than the smallest disk in the Vdisk or disk group and is the same disk type (SAS or SATA). Table 11 RAID-6 reconstruction scenarios Symptom Action Required To start reconstruction manually, replace each failed disk, and then do one of the following: No compatible spares are available, reconstruction does not start automatically • Add each new disk as either a dedicated spare or a global spare. Remember that a different critical Vdisk or disk group might take the global spare than the one you intended. When a global spare replaces a disk in a Vdisk or disk group, the global spare’s icon in the enclosure view changes to match the other disks in that Vdisk or disk group. • Enable the Dynamic Spare Capability option to use the new disks without designating them as spares. One disk fails and a compatible spare is The system uses the spare to reconstruct the Vdisk or disk group. If a available second disk fails during reconstruction, reconstruction continues until it is complete, regardless of whether a second spare is available. If the spare fails during reconstruction, reconstruction stops. Two disks fail and only one compatible spare is available The system waits five minutes for a second spare to become available. After five minutes, the system uses the spare to reconstruct one disk in the Vdisk or disk group (referred to as fail 2, fix 1 mode). If the spare fails during reconstruction, reconstruction stops. Two disks fail and two compatible spares The system uses both spares to reconstruct the Vdisk or disk group. If are available one of the spares fails during reconstruction, reconstruction proceeds in fail 2, fix 1 mode. If the second spare fails during reconstruction, reconstruction stops. Table 12 LEDs - Disk drive LEDs Symptom Cause Fault/UID LED illuminates amber A disk fails. Online/Activity LED illuminates green A spare is used as a reconstruction target. NOTE: Depending on the Vdisk or disk group RAID level and size, disk speed, utility priority, and other processes running on the storage system, reconstruction can take hours or days to complete. You can stop reconstruction only by deleting the Vdisk or disk group. 18 Troubleshooting HPE P2000 G3 Vdisk or disk group Member Unavailable RAID Level Symptom Action Required NRAID Vdisk or disk group status is OFFL (Offline). RAID 0 Vdisk or disk group status is OFFL (Offline). A single drive failure causes the Vdisk or disk group to go OFFL and Trust cannot recover the data. Data is restored from the backup. RAID 1 Vdisk or disk group status is CRIT If a spare is already available (global or dedicated to the (Critical). Vdisk or disk group), it is used for the Vdisk or disk group in CRIT or FTDN state and reconstruct begins. Replace the failed disk drive if necessary. If a spare is not available, consider the option to replace the failed disk drive or clear meta-data on LEFTOVR drive and add it as a spare. RAID 5 RAID 10 RAID 50 RAID 6 Vdisk or disk group status is FTDN (Degraded). Two or Multiple Disk Drive Failure RAID Level RAID 1 RAID 5 RAID 6 Symptom Description/Notes Action Required Two disk drive failures in RAID 1 make it QTOFor OFFL. RAID 1 supports a maximum of two If the Vdisk goes QTOF, the Vdisk disk drives. gets de-quarantined automatically after the drives are recognized. If the Vdisk does not get de-quarantined Failure of two or more disk NA automatically, run a manual drives in a RAID 5 Vdisk or de-quarantine to bring the Vdisk to disk group makes the Vdisk OFFL state; once the Vdisk or disk or disk group QTOFor group is in the OFFL state, see OFFL. “Using the Trust Command” (page 26). Failure of two disk drives in If multiple spares are available for a RAID 6 Vdisk or disk the Vdisk or disk group, two disk group makes the Vdisk or drive reconstruction begins. disk group CRIT. Failure of multiple disk NA drives (greater than two) in a RAID 6 Vdisk or disk group makes the Vdisk or disk group QTOFor OFFL. RAID 10 Failure of both the disk drives in the same sub-Vdisk or subgroup makes the Vdisk or disk group QTOF NA RAID 10, RAID 50 Failure of two or more drives, all in different sub-Vdisks or subgroups. In a RAID sub-Vdisk or subgroup, if one drive fails, the sub-Vdisk or subgroup status, and therefore the Vdisk or disk group status, is CRIT. If two or more sub-Vdisks or subgroups have only one drive failed, the status of the Vdisk or disk group is still CRIT. Available spares will be used for reconstruction. Disk Drives 19 Vdisk or Disk Group Alerts Symptom Action Required Vdisk or disk group is critical • For RAID types other than RAID 6, see “Vdisk or disk group Member Unavailable” (page 19) and “Two or Multiple Disk Drive Failure” (page 19) • For RAID 6, see “Two or Multiple Disk Drive Failure” (page 19) Vdisk or disk group is quarantined See “Vdisk or Disk Group Quarantined” (page 16). Vdisk or disk group is offline See “Using the Trust Command” (page 26). Vdisk or disk group reconstruction failed See “Disk Drive Failure during Reconstruct” (page 18). Vdisk or disk group Expansion Frequently Asked Questions (FAQs) Question Answer During a Vdisk or disk group expansion, multiple disk drives go missing or are failed while the Vdisk or disk group is quarantined. Will the expansion continue once the disk drives return and the Vdisk or disk group de-quarantines? Yes, expansion resumes once the Vdisk or disk group is automatically de-quarantined. During the expansion of a Vdisk or disk group (other than RAID 6), if a drive fails making the Vdisk or disk group CRIT and a spare is available can a reconstruction happen? No, the spare is not used and the reconstruction does not begin immediately after the drive fails. Reconstruction begins only after the completion of the expansion. A backup is recommended in case there is further drive failures. NOTE: Avoid using the manual de-quarantine option as it takes the Vdisk or disk group OFFL, which causes data loss. During RAID 6 Vdisk or disk group expansion if two drives Two drive reconstructions are performed after the fail, can it do two drive reconstruction? expansion completes. A backup is recommended in case there is further drive failures. Is it possible to create a backup of a Vdisk or disk group Yes, backup of a Vdisk or disk group is performed during while an expansion is underway? Vdisk or disk group expansion. NOTE: Best practice is to perform a backup of the data before initiating the expansion of a Vdisk or disk group. During a Vdisk or disk group expansion, if the Vdisk or disk group go OFFL, can we use TRUST to bring the Vdisk or disk group online? If so will the expansion then proceed? Trust cannot be performed if a Vdisk or disk group go OFFL during Vdisk or disk group expansion. Expansion is aborted and any data on the Vdisk or disk group is irrevocably lost. During a Vdisk or disk group expansion, if the controller Yes, Vdisk or disk group expansion continues after the owning the Vdisk or disk group crashes, will the expansion Vdisk or disk group is failed over to the partner controller. continue after the Vdisk or disk group is failed over to the partner controller? Can we create a secondary volume and create snapshots We can create a secondary volume and snapshots are of a volume that is a member of a Vdisk or disk group created for a Vdisk or disk group under expansion. under expansion? What about other activities like All the other activities work fine. replication? Can the volumes of an expanding Vdisk or disk group be Yes, the volumes can be remapped. remapped to different hosts? For information on replacing the drive modules, see HPE MSA2000/P2000 Drive Module Replacement Instructions Guide on the Hewlett Packard Enterprise Support Center website. 20 Troubleshooting HPE P2000 G3 Management Controller Issue • “Restarting the Management Controller” (page 21) • “Instances when the Management Controller becomes Unresponsive” (page 22) Restarting the Management Controller The Management Controller (MC) is the array component that is responsible for functions related to the management of the array. The Storage Controller (SC) is responsible for handling Host IO to the array. Both the MC and SC reside within the same physical controller module. Some Management Controller functions are: • Managing array user accounts • Managing array logins • Managing array configuration information Table 13 Management Controller scenarios Symptom Action Required SC component is functioning • Restart MC from the other controller, if possible. properly and all Host I/O are being • If that does not resolve the issue, restart the array. handled correctly while the MC component is not accepting a user login A storage controller might be replaced or power-cycled due to an unresponsive Management Controller port. Log in to the Management Controller on the other storage controller and restart the unresponsive Management Controller. For example, an array with the following setup: Controller A—IP 15.5.224.9 Controller B—IP 15.5.224.10 Verify the network connectivity by issuing a ping to the MC IP address of the controller. Ensure that an issue does not exist with the intranet on which the MSA is installed. # ping 15.5.224.9 Reply from 15.5.224.9 bytes=32 time60 nl nl If the ping command does not return a valid response, investigate the network or cabling. If controller A responds to the ping but is not enabling logins via the CLI or SMU interface, log in to controller B via the CLI and attempt to restart the MC on controller A. The following example shows a login and restart of a controller through the CLI interface: #restart mc A Continue? yes <enter return> Success: MC A restarted. There may be up to a minute delay between the Continue prompt and hitting the Yes <enter return> to when the Success information is displayed. Wait a few moments after the Success information is displayed for the restart process to complete and then try to log in to controller A. If the Management Controller remains unresponsive, shutdown and restart controller A using the opposite controller B, then recheck the Management Controller. NOTE: During restart, you will briefly lose communication with the specified management controllers: From controller B: #shutdown A Wait for A to shutdown Management Controller Issue 21 Table 13 Management Controller scenarios (continued) Symptom Action Required #restart sc a < this will restart the Management Controller on controller A as well >. NOTE: In a single controller environment, if the CLI command is not successful, quiesce all host I/Os before restarting the SC of the controller. Instances when the Management Controller becomes Unresponsive Table 14 Instances when the Management Controller becomes Unresponsive Symptom Cause Action Required The following messages regularly appear in the array logs: Firmware Flash is blocked, upgrade retries are evident. For systems that are in a firmware upgrade loop: 1. Backup and verify all data on the system. 2. Run the Online ROM Flash Component again to upgrade both the controllers. 3. Use the FTP option to upgrade both controllers: • 237 INFORMATIONAL Firmware update progress: The firmware was verified. (from MC: yes, for MC: no) (p1: 1, p2: 1, p3: 0, p4: 0) • 237 INFORMATIONAL Firmware update progress: The SC loader was updated. Saved in primary location in flash. The firmware is the same so it was not flashed. (p1: 2, p2: 1, p3: 0, p4: 0) • 237 INFORMATIONAL Firmware update progress: The SC app was updated. Saved in primary location in flash. The firmware is different so it was flashed; flashed successfully. (p1: 4, p2: 1, p3: 1, p4: 0) • 237 INFORMATIONAL Firmware update progress: The memory-controller FPGA was updated. Saved in primary location in flash. The firmware is the same so it was not flashed. (p1: 3, p2: 1, p3: 0, p4: 0) 22 Troubleshooting HPE P2000 G3 ftp> put firmware_file flash 4. Allow firmware upgrade to complete fully. 5. Use either SMU or the CLI, check, and confirm all firmware versions for the Management Controller and Storage Controller matches between both controllers Table 14 Instances when the Management Controller becomes Unresponsive (continued) Symptom Cause The following sequence of messages Management Controller and Storage appears in the array logs: Controller are unable to communicate. • 152 INFORMATIONAL The Storage Controller is not receiving data from the Management Controller. (This is normal during firmware update.) • 153 INFORMATIONAL The Storage Controller resumed communications with the Management Controller. Action Required For systems that have a communication error between the MC and SC, perform any one of the following steps: • Run the following CLI command: #restart mc A Continue? yes <enter return> Success: MC A restarted. • Restart the MC using SMU. • 152 INFORMATIONAL The Storage Controller is not receiving data from the Management Controller. NOTE: • 156 WARNING The Management Controller was restarted by the Storage Controller. • In some cases, if the restart function previously noted does not clear the MC hang issue, you might attempt to reseat the offending controller. Before reseating the offending controller, you must perform a couple of safety measures: 1. Get and verify a backup of the data on the array. 2. If possible, quiesce the I/O to the array. 3. Shutdown the controller, for example, shutdown a|b|both • During the restart process, you will briefly lose communication with the specified management controllers • 139 INFORMATIONAL The Management Controller booted up. MC firmware version: XXXXXXXX (baselevel: XXXX) • 153 INFORMATIONAL The Storage Controller resumed communications with the Management Controller. • 152 INFORMATIONAL The Storage Controller is not receiving data from the Management Controller. (This is normal during firmware update.) Alternatively, shutdown the controller using the SMU. Shutting down both the controllers is an offline activity. • 153 INFORMATIONAL The Storage Controller resumed communications with the Management Controller The configuration information is lost, array management, event messaging, and logging cease to function, but the host I/O continues to operate normally 4. Reseating a controller with active I/O in a dual controller array will cause I/O failover to the other controller. In a single controller array, this reseating should be scheduled during a maintenance window. In both single and dual controller environments, when the MC fails, the result may be a loss of the controller configuration information on the affected controller. The array cannot be managed from the impacted controller, but through the partner controller if available. The configuration information is stored in a non-volatile memory. Resetting or powering off the controller will not resolve the error. The controller must be replaced to resolve this error. 1. Log on to SMU or CLI/telnet. Single Controller Environment—If successful, the MC will be up, and you can proceed with the firmware upgrade. Dual Controller Environments—If successful, the MC will be up. Check the MC on the other controller to confirm it is accessible. a. Halt all host I/O before upgrading the controller. b. The firmware upgrade can be performed using either FTP or SMU. Management Controller Issue 23 Table 14 Instances when the Management Controller becomes Unresponsive (continued) Symptom Cause Action Required If unsuccessful, the MC is suffering from the firmware issue. a. Single Controller Environment—Halt all Host I/O and power cycle the array. If access is restored, proceed with the firmware upgrade. Dual Controller Environments—Halt all Host I/O and power cycle the array. If access is restored, verify that MC on the second controller is accessible or operational. b. If access is not restored, replace the controller and confirm that the new controller is at a new code version or upgrade if necessary. 2. Once the firmware upgrade is complete, verify the MC code version has upgraded correctly. 3. If you cannot log onto SMU or telnet after the firmware upgrade, you will need to replace the controller. For information on replacing the Controller Module, see HPE MSA Controller Module Replacement Instructions Guide on the Hewlett Packard Enterprise Support Center website. Power Supply Faults and Power Cycle • “Power Cycle” (page 24) • “Power Supply Faults and Recommended Actions” (page 25) Power Cycle Under normal operations because of the redundant nature of the MSA array, the system should not require a full system power cycle. Restarting each controller independently results in the same outcome as a full power cycle, while still maintaining host connectivity. On the front of every MSA is an Enclosure ID LED. This LED will aid in identifying the Controller shelf. The Controller Shelf will report as "1" and any subsequent enclosures will have higher numbered IDs. If a power cycle is needed, follow these steps: 1. Shut down, dismount, or unmap the hosts (servers) with access to the MSA volumes. This allows any pending writes in Application cache or Server memory to flush to the MSA controllers write-back cache. 2. Shut down both array controllers in a dual controller system or a single controller system by using SMU or CLI interface. This allows the controllers to flush the controller write cache to the disks. 24 Troubleshooting HPE P2000 G3 NOTE: Never remove the power cords from the Power Supplies on the RAID enclosure without properly shutting down the array controllers first. Shutting down the controllers allows for any write cache in the controller modules to be properly flushed down to the Hard Drives. Removing power from single JBODs may adversely affect Vdisk or disk groups that are shared between enclosures. 3. JBOD Enclosures only need to be powered off for routine maintenance. The JBODs must be powered off only after the controllers are properly shut down using the CLI or SMU. To power off the JBOD enclosures either remove the power cables or turn off the switch, whichever step is applicable for your power supplies. Perform the reverse steps when restoring power to the Storage infrastructure. Power on the JBOD enclosures from the bottom to the top, allowing for the drives in each JBOD to power up before powering the next JBOD enclosure. If the enclosures are powered off, wait a minimum of one minute after powering on the last JBOD enclosure before powering up the MSA controllers. This allows the Hard Drives enough time to spin up before the array controllers come online. CAUTION: Never power cycle an array without knowing the status of the controllers. Removing power from a controller processing host IO could have negative consequences to data integrity. NOTE: Power on the JBOD enclosures from the bottom enclosure to the top, allowing enough time for the drives to spin up. The 6TB and 8TB drives take time. Power Supply Faults and Recommended Actions Table 15 Power supply faults and recommended actions Symptom Recommended Action Power supply warning or failure. Related Event code 551. Verify the following: • All the power supply units are working. • No slots are left open for more than two minutes. If you might need to replace a module, leave the old module in place until you have the replacement, or use a blank cover to close the slot. Leaving a slot open negatively affects the airflow and might cause the unit to overhead. • The controller modules are properly seated in their slots and that their latches are locked. Power supply module status is listed as failed or you • If the power supply module has a switch, ensure that receive a voltage event notification. Related Event code the switch is turned on. 551. • Ensure that the power cables are firmly plugged into both power supply and into an appropriate functional electrical outlet. • Replace the power supply module. Power LED is off. • If the power supply module has a switch, ensure that the switch is turned on. • Ensure that the power cables are firmly plugged into both power supply and into an appropriate electrical outlet. • Replace the power supply module. Voltage/Fan Fault/Service Required LED is on. • Replace the power supply module. Power Supply Faults and Power Cycle 25 For information on replacing the DC Power and Cooling Module, see HPE DC Power and Cooling Module Replacement Instructions Guide on the Hewlett Packard Enterprise Support Center website. Chassis Replacement Table 16 Chassis replacement scenarios Symptom Action Required • A FRU fails to seat properly, cannot be fully inserted, Where a problem exists with the physical chassis itself, or once inserted will not slide all the way into the slot. replacement is necessary. Verify that a physical problem does not exist with the specific FRU before replacing the • The locking mechanism on a FRU is fully closed and chassis. A mechanical issue of this type will not require the FRU does not lock in place or locked down log evaluation. properly. • A chassis with a defect that does not allow a FRU to be installed. An issue relating to the midplane A chassis must be replaced. To determine if the midplane is the problem, collect and investigate the array logs. Also, look for a 314 midplane FRU notification in array events and a corresponding 247 related event. NOTE: • Avoid unnecessary chassis replacement. Other than a mechanical failure, it is rare to have a chassis or midplane issue requiring replacement. • A FRU is any HPE orderable replacement part. For information on replacing the Chassis, see HPE Chassis Replacement Instructions Guide on the Hewlett Packard Enterprise Support Center website. Using the Trust Command The trust command re-synchronizes time and date metadata on the drives, making Leftover or Failed drives that were once members of the Vdisk or disk group, active members of the Vdisk or disk group again. Use the trust command when a Vdisk or disk group is Offline and there is no data backup, or to recover the current data on a Vdisk or disk group. In these cases, the trust command works, only if the drives continue to operate. Use the trust command as follows: 26 Troubleshooting HPE P2000 G3 1. 2. 3. After the trusted Vdisk or disk group is back online, back up the Vdisk or disk group and verify that the data is valid. Delete the trusted Vdisk or disk group, replace any drive that was a member of the trusted Vdisk or disk group that Failed or went Leftover because of Unrecoverable Read Errors (UREs), SMART issues, or other errors, and then create a Vdisk or disk group. Restore data from a valid backup to the new Vdisk or disk group. CAUTION: • The trust command can cause permanent data loss and unstable Vdisk or disk group operation. • Using the trust command on a Vdisk or disk group is a disaster-recovery measure only; the Vdisk or disk group has no tolerance for any additional failures and must never be put back into a production environment. • After the trust command is issued on a Vdisk or disk group, additional disaster recovery troubleshooting steps are limited. • The trust command must be used only if the Vdisk or disk group has an Offline status; the trust command will not run on a Vdisk or disk group with another status. • Do not use the trust command when the storage system is unstable, during power events, or events where enclosures or drives are added or removed unintentionally. • Do not attempt to run the trust command on a Vdisk or disk group that is Quarantined critical, Quarantined offline, or Quarantined with down drives. • When the Vdisk or disk group is Offline: ◦ Never update the controller-module, expansion-module, or drive firmware. ◦ Never clear unwritten cache data. • You cannot use the trust command on a Vdisk or disk group that went Offline during Vdisk or disk group expansion. • You cannot use the trust command on a Vdisk or disk group with a Critical status. Instead, add spares and let the system reconstruct the Vdisk or disk group. NOTE: • The trust command is used as a last step in a disaster recovery situation, and must be performed only by someone who knows how to use the trust command and has experience with Vdisk or disk group configurations and reviewing array logs. • If you are not sure to take the corrective action, contact the Hewlett Packard Enterprise Support Center for further assistance. Running the trust command 1. 2. 3. Disable the background scrub. For more information, see CLI User Guide. Identify the cause for the Vdisk or disk group going Quarantined offline or Offline. If an external issue, such as power failure or cable failures or disconnections, caused the Vdisk or disk group to go Quarantined offline or Offline, fix the external issue. Return the array to a stable state before continuing to the next step. Using the Trust Command 27 4. Determine the drives to be used with the trust command. • If the Vdisk or disk group went Quarantined offline or Offline during a reconstruction, determine if the drive that caused the Vdisk or disk group to go Quarantined offline or Offline went Leftover or Failed. ◦ If the drive went Leftover and was not experiencing UREs, SMART issues, or other errors, unseat the spare drives that were brought into the Vdisk or disk group for reconstruction. ◦ If the drive Failed or went Leftover due to UREs, SMART issues, or other errors (from now on known simply as the failed drive), determine whether to use the partially reconstructed target drive rather than the failed drive. For example, if the reconstruction is mostly done and the failed drive had numerous errors before failing, you may recover more data using the partially reconstructed drive. Even though the data is not complete, because the failed drive may not stay up long enough to be useful. To determine how far the reconstruction has gone, consider the size of the Vdisk or disk group, the type of drives, and review the array logs to determine when the reconstruction started and the time a drive failure occurred. Since using the trust command can irrevocably destroy data, or if you cannot determine the correct drives to use, or the status of the reconstruct, contact Hewlett Packard Enterprise Support Center for assistance. • If the Vdisk or disk group went Quarantined offline or Offline, but not during a reconstruction, note the drives that caused the Vdisk or disk group to most recently go Fault Tolerant with a down disk, Critical, and Quarantined offline or Offline. Unseat the drives that caused the Vdisk or disk group to go Fault Tolerant with a down disk and Critical. ◦ For a RAID-5 Vdisk or disk group, unseat the first Failed or Leftover drive that led to the Vdisk or disk group going Quarantined offline or Offline, according to the logs. ◦ For a RAID-6 Vdisk or disk group, unseat the first two Failed or Leftover drives that led to the Vdisk or disk group going Quarantined offline or Offline, according to the logs. ◦ For a RAID-50 Vdisk or disk group, if you know the sub-Vdisk details, unseat the first Failed or Leftover drive that led to the RAID-5 sub- Vdisk failing, according to the logs. If you do not have the sub-Vdisk details, contact Hewlett Packard Enterprise Support Center for assistance. If you are uncertain about the order the drives Failed or went Leftover, or which drives were added into the Vdisk or disk group for reconstruction, contact Hewlett Packard Enterprise Support Center for further assistance. IMPORTANT: If a drive Failed or went Leftover due to UREs, SMART issues or other errors, DO NOT attempt to clear the metadata on the drive and reuse the drive. Failed drives and drives that went Leftover due to UREs, Smart issues, or other errors are unstable and must be replaced. 28 Troubleshooting HPE P2000 G3 5. Prevent all automatic reconstruction actions on a trusted Vdisk or disk group by following these steps: • Unseat the spare drives associated with the Vdisk or disk group. For more information, see CLI User Guide. • Unseat or delete global spares. For more information, see CLI User Guide. • Turn off the dynamic spares feature. Reconstruction causes heavy usage of drives that were already reporting errors. This usage can cause drives to fail during reconstruction, which can cause data to be unrecoverable. 6. 7. 8. Reseat the remaining affected drives. Using the CLI, run the enable trust command. Run the trust command on the Vdisk or disk group. If you get the following warning on the CLI while running the Trust command, contact the Hewlett Packard Enterprise Support Center for further assistance: Error: The trust operation failed because the disk group has an insufficient number of in-sync disks. - Please contact Support for further assistance. Or WARNING! Found out-of-sync disk(s). Using these disks for trust might cause data corruption. nl Do you want to include the out-of-sync disk(s)? nl • yes or y: Allows the command to proceed, using these disks. nl • no or n: Allows the command to proceed, without using these disks. nl • abort or a: Cancels the command. Or WARNING! Found partially reconstructed disk(s).Using these disks for trust might cause data corruption. nl Do you want to include the partially reconstructed disk(s)? nl • yes or y: Allows the command to proceed, using the out-of-sync disks. nl • no or n: Allows the command to proceed, without using the out-of-sync disks. nl • abort or a: Cancels the command. After running the trust command 1. 2. 3. 4. 5. Perform a complete backup of the Vdisk and validate all backed-up data. Delete the Vdisk. Replace the Failed drives associated with this Vdisk, if necessary. If there are any Leftover drives, see HPE Guided Troubleshooting. Recreate the Vdisk. Restore the data from the most recent valid backup. Using the Trust Command 29 6. 7. 30 Re-enable automatic reconstruction by: • Reinserting or adding global spares that were unseated or deleted previously. For more information, see CLI User Guide. • Re-enabling the spare drives associated with the Vdisk that were unseated previously. • Turning on the dynamic spares feature. For more information, see CLI User Guide. Re-enable background scrub. For more information, see CLI User Guide. Troubleshooting HPE P2000 G3 2 Support and other resources Accessing Hewlett Packard Enterprise Support • For live assistance, go to the Contact Hewlett Packard Enterprise Worldwide website: www.hpe.com/assistance • To access documentation and support services, go to the Hewlett Packard Enterprise Support Center website: www.hpe.com/support/hpesc Information to collect • Technical support registration number (if applicable) • Product name, model or version, and serial number • Operating system name and version • Firmware version • Error messages • Product-specific reports and logs • Add-on products or components • Third-party products or components Accessing updates • Some software products provide a mechanism for accessing software updates through the product interface. Review your product documentation to identify the recommended software update method. • To download product updates, go to either of the following: ◦ Hewlett Packard Enterprise Support Center Get connected with updates page: www.hpe.com/support/e-updates ◦ Software Depot website: www.hpe.com/support/softwaredepot • To view and update your entitlements, and to link your contracts and warranties with your profile, go to the Hewlett Packard Enterprise Support Center More Information on Access to Support Materials page: www.hpe.com/support/AccessToSupportMaterials IMPORTANT: Access to some updates might require product entitlement when accessed through the Hewlett Packard Enterprise Support Center. You must have an HP Passport set up with relevant entitlements. Websites Website Link Hewlett Packard Enterprise Information Library www.hpe.com/info/enterprise/docs Hewlett Packard Enterprise Support Center www.hpe.com/support/hpesc Accessing Hewlett Packard Enterprise Support 31 Website Link Contact Hewlett Packard Enterprise Worldwide www.hpe.com/assistance Subscription Service/Support Alerts www.hpe.com/support/e-updates Software Depot www.hpe.com/support/softwaredepot Customer Self Repair www.hpe.com/support/selfrepair Insight Remote Support www.hpe.com/info/insightremotesupport/docs Serviceguard Solutions for HP-UX www.hpe.com/info/hpux-serviceguard-docs Single Point of Connectivity Knowledge (SPOCK) Storage compatibility matrix www.hpe.com/storage/spock Storage white papers and analyst reports www.hpe.com/storage/whitepapers nl Customer self repair Hewlett Packard Enterprise customer self repair (CSR) programs allow you to repair your product. If a CSR part needs to be replaced, it will be shipped directly to you so that you can install it at your convenience. Some parts do not qualify for CSR. Your Hewlett Packard Enterprise authorized service provider will determine whether a repair can be accomplished by CSR. For more information about CSR, contact your local service provider or go to the CSR website: www.hpe.com/support/selfrepair Remote support Remote support is available with supported devices as part of your warranty or contractual support agreement. It provides intelligent event diagnosis, and automatic, secure submission of hardware event notifications to Hewlett Packard Enterprise, which will initiate a fast and accurate resolution based on your product’s service level. Hewlett Packard Enterprise strongly recommends that you register your device for remote support. For more information and device support details, go to the following website: www.hpe.com/info/insightremotesupport/docs Documentation feedback Hewlett Packard Enterprise is committed to providing documentation that meets your needs. To help us improve the documentation, send any errors, suggestions, or comments to Documentation Feedback ([email protected]). When submitting your feedback, include the document title, part number, edition, and publication date located on the front cover of the document. For online help content, include the product name, product version, help edition, and publication date located on the legal notices page. 32 Support and other resources
© Copyright 2025 Paperzz