Suboptimal RAID card behavior on drive failure and suboptimal Seagate web-based customer service

On Monday morning I got email from the RAID controller on one of my servers, stating that a drive had been “removed”.  I went to the colo, and found that while the drives were all physically present, the controller was reporting that the three-drive RAID 5 array was “unusable”.  I mistakenly assumed that it just needed another drive to rebuild onto, but adding a drive didn’t help as the controller refused to let me start a rebuild.  On further reflection, I realized that I’d never seen it report a status of “unusable” before; normally with a single drive failure the status becomes “degraded”.  I would expect that having a second drive fail when the array is degraded should result in “unusable”.  I checked the logs, and there was no indication that more than one drive had failed, or that the array had been in the degraded state prior to the Monday morning failure.  There is some indication that there may have been a power glitch at that time, and although the computer did not reboot, it is possible that the power to the disk drives may have sagged enough to trigger the disk drives to reboot.  If so, this could have caused the RAID controller to think that two (or all three) of the drives had failed simultanously.  As a result, it apparently marked the metadata of the two working drives as unusable, so that it is unwilling to even attempt a rebuild.  Gee, thanks.

The three disk drives in the array were all Seagate Barracuda ES.2 1TB drives (ST31000340NS).  Two of them still work fine, with all sectors reading without error.  One of them, however, will not do anything but identify its model number.  It doesn’t report the correct drive capacity, and it returns an immediate error to all read or write commands without seeking.

Friends told me about a known failure mode of these drives (and a few other Seagate drives including the Barracuda 7200.11), which can brick the drive.  They said that there was updated firmware from Seagate to fix the problem.  I googled for it, and found a Seagate serial number check utility that will, given a drive serial number, tell you whether you need new firmware.  I entered the serial number of the failed drive, and it reported that there was “No download available for this serial number.”

Further googling, however, turned up a different Seagate page that had specific instructions on how to identify drives subject to the problem, and the firmware download.  Unfortunately Seagate doesn’t provide any release notes on the firmware on their web site, or any detailed description of the failure.  Contrary to what their serial number check utility had told me, my failed drive was in fact one of that needed the upgrade.  The firmware upgrade is provided as an ISO image for a CD.  I downloaded it, booted it, it verified that it was suitable for the drive, and upgraded the firmware.  Unfortunately the drive still didn’t work, and appeared to behave exactly as before.

Further searching revealed that while Seagate doesn’t provide release notes for the firmware on their own web site, they did provide it elsewhere as an “Urgent Field Update” notice to their OEMs, and someone else put it online.  The document gives an overview of the problem, which only occurs at drive power up time.  The drive firmware hits an assertion and goes into a “safe mode”, where safe is defined as bricking the drive.

Seagate says that the firmware upgrade won’t fix the failure once it has already occurred.  If the drive has already failed, it has to be sent back to Seagate for repair.  It would have been rather helpful if Segate’s own web site would have explained that, so that I wouldn’t have wasted time doing a firmware upgrade that wasn’t going to help me at all.

The notice says that Seagate will provide data recovery service for failed drives at no charge.  On the one hand, the drive is under warranty, and I’d rather not spend $160 to replace it.  On the other hand, there is some sensitive data on the drive, so I’d rather not send it to any third party.  This is a good argument for using host-based drive encryption (e.g., LUKS/dm-crypt on Linux).

Instead of sending the drive to Seagate for data recovery, waiting however long that takes, and hoping for the best, I decided that since two of the three drives appeared to be intact, I should be able to do a RAID rebuild myself.  A few years ago when I accidentally wiped out the metadata on a RAID 5, I wrote a program “r5test” to do a rebuild.  At that time, I had an array with all drives intact, so r5test just verified the parity as it did the rebuild.  This time I was actually missing one drive, so I enhanced r5test to allow for a missing drive, in which case it can attempt reconstruction but can’t verify integrity.

With three drives in the array, there are six possible permutations.  The first partition was an 8GB /boot partition, so I tried reconstructing it using each possible permutation.  I found the correct one on the fifth try.  That accomplished, I set it to work reconstructing the entire array onto a single new 2TB drive.  The rate of reconstruction was such that I expected it to take about 25-26 hours.  At some point about 1/3 of the way into the process, r5test exited with no indication of an error.  I don’t know why this happened, but r5test prints a progress message every second, so I was able to restart it from a point slightly before the last stripe it had recovered.

Once r5test had rebuilt the image of the entire array contents, I checked the filesystems.  /tmp was completely trashed, and /var was mostly trashed, but the other filesystems passed fsck with no issues.  I was concerned about /var since it contains MySQL databases.  I was able to find those in lost+found.  I saved the partially recovered /var filesystem, recreated it from a backup, then dropped in the MySQL databases from the rebuilt image.

I took the new drive to the colo to install it in the server.  A second server was physically on top of the one with the failed RAID, so I shut it down and moved it out of the way, in order to make some changes to the SATA cabling in the failed server.  After booting a Fedora recovery CD in order to install GRUB on the new drive, I was able to boot it successfully, about 78 hours after the original failure.

I put the second server back in place, powered it up, and it refused to boot.  The motherboard BIOS displays its information, then the PCI SATA card BIOS displays its information, then the machine resets.  It just went through that cycle continuously, for no obvious reason.  As far as I could tell, it wasn’t even trying to actually boot the disk.  I decided that the best course of action was to install a new motherboard with on-board SATA.  After I got to the computer store, though, I realized that I actually can’t put a new motherboard in that chassis.  It is a 2U chassis with a redundant power supply, but it is from the days of 20-pin ATX power connectors, and all recent motherboards use 24-pin ATX power connectors.  I’ll probably have to replace the whole chassis.  As a temporary measure, I’m replacing the server hardware with a Shuttle X27D mini barebones system (similar to a mini-ITX).  I think the longer-term solution will be to convert it to a virtual machine running on the other server.  It will take about six hours to transfer all the data from the old disk drive to a new on ein the X27D.

In the process of doing all this, I also discovered that my desktop computer at home will not work when I tip it on its side to work on it.  The CPU fan won’t spin, and the motherboard plays a recorded “CPU failed” message.  It works fine when it is upright.  Maybe the fan bearings have worn out.

As a result of all this, I’ve gotten absolutely none of the work done this week that urgently needed to be done.

This entry was posted in Disaster recovery, Hardware, The Suboptimal Way. Bookmark the permalink.

Leave a Reply