What's All This Brouhaha? » Disaster recovery

Motherboard ethernet port went bad

Eric — Wed, 20 Oct 2010 06:20:17 +0000

The onboard ethernet port of my Asus M4A785TD-M EVO motherboard went bad. It was working fine, then all of a sudden quit. Neither Windows nor Linux can detect that the Ethernet chip is even present, e.g., not even with an lspci command on Linux. At first I thought that the BIOS settings had somehow changed and disabled it, but they hadn’t. I tried disabling it in the BIOS, rebooting, enabling it, and rebooting again. I tried resetting the BIOS settings to factory defaults. I tried clearing the CMOS.

This is the first ASUS motherboard failure I’ve ever had, out of several dozen ASUS motherboards I’ve used.

For a little while I got by with an old Belkin USB 10/100 Ethernet adapter I had lying around. The performance in Linux was not very good, but at least it worked. However, there are no Windows 7 drivers for it, and I occasionally do have to boot Windows 7. Today I installed an Intel Gigabit CT Desktop Adapter, which is a PCIe NIC. It is working quite well. I haven’t yet tried out its Wake-on-LAN feature, but I was unhappy to lose that feature when the motherboard port failed, since the USB adapter doesn’t provide it.

I’ll probably replace the motherboard soon. I was happy with it until this happened, but now I’ve got an excuse to get a better one, such as the M4A89GTD PRO/USB3. I don’t actually need the onboard graphics, so in principle I could get the M4A89T PRO/USB3 instead, which is very similar but without the onboard graphics. However, it’s time to upgrade my server also, and I do want onboard graphics for that. I’ll get two of the with-graphics motherboards, and that way the motherboard in my desktop can serve as a backup if the one in the server fails.

Suboptimal RAID card behavior on drive failure and suboptimal Seagate web-based customer service

Eric — Fri, 20 Nov 2009 09:33:02 +0000

On Monday morning I got email from the RAID controller on one of my servers, stating that a drive had been “removed”. I went to the colo, and found that while the drives were all physically present, the controller was reporting that the three-drive RAID 5 array was “unusable”. I mistakenly assumed that it just needed another drive to rebuild onto, but adding a drive didn’t help as the controller refused to let me start a rebuild. On further reflection, I realized that I’d never seen it report a status of “unusable” before; normally with a single drive failure the status becomes “degraded”. I would expect that having a second drive fail when the array is degraded should result in “unusable”. I checked the logs, and there was no indication that more than one drive had failed, or that the array had been in the degraded state prior to the Monday morning failure. There is some indication that there may have been a power glitch at that time, and although the computer did not reboot, it is possible that the power to the disk drives may have sagged enough to trigger the disk drives to reboot. If so, this could have caused the RAID controller to think that two (or all three) of the drives had failed simultanously. As a result, it apparently marked the metadata of the two working drives as unusable, so that it is unwilling to even attempt a rebuild. Gee, thanks.

The three disk drives in the array were all Seagate Barracuda ES.2 1TB drives (ST31000340NS). Two of them still work fine, with all sectors reading without error. One of them, however, will not do anything but identify its model number. It doesn’t report the correct drive capacity, and it returns an immediate error to all read or write commands without seeking.

Friends told me about a known failure mode of these drives (and a few other Seagate drives including the Barracuda 7200.11), which can brick the drive. They said that there was updated firmware from Seagate to fix the problem. I googled for it, and found a Seagate serial number check utility that will, given a drive serial number, tell you whether you need new firmware. I entered the serial number of the failed drive, and it reported that there was “No download available for this serial number.”

Further googling, however, turned up a different Seagate page that had specific instructions on how to identify drives subject to the problem, and the firmware download. Unfortunately Seagate doesn’t provide any release notes on the firmware on their web site, or any detailed description of the failure. Contrary to what their serial number check utility had told me, my failed drive was in fact one of that needed the upgrade. The firmware upgrade is provided as an ISO image for a CD. I downloaded it, booted it, it verified that it was suitable for the drive, and upgraded the firmware. Unfortunately the drive still didn’t work, and appeared to behave exactly as before.

Further searching revealed that while Seagate doesn’t provide release notes for the firmware on their own web site, they did provide it elsewhere as an “Urgent Field Update” notice to their OEMs, and someone else put it online. The document gives an overview of the problem, which only occurs at drive power up time. The drive firmware hits an assertion and goes into a “safe mode”, where safe is defined as bricking the drive.

Seagate says that the firmware upgrade won’t fix the failure once it has already occurred. If the drive has already failed, it has to be sent back to Seagate for repair. It would have been rather helpful if Segate’s own web site would have explained that, so that I wouldn’t have wasted time doing a firmware upgrade that wasn’t going to help me at all.

The notice says that Seagate will provide data recovery service for failed drives at no charge. On the one hand, the drive is under warranty, and I’d rather not spend $160 to replace it. On the other hand, there is some sensitive data on the drive, so I’d rather not send it to any third party. This is a good argument for using host-based drive encryption (e.g., LUKS/dm-crypt on Linux).

Instead of sending the drive to Seagate for data recovery, waiting however long that takes, and hoping for the best, I decided that since two of the three drives appeared to be intact, I should be able to do a RAID rebuild myself. A few years ago when I accidentally wiped out the metadata on a RAID 5, I wrote a program “r5test” to do a rebuild. At that time, I had an array with all drives intact, so r5test just verified the parity as it did the rebuild. This time I was actually missing one drive, so I enhanced r5test to allow for a missing drive, in which case it can attempt reconstruction but can’t verify integrity.

With three drives in the array, there are six possible permutations. The first partition was an 8GB /boot partition, so I tried reconstructing it using each possible permutation. I found the correct one on the fifth try. That accomplished, I set it to work reconstructing the entire array onto a single new 2TB drive. The rate of reconstruction was such that I expected it to take about 25-26 hours. At some point about 1/3 of the way into the process, r5test exited with no indication of an error. I don’t know why this happened, but r5test prints a progress message every second, so I was able to restart it from a point slightly before the last stripe it had recovered.

Once r5test had rebuilt the image of the entire array contents, I checked the filesystems. /tmp was completely trashed, and /var was mostly trashed, but the other filesystems passed fsck with no issues. I was concerned about /var since it contains MySQL databases. I was able to find those in lost+found. I saved the partially recovered /var filesystem, recreated it from a backup, then dropped in the MySQL databases from the rebuilt image.

I took the new drive to the colo to install it in the server. A second server was physically on top of the one with the failed RAID, so I shut it down and moved it out of the way, in order to make some changes to the SATA cabling in the failed server. After booting a Fedora recovery CD in order to install GRUB on the new drive, I was able to boot it successfully, about 78 hours after the original failure.

I put the second server back in place, powered it up, and it refused to boot. The motherboard BIOS displays its information, then the PCI SATA card BIOS displays its information, then the machine resets. It just went through that cycle continuously, for no obvious reason. As far as I could tell, it wasn’t even trying to actually boot the disk. I decided that the best course of action was to install a new motherboard with on-board SATA. After I got to the computer store, though, I realized that I actually can’t put a new motherboard in that chassis. It is a 2U chassis with a redundant power supply, but it is from the days of 20-pin ATX power connectors, and all recent motherboards use 24-pin ATX power connectors. I’ll probably have to replace the whole chassis. As a temporary measure, I’m replacing the server hardware with a Shuttle X27D mini barebones system (similar to a mini-ITX). I think the longer-term solution will be to convert it to a virtual machine running on the other server. It will take about six hours to transfer all the data from the old disk drive to a new on ein the X27D.

In the process of doing all this, I also discovered that my desktop computer at home will not work when I tip it on its side to work on it. The CPU fan won’t spin, and the motherboard plays a recorded “CPU failed” message. It works fine when it is upright. Maybe the fan bearings have worn out.

As a result of all this, I’ve gotten absolutely none of the work done this week that urgently needed to be done.

New computer preloaded with craplets including Norton 360

Eric — Thu, 16 Jul 2009 00:11:59 +0000

My sister’s office PC was damaged by lightning, so she got a new emachines PC at Costco, and I’m helping her get it set up.Â It came with Vista, of course.Â And a bunch of craplets, including Norton 360 with a 60 day “free” subscription.

When you boot the machine, Norton 360 pops up a window to start the subscription process, and there is no cancel or quit.Â The close item on the menu window is disabled, and the “X” button at the top right doesn’t work.Â It won’t let you proceed without creating a new Norton account or entering credentials for an existing account.

Best of all, the fuckwits somehow have managed to subvert the “End program” feature of the Task Manager.Â When I try to kill Norton 360, it says “program not responding”, and then the “End now” button does nothing!

Since Vista rearranged everything, it took me a while to figure out where Add/Remove programs went.Â Now it’s “Programs and Features” or some such.Â It let me uninstall Norton, even while the stupid Norton signup screen was up.

There was also some kind of “BigFix” thing that is supposed to make security updates and fixes easier.Â It came up with a message that as of April 30, 2009, emachines customers no longer get BigFix service.

RAID Capacity Expansion, the Suboptimal Way, or How Not to Have a Fun Weekend

Eric — Tue, 20 Nov 2007 01:55:53 +0000

On Friday, I needed to add disk space to the RAID 5 array on my server. I’m using a 3ware 9550SX-8LP RAID controller, which I’ve generally been very happy with. It has support for online capacity expansion, so I decided to reconfigure it to drop the hot spare drive, then add that drive into the RAID 5.

Due to a layer 8 error (PEBKAC), the system went down around noon on Friday, and it took me until 4 AM Monday morning to restore it to normal operation. Sometimes you know on an intellectual level that something is dangerous, but you haven’t fully internalized it until you have experience with it biting you on the ass. I knew that poking around in the RAID management interface was dangerous, but I thought I knew what I was doing.

The live RAID 5 array was unit 0, and the hot spare was unit 1. I needed to delete unit 1 and add the drive to unit 0. I told the managment software to delete unit 1, it asked me if I was sure, and I confirmed that I wanted it deleted.

Unfortunately it appears that I screwed up and actually asked it to delete unit 0, which it did. Unsurprisingly, Linux immediately started reporting disk errors.

When I realized what I’d done, I paused to think about it for a moment. I do have a full backup of the system, but it’s more than a month old, and I’d rather not have to revert to that. Deleting a RAID unit probably only means changing a little bit of metadata on the drives, so the filesystems were probably all still intact. The metadata format isn’t documented, but maybe there’s a way to reconstruct it.

I called 3ware tech support at around 1 PM. The engineer listened to my problem description, and said that it was possible that they might be able to help me, if the server had a floppy drive. They would email me some scripts which I could put on a floppy disk and run on the machine. The scripts would pull various data out of the RAID controller and from the drives, write to log files on the floppy, and I would email those log files to support. They would analyze them, and generate new scripts which would rewrite the appropriate metadata back onto the drives, reintegrating them as a RAID 5.

Since my email goes through the server in question, I had them send the scripts to an alternate email address. An hour and a half later, I hadn’t received them yet, so I called back. I spoke to a different engineer, gave him yet another alternate email address, and he sent the scripts. I received them almost instantly.

The server does have a floppy drive, but none of my other commonly used machines does. I had to search for my USB floppy drive, which I hadn’t used recently. Once I found that, I realized that I didn’t have any diskettes handy, so I went to the local office supply store to buy a pack. Finally I got MS-DOS 6.22 written to one floppy and the 3ware files to another.

I went to the colocation facility, booted MS-DOS, swapped floppies, and ran the 3ware script. It chugged along for a few minutes, with various hexadecimal data scrolling by on the display and being written to log files on the floppy. When it was done, I drove back, and it was about 4 PM when I emailed the files back to the 3ware engineer. He’d told me that I needed to email the files as individual attachments, rather than putting them in a ZIP file or the like, because their corporate mail server would drop ZIP files as being potential malware.

After a few minutes I decided to call 3ware again to confirm that they’d received the files. The engineer confirmed that they had. I asked when I might expect that they would have the recovery script for me, and he said they’d analyze the logs on Monday. I really didn’t like the idea of waiting until Monday, but 3ware was trying to do me a favor and offer me support that they were in no way obligated to provide, so I certainly couldn’t complain. It would have been entirely reasonable for them to tell me that there was no way to recover the RAID and that I’d have to restore from my backup.

I still didn’t like having the server down that long. It acts as the DNS and mail server for my own domains, and for those of a few friends. There’s nothing enterprise-critical about my own stuff, but I was providing DNS for a business a friend works for, and if the server was down that long the backup DNS might also time out, which would cause problems for him. In hindsight, I should have logged into the backup DNS machine and changed all the slave zones into masters temporarily to prevent that problem, but I didn’t think of it until too late.

I started thinking about whether it was possible to recover the data myself. The basic concept of a RAID 5 is actually pretty simple, so I might be able to figure out the details of the on-disk layout of the data, and write a program to extract it. I decided to give it a try; as long as I didn’t modify the contents of the original RAID drives, if my own recovery attempt failed, I’d still be able to use the script from 3ware support.

First, I needed to reverse-engineer the exact details of how the 3ware controller organized the user data on the drives in the array. I wrote a program “r5test” that could fill a file or disk with known patterns. The program writes 512-byte records (exactly the size of a disk block) each containing a recognizable string “-r5test-” and an eight byte sequential record (or block) number, with the remainder of the record filled with pseudorandom values seeded from the block number. Another function in the program could verify that a file or disk contained this pattern. I tested the program by writing a file, examining a hex dump of the file, then running the verify function.

I went to Central Computer, a local retailer, to buy four inexpensive 160GB SATA drives for the reverse-engineering project, one 1TB SATA drive to copy the real RAID data onto, and four Antec Veris MX-1 USB-SATA external drive cases to hook up the drives to my desktop computer at home. I already had one MX-1 not currently in use. I like the MX-1 because they have a fan to keep the drives cool; despite claims of advanced cooling features, most external USB cases let drives get extremely hot.

At home, I made two sets of labels R0 to R4 and S0 to S3 to use for the live RAID drives and the scratch drives, and put labels on the scratch drives.

I took the 160GB drives and another scratch drive to the colo. I powered down the machine, unplugged the ethernet, pulled the RAID drives out, and carefully labeled them R0-R4 based on the RAID port number (R4 having been the hot spare). I put the 160GB drives into the RAID, and hooked up the scratch drive to a motherboard SATA port. I installed Fedora Core 6 x86-64 on the scratch drive and booted it. FC6 is out of date, but that was the DVD that I had at hand, and it was fine for this purpose. In fact, FC6 seems to install much faster than Fedora 7 or Fedora 8, so it was actually a very good choice. Once FC6 was running, I configured the ethernet, verified outside connectivity and that the SSH daemon was running, and went home.

From home, I copied the r5test sources to the server, compiled them, and started them running. Although I didn’t necessarily need the entire RAID 5 array (about 480GB effective capacity) filled with the test pattern, I let it run all evening until the array was full.

In the mean time, I started adding more functions to r5test. The next requirement would be to verify that the test pattern could be read from the individual RAID drives. I didn’t yet know the exact format, but I wrote code based on my best guess as to what the format would be, with the expectation that it might be different but probably not so different that I couldn’t tweak the code as appropriate.

I also added a function that could write the test pattern onto multiple drives or files in the expected RAID 5 organization, in order to test the reader.

I made a trip back to the colo to pull the 160GB drives, brought them home, and installed them into the MX-1 cases. I put the second set of S0-S3 labels on the cases in order to keep from getting the drives confused. I hooked them up to the desktop machine, one at a time, making note of the device assigned by the kernel (e.g., /dev/sdg). Finally I started looking at them with a hex editor.

From prior experience, I believed that the 3ware controller’s metadata (DCB) was stored near the end of the drive, and that user data started at block 0. The hex editor confirmed this, and that the distribution of the user data was exactly what I’d expected! I ran the RAID 5 pattern checker for a while (but not to completion on the entire drive), and it confirmed that both the test data pattern and the parity were good, and that the parity rotation worked as expected. Stripe n has parity on drive n mod drive_count.

While the pattern checker was running, I wrote the code that would actually extract live RAID 5 data, verify the parity, and write the data onto a target drive. I tested that by running it from the 160GB drives onto a file (again, not continuing to completion), then running the pattern checker on that file. It all seemed to work, so it was time to try the live RAID drives.

I unplugged the MX-1 cases, removed the 160GB drives, and set them aside. I have no further immediate need for them, but perhaps they’ll be useful as scratch drives in the future. I put the RAID drives into the cases, again being careful to maintain the order and note the Linux device named. I used chmod on each device to set the permissions to 444, so the drives would be read-only. I really wish the drives or the cases had a write-protect jumper to do this in hardware. I also put the 1TB drive into a case, and hooked it up.

With four drives in the RAID 5 array, there are 24 possible permutations of ordering. I’d hoped that they would be ordered the same way as the scratch drives, but examination of hex dumps suggested otherwise. On two drives, R0 and R1, block 0 contained what appeared to be a valid master boot record (MBR). Block 0 of the other two drives contained zeros. This is consisted with my expectations; three drives should start with RAID blocks 0, 128, and 256, and one drive have the parity of those blocks. Blocks 128 and 256 are part of logical cylinder zero, so they are expected to contain zeros, and thus the first parity block should match RAID block 0, the MBR. This reduces the possible permutations to only four. But how to determine which is correct?

The first partition on the RAID was a 4GB /boot partition, so I decided to simply use my recovery function to extract about 5GB from the drives in each of the four possible permutations, and use the e2fsck filesystem checker on each. For the (0, 1, 2, 3) permutation, e2fsck reported only minor errors. For (0, 1, 3, 2), e2fsck reported many major errors, and for (1, 0, 2, 3) and (1, 0, 3, 2) e2fsck reported one major error so serious that it couldn’t proceed. This convinced me that (0, 1, 2, 3) was the correct permutation.

I started the recovery program, running as a user rather than root so it would respect the block device permissions. I’d put in code that printed a progress update every second or so, and that let me estimate that it would take 14 hours to complete. (It might have run faster if I’d arranged to plug the destination 1TB drive into a different USB host controller than the source drives.)

After about six hours, I noticed that the program had aborted with an error, though it was not clear why. I was somewhat disturbed to see that the permissions on the block device files had reverted to 600, meaning that they were read-write for root and no access for other users. That shouldn’t have caused any problem; changing the permissions on a file or device doesn’t affect a program that already has it open. Maybe something in GNOME “fixed” the permissions for me, or maybe a cron job did it?

Then I noticed that one of the device files was missing. Sure enough, /proc/scsi/scsi now only showed three of the four source drives. The blue LEDs on all four source drives were still on, though. I ended up disconnecting all the drives, and reconnecting them in sequence again.

I didn’t want to start the recovery over again from the beginning, losing the six hours already spent, so I modified r5test to allow it to start partway into the process, based on the progress numbers the previous invocation had last printed (minus a few thousand blocks). I started it up again, and went to bed.

When I got up, I found that it still had a few hours to go. I waited for it to finish, and then was shocked to find that the destination drive did not have a valid partition table. Investigation revealed that when I restarted the program, it did seek to the correct block on the source drives, but not on the destination drive. I’d made a mistake in where I put the destination seek in the code, and it never happened. I’d have to start it over from the beginning.

I did start it over, and a few hours into the new run it hit an error again. I fixed the bug in r5test, and started it again. This time, a few minutes after I restarted it I checked the MBR on the destination drive, and it looked fine. I checked up on progress from time to time, but there were no further errors.

The process completed around 3:45 AM Monday morning. I took the 1TB drive back to the colo and attached it to the motherboard SATA port. Now for the moment of truth: I turned the machine on. It went through the normal boot sequence, and everything looked fine. I logged in, and it seemed that all the daemons were running properly. I verified the network connectivity, and went home.

As of 5:51 PM on Monday, I haven’t yet heard back from 3ware about the scripts they were planning to send me. I’ll email them and tell them that I no longer need the scripts, and thank them for offering to help.

At some point I’ll clean up r5test and release it with a GPL license. I’m not sure how likely it is that anyone else might need it, but if someone does, it would be nice if they didn’t have to write their own program from scratch. While the on-disk format might not be identical for other RAID controllers, it should be similar enough that r5test can be easily modified to be usable.

I should emphasize again that this problem was entirely of my own devising, and in no way reflects negatively on 3ware. The 3ware products and their customer service are absolutely expemplary. I will continue to use 3ware products in the future, and I highly recommend them to anyone needing high-quality RAID controllers.

/home partition recovered

Eric — Sat, 23 Sep 2006 05:20:13 +0000

I was able to recover the /home partition from the disk that Windows clobbered a week ago.Â Windows overwrote the partition table, and none of the usual methods I’ve used for data recovery were able to find it, nor was gpart.Â I ended up writing a program to scan for likely ext2/ext3 superblocks, identified the right one, then dd’d the contents of the drive from two sectors prior to the superblock into a file.Â Then I was able to mount that as a filesystem via the loopback interface.Â Worked like a charm.Â A lot of hassle, though.Â There wasn’t anything too important on the partition, but it was still nice getting it back.

I need to set up an automated backup system for my laptop, so that when I go home I just plug it into the network and it gets backed up.

Screwed by Windows again!

Eric — Fri, 15 Sep 2006 07:36:18 +0000

I’m trying to copy an NTFS partition from an external USB drive to a spare partition on the internal IDE drive of my laptop.Â In the process, before I’d even so much as gotten Ghost to start doing anything, Windows XP managed to write a copy of the external drive’s partition table over that of the internal drive, rendering the internal drive unbootable, and not easily recoverable.

Fortunately the Linux program “gpart” seems to have the ability to scan a disk, find the probable locations of partitions, and build a new partition table.Â gpart is included in the Fedora Core CDs (32-bit only?), which can be used in rescue mode.Â I’m running the scan now.

Geez, what a nightmare.

Backing up a damaged DVD

Eric — Mon, 05 Dec 2005 08:25:06 +0000

I have a damaged DVD that will no longer play in my DVD player. My computer seems to be able to read it OK. The official position of the MPAA is that I’m shit-outa-luck, and should just buy another copy. Naturally I prefer to exercise fair use rights, and make a backup copy.

I didn’t buy a copy of DVD Copy Plus back when it was being sold, and in any case I prefer to use Linux to solve these problems. I was expecting to have to jump through a lot of hoops to copy a video DVD on Linux, but a search turned up a paper explaining how to do it. In case the information is useful to someone else, I’ve put the paper online at floobydust.com.

The backup process worked fine, and the resulting disc plays in my DVD player, so I’m happy. Of course, normally one should make a backup of a disc before the original goes bad, but in this case I got lucky.

Disk drive failure

Eric — Mon, 25 Apr 2005 10:00:05 +0000

A 200G drive in one of my RAID systems failed today. Since it was RAID 5, no data was lost. However, the original drive had an actual capacity of 203G, while new ones sold with the identical part number are only 200G even. So I can’t put an “identical” replacement in; I’ll have to get a 250G drive.

The failed drive was acquired second-hand and had previously been used for life cycle testing, so it is not too surprising that it failed since it was subjected to abnormally high wear. I didn’t expect that these particular drives would be very reliable, given their history, so I was only willing to use it in a RAID 5 configuration.

Fire

Eric — Thu, 14 Aug 2003 23:27:16 +0000

Originally Mike thought only his garage and a small portion of the house needed to be torn down and rebuilt. However, it has now been determined that the entire house must be torn down.

Mike has been able to recover data from the disks in one of the computers from the garage that was near floor level (lower than my machine), but the drive from the machine that was next to mine will not spin up. Fortunately the disks he did recover have backups of his most important data. Some of the backup data is months old, but for much of the data that’s new enough.

Fire photos

Eric — Thu, 14 Aug 2003 01:17:12 +0000

I’ve just set up an album of photos of my server after the fire.