On Friday, I needed to add disk space to the RAID 5 array on my server. I’m using a 3ware 9550SX-8LP RAID controller, which I’ve generally been very happy with. It has support for online capacity expansion, so I decided to reconfigure it to drop the hot spare drive, then add that drive into the RAID 5.

Due to a layer 8 error (PEBKAC), the system went down around noon on Friday, and it took me until 4 AM Monday morning to restore it to normal operation. Sometimes you know on an intellectual level that something is dangerous, but you haven’t fully internalized it until you have experience with it biting you on the ass. I knew that poking around in the RAID management interface was dangerous, but I thought I knew what I was doing.

The live RAID 5 array was unit 0, and the hot spare was unit 1. I needed to delete unit 1 and add the drive to unit 0. I told the managment software to delete unit 1, it asked me if I was sure, and I confirmed that I wanted it deleted.

Unfortunately it appears that I screwed up and actually asked it to delete unit 0, which it did. Unsurprisingly, Linux immediately started reporting disk errors.

When I realized what I’d done, I paused to think about it for a moment. I do have a full backup of the system, but it’s more than a month old, and I’d rather not have to revert to that. Deleting a RAID unit probably only means changing a little bit of metadata on the drives, so the filesystems were probably all still intact. The metadata format isn’t documented, but maybe there’s a way to reconstruct it.

I called 3ware tech support at around 1 PM. The engineer listened to my problem description, and said that it was possible that they might be able to help me, if the server had a floppy drive. They would email me some scripts which I could put on a floppy disk and run on the machine. The scripts would pull various data out of the RAID controller and from the drives, write to log files on the floppy, and I would email those log files to support. They would analyze them, and generate new scripts which would rewrite the appropriate metadata back onto the drives, reintegrating them as a RAID 5.

Since my email goes through the server in question, I had them send the scripts to an alternate email address. An hour and a half later, I hadn’t received them yet, so I called back. I spoke to a different engineer, gave him yet another alternate email address, and he sent the scripts. I received them almost instantly.

The server does have a floppy drive, but none of my other commonly used machines does. I had to search for my USB floppy drive, which I hadn’t used recently. Once I found that, I realized that I didn’t have any diskettes handy, so I went to the local office supply store to buy a pack. Finally I got MS-DOS 6.22 written to one floppy and the 3ware files to another.

I went to the colocation facility, booted MS-DOS, swapped floppies, and ran the 3ware script. It chugged along for a few minutes, with various hexadecimal data scrolling by on the display and being written to log files on the floppy. When it was done, I drove back, and it was about 4 PM when I emailed the files back to the 3ware engineer. He’d told me that I needed to email the files as individual attachments, rather than putting them in a ZIP file or the like, because their corporate mail server would drop ZIP files as being potential malware.

After a few minutes I decided to call 3ware again to confirm that they’d received the files. The engineer confirmed that they had. I asked when I might expect that they would have the recovery script for me, and he said they’d analyze the logs on Monday. I really didn’t like the idea of waiting until Monday, but 3ware was trying to do me a favor and offer me support that they were in no way obligated to provide, so I certainly couldn’t complain. It would have been entirely reasonable for them to tell me that there was no way to recover the RAID and that I’d have to restore from my backup.

I still didn’t like having the server down that long. It acts as the DNS and mail server for my own domains, and for those of a few friends. There’s nothing enterprise-critical about my own stuff, but I was providing DNS for a business a friend works for, and if the server was down that long the backup DNS might also time out, which would cause problems for him. In hindsight, I should have logged into the backup DNS machine and changed all the slave zones into masters temporarily to prevent that problem, but I didn’t think of it until too late.

I started thinking about whether it was possible to recover the data myself. The basic concept of a RAID 5 is actually pretty simple, so I might be able to figure out the details of the on-disk layout of the data, and write a program to extract it. I decided to give it a try; as long as I didn’t modify the contents of the original RAID drives, if my own recovery attempt failed, I’d still be able to use the script from 3ware support.

First, I needed to reverse-engineer the exact details of how the 3ware controller organized the user data on the drives in the array. I wrote a program “r5test” that could fill a file or disk with known patterns. The program writes 512-byte records (exactly the size of a disk block) each containing a recognizable string “-r5test-” and an eight byte sequential record (or block) number, with the remainder of the record filled with pseudorandom values seeded from the block number. Another function in the program could verify that a file or disk contained this pattern. I tested the program by writing a file, examining a hex dump of the file, then running the verify function.

I went to Central Computer, a local retailer, to buy four inexpensive 160GB SATA drives for the reverse-engineering project, one 1TB SATA drive to copy the real RAID data onto, and four Antec Veris MX-1 USB-SATA external drive cases to hook up the drives to my desktop computer at home. I already had one MX-1 not currently in use. I like the MX-1 because they have a fan to keep the drives cool; despite claims of advanced cooling features, most external USB cases let drives get extremely hot.

At home, I made two sets of labels R0 to R4 and S0 to S3 to use for the live RAID drives and the scratch drives, and put labels on the scratch drives.

I took the 160GB drives and another scratch drive to the colo. I powered down the machine, unplugged the ethernet, pulled the RAID drives out, and carefully labeled them R0-R4 based on the RAID port number (R4 having been the hot spare). I put the 160GB drives into the RAID, and hooked up the scratch drive to a motherboard SATA port. I installed Fedora Core 6 x86-64 on the scratch drive and booted it. FC6 is out of date, but that was the DVD that I had at hand, and it was fine for this purpose. In fact, FC6 seems to install much faster than Fedora 7 or Fedora 8, so it was actually a very good choice. Once FC6 was running, I configured the ethernet, verified outside connectivity and that the SSH daemon was running, and went home.

From home, I copied the r5test sources to the server, compiled them, and started them running. Although I didn’t necessarily need the entire RAID 5 array (about 480GB effective capacity) filled with the test pattern, I let it run all evening until the array was full.

In the mean time, I started adding more functions to r5test. The next requirement would be to verify that the test pattern could be read from the individual RAID drives. I didn’t yet know the exact format, but I wrote code based on my best guess as to what the format would be, with the expectation that it might be different but probably not so different that I couldn’t tweak the code as appropriate.

I also added a function that could write the test pattern onto multiple drives or files in the expected RAID 5 organization, in order to test the reader.

I made a trip back to the colo to pull the 160GB drives, brought them home, and installed them into the MX-1 cases. I put the second set of S0-S3 labels on the cases in order to keep from getting the drives confused. I hooked them up to the desktop machine, one at a time, making note of the device assigned by the kernel (e.g., /dev/sdg). Finally I started looking at them with a hex editor.

From prior experience, I believed that the 3ware controller’s metadata (DCB) was stored near the end of the drive, and that user data started at block 0. The hex editor confirmed this, and that the distribution of the user data was exactly what I’d expected! I ran the RAID 5 pattern checker for a while (but not to completion on the entire drive), and it confirmed that both the test data pattern and the parity were good, and that the parity rotation worked as expected. Stripe n has parity on drive n mod drive_count.

While the pattern checker was running, I wrote the code that would actually extract live RAID 5 data, verify the parity, and write the data onto a target drive. I tested that by running it from the 160GB drives onto a file (again, not continuing to completion), then running the pattern checker on that file. It all seemed to work, so it was time to try the live RAID drives.

I unplugged the MX-1 cases, removed the 160GB drives, and set them aside. I have no further immediate need for them, but perhaps they’ll be useful as scratch drives in the future. I put the RAID drives into the cases, again being careful to maintain the order and note the Linux device named. I used chmod on each device to set the permissions to 444, so the drives would be read-only. I really wish the drives or the cases had a write-protect jumper to do this in hardware. I also put the 1TB drive into a case, and hooked it up.

With four drives in the RAID 5 array, there are 24 possible permutations of ordering. I’d hoped that they would be ordered the same way as the scratch drives, but examination of hex dumps suggested otherwise. On two drives, R0 and R1, block 0 contained what appeared to be a valid master boot record (MBR). Block 0 of the other two drives contained zeros. This is consisted with my expectations; three drives should start with RAID blocks 0, 128, and 256, and one drive have the parity of those blocks. Blocks 128 and 256 are part of logical cylinder zero, so they are expected to contain zeros, and thus the first parity block should match RAID block 0, the MBR. This reduces the possible permutations to only four. But how to determine which is correct?

The first partition on the RAID was a 4GB /boot partition, so I decided to simply use my recovery function to extract about 5GB from the drives in each of the four possible permutations, and use the e2fsck filesystem checker on each. For the (0, 1, 2, 3) permutation, e2fsck reported only minor errors. For (0, 1, 3, 2), e2fsck reported many major errors, and for (1, 0, 2, 3) and (1, 0, 3, 2) e2fsck reported one major error so serious that it couldn’t proceed. This convinced me that (0, 1, 2, 3) was the correct permutation.

I started the recovery program, running as a user rather than root so it would respect the block device permissions. I’d put in code that printed a progress update every second or so, and that let me estimate that it would take 14 hours to complete. (It might have run faster if I’d arranged to plug the destination 1TB drive into a different USB host controller than the source drives.)

After about six hours, I noticed that the program had aborted with an error, though it was not clear why. I was somewhat disturbed to see that the permissions on the block device files had reverted to 600, meaning that they were read-write for root and no access for other users. That shouldn’t have caused any problem; changing the permissions on a file or device doesn’t affect a program that already has it open. Maybe something in GNOME “fixed” the permissions for me, or maybe a cron job did it?

Then I noticed that one of the device files was missing. Sure enough, /proc/scsi/scsi now only showed three of the four source drives. The blue LEDs on all four source drives were still on, though. I ended up disconnecting all the drives, and reconnecting them in sequence again.

I didn’t want to start the recovery over again from the beginning, losing the six hours already spent, so I modified r5test to allow it to start partway into the process, based on the progress numbers the previous invocation had last printed (minus a few thousand blocks). I started it up again, and went to bed.

When I got up, I found that it still had a few hours to go. I waited for it to finish, and then was shocked to find that the destination drive did not have a valid partition table. Investigation revealed that when I restarted the program, it did seek to the correct block on the source drives, but not on the destination drive. I’d made a mistake in where I put the destination seek in the code, and it never happened. I’d have to start it over from the beginning.

I did start it over, and a few hours into the new run it hit an error again. I fixed the bug in r5test, and started it again. This time, a few minutes after I restarted it I checked the MBR on the destination drive, and it looked fine. I checked up on progress from time to time, but there were no further errors.

The process completed around 3:45 AM Monday morning. I took the 1TB drive back to the colo and attached it to the motherboard SATA port. Now for the moment of truth: I turned the machine on. It went through the normal boot sequence, and everything looked fine. I logged in, and it seemed that all the daemons were running properly. I verified the network connectivity, and went home.

As of 5:51 PM on Monday, I haven’t yet heard back from 3ware about the scripts they were planning to send me. I’ll email them and tell them that I no longer need the scripts, and thank them for offering to help.

At some point I’ll clean up r5test and release it with a GPL license. I’m not sure how likely it is that anyone else might need it, but if someone does, it would be nice if they didn’t have to write their own program from scratch. While the on-disk format might not be identical for other RAID controllers, it should be similar enough that r5test can be easily modified to be usable.

I should emphasize again that this problem was entirely of my own devising, and in no way reflects negatively on 3ware. The 3ware products and their customer service are absolutely expemplary. I will continue to use 3ware products in the future, and I highly recommend them to anyone needing high-quality RAID controllers.


4 Responses to “RAID Capacity Expansion, the Suboptimal Way, or How Not to Have a Fun Weekend”  

  1. 1 Vince

    Nice scary horror story! And good hack to bring the data back to life :-)

    Are your plans to release r5test out just theoretical at the moment? I’d actually be happy to put my hands on r5test; i just had bad luck with a 12-drive raid5 array on a 3ware controller and i might have to use your trick to recover the data. (In my case i removed a failing drive from the array, shut down the power, swapped the drive with a new one, powered back up and in the process a 2nd drive decided to disappear… I’m hoping 3ware will come up with scripts to reinject the failing-but-not-quite-dead drive into the array but if not, r5test will be my friend.)

    Thanks!

  2. 2 Eric

    As it currently exists, r5test assumes that the drives all contain valid data. It doesn’t try to do reconstruction from a failed drive. I expect that it wouldn’t be too difficult to add that feature. However, dealing with two failed drives is in general not possible, so you’ll have to hope that the second failed drive hasn’t actually lost your data.

    What I’d do first, regardless of whether you want to get help from 3ware or hack on r5test, is try to make an exact block-for-block copy of the second failed drive. With Linux/BSD/Solaris, etc, it is trivial to do this with dd. (I’d probably do that for all the drives.)

    If you want to try hacking r5test, such that it is, send me email, and I’ll send you the code. Note that you’ll still have to determine somehow what order the 12 drives had within the array, and that seems pretty challenging without knowing the details of the 3ware DCB format. Fortunately since you didn’t do anything dumb like I did, the DCBs on all the drives should be intact. I think 3ware support is much more promising at this point that r5test.

  3. 3 netproteus

    I’ve got a similar problem. A disk has completely failed and I’ve manged to screw up the DCB on one of my good drives.

    3ware are being a bit rubbish and generally very slow. A copy of your r5test program as a starting point to manually recover the data would be a real help.

  4. 4 Eric

    The source code for r5test is now available at
    http://www.brouhaha.com/~eric/software/r5test/

Leave a Reply