Mysterious Linux RAID Problems at DZone

by Rick Ross on May 31, 2009

On New Year’s Day, 2009 we deployed a huge upgrade to the DZone data center servers, and for the most part performance has been fantastic since then. Every once in a while, however, we have a peculiar problem with RAID storage that causes one of our primary arrays to drop offline, with obvious adverse consequences for the rest of our systems. It’s driving us crazy (especially when it happens at 3AM on Sunday morning like it did today!) I’d love to get your help to pinpoint the source of this problem and fix it.

The machine in question is running 64-bit Ubuntu 8.10 (kernel  2.6.27-14-generic) on a Supermicro 6025W-NTR+B with dual quad-Xeons, 32Gb ram, a pair of 64 Gb Samsung  SLC solid state drives in RAID 0, 8 Western Digital Velociraptor WD3000HLFS 300 Gb drives in RAID 5, a 1Tb Western Digital Caviar Black (WD1001FALS) to boot from, and an LSI SAS3081E-R 8-port SAS/SATA card. I mention all the parts because it could be that something in this particular combination is causing the problem. Afaik, all the drivers are current.

Overall, it’s a powerhouse machine that really delivers the goods. It runs the master databases for most of our services and has numerous front-end servers and a slaved database backup hitting it all the time. With 8 powerful Xeon cores available, this baby rarely breaks a sweat. I think the SSDs were an especially helpful addition, providing very low-latency read/write access to drive high database IOPS numbers. We load a lot into ram, so with 16 DIMM slots the Supermicro motherboard leaves us room to add more ram if we need it.

Anyway, last night we had one of these failures at about 3AM, and it’s hard to make sense of what really happened? Here’s an excerpt from the machine’s syslog at what seems to be the point of the original error. There’s not a lot of explanation about how or why the attempted writes failed:

May 31 03:04:23 galactica kernel: [3861092.890630] end_request: I/O error, dev sde, sector 586067135
May 31 03:04:23 galactica kernel: [3861092.890642] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861092.890646] raid5: Disk failure on sde1, disabling device.
May 31 03:04:23 galactica kernel: [3861092.890647] raid5: Operation continuing on 7 devices.
May 31 03:04:23 galactica kernel: [3861092.907339] end_request: I/O error, dev sdh, sector 586067135
May 31 03:04:23 galactica kernel: [3861092.907353] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861092.907357] raid5: Disk failure on sdh1, disabling device.
May 31 03:04:23 galactica kernel: [3861092.907358] raid5: Operation continuing on 6 devices.
May 31 03:04:23 galactica kernel: [3861092.913925] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.913931]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.913934]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.913938]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.913940]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.913941]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.913943]  disk 4, o:0, dev:sdh1
May 31 03:04:23 galactica kernel: [3861092.913945]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.913946]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.913948]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.953761] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.953771]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.953773]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.953777]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.953779]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.953781]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.953783]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.953785]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.953787]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.953799] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.953800]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.953802]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.953803]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.953805]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.953807]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.953808]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.953810]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.953811]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.973757] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.973762]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.973764]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.973766]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.973769]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.973770]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.973771]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.973773]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.973802] Buffer I/O error on device md1, logical block 64134586
May 31 03:04:23 galactica kernel: [3861092.973806] lost page write due to I/O error on md1
May 31 03:04:23 galactica kernel: [3861093.187738] end_request: I/O error, dev sdf, sector 586067135
May 31 03:04:23 galactica kernel: [3861093.187752] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861093.187756] raid5: Disk failure on sdf1, disabling device.
May 31 03:04:23 galactica kernel: [3861093.187757] raid5: Operation continuing on 5 devices.
May 31 03:04:23 galactica kernel: [3861093.195898] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861093.195901]  --- rd:8 wd:5
May 31 03:04:23 galactica kernel: [3861093.195903]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861093.195905]  disk 2, o:0, dev:sdf1
May 31 03:04:23 galactica kernel: [3861093.195907]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861093.195908]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861093.195910]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861093.195911]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sde1
May 31 03:04:23 galactica kernel: [3861093.396258] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861093.396263]  --- rd:8 wd:5
May 31 03:04:23 galactica kernel: [3861093.396266]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861093.396268]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861093.396269]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861093.396271]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861093.396273]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sdh1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sdf1

The failure of the RAID array caused pretty much everything else to flop over. In fact, it’s always a dilemma whether to switch over to our slaved backup because we’ll then have trouble getting the two back in sync when we bring the main RAID back online. In this case we decided to bring the main RAID back online asap and did not switch over to the slaved backup. The following command added the dropped drives back into the RAID with no problem, as if nothing had happened. It made us wonder what the real problem was, since the mechanisms all seemed happy when we re-added them:

mdadm -A /dev/md1 -f -U summaries /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1

It’s a bit maddening, since the only corrective action we had to do was to add the mechanisms back into the set. It makes me suspect there’s a timing issue, or some type of edge-case driver problem that we hit once in a blue moon, but once in a blue moon is too much for my tastes! I’d really love to eliminate these troubling RAID dropouts and get this machine to rock-solid reliability. If you have any insights about what measures we should take, then please share your insights and suggestions. Thanks!

{ 3 comments… read them below or add one }

rick May 31, 2009 at 4:25 pm

Here’s smartctl SMART status report for one of the drives, /dev/sde. It looks like there have been some errors logged, but it also seems to say that the drive is passing. I’m not sure what to make of this:

http://pastebin.ca/1442741

We noticed that each of the SMART events logged by the drive occurred 49 days, 17 hours, 2 minutes and 41 seconds after the previous event. The same timing occurred on the other drives that dropped out. This cannot be a coincidence.

Reply

Topnotch June 1, 2009 at 11:48 am

Rick, I did some quick calculations based on your comment and it looks like there may be some kind of integer overflow error in the device driver that manages the RAID array.

I calculated [1] 49(3600 * 24) + (3600 * 17) + (120 + 41) on the wolfram site:
http://www.wolframalpha.com/input/?i=49(3600+*+24)+%2B+(3600+*+17)+%2B+(120+%2B+41)

and compared it to [2] 2^32
http://www.wolframalpha.com/input/?i=2^32

[1] = 4,294,961
[2] = 4,294,967,296

Calculation [1] is off by a factor of a 1000 and 6,296 has been subtracted from it but otherwise the digits look shockingly similar.

Reply

Giorgio Sironi June 2, 2009 at 1:47 pm

“Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.”
So the documentation itself says that there’s a 2^32 limit, but if this is a physical limit of the hardware registry it won’t be simple to fix it. It won’t be simple also to reproduce the bug (wait 49 days). You need a way to force the power up time to start from 49,5 days.

Reply

Leave a Comment

Previous post:

Next post: