Archive

Archive for the ‘Uncategorized’ Category

Never, ever, EVER compliment your ISP!

June 1st, 2009 Rick Ross 2 comments

If you ever need proof that Murphy is always around, just make the silly mistake of complimenting your ISP. Within a short amount of time you will experience a network outage. The severity of the outage will, of course, be directly proportional to the importance of the person you were addressing when you gave your ISP the compliment. The length of time until the outage will usually be inversely proportional!

This is like the internet version of Clive Barker’s 1992 horror movie, “Candyman.” In that movie, the legend was that if you looked yourself in the mirror and said “Candyman” three times, then a brutal killer would appear out of nowhere and end your miserable existence.

If you feel like tempting fate, just find the nearest mirror and say “My ISP is awesome and never goes down!” three times, with feeling. You’re almost certain to spend the next several hours regretting your unabashed foolishness!

Never, ever, EVER compliment your ISP!

  • Share/Bookmark
Tags: , ,

Mysterious Linux RAID Problems at DZone

May 31st, 2009 Rick Ross 3 comments

On New Year’s Day, 2009 we deployed a huge upgrade to the DZone data center servers, and for the most part performance has been fantastic since then. Every once in a while, however, we have a peculiar problem with RAID storage that causes one of our primary arrays to drop offline, with obvious adverse consequences for the rest of our systems. It’s driving us crazy (especially when it happens at 3AM on Sunday morning like it did today!) I’d love to get your help to pinpoint the source of this problem and fix it.

The machine in question is running 64-bit Ubuntu 8.10 (kernel  2.6.27-14-generic) on a Supermicro 6025W-NTR+B with dual quad-Xeons, 32Gb ram, a pair of 64 Gb Samsung  SLC solid state drives in RAID 0, 8 Western Digital Velociraptor WD3000HLFS 300 Gb drives in RAID 5, a 1Tb Western Digital Caviar Black (WD1001FALS) to boot from, and an LSI SAS3081E-R 8-port SAS/SATA card. I mention all the parts because it could be that something in this particular combination is causing the problem. Afaik, all the drivers are current.

Overall, it’s a powerhouse machine that really delivers the goods. It runs the master databases for most of our services and has numerous front-end servers and a slaved database backup hitting it all the time. With 8 powerful Xeon cores available, this baby rarely breaks a sweat. I think the SSDs were an especially helpful addition, providing very low-latency read/write access to drive high database IOPS numbers. We load a lot into ram, so with 16 DIMM slots the Supermicro motherboard leaves us room to add more ram if we need it.

Anyway, last night we had one of these failures at about 3AM, and it’s hard to make sense of what really happened? Here’s an excerpt from the machine’s syslog at what seems to be the point of the original error. There’s not a lot of explanation about how or why the attempted writes failed:

May 31 03:04:23 galactica kernel: [3861092.890630] end_request: I/O error, dev sde, sector 586067135
May 31 03:04:23 galactica kernel: [3861092.890642] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861092.890646] raid5: Disk failure on sde1, disabling device.
May 31 03:04:23 galactica kernel: [3861092.890647] raid5: Operation continuing on 7 devices.
May 31 03:04:23 galactica kernel: [3861092.907339] end_request: I/O error, dev sdh, sector 586067135
May 31 03:04:23 galactica kernel: [3861092.907353] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861092.907357] raid5: Disk failure on sdh1, disabling device.
May 31 03:04:23 galactica kernel: [3861092.907358] raid5: Operation continuing on 6 devices.
May 31 03:04:23 galactica kernel: [3861092.913925] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.913931]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.913934]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.913938]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.913940]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.913941]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.913943]  disk 4, o:0, dev:sdh1
May 31 03:04:23 galactica kernel: [3861092.913945]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.913946]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.913948]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.953761] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.953771]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.953773]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.953777]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.953779]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.953781]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.953783]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.953785]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.953787]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.953799] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.953800]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.953802]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.953803]  disk 1, o:0, dev:sde1
May 31 03:04:23 galactica kernel: [3861092.953805]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.953807]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.953808]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.953810]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.953811]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.973757] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861092.973762]  --- rd:8 wd:6
May 31 03:04:23 galactica kernel: [3861092.973764]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861092.973766]  disk 2, o:1, dev:sdf1
May 31 03:04:23 galactica kernel: [3861092.973769]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861092.973770]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861092.973771]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861092.973773]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica kernel: [3861092.973802] Buffer I/O error on device md1, logical block 64134586
May 31 03:04:23 galactica kernel: [3861092.973806] lost page write due to I/O error on md1
May 31 03:04:23 galactica kernel: [3861093.187738] end_request: I/O error, dev sdf, sector 586067135
May 31 03:04:23 galactica kernel: [3861093.187752] md: super_written gets error=-5, uptodate=0
May 31 03:04:23 galactica kernel: [3861093.187756] raid5: Disk failure on sdf1, disabling device.
May 31 03:04:23 galactica kernel: [3861093.187757] raid5: Operation continuing on 5 devices.
May 31 03:04:23 galactica kernel: [3861093.195898] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861093.195901]  --- rd:8 wd:5
May 31 03:04:23 galactica kernel: [3861093.195903]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861093.195905]  disk 2, o:0, dev:sdf1
May 31 03:04:23 galactica kernel: [3861093.195907]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861093.195908]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861093.195910]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861093.195911]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sde1
May 31 03:04:23 galactica kernel: [3861093.396258] RAID5 conf printout:
May 31 03:04:23 galactica kernel: [3861093.396263]  --- rd:8 wd:5
May 31 03:04:23 galactica kernel: [3861093.396266]  disk 0, o:1, dev:sdd1
May 31 03:04:23 galactica kernel: [3861093.396268]  disk 3, o:1, dev:sdg1
May 31 03:04:23 galactica kernel: [3861093.396269]  disk 5, o:1, dev:sdi1
May 31 03:04:23 galactica kernel: [3861093.396271]  disk 6, o:1, dev:sdj1
May 31 03:04:23 galactica kernel: [3861093.396273]  disk 7, o:1, dev:sdk1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sdh1
May 31 03:04:23 galactica mdadm[9177]: Fail event detected on md device /dev/md1, component device /dev/sdf1

The failure of the RAID array caused pretty much everything else to flop over. In fact, it’s always a dilemma whether to switch over to our slaved backup because we’ll then have trouble getting the two back in sync when we bring the main RAID back online. In this case we decided to bring the main RAID back online asap and did not switch over to the slaved backup. The following command added the dropped drives back into the RAID with no problem, as if nothing had happened. It made us wonder what the real problem was, since the mechanisms all seemed happy when we re-added them:

mdadm -A /dev/md1 -f -U summaries /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1

It’s a bit maddening, since the only corrective action we had to do was to add the mechanisms back into the set. It makes me suspect there’s a timing issue, or some type of edge-case driver problem that we hit once in a blue moon, but once in a blue moon is too much for my tastes! I’d really love to eliminate these troubling RAID dropouts and get this machine to rock-solid reliability. If you have any insights about what measures we should take, then please share your insights and suggestions. Thanks!

  • Share/Bookmark

It’s complicated to make things easy

February 28th, 2007 Rick Ross No comments

I’ll bet every developer I know has uttered the words, “keep it simple,” but the truth is it’s just not that easy. Simplicity is not the default state of information and processes. More often the opposite holds true, and chaos is the natural order of things, so to speak. Information and processes are made simple only by thoughtful, effective organization and filtering. It can be surprisingly difficult to distill simplicity out of the information slag most of us have to wallow in and manage. Matthew Broderick doesn’t really save the world from imminent destruction by playing tic-tac-toe with an ornery computer, and Chloe from “24″ doesn’t really hack into DoD databases on her lunch break to reposition satellites and “patch the information through” to Jack Bauer’s PDA. (What kind of PDA is that, anyway?) Getting a computer to perform any task well can be a grueling challenge. It’s a lot harder than uttering overused admonitions like “keep it simple.”

Of course, while we are struggling to design easy and effective computer solutions, we often have well-wishing friends and colleagues offer helpful suggestions that begin with “Can’t you just…” Occasionally these lead us to a Eureka moment in which the fog slips away, and we suddenly attain clarity. Typically, however, despite the speaker’s good intentions, these “can’t you just…” suggestions will never get us where we’re trying to go. Good solutions to difficult problems, especially new solutions, can be painfully, maddeningly elusive. The general public seems to operate under a widespread illusion that anyone who can successfully plug in and turn on a computer is just a hop, skip and a jump away from being the next Matthew Broderick or Chloe from “24.” You and I know it’s not so simple. It’s complicated to make things easy.

I’m saying all this because for the past several days I have been working on making it easier for you to vote for links at dzone.com. Actually it hasn’t been just the past several days. If I’m being plainly honest I have to admit it has taken much longer than that. It has probably been more like several weeks since I began sketching out new possibilities, but I think I have finally reached a point where it is tolerable, if not easy. The new DZone voting widget should be online by the time next week’s newsletter reaches you. I hope you’ll like it. Participation is the lifeblood of DZone’s “social filtering,” so we’d like to make it really, really easy for you to participate.

  • Share/Bookmark
Tags: