SRE Revelation and RAID of Death
The first half of my career was as an SRE. Despite SRE practices being used widely only recently, they’re what I always did. I could code, understood how computers worked, how the internet worked, and how server and software architecture was done.
One recently popular aspect of failure-introspection is, in part, looking at the bigger picture. A component failed or some code errored, but what were the trademarks of the human system that got us to this failure point? If we unwind the stack all the way past meetings and the organization of humans, what do we learn from it?
There was one particular failure a few years into my career that triggered my own thinking beyond “hey this person screwed up” and into “what were the cosmic combination of circumstances that lead to me standing here, staring down a harddrive, wondering if a company with over 100 employees would exist in the morning”.
With that hell of a sentence, lets do a war story!
First, apologies to the whole human-cognition crew who think about failure as higher concept; I’m incapable of understanding what you do. I never achieved enlightenment, nor did I ever have the influence to do anything about organizational problems. This post is not about purity of thought here, more like a halfway point.
The Stage
I can’t recall exactly when this happened anymore; second half of 2003 or first half of 2004. I had recently moved to New York City for a relatively low paying tech job as the country slowly recovered from a horrible economic crash. To be more specific, I’d moved to Staten Island because the rents were half. My job was in Manhattan, and I took a bus, boat, then train to work. It took an hour and a half to travel roughly five miles.
My first day of oncall was the Great Northeastern Blackout, which lead to some interesting problems with our datacenter. Perhaps I’ll write about that some other time.
Our datacenter cages were in Manhattan, near Chinatown. The cloud wouldn’t start to exist for a few more years. Most companies had physical racks of equipment. Companies in the post dot-com crash era may have started skimping a bit on their infrastructure. We certainly did.
Multiple datacenters? Nah. Hardly anyone did that.
Redundant hardware? That’s what you buy Big Name machines for. If your server has a “please do not tilt this vending machine” sticker on it, you’re safe from hardware failures.
Proper rackmount equipment? We did fix this over time, but someone had a brilliant idea of loading 200lbs worth of hardware onto two-post rack shelves rated for 100lbs. They would occasionally give way and dump servers onto hapless sysadmins.
You did what you could with the budget you had and the expertise available to you. None of this (perhaps aside from the literal death trap) were really wrong in particular.
The Failure
One night oncall I got a page about a disk failure in our NFS filer. It began recovering its RAID-5 array with a hotspare drive.
Sometime after that, I forget if it was minutes or an hour, I got another page. Another disk failure. The array hadn’t recovered yet.
Context is everything here. I’ll slowfly fill it in.
NFS Filer
Before S3 and non-POSIX filestores were popular, NFS filers were standard for user uploads. This company was split into a couple websites, but most of the traffic (and most of the revenue) were from a single website. This website had an NFS filer (a nice expensive NetApp), with 48 drives configured as a large RAID-5 volume. Every user uploaded file was contained on this device. Users may have been upset if it went away.
Of course there weren’t two. You paid the NetApp premium so you would’t need a second one; the first was supposed to be reliable. We used a CDN (Akamai, the only CDN at the time) so performance wasn’t really a concern. Akamai was mind-bendingly expensive at the time.
It hurts my old bones to admit, but some readers might not have had to use RAID in their lives, so here, use wikipedia.
RAID-5 is the configuration which preserves most of the disk space in an array of drives, but still gives you some safety against drive failures. If you have 3 drives in a RAID-5, you lose 1/3rd of the disk space but can lose any one drive. Thus, in a 48 drive RAID-5, you lose 1/48th of the drive space. Talk about financial impact! Whomever did this was primed for a promotion.
When a drive does fail, a new drive is placed into the array and the raid array is live-rebuilt. It would be bad if a drive failed and someone had to go physically swap the dead drive before it could recover, so there were 1 or 2 “hot spares” in the filer as well. These drives were idle and simply waited to be used.
RAID is not a backup. RAID isn’t even redundancy.
Contemplation and Boat Rides
I got on the Staten Island Ferry sometime around midnight. Also, being relatively new to the job I’d called the secondary oncall. We agreed to meet at the datacenter. Our main website was still up, sort of, but uploads would fail and any uncached images wouldn’t load.
With two dead drives in the filer, there was no way for the RAID to recover. We would need to restore from backup. It wasn’t until the next day that I discovered what our backup situation was, so I’ll get back to that later. I knew it would take days to recover at minimum, with data loss since the last backup.
It felt like a long ride.
The Ineptitude of Knowledge
Having arrived at the datacenter, I took a good look at the NFS filer. In most datacenters you have “hot rows” and “cold rows”. Cold air is pushed into the cold row, with fan intakes facing them. Every other row is a “hot row”, with the exhausts from each machines pointing at each other. These rows got pretty warm.
Of course we were different. Instead of hot rows and cold rows, we had what I’ll call “progressively hotter rows”, or a “fan exhaust centipede”, perhaps. For most rows the intakes faced someone else’s exhaust.
For the NFS filer, it at least faced a colder row. However there was a metal divider wall inches from the back of the device. Hot air pooled up behind its rows of screaming SCSI drives, making the wall warm to the touch.
After conferring with the secondary oncall and staring at the red lights marking both dead drives, I picked one of the two drives at random and pulled it out of the filer.
What made me decide to do this? The ultimate combination of all of my knowledge up to that point in my life, told me that this was the best course of action.
- I had been logged into the filer and tried to convince it to re-add a drive, re-detect a drive, recover anyway, or any option that might help in any way. Nothing could be done from software.
- I knew how RAID systems worked from a deep level: a “drive failure” could be as simple as a timeout. This is why some SATA drives are marketed with “Time Limited Error Recovery” (TLER). Better to have a RAID hiccup than eject a drive.
- I knew, in general, how a harddrive worked and how it could fail:
- Bad drive head.
- Scratched disk.
- Warped disk.
- Unremappable bad sectors.
- Dead motors. Dying motors.
- Firmware hang!
- Memory corruption! Unstable bits in drive cache causing parity errors!
- CPU faults in the disk controller!
- Bad cable.
- Bad cable contacts.
- Dying power supply components. [popped caps!]
- Went off to lunch trying to heroically recover a bad sector by re-reading it over and over.
I knew about half of these could be ephemeral, at least when a drive has first started to fail. A motor might skip or bits occasionally flip, causing a firmware hang.
So what did I do, with all of this knowledge that I had learned over a lifetime of fiddling with computers? I pulled the damn drive out and stared at it. I counted to 60 a couple times. Letting the drive cool might help its chance of recovery. Waiting for the capacitors to drain would ensure crashed firmware would reset.
What would any reasonable person have done, even without all of this knowledge? I’d wager they would try turning it off and on again, by way of taking the drive out and putting it back in again. Maybe they’d blow on the contacts.
Useless. I thought long and hard about all of the factors that had come together to have me, a junior “sysadmin”, standing in a datacenter in the middle of the night holding a harddrive in my hand that held the fate of everyone receiving a paycheck from the company.
I put the drive back in.
It Worked!
The dead drive came back to life. The array began recovering off of a hot spare. We ordered a replacement drive with NetApp and went home. I force-failed the once-dead drive later and ensured it also got replaced.
The Backups
The next day I asked a coworker about where the backup lived. We used tape jukeboxes for backups. These systems were the size of minifriges, and contained stacks of tapes, 2-4 tape drives, and some robot arms to move tapes around. Convenient!
We had an offsite rotation of tapes. Every week a truck would come to the datacenter and pick up some tapes that our backup system had designated as going into the offsite rotation. That way if we lost all data we could recover. Maybe.
I found that the backup for the NFS filer had been running as it was configured to, for 31 days.
Looking at a live view of the backup, I saw it pulling one jpeg at a time, engaging the tape drive, then grabbing another file. This took 5+ seconds per file.
Data Momentum
If you don’t know what a tape casette is, google a youtube of it.
Why was the backup so slow? The tape drives weren’t great, but they could muster at least a few megabytes per second. These files were hundreds of kilobytes at most, yet took many seconds to write.
Well, tape goes in one direction. Backwards or forward. It goes at a certain speed. The tape head is in a fixed position, above the center of the tape inbetween the two reels.
Tape motors don’t start and stop instantaneously. They might be able to, but this could create a strain on the tape or motors. More importantly, in the time it takes for the computer to decide it wants the tape drive to stop moving and signal the drive, the tape has continued to move at its existing speed.
Should diagram this. I’m not feeling it right now.
In short; for each file, the tape drive would seek to an earlier position in the tape, start the motors, write the data, slow the tape to a stop, then reposition it again for the next file. For each file the tape had to start, stop, and change directions several times.
Turns out there’s this thing, called tar, aka Tape Archive that was invented partly to deal with this situation. The backup had been configured to stream data into an on-tape format by pointing the backup at a directory of files. It should have been configured to pipe a tar archive into the tape drive. It could then buffer a few megabytes of data and keep the tape moving in one direction until full.
Also consider 95% of the files on the NFS filer don’t change. I believe the backup system had some options for checksumming and only sending changed files, but it may or may not have been enabled here.
I don’t recall what happened from there. Someone fixed it maybe.
Fault
There’s no point to this. It’s a fun story I tell sometimes: I was smart but it made no difference. Everyone did their job to spec. It wasn’t anyone’s job to go manually ensure the backups actaully worked, everyone was busy doing other things. A person who would’ve thought to check the backups worked at a different company.
New servers were configured manually. When I first started that meant marching to the DC with a floppy disk and a CD (because the servers couldn’t boot from the CD!). Over a year I set up networked PXE boot and automated installations, but imagine all the time you lose when everything is manual!
I figured everyone was doing their job as well as they were able to. They did the job they were hired for, with the context they had. If we wanted to do better, we had to decide that doing better was actually an important thing to do (and what that meant, exactly).
A year later a coworker was setting up a new disk filer to replace the aging NetApp. After noticing that it was set it up as a single 50+ drive RAID-5 array, I asked that they split it up in a couple smaller volumes. They complained bitterly.
End
From the top down, it was more important for the company to ship code and try new avenues of business, than it was to ensure they would exist day to day. Of course, if you walked up to the CTO and asked about the status of the backups, he would probably fly into a rage and say “of course it’s your job! make sure the damn backups work!“.
It’s easy to throw blame, hard to find acceptance. Of course he would say that.