Comments on: Server Outages at Percona

By: Tom Basil

Tom Basil — Thu, 11 Aug 2011 11:15:49 +0000

This is a belated response to reader comments asking for an explanation of how we lost three disks simultaneously on our primary web server in July, and why recovering from it was so difficult for us.

This server had a six-disk RAID5 configured as its only logical drive. One of these disks failed, and we did not replace it in time to prevent what followed. On July 16 the RAID controller itself failed, which affected two more disks of the server’s RAID set. Our staff connected to the RAID BIOS utility and saw that the RAID disks were flapping between the “asserted” and “deasserted” states, which is Dell terminology for in/out of sync. Because the RAID controller was trying to continue using the disks, essential files became badly damaged. This happened because EXT3 was not set to remount-ro on error, so the file system became corrupted. Once our staff gained physical access to the server, the system was in a mountable state from a bootable rescue cd and the filesystem could be used, but with errors in many files.

Recovering from this unhappy failure would have been less of a problem if our backups were in the condition we needed them to be. They were not. But this was more a human failure than a mechanical one, as crucial error notifications were missed due to staff turnover — a reminder that it’s hard to be overly attentive or too paranoid about the state of your backups.

By: Nils

Nils — Fri, 29 Jul 2011 15:27:23 +0000

I had a few close calls myself with disks failing merely hours after one another.

By: Justin Rovang

Justin Rovang — Thu, 28 Jul 2011 21:07:32 +0000

The complexity and level of redundancy is a matter of criticality – in the big picture the MPB site probably doesn’t require anything beyond a reasonable backup setup and maybe some simplistic caching. 3 drives in a go is a damn curious matter though; still curious!

By: Patrick Sciaroni

Patrick Sciaroni — Mon, 25 Jul 2011 17:09:25 +0000

Anything mechanical can and will fail, but three disks at once? At carbonlogic.com, we haven’t ever seen that in the decade that we have been offering economical, Tier 3 hosing. If you like having direct 24 hour access to your assigned Network Engineer, and need extreme availability, we are the logical choice. Let us prepare a quote for you, to see if we can take some of the hassle out of your life!

By: George

George — Fri, 22 Jul 2011 04:32:18 +0000

Ouch was wondering where the percona forums disappeared to! Hope you get everything sorted out. But yes applicable lessons learnt would be great to read about.

My web host had a raid controller failure but they used R1Soft CDP/MySQL backup and just restored a recovery point (entire OS, data etc) to a totally new server and was up and running very quickly.

By: William

William — Wed, 20 Jul 2011 23:29:32 +0000

Downtime always sucks, regardless of when or how many customers affected. But, it is very fascinating to study the causes of outages. If there are applicable lessons, it would be interesting to hear about. However, I competently understand if you don’t want to publicly divulge your infrastructure details on a public forum.

By: Baron Schwartz

Baron Schwartz — Wed, 20 Jul 2011 21:40:01 +0000

FYI, Tom’s traveling right now and won’t be available to respond to comments until next week.

By: Michael

Michael — Wed, 20 Jul 2011 19:20:06 +0000

Why was there no failover system?

By: Patrick Casey

Patrick Casey — Wed, 20 Jul 2011 17:53:08 +0000

If its a raid 10 and nobody was watching the disk alarms, its possible they lost the disks over the period of months.

Lose disk 0 left …. 0 right is still ok, raid up
Lose disk 1 right … 1 left is still ok, raid up
Lose disk 0 right … 0 left is already down. raid down

I’ve lost servers that way before (embarrassing, but true)

By: Justin Rovang

Justin Rovang — Wed, 20 Jul 2011 14:58:08 +0000

Holy hannah! Must’ve been controller related?

By: Edmar

Edmar — Wed, 20 Jul 2011 12:02:39 +0000

Multiple disk failures are extremely important to me as a database/data manager.

Is it possible to inform whether the 3 disks failed simultaneously, or if they possibly failed one at a time during a period of days/months, and RAID degradation was not detected due to absent/faulty monitoring?

Thanks in advance.

By: James Cohen

James Cohen — Wed, 20 Jul 2011 11:08:51 +0000

Sounds like you were unlucky with the disk failures. Can you give any more details as to what caused them to fail? Three at the same time seems very unlucky and unusual.