Comments on: Server Outages at Percona https://www.percona.com/blog/server-outages-at-percona/ Thu, 11 Aug 2011 11:15:49 +0000 hourly 1 https://wordpress.org/?v=6.5.2 By: Tom Basil https://www.percona.com/blog/server-outages-at-percona/#comment-811603 Thu, 11 Aug 2011 11:15:49 +0000 https://www.percona.com/blog/?p=6362#comment-811603 This is a belated response to reader comments asking for an explanation of how we lost three disks simultaneously on our primary web server in July, and why recovering from it was so difficult for us.

This server had a six-disk RAID5 configured as its only logical drive. One of these disks failed, and we did not replace it in time to prevent what followed. On July 16 the RAID controller itself failed, which affected two more disks of the server’s RAID set. Our staff connected to the RAID BIOS utility and saw that the RAID disks were flapping between the “asserted” and “deasserted” states, which is Dell terminology for in/out of sync. Because the RAID controller was trying to continue using the disks, essential files became badly damaged. This happened because EXT3 was not set to remount-ro on error, so the file system became corrupted. Once our staff gained physical access to the server, the system was in a mountable state from a bootable rescue cd and the filesystem could be used, but with errors in many files.

Recovering from this unhappy failure would have been less of a problem if our backups were in the condition we needed them to be. They were not. But this was more a human failure than a mechanical one, as crucial error notifications were missed due to staff turnover — a reminder that it’s hard to be overly attentive or too paranoid about the state of your backups.

]]>
By: Nils https://www.percona.com/blog/server-outages-at-percona/#comment-807847 Fri, 29 Jul 2011 15:27:23 +0000 https://www.percona.com/blog/?p=6362#comment-807847 I had a few close calls myself with disks failing merely hours after one another.

]]>
By: Justin Rovang https://www.percona.com/blog/server-outages-at-percona/#comment-807749 Thu, 28 Jul 2011 21:07:32 +0000 https://www.percona.com/blog/?p=6362#comment-807749 The complexity and level of redundancy is a matter of criticality – in the big picture the MPB site probably doesn’t require anything beyond a reasonable backup setup and maybe some simplistic caching. 3 drives in a go is a damn curious matter though; still curious!

]]>
By: Patrick Sciaroni https://www.percona.com/blog/server-outages-at-percona/#comment-807403 Mon, 25 Jul 2011 17:09:25 +0000 https://www.percona.com/blog/?p=6362#comment-807403 Anything mechanical can and will fail, but three disks at once? At carbonlogic.com, we haven’t ever seen that in the decade that we have been offering economical, Tier 3 hosing. If you like having direct 24 hour access to your assigned Network Engineer, and need extreme availability, we are the logical choice. Let us prepare a quote for you, to see if we can take some of the hassle out of your life!

]]>
By: George https://www.percona.com/blog/server-outages-at-percona/#comment-807052 Fri, 22 Jul 2011 04:32:18 +0000 https://www.percona.com/blog/?p=6362#comment-807052 Ouch was wondering where the percona forums disappeared to! Hope you get everything sorted out. But yes applicable lessons learnt would be great to read about.

My web host had a raid controller failure but they used R1Soft CDP/MySQL backup and just restored a recovery point (entire OS, data etc) to a totally new server and was up and running very quickly.

]]>
By: William https://www.percona.com/blog/server-outages-at-percona/#comment-806910 Wed, 20 Jul 2011 23:29:32 +0000 https://www.percona.com/blog/?p=6362#comment-806910 Downtime always sucks, regardless of when or how many customers affected. But, it is very fascinating to study the causes of outages. If there are applicable lessons, it would be interesting to hear about. However, I competently understand if you don’t want to publicly divulge your infrastructure details on a public forum.

]]>
By: Baron Schwartz https://www.percona.com/blog/server-outages-at-percona/#comment-806904 Wed, 20 Jul 2011 21:40:01 +0000 https://www.percona.com/blog/?p=6362#comment-806904 FYI, Tom’s traveling right now and won’t be available to respond to comments until next week.

]]>
By: Michael https://www.percona.com/blog/server-outages-at-percona/#comment-806899 Wed, 20 Jul 2011 19:20:06 +0000 https://www.percona.com/blog/?p=6362#comment-806899 Why was there no failover system?

]]>
By: Patrick Casey https://www.percona.com/blog/server-outages-at-percona/#comment-806880 Wed, 20 Jul 2011 17:53:08 +0000 https://www.percona.com/blog/?p=6362#comment-806880 If its a raid 10 and nobody was watching the disk alarms, its possible they lost the disks over the period of months.

Lose disk 0 left …. 0 right is still ok, raid up
Lose disk 1 right … 1 left is still ok, raid up
Lose disk 0 right … 0 left is already down. raid down

I’ve lost servers that way before (embarrassing, but true)

]]>
By: Justin Rovang https://www.percona.com/blog/server-outages-at-percona/#comment-806851 Wed, 20 Jul 2011 14:58:08 +0000 https://www.percona.com/blog/?p=6362#comment-806851 Holy hannah! Must’ve been controller related?

]]>
By: Edmar https://www.percona.com/blog/server-outages-at-percona/#comment-806826 Wed, 20 Jul 2011 12:02:39 +0000 https://www.percona.com/blog/?p=6362#comment-806826 Multiple disk failures are extremely important to me as a database/data manager.

Is it possible to inform whether the 3 disks failed simultaneously, or if they possibly failed one at a time during a period of days/months, and RAID degradation was not detected due to absent/faulty monitoring?

Thanks in advance.

]]>
By: James Cohen https://www.percona.com/blog/server-outages-at-percona/#comment-806825 Wed, 20 Jul 2011 11:08:51 +0000 https://www.percona.com/blog/?p=6362#comment-806825 Sounds like you were unlucky with the disk failures. Can you give any more details as to what caused them to fail? Three at the same time seems very unlucky and unusual.

]]>