On Saturday July 16, Percona suffered a catastrophic failure of three disks on our primary web server. This was compounded by unexpected problems in recovery from our backups arising from staff changes. The net result is that several Percona web properties were offline from several hours to several days and that our cleanup continues. Interruptions in our provision of software downloads, documentation, and credit card transactions directly affected Percona customers and users of our software. However no customer data was compromised and customer access to our engineers via our customer service portal and online chat was uninterrupted. The web server that failed is completely separate from those that contain customer data.

The recovery lessons learned for us have been considerable and will be incorporated into our internal processes. Availability and performance of all of our websites is a top priority. On behalf of all of us at Percona, I apologize for the inconvenience this has caused for our users.

Sincerely,

– Tom Basil, COO

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
James Cohen

Sounds like you were unlucky with the disk failures. Can you give any more details as to what caused them to fail? Three at the same time seems very unlucky and unusual.

Edmar

Multiple disk failures are extremely important to me as a database/data manager.

Is it possible to inform whether the 3 disks failed simultaneously, or if they possibly failed one at a time during a period of days/months, and RAID degradation was not detected due to absent/faulty monitoring?

Thanks in advance.

Justin Rovang

Holy hannah! Must’ve been controller related?

Patrick Casey

If its a raid 10 and nobody was watching the disk alarms, its possible they lost the disks over the period of months.

Lose disk 0 left …. 0 right is still ok, raid up
Lose disk 1 right … 1 left is still ok, raid up
Lose disk 0 right … 0 left is already down. raid down

I’ve lost servers that way before (embarrassing, but true)

Michael

Why was there no failover system?

Baron Schwartz

FYI, Tom’s traveling right now and won’t be available to respond to comments until next week.

William

Downtime always sucks, regardless of when or how many customers affected. But, it is very fascinating to study the causes of outages. If there are applicable lessons, it would be interesting to hear about. However, I competently understand if you don’t want to publicly divulge your infrastructure details on a public forum.

George

Ouch was wondering where the percona forums disappeared to! Hope you get everything sorted out. But yes applicable lessons learnt would be great to read about.

My web host had a raid controller failure but they used R1Soft CDP/MySQL backup and just restored a recovery point (entire OS, data etc) to a totally new server and was up and running very quickly.

Patrick Sciaroni

Anything mechanical can and will fail, but three disks at once? At carbonlogic.com, we haven’t ever seen that in the decade that we have been offering economical, Tier 3 hosing. If you like having direct 24 hour access to your assigned Network Engineer, and need extreme availability, we are the logical choice. Let us prepare a quote for you, to see if we can take some of the hassle out of your life!

Justin Rovang

The complexity and level of redundancy is a matter of criticality – in the big picture the MPB site probably doesn’t require anything beyond a reasonable backup setup and maybe some simplistic caching. 3 drives in a go is a damn curious matter though; still curious!

Nils

I had a few close calls myself with disks failing merely hours after one another.

Tom Basil

This is a belated response to reader comments asking for an explanation of how we lost three disks simultaneously on our primary web server in July, and why recovering from it was so difficult for us.

This server had a six-disk RAID5 configured as its only logical drive. One of these disks failed, and we did not replace it in time to prevent what followed. On July 16 the RAID controller itself failed, which affected two more disks of the server’s RAID set. Our staff connected to the RAID BIOS utility and saw that the RAID disks were flapping between the “asserted” and “deasserted” states, which is Dell terminology for in/out of sync. Because the RAID controller was trying to continue using the disks, essential files became badly damaged. This happened because EXT3 was not set to remount-ro on error, so the file system became corrupted. Once our staff gained physical access to the server, the system was in a mountable state from a bootable rescue cd and the filesystem could be used, but with errors in many files.

Recovering from this unhappy failure would have been less of a problem if our backups were in the condition we needed them to be. They were not. But this was more a human failure than a mechanical one, as crucial error notifications were missed due to staff turnover — a reminder that it’s hard to be overly attentive or too paranoid about the state of your backups.