A couple of weeks ago one of my colleagues and I worked on a data corruption case that reminded me that sometimes people make unsafe assumptions without knowing it. This one involved SAN snapshotting that was unsafe.
In a nutshell, the client used SAN block-level replication to maintain a standby/failover MySQL system, and there was a failover that didn’t work; both the primary and fallback machine had identically corrupted data files. After running fsck on the replica, the InnoDB data files were entirely deleted.
When we arrived on the scene, there was a data directory with an 800+ GB data file, which we determined had been restored from a SAN snapshot. Accessing this file caused a number of errors, including warnings about accessing data outside of the partition boundaries. We were eventually able to coax the filesystem into truncating the data file back to a size that didn’t contain invalid pointers and could be read without errors on the filesystem level. From InnoDB’s point of view, though, it was still completely corrupted. The “InnoDB file” contained blocks of data that were obviously from other files, such as Python exception logs. The SAN snapshot was useless for practical purposes. (The client decided not to try to extract the data from the corrupted file, which we have specialized tools for doing. It’s an intensive process that costs a little money.)
The problem was that the filesystem was ext2, with no journaling and no consistency guarantees. A snapshot on the SAN is just the same as cutting the power to the machine — the block device is in an inconsistent state. A filesystem that can survive that has to ensure that it writes the data to the block device such that it can bring into a consistent state later. The techniques for doing this include things like ordered writes and meta-data journaling. But ext2 does not know how to do that. The data that’s seen by the SAN is some jumble of blocks that represents the most efficient way to transfer the changed blocks over the interconnect, without regard to logical consistency on the filesystem level.
Two things can help avoid such a disaster: 1) get qualified advice and 2) don’t trust the advice; backups and disaster recovery plans must be tested periodically.
This case illustrates an important point that I repeat often. The danger of using a replica as a backup is that data loss on the primary can affect the replica, too. This is true no matter what type of replication is being used. In this case it’s block-level SAN replication. DRBD would behave just the same way. At a higher level, MySQL replication has the same weakness. If you rely on a MySQL slave for a “backup,” you’ll be out of luck when someone accidentally runs DROP TABLE on your master. That statement will promptly replicate over and drop the table off your “backup.”
I still see people using a replica as a backup, and I know it’s just a matter of time before they lose data. In my experience, the types of errors that will propagate through replication are much more common than those that’ll be isolated to just one machine, such as hardware failures.
Great post! Thanks! Been wondering about this for a while now. Particularly agree with “the types of errors that will propagate through replication are much more common than those that’ll be isolated to just one machine, such as hardware failures”. So really, for me, if you can afford some downtime, perhaps the best practice is to streamline your server setup (and full backup + incremental backup restore) so much that you can afford to rebuild it in case you lose it in production.
I’m curious what you think of time delayed replication. Where a slave (or potentially multiple slaves at different intervals) is intentionally behind on replication by an hour or two. The idea being that mistakes (like ‘DROP TABLE’) would be caught with in that time frame.
Not a replacement for proper backups, but a potentially interesting addition.
Which is why we have a dedicated backup slave, which we then turn off replication and take a dump of nightly. We’ve been toying with the idea of delayed replication, such as Joseph above mentioned, but the only way to do that, as far as I can tell, is simply to have a cronjob script that toggles replication in some sort of estimated manner. But until they add that feature to MySQL proper, it’s not really reliable.
What is anyone’s experience of using Maatkit mk-slave-delay to have a time delayed replica?
http://www.maatkit.org/doc/mk-slave-delay.html
Delayed replication is a good way to augment a disaster recovery plan, and help reduce the downtime when something goes bad. But it shouldn’t be relied on all by itself, of course. I wrote mk-slave-delay and often suggest its use.
And this is why we use multiple off-site incremental backups and only do full replication periodically (without deleting old replicas). Something got messed up? Revert to a stable replica and apply some increments until you get the most recent stable version.
If you’re looking at a backup solution there’s two (common) things you try to defense against:
1) The primary server eating itself/disappearing into a puff of smoke
2) Some sort of logical error in the application or database corruption your data even though the server is technically just fine
Replication solves #1 *really well*.
Replication does not solve #2 at all since most common logical errors are happily replicated. As some have mentioned, running a delayed slave can mitigate this, but that only works if you notice the problem before the slave experiences it.
Traditional backups solve #1 weakly since the restore time on a large database can be prohibitive
Traditional backups solve #2 really well.
In the best of all possible worlds I usually recommend we run both, but depending on the business needs it doesn’t always make economic sense to run replication + backups.
In the enterprise environments I’m most familiar with the usual approach is to *always* run traditional backups, since they give absolute protection against both scenarios above, albeit with a high recovery time. For systems where we can’t afford the recovery time, we’ll also add in replication so that we can recover from some subset of failures quickly.
A backup isn’t a backup unless you’ve tried a recovery 😉 It is possible the ex2 weaknesses could have been discovered with a test restore.
A test restore, under production load. I’d guess they did a test — maybe before putting into production, though. If the system were quiet, pdflush would flush blocks to the SAN every 5 seconds, and soon the SAN’s copy of the data would be consistent too.
Its worth pointing out at this point that my experience with snapping a hot INNODB system on EXT3 at least is that it *almost always* works.
We snap hot databases, mount the snaps elsewhere, and run a recovery all the time both via LVM and storage snaps (netapp).
In point of fact, I can’t recall ever having a snap we took this way fail the recovery process.
I keep reading about cases like this where it didn’t work, so I’m skeptical about proposals to use storage snaps in lieu of backups (because, inevitably, the snap you need will inevitable be the one that doesn’t recover), but as a general case tool to quickly clone databases, I think its phenomenally useful and infinitely faster than a dump/restore.
Point being I suppose that its entirely possible the customer in this case did do testing on this, even under heavy load, and it worked fine in testing. Doesn’t mean it was a bad test, probably more a failure of imagination (or research).
Patrick, this customer was using ext2. I’ve never seen a problem with this kind of corruption on ext3, but ext2 doesn’t even TRY to handle things like remote snapshots.
In case it’s not a completely horrible setup…and useful for someone else.
We run master/slave MySQL replication…ZFS is the FS for both master and slave. We snapshot the filesystems on master and slave regularly. 3 times a week we copy one of those snaps off to a set of DataDomain backup servers.
Also seems to me like a typical case of “we bought this expensive SAN so our data is safe”.
The beauty of mysql replication is that the db copy is “hot” — there’s always a functioning database running against the copy. You can run queries, CHECK TABLEs, whatever you want while replication is proceeding to assure yourself the thing is going to work when it’s needed. Not true of low-level replication schemes.
I second that backing up by “snapshotting” a hot innodb db works. You just let the copy run crash recovery and voila you have a consistent backup copy of your database. I don’t use “snapshots” per se. My experience with LVM snapshots has been disappointing in terms of performance. I use RAID-10 and split mirrors.
Another beauty of mysql replication that’s unfortunately also a curse — it’s single threaded. Obscure bugs hit in Linux or mysql due to threads tripping over themselves just don’t happen on the slave. I’m pretty sure this fact has saved me at least once. (The curse is that mysql replication has no hope of keeping up with a busy master.)
Quote: A snapshot on the SAN is just the same as cutting the power to the machine — the block device is in an inconsistent state.
So you are telling me i can loose all my data just by cutting power?
I won’t blame the filesystem here, i will blame the database.
Andreas, I think it actually depends on both the DB and the fileystem. The most common linux file systems like ext3 are journaling file systems and as such they very, very, very rarely get corrupted during a hard power off, but it does happen. If your file system gets corrupted, it really doesn’t matter what kind of shape the database is in because the underlying FS is munged and you can’t get your blocks back.
In practise, I can’t recall a snap of mine ever failing a recovery on EXT3, buts its not outside the realm of possibility.
Point being I suppose is that modern file system almost always survive a crash, so we tend to take that kind of durability for granted, but it really isn’t. The fact that a server comes back at all after an abrupt power loss is a testment to some very careful file system design. The fact that a database running on top of that file system recovers as well is still more good engineering, but both have to work to get your data back :).
Given a properly behaving operating environment, InnoDB does not corrupt your data, and it is better than most at detecting and recovering from many types of misbehavior. But if the database writes and fsync()s a file, and the filesystem botches the result good and proper, there is nothing the database CAN do.
For a different perspective on this topic, I suggest reading http://www.xaprb.com/blog/2010/02/08/how-postgresql-protects-against-partial-page-writes-and-data-corruption/.
I know I’m a bit late in responding to this, but isn’t this really about “why you can’t rely on ext2 for anything”? I think pretty much everyone agrees that using a non-journaling file system is a really bad idea if you want any resilience in the face of power outage at all, regardless of whether any sort of replication is involved.
Just my $.02.
The unreliability of ext2 is just one specific example of a larger principle, in my opinion.
InnoDB got a transaction log. Isn’t this log responsible for recovering any glitches after a power outage?
Read previous comments, please.
Why doesn’t MySQL support raw devices like Oracle? Then you don’t have to rely on a file system driver for anything! you write the bits out to the device yourself and have complete and total control over that device.
InnoDB does support raw devices, and if used, then this type of problem is avoided. But it’s kind of nice to be able to use “cp” and “mv” and other tools. Most people don’t want to give that up. “Complete and total control” comes with limitations.
Hi,
A replication is not a backup. It is aimed to provide High availability (HA) , not disaster recovery (DR).
To be safe (the minimum !) :
1) use good harware. Mainly by doubling all devices (power, etc …).
2) use RAIDx disks on a SAN (be careful on the ‘write back’ or ‘write through’ parameter !)
3) use ZFS if possible, or at least a journalized FS.
4) use ACID compliant database (postgresql …), use internal replication feature if possible, for HA and DR (see 6)
5) snapshots can be usefull sometimes, but rarely for databases because data can be inconsistent.
Don’t snapshot database directories, except if you can tell the database that you are backuping or snapshoting the data. (Postgresql allow it : SELECT pg_start_backup(‘label’); )
ZFS allow powerfull snapshots that can be sent to another remote node…
6) do regular database dumps in a (compressed) file for snapshots and backups. That can be done on a replicated slave database for performance.
7) do complete and incremental backups on tapes, disks and optical disks. Protect it in fireproof vault.
8) use system versioning for system configuration (/etc/, …) and protect it (tripwire …)
9) duplicate the whole in two datacenters in two different cities, at least, or two continents if possible.
cross the backups : Datacenter A is backuped on datacenter B and the contrary.
10) test data recovery and your disaster recovery plan (imply you wrote it :>) …)
And pray… if you are believer.