Clarification on "Call me Maybe: MariaDB Galera Cluster"

Recently Aphyr (Kyle Kingsbury) published https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-cluster

The article is technically valid, I am not going to dispute a conclusion Aphyr made, but it is also quite technically involved, so users who just jump to conclusion may get the wrong impression and we’re left with more questions than ever.

So, let me state what is the real conclusion of this article:
“Galera cluster does not support SNAPSHOT ISOLATION LEVEL, in contract to what was stated in the documentation”.
Following that conclusion is using Galera cluster may result in “corrupted” data.

I do not quite like the usage of the word “corrupted” here. For me, the more correct word be to use is “inconsistent”.

So with this clarification, the Aphyr’s conclusion that Galera Cluster (and it affects both MariaDB Galera and Percona XtraDB Cluster products)
does not support SNAPSHOT ISOLATION and may leave data in inconsistent state is valid.
But there I need to add quite IMPORTANT addition: it may leave data in inconsistent state
if you use SPECIAL TYPE of transactions in default isolation levels that Aphyr uses in his test.
Moreover, if we test the same workload on a simple single instance InnoDB, we will get the same result.

Before getting too scary of “inconsistent data”, let’s review what kind of transactions are used and what are practical implications.

Aphyr uses following logic:
Assume we have a table

CREATE TABLE `accounts` (
  `account_id` int(11) NOT NULL,
  `balance` int(11) NOT NULL,
  PRIMARY KEY (`account_id`)
) ENGINE=InnoDB

CREATE TABLE `accounts` (

`account_id` int(11) NOT NULL,

`balance` int(11) NOT NULL,

PRIMARY KEY (`account_id`)

) ENGINE=InnoDB

We have N rows in table accounts, and each row populated with initial balance 100.
That results in SUM(balance) FROM accounts == 100*N

Application logic: Execute following transactions concurrently:

BEGIN (start transactions)
// in application: renerate random $fromAccount in a range [1;N]
SELECT balance FROM accounts WHERE account_id=$fromAccount
// in application read balance into $from_balance variable
// in application: renerate random $toAccount in a range [1;N], but != $fromAccount
SELECT balance FROM accounts WHERE account_id=$toAccount
// in application read balance into $to_balance variable
// in application, generate random amount to move from one account to another, as $moveAmt
// in application, calculate new balance for account1: $newBalance1=$from_balance-$moveAmt
// in application, calculate new balance for account2: $newBalance2=$to_balance+$moveAmt
// execute queries:
UPDATE account SET balance=$newBalance1 WHERE account_id=$fromAccount
UPDATE account SET balance=$newBalance2 WHERE account_id=$toAccount
COMMIT;

BEGIN (start transactions)

// in application: renerate random $fromAccount in a range [1;N]

SELECT balance FROM accounts WHERE account_id=$fromAccount

// in application read balance into $from_balance variable

// in application: renerate random $toAccount in a range [1;N], but != $fromAccount

SELECT balance FROM accounts WHERE account_id=$toAccount

// in application read balance into $to_balance variable

// in application, generate random amount to move from one account to another, as $moveAmt

// in application, calculate new balance for account1: $newBalance1=$from_balance-$moveAmt

// in application, calculate new balance for account2: $newBalance2=$to_balance+$moveAmt

// execute queries:

UPDATE account SET balance=$newBalance1 WHERE account_id=$fromAccount

UPDATE account SET balance=$newBalance2 WHERE account_id=$toAccount

COMMIT;

As you see it includes some application logic, so on database side, the transactions looks like:
(assuming we move 25 from account 5 to 8)

BEGIN
SELECT balance FROM accounts WHERE account_id=5
SELECT balance FROM accounts WHERE account_id=8
UPDATE account SET balance=75 WHERE account_id=5
UPDATE account SET balance=125 WHERE account_id=8
COMMIT;

BEGIN

SELECT balance FROM accounts WHERE account_id=5

SELECT balance FROM accounts WHERE account_id=8

UPDATE account SET balance=75 WHERE account_id=5

UPDATE account SET balance=125 WHERE account_id=8

COMMIT;

Aphyr’s proves that these transactions executed concurrently should keep balances consistent (that is SUM(balances)==N*100) if database support SNAPSHOT ISOLATION or SERIALIZABLE isolation levels.
In his test he shows that running on Galera cluster these transactions executing concurrently results in inconsistent balance, therefore Galera cluster does not support SNAPSHOT ISOLATION level.

This however is totally expected.
Moreover, if you try this test on a single server against InnoDB in REPEATABLE-READ (default) mode,
you also will end up in inconsistent state (you can find my code here: https://github.com/vadimtk/go-trx-iso-test).

This is because how InnoDB handles REPEATABLE-READ mode (one may argue that InnoDB’s REPEATABLE-READ is weaker
than standard defined REPEATABLE-READ, and it is more closer to READ-COMMITED. This is a good opportunity for Asphyr to start another FUD “Call Me Maybe: InnoDB”). In simplified terms, InnoDB executes reads in repeatable-read mode, and writes or locked-read in read-committed mode.

What does it mean from practical standpoint?
From my opinion these transactions are little bit artificial (although are totally valid).

If you use this in a real life, the more obvious way to write these transactions is:

BEGIN
SELECT balance FROM accounts WHERE account_id=5
SELECT balance FROM accounts WHERE account_id=8
UPDATE account SET balance=balance-25 WHERE account_id=5
UPDATE account SET balance=balance+25 WHERE account_id=8
COMMIT;

BEGIN

SELECT balance FROM accounts WHERE account_id=5

SELECT balance FROM accounts WHERE account_id=8

UPDATE account SET balance=balance-25 WHERE account_id=5

UPDATE account SET balance=balance+25 WHERE account_id=8

COMMIT;

If you do this, it will NOT produce an inconsistent state.

Another way to handle this (for a single server InnoDB) is to use SERIALIZABLE isolation level (with an expected performance penalty).
Unfortunately Galera Cluster does not support SERIALIZABLE isolation level, as it does not pass read-set between nodes,
and node communication happens on COMMIT stage.

Third way, MySQL also provides an extension: SELECT .. FOR UPDATE statements to handle cases exactly like these.
So if you want to keep REPEATABLE-READ and original transactions, you will need to rewrite this as

BEGIN
SELECT balance FROM accounts WHERE account_id=5 FOR UPDATE;
SELECT balance FROM accounts WHERE account_id=8 FOR UPDATE;
UPDATE account SET balance=75 WHERE account_id=5
UPDATE account SET balance=125 WHERE account_id=8
COMMIT;

BEGIN

SELECT balance FROM accounts WHERE account_id=5 FOR UPDATE;

SELECT balance FROM accounts WHERE account_id=8 FOR UPDATE;

UPDATE account SET balance=75 WHERE account_id=5

UPDATE account SET balance=125 WHERE account_id=8

COMMIT;

This will result in a consistent state for table accounts and will work for both single InnoDB and multi-node Percona XtraDB Cluster deployments.

One thing to remember, that with Percona XtraDB Cluster you may get a DEADLOCK error trying to execute "COMMIT" statement, so your application should be ready to handle this error, rollback and repeat the transaction if needed.

So in conclusion of my post:

Using transactions described in https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-cluster may result in inconsistent state (not in data corrupted state!), for both Galera Cluster and single instance InnoDB. But this is because these transactions do not use properly InnoDB’s REPETABLE-READ isolation level. To reflect InnoDB’s requirement we need to use “SELECT ... FOR UPDATE” or to rewrite transactions in a described way.

UPDATE 18-Sep-2015.
Based on Twitter comments and comments from https://news.ycombinator.com/item?id=10238690, I would like to add following.

1. If it was not clear from my post: Galera documentation http://galeracluster.com/documentation-webpages/isolationlevels.html#cluster-transaction-isolation is wrong. Galera DOES NOT support SNAPSHOT ISOLATION level. At the same time it states “It is recommended that you avoid using SERIALIZABLE in Galera Cluster.”
2. Percona XtraDB Cluster product uses Galera replication library and affected in the same way.
3. With that, Percona is NOT affiliated with Codership and galeracluster.com. We have NO control over galeracluster.com website, so we can’t make changes to the documentation. There is a posted bug https://github.com/codership/galera/issues/336 on this matter.

17 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Tom Santero

8 years ago

Instead of getting defensive and nitpicking the language of Kyle’s post, the professional thing to do would have been to update the documentation on your site that claims Snapshot-Isolation [0] and be grateful that someone spent several hours of their life, independently verifying the behavior of your system in a transparent, repeatable manner (which also lead to the discovery of a corner case where the system fails in an interesting way).

Distributed systems are hard; apparently taking criticism even more so.

[0] http://galeracluster.com/documentation-webpages/isolationlevels.html#cluster-transaction-isolation

Mao Geng

8 years ago

Good point and examples. Thanks!
Oracle (and Postgres) doesn’t support REPEATABLE_READ at all (Tom has a good explanation here – http://www.oracle.com/technetwork/issue-archive/2010/10-jan/o65asktom-082389.html), why similar MVCC based InnoDB supports it? Even the consistency it provides is “weak”, and performance cost is high (https://www.percona.com/blog/2015/01/14/mysql-performance-implications-of-innodb-isolation-modes)?
I am curious if you can share a situation where using REPEATABLE_READ is more appropriate than using READ_COMMITTED with explicit locks?

Doug

8 years ago

Nice post, Vadim. I appreciate the detailed distinction made concerning isolation levels and how to rewrite the transaction to avoid inconsistencies.

Mark Callaghan

8 years ago

Excellent post. Thanks for doing the hard work to explain this. My first response on reading his post was the assumption that the problem existed for InnoDB independent of Galera, but I don’t want to credit as you did the work.

While figuring out what to do for repeatable read for RocksDB, I documented InnoDB, Oracle and Postgres. I think MySQL+RocksDB will end up with Postres-style.
https://github.com/MariaDB/webscalesql-5.6/wiki/Cursor-Isolation

Vadim Tkachenko

Author

8 years ago

By some reason a comment from Tom Santero (@tsantero) did not come through (our webmasters are looking into this)
I repost it here that there is no claims in censorship:

His text:
======
Instead of getting defensive and nitpicking the language of Kyle’s post, the professional thing to do would have been to update the documentation on your site that claims Snapshot-Isolation [0] and be grateful that someone spent several hours of their life, independently verifying the behavior of your system in a transparent, repeatable manner (which also lead to the discovery of a corner case where the system fails in an interesting way).

Distributed systems are hard; apparently taking criticism even more so.

[0] http://galeracluster.com/documentation-webpages/isolationlevels.html#cluster-transaction-isolation
======

Vadim Tkachenko

Author

8 years ago

@Mark, thank you!

Good luck with adding isolations to RocksDB. It seriously is a quite complicated project, especially given MySQL API layer.

Jonathan Oliver

8 years ago

Vadim,

Thanks for clarifying this. I don’t see your post as defensive or nit-picky at all. Language matters. Inconsistent and corrupted are two very different words with very different meanings.

R. Kyle Murphy

8 years ago

As a developer I have to say that Aphyr’s example is more realistic and reflects more accurately how DBs are typically used. The usual flow from a application perspective is to start a new transaction, query for the records you need, perform some transformations of them in your application, write the updated records back, and then commit the transaction which is exactly what Aphyr did. Although the SQL you provided would work in this particular instance, it’s very problematic from the standpoint of an application because now you’ve moved part of the business logic down into the DB. Taken to the logic conclusion in order to maintain consistency you will end up having to use stored procedures for all application logic as anything that touches the DB records could potentially lead to an inconsistent state unless ALL of the logic is contained inside of the DB (and wrapped in a transaction).

Mao Geng

8 years ago

I left a comment yesterday however seems it did not make it. Let me retry.
Very good examples and clarification, however I am curious why InnoDB supports REPEATABLE_READ, while with similar MVCC design Oracle/Postgres don’t support it at all (Tom has a good post about it – http://www.oracle.com/technetwork/issue-archive/2010/10-jan/o65asktom-082389.html)? In what situations, REPEATABLE_READ is better than READ_COMMITTED plus explicit locking, and better than SERIALIZABLE – can you please give an example?

Vadim Tkachenko

Author

8 years ago

Mao,

Let me provide you a quick answer, let’s see if it help, if not I will need to look into detailed example.

InnoDB REPEATABLE-READ is better than READ-COMMITTED, as in READ-COMMITTED you are vulnerable to “phantom read” anomaly. https://en.wikipedia.org/wiki/Isolation_(database_systems)#Phantom_reads
This is interesting, as standard allows to have “phantom reads” in REPEATABLE-READ, but InnoDB’s implementation does not have it.

As REPEATABLE-READ vs SERIALIZABLE:
SERIALIZABLE gives you a way to execute a transaction as it is totally isolated, and no other transaction may interfere. That makes possible for original transactions in this post to work.
Unfortunately it comes with a serious performance and locking overhead. Your queries might be locked (while not locked in REPEATABLE-READ) and even result in more DEADLOCK and LOCK WAIT TIMEOUT errors under high concurrency.

Please also check:
https://www.percona.com/blog/2015/01/14/mysql-performance-implications-of-innodb-isolation-modes/
https://www.percona.com/blog/2012/08/28/differences-between-read-committed-and-repeatable-read-transaction-isolation-levels/

Mark Callaghan

8 years ago

Mao Geng – Postgres supports repeatable read. See this for a comparison with InnoDB behavior – https://github.com/MariaDB/webscalesql-5.6/wiki/Cursor-Isolation

R Kyle Murphy – the community of InnoDB experts knows all about this. Alas, the community of InnoDB users is larger than that and sometimes they make the mistake of not using “FOR UPDATE” in the select statement as described above. The workaround doesn’t require embedding business logic. This is well documented behavior for InnoDB. I don’t think it will change. The alternative as implemented by Postgres isn’t without problems — there will be more rollbacks from write-write conflicts.

lots of people supporting InnoDB in production know that users sometimes do the wrong thing with InnoDB (start transaction, SELECT … WHERE k, UPDATE … WHERE k). Adding ‘for update’ fixes that. This is well documented behavior. As much as I like most of the work

Mao Geng

8 years ago

Thanks Mark. I just found Postgres added support for repeatable_read since 9.1. Also I found its document (http://www.postgresql.org/docs/9.1/static/transaction-iso.html) describes the difference between repeatable_read and read_committed pretty well, especially it points out “select for update” statement in repeatable_read transaction might fail – the repeatable read transaction will be rolled back with the message:
“ERROR: could not serialize access due to concurrent update”.

Valim, thanks for reply. I saw the differences. I have a long history on Oracle. It is a bit hard for me to think a case repeatable_read is required.
In your example above:
BEGIN
SELECT balance FROM accounts WHERE account_id=5
SELECT balance FROM accounts WHERE account_id=8
UPDATE account SET balance=balance-25 WHERE account_id=5
UPDATE account SET balance=balance+25 WHERE account_id=8
COMMIT;
Is repeatable_read required here to keep consistency? The next “select for update” example can use read_committed, isn’t it?

Thanks,
Mao

Henrik Ingo

8 years ago

Hi Tom

I’m afraid you’re barking up the wrong tree here, as Vadim or Percona are not the creators of Galera. Nor is Vadim being defensive, it seems to me he is quite open about InnoDB’s limitations.

Regarding Kyle, this example is quite similar to most of his posts in “Call me maybe” series. At least I can say this regarding the DB technologies I know well enough to have an opinion of. While Kyle’s method of investigating consistency levels provided by different databases is really advanced, it is impossible for a single person to be an expert in all databases, so often his analysis is also quite superficial. Combined with his entertaining but not that professional “you’re all shit anyway” attitude, it makes his posts hard to read. An average reader will get the conclusion that Galera (or even just InnoDB) is useless for transactional apps, when in reality the correct conclusion is simply that Kyle didn’t know how to use InnoDB correctly. I’m not even super-experienced with InnoDB myself, and even then my first comment when reading that post was “but shouldn’t you use FOR UPDATE to do that?”

Even then, it must be said that Kyle usually succeeds in uncovering some genuine bug or flaw in the systems he tests. Which proves that this is a difficult topic for vendors to get right and on a general level Kyle’s methodology is very sound. In this case the genuine finding was that Galera documentation claimed to support snapshot isolation, but it turned out that is not true, and the Galera developers admitted that. It is indeed bad that they didn’t rush to fix their documentation yet.

King Arthur

8 years ago

Henrik,

> Kyle didn’t know how to use InnoDB correctly.

One does not have to be “super-experienced”, to understand
that Aphyr test does not need either LOCK-IN-SHARED-MODE nor FOR-UPDATE lock explicitly requested.
Look at it again, please. According to mysql/innodb docs

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;

forces the following selects to acquire SHARED lock which suffice
transactions

select * from accounts where id = 0
select * from accounts where id = 1
UPDATE accounts SET balance = 8 WHERE id = 0
UPDATE accounts SET balance = 12 WHERE id = 1
COMMIT

be runnable correctly and justifiably against the server alone.

Negative results for the cluster made already few changes in docs,
good for the user!
But who knows if Repeatable-read or Read-committed are not compromised too..

henrikingo

8 years ago

Hi Arthur

The point is that a Galera Cluster does not support serializable isolation level (nor snapshot isolation level). The fact that this is not clearly documented, is a bad thing and it is the part of Kyle’s post that adds new and correct information.

You’re wrong in saying that FOR UPDATE is not needed in the Aphyr test. This blog post shows that it is precisely what is needed. It is incorrect to try to do isolated transactions any other way in a Galera Cluster.

If Kyle was more familiar with InnoDB, he would have known this, and it would have made his blog post much better. That said, I don’t fault Kyle for not being an InnoDB expert. He clearly is an expert in distributed databases and consistency levels. But for the full picture, the reader must be careful when reading his posts. This was a good example of why.

King Arthur

8 years ago

Dear Henrik,

As you agreed Aphyr exposed false promises. Always good for the user!!!
When I stated a trivial no-need-FOR-UPDATE lock that related to the proof of SI.

It’s not Innodb and its features to blame as you can run this very test against the bare server, the bare Innodb to it passes.

To spell it out Innodb supports SI as far Aphyr’s test concerns.

That means “wrong” is a way you read my post or perceive Aphyr’s results.

Cheers.

henrikingo

8 years ago

Arthur: If you intended to say that “FOR UPDATE” is not needed in Snapshot Isolation mode, then we are in agreement. However, since we have learned that a Galera Cluster does not provide SI, it is needed for Galera.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Clarification on “Call me Maybe: MariaDB Galera Cluster”

Related

Related Blog Articles

RECOMMENDED ARTICLES

High Availability: Choosing the Right Option for Your Percona Monitoring and Management

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Clarification on “Call me Maybe: MariaDB Galera Cluster”

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

High Availability: Choosing the Right Option for Your Percona Monitoring and Management

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation