In this blog, we’ll discuss the ramifications of the Galera Error Failed to Report Last Committed (Interrupted System Call).
I have recently seen this error with Percona XtraDB Cluster (or Galera):
[Warning] WSREP: Failed to report last committed 549684236, -4 (Interrupted system call)
It was posted in launchpad as a bug in 2013: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1434646
My colleague Przemek replied, and explained it as:
Reporting the last committed transaction is just a part of the certification index purge process. In case it fails for some reason (it occasionally does), the cert index purge may be a little delayed. But it does not mean the transaction was not applied successfully. This is a warning after all.
If we look up this error in the source code, we realize it is reusing Linux system errors. Specifically:
#define EINTR 4 /* Interrupted system call */
As there isn’t much documentation regarding this error, and internet searches did not bring up useful information, my colleague David Bennett and I delved into the source code (as we do on occasion).
If we look in the Galera source code gcs_sm.hpp
we see:
289 * @retval -EINTR - was interrupted by another thread
We also see:
317 /* was interrupted, will be handled by someone else */
This means that the thread was interrupted, but the server will retry on another thread. As it is just a warning, it isn’t anything to be too concerned about – unless they begin to pile up (which could be a sign of concurrency issues).
The specific warning is thrown from galera_service_thd.cpp
here:
58 if (gu_unlikely(ret < 0))
59 {
60 log_warn << "Failed to report last committed "
61 << data.last_committed_ << ", " << ret
62 << " (" << strerror (-ret) << ')';
63 // @todo: figure out what to do in this case
64 }
This warning could be handled better so as to not flood the logs, or sound cryptic enough to concern administrators.
On PXC 5.7, using local sysbench, I see this message on a HDD:
2016-08-09T09:43:58.131621Z 0 [Warning] WSREP: Failed to report last committed 71, -4 (Interrupted system call)
2016-08-09T09:44:01.746062Z 0 [Warning] WSREP: Failed to report last committed 43, -4 (Interrupted system call)
2016-08-09T09:44:04.974485Z 0 [Warning] WSREP: Failed to report last committed 45, -4 (Interrupted system call)
2016-08-09T09:44:12.695912Z 0 [Warning] WSREP: Failed to report last committed 51, -4 (Interrupted system call)
2016-08-09T09:44:16.046751Z 0 [Warning] WSREP: Failed to report last committed 53, -4 (Interrupted system call)
The same does not happen on a SSD in the same server (otherwise all remaining equal)
I had little different message
[Warning] WSREP: Failed to report last committed 285293519, -110 (Connection timed out)
What does it mean?
This simply means that the said node was unable to send the commit report notification to group channel probably due to heavy n/w traffic. It is again from same category and can be ignored but it also signals an important warning that you probably want to re-evaluate your load and available n/w bandwidth. Not that things will break immediately but if things keeps growing in this way you may see node dropping in future due to n/w issues.