While we do have many blog posts on replication on our blog, such as on replication being single-threaded, on semi-synchronous replication or on estimating replication capacity, I don’t think we have one that covers the very basics of how MySQL replication really works on the high level. Or it’s been so long ago I can’t even find it. So, I decided to write one now.
Of course, there are many aspects of MySQL replication, but my main focus will be the logistics – how replication events are written on the master, how they are transferred to the replication slave and then how they are applied there. Note that this is NOT a HOWTO setup replication, but rather a howstuffworks type of thing.
MySQL Replication events
I say replication events in this article because I want to avoid discussion about different replication formats. These are covered pretty well in the MySQL manual here. Put simply, the events can be one of two types:
- Statement based – in which case these are write queries
- Row based – in this case these are changes to records, sort of row diffs if you will
But other than that, I won’t be going back to differences in replication with different replication formats, mostly because there’s very little that’s different when it comes to transporting the data changes.
On the master
So now let me start with what is happening on the master. For replication to work, first of all master needs to be writing replication events to a special log called binary log. This is usually very lightweight activity (assuming events are not synchronized to disk), because writes are buffered and because they are sequential. The binary log file stores data that replication slave will be reading later.
Whenever a replication slave connects to a master, master creates a new thread for the connection (similar to one that’s used for just about any other server client) and then it does whatever the client – replication slave in this case – asks. Most of that is going to be (a) feeding replication slave with events from the binary log and (b) notifying slave about newly written events to its binary log.
Slaves that are up to date will mostly be reading events that are still cached in OS cache on the master, so there is not going to be any physical disk reads on the master in order to feed binary log events to slave(s). However, when you connect a replication slave that is few hours or even days behind, it will initially start reading binary logs that were written hours or days ago – master may no longer have these cached, so disk reads will occur. If master does not have free IO resources, you may feel a bump at that point.
On the replica
Now let’s see what is happening on the slave. When you start replication, two threads are started on the slave:
1. IO thread
This process called IO thread connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relay log. That’s all.
Even though there’s only one thread reading binary log from the master and one writing relay log on the slave, very rarely copying of replication events is a slower element of the replication. There could be a network delay, causing a steady delay of few hundred milliseconds, but that’s about it.
If you want to see where IO thread currently is, check the following in “show slave statusG”:
- Master_Log_File – last file copied from the master (most of the time it would be the same as last binary log written by a master)
- Read_Master_Log_Pos – binary log from master is copied over to the relay log on the slave up until this position.
And then you can compare it to the output of “show master statusG” from the master.
2. SQL thread
The second process – SQL thread – reads events from a relay log stored locally on the replication slave (the file that was written by IO thread) and then applies them as fast as possible.
This thread is what people often blame for being single-threaded. Going back to “show slave statusG”, you can get the current status of SQL thread from the following variables:
- Relay_Master_Log_File – binary log from master, that SQL thread is “working on” (in reality it is working on relay log, so it’s just a convenient way to display information)
- Exec_Master_Log_Pos – which position from master binary log is being executed by SQL thread.
Replication lag
Now I want to briefly touch the subject of replication lag in this context. When you are dealing with replication lag, first thing you want to know is which of the two replication threads is behind. Most of the time it will be the SQL thread, still it makes sense to double check. You can do that by comparing the replication status variables mentioned above to the master binary log status from the output of “show master statusG” from the master.
If it happens to be IO thread, which, as I mentioned many times already, is very rare, one thing you may want to try to get that fixed is enabling slave compressed protocol.
Otherwise, if you are sure it is SQL thread, then you want to understand what is the reason and that you can usually observe by vmstat. Monitor server activity over time and see if it is “r” or “b” column that is “scoring” most of the time. If it is “r”, replication is CPU-bound, otherwise – IO. If it is not conclusive, mpstat will give you better visibility by CPU thread.
Note this assumes that there is no other activity happening on the server. If there is some activity, then you may also want to look at diskstats or even do a query review for SQL thread to get a good picture.
If you find that replication is CPU bound, this maybe very helpful.
If it is IO bound, then fixing it may not be as easy (or rather, as cheap). Let me explain. If replication is IO bound, most of the time that means that SQL thread is unable to read fast enough because reads are single threaded. Yes, you got that right – it is reads that are limiting replication performance, not writes. Let me explain this further.
Assume you have a RAID10 with a bunch of disks and write-back cache. Writes, even though they are serialized, will be fast because they are buffered in the controller cache and because internally RAID card can parallelize writes to disks. Hence replication slave with similar hardware can write just as fast as master can.
Now Reads. When your workset does not fit in memory, then the data that is about to get modified is going to have to be read from disk first and this is where it is limited by the single-threaded nature of the replication, because one thread will only ever read from one disk at a time.
That being said, one solution to fix IO-bound replication is to increase the amount of memory so working set fits in memory. Another – get IO device that can do much more IO operations per second even with a single thread – fastest traditional disks can do up to 250 iops, SSDs – in the order of 10,000 iops.
Questions? Comments? Concerns?
More resources
Blog posts
- Overview of Different MySQL Replication Solutions
- Group Replication: The Sweet and the Sour
- Quest for Better Replication in MySQL: Galera vs. Group Replication
Free eBook
Having trouble with replication in MySQL? Our webinar provides expert tips to help you identify and solve common replication errors.
Hi Aurismas,
Thanks for writing such a great high-level view on “Behind the Scenes Replication”. Just to dive into lower level details of replication, you mentioned that the master signals the Slave_IO thread for any new binlog events. Could you please enlighten how actually the IO thread copies the binlog events to Relay logs. My understanding of a simplistic approach would be the Slave_IO thread executing “Show binlog events in from ” and a single event is copied at a time which is bound by “max_allowed_packet”. Also where does the Binlog Dump thread stands in this picture.
Also in one section you mentioned about the single-threaded performance for Reads, internally mysql especially innodb read IO threads (multiple) and some myisam read process could be responsible for that, isnt it. General understanding is that only one thread has to do all the DMLs serially which on the other hand was done in concurrent manner on the master.
Please correct if wrong.
Cheers
after 5.6 GTID manner make replication easy use and to control sush as failover . switchover or filter etc.
What about binlog_format=MIXED?
Akshay, –
Replication uses its own protocol. “show binlog events” command is mostly to be used by DBAs. I guess there is some similarity to how it is implemented, but I wouldn’t know. However, I just double checked with tcpdump and I do see that events are not copied one by one, but rather in chunks. That is of course if there’s more than one event to fetch.
Note that since MySQL 5.1.64 and 5.5.26 there’s a new variable slave_max_allowed_packet which takes precedence over max_allowed_packet for a single event.
Binlog Dump is the replication thread on the master. I guess I should have named it in the article.
Regarding Reads, the multiple IO threads can only be used for read-ahead which in reality rarely happens. Otherwise, reads are always single-threaded – done in foreground by replication SQL thread – and most of the time reads are going to be the bottle-neck of the replication if it is IO-bound. It could be writes too, but only if the system does not have write-back cache or if it has less powerful IO subsystem than the master. Besides, writes are often pretty well optimized to be sequential while reads are mostly random.
Daniël, –
MIXED only suggests that either Statement or Row based replication is going to be used, depending on what is best for a given situation (or what MySQL thinks is best). However, I wanted to keep this article simple and not dive too deep into some details.
A discussion on different replication formats is certainly an interesting one though, I may want to do it sometime if there’s interest.
Aurimas
Thank you, Aurimas. Good stuff. I’d like to see a discussion on different replication formats in this context as well.
Hello Aurimas,
Is it possible to tweak replication and replicate only inserts from master and ignore the rest of the activity? I might be a little out of league asking for this, but is it possible to filter binlog events to filter before it is applied on the slave? I have a specific requirement that i am working on… Any ideas would be great help 🙂 I thought of using mysql proxy, but its still not GA…
Thanks,
Raghu
Raghunandan, –
I’m not sure MySQL Proxy is ever going to go GA 🙂 Although we were using it quite succesfully with some projects regardless of the alpha status.
I don’t think there’s any filtering other than by table and/or schema available in the stock MySQL, so you are definitely looking at something outside the box, yet I’m not quite sure yet how one would engage mysql-proxy or sql relay in filtering replication traffic..
Either way, let me know if we can help you with implementation, I’m sure we can figure something out, but that sounds like a feature request to me.
Hello Aurimas,
Thanks for the response. Here is what I am trying to do, might sound crazy 🙂 I have a master/slave setup and the master has heavy writes, for which, I have created partitions so that the innodb memory would hold only a day’s worth of data. That way my insert performance wont degrade as the no. of inserts increase on the table. Now on the slave, the indexing and reporting requirements are different. On the master, once the partition is not required, I will truncate that partition and keep that to minimum. Now here comes my requirement. I dont want to replicate the truncate partition statement, but the inserts should go to slave. 🙂 Can I achieve this?
Hello Aurimas,
Many thanks for the article. It is very useful.
I have a basic doubt.
As we transfer binary logs from Master to Slave as relay logs (both are OS files), why do we need to create a user on Master with REPLICATION SLAVE privilege to enable replication?
Why does slave need to connect MySQL on master?
Thanks in advance.
Abhijit Buchake
Thanks for your question, Abhijit. The reason is actually pretty simple – all communication goes through MySQL and therefore in order to fetch binary logs from master, slave must authenticate, otherwise anyone could knock and get binary logs. Moreover, later on slave thread is “pinged” about every new event arrived on the master binary log so it would start fetching it. If you copy binary logs manually and parse them locally (eg. using mysqlbinlog utility), then you do not need to connect to master, but you can only use that for delayed replication.
Let me know if that answers your question. See here for explanation on what REPLICATION SLAVE privilege does.
Aurimas
I have a requirement where I need the master to send bin logs to the slave without slave connecting first. Is this possible?
Hi,
Can anyone pls explain how does the slave know that a particular command is executed successfully on slave. I mean how does the acknowledgement actually happen to move the relay log position one step ahead.?
Madhu, –
I’m not sure how deep do you want to dig here, I can only scratch the surface, but the SQL threadreads an event from relay log, tries to execute it and after the execution it checks if the command returns the same result as it did on the master (this information is carried in the binary logs (and therefore is copied to relay logs)), if it does – master.info is updated and next event is read. If not, SQL thread stops, throws an error and waits for a user input.
Note that master.info in MySQL 5.5 and earlier is not syncronised to disk (Percona Server has very nice workarounds for that), hence in case of a OS crash, the true position and position written in master.info will differ most of the time.
Hope that makes sense.
Aurimas
Thanks a lot for the info.. 🙂 That helps!
Thank you for you great article;
Need to ask about mysql replication;
you have talked about master to slave replication.
Will it also handle slave to master?
The actual scenario is;
when online the software posts to the hosted server mysql db.
the online web server will replicate to the local server mysql server.
When the web is offline then the slave mysql entries could be made and then replicated to the main server.
can mysql replication handle the above?
Azhar
Could you tell me how replication stops. You mentioned how it is started. The slave contacts the master and requests log information. This starts the process. But when does the process stop, or does it simply go on forever. I don’t believe it goes on forever, but what set of conditions cause the replication work to terminate. Like maybe it runs until is completely caught up and then it terminates. Can the user specify the conditions the replication should stop, or does it really go on forever.
Azhar, I don’t quite understand your question, but you can potentially set up two mysql servers to replicate to each other. While it’s not recommended setup (active-active), in your case, as I understand, it would be passive-active master-master so it might work. Just make sure you don’t write to both servers at once and if you want to, then consider PXC instead.
Ken, essentially the replication slave does replicate infinitely, however slave obviously reaches a point where it caught up with the master at which point it stops and waits until master will “ping” the slave about a new event.
As for the second part, yes – you can configure the replication to stop at some point, however this technique is mostly used for point-in-time recovery: start slave until [condition x].
Thank you for writing this brief overview.
In few words, you explained to me what I was lookig for.
Thanks!
Sir, Please help me to understand.
I have one question related to master-master replication concept….might be I didn’t understand the concept correctly.
Both mysql systems having bin log,relay log and logically how it is not going to get conflict replication/data in M1M2 based on the concept which i read from your article.
Thank you very much for your help.
Vadiraj
Vadiraj –
thanks for the question. If you mean active-active master-master setup, chances for the conflicts are great, unless you running something like Galera which takes entirelly different approach. Active-active master-master is otherwise something we do not recommend and if you mean my article about MMM back in the day, it talks about active-passive setup, not active-active.
Awesome article .
Just a small doubt , does replication also exists for lmdb ? since i am using py-lmdb as an interface to lmdb , do i need to write protocols and binlog methods , connection methods .
Thanks
Thanks for your nice article. I have a doubt. Whether the bin log file every time clear and update the new data or it will have the data from the scratch ie. when we start replication
Viji,
I don’t think I fully understand your question. But. Binary logs are normally not overwritten – when the master server starts writing binary logs, it will write to binlog.000001, then binlog.000002 and so on.
However, if you have expire_logs_days not set to 0, then binary logs that are more than X days old will be automatically removed upon log rotation. So if your server is now writing binlog.001000, expire_logs_days=30 and binlog.000500 was closed 31 days ago, then all logs prior to binlog.000500 will be removed once server starts writing binlog.001001.
Hope that answers your question. Basically, binary logs are not written in a circular fashion like transaction logs are.
Hello Aurimas,
Thanks for this article.
I have one query related to mysql replication. Is it recommended to have a MySQL replication when both the master and slave are geographically separated. And that too when the replication is both way.
Kind regards,
Manu
Hi, Manu
Thanks for the question. I don’t think it’s very complete though as it depends really on what you’re trying to achieve with geo separated servers. Also, depending on the amount of write workload, WAN may become a bottle-neck for the replication.
While replication does work just fine over distance, it will not be synchronous normally. And it’s definitely not going to work both ways unless you write to different schemas or at the very least use offset for auto increment (doesn’t help in all cases btw).
I know my answer is incomplete, but so is the question 🙂 It’s not an easy design decision to make. A lot of variables to consider. And also a lot of alternative approaches depending on what the goals are.
Hope that helps.
Thank you Aurimas.
We already have a geographically separated setup but we are facing issues because of network latency.
I just wanted to make sure about certain other things. Your answer helped a lot, so its not incomplete in that way.
Kind regards,
Manu
Hi,
I have question on replication of DB. Is it possible to perform replication using the old DB dump? Is it going to re-sync on Master?
The reason I ask because we have 1 MySQL DB (around 160GB size and this DB is updating every seconds) and performing DB Dump is fast but the problem is the restoration, it took around 15hrs to finish.
Is it possible?
Thanks in advance. 🙂
Jove,
sorry for a slow response. Essentially, it is possible, but the key is to use –master-data when exporting the data with mysqldump. And for that you have to make a consistent backup which means either using –lock-all-tables (which I think is used by default if you’re using –master-data) or –single-transaction if all your tables are transactional.
Consider using mydumper which will make both export and import faster, assuming you have enough CPU cores and IO capacity.
Aurimas
Hi,Aurimas. In your article, you said that 「This process called IO thread connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relay log. That’s all」.
But from MySQL offical documentation,it said that(https://dev.mysql.com/doc/refman/5.5/en/master-thread-states.html)
8.14.5 Replication Master Thread States
● Sending binlog event to slave
Binary logs consist of events, where an event is usually an update plus some other information. The thread has read an event from the binary log and is now sending it to the slave.
● Master has sent all binlog to slave; waiting for binlog to be updated
The thread has read all outstanding updates from the binary logs and sent them to the slave. The thread is now idle, waiting for new events to appear in the binary log resulting from new updates occurring on the master.
So, my question is that how the binary logs transferred to the slave?
Hello,
is it possible to configure the replication only as “backup” solution with no reads on the slave?
Thanks in advance for your answer!
Afox
I was dropped 1 table on production server yesterday. but that time I have backup till last day so that I was restored but for yesterdays data i have to refer bin-log file .
My table used now() function for insertion of date-time column, so my question is all expertise is how to restore that now() function in current date with yesterdays date & time.
I am trying to insert the exact value of column datetime when that
data was created.
for Additionally binlog file has statement like this :
#160115 10:15:14
INSERT INTO TABLEA ( id, col2,col3, datetime) values
(1,value2,valu3,now());
#160115 10:15:24
INSERT INTO TABLEA ( id, col2,col3, datetime) values
(1,value2,valu3,now());
#160115 10:25:10
INSERT INTO TABLEA ( id, col2,col3, datetime) values
(1,value2,valu3,now());
I am having just starting my Administration work. so please help me.
Thanks in advance .
I have a database table with a BINARY(16) column that has a UNIQUE index. It seems to take forever to insert into this table when replication among the clusters is enabled. Any ideas?
Thank you for explain replication
Good Job Aurimas. I, Like It ..
We have master-slave replication in place utilizing GTID approach; the use case we have is to configure this replication such that some records that exist in some tables on the master, which meet specific criteria do not get replicated to the salve; is this possible?
Thanks