MySQL 101: Monitor Disk I/O with pt-diskstatsHere on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.

In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.

pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.

  • rd_s tells about number of reads per second while wr_s represents number of writes per second.
  • rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
  • rd_mrg and wr_mrg are other two important columns in pt-diskstats output.  *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
  • qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.

Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the  –devices-regex option which prints only device information that matches this Perl regex.

These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload.

It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats.

Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev.

From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound.

The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS.

Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq.

You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output.

If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details.


Conclusion:

pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
gggeek

Pardon my ignorance, but how does this work for NFS mounts? Are they ignored, treated as physical disks, or something else? And drbdb volumes?

One of the things I often find hard to grok when looking at the output of the disk-statistics tools is how to map the data to the underlying physical disks – esp. when there are network filesystems and LVM at play

Ernie Souhrada

NFS mounts are ignored; pt-diskstats collects its information from /proc/diskstats, which does not contain any information about NFS volumes. For DRBD volumes, both the actual physical disk and the DRBD volume will appear in the pt-diskstats output, often with very different numbers, since the DRBD metrics will include network and DRBD overhead but the physical device metrics will just be from the raw disk. The same is true for LVM, since you might have multiple LVs on a given PV or a singe LV spanning multiple PVs.

Valerie

Doesn’t work on FreeBSD. Sad face.

Cannot open /proc/diskstats: No such file or directory at /usr/local/bin/pt-diskstats line 3355.

Masroor Farooqi

I like that this is a read only tool. Combine that with fact that it polls /proc/diskstats means that one does not have to worry about additional overhead to the disk subsystem. As may be obvious to most readers of this post, I/O access is the most expensive operation that a database does and DBAs spend (or should spend) a lot of time fine tuning this. Careful use of a tool like this, or say, iostat should be a “must use” tool in any DBAs toolkit.

One point I’d like to make to gggeek in the context of databases is that please do not try to put database datafiles on NFS volumes. In fact, something like Oracle actually checks for NFS and if it finds its tablespaces are on NFS it will refuse to go online. This is because NFS is a no guarantee file subsystem. Couple that with the fact that it uses TCP as a transport mechanism and the result is that database access is (a) slow, and (b) not fully guaranteed.

gggeek

@Masroor I pretty much agree.

But as a web consultant, I often come across servers which are not dedicated to database serving, and they can host the webserver as well as memcache, solr, queueing services and whatnot. It is thus important to be able to easily tell apart IO coming from the different services, and which disks/partitions are being hit.

And while NFS is not so good a choice for a db storage, using DRBD is quite a common practice to achieve high-availability setups.

Masroor Farooqi

gggeek Agreed. I was a database consultant for a long time, so I have seen my fair share of, lets say, architectural abuses 🙂 and it’s invaluable to know what is the disk spinning for. I have not personally used DRBD for database replication, but have for offsite backups (we used a proprietary block replication solution called Compellent). Anyway’s good to know for future.

Have a good one