The combination of the information obtained from the “pt-summaries” discussed in the previous posts of this series (Part 1: The Percona Support Way, Part 2: Knowing the Server, Part 3: What MySQL?) helps us come up with the first impression of a MySQL server. However, apart from the quick glance we get at two samples of a selective group of MySQL status variables, they provide what I call a “static” view of the server, akin to looking at a still picture of an engine. We get the chance to spot some major discrepancies in the MySQL configuration in view of the available resources in the host and adjust those following some best practices but the real tuning starts by observing how the engine performs in motion.
pt-stalk is the final tool in our Percona Toolkit trifecta: we use it to collect data from a running server, from both the OS side as well as MySQL’s, so we can then analyze how it is operating. For better gains, you should run pt-stalk during a period of high activity (ideally, peak time). It can also be used to collect data while the server is undergoing a performance issue, for troubleshooting.
If you have the Percona Toolkit installed on the server, you can run pt-stalk like follows:
1 | sudo pt-stalk --no-stalk --iterations=2 --sleep=30 -- --user=root --password=<mysql-root-pass> |
A few observations:
- Options –no-stalk –iterations=2 –sleep=30 are used to collect two sets of diagnostic data with a 30-second interval between then; since –run-time is not defined, the default value of 30 seconds is employed thus the command above provides a 1-minute worth of data.
- By default, pt-stalk will save the captured data under /var/lib/pt-stalk. You can choose a different destination directory with option –dest .
- If your system is configured in such a way that you don’t need to provide credentials to access MySQL then you can remove the — –user=root –password=<mysql-root-pass> section.
- Many of the data collection tools used by pt-stalk are part of the sysstat package so make sure it is installed on the server.
Let’s have a look at the data collected.
A Mix of OS and MySQL Diagnostics Data
Part of the data collected by pt-stalk is purely OS-related, which makes the tool partially useful for inspecting other kinds of Linux servers as well. The other part is purely MySQL. Here’s the list of files generated by the command above in one of my test servers – note there are two groups of files below, prefixed with timestamps set 30 seconds apart:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | $ ls /var/lib/pt-stalk/ 2021_05_01_12_55_30-df 2021_05_01_12_56_00-df 2021_05_01_12_55_30-disk-space 2021_05_01_12_56_00-disk-space 2021_05_01_12_55_30-diskstats 2021_05_01_12_56_00-diskstats 2021_05_01_12_55_30-hostname 2021_05_01_12_56_00-hostname 2021_05_01_12_55_30-innodbstatus1 2021_05_01_12_56_00-innodbstatus1 2021_05_01_12_55_30-innodbstatus2 2021_05_01_12_56_00-innodbstatus2 2021_05_01_12_55_30-interrupts 2021_05_01_12_56_00-interrupts 2021_05_01_12_55_30-iostat 2021_05_01_12_56_00-iostat 2021_05_01_12_55_30-iostat-overall 2021_05_01_12_56_00-iostat-overall 2021_05_01_12_55_30-log_error 2021_05_01_12_56_00-log_error 2021_05_01_12_55_30-lsof 2021_05_01_12_56_00-lsof 2021_05_01_12_55_30-meminfo 2021_05_01_12_56_00-meminfo 2021_05_01_12_55_30-mpstat 2021_05_01_12_56_00-mpstat 2021_05_01_12_55_30-mpstat-overall 2021_05_01_12_56_00-mpstat-overall 2021_05_01_12_55_30-mutex-status1 2021_05_01_12_56_00-mutex-status1 2021_05_01_12_55_30-mutex-status2 2021_05_01_12_56_00-mutex-status2 2021_05_01_12_55_30-mysqladmin 2021_05_01_12_56_00-mysqladmin 2021_05_01_12_55_30-netstat 2021_05_01_12_56_00-netstat 2021_05_01_12_55_30-netstat_s 2021_05_01_12_56_00-netstat_s 2021_05_01_12_55_30-opentables1 2021_05_01_12_56_00-opentables1 2021_05_01_12_55_30-opentables2 2021_05_01_12_56_00-opentables2 2021_05_01_12_55_30-output 2021_05_01_12_56_00-output 2021_05_01_12_55_30-pmap 2021_05_01_12_56_00-pmap 2021_05_01_12_55_30-processlist 2021_05_01_12_56_00-processlist 2021_05_01_12_55_30-procstat 2021_05_01_12_56_00-procstat 2021_05_01_12_55_30-procvmstat 2021_05_01_12_56_00-procvmstat 2021_05_01_12_55_30-ps 2021_05_01_12_56_00-ps 2021_05_01_12_55_30-slabinfo 2021_05_01_12_56_00-slabinfo 2021_05_01_12_55_30-slave-status 2021_05_01_12_56_00-slave-status 2021_05_01_12_55_30-sysctl 2021_05_01_12_56_00-sysctl 2021_05_01_12_55_30-top 2021_05_01_12_56_00-top 2021_05_01_12_55_30-trigger 2021_05_01_12_56_00-trigger 2021_05_01_12_55_30-variables 2021_05_01_12_56_00-variables 2021_05_01_12_55_30-vmstat 2021_05_01_12_56_00-vmstat 2021_05_01_12_55_30-vmstat-overall 2021_05_01_12_56_00-vmstat-overall |
We won’t be examining all of these in detail here; I’ll focus on the most important ones (in general), following the order (and logic) I usually employ. As was the case in the previous posts, the excerpts of data used in the following sections do not all come from the same pt-stalk collection, not even from the same server; they are assorted from different sources to better illustrate the case at hand.
Was the Server Under Moderate or High Load When the Data was Captured?
My first stop tend to be the first top sample; there we can see reasonably clear if the server was under high load when pt-stalk was run, and for how long:
1 2 3 4 5 6 7 8 9 10 11 12 | top - 11:23:16 up 14 days, 23:54, 4 users, load average: 20.17, 17.82, 14.98 Tasks: 262 total, 1 running, 261 sleeping, 0 stopped, 0 zombie %Cpu(s): 78.2 us, 6.9 sy, 0.0 ni, 14.2 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0 st KiB Mem : 13186303+total, 658972 free, 11101868+used, 20185376 buff/cache KiB Swap: 8388604 total, 8215548 free, 173056 used. 20041612 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3408 mysql 20 0 112.4g 104.9g 141832 S 1169 83.4 3761:10 mysqld 16379 root 20 0 113652 1340 608 S 6.2 0.0 0:00.01 bash 16530 root 20 0 162020 2180 1528 R 6.2 0.0 0:00.01 top 1 root 20 0 52148 3448 1940 S 0.0 0.0 4:37.76 systemd |
In the sample above, we can see the server was undergoing an increase of load, “averaging” close to 15 for the past 15 minutes, 18 for the past 5 minutes, and finally 20 in the last minute. Just remember that the load metric is dependent on the number of cores in a server. This other sample below may look scarier but it comes from a box with 192 cores; while this is an oversimplification of the metric, we could say the server was operating at (76.9/192=) 40% of its maximum processing capacity in the last minute:
1 2 3 4 5 6 7 8 9 10 11 | top - 15:30:26 up 258 days, 21 min, 16 users, load average: 76.90, 78.63, 80.69 Tasks: 1711 total, 1 running, 1710 sleeping, 0 stopped, 0 zombie %Cpu(s): 23.2 us, 6.5 sy, 0.0 ni, 59.7 id, 10.0 wa, 0.0 hi, 0.7 si, 0.0 st KiB Mem : 15847333+total, 8529432 free, 78651558+used, 78968825+buff/cache KiB Swap: 23488102+total, 23441689+free, 464128 used. 79652627+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 45930 mysql 20 0 652.5g 645.5g 10128 S 2163 42.7 160718:24 mysqld 91413 apuser 20 0 34.8g 9.1g 164012 S 368.4 0.6 1037:45 java 11450 root 20 0 173092 4332 1920 R 21.1 0.0 0:00.05 top |
It is not a problem if the data was captured under low or moderate load, it just changes the way we should look at and analyze the rest of it.
We can also have a good idea of overall memory usage, and how much of it was allocated to MySQL. From there, I like to have a sneaky peek at meminfo as well; sometimes we find a server with more than a little part of the memory converted into transparent huge pages (THP):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | TS 1616111316.005848775 2021-03-18 23:48:36 MemTotal: 197910740 kB MemFree: 3300968 kB MemAvailable: 15070884 kB Buffers: 137580 kB Cached: 11954976 kB SwapCached: 0 kB Active: 178836988 kB Inactive: 7812608 kB Active(anon): 174796296 kB Inactive(anon): 162712 kB Active(file): 4040692 kB Inactive(file): 7649896 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB (...) AnonHugePages: 108390400 kB |
No swap space and 55% of the total memory converted in THP doesn’t look that good.
Depicting CPU Usage and I/O
My next stop is the mpstat sample, here’s one from a 16-core server under high load:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | 11:23:16 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:23:17 AM all 77.51 0.00 1.50 0.94 0.00 0.25 0.00 0.00 0.00 19.80 11:23:17 AM 0 91.92 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 7.07 11:23:17 AM 1 61.86 0.00 2.06 3.09 0.00 0.00 0.00 0.00 0.00 32.99 11:23:17 AM 2 67.68 0.00 3.03 1.01 0.00 0.00 0.00 0.00 0.00 28.28 11:23:17 AM 3 81.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 17.00 11:23:17 AM 4 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11:23:17 AM 5 55.56 0.00 2.02 3.03 0.00 0.00 0.00 0.00 0.00 39.39 11:23:17 AM 6 73.74 0.00 3.03 0.00 0.00 0.00 0.00 0.00 0.00 23.23 11:23:17 AM 7 62.63 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 36.36 11:23:17 AM 8 56.00 0.00 3.00 3.00 0.00 1.00 0.00 0.00 0.00 37.00 11:23:17 AM 9 82.00 0.00 1.00 2.00 0.00 0.00 0.00 0.00 0.00 15.00 11:23:17 AM 10 98.02 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.99 11:23:17 AM 11 67.68 0.00 1.01 1.01 0.00 0.00 0.00 0.00 0.00 30.30 11:23:17 AM 12 79.80 0.00 2.02 0.00 0.00 0.00 0.00 0.00 0.00 18.18 11:23:17 AM 13 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11:23:17 AM 14 87.88 0.00 2.02 0.00 0.00 0.00 0.00 0.00 0.00 10.10 11:23:17 AM 15 76.24 0.00 0.99 0.99 0.00 0.99 0.00 0.00 0.00 20.79 |
MySQL should be run by a regular, non-superuser (named mysql by default), thus all database work should be accounted for in the usr column above (which, in a dedicated server, is all about MySQL). In the sample above we see 2 cores pegged at 100% utilization; unless this sample came from an exceptional period of high load, this is a fine example of a server we should probably not have running in powersave mode (for CPU scaling governor, as seen in the 2nd post of this series) and have all cores running at full speed instead.
In the sample above we find a very good balance of load: all cores are working, and working hard. But picture a similar sample where you only see one of several cores at 100% usr load: in a replica, this is a sign replication is operating in single-thread mode. Back in the days of MySQL 5.5, and even 5.6, this was mostly the norm: we sometimes saw servers with one core running at 100% all time and replication lag continuously increasing. This was a time when replicas with less, but faster cores excelled. Multi-threaded replication appeared in MySQL 5.6 but it was limited to one database per (replication) thread; if your whole schema was organized around a single database, or if only one of the databases received most of the writes, it didn’t help much. Multi-threaded replication really blossomed in MySQL 5.7, with the introduction of the LOGICAL_CLOCK mode. The variables file contains all MySQL settings and you can use it to check how is replication configured in a replica:
1 2 3 | $ grep "slave_parallel_" *-variables slave_parallel_type LOGICAL_CLOCK slave_parallel_workers 8 |
Another important column to monitor in mpstat is the iowait column: if you see values repeatedly above 1.0, particularly under moderate load, it may be a sign that the storage subsystem is a limiting factor. Double-check in the iostat sample if the average queue size (avgqu-sz) depth is often bigger than 1:
1 2 3 4 5 | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 4.00 0.00 1187.00 0.00 58873.33 99.20 4.71 3.97 0.00 3.97 0.04 5.20 sda 0.00 0.00 0.00 0.33 0.00 1.33 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 1187.33 0.00 58841.33 99.12 4.76 4.01 0.00 4.01 0.04 5.20 |
In the sample above, we can see that sdb is already saturated with around 1000 write IOPS. Sometimes the IO capacity of the system is overestimated and the settings are set too high, making InnoDB believe it has more IOPS at its disposal than there actually is:
1 2 3 | $ grep “innodb_io_capacity” *-variables innodb_io_capacity 1000 innodb_io_capacity_max 10000 |
It is a good practice to measure the practical limit of IOPS that can be provided by your disks. Overestimating innodb_io_capacity may actually lead to worse performance and stalls in response times. For more information on this, see Tuning MySQL/InnoDB Flushing for a Write-Intensive Workload.
The last stop in my usual tour around the OS data collected by pt-stalk is the vmstat sample, which I mostly use to check if there was some swapping taking place when the data was collected:
1 2 3 4 5 6 7 8 9 10 | procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 28 2 172544 676900 160176 20025276 0 0 158 243 1 1 16 1 83 0 0 29 0 172544 672692 160312 20027832 260 0 1064 3588 22249 3959 95 2 3 0 0 17 0 172544 669032 160316 20028860 52 0 348 3436 20754 5070 78 2 20 1 0 (...) 13 0 172544 697792 160536 19998792 0 0 576 3600 19421 5221 73 2 24 1 0 21 0 173056 705116 160532 19991780 0 300 1744 3936 183103 4090 86 3 10 1 0 11 0 173056 696420 160540 19999820 0 88 7360 3548 48524 4451 82 2 16 0 0 21 0 173056 692664 160540 20003536 0 0 3280 3716 20936 4711 88 2 9 1 0 |
While swapping is detrimental to performance, as discussed in the 2nd post, I’m one to advocate in favor of having swap space configured in a server in most cases – just make sure to adjust swappiness to a minimum as explained there. You can check that is the case by looking at the sysctl sample:
1 2 | $ grep swappiness *-sysctl vm.swappiness = 1 |
MySQL in Motion
This is my favorite part: mysqladmin files contain samples from SHOW GLOBAL STATUS and while it is difficult to make sense of the raw data, pt-mext, another tool from the Percona Toolkit, can be used to process them and provide a clear view of the status counters variation over time, using the first sample in the series as the base reference. Here’s how you can use it:
1 | pt-mext -r -- cat 2021_05_01_12_55_30-mysqladmin > ptmext.out |
Let the fun begin!
Does My Hot Data Fit in Memory?
This is important information to have: is the InnoDB Buffer Pool big enough to store the data that is accessed more often? If it does, it means very little data is fetched from the tablespace files on the disk (which is slow to load) and that most of it is found in memory already. We get that ratio from two InnoDB status counters:
1 2 3 | $ grep "Innodb_buffer_pool_read_requests\|Innodb_buffer_pool_reads" ptmext.out Innodb_buffer_pool_read_requests 69496091591 4040274 3636485 3475398 3509297 3324067 2790171 3811242 3820111 4472993 4240323 3603441 4636822 4713943 4880787 4032916 4328607 3959668 4187900 3883735 3850137 4166323 3933372 3628560 4101771 2766796 3220822 4093628 3581942 Innodb_buffer_pool_reads 9420027 12 17 7 11 28 21 16 53 5 4 17 10 9 14 9 14 5 30 4 6 7 4 10 1 115 8 9 56 |
- Innodb_buffer_pool_read_requests indicates the number of (logical) read requests
- Innodb_buffer_pool_reads indicates the number of (logical) read requests that could not be satisfied from data already present in the Buffer Pool; the data had to first be read from disk and then loaded into the Buffer Pool to be served.
The “magic” of pt-mext is shown above: the first value presented for each metric comes from the first sample collected, which provides a base reference, while the values for all consecutive samples, taken 1 second apart, are the difference between them and the previous sample. This way, we get a clear idea of how the metric is changing over time.
Here are the values for the first four samples of Innodb_buffer_pool_reads used in the excerpt above, to better illustrate this function:
Raw value | pt-mext |
9420027 | 9420027 |
9420039 | +12 |
9420056 | +17 |
9420063 | +7 |
Looking at Temporary Tables
Another important aspect we discussed in the last post that is key to MySQL tuning is the use of temporary tables. If there are too many of them being created every second:
- Your queries may be needing some optimization;
- They may be taking too much of the memory available and the situation may become critical (as in running out of memory) during peaks of high activity/concurrency.
There’s another element to it: if they are growing bigger than the threshold for temporary tables to remain in-memory:
1 2 3 | $ grep "tmp_table_size\|max_heap_table_size" *-variables max_heap_table_size 209715200 tmp_table_size 209715200 |
they will be moved to the disk (becoming on-disk temporary tables) and your queries will take more time to complete, which may also create a vicious cycle during moments of high concurrency. Here’s a somewhat extreme example from a server with only 126G of RAM and a Buffer Pool of 60G:
1 2 | Created_tmp_disk_tables 61327112 73 523 501 448 518 421 533 540 464 618 741 528 332 216 522 400 113 8 7 363 208 575 519 396 400 483 610 236 Created_tmp_tables 134023235 115 546 525 474 542 452 550 557 493 641 763 548 356 250 558 419 129 23 19 406 234 595 542 420 419 511 637 271 |
- Created_tmp_tables indicates the total number of temporary tables created by the server, be it in-memory or on-disk.
- Created_tmp_disk_tables only accounts for temporary tables created on-disk
In the excerpt above, we see a high number of temporary tables created every second, with most of them being created as or converted to on-disk temporary tables. If we consider the latter option, that they are being created in-memory until reaching the threshold set by tmp_table_size, the amount of memory allocated for temporary tables during peak time may surpass what remains available in the server. That is what was happening with this server until they downsized the buffer pool to the current value of 60G as a short term solution; despite having swappiness set to 1 and theoretically available memory in the OS cache, swap space was almost depleted due to memory pressure:
1 2 3 4 5 6 7 8 9 | top - 15:54:29 up 126 days, 1:18, 3 users, load average: 4.17, 4.34, 4.70 Tasks: 445 total, 4 running, 441 sleeping, 0 stopped, 0 zombie %Cpu(s): 8.0 us, 6.3 sy, 0.0 ni, 71.8 id, 13.5 wa, 0.0 hi, 0.4 si, 0.0 st KiB Mem : 13176490+total, 1973120 free, 10337484+used, 26416932 buff/cache KiB Swap: 10485756 total, 562384 free, 9923372 used. 26613916 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11830 mysql 20 0 111.4g 93.8g 8760 S 175.0 74.6 10332:00 mysqld |
1 2 3 4 5 6 7 8 9 10 | procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 18 4 9923884 1988808 40 26417080 1 1 33 584 0 0 2 1 97 0 0 5 3 9919788 1980508 40 26424644 5556 0 17444 30576 42479 53317 4 2 82 11 0 2 3 9912876 1970328 40 26427844 7872 0 21856 35881 50203 67962 6 2 82 11 0 2 3 9903404 1955664 40 26433916 10992 0 24208 41452 58789 76326 6 2 82 10 0 3 4 9893932 1938188 40 26437384 11008 0 25712 39040 54518 65092 6 2 82 11 0 5 3 9888044 1926884 40 26442912 7776 0 27152 37513 75462 82150 17 3 71 9 0 2 4 9881644 1917948 40 26445612 7336 0 23288 32787 44001 53016 7 2 79 12 0 5 4 9869612 1896296 40 26450052 15308 0 30748 34702 50177 58230 9 2 77 12 |
The long-term solution for them was to work on optimizing their queries to decrease the amount and size of temporary tables utilized by them.
Contention at the Table Cache Level
When a connection runs a query that accesses a given table, the table is opened and the Opened_tables counter is increased. If there are ten connections running the same query concurrently, the table will be opened ten times; the reference for a given opened table is not shared by different client sessions because each may have a different state set for the target table. Note this impacts the number of file descriptors used by MySQL.
For performance reasons, the reference for each opened table is kept in a cache. When that cache is full and someone tries to open a table that is not in the cache, MySQL needs to make space for it by either finding a table that is no longer being in use by any thread and recycling its entry on the table cache or by temporarily expanding the table cache.
The general advice suggests that if the value for Opened_tables increases constantly, it’s a sign that the table cache might be too small:
1 | Opened_tables 17369503 123 179 187 228 307 296 350 209 208 192 219 308 183 316 260 312 291 123 237 152 146 208 273 284 335 110 238 |
I rather look at the number of cache misses and overflows, which provides a more direct indication of contention at the table cache level:
1 2 3 | Table_open_cache_hits 697915937 5362 7645 6461 4789 4815 5912 6569 5989 7260 4128 5892 7822 4265 7562 5032 6892 6451 2865 5740 5475 3744 5568 6137 4688 5687 1940 3743 Table_open_cache_misses 17355555 123 179 187 228 307 296 350 209 208 192 219 308 183 316 260 312 290 123 237 152 146 208 273 284 335 110 238 Table_open_cache_overflows 17344722 123 179 187 228 307 296 350 209 208 192 219 308 183 316 260 312 290 123 237 152 146 208 273 284 335 110 238 |
As discussed in the previous post, pay attention to the relative size of each table cache instance:
1 2 3 | $ grep table_open_cache *-variables table_open_cache 10708 table_open_cache_instances 16 |
Ever since the table cache has been partitioned to perform better in highly-concurrent environments, it is no longer one big cache with table_open_cache entries: this value is now divided by the number of table_open_cache_instances. If the number of entries per instance is low, it won’t scale well; you may need to increase table_open_cache, or yet decrease the number of instances.
There are a lot of other status variables we could be covered here but I hope you have got the general idea of how we can use pt-mext to process mysqladmin samples from the examples above. We shall keep going, there are two essential pieces of data we cannot leave uncovered.
Looking Inside the (Real) Engine
And that would be InnoDB. Yes, we get a lot of InnoDB-related metrics covered in the status variables but there is precious information about the activity of the storage engine we can only find in the output of SHOW ENGINE INNODB STATUS (“SEIS”) which we find in the innodbstatus files.
There is a lot of information in there, I reckon it is a common feeling to feel a bit lost when trying to make sense of it all. The trick is to break it down into sections; you will usually focus on a subset of them, according to what you are looking for.
Transactions Running
InnoDB is a transactional storage engine and SEIS is one of the best places to look for information about them. Open one of the innodbstatus (there are two collected for each of the iterations executed) and have a general look. Before I do it myself, I like to get a general idea about the number of transactions inside InnoDB; I use grep for this:
1 2 | $ grep "\---TRANSACTION" innodbstatus |wc -l 2503 |
That’s a lot of transactions, but it doesn’t mean they are all running:
1 2 | $ grep "\---TRANSACTION" innodbstatus | grep ACTIVE | wc -l 273 |
If you have a lot of active transactions running inside InnoDB and they are taking too much time to complete, getting just a bit of CPU cycles to process each time and then going back to waiting for more, generating a lot of context switch, you may benefit from limiting the maximum number of transactions that can be allowed inside InnoDB at the same time – the rest of them will have to wait at the door, a concept that is not much different than using a connection pooler. The variable that can be used for this is innodb_thread_concurrency: the default value of zero indicates there is no limit in place.
This is one of the best examples I have in mind when I think about “tuning the engine for the race track”: the workload dictates how MySQL should be tuned. In a high concurrency system, if there are queries that compete for the same resources blocking each other, it may be better to allow only a fraction of them to run at a time so they complete faster and leave. In cases like this, how many threads should be allowed to run concurrently is a matter of experimentation: try with values that are multiples of the number of cores in the server.
Long-Running Transactions
You will find the oldest, longest-running transactions on the bottom of the list of transactions: if you are troubleshooting a performance problem, there is a good chance that those transactions are involved – either as culprits or victims of it.
Long-running transactions can be very detrimental to general server performance: they drive the InnoDB history list length high and can hold locks needed by other queries to complete. Here’s an example of such:
1 2 3 4 | ---TRANSACTION 48793810, ACTIVE 293183 sec 106688 lock struct(s), heap size 10887376, 1642661 row lock(s), undo log entries 3698942 MySQL thread id 349, OS thread handle 140316247336704, query id 64765183 myserver.hostname.com 172.x.x.x myuser Trx read view will not see trx with id >= 48793810, sees < 48793810 |
You can investigate those further by looking for them in the respective transactions file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | *************************** 2. row *************************** trx_id: 48793810 trx_state: RUNNING trx_started: 2020-08-21 00:00:33 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 3805639 trx_mysql_thread_id: 349 trx_query: NULL trx_operation_state: NULL trx_tables_in_use: 0 trx_tables_locked: 11 trx_lock_structs: 106688 trx_lock_memory_bytes: 10887376 trx_rows_locked: 1642663 trx_rows_modified: 3698951 trx_concurrency_tickets: 0 trx_isolation_level: REPEATABLE READ trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 0 trx_is_read_only: 0 trx_autocommit_non_locking: 0 |
processlist file:
1 2 3 4 5 6 7 8 9 10 11 | *************************** 1. row *************************** Id: 349 User: myuser Host: myserver.hostname.com:60260 db: mydatabase Command: Sleep Time: 1 State: Info: NULL Rows_sent: 0 Rows_examined: 0 |
and lock-waits file:
1 2 3 4 5 | *************************** 1. row *************************** who_blocks: thread 349 from myserver.hostname.com:60260 idle_in_trx: 1 max_wait_time: 49 num_waiters: 10 |
I reckon I’m stopping short of providing a complete procedure for investigating locking issues; this is another subject that warrants more attention.
Checkpointing and Redo Log Space
We briefly discussed the importance of tuning the redo log space (innodb_log_file_size x innodb_log_files_in_group) in the previous post, and how it should be big enough to optimize the checkpointing mechanism. There is an old formula that can still be used for estimation, and the data for this can be found in SEIS. If we used the first innodbstatus file of each of the two samples:
1 2 3 | $ grep "Log sequence number" 2020_08_24_15_*-innodbstatus1 2020_08_24_15_30_25-innodbstatus1:Log sequence number 6645<strong>2632416280</strong> 2020_08_24_15_30_55-innodbstatus1:Log sequence number 6645<strong>3484080767</strong> |
We can obtain the number of bytes written in this 30-second space and make a rough projection for one hour:
1 | (66453484080767 - 66452632416280) * 2 * 60 ~= 95G/h |
Such a projection is very dependent on the timing when the SEIS samples were captured; you may want to track Log sequence number during an extended period of time but do note it is important that the redo log be tuned for the most demanding write periods, which usually takes place during peak time. You certainly don’t need to have it covered for a full hour; 15-30 minutes should be enough for checkpointing to operate in an optimized way:
1 | 20min ~= ( innodb_log_file_size → 16G ) x ( innodb_log_files_in_group → 2 ) |
Adaptive Hash Index
InnoDBs can make use of a special feature to index values that are being accessed frequently, called Adaptive Hash Index, or AHI, for short. As the name suggests, these are built as hash indexes in memory, on top of the B-Tree table indexes we usually talk about. They allow for very fast hashed lookups. But it won’t be helpful in workloads where most of the searches are non-hash searches:
1 2 | $ grep "hash searches" innodbstatus 19000.00 hash searches/s, 1955000.00 non-hash searches/s |
In the excerpt above, we find that only 1% of the searches are “hashable”: in cases like this, disabling the AHI feature (innodb_adaptive_hash_index) can improve performance. AHI will have a bigger impact (be it positive or negative) on read-only workloads so it may make sense to have it enabled on your replicas but not on the master – or vice-versa. This highlights another interesting point when inspecting database servers: it is very likely you will find different variations of your workload if you split traffic into writes (master) and reads (replicas); you should analyze (and configure) your servers independently.
In other cases, the rate of hashable searches can be predominant but the system may still suffer contention from this mechanism. This used to be the case in MySQL 5.6 and earlier versions, when “the adaptive hash index search system was protected by a single latch (btr_search_latch) which could become a point of contention”. This is apparent when we see many threads waiting on an RW-latch created in btr0sea.c in the upper part of the SEIS output, for example. This has been mitigated in MySQL 5.7 with the possibility of partitioning the AHI search mechanism (innodb_adaptive_hash_index_parts): “each index is bound to a specific partition, with each partition protected by a separate latch”.
If you are in doubt about the effectiveness of the AHI for your workload and can test operating with it disabled for comparison, that would be a simple path to pursuit.
A Quick Note on Processlist
We mentioned processlist samples briefly above: they contain basic information about ongoing connections to MySQL, such as what their source is and what they are currently doing, although not always in exact terms. There are moments when we need to inspect this file in search of a specific thread like we did earlier, but we can also extract more high-level stats from these samples. The following command will sort connections by Command:
1 2 3 4 5 6 | $ cat processlist|grep Command:|sort|uniq -c|sort -rn 2750 Command: Sleep 1226 Command: Query 36 Command: Connect 29 Command: Binlog Dump 2 Command: Execute |
Please note you cannot simply look at the overall raw numbers above: each processlist file contains 29 samples of SHOW FULL PROCESSLIST; averaging these values does not provide an accurate view of the system at the time, but it does provide a pretty good general representation of it. Sorting by State is even better in this sense:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | $ cat processlist|grep State:|sort|uniq -c|sort -rn 2750 State: 548 State: Opening tables 369 State: statistics 103 State: Sending data 68 State: starting 31 State: query end 31 State: Creating sort index 29 State: Waiting for the slave SQL thread to free enough relay log space 27 State: update 18 State: executing 17 State: Master has sent all binlog to slave; waiting for more updates 14 State: Sending to client 7 State: updating 6 State: Receiving from client 6 State: preparing for alter table 5 State: checking permissions 4 State: Waiting for table metadata lock 4 State: closing tables 3 State: System lock 1 State: Sending binlog event to slave 1 State: preparing 1 State: creating table |
By looking at the excerpt above, I would be inclined to further examine the efficiency of the table cache and that of the redo log space.
One last point that almost didn’t make this document: the MyISAM storage engine has its days counted. In MySQL 8.0, the move was made that even system tables are not MyISAM anymore. In certain situations, you may resort back to MyISAM for on-disk temporary tables (default_tmp_storage_engine). Just have the following information in mind: for obvious reasons, MyISAM tables aren’t featured in the output of SHOW ENGINE INNODB STATUS: if you have a performance issue involving a query and you can’t find anything about it in SEIS, chances are that the underlying tables are not InnoDB; look for clues about the query in SHOW PROCESSLIST instead.
Daemon Mode
You can have pt-stalk running in the background as a daemon, ready to trigger when a predefined condition is met. By default, a threshold for Threads_running is used, but you can also employ your own trigger conditions through a custom function!
In the Next Post …
The best complement to our beloved Percona Toolkit trifecta is none other than Percona Monitoring and Management (PMM). It continually collects MySQL status variables for your servers and plots the metrics in Grafana panels that are easy to interpret. Plus, it includes a very rich Query Analytics dashboard that can help you identify your top slow queries. In the final post of this series, we have a look at how we can obtain much of the same information we got from the Percona Toolkit tools (and sometimes more) from PMM.
Hi Fernando,
Very detailed and excellent explanation. Kudos. \
Regarding ‘Contention at the Table Cache level’, you’ve stated that ‘this impacts the number of file descriptors used by MySQL‘ meaning that the same table opened in multiple instances will have different file descriptors for the same file.
I tried to recreate the above scenario by executing a ‘select * from tablename’ on a table which has 18 crore records in one session and another select query ‘select * from tablename order by id desc limit 1000’ on the same table in another session. I monitored the system calls using strace. From the strace output, we found out that session2 used the same file descriptor as session1 to retrieve data while the query in session1 was still running. Please note that the value of table_open_cache_instances was 16.
If a table is open, another session accessing the same table should open the table .ibd file using a new file descriptor. But that is not the case here.
When we say increasing table_open_cache_instances reduces contention, what contention do we exactly mean here? table level? In source code, it is mentioned that by increasing instances we can make sure to lock ‘m_lock’ instead of having to lock the ‘LOCK_OPEN’. From my understanding, this lock is unlocked in a very small span of time and don’t think it causes any overhead.
Please share your insights on this. Also, what do you think should be the ideal value for table_open_cache_instances?