Tuning Linux for MySQL Performance
When you are tuning MySQL for performance and reliability, it includes right from optimal sizing / capacity planning of hardware infrastructure, Linux kernel parameters configuration, tuning MySQL system variables and application / SQL / index performance optimization. We strongly recommend not to change any Linux kernel configuration parameters unless you are fully aware of how the tuning can influence the performance of Linux. In this blog post we have explained how to tune Linux for MySQL performance. Please don’t consider this post as guidance run-book / cheatsheet / checklist for tuning your production Linux servers, We have written this post purely for knowledge sharing and education purpose only, Thanks for understanding !
What is swappiness on Linux?
First of all, The Linux swappiness value has nothing to do with how much RAM is used before swapping starts. Even today, There are many folks get confused if any relationship exists between physical RAM and Linux swampiness. So we will make it clear before proceeding further, Swapping is actually a technique in Linux to write data in Random Access Memory (RAM) to the swap partition or swap file on your hard disk and this happens to free up RAM. In Linux there is a configuration parameter called swappiness value and most books and blogs spreads a wrong interpretation of swappiness which explains swappiness as Linux sets a threshold for RAM usage, and when the amount of used RAM hits that threshold, swapping starts. This is absolutely wrong and Let’s see how it is explained in Linux documentation (you can read more on this topic here – https://github.com/torvalds/linux/blob/v5.0/Documentation/sysctl/vm.txt#L809 ), ” This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.”
What we at MinervaDB recommends about Linux swappiness for MySQL performance ?
Try keeping swapping less aggressive by setting swappiness to “1”, Please do not set value to “0” and disable swappiness forever.
Monitor swappiness of your Linux system:
How to configure Linux swappiness “1” ?
# Make sure you are root and set swappiness to 1
echo 1 > /proc/sys/vm/swappiness
# Or, you can use sysctl to do the same sysctl
vm.swappiness vm.swappiness = 1
If you want to permanently apply the changes made (this is after successful benchmarking of new value and some performance improvement), please update the change in /etc/sysctl.conf as mentioned below:
vm.swappiness = 1
How Linux I/O Scheduler influence the MySQL performance ?
Disk access is a super expensive way to retrieve data, Flash and Solid State storage is no exception when compared to accessing data directly from RAM, All this get super complicated and slow when data is stored on spinning disks. Technically the reason for high latency on transactions over spinning disk is due to reason that WRITEs happen on location of spinning platter and during reads the physical drive need to spin the disk platters to the location for reading the data, This process is called seeking and it resource intensive (also time consuming). I/O schedulers are made for optimal access of data requests by merging I/O requests to similar locations on disk. So by grouping the data requests located at similar sections of disk, the drive need not have seek often and this tunes performance of disk I/O operations.
Modern Linux kernel support multiple I/O scheduler options with their own pros and cons, We have copied more details below:
Complete Fairness Queueing (CFQ)
The Complete Fairness Queueing (CFQ) I/O scheduler works by creating a per-process I/O queue. Most Linux distributions use the Completely Fair Queuing (CFQ) scheme by default, which gives input and output requests equal priority. This scheduler is efficient on systems running multiple tasks that need equal access to I/O resources. The main aim of CFQ scheduler is to provide a fair allocation of the disk I/O bandwidth for all the processes which requests an I/O operation and this also make CFQ less optimal for environments which need to prioritize one request type (such as READs) from a single process. In case of asynchronous requests, all the requests from all the processes are batched together according to their process’s I/O priority. The length of the time slice and the number of requests a queue is allowed to submit depends on the I/O priority of the given process.
This specifies how long CFQ should idle for next request on certain cfq queues (for sequential workloads) and service trees (for random workloads) before queue is expired and CFQ selects next queue to dispatch from. By default slice_idle is a non-zero value. That means by default we idle on queues/service trees. This can be very helpful on highly seeky media like single spindle SATA/SAS disks where we can cut down on overall number of seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues/service tree level and one should see an overall improved throughput on faster storage devices like multiple SATA/SAS disks in hardware RAID configuration. The down side is that isolation provided from WRITES also goes down and notion of IO priority becomes weaker. So depending on storage and workload, it might be useful to set slice_idle=0. In general I think for SATA/SAS disks and software RAID of SATA/SAS disks keeping slice_idle enabled should be useful. For any configurations where there are multiple spindles behind single LUN (Host based hardware RAID controller or for storage arrays), setting slice_idle=0 might end up in better
throughput and acceptable latencies.
This specifies, given in Kbytes, the maximum “distance” for backward seeking. The distance is the amount of space from the current head location to the sectors that are backward in terms of distance. This parameter allows the scheduler to anticipate requests in the “backward” direction and consider them as being the “next” if they are within this distance from the current head location.
This parameter is used to compute the cost of backward seeking. If the backward distance of request is just 1/back_seek_penalty from a “front” request, then the seeking cost of two requests is considered equivalent.
So scheduler will not bias toward one or the other request (otherwise scheduler will bias toward front request). Default value of back_seek_penalty is 2.
The goal of the Deadline scheduler is to guarantee a start service time for a request. It does this by imposing a deadline on all I/O operations to prevent starvation of requests. As I/O requests come in, they are assigned an expiry time (the deadline for that request). At the point where the expiry time for that request is reached, the scheduler forces the service of that request at the location on the disk. While it is doing this, any other requests within easy reach (without requiring too much movement) are attempted. Where possible, the scheduler attempts completion of any I/O request before the expiry time is met.
The deadline scheduler can be used in situations where the host is not concerned with “fairness” for all processes residing on the system. The concern is rather where the system requires I/O requests are not stalled for long periods.
The deadline scheduler can be considered the best choice given a host where one process dominates disk I/O. Most database servers including MySQL are a natural fit for this category.
NOOP scheduler simply handles the requests in the order they were submitted and it does nothing to change the order or priority. NOOP scheduler provides optimal throughput on the storage systems which are capable of providing their own queuing systems like Solid State drives and intelligent RAID controllers with built-in buffer cache because it does not make any attempts to reduce seek time beyond simple request merging (which helps throughput)..
Tuning Linux file systems for performance
XFS imposes an arbitrary limit on the number of files that a file system can hold. In general, this limit is high enough that it will never be hit. eXtended File System (XFS) is the optimal choice for the planned workload. XFS is built for optimal disk operation on large files and streaming I/O performance. XFS provides a feature, called direct I/O, that provides the semantics of a UNIX raw device inside the file system namespace. Reads and writes to a file opened for direct I/O bypass the kernel file cache and go directly from the user buffer to the underlying I/O hardware. Bypassing the file cache of- fers the application full control over the I/O request size and caching policy. Avoiding the copy into the kernel address space reduces the CPU utilization for large I/O requests significantly. Thus direct I/O allows applications such as databases, which were traditionally using raw devices, to operate within the file system hierarchy. Extent based allocation reduces fragmentation, metadata size, and improves I/O performance by allowing fewer and larger I/O operations.
Ext4 uses extents (as opposed to the traditional block mapping scheme used by ext2 and ext3), which improves performance when using large files and reduces metadata overhead for large files. The file-allocation algorithms attempt to spread the files as evenly as possible among the cylinder groups and, when fragmentation is necessary, to keep the discontinuous file extents as close as possible to others in the same file to minimize head seek and rotational latency as much as possible. Ext4 uses 48-bit internal addressing, making it theoretically possible to allocate files up to 16 TiB on filesystems up to 1,000,000 TiB (1 EiB). Early implementations of ext4 were still limited to 16 TiB filesystems by some userland utilities, but as of 2011, e2fsprogs has directly supported the creation of >16TiB ext4 filesystems. As one example, Red Hat Enterprise Linux contractually supports ext4 filesystems only up to 50 TiB and recommends ext4 volumes no larger than 100 TiB. Ext4 also doesn’t do enough to guarantee the integrity of your data. As big an advancement as journaling was back in the ext3 days, it does not cover a lot of the common causes of data corruption. If data is corrupted while already on disk—by faulty hardware, impact of cosmic rays (yes, really), or simple degradation of data over time—ext4 has no way of either detecting or repairing such corruption.
ZFS is a combined file system and logical volume manager designed and implemented by a team at Sun Microsystems led by Jeff Bonwick and Matthew Ahrens. Oracle is the owner and custodian of ZFS, and it’s in a peculiar position with respect to Linux filesystems. ZFS has been (mostly) kept out of Linux due to CDDL incompatibility with Linux’s GPL license. ZFS is similar to other storage management approaches, but in some ways, it’s radically different. ZFS does not normally use the Linux Logical Volume Manager (LVM) or disk partitions, and it’s usually convenient to delete partitions and LVM structures prior to preparing media for a zpool. ZFS can implement “deduplication” by maintaining a searchable index of block checksums and their locations. If a new block to be written matches an existing block within the index, the existing block is used instead, and space is saved. In this way, multiple files may share content by maintaining single copies of common blocks, from which they will diverge if any of their content changes. Fragmentation in ZFS is a larger question, and it appears related more to remaining storage capacity than rapid file growth and reduction. Performance of a heavily used dataset will begin to degrade when it is 50% full, and it will dramatically drop over 80% usage when ZFS begins to use “best-fit” rather than “first-fit” to store new blocks. Regaining performance after dropping below 50% usage can involve dropping and resilvering physical disks in the containing vdev until all of the dataset’s blocks have migrated. Otherwise, the dataset should be completely unloaded and erased, then reloaded with content that does not exceed 50% usage (the zfs send and receive utilities are useful for this purpose). It is important to provide ample free disk space to datasets that will see heavy use.
Linux performance optimization involves deep understanding of tuning both logical and physical components of Linux infrastructure, There is definitely one–size–fits–all so each components need to be optimally configured for full-stack performance and scalability. This blogs is made purely for knowledge sharing purpose, Please don’t consider this blog post as run-book or guided strategy for configuring your Linux infrastructure for performance. Please engage with professional Linux performance experts to tune Linux for performance, scalability and reliability, Thanks for reading !