Speeding up slow disks with SSD caching

Afterburner

Article from Issue 187/2016
Author(s):

Flash memory is fast but also expensive. Caching with Flash provides a way out: A smaller and cheaper SSD can speed up the disk.

Hard disks are inexpensive, and they have huge capacities, but they are also slow. Solid state disks (SSDs) are fast, but smaller and more expensive. If you combine the advantages of a hard disk with an SSD-based cache, you pick up a large performance gain at a reasonable cost.

An application generally does not want all the data at once; most of the data is in the state of being ignored most of the time. Caching lets you move the most frequently requested data to an exclusive, fast medium and leave the less-frequently accessed data on the cheaper but slower background medium.

The Linux environment has several tools that provide the necessary software to support hard-disk caching. Does it help to use an SSD-based flash drive as a cache for a traditional hard disk? We decided to find out. This article explores the possibilities for caching with the Linux caching tools Enhance IO and dm-cache. If you are new to the topic of caching, and you would like some additional information on choices you might have to make, see the boxes titled "A Little Cache Theory" and "How Flash Works."

How Flash Works

The precursors of today's Flash memory appeared as early as the 1970s. The devices at the time stored computer microcode in ROM chips (read-only memory). These ROM chips could neither be deleted nor overwritten. Thus, an update meant replacing the chips.

To simplify the procedure, scientists developed erasable programmable read-only memory, EPROM for short. This memory typically had a covered transparent window on the silicon chip. If you removed the label on the window and irradiated the module for around a quarter of an hour with UV light, you would erase the chip, and you would be able to rewrite it. This solution was significantly less expensive than a throw-away ROM, but it was still cumbersome.

The next generation was a further improvement in the form of electrically erasable programmable read-only memory (EEPROM). This type of memory was erased by applying a voltage. Like its predecessors, EEPROM was used to store small amounts of data that needed to be preserved without power and did not frequently change. Like today's Flash memory, these memory modules already belonged to the random access memory (NVRAM) class.

Flash memory, which followed EEPROMs, had a much higher storage density, but relied on the same principle: It contains a floating gate transistor for each bit. The floating gate is an electrically isolated connection to which a voltage can be applied.

The presence of a voltage keeps the source-drain line of the transistor in a high-impedance state, that is, the transistor is non-conductive and blocks (Figure 1). Without voltage at the floating gate, the transistor conducts electricity between the source and drain instead. These two states distinguish the 0 and 1 binary bits.

Figure 1: Schematic of a floating-gate transistor.

A Little Cache Theory

If you want a cache to handle the most important data, you also need to define what is important. The cache's decision strategy sets the priority. Several models exist for setting priorities. All of these models define what data the cache needs to forget in favor of new entries. The cache delivers the data automatically and completely transparently when asked to do so by the background medium. The most important decision strategies are:

  • FIFO (First in, first out): The entry that was written first to the cache drops out of it again first. This approach is disadvantageous if the cache is small. In this case, data needs to be deleted permanently to make room.
  • LFU (least frequently used): Whatever is least frequently requested is forgotten. This strategy is more efficient than FIFO when applications actually require certain entries significantly more often than others.
  • LRU (least recently used): This strategy keeps the entries in the cache that have been used recently and removes the oldest. This technique usually requires a number of bits to remember how old a particular entry is. Each hit in the cache updates the age of all the other entries. Variations on LRU include Pseudo-LRU (PLRU, which only needs one age bit) or segmented LRU (SLRU, which includes a protected segment from which the cache is not allowed to remove any entries).
  • MRU (most recently used): The opposite of LRU is also useful, if the likelihood that data will be accessed increases with the age of the data. This scenario occurs, for example, in sequential parsing of a data file. If the use case lends itself to a scenario where data that is just read won't be accessed again in the near future, it makes sense to forget the most recent entries in the cache first.
  • MQ (multi-queue): This technique maintains different queues with the LRU strategy, where each queue is associated with a particular access frequency. A history buffer remembers the access frequency of the last entries to have been removed for a certain time. Stochastic multi-queue (SMQ) is a variety of MQ.
  • RR (random replacement): Ditches an entry at random.
  • Application specific: The cache learns from the application, operating system, hypervisor, or database what is worth keeping and adjusts to patterns of user behavior.

In addition to the decision-making strategy, each cache also selects a write strategy. Write options include:

  • Writethrough: The system immediately stores the block to be written in the cache, as well as on the background medium. However the process may have to wait to write to the slower medium.
  • Writeback: The block to be written is first stored only in the cache, not on the background medium. The block only moves to the slow hard disk when the entry is displaced from the cache. This strategy avoids waiting times, but at the cost of temporary inconsistency. The medium behind the cache contains outdated data at times. The cache must be battery-buffered for this strategy; a power failure almost inevitably leads to data loss.

Another distinguishing feature for caches is how the cache addresses its entries. In direct-mapped caches, the address in the cache is derived directly from the address on the main storage medium, such as by using its least significant bits. Associative caches, on the other hand, use an algorithm to determine the location in the cache, for example, via a hash function. Direct mapping is faster, but two blocks can displace each other even if the remaining cache is completely empty. Associative mapping is more flexible and the computational effort is higher.

Caching Solutions on Linux

Linux offers a variety of solutions for hard disk caching. This article only considers caches for block devices, which aren't affected by the filesystem and know nothing about the nature of the applications. For simplicity, the tests in this article do not consider the case where the same blocks are cached in other parts of the I/O stack, say, by the hard disk itself or in RAM when using a buffer cache.

One family of possible caching solutions for Linux centers around Flashcache [1]. Flashcache implements an associative cache with a writeback policy and uses FIFO (be default) or LRU as a replacement strategy. For this article, I tested Enhance IO, developed by STEC Inc. [2], which is based on Flashcache. Unlike Flashcache, Enhance IO does not use the device mapper. Enhance IO can transparently set up caching for mounted block devices. The Enhance IO environment supports three write strategies: Read-only, Writethrough, and Writeback.

In Read-only mode, all write operations are fed directly to the hard disk. Reads first transfer the data from the disk to the SSD; if access to the same block occurs again, the block is then read from the SSD.

In Writethrough mode, read operations are treated similarly to Read-only, but are written in parallel to the HDD and SSD. Subsequent reads only access the SSD. Writeback mode performs all read and write operations on the SSD in the usual way. The operations reach the disk asynchronously.

The other caching solution I tested for this article is dm-cache [3], which is directly connected to the device mapper. The dm-cache method creates a LVM hybrid volume from three devices – the actual cache, a small device for metadata (both on SSD), and the hard disk. The caching strategy is stochastic multi-queue, or MQ; the write strategy can be Writeback, Writethrough, or Passthrough.

Installation

Installing dm-cache or Enhance IO is not exactly rocket science. For Enhance IO, you can follow the example in Listing 1. First clone the Git repository, copy the command file for the CLI to /sbin, and copy the manpage to the right place (lines 1 to 5). Then, copy the directory containing the driver sources and rename it (lines 7-10). Next, you need to install framework dynamic kernel module support (DKMS). Before doing so, add the following line to a configuration file for DKMS (line 14):

PACKAGE_VERSION="0.1"

Now the installer can draw on DKMS to compile and install the driver module (line 16).

Listing 1

Enhance IO Installation

jcb@localhost: git clone https://github.com/STEC-inc/EnhanceIO
jcb@localhost: cd EnhanceIO/
jcb@localhost:~/EnhanceIO$ sudo cp CLI/eio_cli /sbin/
jcb@localhost:~/EnhanceIO$ chmod 700 CLI/eio_cli
jcb@localhost:~/EnhanceIO$ sudo cp ./CLI/eio_cli.8 /usr/share/man/man8/
jcb@localhost:~/EnhanceIO$ cd Driver
jcb@localhost:~/EnhanceIO/Driver$ sudo cp -r enhanceio /usr/src
jcb@localhost:~/EnhanceIO/Driver$ sudo mv /usr/src/enhanceio/usr/src/enhanceio-0.1
jcb@localhost:~/EnhanceIO/Driver$ cd /usr/src/enhanceio-0.1
jcb@localhost:/usr/src/enhanceio-0.1$ sudo vi dkms.conf
jcb@localhost:/usr/src/enhanceio-0.1$ dnf install dkms
jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms add -m enhanceio -v 0.1
jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms build -m enhanceio -v 0.1
jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms install -m enhanceio -v 0.1
[root@graphite enhanceio-0.1]# sudo eio_cli create -d /dev/mapper/testvol-data1 -s /dev/nvme0n1p2 -m wb -c enhanceio_cache

The final step sets up the cache (line 20). In this example, /dev/mapper/testvol-data1 is the LVM volume you wish to accelerate and /dev/nvme0n1p2 is the SSD. Intel kindly provided a fast PCI Express SSD, with 750 series NVMe, for the tests (with a capacity of 1.2 TB).

dm-cache is also easy to install. Because the device mapper framework is part of the kernel, you won't need any extra software. To prepare for installation, partition the SSD to have a larger cache and a smaller part available for the metadata device. You can calculate the size of the metadata partition with:

Metadata = 4194304 + (16 * cache size/block size)

In this example, the metadata partition is around 70MB. You can set up the special LVM device with the dmsetup command:

dmsetup create dmcache --table '0 1366552543 cache /dev/nvme0n1p2 /dev/nvme0n1p1 /dev/sdb2 512 1 writeback default 0'

This cryptic command line lists the following: the first and last sectors of the cache, the device name for the metadata device, the cache device, the data device, then the block size in sectors, the number of the feature arguments, and the write strategy feature argument (Writeback, in this case). Then, it lists the caching policy and the number of policy arguments (here: zero).

If this command fails with the hard-to-understand error message Invalid or incomplete multi-byte or wide characters, it is probably because the cache or the metadata partition contains old data. dmsetup does not like that. A remedy is:

dd if=/dev/zero of=/dev/nvme0n1p2
dd if=/dev/zero of=/dev/nvme0n1p1

To check on the status, you can call the cache statistics for the two solutions after performing a number of writes and reads. For dm-cache, the figures are output without formatting and the meaning of the values is only documented in the source code. The output will look like Listing 2.

Listing 2

dmsetup status

root@graphite jcb]# dmsetup status
dmcache: 0 1366552543 cache 8 12468/17920 512 4653/4194304 1488021 200791
2189199 41931 0 4650 0 1 writeback 2
   migration_threshold 2048 mq 10
random_threshold 4 sequential_threshold 512
   discard_promote_adjustment 1
read_promote_adjustment 4 write_promote_adjustment 8
fedora-home: 0 199393280 linear

In the example shown in this listing, the first slash-separated pair of numbers is Used Metadata Blocks/Total Metadata Blocks; this pair is followed by the block size in the cache and Used Cache Blocks/Total Cache Blocks. Then, you see the second pair of slash-separated numbers with the values for Read Hits, Read Misses, Write Hits, and Write Misses. Things are easier with Enhance IO. The Enhance IO statistics are located in a file on the Proc filesystem and formatted in a table (see Listing 3).

Listing 3

Enhance IO Statistics (Excerpt)

[root@graphite jcb]# cat /proc/enhanceio/enhanceio_cache/stats
reads                              1962
writes                          6268272
read_hits                           346
read_hit_pct                         17
write_hits                      1870824
write_hit_pct                        29
dirty_write_hits                 167399
dirty_write_hit_pct                   2
cached_blocks                     17664
rd_replace                           65
wr_replace                       394196
<I>[...]<I>

Benchmarks

You might be wondering what the reward is for all this effort. To study the benefits of caching, we ran various benchmarks. First, we successively migrated the hard disk files of a virtual machine to a regular disk, a RAID device, devices with dm-cache or Enhance IO, and a plain vanilla SSD. We then booted the VM and measured the time in each case.

Figure 2 shows the results. The data comes from the log of the Bootchart Tools [4], showing the number of seconds from the beginning of the boot process to starting Xorg. dm-cache's bad performance is explained by the fact that the cache needs a long time to warm up. A few boot attempts are not enough to accurately measure the performance. For the FIO benchmark, we had to repeat the measurement more than 70 times before dm-cache produced stable results.

Figure 2: Booting a VM as a benchmark. dm-cache has not yet warmed up.

In a second benchmark, we used Flexible I/O Tester (FIO, [5]) and let it work with a read only, random-access workload. The tester first created 15 files with a size between 10 and 100GB (total size 96GB, which was eight times the size of the available RAM) and then read arbitrary 4KB blocks with up to 16 threads for several minutes. This test shows the impressive superiority of the SSD-based devices compared to hard drives (Figure 3).

Figure 3: An SSD, either alone or as a cache, is miles ahead of hard disk drives in random reads.

The fact that a cache achieves results a few percent better than the standard SSD was not expected, but it results from normal fluctuations in the results and the fact that the influences on the complex I/O stack are diverse. Other caches in faster RAM at the filesystem level play a role. The result of each disk is so poor, at less than 1MB/s, that it disappears into the Y-axis. A RAID is significantly faster but still several orders of magnitude slower than devices that do without time-consuming repositioning of the read head.

For a third benchmark, we used Sysbench [6], which processed a read-write online transaction processing (OLTP) mix in a MySQL database. The MySQL data directory was stored successively on the devices. Each measurement was repeated at least three times, and a mean value was computed. The number of database threads working in parallel grew in the course of the benchmark process.

As you would expect, the SSD is the most expensive, but it is also the fastest solution. The two caches come pretty close to a peak of their power curve with 64 threads. The RAID's performance was passable but much slower. Finally, the hard disk drive was mercilessly outclassed (Figure 4).

Figure 4: Sysbench with various devices. Nothing can catch up with the plain vanilla SSD, and the single disk is well beaten in last place.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • AuFS

    AuFS offers a painless filesystem for a thin client, and FS-Cache provides a persistent cache.

  • Memcached

    This practical caching tool can reduce the load on a web database server by as much as 90%.

  • Ask Klaus!

    Klaus Knopper answers your Linux questions.

  • Charly's Column – Varnish

    Columnist Charly gives Apache a slick coat of Varnish for better performance.

  • Network Block Devices

    You don’t need Samba or NFS to support a diskless client. A remote block device can help improve performance and efficiency. We'll show your how.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News

njobs Europe
What:
Where:
Country:
Njobs Netherlands Njobs Deutschland Njobs United Kingdom Njobs Italia Njobs France Njobs Espana Njobs Poland
Njobs Austria Njobs Denmark Njobs Belgium Njobs Czech Republic Njobs Mexico Njobs India Njobs Colombia