With their amazing random read/write performance, which is about two orders of magnitude higher than that of traditional Hard Disk Drives (HDDs), Solid State Drives (SSDs) are perfect for I/O intensive workloads. However, the attractive numbers from the vendor spec sheets often list only peak performance. Achieving and maintaining the same level of performance in practice can be elusive.
Things seemed to run smoothly when we first began using SSDs in our production systems. For example, a single MySQL server could easily handle 15,000 transactions per second. Yet we soon started to see instabilities and determined that SSDs were the culprit. For instance, the random read I/O throughput of one of our servers was degraded to 50 MB/s compared to the original 450+ MB/s throughput number advertised by the manufacturer.
Why did the strong performance of the SSDs drop so much within only a few months of running the system with a random read and write workload? To understand the reason, we first need to understand how SSDs work internally. Not surprisingly, they are totally different from HDDs. One fundamental difference is how they write data. Before a SSD writes data, it has to erase the old data first, just like you erase the writing on a whiteboard to write something new. Even worse, imagine that you have to wipe clean the whole whiteboard, even if you only want to write a single word. In a SSD, the unit for writing is called a page, whose size can be 2KB, 4KB, 8KB, or 16KB, while the unit for erasing is called an erase block, which can include 128 or 256 pages.
With these limitations, how do you update data efficiently in a SSD? The naive way is to update the data in place. To do this, a SSD has to backup the entire erase block somewhere else (e.g. memory), erase the data, and then write back the updated block. As a result, even though you may only need to update 8KB of data, you might end up reading and then writing back 4MB of data. This is called Write Amplification.
The graph below shows an example. In this simplified example, we assume that there are only five pages per erase block. To write to page B, we first copy the whole erase block to memory, erase the block, update in memory, and then write back (also called program) the updated block. As a result, even though we only mean to update a single page, all of the other pages also get read and then written back.
To avoid this performance penalty, SSD designers choose to write the new data to a new page and mark the old one as invalid so that they can be erased later in a batch. I will cover how to remember the mapping at the end of this post.
Since we write to the new pages but delay the deletion of the old pages, we are consuming a considerable amount of extra space to store our data. For example, in the following graph page B has been rewritten twice. Since the old erase blocks still contain valid pages and have not yet been recycled, we have three different copies of page B in our disk. Similarly, we have two copies of page J. As a result, we can run out of physical disk space even though we have only written part of the disk capacity.
To avoid this situation, a SSD has to recycle the old pages periodically, called Garbage Collection, instead of waiting for all of the pages within an erase block to become invalid. To this end, a SSD has to copy all of the still-valid pages to a new block before erasing an old block. These extra background I/O operations will interfere with the foreground disk I/O operations requested by the user and hurt the performance of the disk. The frequency of garbage collection is determined by the number of clean blocks. Less clean blocks result in more frequent garbage collections and thus worse disk performance. Taken to the extreme, if the disk is filled up by user data and there is no clean block left, then it has to fall back to the 256x amplified writes.
Does this mean it will be fine if I only use, say, 70% of the disk capacity? Unfortunately, it is not that simple. Because of miscommunication between file systems and disks, disks often think that they are full even though they are not. Sounds weird, right? In fact, if you have ever recovered a deleted file from a disk, you must be aware that file systems don’t really tell disks to eliminate any data while deleting a file. What is done is basically just to flip a bit to mark the file as deleted. This is not a problem for a traditional HDD since it updates data in place anyway. It is a completely different story for a SSD since it has to deal with these now unused pages everywhere during garbage collection. Even worse, it will think that it has run out of clean blocks even though it has not.
Now it is clear what happened to our SSDs. Over time, our daily production traffic caused the SSDs to become fuller from their perspective. As a result, garbage collection was triggered more and more often until disk performance reached unacceptable levels. One natural way to solve the issue is to tell SSDs which data are deleted. Modern operating systems support an instruction called TRIM to allow file systems to pass the information to the underlying disks. Unfortunately, TRIM support from RAID cards is rare. Also, TRIM is defined as a non-queued command, which might cause I/O stalls and has to be used with care. Even with TRIM, a disk can still be filled up. This is where over-provisioning kicks in. The basic idea is to preserve a portion of the disk to accommodate the necessary background I/O. As a result, a disk will never run out of clean blocks. Most SSDs are over-provisioned out of factory but this is not always the case. Other design optimizations will also help performance. For example, most SSDs remember the mapping between logical and physical pages by a tree data structure. To keep the high space efficiency, the tree itself also needs some expensive maintenance like defragmentation, which will worsen the situation even more. Given this, one solution is to adopt a simpler and stabler data structure.
With the above criteria in mind, we picked the Intel DC S3700 SSD for our new architecture. The S3700 is designed for a data center workload with Intel’s third generation controller and enterprise-level NAND flash memory. The drive is about 32% over provisioned out of factory. It should be able to handle most of the workload but you can always reserve more for your own purpose. In addition, Intel replaced the mapping tree with a flat 1:1 mapping table to further eliminate the performance penalty from metadata maintenance.
Designing and measuring the SSD array could be covered in another blog post. Below we show the benchmark results of our most recent SSD array configuration. The red and the blue curves show the 4KB random read/write IOPS (I/O per Second) measured by fio (http://freecode.com/projects/fio), respectively. To test the stability, we ran the tests long enough to cover the whole disk. As you can see, the performance is pretty stable across the whole process, especially for random writes. Even at the end of the test when the disk was almost full, the write performance didn’t drop much. We are not sure about the reason behind the stair shape of the random read curve, probably related to garbage collection. However, the performance remains sufficiently high and stable for the whole process. To back up the tests, we have been running the new arrays in production systems for several months and have been happy with their real world performance.
Interested in helping us scale our infrastructure to the next 30+ million users and beyond? Join our growing team!