SSDs are taking the storage industry by the storm by filling the ever-widening latency gap between other computing resources and hard drives. Almost every major storage vendor has a flash product now, but what is interesting is the differences in their approaches. Many have rushed to market with flash as a read cache for disks. Others have used gold plated SLC flash or even PCIe flash cards. Yet others have put together a tray of SSDs with an open-source file system. These early products are unable to deliver the full benefits of flash because they do not address the hard problems of flash, or are simply too expensive for mainstream applications. I stopped in at the Solid State Storage Symposium today, where many of these issues were hot topics. If you’re looking at any SSD storage systems, it’s important to understand the challenges.
SSDs behave very differently from hard disks. The main complexity lies in the Flash Translation Layer (FTL), which provides the magic that makes a bunch of flash chips usable in enterprise storage: wear leveling, ECC for data retention, page remapping, garbage collection (GC), write caching, managing internal mapping tables and so on. However, these internal tasks conflict with user requests and manifest as two main issues: latency spikes and limited durability.
The main appeal of SSD is its low latency; however, it is not available consistently. And while write latency can be masked with write-back caching, read latency cannot be hidden. Typical SSD latencies are a couple of hundred microseconds but some accesses can be interrupted by device internal tasks, and their latency can exceed tens of milliseconds or even seconds. That’s slower than a hard disk.
There are myriad flash internal tasks that can contribute to latency, such as GC at inopportune times or stalling user IO to periodically persist internal metadata. What further complicates the situation is the lack of coordination across an array of devices. The most common way to use SSD is to configure a group of devices, typically in RAID-6. But since each device is its own eco-system completely unaware of others, the resultant performance of IOs to this array can become even more unpredictable since their internal tasks are not coordinated.
Unless the storage subsystem understands the circumstances under which latency spikes occur and can manage or proactively schedule them across the entire array, the end result will be inconsistent and have widely varying latency characteristics.
Figure 1: This graphs latency plotted against time for a widely used SSD on the market. The workload is a combination of sequential writes and random read reads, common with many log-structured file systems today. You can see the periodic latency spike that disrupts normal access.
Another issue is Endurance. Although flash is great for IOPS, it has limited write cycles compared with a hard disk. And while SLC flash drives have higher endurance compared with MLC, they are too expensive for mainstream applications, and may still require over-provisioning to control write amplification. MLC flash is much more cost-effective, but if used naively will quickly wear out. Its lifetime is proportional to the amount of data written to it by both user as well as data produced by internal drive activity such as GC, page remapping, wear leveling, or data movement for retention.
The additional data written internally for each user write is referred to as write amplification and is usually highly dependent on device usage patterns. It is possible to nearly eliminate write amplification by using the device in way that hardly ever triggers GC, but the techniques are not widely understood and may be drive-specific. Techniques for total data reduction such as dedupe and compression are more widely known, but hard to implement efficiently with low latency. Similarly, building a file system that has a low metadata footprint and IO overhead per user byte are also challenging but yields high benefit.
A big challenge to designing a storage system that runs efficiently on flash is understanding drive geometry. Traditional file systems are highly tuned to the geometry of hard drives; this was possible because of the wealth of information available about hard disks. For SSD, however, the device geometry is highly virtualized and proprietary. This makes the task of reducing write amplification hard. For example, writing a program erase block instead of random pages within a block can reduce amplification, but what are the boundaries of these program erase blocks? How big are they, where do they start?
Finally, one of the key requirements for enterprise storage is reliability. Given that write endurance is a challenge for SSD and the fact that suboptimal use patterns can further affect it, it is important to predict when data is at risk. SSD has failure modes that are different from hard disks. As SSDs wear out, they begin to experience more and more program failures, resulting in additional latency spikes. Furthermore, because of efficient wear-leveling, SSDs can wear out very quickly as they near the end of their useful life.
Figure 2: This is a graph of write cycles vs. use of SSD spare pages. You can see the use of spares rising very slowly in the beginning and then exponentially toward the end.
It is important to observe and deduce when a device is vulnerable. This is not a simple matter of counting the number of bytes written to the device and comparing it with its rating. It means observing the device for various signs of failure and errors and taking action.
The bottom line is that SSD is a game changer, but needs to be implemented correctly in any storage system and won’t be as effective if it’s bolted on to an existing system. If you’re evaluating any SSD storage products, make sure you understand how the file system uses flash and manages performance, latency spikes, write endurance and reliability.
The Tintri OS has solved these problems with simple but efficient solutions. It allows you to use SSD as an integral part of your storage without compromising consistent performance or data reliability.