Facebook Analyzes SSD Failure Rates In The Data Center Environment

Researchers at Carnegie Mellon University and Facebook performed an extensive analysis of the effects of various factors on flash-based SSD reliability across a majority of the SSDs employed at Facebook, running production data center workloads. The goal was to examing the SSD behavior over under 24/7 data center operations. The research, entitled "A Large-Scale Study of Flash Memory Failures in the Field" was performed by Justin Meza and Onur Mutlu of Carnegie Mellon University and Qiang Wu and Sanjeev Kumar of Facebook.

The team analyzed a variety of internal and external characteristics of SSDs and examine how these characteristics affect the trends for uncorrectable errors.

Unfortunately, the study doesn't break results out by vendor, it instead classifies the SSDs by age. In addition, the team defines the SSD of failure as an uncorrectable read error (URE) leading to data loss. That doesn't mean the SSD was dead, but they did find that SSDs that had one URE were much more likely to have another.

Furthermore, since SSDs don't relay internal read errors that the controller can correct, the only read errors the study captured were those that got reported to the server, so this is more a device level reporting.

The SSDs that were examined ranged from high-capacity multi-level cell (MLC) flash-based SSDs with capacities of 720 GB, 1.2 TB, and 3.2 TB. The technologies examined spanned two generations of PCIe, versions 1 and 2.

Findings

The study concluded that SSDs go through several distinct failure periods - early detection, early failure, usable life, and wearout - during their lifecycle, corresponding to the amount of data written to flash chips. Due to pools of flash blocks with different reliability characteristics, failure rate in a population does not onotonically increase with respect to amount of data written to flash chips.

This is unlike the failure rate trends seen in raw flash chips. The researchers suggest that techniques should be designed to help reduce or tolerate errors throughout SSD lifecycle. For example, additional error correction at the beginning of an SSD's life could help reduce the failure rates we see during the early detection period.

The study also found that the effect of read disturbance errors is not a predominant source of errors in the SSDs.

In addition, the researchers concluded that sparse data layout across an SSD's physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns.

Further research into flash write coalescing policies with information from the system level may help improve SSD reliability. For example, information about write access patterns from the operating system could potentially inform SSD controllers of non-contiguous data that is accessed very frequently, which may be one type of access pattern that adversely affects SSD reliability and is a candidate for storing in a separate write buffer.

Another important finding was the effect of temperature. Higher temperatures led to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques, according to the study. In general, techniques like throttling, which may be employed to reduce SSD temperature, could be effective at reducing the failure rate of SSDs. The study also found that SSD temperature is correlated with the power used to transmit data across the PCIe bus, which can potentially be used as a proxy for temperature in the absence of SSD temperature sensors.

Last but not least, the study found that the amount of data reported to be written by the ystem software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques. Techniques that simply reduce the rate of software-level writes may not reduce the failure rate of SSDs.

"Studies seeking to model the effects of reducing software-level writes on flash reliability should also consider how other aspects of SSD oper-
ation, such as system-level buffering and SSD controller wear leveling, affect the actual amount of data written to SSDs," the study reads.