Monitoring SSD longevity

Anyone who has investigated SSDs for server applications is aware that flash-based SSDs have a limited life span. NAND flash cells have a finite program/erase cycle limit, and though manufacturers use ever more sophisticated wear-level algorithms, the fact remains that the heavier your write activity to a device, the faster you will approach this limit. It would therefore be a good idea to monitor for wear indicators in much the same way that we monitor spinning disks for signs of impending failure (reallocated sectors, etc.)

There is currently no industry standard for exposing the remaining life of an SSD. Intel provides a Media Wear-out Indicator in the form of a SMART attribute (0xE9) that starts at 100 and counts down to 1 as the P/E cycles progress through the media. It’s a rough estimate of remaining life, but it’s better than nothing. Other vendors may do similar things, but a cursory Google search didn’t turn up anything.

Since we use Intel SSDs at $DAYJOB, I thought I’d put together a monitor of this attribute so we can track the endurance of our deployed drives. We monitor various local metrics on our systems with Resmon, which has a simple plugin architecture. After about a half-day’s work (most of which was testing on different systems), I had a Resmon module that produces output like:

    Available_Reservd_Space = 100
    Available_Reservd_Space_raw = 0
    End-to-End_Error = 100
    End-to-End_Error_raw = 0
    Erase_Fail_Count = 100
    Erase_Fail_Count_raw = 0
    Host_Reads_32MiB = 100
    Host_Reads_32MiB_raw = 1599
    Host_Writes_32MiB = 100
    Host_Writes_32MiB_raw = 5964
    Media_Wearout_Indicator = 100
    Media_Wearout_Indicator_raw = 0
    Power_Cycle_Count = 100
    Power_Cycle_Count_raw = 29
    Power_On_Hours = 100
    Power_On_Hours_raw = 4504
    Program_Fail_Count = 100
    Program_Fail_Count_raw = 0
    Reallocated_Sector_Ct = 100
    Reallocated_Sector_Ct_raw = 0
    Reported_Uncorrect = 100
    Reported_Uncorrect_raw = 0
    Reserve_Block_Count = 100
    Reserve_Block_Count_raw = 0
    Spin_Up_Time = 100
    Spin_Up_Time_raw = 0
    Start_Stop_Count = 100
    Start_Stop_Count_raw = 0
    Unsafe_Shutdown_Count = 100
    Unsafe_Shutdown_Count_raw = 25
    Workld_Host_Reads_Perc = 100
    Workld_Host_Reads_Perc_raw = 21
    Workld_Media_Wear_Indic = 100
    Workld_Media_Wear_Indic_raw = 111
    Workload_Minutes = 100
    Workload_Minutes_raw = 8115
    fw = 4PC10302
    model = INTEL SSDSA2CT040G3
    serial = CVPRxxxxxxxxx40AGN

The above is from an Intel 320 drive. The module simply runs the smartctl utility from the smartmontools project and extracts the desired information from the output.

As mentioned above, the attribute I care about is Media_Wearout_Indicator, and it’s the normalized value that counts down from 100. Now that this value is exposed via Resmon, I can monitor it in Circonus. Not only will I be able to watch the MWI value, I can also track the serial number and firmware strings for a given drive as text metrics for correlation with other trends.

I made this a generic SMART module, so even if you don’t have any SSDs, you can monitor other important attributes such as reallocated sectors or temperature. The module pulls both the normalized and the raw values. SMART attributes are vendor-specific, so check on your drives’ specs to see which value is most meaningful for a given attribute. Note that in its current form, the SMART module only knows how to interpret typical output for ATA devices, which is presented in a tabular format that is relatively easy to parse. Output for SCSI/SAS devices is much more free-form and will not be picked up by this module.

So far the module has been tested on OmniOS and Linux but it should work on *BSD and other platforms supported by smartmontools. All you need is Resmon and a Perl interpreter!

Back to top