I've heard SSDs are more likely to "fail fast". Can anyone recommended utilities...

caseyf · on April 2, 2024

For NVMe, if you get the SMART data with smartmontools/smartctl, you can inspect Percentage Used.

"Percentage Used: Contains a vendor specific estimate of the percentage of life used for the Endurance Group based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the Endurance Group has been consumed, but may not indicate an NVM failure. The value is allowed to exceed 100."

for SATA/SAS SSDs, there is "Media_Wearout_Indicator" which hasn't been a particularly reliable indicator in my experience.

Dalewyn · on April 3, 2024

>if you get the SMART data with smartmontools/smartctl, you can inspect Percentage Used.

CrystalDiskInfo[1] can be used for this purpose over on Windows. Some vendor-provided utilities like Samsung Magician will also provide this data with appropriate drives.

[1]: https://crystalmark.info/en/software/crystaldiskinfo/

justsomehnguy · on April 5, 2024

> CrystalDiskInfo[1] can be used for this purpose over on Windows

Only if you need a fancy GUI or guide some non-tech person to read the values to you over the phone/IM.

Otherwise just use win32 port of smartmontools.

magnetic · on April 2, 2024

My SSDs show SMART attributes, which can be used as a rough indicator of health, but really the only strategy I've found to work well for my peace of mind is to use redundancy.

Concretely, I use ZFS with a zpool with 2 SSDs in a mirror configuration. When one dies, even if it's sudden, I can just swap it out for another one and that's it.

My vulnerability window starts when the first SSD fails and closes when the mirror is rebuilt. If something bad happens to the other SSD during that time, I'm toast and I have to start restoring from backup.

Vecr · on April 2, 2024

Did you stagger the power-on times? Otherwise you could get tightly correlated failures.

magnetic · on April 2, 2024

They are about 25 hours apart, which isn't very large I'll admit.

Thankfully, the serial numbers aren't too close to each other, so I'm hoping they aren't part of the same batch.

themoonisachees · on April 3, 2024

In my experience with enterprise SSDs (which yeah aren't the same but that's what I have to offer), SSDs with sequential serial numbers and identical on-times, in the same RAID array, can have wildly different actual endurance. Some storage servers I used to admin had SSDs lasting longer than 2 neighbor replacements from the same original box, and this happened at least twice.

I stopped being worried about on-times after that for SSDs. HDDs are still quite correlated (on the order of months) but if you're building the server you have to put the disks in it at some point.

somat · on April 3, 2024

One surprising feature of "enterprise" class drives is that they often have a faster fail than their consumer class counterparts. the idea being an enterprise class drive will be found in a drive array, and the array will be much happier with a hard fast final failure than a drive that tries to limp along. While the consumer with their single drive is a lot happier when their failing drive gives them every chance to get the data off.

themoonisachees · on April 3, 2024

Often times you'll find that consumer drives that are limping around do fail smart checks and would absolutely flash a red light and send out a monitoring alert were they in an enterprise enclosure. While there is probably truth to what you're saying, I think the enterprise is also just way more proactive at testing drives and calling them dead the instant the smart checks fail, where consumers don't typically use crystaldiskinfo.