For NVMe, if you get the SMART data with smartmontools/smartctl, you can inspect Percentage Used.
"Percentage Used: Contains a vendor specific estimate of the percentage of life used for the Endurance Group based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the Endurance Group has been consumed, but may not indicate an NVM failure. The value is allowed to exceed 100."
for SATA/SAS SSDs, there is "Media_Wearout_Indicator" which hasn't been a particularly reliable indicator in my experience.
>if you get the SMART data with smartmontools/smartctl, you can inspect Percentage Used.
CrystalDiskInfo[1] can be used for this purpose over on Windows. Some vendor-provided utilities like Samsung Magician will also provide this data with appropriate drives.
My SSDs show SMART attributes, which can be used as a rough indicator of health, but really the only strategy I've found to work well for my peace of mind is to use redundancy.
Concretely, I use ZFS with a zpool with 2 SSDs in a mirror configuration. When one dies, even if it's sudden, I can just swap it out for another one and that's it.
My vulnerability window starts when the first SSD fails and closes when the mirror is rebuilt. If something bad happens to the other SSD during that time, I'm toast and I have to start restoring from backup.
In my experience with enterprise SSDs (which yeah aren't the same but that's what I have to offer), SSDs with sequential serial numbers and identical on-times, in the same RAID array, can have wildly different actual endurance. Some storage servers I used to admin had SSDs lasting longer than 2 neighbor replacements from the same original box, and this happened at least twice.
I stopped being worried about on-times after that for SSDs. HDDs are still quite correlated (on the order of months) but if you're building the server you have to put the disks in it at some point.
One surprising feature of "enterprise" class drives is that they often have a faster fail than their consumer class counterparts. the idea being an enterprise class drive will be found in a drive array, and the array will be much happier with a hard fast final failure than a drive that tries to limp along. While the consumer with their single drive is a lot happier when their failing drive gives them every chance to get the data off.
Often times you'll find that consumer drives that are limping around do fail smart checks and would absolutely flash a red light and send out a monitoring alert were they in an enterprise enclosure. While there is probably truth to what you're saying, I think the enterprise is also just way more proactive at testing drives and calling them dead the instant the smart checks fail, where consumers don't typically use crystaldiskinfo.
Can anyone recommended utilities that monitor and warn before SSD failures?