Disk failure rate spike

Hey everyone,

We’ve noticed an increase in a type of disk failure on some of the storage nodes that ultimately has a severe negative impact on storage performance. In particular, we observe that certain models of drives in certain manufacturing date ranges seem to be more prone to failure.

As a result, we’re looking a bit more closely at our logs to keep an eye on how widespread this is, but most of the older storage seems fine; it has tended towards some of the newer storage using both 2Tb and 4Tb drives. The 2Tb drives are the more surprising to us as the model line involved has generally been performing as expected, with many older storage units using the same drives without having these issues.

We are also engaging our vendor to see if this is something that they are seeing elsewhere, and making sure we keep a close eye on our stock of replacements to deal with these failures.