S.M.A.R.T, or not so smart…
A few months ago I did some research on hard disk S.M.A.R.T data. For those not in the loop, it stands for Self-Monitoring, Analysis, and Reporting Technology, and is basically a mechanism modern hard disks use to track their health. Wikipedia has a great write up on the concept, but the short version is that hard disks monitor things such as temperature, bad block relocation actions, unrecoverable read errors and the like.
The disks also track interesting things like power outages, shocks, and spin up times. There’s a whole list of them on the Wikipedia page, but of course, not all metrics are captured by all drives.
My research involved trying to find a method to predict in advance whether a hard disk was healthy enough to survive full disk encryption without breaking. The challenge was that any test to determine this was as likely to cause the drive to fail as encrypting it was. Either way you ended up with a broken hard disk and no user data.
There are two common causes of hard disk failure related to full disk encryption – thermal shock and undiscovered bad blocks.
Thermal problems are pretty easy to track using the S.M.A.R.T system – All drives I’ve seen track “Temperature Difference from 100″, which gives a cross-manufacture index of how close the drive is to its design thermal limit. You can get this by querying item 190 in the S.M.A.R.T data field using either the vendor tools, or WMI. Unfortunately there’s no clear indication that thermal problems have any bearing on drive lifetime.
Undiscovered bad blocks are a much harder problem to work around. The theory is that because the average hard disk is much bigger than the working data set most users utialize, it’s possible for the drive to have lots of undiscovered bad sectors.
Normally, you’d never notice a bad sector on a disk – the firmware of the drive itself is responsible for noticing bad conditions and mitigates them by “remapping” the bad location to a reserved part of the drive. Again, Wikipedia has a good write up, but simply:
Normally this remapping would happen over time as more and more of the disk gets used, but of course with full disk encryption, the first thing we do is write to every sector of the drive to encrypt it. (many people ask why we don’t just encrypt sectors currently in use, the reason is that it opens the drive up to plain text attacks, and also that knowing what’s really in use is quite hard. We want to encrypt files which have been deleted as well.)
Because of this, if there are undiscovered bad sectors on the disk they will get remapped as part of the encryption process. If there are too many to be invisibly remapped by the drive itself, it will start reporting them as bad to the OS, and that’s when things start breaking.
Many people have tried doing a Windows chkdsk prior to encrypting the disk, but of course, unless you do the bad block scan it’s pointless – full disk encryption doesnt really care how valid the file system is – it’s way above the level we are working at. Of course, a bad block scan will cause the invisible remap of real bad blocks just like encryption would, so it’s prone to cause the same loss of data.
The only thing that seemed to help was to use the S.M.A.R.T counters which tell you how many bad blocks the drive has remapped (or rather, the portion of space left to map new blocks), if that starts changing during encryption, it’s a good indication that the process should be stopped so the user can back up their drive and replace it.
Google performed a study of S.M.A.R.T data vs actual drive failures in 2007 and came to some interesting conclusions:
Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but suggests that certain sub-categories of information which some S.M.A.R.T. implementations track do correlate with actual failure rates – specifically, in the 60 days following the first scan error on a drive, the drive is, on average, 39 times more likely to fail than it would have been had no such error occurred. Also, first errors in re-allocations, offline re-allocations and probational counts are strongly correlated to higher probabilities of failure.
they also found that of their 100,000 failures, 56% had no worry-some indicators in the core S.M.A.R.T data set, and even adding in the entire S.M.A.R.T data population, still 36% failed without any indication whatsoever. One positive metric Google produced was that once a S.M.A.R.T error was reported, the drive was 39x more likely to fail in the next 60 days than a drive with no reported errors.
Despite having S.M.A.R.T , if you track all the metrics you’ll still find at least a third of your drives will fail without warning.
You can download a VBS Class to show you SMART data from CTOGoneWild