Why you should not use SMR disks for ZFS

26 Nov 2022 -
Last update 26 Nov 2022
Reading time 12 mins

TL;DR: SMR disks are evil in any ZFS ZPool or below copy on write filesystems such as btrfs. Never ever use them there. Replace them as soon as possible when you have some in your pools, do not wait for failure or some later moment. They’re ok in other scenarios like ufs, ext4, NTFS, fat, etc. partitions and applications that write on really rare occasions but not for ZFS, btrfs and other modern filesystems or in RAID settings.

What is shingled magnetic recording?
What’s the problem?
- Some tips during recovery
- The best way to recover?
When are SMR disks perfectly ok?

What is shingled magnetic recording?

So what is SMR anyways? It’s a way - or pattern - that magnetic hard disks are recording their data with. Currently one sees two categories of hard disks that write data in different ways on their magnetic disks:

Conventional Magnetic Recording (CMR) is the traditional recording method - usually as of today this means that those disks use perpendicular magnetic recording (PMR) which magnetizes the disks in horizontal direction and perpendicular to the disk surface (in contrast to the older longitudinal magnetic recording (LMR) method that has been used for older disks). Those disks write data directly to those sectors of the disks that are requested by the operating system (except for bad sector remapping that usually happens transparently to the host system by the disk controller as well as offsets added to prevent access to locked sections on the disk that might be used by firmware) without having to rewrite other sections of the disk. The write head always only overwrites a single track on the disk and never destroys or overwrites other data.
Shingled Magnetic Recording (SMR) on the other hand writes data on overlapping neighboring tracks. This works since the write heads are larger (or at least can be imagined to be) than read heads - the tracks are placed more dense than with CMR so that the write head writes onto more than one track. The read head is still small enough to read the tracks independently. In case the write head destroyed a neighboring track the disk will also have to re-write the neighboring track. So a write might cause a cascading effect of rewriting the neighboring sectors as well as their neighbors until it reaches an unused area or the edge of the shingle zone where overwriting is not necessary. While the disk re-writes it basically hangs, the writes take longer to finish. In addition the disks of course need to know which sectors they have to rewrite so like SSDs they require periodic TRIM commands sent by the operating system (i.e. SMR disks work really bad with legacy operating systems not supporting TRIM). The write processes gets less performing the more disk sectors are used (or have ever been in use in case no TRIM command is sent by the operating system). Instead of a few milliseconds writes may take up to a few seconds (which will be the problem in the end) - on some occasions even tens of seconds. To hide this rewriting that might take way more time than just writing a single track or sector those disks usually employ an outer region that uses CMR as a cache. Usually those regions are at least 20 GBytes in size - data is first written into the cache and when either the hosts requests it, the disk cache is full or the disk has some spare time on some of the disk managed SMR disks the disk re-writes the data into the SMR region. Until the disk cache is full the write performance of an SMR disk appears to be comparable with a CMR disk - so one doesn’t notice anything. In case the cache is full the disk rewrites all data from the cache into the SMR region usually - which can take tens of seconds while the disk hangs before finishing the next write which will cause the bandwidth to cripple into the kilobytes per second range and potentially even lead to the disks faulting.

Why do manufacturers use SMR instead of CMR when it’s so bad? As one can imagine denser tracks mean on one hand less head movement in many cases but most important you can fit more data on smaller disks - sometimes you can even reduce the number of rotating storage disks by one and thus reduce costs which leads to huge profit gains when one looks at the volume of disks manufactured. Thus you can sell a disk with same capacity for a lower price or have a larger profit. Because of this many manufacturers didn’t even tell customers which disks used SMR in many cases - this had been better the last few months since manufacturers started to release lists of SMR disks realizing they cause failure in important settings which didn’t make users really happy (it didn’t even help that they saw large performance the first time period after they bought the disk when after that even major data loss happened and performance crippled - surprisingly).

What’s the problem?

Basically the long latency whenever the disk has to rewrite data from the cache into the SMR region in case it causes many rewrites and thus latency’s up to the tens of seconds range. When using SMR disks in the beginning everything seems to work fine and with usual workloads and with sequential access (or workloads that mainly cause reads) everything appears perfect. Also when only writing in bursts and then pausing for a longer duration everything seems fine since all data is written to the CMR cache and then transferred in the idle time into the shingle zones. When running with a modern copy on write filesystem also everything looks great - until some disks in the pool fail. Then a resilver process will start. Resilvering a modern file system causes many different random writes while replaying the journals and rewriting nearly all sections on the disk. This rapidly fills up the write cache in the CMR region of the disk - and then triggers rewrites that take many seconds and if you’re unlucky up to tens of seconds. In case the disk gets too slow the RAID logic of hardware raid controllers and also pool management of ZFS (other volume managers of course too) mark the disk as dead - since this is one of the behaviors a failing disk shows. Thus they eject your disk that you want to resilver on as dead and as long as there is enough redundancy they do not cause any error. The problem now arises when there are multiple SMR disks in the system that are written too as usual during a resilver (and by the fact that resilvering will take really long or will never finish with periodic faults). Then the next disk might fail during the continuation of the process and so on. This will lead in a unavailable pool that will fall into this condition reboot after reboot over and over. And if you’ve bad luck and have some failing sectors or a faulty disk this might even lead to total data loss even with modern filesystems who are usually surprisingly robust - there are some tricks one can try though.

And here’s the catch: You won’t notice this problem until you experience major disk problems - not even the bandwidth reduction that you will first notice when the disks get filled up nearly complete for the first time since the disk has to rewrite more neighboring sectors and has less likelihood to stop on an empty sector while performing writes into the SMR region. And when they are there a rebooting machine may hang many hours during mounting of the filesystem while replacing the most necessary parts of the intention log - and even weeks to months during resilvering or scrubbing the whole pool. It may take even days mounting the pools again (the longest I’ve ever seen has been 9 days to reboot on a production machine with a just 12 TByte pool just for mounting, the resilver dropped to less than 4 KByte/sec after about 20% and thus would have taken roughly 82 years to finish due to the speed decrease on the SMR drive). In case it’s urgent you then just have to buy CMR disks as replacement, just mirror the old ones sector wise and throw out the SMR disks all at once. This works somehow well and comparably fast (a few days for ordering and cloning the disks) as long as there are no defective sectors on the disks since one can clone them in huge batches which exploits fast sequential reads and writes on disks. When there are defective sectors it gets a little bit more tricky - then one has to copy them sector by sector which is painfully slow (lets say around 3-4 days on a 4 TByte disk as of today - and one has to do this with every SMR Disk in ones pool). Or you wait until the machine starts up again and replace them the old fashioned way during runtime which basically works as long as the SMR disks are not written onto - so in case any defect happens one cannot do this in a reasonable way. All in all - as soon as problems start you will have days to weeks of downtime of those machines which is something that you want to avoid in the first place - there is a reason to use RAID like solutions anyways (and that’s not to replace backups but to keep the systems available and up even in case of hardware failure)

So short story even shorter: If you have SMR disks in your pools replace them. If you’re building pools don’t use SMR disks. Check if your disks are SMR beforehand or now when you don’t know. And if a manufacturer does not explicitly state which recording technique a disk uses (note that this might even differ for different storage sizes of the same family of disks of the same manufacturer) assume it’s SMR and don’t buy them.

Some tips during recovery

So is there something one can do during recovery (resilvering) when one is unable to get rid of the SMR disks or while doing the replacement? Indeed - one can increase the disk timeout threshold at which the operating system decides a disk is dead in case one uses a JBOD and not a hardware RAID controller.

For FreeBSD for example one would set:

kern.cam.ada.default_timeout for ATA direct access devices
kern.cam.da.default_timeout for SCSI direct access devices

Usually those are set to values like 30 seconds for ada and 60 seconds for da - way to slow for a full SMR disk. It’s a good idea to increase the timeout values during recoveries to values as high as 5 minutes (300 seconds) to prevent the disks to be ejected. Then one can run a long scrub, resilver and replace procedure. This will take time (depending on the disk up to the range of 80 to 100 years - unfortunately not exaggerating)- but it’s still one of the fastest ways to recover from SMR disks over just a few weeks up to a few months - but no other device is allowed to fail during this time and performance will cripple. One can usually only take this route as long as no device has shown the performance degrading effects to replace the SMR disks in a useful way. But as soon as resilvering

The best way to recover?

There is only one really good way to recover when the problem already has shown - buy a CMR disk of equal or larger size (sector wise) for each SMR disk in your storage pool. Take your systems offline and clone them disk wise one by one. This is pretty simple:

Boot the system without importing the pool from any other medium
Figure out the serial numbers or all disks and their matching device names
Figure out the device names of all replacement disks
Use a command like dd if=/dev/adaXX of=/dev/adaYY bs=1G for each and every disk in succession. For a typical 3 TB disk this will take 7 to 12 hours per disk - if you have multiple disks and are able to attach them to the machine launch multiple instances of dd - you are most likely limited by the disks and not the controllers during replacement.
Remove the SMR disks and re-import your pools again. Everything should work smoothly from now on
Get rid of the SMR disks (shredder, sell on eBay for the next person to make the same mistake or use for slow archiving without a copy on write filesystem and without RAID, etc.)

When are SMR disks perfectly ok?

Personally I’d say: Never. Avoid them.

But basically whenever not many writes are performed in close succession, latency is not a problem and the disk is not used inside a disk array. Thus when one archives really slowly produced data on some disks, runs only a single disk in ones desktop workstation or notebook where one just writes low amounts or data, etc. and does not run a robust copy on write filesystem or wants to send in larger bursts of data (like performing a full disk backup, etc.) SMR disks are a perfectly valid choice. So for many home and some unnecessary office use this won’t be a problem. But for server applications, reliable workstations, storage applications or as a backup target they’re usually a really bad idea.