Freezing FreeBSD systems - NCQ on SSDs - a short troubleshooting journey

08 Nov 2024 - tsp
Last update 12 Nov 2024
Reading time 5 mins

The problem and a first attempt to solve it
The cause of the problem: FLUSHCACHE and NCQ
An rc.init script to disable NCQ
Conclusion

The problem and a first attempt to solve it

Since I prefer using FreeBSD as my main operating system on servers but also on desktop systems I’ve been running it on many different hardware configurations and usually everything works out very smooth and without any problems (especially with way less problems than with other operating systems). It’s robust, easy to configure, consistent and efficient. Lately there had been a few systems newly built out of low cost components that started to randomly freeze though. It appeared that some applications started to stop working while one still could use the system and at some point it froze completely. The freezing seemed to be related to usage of browsers (like Firefox or Chromium) on the first glance since it never happened when they did not run but was reproducible reliably during browser usage even on systems with much memory - though the freezing frequency seemed to correlated to the available RAM. Since I also use FreeBSD at work and hanging on my workstation there got a little bit too annoying I decided to investigate this further. After rebooting the log did not show anything though - so I decided to log into a ssh session from another machine since the GUI usually totally froze and just display the contents of /var/log/messages in real-time all the time using tail -f:

sudo tail -f /var/log/messages

This also works when data is not flushed onto the disk but only kept in RAM so one will see kernel messages as long as the filesystem driver itself works. This then finally showed consistently over all freezings timeouts on my very cheap SSDs:

kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Retrying command, 0 more tries remain
kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Error 5, Retries exhausted

This seemed to indicate problems with the SSD though inspection of the SMART parameters yields no further insight. Also the behavior was consistent over a small set of machines. I tried to increase the timeout just in case the SSDs where incredible slow like one knows from SMR harddisks - that you should not use for ZFS anyways. To mitigate this problem I increased the default timeout for the device first to 60 seconds and in a last try even to 300 seconds.

kern.cam.ada.default_timeout=60

Unfortunately this did not resolve the problem - it just took longer till the system totally froze.

The cause of the problem: FLUSHCACHE and NCQ

In the end it turned out to be a problem with flushing the caches buffers in conjunction to usage of native command queuing. On many devices like cheap SSDs and SMR (Shingled Magnetic Recording) harddisks NCQ (native command queuing) could simply lead to multiple flushing commands to be enqueued that then exceed the timeout - this would be solvable by just increasing the timeout even further. Unfortunately my devices still caused the same trouble. This may be caused for example by firmware bugs. This could be solved on my particular system and my particular SSDs by disabling NCQ - of course with all consequences like reduced throughput:

camcontrol negotiate ada0 -T disable

This disabled tagged queuing (native command queuing) on the disk and immediately solved the freezing issues.

An `rc.init` script to disable NCQ

Since this change does not persist I decided to write a small hacked rc.init script in /usr/local/etc/rc.d/disablencq:

#!/bin/sh

# PROVIDE: disablencq
# REQUIRE: NETWORKING SERVERS

# Disable tagged queuing on listed disks
#
# disablencq_enable="YES"
#    Execute this script
# disablencq_disks="ada0 ada1 ada2"
#    List the disks to disable NCQ on

. /etc/rc.subr

name="disablencq"
rcvar=disablencq_enable
desc="Disable NCQ on some disks"
start_cmd="disablencq_start"

disablencq_start()
{
	for dsk in ${disablencq_disks}; do
		camcontrol negotiate ${dsk} -T disable
	done
}

load_rc_config $name
: ${disablencq_enable:="NO"}
: ${disablencq_disks:="ada0"}
run_rc_command "$1"

Now I was able to configure the problematic SSDs in /etc/rc.conf:

disablencq_enable="YES"
disablencq_disks="ada0 ada1"

This of course is a hack - but it seems to circumvent problems with the native firmware of some SSDs. Usually it’s a good idea to stay away from such SSDs the same way as one should stay away from SMR harddisks though.

Conclusion

If a system hangs arbitrarily and there seems to be no other cause it may help to disable NCQ on SATA harddisks.
One has to be cautious when choosing SSDs or also harddisk and avoid SSDs with problematic firmware the same way as one should avoid SMR harddisks.
Disabling NCQ allows one to continue using cheaper SSDs (note: affiliate link, this pages author profits from qualified purchases) in a stable fashion

Freezing FreeBSD systems - NCQ on SSDs - a short troubleshooting journey

The problem and a first attempt to solve it

The cause of the problem: FLUSHCACHE and NCQ

An rc.init script to disable NCQ

Conclusion

An `rc.init` script to disable NCQ