Freezing FreeBSD systems - NCQ on SSDs - a short troubleshooting journey
08 Nov 2024 - tsp
Last update 12 Nov 2024
5 mins
The problem and a first attempt to solve it
Since I prefer using FreeBSD as my main operating system
on servers but also on desktop systems Iβve been running it on many different hardware
configurations and usually everything works out very smooth and without any problems (especially
with way less problems than with other operating systems). Itβs robust, easy to configure,
consistent and efficient. Lately there had been a few systems newly built out of low cost components
that started to randomly freeze though. It appeared that some applications started to stop working
while one still could use the system and at some point it froze completely. The freezing seemed
to be related to usage of browsers (like Firefox or Chromium) on the first glance since it
never happened when they did not run but was reproducible reliably during browser usage even
on systems with much memory - though the freezing frequency seemed to correlated to the
available RAM. Since I also use FreeBSD at work and hanging on my workstation there
got a little bit too annoying I decided to investigate this further. After rebooting the
log did not show anything though - so I decided to log into a ssh session from another machine
since the GUI usually totally froze and just display the contents of /var/log/messages
in real-time
all the time using tail -f
:
sudo tail -f /var/log/messages
This also works when data is not flushed onto the disk but only kept in RAM so one will
see kernel messages as long as the filesystem driver itself works. This then finally
showed consistently over all freezings timeouts on my very cheap SSDs:
kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Retrying command, 0 more tries remain
kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Error 5, Retries exhausted
This seemed to indicate problems with the SSD though inspection of the SMART parameters yields
no further insight. Also the behavior was consistent over a small set of machines. I tried
to increase the timeout just in case the SSDs where incredible slow like one knows from SMR
harddisks - that you should not use for ZFS anyways. To mitigate
this problem I increased the default timeout for the device first to 60 seconds and in a last try
even to 300 seconds.
kern.cam.ada.default_timeout=60
Unfortunately this did not resolve the problem - it just took longer till the system totally
froze.
The cause of the problem: FLUSHCACHE and NCQ
In the end it turned out to be a problem with flushing the caches buffers in conjunction
to usage of native command queuing. On many devices like cheap SSDs and SMR (Shingled Magnetic Recording)
harddisks NCQ (native command queuing) could simply lead to multiple flushing commands to be
enqueued that then exceed the timeout - this would be solvable by just increasing the timeout
even further. Unfortunately my devices still caused the same trouble. This may be caused for
example by firmware bugs. This could be solved on my particular system and my particular SSDs
by disabling NCQ - of course with all consequences like reduced throughput:
camcontrol negotiate ada0 -T disable
This disabled tagged queuing (native command queuing) on the disk and immediately solved the
freezing issues.
An rc.init
script to disable NCQ
Since this change does not persist I decided to write a small hacked rc.init
script in /usr/local/etc/rc.d/disablencq
:
#!/bin/sh
# PROVIDE: disablencq
# REQUIRE: NETWORKING SERVERS
# Disable tagged queuing on listed disks
#
# disablencq_enable="YES"
# Execute this script
# disablencq_disks="ada0 ada1 ada2"
# List the disks to disable NCQ on
. /etc/rc.subr
name="disablencq"
rcvar=disablencq_enable
desc="Disable NCQ on some disks"
start_cmd="disablencq_start"
disablencq_start()
{
for dsk in ${disablencq_disks}; do
camcontrol negotiate ${dsk} -T disable
done
}
load_rc_config $name
: ${disablencq_enable:="NO"}
: ${disablencq_disks:="ada0"}
run_rc_command "$1"
Now I was able to configure the problematic SSDs in /etc/rc.conf
:
disablencq_enable="YES"
disablencq_disks="ada0 ada1"
This of course is a hack - but it seems to circumvent problems with the
native firmware of some SSDs. Usually itβs a good idea to stay away from
such SSDs the same way as one should stay away from SMR harddisks though.
Conclusion
- If a system hangs arbitrarily and there seems to be no other cause it
may help to disable NCQ on SATA harddisks.
- One has to be cautious when choosing SSDs or also harddisk and avoid
SSDs with problematic firmware the same way as one should avoid SMR
harddisks.
- Disabling NCQ allows one to continue using cheaper SSDs
(note: affiliate link, this pages author profits from qualified purchases) in
a stable fashion
This article is tagged: