19 Oct 2025 - tsp
Last update 19 Oct 2025
3 mins
TL;DR
On a box with two NVIDIA GeForce RTX 3060 (12 GB) GPUs, Ollama auto-updated to 0.12.4 and model loads silently stopped working, upgrading to 0.12.5 did not solve the problem. Downgrading to Ollama 0.12.3 fixed it immediately. NVIDIA driver upgrade from 572.82 to 581.29 did not change the behavior (other CUDA apps were fine)
When Ollama auto-updated on my small dual RTX 3060 headless setup, model loading suddenly stopped working - no errors, just silence and hanging clients. The GPUs were detected, yet nothing generated and loading of tensors failed from one day to the other. Upgrading drivers and even moving from version 0.12.4 to 0.12.5 did not help, while other CUDA applications ran perfectly.
After a few hours of debugging, the fix turned out to be simple: rolling back to Ollama 0.12.3 instantly restored normal behavior. If you are seeing lines like llama_model_load: vocab only - skipping tensors
and key with type not found ... general.alignment
, this post walks through what happened and how to get your models running again without tearing your hair out.
Server was up and GPUs detected:
Listening on [::]:8182 (version 0.12.5)
inference compute ... CUDA0 / CUDA1 ... NVIDIA GeForce RTX 3060 ... total="12.0 GiB"
When a clients hit /api/show
(metadata probe), logs looked normal but gave the impression models don’t load:
POST "/api/show"
llama_model_load: vocab only - skipping tensors
This is expected for /api/show
, but in my case real generations never kicked in either.
Extra noise that wasn’t fatal but confused debugging:
key with type not found key=general.alignment default=32
and many lines like:
load: control token: 128... '<|reserved_special_token_...|>' is not marked as EOG
Tried a driver update from 572.82 to 581.29. This yielded no change (at least OpenCL and CUDA in other apps remained OK).
I didn’t dig deeper into whether it’s scheduling, multi-GPU runner selection, or a llama.cpp
interface edge case in 0.12.4/0.12.5 - the point here is the quick fix.
Downgrade Ollama to 0.12.3. The models load and generate again.
Hitting /api/show
only opens the GGUF header and tokenizer. It prints:
llama_model_load: vocab only - skipping tensors
This is expected behaviour. You need /api/generate
or /api/chat
to actually allocate and load tensors into VRAM.
The general.alignment default=32
line is a benign warning from newer llama.cpp
when an optional GGUF key is absent.
If your dual-GPU RTX 3060 rig suddenly stops loading models after Ollama auto-updates and you see lines like:
llama_model_load: vocab only - skipping tensors
key with type not found key=general.alignment default=32
try rolling back to Ollama 0.12.3 first. That immediately restored normal model loading and inference for me. This may solve the immediate headache before trying to get resolved how to upgrade without breaking everything.
Note: These links are Amazon affiliate links, this pages author profits from qualified purchases
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/