Ollama models not loading after update (dual RTX 3060)

19 Oct 2025 - tsp
Last update 19 Oct 2025
Reading time 3 mins

TL;DR

On a box with two NVIDIA GeForce RTX 3060 (12 GB) GPUs, Ollama auto-updated to 0.12.4 and model loads silently stopped working, upgrading to 0.12.5 did not solve the problem. Downgrading to Ollama 0.12.3 fixed it immediately. NVIDIA driver upgrade from 572.82 to 581.29 did not change the behavior (other CUDA apps were fine)

When Ollama auto-updated on my small dual RTX 3060 headless setup, model loading suddenly stopped working - no errors, just silence and hanging clients. The GPUs were detected, yet nothing generated and loading of tensors failed from one day to the other. Upgrading drivers and even moving from version 0.12.4 to 0.12.5 did not help, while other CUDA applications ran perfectly.

After a few hours of debugging, the fix turned out to be simple: rolling back to Ollama 0.12.3 instantly restored normal behavior. If you are seeing lines like llama_model_load: vocab only - skipping tensors and key with type not found ... general.alignment, this post walks through what happened and how to get your models running again without tearing your hair out.

Setup

Symptoms and what I saw in the logs

Server was up and GPUs detected:

Listening on [::]:8182 (version 0.12.5)
inference compute ... CUDA0 / CUDA1 ... NVIDIA GeForce RTX 3060 ... total="12.0 GiB"

When a clients hit /api/show (metadata probe), logs looked normal but gave the impression models don’t load:

POST "/api/show"
llama_model_load: vocab only - skipping tensors

This is expected for /api/show, but in my case real generations never kicked in either.

Extra noise that wasn’t fatal but confused debugging:

key with type not found key=general.alignment default=32

and many lines like:

load: control token: 128... '<|reserved_special_token_...|>' is not marked as EOG

What I changed (that did not fix it)

Tried a driver update from 572.82 to 581.29. This yielded no change (at least OpenCL and CUDA in other apps remained OK).

Root cause (practical)

I didn’t dig deeper into whether it’s scheduling, multi-GPU runner selection, or a llama.cpp interface edge case in 0.12.4/0.12.5 - the point here is the quick fix.

Fix

Downgrade Ollama to 0.12.3. The models load and generate again.

Notes that might help your debugging

Hitting /api/show only opens the GGUF header and tokenizer. It prints:

llama_model_load: vocab only - skipping tensors

This is expected behaviour. You need /api/generate or /api/chat to actually allocate and load tensors into VRAM.

The general.alignment default=32 line is a benign warning from newer llama.cpp when an optional GGUF key is absent.

Conclusion

If your dual-GPU RTX 3060 rig suddenly stops loading models after Ollama auto-updates and you see lines like:

llama_model_load: vocab only - skipping tensors
key with type not found key=general.alignment default=32

try rolling back to Ollama 0.12.3 first. That immediately restored normal model loading and inference for me. This may solve the immediate headache before trying to get resolved how to upgrade without breaking everything.

Note: These links are Amazon affiliate links, this pages author profits from qualified purchases

This article is tagged:


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support