Ollama models not loading after update (dual RTX 3060)

19 Oct 2025 - tsp
Last update 02 Nov 2025
Reading time 5 mins

TL;DR

On a box with two NVIDIA GeForce RTX 3060 (12 GB) GPUs, Ollama auto-updated to 0.12.4 and model loads silently stopped working, upgrading to 0.12.5 did not solve the problem. Downgrading to Ollama 0.12.3 fixed it immediately. NVIDIA driver upgrade from 572.82 to 581.29 did not change the behavior (other CUDA apps were fine)

When Ollama auto-updated on my small dual RTX 3060 headless setup, model loading suddenly stopped working - no errors, just silence and hanging clients. The GPUs were detected, yet nothing generated and loading of tensors failed from one day to the other. Upgrading drivers and even moving from version 0.12.4 to 0.12.5 did not help, while other CUDA applications ran perfectly.

After a few hours of debugging, the fix turned out to be simple: rolling back to Ollama 0.12.3 instantly restored normal behavior. If you are seeing lines like llama_model_load: vocab only - skipping tensors and key with type not found ... general.alignment, this post walks through what happened and how to get your models running again without tearing your hair out.

Setup
Symptoms and what I saw in the logs
What I changed (that did not fix it)
Root cause (practical)
Fix
Notes that might help your debugging
Update with 0.12.7
Conclusion
Related products

Setup

GPUs: 2 NVIDIA GeForce RTX 3060 (12 GB)
Drivers: NVIDIA Studio
- Tried upgrading from 572.82 to 581.29 (no effect on the issue)
Ollama: auto-updated to 0.12.4, then manually to 0.12.5
Working version: 0.12.3

Symptoms and what I saw in the logs

Server was up and GPUs detected:

Listening on [::]:8182 (version 0.12.5)
inference compute ... CUDA0 / CUDA1 ... NVIDIA GeForce RTX 3060 ... total="12.0 GiB"

When a clients hit /api/show (metadata probe), logs looked normal but gave the impression models don’t load:

POST "/api/show"
llama_model_load: vocab only - skipping tensors

This is expected for /api/show, but in my case real generations never kicked in either.

Extra noise that wasn’t fatal but confused debugging:

key with type not found key=general.alignment default=32

and many lines like:

load: control token: 128... '<|reserved_special_token_...|>' is not marked as EOG

What I changed (that did not fix it)

Tried a driver update from 572.82 to 581.29. This yielded no change (at least OpenCL and CUDA in other apps remained OK).

Root cause (practical)

Ollama auto-upgrade had bumped the daemon from 0.12.3 to 0.12.4/0.12.5.
On this particular machine (dual 3060s) that version pair resulted in the above behavior.

I didn’t dig deeper into whether it’s scheduling, multi-GPU runner selection, or a llama.cpp interface edge case in 0.12.4/0.12.5 - the point here is the quick fix.

Fix

Downgrade Ollama to 0.12.3. The models load and generate again.

Unfortunately on some platforms like Windows the system tray application of ollama will perform auto-updates. As it appears from various bug reports, even though the feature to disable them is desired by many people for different reasons (prevent breaking upgrades, prevent fetching huge amounts of data over metered connections, not unknowingly pull new code into your network, etc.) the developers actually do not intend to allow disabling of auto update. A workaround is completly getting rid of the system tray app.exe manually or supressing access to Internet by this application. On Windows this can be done in the powershell when executes as administrator using the following command also mentioned in the referenced bug report:

New-NetFirewallRule -DisplayName "Block Ollama App (disable autoupdate)" -Direction Outbound -Program "$env:LOCALAPPDATA\Programs\Ollama\ollama app.exe" -Action Block

On other applications you can perform similar blocks, block on DNS level - though this will also prevent fetching of models - or remove the tray application.

Notes that might help your debugging

Hitting /api/show only opens the GGUF header and tokenizer. It prints:

llama_model_load: vocab only - skipping tensors

This is expected behaviour. You need /api/generate or /api/chat to actually allocate and load tensors into VRAM.

The general.alignment default=32 line is a benign warning from newer llama.cpp when an optional GGUF key is absent.

Update with 0.12.7

At some point in time (November 2025) tried to upgrade to 0.12.7, tried to upgrade graphics drivers again, updated to CUDA 13 - now even the error messages are gong but still nothing is working. One really has to stay at 0.12.3 as it seems.

Conclusion

If your dual-GPU RTX 3060 rig suddenly stops loading models after Ollama auto-updates and you see lines like:

llama_model_load: vocab only - skipping tensors
key with type not found key=general.alignment default=32

try rolling back to Ollama 0.12.3 first. That immediately restored normal model loading and inference for me. This may solve the immediate headache before trying to get resolved how to upgrade without breaking everything.

Note: These links are Amazon affiliate links, this pages author profits from qualified purchases

NVidia 3060 with 12GB VRAM
HDMI dummy to allow headless operation

Ollama models not loading after update (dual RTX 3060)

Setup

Symptoms and what I saw in the logs

What I changed (that did not fix it)

Root cause (practical)

Fix

Notes that might help your debugging

Update with 0.12.7

Conclusion

Related products

Related articles

Running Ollama on a CPU-Only Headless FreeBSD 14 System

The most simple Difference of Gaussian image pyramid with OpenCL

mini-apigw: A Lightweight Gateway for Multi-Model AI Infrastructure

Expanding GPU Capabilities on Notebooks and Mini PCs Without PCIe Slots via M.2 NVMe Slots

Exploring Cursor AI on FreeBSD: A Developer's Perspective and Installation Guide (and a note on local models)

Frama-C with wp-dynamic and function pointers

Repairing corrupted RabbitMQ instance (VHost / experienced error, not a dets file)

Getting started with STM32F401RE Nucleo starter board on FreeBSD

Also on this blog

Selenium with Chromium and Java on FreeBSD

The Cost of Trying: A Hidden Truth About Autistic Reciprocity

OpAmp based voltage controllable current source

Whats the Bayesian rule and how to apply for simple tests