09 Mar 2025 - tsp
Last update 09 Mar 2025
4 mins
Ollama is a lightweight inference engine for running large language models (LLMs) like LLaMA, LLaMA-Vision, and DeepSeek directly on your local machine. It enables seamless interaction with these models without relying on cloud-based APIs, making it an excellent choice for offline use, embedded systems, and privacy-sensitive applications. It offers a set of command-line utilities and an HTTP API that can be utilized via custom code or an excellent Python API.
Typically, running LLMs is most efficient on high-performance GPUs such as NVIDIA RTX 3060 (12GB VRAM) or A100 (48GB VRAM) (note: those links are affiliate links, this pages author profits from qualified purchases). However, in some cases, inference on CPU-only headless systems is necessary, such as:
While CPU-based inference is significantly slower than GPU-accelerated inference, it remains viable for scenarios where real-time processing is not required.
Ollama supports Vulkan for acceleration when not using CUDA. This usually works well when running a graphical desktop environment like XFCE (X11 or Wayland). However, on systems without a GUI, Ollama currently ignores any flags and always tries to use the Vulkan runner for executing LLMs, causing it to fail with the error message:
Terminating due to uncaught exception 0x8b47020bfc0 of type vk::IncompatibleDriverError
Additionally, even when explicitly setting environment variables to force CPU-based execution, Ollama still defaults to the Vulkan runner, making it impossible to override without recompiling.
To work around this, we will compile the Ollama application while explicitly excluding the Vulkan runner. This ensures that the software will function correctly in a headless environment, though it also means that Vulkan will not be available even if a GUI is installed later. To accomplish this, we will compile Ollama from the FreeBSD Ports system instead of using the prebuilt package and modify the build configuration before compilation in a simple way. This approach effectively bypasses a bug that prevents disabling Vulkan via runtime flags.
In this example, we are running Ollama on a headless FreeBSD 14.2 system powered by an Intel Alder Lake-N97 (4 cores) (note: this link is an affiliate link, the pages author profits from qualified purchases). This setup represents a realistic scenario for running LLM inference on low-power embedded hardware.
To compile Ollama on FreeBSD 14, we begin by extracting the port from the Ports Collection:
cd /usr/ports/misc/ollama
make extract
make patch
cd /usr/ports/misc/ollama
– Navigate to the directory containing the Ollama port.make extract
– This step downloads the source code and extracts it into the work directory.make patch
– Applies FreeBSD-specific patches provided by the ports system.After extracting the source, we need to modify a build script to disable Vulkan (GPU acceleration) since we are targeting CPU-only inference.
Open the following file:
/usr/ports/misc/ollama/work/github.com/ollama/ollama@v0.3.6/llm/generate/gen_bsd.sh
Inside this file, locate two occurrences of the following flag:
-DGGML_VULKAN=on
Change both occurrences to:
-DGGML_VULKAN=off
This ensures that the build process does not attempt to use Vulkan, which is unnecessary for CPU-only inference.
Once the modification is complete, proceed with building and installing Ollama:
make
make install
This will compile the software and install it into the system.
After installation, Ollama can be started with the following command:
env OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=1 LLAMA_DEBUG=1 ollama serve
This launches the Ollama inference server, ready to accept model execution requests.
To run a model such as LLaMA 3.2, use:
ollama run llama3.2
Since headless embedded systems often lack direct internet access, models can be transferred from another machine by copying the ~/.ollama
directory:
rsync -av --progress ~/.ollama user@remote-system:/home/user/
This allows you to pre-download and store models on a networked system, then transfer them to your isolated embedded setup.
By following these steps, you can successfully run Ollama on a headless CPU-only FreeBSD 14.2 system with an Intel Alder Lake-N97 processor, enabling local LLM inference without requiring GPU acceleration or an internet connection.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/