Running Ollama on a CPU-Only Headless FreeBSD 14 System

09 Mar 2025 - tsp
Last update 09 Mar 2025
Reading time 4 mins

Introduction: What is Ollama?

Ollama is a lightweight inference engine for running large language models (LLMs) like LLaMA, LLaMA-Vision, and DeepSeek directly on your local machine. It enables seamless interaction with these models without relying on cloud-based APIs, making it an excellent choice for offline use, embedded systems, and privacy-sensitive applications. It offers a set of command-line utilities and an HTTP API that can be utilized via custom code or an excellent Python API.

Why Run Inference on CPU-Only Embedded Systems?
The Problem
System Setup for This Example
Extracting the Ollama Port on FreeBSD
Modifying the Build Configuration
Compiling and Installing Ollama
Running Ollama
Moving Pre-Trained Models to an Embedded System
Conclusion

Why Run Inference on CPU-Only Embedded Systems?

Typically, running LLMs is most efficient on high-performance GPUs such as NVIDIA RTX 3060 (12GB VRAM) or A100 (48GB VRAM) (note: those links are affiliate links, this pages author profits from qualified purchases). However, in some cases, inference on CPU-only headless systems is necessary, such as:

Deploying models on embedded or industrial hardware where GPUs are not available.
Running inference on low-performance systems in remote locations without internet access.
Applications where low inference speed is sufficient combined with a low query rate, such as adding artificial intelligence to control systems or home automation systems.

While CPU-based inference is significantly slower than GPU-accelerated inference, it remains viable for scenarios where real-time processing is not required.

The Problem

Ollama supports Vulkan for acceleration when not using CUDA. This usually works well when running a graphical desktop environment like XFCE (X11 or Wayland). However, on systems without a GUI, Ollama currently ignores any flags and always tries to use the Vulkan runner for executing LLMs, causing it to fail with the error message:

Terminating due to uncaught exception 0x8b47020bfc0 of type vk::IncompatibleDriverError

Additionally, even when explicitly setting environment variables to force CPU-based execution, Ollama still defaults to the Vulkan runner, making it impossible to override without recompiling.

To work around this, we will compile the Ollama application while explicitly excluding the Vulkan runner. This ensures that the software will function correctly in a headless environment, though it also means that Vulkan will not be available even if a GUI is installed later. To accomplish this, we will compile Ollama from the FreeBSD Ports system instead of using the prebuilt package and modify the build configuration before compilation in a simple way. This approach effectively bypasses a bug that prevents disabling Vulkan via runtime flags.

System Setup for This Example

In this example, we are running Ollama on a headless FreeBSD 14.2 system powered by an Intel Alder Lake-N97 (4 cores) (note: this link is an affiliate link, the pages author profits from qualified purchases). This setup represents a realistic scenario for running LLM inference on low-power embedded hardware.

Extracting the Ollama Port on FreeBSD

To compile Ollama on FreeBSD 14, we begin by extracting the port from the Ports Collection:

cd /usr/ports/misc/ollama
make extract
make patch

cd /usr/ports/misc/ollama – Navigate to the directory containing the Ollama port.
make extract – This step downloads the source code and extracts it into the work directory.
make patch – Applies FreeBSD-specific patches provided by the ports system.

Modifying the Build Configuration

After extracting the source, we need to modify a build script to disable Vulkan (GPU acceleration) since we are targeting CPU-only inference.

Open the following file:

/usr/ports/misc/ollama/work/github.com/ollama/ollama@v0.3.6/llm/generate/gen_bsd.sh

Inside this file, locate two occurrences of the following flag:

-DGGML_VULKAN=on

Change both occurrences to:

-DGGML_VULKAN=off

This ensures that the build process does not attempt to use Vulkan, which is unnecessary for CPU-only inference.

Compiling and Installing Ollama

Once the modification is complete, proceed with building and installing Ollama:

make
make install

This will compile the software and install it into the system.

Running Ollama

After installation, Ollama can be started with the following command:

env OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=1 LLAMA_DEBUG=1 ollama serve

This launches the Ollama inference server, ready to accept model execution requests.

To run a model such as LLaMA 3.2, use:

ollama run llama3.2

Moving Pre-Trained Models to an Embedded System

Since headless embedded systems often lack direct internet access, models can be transferred from another machine by copying the ~/.ollama directory:

rsync -av --progress ~/.ollama user@remote-system:/home/user/

This allows you to pre-download and store models on a networked system, then transfer them to your isolated embedded setup.

Conclusion

By following these steps, you can successfully run Ollama on a headless CPU-only FreeBSD 14.2 system with an Intel Alder Lake-N97 processor, enabling local LLM inference without requiring GPU acceleration or an internet connection.