mini-apigw: A Lightweight Gateway for Multi-Model AI Infrastructure

25 Oct 2025 - tsp
Last update 25 Oct 2025
Reading time 22 mins

Introduction - The Problem Nobody Wants to Admit

Modern AI applications often behave as if they are the only process in the universe. Every new AI service, notebook, or microservice starts up, loads its favorite model, and assumes that no one else dares to touch the GPU or system resources like RAM. The result is chaos: multiple workloads thrashing the same hardware. Applications today also very often lack proper error handling - if resources are exhausted or GPU buffers are lost due to competition, they simply crash as if error handling had never been invented this century. It’s a silent epidemic of resource arrogance. This reminds of the time of JavaEE applications who tend to hog all of a servers RAM without actually needing it.

Commercial and open source solutions exist - Langfuse, Helicone and open projects like TU Wien’s Aqueduct, which we use in front of vLLM clusters at the university. These gateways are capable, but they’re heavy: complex multi-component architectures with databases, dashboards, and web-based configuration layers. They’re very great for large institutions but overkill for small labs, hobby projects, or local offline clusters where you just want control and load distribution without bureaucracy and administrative overhead.

That’s why mini-apigw exists: a tiny, transparent, locally controlled OpenAI-compatible gateway designed to bridge multiple model backends - OpenAI, Anthropic, Ollama, and soon vLLM as well as Fooocus - while adding governance and arbitration features missing from the modern AI ecosystem.

Motivation - Why I Built mini-apigw

I wanted a service that sits quietly between clients and model backends and fixes the problems most people don’t even notice until it’s too late (obviously I personally ran into them):

In short: I wanted something that behaves like a modern version of an old-school UNIX daemon that abstracted the LLM and image generation services (and later on additional services) from the actual backends. No unnecessary web interface, no unnecessary moving parts. Configuration through files. Trace logs into files. Simplicity (as simple as possible, though not too simple).

Design Philosophy - Minimalism Meets Control

mini-apigw lives in ~/.config/mini-apigw/ and uses three JSON files:

It can reload configuration with SIGHUP like a traditional Unix daemon, report status via local-only admin endpoints and forward all /v1/* calls exactly as the OpenAI API does - so existing clients need no modification except for the base URL.

This minimalism also reduces the exposed surface dramatically. No admin web UI means no authentication endpoint to break into. It makes it unsuitable for large multi-user teams but perfect for controlled environments or single-node setups.

Architecture - A Single Front Door for Many Models

Clients ─ mini-apigw ─┬─ OpenAI
                      ├─ Anthropic
                      ├─ Ollama
                      └─ vLLM / custom backends

mini-apigw exposes a single /v1/ endpoint structure and routes each request to the appropriate backend according to the model name, aliases, or policy rules. Backends can be cloud APIs or local inference servers. The gateway supports both streaming and non-streaming responses and can transform metadata and logging as needed.

The most important feature, however, is arbitration.

Arbitration - The Missing Layer in Todays AI Ecosystem

Most AI frameworks as of today assume exclusive GPU access. In multi-user, multi-service setups, that assumption fails spectacularly. Two LLMs start loading, each thinks it owns the GPU, and everything crashes. Sometimes even the system due to the state of current GPU computation frameworks.

mini-apigw introduces sequence groups - small arbitration queues per groups of backends that serialize model loads and executions. The effect is similar to thread pools or database connection pools: predictable throughput, no GPU thrashing, and clean recovery.

It’s a rediscovery of a concept that mainstream software engineering once valued deeply. JavaEE, for example, had sophisticated thread pools and resource managers, ensuring fairness and throughput under load. Modern AI software, in contrast, is a jungle of processes fighting for VRAM without a referee. mini-apigw brings that sanity back.

Quick Tour - Configuration and Usage

Installation of the application is simple thanks due to PyPi:

$ pip install mini-apigw

One can also install the gateway from the cloned GitHub repository editable in place for development:

$ git clone git@github.com:tspspi/mini-apigw.git
$ cd mini-apigw
$ pip install -e .

Start the gateway manually:

$ mini-apigw --config ~/.config/mini-apigw/

Unix domain socket & reverse proxy: In my personal deployment I usually run the daemon bound to a Unix domain socket and expose it through Apache (or another reverse proxy) using ProxyPass and ProxyPassReverse to the local socket. This has several operational advantages:

A typical configuration of the reverse proxy - in this case of Apache httpd - may look like the following:

<VirtualHost *:80>
	ServerName api.example.com
	ServerAdmin complains@example.com

	DocumentRoot /usr/www/www.example.com/www/

	ProxyPass	/	"unix:/var/run/mini-apigw.sock|http://localhost/"
	ProxyPassReverse /	"unix://var/run/mini-apigw.sock|http://localhost/"

	<LocationMatch "^/(admin|stats)">
	        AuthType Basic
	        AuthName "mini-apigw admin"
	        AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
	        <RequireAll>
	                Require valid-user
	                Require ip 127.0.0.1 ::1 192.168.1.0/24
        	</RequireAll>
	</LocationMatch>
</VirtualHost>
<VirtualHost *:443>
    ServerName api.example.com
    ServerAdmin complains@example.com

    DocumentRoot /usr/www/www.example.com/www/

    ProxyPass	/	"unix:/var/run/mini-apigw.sock|http://localhost/"
    ProxyPassReverse /	"unix://var/run/mini-apigw.sock|http://localhost/"

    SSLOptions +StdEnvVars
#	SSLVerifyClient optional
    SSLVerifyDepth 5
    SSLCertificateFile "/usr/www/www.example.com/conf/ssl.cert"
    SSLCertificateKeyFile "/usr/www/www.example.com/conf/ssl.key"
    SSLCertificateChainFile "/usr/www/www.example.com/conf/ssl.cert"
#   SSLCACertificateFile "/usr/www/www.example.com/conf/ca01_01.cert"

	<LocationMatch "^/(admin|stats)">
	        AuthType Basic
	        AuthName "mini-apigw admin"
	        AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
	        <RequireAll>
	                Require valid-user
	                Require ip 127.0.0.1 ::1 192.0.2.0/24
        	</RequireAll>
	</LocationMatch>
</VirtualHost>

mini-apigw can also listen directly on TCP (IPv6 by default, legacy IPv4 if configured) when that is preferred, but for controlled server deployments the Unix domain socket + reverse proxy pattern tends to make more sense from a system administration perspective.

Configuration

Daemon configuration (daemon.json)

The main daemon configuration daemon.json defines:

{
  "listen": {
    "unix_socket" : "/usr/home/tsp/miniapigw/gw.sock",
    "port": 8080
  },
  "admin": {
    "stats_networks": ["127.0.0.1/32", "::1/128", "192.0.2.0/24" ]
  },
  "logging": {
    "level": "INFO",
    "redact_prompts": false,
    "access_log": true,
    "file" : "/var/log/miniapigw.log"
  },
  "database" : {
    "host" : "192.0.2.3",
    "database" : "database_name",
    "username" : "database_user",
    "password" : "database_secret"
  },
  "reload": {
    "enable_sighup": true
  },
  "timeouts": {
    "default_connect_s": 60,
    "default_read_s": 600
  }
}

Application configuration (apps.json)

The apps.json file contains configuration for the applications that can access the API gateway.

{
  "apps" : [
    {
      "app_id": "demo",
      "name": "Demo application",
      "api_keys": [
        "secretkey1",
        "secretkey2"
      ],
      "policy": {
        "allow": [ "llama3.2", "gpt-4o-mini", "gpt-oss", "llama3.2:latest", "text-embedding-3-small", "nomic-embed-text", "dall-e-3" ],
        "deny": []
      },
      "cost_limit": {
        "period": "day",
        "limit": 10.0
      },
      "trace": {
        "file": "/var/log/miniapigw/logs/demo.jsonl",
        "imagedir": "/var/log/miniapigw/logs/demo.images/",
        "includeprompts": true,
        "includeresponse": true,
        "includekeys": true
     }
  ]
}

Backend configuration (backends.json)

This is the place where one defines which backends are available and which sequence groups they belong to. In addition aliases are defined here. The file is one huge JSON dictionary.

The aliases section is a simple dictionary mapping from an arbitrary string to an actual model name. The model name later on selects the backend. In the following example one can see that some aliases have been used to select model sizes or versions. In addition a transparent name called blogembed has been used. This is a technique that I use also for my personal gateway to select the embeddings used by the tools operating on this blog on the API gateway. All tools use the transparent name blogembed when querying the gateway. If I ever want to switch to a different embedding I just have to change the mapping in the alias. The tools detect the different size of the embeddings and regenerate their indices.

The next section are sequence_groups. This is a dictionary that contains one entry per so called sequence group. All requests that go the backends that belong to the same sequence group are executed serially, never in parallel. Other requests may be processed in parallel.

The following list of backends is then the main configuration of the backends. As one can see every backend has:

{
  "aliases": {
    "llama3.2" : "llama3.2:latest",
    "gpt-oss" : "gpt-oss:20b",
    "llama3.2-vision" : "llama3.2-vision:latest",
    "blogembed" : "mxbai-embed-large:latest"
  },
  "sequence_groups": {
    "local_gpu_01": {
      "description": "Serialized work for local GPU tasks"
    }
  },
  "backends": [
    {
      "type": "openai",
      "name": "openai-primary",
      "base_url": "https://api.openai.com/v1",
      "api_key": "YOUROPENAI_PLATFORM_KEY",
      "concurrency": 1,
      "supports": {
        "chat": [ "gpt-4o-mini" ],
        "embeddings": [ "text-embedding-3-small" ],
        "images": [ "dall-e-3" ]
      },
      "cost": {
        "currency": "usd",
        "unit": "1k_tokens",
        "models": {
          "gpt-4o-mini": {"prompt": 0.002, "completion": 0.004},
          "text-embedding-3-small": {"prompt": 0.0001, "completion": 0.0}
        }
      }
    },
    {
      "type": "ollama",
      "name": "ollama-local",
      "base_url": "http://192.0.2.1:8182",
      "sequence_group": "local_gpu_01",
      "concurrency": 1,
      "supports": {
        "chat": ["llama3.2:latest", "gpt-oss:20b", "llama3.2-vision:latest"],
        "completions" : ["llama3.2:latest" ],
        "embeddings": ["nomic-embed-text", "mxbai-embed-large:latest"]
      },
      "cost": {
        "models": {
          "llama3.2:latest": {"prompt": 0.0, "completion": 0.0},
          "gpt-oss:20b": {"prompt": 0.0, "completion": 0.0},
          "nomic-embed-text": {"prompt": 0.0, "completion": 0.0}
        }
      }
    }
  ]
}

Then any OpenAI-compatible client can use it transparently:

import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sk-..."

response = openai.ChatCompletion.create(
    model="llama3",
    messages=[{"role": "user", "content": "Explain quantum tunneling."}]
)

mini-apigw will automatically pick the right backend (Ollama in this case) and manage concurrency and logging.

Creating and Using API keys

To ease the creation of API keys - these are only transparent bearer tokens so actually just arbitrary strings - the mini-apigw client implements the token command. This creates a random access token that can then be used in the application configuration. At the moment of writing the API tokens have been threatened as transparent sequences of bytes. In a later stage they will be JWTs that include permissions for the given clients to allow end to end authorization.

Note that API keys should never be used over plain http except on the local network or over the Unix Domain Socket. Always use https .

Starting and Stopping the Services, Reloading Configuration

Starting the service can be done via the mini-apigw command line interface via the start subcommand (or without any subcommand) or via an rc.init script in case on runs on FreeBSD. Stopping and reloading configuration can be done using two distinct mechanisms:

The rc.init  script also supports checking the status of the daemon using the PID file.

#!/bin/sh
# PROVIDE: mini_apigw
# REQUIRE: LOGIN
# KEYWORD: shutdown

. /etc/rc.subr

name="mini_apigw"
rcvar="mini_apigw_enable"

load_rc_config $name

: ${mini_apigw_enable:="NO"}
: ${mini_apigw_command:="/usr/local/bin/mini-apigw"}
: ${mini_apigw_config_dir:="/usr/local/etc/mini-apigw"}
: ${mini_apigw_user:="mini-apigw"}
: ${mini_apigw_pidfile:="${mini_apigw_config_dir}/mini-apigw.pid"}
: ${mini_apigw_unix_socket:="${mini_apigw_config_dir}/mini-apigw.sock"}
: ${mini_apigw_flags:=""}
: ${mini_apigw_timeout:="10"}

command="${mini_apigw_command}"
pidfile="${mini_apigw_pidfile}"
required_files="${mini_apigw_config_dir}/daemon.json"
extra_commands="reload status"
start_cmd="${name}_start"
stop_cmd="${name}_stop"
reload_cmd="${name}_reload"
status_cmd="${name}_status"

mini_apigw_build_args()
{
	_subcmd="$1"
	shift
	_cmd="${command} ${_subcmd} --config-dir \"${mini_apigw_config_dir}\""
	if [ -n "${mini_apigw_unix_socket}" ]; then
		_cmd="${_cmd} --unix-socket \"${mini_apigw_unix_socket}\""
	fi
	for _arg in "$@"; do
		_cmd="${_cmd} ${_arg}"
	done
	if [ -n "${mini_apigw_flags}" ]; then
		_cmd="${_cmd} ${mini_apigw_flags}"
	fi
	echo "${_cmd}"
}

mini_apigw_run()
{
	_cmd=$(mini_apigw_build_args "$@")
	if [ "$(id -un)" = "${mini_apigw_user}" ]; then
		/bin/sh -c "${_cmd}"
	else
		su -m "${mini_apigw_user}" -c "${_cmd}"
	fi
}

mini_apigw_start()
{
	mini_apigw_run start
}

mini_apigw_stop()
{
	mini_apigw_run stop --timeout "${mini_apigw_timeout}"
}

mini_apigw_reload()
{
	mini_apigw_run reload --timeout "${mini_apigw_timeout}"
}

mini_apigw_status()
{
	if [ ! -f "${pidfile}" ]; then
		echo "${name} is not running"
		return 1
	fi
	_pid=$(cat "${pidfile}" 2>/dev/null)
	if [ -z "${_pid}" ]; then
		echo "${name} pidfile exists but is empty"
		return 1
	fi
	if kill -0 "${_pid}" 2>/dev/null; then
		echo "${name} running as pid ${_pid}"
		return 0
	fi
	echo "${name} pidfile exists but process not running"
	return 1
}

run_rc_command "$1"

Security Considerations

Conclusion - Control, Simplicity, and Fairness

mini-apigw is a reminder that simplicity and control can coexist. It’s about reclaiming responsibility for resources and infrastructure in a world where every AI application assumes to be alone. It’s not a massive platform—it’s a scalpel: small, precise, and reliable. When others build towers of YAML and Kubernetes operators - or require loads of virtual environments and Docker containers to be deployed without any control over the content, sometimes all you need is a well-behaved little daemon that keeps the peace between your models.

This utility has been designed to solve a given simple task in a simple environment. It will never scale to a huge cluster and it will not scale to any worldwide operation. It has never been designed to do so. It’s there to solve a small local problem. And it works flawless up until now.

This article is tagged:


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support