mini-apigw: A Lightweight Gateway for Multi-Model AI Infrastructure

25 Oct 2025 - tsp
Last update 25 Oct 2025
Reading time 22 mins

Introduction - The Problem Nobody Wants to Admit

Modern AI applications often behave as if they are the only process in the universe. Every new AI service, notebook, or microservice starts up, loads its favorite model, and assumes that no one else dares to touch the GPU or system resources like RAM. The result is chaos: multiple workloads thrashing the same hardware. Applications today also very often lack proper error handling - if resources are exhausted or GPU buffers are lost due to competition, they simply crash as if error handling had never been invented this century. It’s a silent epidemic of resource arrogance. This reminds of the time of JavaEE applications who tend to hog all of a servers RAM without actually needing it.

Commercial and open source solutions exist - Langfuse, Helicone and open projects like TU Wien’s Aqueduct, which we use in front of vLLM clusters at the university. These gateways are capable, but they’re heavy: complex multi-component architectures with databases, dashboards, and web-based configuration layers. They’re very great for large institutions but overkill for small labs, hobby projects, or local offline clusters where you just want control and load distribution without bureaucracy and administrative overhead.

That’s why mini-apigw exists: a tiny, transparent, locally controlled OpenAI-compatible gateway designed to bridge multiple model backends - OpenAI, Anthropic, Ollama, and soon vLLM as well as Fooocus - while adding governance and arbitration features missing from the modern AI ecosystem.

Motivation - Why I Built mini-apigw
Design Philosophy - Minimalism Meets Control
Architecture - A Single Front Door for Many Models
Arbitration - The Missing Layer in Todays AI Ecosystem
Quick Tour - Configuration and Usage
Configuration
Creating and Using API keys
Starting and Stopping the Services, Reloading Configuration
Security Considerations
Conclusion - Control, Simplicity, and Fairness

Motivation - Why I Built mini-apigw

I wanted a service that sits quietly between clients and model backends and fixes the problems most people don’t even notice until it’s too late (obviously I personally ran into them):

Uniform interface: Everything speaks the OpenAI API, including tools like LibreChat, my own orchestration agents, and external tools. You do not have to implement different backends for ollama, Anthropic, vLLM and other backends in every single application. The gateway provides translation to the different protocols.
Multiple backends: Mix and match OpenAI cloud models with local Ollama and vLLM without rewriting client code. The appropriate backend is decided on by model name, you can use the same client transparently.
Resource arbitration: Prevent GPU overload by serializing and queuing heavy model workloads. This is especially interesting when you have multiple applications that want to access different backends that run on the same hardware and are capable of dynamically loading their models. If they run in sequence resources are there, if they run in parallel they reclaim each others memory and due to lost buffers most AI applications today just crash.
Per-app governance: Define API keys, model access lists, token quotas, and aliases. In addition the middleware can perform complete tracing of all requests and responses in case one wants to configure this for an application. There is no need to implement this in every app and trust it. You just set the configuration option and get a list of JSON objects that contains all interactions with the backends.
Hot reconfiguration: Reload settings without restart, because uptime matters - even in a small lab or hobby setting.
Minimal dependencies: No database except for persistent accounting if wanted (optional), no dashboard, no heavy template engines, no external service dependencies - just FastAPI and a JSON configuration directory.

In short: I wanted something that behaves like a modern version of an old-school UNIX daemon that abstracted the LLM and image generation services (and later on additional services) from the actual backends. No unnecessary web interface, no unnecessary moving parts. Configuration through files. Trace logs into files. Simplicity (as simple as possible, though not too simple).

Design Philosophy - Minimalism Meets Control

mini-apigw lives in ~/.config/mini-apigw/ and uses three JSON files:

daemon.json for runtime and logging options
backends.json for definitions for connected APIs (OpenAI, Ollama; later hopefully Anthropic, vLLM, Fooocus, etc.)
apps.json contains per-app keys, permissions, and limits

It can reload configuration with SIGHUP like a traditional Unix daemon, report status via local-only admin endpoints and forward all /v1/* calls exactly as the OpenAI API does - so existing clients need no modification except for the base URL.

This minimalism also reduces the exposed surface dramatically. No admin web UI means no authentication endpoint to break into. It makes it unsuitable for large multi-user teams but perfect for controlled environments or single-node setups.

Architecture - A Single Front Door for Many Models

Clients ─ mini-apigw ─┬─ OpenAI
                      ├─ Anthropic
                      ├─ Ollama
                      └─ vLLM / custom backends

mini-apigw exposes a single /v1/ endpoint structure and routes each request to the appropriate backend according to the model name, aliases, or policy rules. Backends can be cloud APIs or local inference servers. The gateway supports both streaming and non-streaming responses and can transform metadata and logging as needed.

The most important feature, however, is arbitration.

Arbitration - The Missing Layer in Todays AI Ecosystem

Most AI frameworks as of today assume exclusive GPU access. In multi-user, multi-service setups, that assumption fails spectacularly. Two LLMs start loading, each thinks it owns the GPU, and everything crashes. Sometimes even the system due to the state of current GPU computation frameworks.

mini-apigw introduces sequence groups - small arbitration queues per groups of backends that serialize model loads and executions. The effect is similar to thread pools or database connection pools: predictable throughput, no GPU thrashing, and clean recovery.

It’s a rediscovery of a concept that mainstream software engineering once valued deeply. JavaEE, for example, had sophisticated thread pools and resource managers, ensuring fairness and throughput under load. Modern AI software, in contrast, is a jungle of processes fighting for VRAM without a referee. mini-apigw brings that sanity back.

Quick Tour - Configuration and Usage

Installation of the application is simple thanks due to PyPi:

$ pip install mini-apigw

One can also install the gateway from the cloned GitHub repository editable in place for development:

$ git clone git@github.com:tspspi/mini-apigw.git
$ cd mini-apigw
$ pip install -e .

Start the gateway manually:

$ mini-apigw --config ~/.config/mini-apigw/

Unix domain socket & reverse proxy: In my personal deployment I usually run the daemon bound to a Unix domain socket and expose it through Apache (or another reverse proxy) using ProxyPass and ProxyPassReverse to the local socket. This has several operational advantages:

Single vhost per service: Apache can route multiple backends and vhosts easily while keeping the gateway accessible under the usual HTTP hostnames and TLS termination. It’s much easier for a sysadmin to manage virtual hosts this way than to reconfigure services to listen on many different TCP ports. It’s also easier to configure TLS termination on a single server than to configure a bunch of services and keep them up to date.
Proofen and working QoS solutions: With solutions like mod_qos Apache provides proofen real world solutions to provide rate limiting. You can just drop then in with your usual configuration without having to implement anything in the API gateway.
Reduced network attack surface: The gateway itself does not open a public TCP port; only the trusted reverse proxy does TLS termination and public exposure. The HTTP implementation of Apache is pretty well tested over decades - having a load of small HTTP servers (or even HTTPS servers) running and keeping them up to date with a myriad of different implementations increases the attack surface massively, especially when using younger libraries that are quickly hacked together by small communities or single developers (this includes my API gateway of course).
File‑system permissions: Unix sockets can be protected efficiently and very simple by filesystem permissions, restricting which system users and services may talk to the gateway.
Easier integration: Reverse proxy features such as HTTP auth, access logging, rate limiting, and client certificate checks remain available at the proxy layer.

A typical configuration of the reverse proxy - in this case of Apache httpd - may look like the following:

<VirtualHost *:80>
	ServerName api.example.com
	ServerAdmin complains@example.com

	DocumentRoot /usr/www/www.example.com/www/

	ProxyPass	/	"unix:/var/run/mini-apigw.sock|http://localhost/"
	ProxyPassReverse /	"unix://var/run/mini-apigw.sock|http://localhost/"

	<LocationMatch "^/(admin|stats)">
	        AuthType Basic
	        AuthName "mini-apigw admin"
	        AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
	        <RequireAll>
	                Require valid-user
	                Require ip 127.0.0.1 ::1 192.168.1.0/24
        	</RequireAll>
	</LocationMatch>
</VirtualHost>
<VirtualHost *:443>
    ServerName api.example.com
    ServerAdmin complains@example.com

    DocumentRoot /usr/www/www.example.com/www/

    ProxyPass	/	"unix:/var/run/mini-apigw.sock|http://localhost/"
    ProxyPassReverse /	"unix://var/run/mini-apigw.sock|http://localhost/"

    SSLOptions +StdEnvVars
#	SSLVerifyClient optional
    SSLVerifyDepth 5
    SSLCertificateFile "/usr/www/www.example.com/conf/ssl.cert"
    SSLCertificateKeyFile "/usr/www/www.example.com/conf/ssl.key"
    SSLCertificateChainFile "/usr/www/www.example.com/conf/ssl.cert"
#   SSLCACertificateFile "/usr/www/www.example.com/conf/ca01_01.cert"

	<LocationMatch "^/(admin|stats)">
	        AuthType Basic
	        AuthName "mini-apigw admin"
	        AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
	        <RequireAll>
	                Require valid-user
	                Require ip 127.0.0.1 ::1 192.0.2.0/24
        	</RequireAll>
	</LocationMatch>
</VirtualHost>

mini-apigw can also listen directly on TCP (IPv6 by default, legacy IPv4 if configured) when that is preferred, but for controlled server deployments the Unix domain socket + reverse proxy pattern tends to make more sense from a system administration perspective.

Configuration

Daemon configuration (`daemon.json`)

The main daemon configuration daemon.json defines:

Where the daemon listens at. In the following example this is a unix_socket (and the port is unused). One can alternatively specify an ipv4 or ipv6 field containing the listen addresses
Where the admin endpoint is located and who can access it
Logging allows one to specify a log file into which the daemon should write its generic and it’s access log. Currently there is no support for syslog
The database configuration specifies an optional PostgreSQL database. If it is specified an accounting log is written into the database
The reload option allows one to disable or enable reloading of configuration files using a SIGHUP handler

{
  "listen": {
    "unix_socket" : "/usr/home/tsp/miniapigw/gw.sock",
    "port": 8080
  },
  "admin": {
    "stats_networks": ["127.0.0.1/32", "::1/128", "192.0.2.0/24" ]
  },
  "logging": {
    "level": "INFO",
    "redact_prompts": false,
    "access_log": true,
    "file" : "/var/log/miniapigw.log"
  },
  "database" : {
    "host" : "192.0.2.3",
    "database" : "database_name",
    "username" : "database_user",
    "password" : "database_secret"
  },
  "reload": {
    "enable_sighup": true
  },
  "timeouts": {
    "default_connect_s": 60,
    "default_read_s": 600
  }
}

Application configuration (`apps.json`)

The apps.json file contains configuration for the applications that can access the API gateway.

Each application has an app_id (this should be machine readable, I’d not recommend special characters there) as well as a name. The app_id has to be unique. The name can e any description of the application.
The api_keys array contains a list of API keys. Those are transparent bearer tokens, at this moment they are not parsed by the gateway in any way. They also have to be uniquely assigned to one application (i.e. not the same API key to different applications).
The policy allows one to specify which models are allowed for this application (this can include alias definitions in the backend). If the whitelist is not used the blacklist can be used via deny
The cost_limit enforces a rough resource cap on each application. This is in particular useful when designing automatic systems to prevent them running havoc and billing you thousands of EUR/USD on your credit card. It’s good to have safeguards in place.
The trace configuratoin allows you to define a jsonl log file into which every request is logged. Depending on the three configuration options below it logs different aspecs of the requests. This allows one to trace what an application has been done without having to implement logging in the application itself. In case imagedir is specified all generated graphics from this application are also archived in the specified directory to have a trace of what has been generated.

{
  "apps" : [
    {
      "app_id": "demo",
      "name": "Demo application",
      "api_keys": [
        "secretkey1",
        "secretkey2"
      ],
      "policy": {
        "allow": [ "llama3.2", "gpt-4o-mini", "gpt-oss", "llama3.2:latest", "text-embedding-3-small", "nomic-embed-text", "dall-e-3" ],
        "deny": []
      },
      "cost_limit": {
        "period": "day",
        "limit": 10.0
      },
      "trace": {
        "file": "/var/log/miniapigw/logs/demo.jsonl",
        "imagedir": "/var/log/miniapigw/logs/demo.images/",
        "includeprompts": true,
        "includeresponse": true,
        "includekeys": true
     }
  ]
}

Backend configuration (`backends.json`)

This is the place where one defines which backends are available and which sequence groups they belong to. In addition aliases are defined here. The file is one huge JSON dictionary.

The aliases section is a simple dictionary mapping from an arbitrary string to an actual model name. The model name later on selects the backend. In the following example one can see that some aliases have been used to select model sizes or versions. In addition a transparent name called blogembed has been used. This is a technique that I use also for my personal gateway to select the embeddings used by the tools operating on this blog on the API gateway. All tools use the transparent name blogembed when querying the gateway. If I ever want to switch to a different embedding I just have to change the mapping in the alias. The tools detect the different size of the embeddings and regenerate their indices.

The next section are sequence_groups. This is a dictionary that contains one entry per so called sequence group. All requests that go the backends that belong to the same sequence group are executed serially, never in parallel. Other requests may be processed in parallel.

The following list of backends is then the main configuration of the backends. As one can see every backend has:

A type that selects the code that will be used to communicate and translate to/from this backend
A name for logging purposes
Connection parameters like base_url, api_key required to access the remote host, etc. For backends like fooocus one will also be able to specify stuff like selected styles, used models and refiners and other parameters.
The supports list defines which models are exposed for the different operations. Those are exposed to the routing framework. The selection of the backends operates on the model names used here - a client requesting for example gpt-4o-mini for chat will be routed to the opanai-primary backend, a client requesting llama3.2:latest for completion will be routed to ollama-local
The cost configuration allows one to specify how much each of the requests costs for each token. This is not fully implemented and is part of the safeguard against runaway applications.

{
  "aliases": {
    "llama3.2" : "llama3.2:latest",
    "gpt-oss" : "gpt-oss:20b",
    "llama3.2-vision" : "llama3.2-vision:latest",
    "blogembed" : "mxbai-embed-large:latest"
  },
  "sequence_groups": {
    "local_gpu_01": {
      "description": "Serialized work for local GPU tasks"
    }
  },
  "backends": [
    {
      "type": "openai",
      "name": "openai-primary",
      "base_url": "https://api.openai.com/v1",
      "api_key": "YOUROPENAI_PLATFORM_KEY",
      "concurrency": 1,
      "supports": {
        "chat": [ "gpt-4o-mini" ],
        "embeddings": [ "text-embedding-3-small" ],
        "images": [ "dall-e-3" ]
      },
      "cost": {
        "currency": "usd",
        "unit": "1k_tokens",
        "models": {
          "gpt-4o-mini": {"prompt": 0.002, "completion": 0.004},
          "text-embedding-3-small": {"prompt": 0.0001, "completion": 0.0}
        }
      }
    },
    {
      "type": "ollama",
      "name": "ollama-local",
      "base_url": "http://192.0.2.1:8182",
      "sequence_group": "local_gpu_01",
      "concurrency": 1,
      "supports": {
        "chat": ["llama3.2:latest", "gpt-oss:20b", "llama3.2-vision:latest"],
        "completions" : ["llama3.2:latest" ],
        "embeddings": ["nomic-embed-text", "mxbai-embed-large:latest"]
      },
      "cost": {
        "models": {
          "llama3.2:latest": {"prompt": 0.0, "completion": 0.0},
          "gpt-oss:20b": {"prompt": 0.0, "completion": 0.0},
          "nomic-embed-text": {"prompt": 0.0, "completion": 0.0}
        }
      }
    }
  ]
}

Then any OpenAI-compatible client can use it transparently:

import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sk-..."

response = openai.ChatCompletion.create(
    model="llama3",
    messages=[{"role": "user", "content": "Explain quantum tunneling."}]
)

mini-apigw will automatically pick the right backend (Ollama in this case) and manage concurrency and logging.

Creating and Using API keys

To ease the creation of API keys - these are only transparent bearer tokens so actually just arbitrary strings - the mini-apigw client implements the token command. This creates a random access token that can then be used in the application configuration. At the moment of writing the API tokens have been threatened as transparent sequences of bytes. In a later stage they will be JWTs that include permissions for the given clients to allow end to end authorization.

Note that API keys should never be used over plain http except on the local network or over the Unix Domain Socket. Always use https .

Starting and Stopping the Services, Reloading Configuration

Starting the service can be done via the mini-apigw command line interface via the start subcommand (or without any subcommand) or via an rc.init script in case on runs on FreeBSD. Stopping and reloading configuration can be done using two distinct mechanisms:

Signals like a traditional Unix daemon. On SIGHUP the daemon reloads it’s configuration from the JSON files. On SIGTERM the daemon shuts down.
An HTTP administrative interface that is exposed via the command line interfaces stop and reload commands.

The rc.init script also supports checking the status of the daemon using the PID file.

#!/bin/sh
# PROVIDE: mini_apigw
# REQUIRE: LOGIN
# KEYWORD: shutdown

. /etc/rc.subr

name="mini_apigw"
rcvar="mini_apigw_enable"

load_rc_config $name

: ${mini_apigw_enable:="NO"}
: ${mini_apigw_command:="/usr/local/bin/mini-apigw"}
: ${mini_apigw_config_dir:="/usr/local/etc/mini-apigw"}
: ${mini_apigw_user:="mini-apigw"}
: ${mini_apigw_pidfile:="${mini_apigw_config_dir}/mini-apigw.pid"}
: ${mini_apigw_unix_socket:="${mini_apigw_config_dir}/mini-apigw.sock"}
: ${mini_apigw_flags:=""}
: ${mini_apigw_timeout:="10"}

command="${mini_apigw_command}"
pidfile="${mini_apigw_pidfile}"
required_files="${mini_apigw_config_dir}/daemon.json"
extra_commands="reload status"
start_cmd="${name}_start"
stop_cmd="${name}_stop"
reload_cmd="${name}_reload"
status_cmd="${name}_status"

mini_apigw_build_args()
{
	_subcmd="$1"
	shift
	_cmd="${command} ${_subcmd} --config-dir \"${mini_apigw_config_dir}\""
	if [ -n "${mini_apigw_unix_socket}" ]; then
		_cmd="${_cmd} --unix-socket \"${mini_apigw_unix_socket}\""
	fi
	for _arg in "$@"; do
		_cmd="${_cmd} ${_arg}"
	done
	if [ -n "${mini_apigw_flags}" ]; then
		_cmd="${_cmd} ${mini_apigw_flags}"
	fi
	echo "${_cmd}"
}

mini_apigw_run()
{
	_cmd=$(mini_apigw_build_args "$@")
	if [ "$(id -un)" = "${mini_apigw_user}" ]; then
		/bin/sh -c "${_cmd}"
	else
		su -m "${mini_apigw_user}" -c "${_cmd}"
	fi
}

mini_apigw_start()
{
	mini_apigw_run start
}

mini_apigw_stop()
{
	mini_apigw_run stop --timeout "${mini_apigw_timeout}"
}

mini_apigw_reload()
{
	mini_apigw_run reload --timeout "${mini_apigw_timeout}"
}

mini_apigw_status()
{
	if [ ! -f "${pidfile}" ]; then
		echo "${name} is not running"
		return 1
	fi
	_pid=$(cat "${pidfile}" 2>/dev/null)
	if [ -z "${_pid}" ]; then
		echo "${name} pidfile exists but is empty"
		return 1
	fi
	if kill -0 "${_pid}" 2>/dev/null; then
		echo "${name} running as pid ${_pid}"
		return 0
	fi
	echo "${name} pidfile exists but process not running"
	return 1
}

run_rc_command "$1"

Security Considerations

Please note that API keys are stored in plain text in the apps.json configuration file. This will be fixed in later iterations. This is of course bad design that now has been choosen for simplicity. A quick fix later on will be to store just hashes here. This is on the ToDo list (and maybe done at the moment you read this article later on)
The API keys for backends have to be stored in plain text in the configuration files. There is no way to prevent this.
The API keys are passed in plaintext in the HTTP headers. If you use a public network you have to use TLS. Never ever use plain http over any public or not fully trusted network!

Conclusion - Control, Simplicity, and Fairness

mini-apigw is a reminder that simplicity and control can coexist. It’s about reclaiming responsibility for resources and infrastructure in a world where every AI application assumes to be alone. It’s not a massive platform—it’s a scalpel: small, precise, and reliable. When others build towers of YAML and Kubernetes operators - or require loads of virtual environments and Docker containers to be deployed without any control over the content, sometimes all you need is a well-behaved little daemon that keeps the peace between your models.

This utility has been designed to solve a given simple task in a simple environment. It will never scale to a huge cluster and it will not scale to any worldwide operation. It has never been designed to do so. It’s there to solve a small local problem. And it works flawless up until now.

mini-apigw: A Lightweight Gateway for Multi-Model AI Infrastructure

Introduction - The Problem Nobody Wants to Admit

Motivation - Why I Built mini-apigw

Design Philosophy - Minimalism Meets Control

Architecture - A Single Front Door for Many Models

Arbitration - The Missing Layer in Todays AI Ecosystem

Quick Tour - Configuration and Usage

Configuration

Daemon configuration (daemon.json)

Application configuration (apps.json)

Backend configuration (backends.json)

Creating and Using API keys

Starting and Stopping the Services, Reloading Configuration

Security Considerations

Conclusion - Control, Simplicity, and Fairness

Related articles

Getting started with WSGI and Python as well as uwsgi application server

Adding Grafana annotations using HTTP API

Getting started with MCP servers

Getting started with node-red for home/lab/process automation

Another quick glance on the OpenAI API to ChatGPT using function calling

Controlling Pfeiffer Turbopumps using RS485

Building a TCP console server for Windows and FreeBSD

ISC-DHCPD events triggering native hooks from within a chroot

Also on this blog

FDM Printed PLA in Cryogenic Environments

Cherenkov radiation summary

Simple XML serialization and deserialization in C#

GIT and SVN interoperability or moving from SVN to GIT without loosing history

Daemon configuration (`daemon.json`)

Application configuration (`apps.json`)

Backend configuration (`backends.json`)