25 Oct 2025 - tsp
Last update 25 Oct 2025
22 mins
Modern AI applications often behave as if they are the only process in the universe. Every new AI service, notebook, or microservice starts up, loads its favorite model, and assumes that no one else dares to touch the GPU or system resources like RAM. The result is chaos: multiple workloads thrashing the same hardware. Applications today also very often lack proper error handling - if resources are exhausted or GPU buffers are lost due to competition, they simply crash as if error handling had never been invented this century. It’s a silent epidemic of resource arrogance. This reminds of the time of JavaEE applications who tend to hog all of a servers RAM without actually needing it.
Commercial and open source solutions exist - Langfuse, Helicone and open projects like TU Wien’s Aqueduct, which we use in front of vLLM clusters at the university. These gateways are capable, but they’re heavy: complex multi-component architectures with databases, dashboards, and web-based configuration layers. They’re very great for large institutions but overkill for small labs, hobby projects, or local offline clusters where you just want control and load distribution without bureaucracy and administrative overhead.
That’s why mini-apigw exists: a tiny, transparent, locally controlled OpenAI-compatible gateway designed to bridge multiple model backends - OpenAI, Anthropic, Ollama, and soon vLLM as well as Fooocus - while adding governance and arbitration features missing from the modern AI ecosystem.

I wanted a service that sits quietly between clients and model backends and fixes the problems most people don’t even notice until it’s too late (obviously I personally ran into them):
In short: I wanted something that behaves like a modern version of an old-school UNIX daemon that abstracted the LLM and image generation services (and later on additional services) from the actual backends. No unnecessary web interface, no unnecessary moving parts. Configuration through files. Trace logs into files. Simplicity (as simple as possible, though not too simple).
mini-apigw lives in ~/.config/mini-apigw/ and uses three JSON files:
It can reload configuration with SIGHUP like a traditional Unix daemon, report status via local-only admin endpoints and forward all /v1/* calls exactly as the OpenAI API does - so existing clients need no modification except for the base URL.
This minimalism also reduces the exposed surface dramatically. No admin web UI means no authentication endpoint to break into. It makes it unsuitable for large multi-user teams but perfect for controlled environments or single-node setups.
Clients ─ mini-apigw ─┬─ OpenAI
├─ Anthropic
├─ Ollama
└─ vLLM / custom backends
mini-apigw exposes a single /v1/ endpoint structure and routes each request to the appropriate backend according to the model name, aliases, or policy rules. Backends can be cloud APIs or local inference servers. The gateway supports both streaming and non-streaming responses and can transform metadata and logging as needed.
The most important feature, however, is arbitration.
Most AI frameworks as of today assume exclusive GPU access. In multi-user, multi-service setups, that assumption fails spectacularly. Two LLMs start loading, each thinks it owns the GPU, and everything crashes. Sometimes even the system due to the state of current GPU computation frameworks.
mini-apigw introduces sequence groups - small arbitration queues per groups of backends that serialize model loads and executions. The effect is similar to thread pools or database connection pools: predictable throughput, no GPU thrashing, and clean recovery.
It’s a rediscovery of a concept that mainstream software engineering once valued deeply. JavaEE, for example, had sophisticated thread pools and resource managers, ensuring fairness and throughput under load. Modern AI software, in contrast, is a jungle of processes fighting for VRAM without a referee. mini-apigw brings that sanity back.
Installation of the application is simple thanks due to PyPi:
$ pip install mini-apigw
One can also install the gateway from the cloned GitHub repository editable in place for development:
$ git clone git@github.com:tspspi/mini-apigw.git
$ cd mini-apigw
$ pip install -e .
Start the gateway manually:
$ mini-apigw --config ~/.config/mini-apigw/
Unix domain socket & reverse proxy: In my personal deployment I usually run the daemon bound to a Unix domain socket and expose it through Apache (or another reverse proxy) using ProxyPass and ProxyPassReverse to the local socket. This has several operational advantages:
mod_qos Apache provides proofen real world solutions to provide rate limiting. You can just drop then in with your usual configuration without having to implement anything in the API gateway.A typical configuration of the reverse proxy - in this case of Apache httpd - may look like the following:
<VirtualHost *:80>
ServerName api.example.com
ServerAdmin complains@example.com
DocumentRoot /usr/www/www.example.com/www/
ProxyPass / "unix:/var/run/mini-apigw.sock|http://localhost/"
ProxyPassReverse / "unix://var/run/mini-apigw.sock|http://localhost/"
<LocationMatch "^/(admin|stats)">
AuthType Basic
AuthName "mini-apigw admin"
AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
<RequireAll>
Require valid-user
Require ip 127.0.0.1 ::1 192.168.1.0/24
</RequireAll>
</LocationMatch>
</VirtualHost>
<VirtualHost *:443>
ServerName api.example.com
ServerAdmin complains@example.com
DocumentRoot /usr/www/www.example.com/www/
ProxyPass / "unix:/var/run/mini-apigw.sock|http://localhost/"
ProxyPassReverse / "unix://var/run/mini-apigw.sock|http://localhost/"
SSLOptions +StdEnvVars
# SSLVerifyClient optional
SSLVerifyDepth 5
SSLCertificateFile "/usr/www/www.example.com/conf/ssl.cert"
SSLCertificateKeyFile "/usr/www/www.example.com/conf/ssl.key"
SSLCertificateChainFile "/usr/www/www.example.com/conf/ssl.cert"
# SSLCACertificateFile "/usr/www/www.example.com/conf/ca01_01.cert"
<LocationMatch "^/(admin|stats)">
AuthType Basic
AuthName "mini-apigw admin"
AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
<RequireAll>
Require valid-user
Require ip 127.0.0.1 ::1 192.0.2.0/24
</RequireAll>
</LocationMatch>
</VirtualHost>
mini-apigw can also listen directly on TCP (IPv6 by default, legacy IPv4 if configured) when that is preferred, but for controlled server deployments the Unix domain socket + reverse proxy pattern tends to make more sense from a system administration perspective.
daemon.json)The main daemon configuration daemon.json defines:
unix_socket (and the port is unused).
One can alternatively specify an ipv4 or ipv6 field containing the listen addressesadmin endpoint is located and who can access itsyslogdatabase configuration specifies an optional PostgreSQL database. If it is specified an accounting log is
written into the databasereload option allows one to disable or enable reloading of configuration files using a SIGHUP handler{
"listen": {
"unix_socket" : "/usr/home/tsp/miniapigw/gw.sock",
"port": 8080
},
"admin": {
"stats_networks": ["127.0.0.1/32", "::1/128", "192.0.2.0/24" ]
},
"logging": {
"level": "INFO",
"redact_prompts": false,
"access_log": true,
"file" : "/var/log/miniapigw.log"
},
"database" : {
"host" : "192.0.2.3",
"database" : "database_name",
"username" : "database_user",
"password" : "database_secret"
},
"reload": {
"enable_sighup": true
},
"timeouts": {
"default_connect_s": 60,
"default_read_s": 600
}
}
apps.json)The apps.json file contains configuration for the applications that can access the API gateway.
app_id (this should be machine readable, I’d not recommend special characters there) as well as a name. The app_id has to be unique. The name can e any description of the application.api_keys array contains a list of API keys. Those are transparent bearer tokens, at this moment they are not parsed by the gateway in any way. They also have to be uniquely assigned to one application (i.e. not the same API key to different applications).policy allows one to specify which models are allowed for this application (this can include alias definitions in the backend). If the whitelist is not used the blacklist can be used via denycost_limit enforces a rough resource cap on each application. This is in particular useful when designing automatic systems to prevent them running havoc and billing you thousands of EUR/USD on your credit card. It’s good to have safeguards in place.trace configuratoin allows you to define a jsonl log file into which every request is logged. Depending on the three configuration options below it logs different aspecs of the requests. This allows one to trace what an application has been done without having to implement logging in the application itself. In case imagedir is specified all generated graphics from this application are also archived in the specified directory to have a trace of what has been generated.{
"apps" : [
{
"app_id": "demo",
"name": "Demo application",
"api_keys": [
"secretkey1",
"secretkey2"
],
"policy": {
"allow": [ "llama3.2", "gpt-4o-mini", "gpt-oss", "llama3.2:latest", "text-embedding-3-small", "nomic-embed-text", "dall-e-3" ],
"deny": []
},
"cost_limit": {
"period": "day",
"limit": 10.0
},
"trace": {
"file": "/var/log/miniapigw/logs/demo.jsonl",
"imagedir": "/var/log/miniapigw/logs/demo.images/",
"includeprompts": true,
"includeresponse": true,
"includekeys": true
}
]
}
backends.json)This is the place where one defines which backends are available and which sequence groups they belong to. In addition aliases are defined here. The file is one huge JSON dictionary.
The aliases section is a simple dictionary mapping from an arbitrary string to an actual model name. The model name later on selects the
backend. In the following example one can see that some aliases have been used to select model sizes or versions. In addition a transparent name
called blogembed has been used. This is a technique that I use also for my personal gateway to select the embeddings used by the tools operating
on this blog on the API gateway. All tools use the transparent name blogembed when querying the gateway. If I ever want to switch to a different
embedding I just have to change the mapping in the alias. The tools detect the different size of the embeddings and regenerate their indices.
The next section are sequence_groups. This is a dictionary that contains one entry per so called sequence group. All requests that go the
backends that belong to the same sequence group are executed serially, never in parallel. Other requests may be processed in parallel.
The following list of backends is then the main configuration of the backends. As one can see every backend has:
base_url, api_key required to access the remote host, etc. For backends like fooocus one will
also be able to specify stuff like selected styles, used models and refiners and other parameters.supports list defines which models are exposed for the different operations. Those are exposed to the routing framework. The
selection of the backends operates on the model names used here - a client requesting for example gpt-4o-mini for chat will be
routed to the opanai-primary backend, a client requesting llama3.2:latest for completion will be routed to ollama-localcost configuration allows one to specify how much each of the requests costs for each token. This is not fully implemented and is part
of the safeguard against runaway applications.{
"aliases": {
"llama3.2" : "llama3.2:latest",
"gpt-oss" : "gpt-oss:20b",
"llama3.2-vision" : "llama3.2-vision:latest",
"blogembed" : "mxbai-embed-large:latest"
},
"sequence_groups": {
"local_gpu_01": {
"description": "Serialized work for local GPU tasks"
}
},
"backends": [
{
"type": "openai",
"name": "openai-primary",
"base_url": "https://api.openai.com/v1",
"api_key": "YOUROPENAI_PLATFORM_KEY",
"concurrency": 1,
"supports": {
"chat": [ "gpt-4o-mini" ],
"embeddings": [ "text-embedding-3-small" ],
"images": [ "dall-e-3" ]
},
"cost": {
"currency": "usd",
"unit": "1k_tokens",
"models": {
"gpt-4o-mini": {"prompt": 0.002, "completion": 0.004},
"text-embedding-3-small": {"prompt": 0.0001, "completion": 0.0}
}
}
},
{
"type": "ollama",
"name": "ollama-local",
"base_url": "http://192.0.2.1:8182",
"sequence_group": "local_gpu_01",
"concurrency": 1,
"supports": {
"chat": ["llama3.2:latest", "gpt-oss:20b", "llama3.2-vision:latest"],
"completions" : ["llama3.2:latest" ],
"embeddings": ["nomic-embed-text", "mxbai-embed-large:latest"]
},
"cost": {
"models": {
"llama3.2:latest": {"prompt": 0.0, "completion": 0.0},
"gpt-oss:20b": {"prompt": 0.0, "completion": 0.0},
"nomic-embed-text": {"prompt": 0.0, "completion": 0.0}
}
}
}
]
}
Then any OpenAI-compatible client can use it transparently:
import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
model="llama3",
messages=[{"role": "user", "content": "Explain quantum tunneling."}]
)
mini-apigw will automatically pick the right backend (Ollama in this case) and manage concurrency and logging.
To ease the creation of API keys - these are only transparent bearer tokens so actually just arbitrary strings - the mini-apigw client implements the token command. This creates a random access token that can then be used in the application configuration. At the moment of writing the API tokens have been threatened as transparent sequences of bytes. In a later stage they will be JWTs that include permissions for the given clients to allow end to end authorization.
Note that API keys should never be used over plain http except on the local network or over the Unix Domain Socket. Always use https .
Starting the service can be done via the mini-apigw command line interface via the start subcommand (or without any subcommand) or via an rc.init script in case on runs on FreeBSD. Stopping and reloading configuration can be done using two distinct mechanisms:
SIGHUP the daemon reloads it’s configuration from the JSON files. On SIGTERM the daemon shuts down.stop and reload commands.The rc.init script also supports checking the status of the daemon using the PID file.
#!/bin/sh
# PROVIDE: mini_apigw
# REQUIRE: LOGIN
# KEYWORD: shutdown
. /etc/rc.subr
name="mini_apigw"
rcvar="mini_apigw_enable"
load_rc_config $name
: ${mini_apigw_enable:="NO"}
: ${mini_apigw_command:="/usr/local/bin/mini-apigw"}
: ${mini_apigw_config_dir:="/usr/local/etc/mini-apigw"}
: ${mini_apigw_user:="mini-apigw"}
: ${mini_apigw_pidfile:="${mini_apigw_config_dir}/mini-apigw.pid"}
: ${mini_apigw_unix_socket:="${mini_apigw_config_dir}/mini-apigw.sock"}
: ${mini_apigw_flags:=""}
: ${mini_apigw_timeout:="10"}
command="${mini_apigw_command}"
pidfile="${mini_apigw_pidfile}"
required_files="${mini_apigw_config_dir}/daemon.json"
extra_commands="reload status"
start_cmd="${name}_start"
stop_cmd="${name}_stop"
reload_cmd="${name}_reload"
status_cmd="${name}_status"
mini_apigw_build_args()
{
_subcmd="$1"
shift
_cmd="${command} ${_subcmd} --config-dir \"${mini_apigw_config_dir}\""
if [ -n "${mini_apigw_unix_socket}" ]; then
_cmd="${_cmd} --unix-socket \"${mini_apigw_unix_socket}\""
fi
for _arg in "$@"; do
_cmd="${_cmd} ${_arg}"
done
if [ -n "${mini_apigw_flags}" ]; then
_cmd="${_cmd} ${mini_apigw_flags}"
fi
echo "${_cmd}"
}
mini_apigw_run()
{
_cmd=$(mini_apigw_build_args "$@")
if [ "$(id -un)" = "${mini_apigw_user}" ]; then
/bin/sh -c "${_cmd}"
else
su -m "${mini_apigw_user}" -c "${_cmd}"
fi
}
mini_apigw_start()
{
mini_apigw_run start
}
mini_apigw_stop()
{
mini_apigw_run stop --timeout "${mini_apigw_timeout}"
}
mini_apigw_reload()
{
mini_apigw_run reload --timeout "${mini_apigw_timeout}"
}
mini_apigw_status()
{
if [ ! -f "${pidfile}" ]; then
echo "${name} is not running"
return 1
fi
_pid=$(cat "${pidfile}" 2>/dev/null)
if [ -z "${_pid}" ]; then
echo "${name} pidfile exists but is empty"
return 1
fi
if kill -0 "${_pid}" 2>/dev/null; then
echo "${name} running as pid ${_pid}"
return 0
fi
echo "${name} pidfile exists but process not running"
return 1
}
run_rc_command "$1"
apps.json configuration file. This will be fixed in later iterations. This is
of course bad design that now has been choosen for simplicity. A quick fix later on will be to store just hashes here. This is on the ToDo list
(and maybe done at the moment you read this article later on)http
over any public or not fully trusted network!mini-apigw is a reminder that simplicity and control can coexist. It’s about reclaiming responsibility for resources and infrastructure
in a world where every AI application assumes to be alone. It’s not a massive platform—it’s a scalpel: small, precise, and reliable.
When others build towers of YAML and Kubernetes operators - or require loads of virtual environments and Docker containers to be deployed
without any control over the content, sometimes all you need is a well-behaved little daemon that keeps the peace between your models.
This utility has been designed to solve a given simple task in a simple environment. It will never scale to a huge cluster and it will not scale to any worldwide operation. It has never been designed to do so. It’s there to solve a small local problem. And it works flawless up until now.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/