05 Jul 2026 - tsp
Last update 05 Jul 2026
19 mins
So we all know OpenAIs codex is totally awesome. Especially with GPT 5.5 it is amazing as soon as you know how to use it (which means - don’t tell it “write me an app” but handle it like a senior developer would handle his juniors). But then we also all know that any subscription runs out of credits very fast and when one uses an on demand plan - for example via the platform API - one gets broke very fast because it gets expensive. Especially with the looping /goal one can accumulate cost very fast.
In addition it is already very simple to run local large language models on consumer hardware, for example using ollama, even on small scale systems. These models are pretty amazing for small scale local jobs that do not require long term planning usually. Knowledge graph extraction, embedding vectors, simple chats, interpreting your E-Mails and similar applications. Running locally exchanges continuous cost for subscriptions against upfront hardware- and electricity cost.
The main problem with self hosted LLMs at the moment is available VRAM (or RAM on shared memory systems like Apple Silicon or the new Ryzen AI series). But even on larger consumer systems one is pretty limited. To run a large model like GLM 5.2 on your own hardware in a reasonable way you would need around 8x H100 80GB or 4 to 8 H200 141GB GPUs, a high end x86 CPU with AVX512, more system memory than VRAM (i.e. we are talking about 1TB of RAM) - and of course a mainboard, power supplies and cooling facilities to run that system. Technically doable but usually in the price range of around 200.000 - 300.000 Eur alone for the GPUs. This is not feasible for any private personal usage (at least if you are not an billionaire and if you are you will most likely know that this would be a total waste of money for personal use and would still fall back to a hosted cloud service).
Smaller models though work very well on consumer hardware. The 35B MoE models run on a single 24GB VRAM GPU, 70B models on a bit larger GPUs. In my experience they are noticeably weaker at long-range planning, maintaining coherence over many reasoning steps, and robust recovery from mistakes (which is likely accounted to missing emerging properties or larger scale models). But still many projects have shown that even small scale models can operate perfectly well for specific tasks when run in agentic frameworks, exploiting planning and refinement loops. And that is exactly the idea behind using such models with orchestrators like codex. For more complex tasks a growing context can introduce problems with these models though, as we will see in the bugs section of this article.
TL;DR: You can use small scale models with codex and sometimes the output is acceptable. Its totally not comparable with the performance of frontier models like GPT5.4 or GPT5.5 though. If you want to have a robust and efficient assistant for software development, don’t bother using the ollama backend. If you can accept many quirks, runaway context, sometimes injected deletions and want to play around, using a smaller model with codex might get interesting.
ollama Directly
I assume in the following we are going to use ollama as runtime for the large language models since this is the easiest runtime available for endusers. One could of course also use vLLM, which is the runtime of choice for distributed MoE models, or llama.cpp server command.
The best model that I used till now on smaller scale systems like the dual 12GB RTX3060 setup was in my opinion qwen3.6:35b-a3b, which is also directly available in ollamas model library. This model is an mixture of experts model with only 3B active at all times, while being a 35B model. Note that OpenAI themselves, who released that model, usually suggest (and target codex to) use the gpt-oss model. To install the qwen3.6-35B-A3B model on ollama one can simply pull it via
ollama pull qwen3.6:35b-a3b
Unfortunately the default parameters for this model are not really suited for operation with codex, especially due to the thresholds for repeated token output. The default parameters I found on the latest version I played with have been:
Model
architecture qwen35moe
parameters 36.0B
context length 262144
embedding length 2048
quantization Q4_K_M
Capabilities
completion
vision
tools
thinking
Parameters
presence_penalty 1.5
repeat_penalty 1
temperature 1
top_k 20
top_p 0.95
min_p 0
License Apache License Version 2.0, January 2004
An explanation of the parameters can be found in the appendix. The main problems for useful execution with codes where:
temperature. For coding tasks it’s a good idea to keep temperature low since this is what allows the sampler to get creative by flattening the probability distribution. Since this sometimes leaves to swapping of keywords (like continue instead of return) or similar mistakes its a good idea to keep the temperature low.num_ctx). A huge context is good. But it also needs a huge amount of VRAM for the KV cache. One may want to limit the context window to 64k (65536) or 128k (131072) to keep the size of the KV cache manageable.top_k, top_p and min_p, which are also optimal for creativity but not optimal for coding due to the probability of swapping keywords.This lead to a modified Modelfile for the model. I used the following parameter, already overriding the system prompt with some very simple instructions that turned out to be sufficient:
FROM qwen3.6:35b-a3b
PARAMETER num_ctx 65536
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER top_p 0.8
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.08
PARAMETER repeat_last_n 1024
PARAMETER presence_penalty 0
SYSTEM """
You are a coding agent. Be precise and concise.
Do not repeat yourself.
Do not emit long hidden reasoning.
When using tools, produce valid tool calls only.
When you are uncertain, summarize the uncertainty and stop rather than looping.
"""
This can then be used to create a new model identity:
ollama create qwen3.6:35b-a3b-codex -f Modelfile
This turned out to work way better than the unmodified model for myself.
ollama DirectlySo first - how can one use this when codex communicates directly via the responses API with ollama? One needs:
The profile configuration resides in ~/.codex/NAME.config.toml (I personally used ~/.codex/qwen.config.toml). Note that the model_catalog_json path and name are arbitrary, the base_url of course has to point to the ollama instance at the correct port.
model_provider = "ollama-local"
model_context_window = 65536
model_auto_compact_token_limit = 56000
model_catalog_json = "/usr/home/USERNAME/.codex/qwen-models.json"
[model_providers.ollama-local]
name = "Ollama local"
base_url = "http://198.51.100.1:1234/v1"
wire_api = "responses"
requires_openai_auth = false
supports_websockets = false
stream_idle_timeout_ms = 3000000
stream_max_retries = 1
What these settings do is:
num_ctx from before.model_catalog_json that contains Metadata about the model itselfIn addition the metadata file specified under model_catalog_json is required to provide capability information about the model to the codex runtime. This provides codex with information about reasoning support, properties of the template, reasoning support of the model, reasoning levels, again context window configuration (one has again to ensure that this is consistent with the num_ctx of the model) as well as embedded tool support:
{
"models": [
{
"slug": "qwen3.6:35b-a3b-codex",
"display_name": "Qwen3.6 35B A3B Codex",
"description": "Local Qwen3.6 35B A3B via Ollama, tuned for Codex OSS.",
"provider": "ollama-local",
"visibility": "list",
"supported_in_api": true,
"priority": 100,
"default_reasoning_level": "low",
"supported_reasoning_levels": [
{
"effort": "low",
"description": "Fast local reasoning"
},
{
"effort": "medium",
"description": "Balanced local reasoning"
}
],
"supports_reasoning_summaries": false,
"default_reasoning_summary": "none",
"support_verbosity": true,
"default_verbosity": "low",
"shell_type": "shell_command",
"apply_patch_tool_type": "freeform",
"web_search_tool_type": "text_and_image",
"supports_parallel_tool_calls": false,
"supports_image_detail_original": false,
"context_window": 32768,
"max_context_window": 32768,
"effective_context_window_percent": 85,
"truncation_policy": {
"mode": "tokens",
"limit": 10000
},
"experimental_supported_tools": [],
"input_modalities": ["text"],
"supports_search_tool": false,
"base_instructions": "You are Codex, a coding agent. Work in short, precise steps. Use tools carefully. Do not repeat yourself. When tool calls are needed, emit valid tool calls only. If you are stuck, summarize the blocker and stop instead of looping."
}
]
}
Having those files in place it is possible to launch codex using
codex --oss -p qwen -m qwen3.6:35b-a3b-codex
or, setting the CODEX_OSS_BASE_URL via the environment instead:
env CODEX_OSS_BASE_URL="http://198.51.100.1:1234/v1" codex --oss -p qwen -m qwen3.6:35b-a3b-codex
This is already enough to use the models.
Unfortunately this was the error that appeared most of the time when something failed. The exact codex output is
stream disconnected before completion: stream closed before response.completed
This turned out to be caused by different reasons:
The most common error is the last one - the model aborting due to repeated patterns. This happens especially when the context window reaches a larger size, yielding Qwens operation to degenerate into a thesaurus like continuation. This is one of the events that happens when running small large language models with larger context. To approach those problems one:
This is one of the major drawbacks of small self hosted models in comparison to well tuned large scaled cloud models
This also happened from time to time using qwen models in codex. At some point after they ran for some time they seem to stop producing proper tool calling output. Then the console gets flooded with output like
<function=exec_command>
<parameter=cmd>
python3 -m pytest tests/ -v --tb=short
</parameter>
</function>
</tool_call>
Its pretty obvious that this happens whenever the model stops to emit the initial <tool_call> tag - for whatever reason. This also happens for a growing context, like most bugs.
It seems qwen code - as soon as the context grows - loves to approach problems with a simple solution: It often suggests to delete the entire codebase over a simple indention error. This is something that is well known from smaller models from the past when the context grows. It behaves like many beginners in this case - it does not try to understand bugs but just simply drops everything and wants to start over.
This is the best-known parameter. The model computes probabilities
[ P(t_i) ]for each possible next token. Temperature rescales these probabilities before sampling:
[ P'(t_i) \propto P(t_i)^{\frac{1}{T}} ]where $T$ is the temperature.
In this configuration the highest probability token almost always wins. For example, if the networks outputs
| Probability | Token |
|---|---|
| 0.82 | return |
| 0.09 | yield |
| 0.05 | break |
| 0.04 | continue |
At temperature 0.1 the model will nearly always emit return, which is excellent for programming.
The distribution is unchanged. The model occasionally chooses the second or third best option (inserting yield or break instead of return.
The distribution becomes flatter, which is amazing for creative writing. Unlikely tokens may be sampled. Absolutely terrible for programming.
This provides a filter of the tokens that undergo probabilistic selection. If one applies top_k = 2 to the above example data set, the model will only choose from the two options return and yield
This is a filter that is not applied to the number of best candidates - candidates are selected by adding up their probability. As long as the total probability in the selected pool stays below top_p another token is added to the candidate list. This is also called nucleus sampling.
This option drops every token that has a probability of $P_\mathrm{maxtoken} * \mathrm{min}_p$. In many cases this produces more stable results than filtering via top_p.
This factor is used as a divider for every token probability that has recently been emitted. This discourages the repeated emission of the same token, without suppressing the probability. The typical values are in the range of 1 to 1.15. If too high, legitimate repetition gets suppressed.
This is the sliding window that repeat_penalty is applied to. Larger values prevent long loops but can slightly reduce consistency because earlier identifiers are also penalized.
Is similar to repeat_penaltybut is not applied for every time the token is encountered but is applied to tokens it they appeared at least once in the sliding window. This reduces probability of word repetition in text. Large presence penalties are usually undesirable for programming because identifiers and keywords often need to repeat consistently.
This is a penalty that scales with the number of times a token has been reused. Again, not optimal for code.
This is the maximum size of the attention window. Increasing context length increases the short term memory of the model as well as it’s context dependent behaviour. Increasing the context length increases KV cache memory approximately linearly while increasing the computational cost of standard attention roughly quadratically.
This is the maximum number of tokens an LLM is allowed to generate in a single response. Reducing it prevents too much gibberish to be generated in a single response while it of course also reduces the amount of information that an LLM can generate. It is primarily used as a safeguard to prevent infinite loops, cap costs and safeguard system resources.
It is entirely possible to run complex long range planning orchestrators and agent frameworks like codex using self hosted models. It is some work to get them up and running and the quality of the output as well as the long range planning will not work as one is used to from frontier models like GPT 5.4 or GPT 5.5. One would need a large scale model like GLM 5.2 on very expensive hardware, being totally not economical for private use except the privacy of running a local model is the primary reason, to achieve this level of reasoning and long term planning capabilities. When one thinks about buying such hardware one also has to take into account that models will grow over time - ones hardware wont.
If one wants to play around, needs agents that do long running jobs or wants to hammer an LLM with request on the other hand it is often an interesting approach of running an LLM offline. The gains are obvious:
First the positive:
And the negative:
/goal__pycache__, etc.).Overall the model was compareable to a junior developer who can draft a service shaped codebase quickly but does not reliably close the loop by running, testing and reconciling design assumptions with actual runtime behaviour.
The resulting code was way below production quality, parts not even working at all. LLMs on this scale - in contrast to large scale models like GPT5.4 or GPT5.5 - are merly some kind of auto-complete …
--oss mode.codexThis article is tagged: Programming, Artificial Intelligence, System administration, Administration, Large Language Models, Machine learning
Dipl.-Ing. Thomas Spielauer, Wien (webcomplainsQu98equt9ewh@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/