Using Proof Assistants to Ground LLMs in Mathematical Truth

08 Jan 2026 - tsp
Last update 08 Jan 2026
Reading time 23 mins

Large Language Models (LLMs) like GPT-5 are powerful but often prone to hallucinations. They confidently state plausible sounding falsehoods. This is especially problematic in mathematics and theoretical physics, where rigor and truth are fundamental[1]. The idea of grounding an LLM in mathematical truth is to have it work in tandem with a proof assistant (such as Coq or Lean). The LLM proposes definitions, lemmas, or proof steps and the proof assistant checks each step for correctness. By doing so, any claim made by the LLM is backed by a formal proof, eliminating hallucinations because incorrect steps are rejected[1,2]. In essence, the proof assistant serves as an infallible judge: if the machine-checked proof script completes with Qed, the result is guaranteed true (assuming no bugs in the prover)[2]. This approach opens up a new way to explore mathematics and theoretical physics - an LLM can suggest creative ideas, innovative new models or conjectures, and the proof assistant ensures those ideas are rigorously verified. This approach came to my mind a few years ago and when I checked - of course someone already did it. Below I first survey how researchers have begun combining LLMs with proof assistants (covering both open and proprietary efforts, including some historical approaches). Then I provide a step-by-step example on using OpenAI’s ChatGPT or Codex (GPT) together with Coq, showing how one might interactively generate and verify proofs. In the end I will also present a way to send Codex researching a topic and keeping track of formal proofs for the different claims - which is a technique that I personally actually also use on a day to day basis including exploring new approaches to various problems.

LLMs and Proof Assistants: Techniques and Examples (General Overview)
Conclusion
Tutorial: Guiding Coq with Codex (GPT) – A Step-by-Step Guide
- The manual way
- Exploiting Agentic Frameworks like Codex
Tips and Best Practices
Final conclusion
References
Related Services and Software

LLMs and Proof Assistants: Techniques and Examples (General Overview)

Connecting LLMs to proof assistants is an active research area aiming to leverage the generative power of LLMs and their generalization capabilities, while guaranteeing correctness via formal verification. Early foundational efforts laid the groundwork. For example, researchers have long applied AI to theorem proving: machine learning was used in the 2000s to guide automated provers (for example selecting relevant lemmas)[3], and systems like DeepMind’s DeepMath (2017) explored neural networks for premise selection[3]. However, those early systems did not use the kind of large-scale language models we have today. In contrast to other artificial intelligence approaches those large language models are capable to explore new ideas and areas by their generalization capabilities. A notable breakthrough came in 2020, when OpenAI introduced GPT-f, a transformer-based model for the Metamath formal system[4]. GPT-f treated theorem proving as a language modeling task: it was trained to generate proofs as sequences of tokens. GPT-f managed to prove about 56% of Metamath theorems and found new shorter proofs for some that were accepted into the Metamath library[4]. This was the first time a deep-learning system contributed proofs that human formalizers adopted as correct[4].

Following GPT-f the focus expanded to richer proof assistants like Lean and Coq, which support higher-order logic and have large libraries of mathematical facts. In 2022, OpenAI and others extended these ideas - for example, by training models on proofs from the Lean community. One OpenAI system described in a Feb 2022 blog article and paper used a GPT-3 style model (with 774M parameters) together with an expert iteration loop to solve math competition problems formalized in Lean[5]. The model would search for proofs and learn from its successes, iteratively improving. This approach achieved state-of-the-art on a benchmark called MiniF2F, solving several Olympiad-level geometry and inequality problems in Lean[5]. For example, it solved an Olympiad problem about triangle inequalities by ultimately inventing the right sequence of tactics (like a call to nlinarith, a nonlinear arithmetic solver) to dispatch the goal[5]. This showed that with LLM guidance, proof assistants can tackle nontrivial problems fully formally.

Other researchers tackled the challenge of translating informal human mathematics into formal proofs - a task known as autoformalization. In 2022, a team from Google demonstrated that Codex could translate about 25% of plain English math competition problems perfectly into Isabelle/HOL formal statements and proofs[3]. The model was able to produce correct Isabelle code (definitions and lemma statements) for a significant subset of problems, capturing all the logical details[3]. It was further shown that these autoformalized statements could be used to train a theorem prover, boosting its success rate on benchmarks[3]. This work and related tools like LeanChat and LeanAid, hint at a future where mathematicians can state a conjecture in natural language and have an AI system propose a formalization in Lean or Coq[6].

Another important line of research seeks to combine natural language reasoning with formal proof. A prime example is the Draft-Sketch-Prove pipeline (DSP). In this approach, an LLM first writes an informal proof in human style (a few paragraphs of reasoning in English/LaTeX) for a given problem[6]. Then, another model or module translates this high-level proof sketch into a formal proof skeleton - an outline in Isabelle with some steps left for automation[6]. Finally, automated provers or tactics (such as Isabelle’s Sledgehammer) attempt to fill in the low-level details to complete the formal proof[6]. Using this three-step approach with two different LLMs (Googles Minerva for the informal draft and OpenAIs Codex for the formal sketch), it was managed to solve about 35-40% of a set of competition-level problems (under a budget of 100 attempts per problem)[6]. The success rate is not 100%, but it’s far above random, demonstrating that LLMs can effectively bridge the gap between intuitive reasoning and formal verification.

Yet another noteworthy effort is Baldur[11], an approach that integrates an LLM with feedback loops to correct its proofs. Baldur used a model (Minerva) to generate candidate proofs in Isabelle/HOL, then checked these proofs and provided the model with the failure information to refine its next attempt[6]. This back-and-forth allowed the system to achieve about a 47% success rate on a benchmark of Isabelle proofs[6]. The key innovation was feeding the proof assistant’s feedback (error messages, unfinished goals) back into the LLM, so it could adjust its strategy.

Most recently, the concept of an LLM-based agent that interacts with a proof environment has gained traction again. Instead of the LLM just producing a static proof, the LLM can be used as a decision-maker in a stepwise proving process. The LLM observes the current proof state and decides on the next action (apply a certain tactic, backtrack, recall a lemma, etc.), then the proof assistant executes that action and returns the new state or any errors. COPRA (In-Context Prover Agent) exemplifies this approach: it uses GPT-4 as the brains controlling a proof search in Coq or Lean[7]. COPRA maintains a stateful backtracking search - if the LLM’s chosen tactic leads to a dead end, the system can backtrack and try alternative strategies[7]. After each tactic execution, the resulting proof state (the remaining goals or any error message) is fed back into the LLM’s prompt for the next step[7]. This loop between the LLM and the formal environment greatly reduces hallucinations: the LLM cannot just make up a fact, because the proof assistant will immediately flag it if it’s not actually proven or applicable[7]. COPRA showed significantly better performance than a single-pass GPT-4 on benchmarks - it found proofs more reliably and with fewer queries by leveraging the stepwise feedback[7]. In Coq, COPRA was evaluated on a set of theorems from the CompCert verification project and outperformed prior systems like Proverbot9001 (a 2020 learned Coq prover)[7]. This demonstrates the power of treating theorem proving like a dialogue: GPT-4 can steer Coq to a proof, learning from each success or failure along the way.

To put things in perspective, consider how far this synergy has come: a recent experiment used GPT-4 to attempt every exercise in a Coq textbook. Initially, GPT-4 solved roughly 80% of the problems with formal proofs. After upgrading to gpt-4o1 preview, the success rate jumped to 99.4%, with the AI solving all but 4 out of 750+ problems[8]. Essentially, an advanced LLM combined with Coq was able to prove nearly every lemma in an undergraduate-level curriculum. The process was automated and scriptable: the model’s proof outputs were checked by Coq, and any failure simply meant that attempt didn’t count. Similarly, research is now aiming beyond pure math into theoretical physics and other domains. A 2025 preprint introduced Ax-Prover, a multi-agent system using LLMs plus formal verifiers to tackle formalized problems in quantum theory and abstract algebra[9]. Ax-Prover, which relies on tool-integration (via Model Context Protocol) to interface the AI with proof checkers, was able to adapt to these new domains and outperform specialized provers that were only trained on narrow areas[9]. This suggests that LLM and proof-assistant systems could one day assist in deriving new results or checking conjectures in physics, by exploring ideas in a formal, reliable way.

Conclusion

LLMs can generate conjectures and proof strategies, while proof assistants like Coq or Lean rigorously verify each step. We’ve seen open-source breakthroughs like Codex and Isabelle autoformalization[6] and Lean GPT-f systems[5] as well as proprietary contributions (OpenAI’s GPT-4, DeepMind’s work, etc.), often combining ideas. The benefit is two-fold: The LLM becomes more trustworthy (it can’t cheat the prover, so we only accept claims it truly proved), and using the LLM can greatly accelerate formalization work (finding proofs that would be tedious for humans). One gets the best of both worlds - intuitive code or math solutions with a machine-verifiable guarantee of correctness[2]. The next section will illustrate how one can practically set up such a workflow with Coq and OpenAI’s ChatGPT as well as in an automated workflow utilizing Codex.

Tutorial: Guiding Coq with Codex (GPT) – A Step-by-Step Guide

The manual way

In this section, we now focus on using Coq (the proof assistant I personally prefer) in combination with an OpenAI LLM (GPT-5 or Codex). I assume you already have Coq installed and an interface to query the LLM (for example, the OpenAI API via a Python script, a tool like Codex CLI that lets you interact with GPT or simply the web interface of ChatGPT). The goal is to have the LLM help generate a proof which Coq will check. Essentially, you’ll act as a mediator between Coq and the LLM (unless you use an agent orchestrator that automates the loop). The process can be summarized as: prompt $\to$ propose proof $\to$ verify $\to$ refine and repeat until QED.

Here’s how to do it:

State the Theorem in Coq: Begin by formulating the theorem or goal you want to prove in Coq’s language. For example, let’s say we want to prove a simple arithmetic property in Coq, like the fact that adding zero to a natural number yields the same number. In a Coq file (or Coq IDE), you might write:

Theorem plus_n_O : forall n : nat, n + 0 = n.
Proof.
(* proof steps will go here *)

This sets up a goal $\forall n, n + 0 = n$. Keep this context ready - you will feed the goal (and any relevant definitions or lemmas) to the LLM. If you’re using an interactive setup, you can also just describe the problem in words (for example “Prove that for any natural number $n$, $n + 0 = n$ in Coq”) and the model will translate it to Coq code in its response.

Prompt the LLM for a Proof: Next, send a prompt to the LLM asking for a proof of the stated theorem. If using Codex or GPT-4 via an API, you might provide the theorem statement (and maybe the beginning Proof line) and prompt something like “Complete the proof script”. In a chat interface, you could simply ask: “How can I prove $\forall n : nat, n + 0 = n$ in Coq?”. The LLM will then attempt to generate a sequence of Coq tactics that prove the goal. Modern LLMs have surprisingly good knowledge of Coq tactics and common strategies. For our example, it might produce something like:

induction n.
- (* base case *) reflexivity.
- (* inductive step *) simpl. rewrite IHn. reflexivity.
Qed.

This is a plausible proof: it performs induction on $n$, then handles the base case ($n = 0$) by reflexivity (since $0 + 0 = 0$ is trivial), and for the inductive case it simplifies $S n + 0$ to $S n + 0$ (which actually simplifies to $S (n + 0)$), then uses the inductive hypothesis IHn to replace $n + 0$ with $n$, and concludes by reflexivity. The exact output might differ in formatting, but the idea is that the LLM provides a candidate proof script. In more complex problems, the model may introduce lemmas, use tactics like apply, rewrite, exists, etc. or generate helper lemmas on the fly. There is an example where ChatGPT was prompted to prove the equivalence of two definitions of factorial, and the model itself introduced an auxiliary lemma and structured the proof with bullets correctly[2] - a sign that it can produce non-trivial proofs (and not just reproduce existing patterns).

Review and Run the Proof in Coq: If working in an interactive non automated environment copy the proof script generated by the LLM into your Coq environment. End the proof with Qed. if the LLM hasn’t already. Now let Coq check it. If the proof is correct, Coq will accept it and respond with no errors, thus finishing the theorem. This is the ideal outcome - it means the LLM’s suggestion was valid and we have formally verified it. You can now trust the result, since Coq’s approval means every step was logically sound. Often, with simple goals, GPT-based models do succeed in one shot. In our example, the proof induction n; \[reflexivity| simpl; rewrite IHn; reflexivity\]. would indeed solve the goal. In more complex cases, the first attempt might not work perfectly, which leads to the next step.

If Coq Reports an Error, Feed it Back: It is common that the LLM’s first proof attempt has a mistake, maybe a minor typo, wrong tactic, or an unproven subgoal. For interactive environments read the error message Coq gives. For example, Coq might say something like “Error: Unable to unify $n + 0 with n$” or “unknown identifier IHn” depending on the situation. Take that feedback and go back to the LLM. A crucial part of this loop is to inform the LLM what went wrong[6]. You might tell it: “Coq says it cannot unify $n + 0 with n$ in the inductive step. Maybe it needs a rewrite”. In a chat, you can paste the Coq error text. Modern LLMs are trained on lots of programming and proof data, so they often recognize Coq error messages and know how to fix them. It was found that with iterative hints, even GPT-3.5 could gradually correct proofs[6]. This back-and-forth is analogous to how one would debug a program: the LLM is the coder and Coq is the compiler/runtime giving feedback. By cycling this way (ask for proof, get suggestion, run in Coq, report error, get refined suggestion), you will converge towards a correct proof. These systems are far more powerful if you talk to them the right way by providing clear feedback on mistakes so the AI can adjust[6].

Iterate Until the Proof Succeeds: Continue the loop of refine-and-check. Each iteration, the LLM will incorporate the latest feedback. For example, if it missed a step in an inductive proof, it might insert that step when prompted with the error. Another sample is shown in [2]. If the model gets confused or stuck, you can try giving it a hint (“Perhaps you should use induction on this variable” or “Remember the lemma plus_n_Sm: n + S m = S (n + m)“). You can also break a complex goal into smaller lemmas yourself and have the AI prove those one at a time, which is often easier. With GPT-4-level models, this iteration usually converges quickly on standard problems - as evidenced by the near 100% success on the Software Foundations exercises when multiple attempts were allowed[8]. After a few rounds, you will likely get a proof script that Coq accepts, ending with Qed. The LLM and Coq together have now produced a verified proof.

Optional: Ask for a Human-Friendly Explanation: Once the formal proof is done, you might want insight into what was proven and how. You can prompt the LLM (outside of Coq) to explain the proof in simple terms or to generate a LaTeX-formatted write-up. For instance: “Explain in a few sentences why the proof works, as you would in a textbook”. Since the LLM knows the formal proof script, it can describe the reasoning. You could also ask it to output a full LaTeX proof that mirrors the formal proof. This provides a parallel, readable narrative to the machine-checked proof. The combination of a rigorous Coq proof and a LaTeX explanation is a powerful way to document new mathematical results - the Coq proof guarantees correctness, and the LaTeX version makes it accessible to humans.

Exploiting Agentic Frameworks like Codex

In practice, you may integrate these steps into an automated workflow instead of working interactive. For example, the Codex CLI tool by OpenAI can act as an agent: it can take your prompt, call the LLM, paste the answer into a file, execute Coq on it by itself, and then feed any Coq errors back into the prompt - all automatically cycling till the proof completes. This approach is surprisingly simple: You start your IDE, allow it to edit your workspace and execute Coq (in best case without asking for permission all the time) and just instruct it with a well written description which problem to tackle, to write a formal proof in Coq into files and evaluate it using the coq problem. You also instruct it to iterate till the proof completes and covers all relevant areas, not skipping anything.

It is also possible to integrate external tools via the MCP (a proof checker, a solver, etc.) in a structured way[10]. For instance, an MCP-based Coq plugin could let the LLM query Coq for the list of current goals or ask Coq to apply a tactic and return the result instead of re-executing the batch coq utility all over again, improving performance and computational cost. Using such an orchestrator is more advanced, but it can greatly streamline the process - essentially turning the interactive trial-and-error we described into an autonomous agent loop. In addition one can include web search, reading and indexing papers and writing structured documentation in the fully automated loop. Research prototypes like LeanDojo and ProofGPT have started providing such integrations for Lean, and similar efforts for Coq are underway (Proverbot9001 was an earlier attempt using learned tactics inside Coq; newer ones use LLMs). If you prefer not to set up those, a simple manual loop with copy-paste, as described in steps 2–5, is enough to get started.

Tips and Best Practices

When using an LLM with Coq, keep these in mind:

Provide Context: If your proof requires a certain library or prior lemmas, load them in Coq and also mention them to the LLM. For example, if proving something about lists, you might need to Require Import Coq.Lists.List. in Coq and tell the LLM “you can use list tactics like simpl, rewrite, etc.”. A well-informed model (like GPT-4 which has seen a lot of Coq code) might automatically know common libraries, but it doesn’t hurt to be explicit.
Watch Out for Subtle Issues: The LLM might produce a proof that “looks” right but actually is not (like using a lemma that has not been proven or assuming what it needs to show). Coq will usually catch these, as an error or an unfinished goal. Always rely on the Coq verdict, not the LLMs confidence. If Coq says Qed, you’re good[2]
Iterative Refinement: This bears repeating - the strength of this approach is in iteration. Even experts do not expect a complex proof to come out perfectly on the first try. Use Coq’s feedback as a guide for the LLM. For instance, if Coq’s error is “unfocused goals remain”, it means the AI opened too many subgoals without closing them; you can tell the AI to focus on one goal at a time or use bullets { ... }. The model can then adjust its proof script accordingly.
Use the AI as a Teacher: If you are an experimental physicist or any newcomer to formal methods exploring formal proofs, this setup can accelerate your learning. You can let the AI produce a proof, then study why that proof works. Users have found that AI-suggested proofs give insight or at least starting points for tricky problems. In the near future, as this technology matures, it could even help formalize new physics theories by checking each step of a derivation for consistency. By combining intuitive reasoning (from you or the AI) with the rigorous bookkeeping of Coq, you ensure that no step is hiding an error.

Final conclusion

In summary, using Coq with an LLM like GPT is a powerful and easy way to explore and verify mathematical ideas. You get the creativity and heuristic guidance of the language model - as well as it’s capabilities to invent new ideas or approaches due to it’s generalization capabilities - and the unyielding correctness guarantees of the proof assistant. This method has already shown success on everything from simple arithmetic lemmas to competition-level math problems, and it’s rapidly improving. With the steps above, one is able to experiment with own proofs in a fully automated fashion. The prospect of discovering new solutions in math or theoretical physics with an AI assistant that proves each claim is now already realistic and available[2,9], it’s most likely only a matter of weeks to months till we see first formal proofs in those fields for newly generated theories invented by large language models.

References

[1] C. Liu et al., Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification, Jun. 05, 2025, arXiv: arXiv:2506.04592. doi: 10.48550/arXiv.2506.04592.
[2] Garret Millson, Adventures in AI-assisted proof generation
[3] Y. Wu et al., Autoformalization with Large Language Models, May 25, 2022, arXiv: arXiv:2205.12615. doi: 10.48550/arXiv.2205.12615.
[4] S. Polu and I. Sutskever, Generative Language Modeling for Automated Theorem Proving, Sep. 07, 2020, arXiv: arXiv:2009.03393. doi: 10.48550/arXiv.2009.03393.
[5] S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin, and I. Sutskever, Formal Mathematics Statement Curriculum Learning Feb. 03, 2022, arXiv: arXiv:2202.01344. doi: 10.48550/arXiv.2202.01344.
[6] Stack Exchange Discussion: Make ChatGPT write formal proof from natural language proof
[7] A. Thakur, Y. Wen, and S. Chaudhuri, A Language-Agent Approach to Formal Theorem-Proving
[8] Sam Miller, From 80% to 99.4%: Improvements in Coq Proving via GPT Models
[9] Anonymous (Double blinded preprint): Ax-Prover: Agentic LEAN Proving with LLMs and MCP-based Verifiers
[10] G. Baudart, E. J. G. Arias, _Proof Assistant Assistants: From teaching Rocq to LLM-assisted proofs_a
[11] E. First, M. N. Rabe, T. Ringer, and Y. Brun, Baldur: Whole-Proof Generation and Repair with Large Language Models, doi: 10.48550/arXiv.2303.04910

Proof Assistants:
- Roqc (previously called Coq)
- Lean language
Large Language Models:
- OpenAIs ChatGPT and OpenAI Platform as well as Codex
- Anthropics Claude and Platform
- xAIs Grok and xAI API
- Mistral AI
- Ollama for offline LLMs
Books:

Using Proof Assistants to Ground LLMs in Mathematical Truth

LLMs and Proof Assistants: Techniques and Examples (General Overview)

Conclusion

Tutorial: Guiding Coq with Codex (GPT) – A Step-by-Step Guide

The manual way

Exploiting Agentic Frameworks like Codex

Tips and Best Practices

Final conclusion

References

Related Services and Software

Related articles

How I Use Large Language Models (LLMs) in My Daily Work and Hobbies

What (in my opinion) one can learn from Erlang/OTP for other programming languages

GPU size estimation for LLMs

mini-apigw: A Lightweight Gateway for Multi-Model AI Infrastructure

A first quick look on different sentence embedding methods - playing with word and sentence embeddings

Why Large Language Models Are More Than Just Next Word Prediction

How to use Frama-C to proof correctness of AVR microcontroller code

Harnessing the Power of GPTs - or why GPTs are not better search engines

Also on this blog

AWS first steps - How to initialize the admin IAM user and initialize CLI tools

Why I think that Jabber/XMPP is one of the best chat system available

Why 'High-Functioning' Autistic People Often Go Unseen and Unaccepted

Wire cross-sections