Our First Proof submissions
OpenAI News我们用内部模型对所有 10 道 First Proof 问题进行了测试。First Proof 是一个面向研究级别的数学挑战,旨在检验 AI 系统能否给出正确且可检验的证明尝试。与短答或竞赛式数学不同,这类题目要求在专业领域内构建端到端的论证,要确认正确性通常需要专家复核。出题人均为各自领域的顶尖学者,其中至少有几道题在作者找到解答之前曾长期悬而未决。有一个在这些学科上有较大重叠的大学系,理论上可以在一周内解决其中许多题目。
我们在太平洋时间 2026 年 2 月 14 日周六凌晨 0:00 发布了我们的证明尝试。根据专家反馈,我们认为模型在至少五道题(第 4、5、6、9 和 10 题)的证明尝试很有可能是正确的,另有若干题仍在审查中。起初我们认为第 2 题的尝试很可能成立,但在参照了 First Proof 官方评注和社区的进一步分析后,现认为该尝试不正确。我们感谢各方的参与,并期待持续的评审。我们的全部证明尝试已公开,预印本中收录了全部十道证明,并新增加了附录,列出提示模式与示例,旨在模拟我们在过程中与模型的交互方式。
我们认为,对前沿研究问题的探索可能是评估下一代 AI 模型能力最重要的方式。基准测试有用,但常常漏掉研究中最难的部分:维持长链推理、选取合适的抽象、处理题目表述中的歧义,以及提出能经受住专家审查的论证。像 First Proof 这样的前沿挑战能在那些正确性难以简单判定、失败模式信息量大的场景下对模型能力进行更有价值的压力测试。
“我们当前正在训练一个新模型,主要目标是提高其思维的严谨性,让模型能够连续思考数小时并对结论保持高度自信。First Proof 题目一公布,就显得是一个理想的试验场,周末我就进行了尝试。模型很快就解决了两道题(第 9 和第 10 题)。随着训练推进,它能力稳步提升,最终在我们评估中又至少解决了三道。尤其让人高兴的是,它先后解决了第 6 题,然后两天后又解决了第 4 题——这些题目属于我们团队较为熟悉的领域。看着模型一天比一天明显聪明起来,真是令人惊讶。”—— James R. Lee( OpenAI Researcher, Reasoning )
我们在有限的人类监督下运行模型。训练过程中对模型的不同版本给出提示时,我们有时会建议在早期尝试中看起来有效的重试策略;在得到专家反馈后,也会要求模型扩展或澄清证明的某些部分,以便更容易验证推理。我们还促成了该模型与 ChatGPT 之间的反复交互,用于验证、格式化和文风调整。对于某些问题,我们呈现的是若干尝试中经人类判断选出的最佳结果。该过程为一次快速冲刺,远非严格受控实验中那样规范;我们期待与 First Proof 组织者就未来更严谨的实验与评估框架进行讨论。
本项工作建立在此前在数学与科学推理前沿模型上的成果之上。2025 年 7 月,我们用一个通用推理模型在 International Mathematical Olympiad 上达到了金牌级别的成绩(35/42 分)。2025 年 11 月,我们发布了“Early experiments in accelerating science with GPT‑5”,展示了 GPT‑5 在数学、物理、生物等领域帮助研究者取得实质进展的若干案例,同时也说明了其局限性。最近我们又报告了一项物理学合作,GPT‑5.2 提出了一种胶子振幅公式的候选表达式,随后由内部模型给出形式化证明,并由作者验证。
我们期待与学界更深入地探讨如何评估研究级别的推理能力,欢迎专家对这些尝试提出反馈,并期待将这些新能力在未来的公开模型中提供给更广泛的用户。
We ran an internal model on all 10 First Proof problems, a research-level math challenge designed to test whether AI systems can produce correct, checkable proof attempts. Unlike short-answer or competition-style math, these problems require building end-to-end arguments in specialized domains, and correctness is hard to establish without expert review. The authors of the First Proof problems are leading experts in their respective fields, and at least a couple of the problems were open for years before the authors found solutions. An academic department that has substantial overlap with the subject areas could conceivably solve many of the problems in one week.
We shared our proof attempts on Saturday, February 14, 2026 at 12:00 AM PT. Based on feedback from experts, we believe at least five of the model’s proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, and several others remain under review. We initially believed our attempt for problem 2 was likely correct. Based on the official First Proof commentary and further community analysis, we now believe it is incorrect. We’re grateful for the engagement and look forward to continued review. Our full set of proof attempts can be found here. The preprint includes all ten proof attempts, plus a newly added appendix with prompt patterns and examples that aim to simulate our manual interactions with the models during the process.
We believe novel frontier research is perhaps the most important way to evaluate capabilities of next generation AI models. Benchmarks are useful, but they can miss some of the hardest parts of research: sustaining long chains of reasoning, choosing the right abstractions, handling ambiguity in problem statements, and producing arguments that survive expert scrutiny. Frontier challenges like First Proof help us stress-test those capabilities in settings where correctness is nontrivial to verify and the failure modes are informative.
“We’re currently training a new model for which a primary focus is increasing the level of rigor in its thinking, with the goal that the model can think continuously for many hours and remain highly confident in its conclusions. When the First Proof problems were announced, it seemed like the perfect testbed, so over the weekend I tried it out. Already it was able to solve two of the problems (#9 and #10). As it trained, it became increasingly capable, eventually solving–in our estimation–at least three more. We were particularly pleased when it solved #6 and then, two days later, #4, as those problems were from fields familiar to many of us. It’s pretty incredible to watch a model get tangibly smarter day by day.”
– James R. Lee (OpenAI Researcher, Reasoning)
We ran the model with limited human supervision. When prompting versions of the model along training, we sometimes suggested retrying strategies that appeared fruitful in earlier attempts. For some attempts, we asked the model to expand or clarify parts of a proof after receiving expert feedback, to make the reasoning easier to verify. We also facilitated a back-and-forth between this model and ChatGPT for verification, formatting, and style. For some problems, we present the best of a few attempts, selected by human judgment. This was a fast sprint, and our process was not as clean as we would like in a properly controlled evaluation. We look forward to discussions with the First Proof organizers about a more rigorous experiment and evaluation framework for future iterations.
This work builds on earlier results from frontier reasoning models in math and science. In July 2025, we reached gold medal-level performance on the International Mathematical Olympiad with a general-purpose reasoning model (35/42 points). In November 2025, we shared “Early experiments in accelerating science with GPT‑5”, a set of case studies where GPT‑5 helped researchers make concrete progress across math, physics, biology, and other fields, along with the limitations we observed. And most recently, we reported a physics collaboration where GPT‑5.2 proposed a candidate expression for a gluon-amplitude formula that was then formally proved by an internal model and verified by the authors.
We look forward to deeper engagement with the community on how to evaluate research-grade reasoning, including expert feedback on these attempts, and we’re excited to make these new capabilities available in future public models.
Generated by RSStT. The copyright belongs to the original author.