Advancing science and math with GPT-5.2

Advancing science and math with GPT-5.2

OpenAI News

我们对强人工智能的期望之一,是它能为所有人加速科学研究——帮助科研人员更快地探索更多想法、验证更多假设,并将发现转化为实际影响。

过去一年里,我们与数学、物理、生物和计算机科学等领域的科学家紧密合作,摸清 AI 能在哪些地方提供帮助、又在哪些方面仍不足。上个月我们发表了一篇论文,汇编了若干早期案例,展示了在数学、物理、生物学、计算机科学、天文学和材料科学等领域中, GPT‑5 曾如何辅助研究者取得进展。随着 GPT‑5.2 的推出,这些收益开始变得更稳定、更可靠。

更强的表现体现在对精确性要求高的场合

在科学与数学工作场景中, GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是迄今为止我们最强的模型。

扎实的数学推理是科学与工程工作可靠性的基石:它让模型能遵循多步逻辑、保持量值一致,避免在真实分析中逐步放大的细微错误——无论是模拟、统计、预测还是建模。像 FrontierMath 这类基准上的进步,不只是某一项狭窄技能的提升,而是更强的一般性推理与抽象能力,这些能力直接有助于编码、数据分析和实验设计等科学工作流。

这些能力也与通用智能的进展密切相关。能够在抽象层面可靠推理、在长链思考中保持一致性并跨领域泛化的系统,展示的是通往 AGI 的基础特质——不是针对某一任务的“技巧”,而是可广泛迁移、对科学、工程和现实决策都有价值的推理能力。

我们认为 GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是目前世界上协助并加速科学家的最佳模型。在研究生水平的 Google‑proof 问答基准 GPQA Diamond 上, GPT‑5.2 Pro 的得分为 93.2%,紧随其后的是 GPT‑5.2 Thinking 的 92.4%。在该基准中,模型需回答关于物理、化学和生物的多项选择题;测试时未启用任何外部工具,并把推理强度设为最大值。

在专家级数学评测 FrontierMath(Tier 1–3)中, GPT‑5.2 Thinking 创下新纪录,解决了 40.3% 的问题。该评测允许启用 Python 工具,并同样将推理强度设为最高。

案例研究

GPT‑5.2 的优势并不只体现在研究生水平的问题上。我们现在常见前沿模型为之前未解、且日渐微妙的数学与科学问题贡献解答。

在这个案例中, GPT‑5.2 Pro 帮助解决了统计学习理论中的一个开放研究问题,相关工作发表在新论文 On Learning-Curve Monotonicity for Maximum Likelihood Estimators 中。问题可以用一句话概括:“当你收集更多数据时,结果是否能可靠地变好?”在任何从数据拟合模型的场景里,这个问题都会出现。你可以画出学习曲线,随着样本量增加跟踪平均误差。理想情况下,曲线是单调的:数据越多、误差越小——这是人们希望并常常默认的行为。

但近几年研究表明,这种直觉并不总成立。一条由 Viering, Mey, and Loog 在 COLT 2019 提出的开放问题引发的研究路线显示,答案常常是否定的。即使是非常简单且“良性”的玩具模型,也可能出现非单调的学习曲线——增加数据反而会提高期望误差。这一发现引发了大量后续工作,研究者不断扩展出现逆转的情形,并提出越来越复杂的方法以期恢复单调行为。

然而,有一个最基本的情形一直悬而未决:在最教科书式的干净情境下会怎样——也就是统计模型正确、数据服从熟悉的正态分布、均值已知但标准差未知的情形。研究者已知对该设定做些微小改动就可能破坏单调性,但在这个核心情形下的答案长期不明。

我们的新论文证明,在这个干净的设定里,直觉成立:更多的数据会可预测地改进学习结果,而不会出现出人意料或不稳定的行为。该论文的特殊之处在于证明的获得方式:作者并非先设计出策略再请模型补写步骤,也没有给出中间论证或证明提纲。相反,他们直接让 GPT‑5.2 Pro 试图解决这个开放问题,然后对模型给出的证明进行了细致的验证,其中包括外部领域专家的审阅和确认。

随后,作者提出一些简单的后续问题以检验思路的推广力。 GPT‑5.2 Pro 将结果扩展到更高维的情形和其他常见的统计模型。整个过程中,人类的角色侧重于验证与清晰阐述,而不是为证明提供数学脚手架。

展望未来

这一结果提示了 AI 支持科学研究的一条有用路径,尤其适用于那些有公理化理论基础的领域,如数学和理论计算机科学。在此类领域,前沿模型可以帮助探索证明、检验猜想、发现那些需大量人力才能察觉的联系。

同时,这些系统并非独立的研究者。专家判断、验证和领域理解仍然必不可少。即便是能力很强的模型也会犯错或依赖未明说的假设,但它们能产出结构化、细致的论证,值得人类仔细审阅与打磨。因此,要与 AI 一起取得可靠进展,必须建立把验证、透明和协作牢牢置于工作流中的做法。

作为一个案例,这一成果展示了一种新兴的研究实践模式:像 GPT‑5.2 这样的模型可以作为支持数学推理和加速早期探索的工具,而正确性、解释与语境的最终责任仍由人类研究者承担。若谨慎使用,这类系统可简化理论工作的重要环节,但不会取代科学探究中人类判断的核心地位。



One of our hopes for strong AI is that it will accelerate scientific research for the benefit of everyone, helping researchers explore more ideas, test them faster, and turn discoveries into impact. 


Over the past year, we’ve been working closely with scientists across math, physics, biology, and computer science to understand where AI can help—and where it still falls short. Last month, we published a paper that compiles early case studies across math, physics, biology, computer science, astronomy, and materials science in which GPT‑5 helped researchers showing how GPT‑5 has already begun contributing to real scientific work. With GPT‑5.2, we’re starting to see those gains become more consistent and more reliable.


Stronger performance where precision matters




GPT‑5.2 Pro and GPT‑5.2 Thinking are our strongest models yet for scientific and mathematical work.


Strong mathematical reasoning is a foundation for reliability in scientific and technical work. It enables models to follow multi-step logic, keep quantities consistent, and avoid subtle errors that can compound in real analyses—from simulations and statistics to forecasting and modeling. Improvements on benchmarks like FrontierMath reflect not a narrow skill, but stronger general reasoning and abstraction, capabilities that carry directly into scientific workflows such as coding, data analysis, and experimental design.


These capabilities are also closely tied to progress toward general intelligence. A system that can reliably reason through abstraction, maintain consistency across long chains of thought, and generalize across domains is exhibiting traits that are foundational to AGI—not task-specific tricks, but broad, transferable reasoning skills that matter across science, engineering, and real-world decision-making.


We believe GPT‑5.2 Pro and GPT‑5.2 Thinking are the world’s best models for assisting and accelerating scientists. On GPQA Diamond, a graduate-level Google-proof Q&A benchmark, GPT‑5.2 Pro achieves 93.2%, followed closely by GPT‑5.2 Thinking at 92.4%.




In GPQA Diamond⁠, models answer multiple choice questions about physics, chemistry, and biology. No tools were enabled and reasoning effort was set to maximum.



On FrontierMath (Tier 1–3), an evaluation of expert-level mathematics, GPT‑5.2 Thinking set a new state of the art, solving 40.3% of problems.




In FrontierMath⁠, models solve expert-level mathematics problems. A Python tool was enabled and reasoning effort was set to maximum.



Case study



GPT‑5.2 is not only strong at graduate-level science problems. We now regularly see our frontier models contributing solutions to previously unsolved—and increasingly subtle—questions in mathematics and the sciences.

In this case study, we describe how GPT‑5.2 Pro helped resolve an open research problem in statistical learning theory, documented in a new paper, On Learning-Curve Monotonicity for Maximum Likelihood Estimators.

The question (“If you collect more data, do your results reliably get better?”) shows up any time you fit a model from data. You can draw a learning curve that tracks average error as you add more examples. In the best case, the curve is monotone. More data means less error, every step of the way. That is the behavior people hope for, and often assume.

But over the last few years, researchers have learned that this intuition can fail. A line of work kicked off by an open problem posed at the Conference on Learning Theory (COLT) in 2019 by Viering, Mey, and Loog showed that the answer is often no. Even very simple, well-behaved toy setups can have non-monotonic learning curves, where adding data increases expected error. That surprise triggered a wave of follow-up papers. They expanded the list of settings where these reversals happen and proposed increasingly elaborate methods designed to restore monotone behavior.

Still, one of the most basic cases remained unresolved. What happens in the cleanest textbook situation, where the statistical model is actually correct and the data follow the familiar bell curve pattern, with a known mean but unknown standard deviation? Researchers already knew that small changes to this setup could break monotonic behavior. But the answer remained unknown in this core case.

Our new paper demonstrates that in this clean setting, intuition prevails: learning is predictably improved by more data, rather than behaving in surprising or unstable ways. What makes this paper unusual is how the proof was obtained. The authors did not work out a strategy and then ask the model to fill in steps. They did not provide intermediate arguments or a proof outline. Instead, they asked GPT‑5.2 Pro to solve the open problem directly, and then carefully verified the proof, including review and validation by external subject-matter experts.

The authors then asked simple follow-up questions to see how far the idea could go. GPT‑5.2 Pro extended the result beyond the original problem to higher dimensional settings and other common statistical models. Throughout, the human role stayed focused on verification and clear writing, rather than supplying mathematical scaffolding.

















Looking ahead




This result suggests a useful direction for how AI systems can support scientific research, particularly in domains with axiomatic theoretical foundations such as mathematics and theoretical computer science. In settings like these, frontier models can help explore proofs, test hypotheses, and identify connections that might otherwise take substantial human effort to uncover.


At the same time, these systems are not independent researchers. Expert judgment, verification, and domain understanding remain essential. Even highly capable models can make mistakes or rely on unstated assumptions. But they can also produce detailed, structured arguments that merit careful human study and refinement. Making reliable progress with AI therefore depends on workflows that keep validation, transparency, and collaboration firmly in the loop.


Viewed as a case study, this result illustrates an emerging mode of research practice. Models like GPT‑5.2 can serve as tools for supporting mathematical reasoning and accelerating early-stage exploration, while responsibility for correctness, interpretation, and context remains with human researchers. Used carefully, such systems may help streamline significant aspects of theoretical work without displacing the central role of human judgment in scientific inquiry.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page