Introducing GPT-5.2-Codex

Introducing GPT-5.2-Codex

OpenAI News

今天我们发布了 GPT‑5.2-Codex,这是迄今为止在处理复杂、真实世界软件工程任务方面最先进的“代理式”编程模型。基于 GPT‑5.2 的能力, GPT‑5.2-Codex 对在 Codex 中的代理式编程进行了进一步优化,改进包括通过上下文压缩提升长周期任务处理能力、在大规模代码变更(如重构和迁移)上的更强表现、对 Windows 环境的适配改进,以及显著增强的网络安全能力。

随着模型在智能边界上持续进步,我们发现这些改进也会在专门领域带来能力跃迁,尤其是在网络安全方面。例如,就在上周,一位安全研究员使用 GPT‑5.1-Codex-Max 结合 Codex CLI 发现并负责披露了 React 中可导致源码泄露的漏洞。

相较于我们此前发布的模型, GPT‑5.2-Codex 在网络安全相关能力上更强。这些进步能够在更大范围内增强网络防御,但也带来了新的双重用途风险,需要谨慎部署。虽然按我们 的 Preparedness Framework 评估, GPT‑5.2-Codex 还未达到“High”级别的网络能力,但我们在设计部署策略时已考虑到未来能力增长的可能性。

我们今天在所有 Codex 界面向付费 ChatGPT 用户推出了 GPT‑5.2-Codex,并在接下来的数周内努力为 API 用户安全开放访问。同时,我们正对经过甄别的专业人士和组织试点受邀的受信任访问,允许他们以更宽松的模型使用权限访问即将到来的能力,前提是这些参与者专注于防御性网络安全工作。我们认为这种部署方式能够在可及性与安全之间取得平衡。

推动真实世界软件工程的前沿

GPT‑5.2-Codex 在专业知识型工作上的表现延续并增强了 GPT‑5.2 的优势,并在代理式编程与终端操作能力上继承了 GPT‑5.1-Codex-Max 的前沿成果。现在的 GPT‑5.2-Codex 在长上下文理解、更可靠的工具调用、更高的事实性和原生压缩能力上都有所提升,使其在处理长期运行的编码任务时更值得信赖,同时在推理时仍保持令牌效率。

在衡量代理式性能的基准测试上, GPT‑5.2-Codex 在 SWE-Bench Pro 与 Terminal-Bench 2.0 上取得了行业领先表现,这些基准旨在考察在真实终端环境中执行多样化任务的代理能力。它在原生 Windows 环境中的代理式编码也更高效、更可靠,这是建立在 GPT‑5.1-Codex-Max 引入能力之上的进一步进展。

得益于这些改进, Codex 在处理大型代码库、长时间会话并保持完整上下文方面更为胜任。它能更可靠地完成诸如大规模重构、代码迁移与功能构建等复杂任务——即便计划变化或尝试失败,也能持续迭代而不丢失脉络。

在 SWE-Bench Pro 中,模型会获得一个代码仓库并需生成补丁以解决现实的软件工程问题。 Terminal-Bench 2.0 则用于在真实终端环境中测试 AI 代理,任务包括编译代码、训练模型和搭建服务器等。

更强的视觉理解能力使 GPT‑5.2-Codex 在编码会话中能够更准确地解读屏幕截图、技术图表、图表与界面元素。 Codex 可以将设计稿快速转化为可运行的原型,并能与用户协作将这些原型推向生产环境。

推动网络安全前沿

在我们对核心网络安全评估的长期跟踪中,发现从 GPT‑5-Codex 开始能力出现明显跃升,随后在 GPT‑5.1-Codex-Max 上再次大幅提升,而现在 GPT‑5.2-Codex 又带来了第三次跃进。我们预计未来的 AI 模型会沿着这一路径继续发展。为此,我们在规划和评估时,按可能到达 Preparedness Framework 所定义“High”级别网络能力来做准备。尽管 GPT‑5.2-Codex 目前尚未达到“High”级别,我们仍在为越过该门槛的未来模型做准备。鉴于其增强的网络能力,我们在模型内部与产品层面增加了额外的防护措施,详见相应的 system card 。

专业级 Capture-the-Flag(CTF)评估用于衡量模型在 Linux 环境中解决高级、需多步骤且具职业水平的实战性网络挑战的频率与能力。

现实世界的网络安全能力

现代社会依赖软件运转,其可靠性取决于强健的网络安全——保障银行、医疗、通信与关键服务的连续运行,保护敏感数据,并维护公众对日常所依赖软件的信任。漏洞往往在被发现之前长期存在,发现、验证与修复这些漏洞通常依赖于工程师社区和独立安全研究人员配合适当工具的工作。

2025 年 12 月 11 日, React 团队公布了三处影响使用 React Server Components 的应用的安全漏洞。这次披露之所以引人注目,不仅在于漏洞本身,还在于发现漏洞的方式。

Privy(Stripe 旗下) 的首席安全工程师 Andrew MacPherson 使用 GPT‑5.1-Codex-Max、 Codex CLI 与其他编程代理,试图复现并研究一周前披露的另一个严重 React 漏洞,即 React2Shell(CVE-2025-55182)。他的目的是评估模型在真实漏洞研究中的辅助能力。

他最初尝试了多次零样本分析,提示模型查看补丁并指出其修复的问题;未果后,他转而采用更高频次的迭代式提示。当这些方法仍然未能直接奏效时,他引导 Codex 按照常规防御性安全流程操作——搭建本地测试环境、推理潜在攻击面,并使用模糊测试以畸形输入探测系统。在试图重现原始 React2Shell 问题的过程中, Codex 暴露出一些意料之外的行为,值得进一步深挖。经过一周左右的工作,这一流程促成了对先前未知漏洞的发现,并已负责任地向 React 团队披露。

这一案例表明,先进的 AI 系统能实质性加速面向主流、真实软件的防御性安全工作;但同样地,能帮助防守方加速的能力,也可能被不良参与者滥用。

随着代理式系统在网络安全相关任务上能力增强,我们把“负责任部署”作为核心优先事项——每一次能力提升都要配套更强的防护、更严格的访问控制,并持续与安全社区合作。

通过受信任访问赋能网络防御

安全团队在模拟威胁行为、分析恶意软件以支持修复,或对关键基础设施进行压力测试时,常遇到权限或合规限制。我们正在开发一项受信任访问试点计划,以为符合条件的用户和组织消除这些摩擦,使受信任的防御者能够使用前沿 AI 的网络能力来加速防御工作。

该试点最初将采取受邀制,仅向具备负责任漏洞披露记录的经审核安全专业人士和具有明确职业网络安全用途的组织开放。符合条件的参与者将获得我们在防御用例上最有能力的模型访问,以支持合法的双重用途工作。

如果你是从事道德安全工作(如漏洞研究或经授权的红队演练)的安全专业人士或相关组织,我们欢迎你表达加入意愿并反馈对该项目的期望。

结语

GPT‑5.2-Codex 是先进 AI 在支持真实世界软件工程与诸如网络安全等专业领域应用上的又一进展——它帮助开发者与防御者处理复杂且长期的工作,并强化负责安全研究可用的工具。

通过分步推出 GPT‑5.2-Codex、将部署与防护措施配套,并与安全社区紧密合作,我们希望在最大化防御效益的同时降低误用风险。本次发布的经验将直接指导我们随软件与网络前沿进展逐步扩大访问的方式。



Today we’re releasing GPT‑5.2-Codex, the most advanced agentic coding model yet for complex, real-world software engineering. GPT‑5.2-Codex is a version of GPT‑5.2 further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.


As our models continue to advance along the intelligence frontier, we’ve observed that these improvements also translate to capability jumps in specialized domains such as cybersecurity. For example, just last week, a security researcher using GPT‑5.1-Codex-Max with Codex CLI found and responsibly disclosed a vulnerability in React that could lead to source code exposure.


GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment. While GPT‑5.2-Codex does not reach a ‘High’ level of cyber capability under our Preparedness Framework, we’re designing our deployment approach with future capability growth in mind.


We're releasing GPT‑5.2-Codex today in all Codex surfaces for paid ChatGPT users, and working towards safely enabling access to GPT‑5.2-Codex for API users in the coming weeks. In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.


Pushing the frontier on real-world software engineering




GPT‑5.2-Codex builds on GPT‑5.2’s strengths in professional knowledge work and GPT‑5.1-Codex-Max’s frontier agentic coding and terminal-using capabilities. GPT‑5.2-Codex is now better at long-context understanding, reliable tool calling, improved factuality, and native compaction, making it a more dependable partner for long running coding tasks, while remaining token-efficient in its reasoning.


GPT‑5.2-Codex achieves state-of-the-art performance on SWE-Bench Pro and Terminal-Bench 2.0, benchmarks designed to test agentic performance on a wide variety of tasks in realistic terminal environments. It is also much more effective and reliable at agentic coding in native Windows environments, building on capabilities introduced in GPT‑5.1-Codex-Max.


With these improvements, Codex is more capable at working in large repositories over extended sessions with full context intact. It can more reliably complete complex tasks like large refactors, code migrations, and feature builds — continuing to iterate without losing track, even when plans change or attempts fail.





In SWE-Bench Pro⁠⁠⁠⁠, a model is given a code repository and must generate a patch to solve a realistic software engineering task. Terminal-Bench 2.0 is a benchmark for testing AI agents in real terminal environments. Tasks include compiling code, training models and setting up servers.



Stronger vision performance enables GPT‑5.2-Codex to more accurately interpret screenshots, technical diagrams, charts, and UI surfaces shared during coding sessions.


Codex can take design mocks and quickly translate them to functional prototypes, and you can pair with Codex to take these prototypes to production.





Design mock














Prototype generated by GPT-5.2-Codex














Advancing the cyber frontier




When charting performance on one of our core cybersecurity evaluations over time, we see a sharp jump in capability starting with GPT‑5-Codex, another large jump with GPT‑5.1-Codex-Max and now a third jump with GPT‑5.2-Codex. We expect that upcoming AI models will continue on this trajectory. In preparation, we are planning and evaluating as though each new model could reach ‘High’ levels of cybersecurity capability, as measured by our Preparedness Framework⁠⁠. While GPT‑5.2-Codex has not yet reached ‘High’ level of cyber capability, we are preparing for future models that cross that threshold. Due to the increased cyber capabilities, we have added additional safeguards at the in model and in the product, which are outlined in the system card.




The Professional Capture-the-Flag (CTF) eval measures how often the model can solve advanced, multi-step real-world challenges (requiring professional-level cybersecurity skills) in a Linux environment.



Real-world cyber capabilities




Modern society runs on software, and its reliability depends on strong cybersecurity—keeping critical systems in banking, healthcare, communications, and essential services online, protecting sensitive data, and ensuring people can trust the software they rely on every day. Vulnerabilities can exist long before anyone knows about them, and finding, validating, and fixing them often depends on a community of engineers and independent security researchers equipped with the right tools.


On December 11, 2025, the React team published three security vulnerabilities affecting apps built with React Server Components. What made this disclosure notable was not only the vulnerabilities themselves, but how they were uncovered.


Andrew MacPherson, a principal security engineer at Privy (a Stripe company), was using GPT‑5.1-Codex-Max with Codex CLI and other coding agents to reproduce and study a different critical React vulnerability disclosed the week prior, known as React2Shell (CVE-2025-55182). His goal was to evaluate how well the model could assist with real-world vulnerability research.


He initially attempted several zero-shot analyses, prompting the model to examine the patch and identify the vulnerability it addressed. When that did not yield results, he shifted to a higher-volume, iterative prompting approach. When those approaches did not succeed, he guided Codex through standard defensive security workflows—setting up a local test environment, reasoning through potential attack surfaces, and using fuzzing to probe the system with malformed inputs. While attempting to reproduce the original React2Shell issue, Codex surfaced unexpected behaviors that warranted deeper investigation. Over the course of a single week, this process led to the discovery of previously unknown vulnerabilities, which were responsibly disclosed to the React team.










This demonstrates how advanced AI systems can materially accelerate defensive security work in widely used, real-world software. At the same time, capabilities that help defenders move faster can also be misused by bad actors.


As agentic systems become more capable in cybersecurity-relevant tasks, we are making it a core priority to ensure these advances are deployed responsibly—pairing every gain in capability with stronger safeguards, tighter access controls, and ongoing collaboration with the security community.


Empowering cyberdefense through trusted access




Security teams can run into restrictions when attempting to emulate threat actors, analyze malware to support remediation, or stress test critical infrastructure. We are developing a trusted access pilot to remove that friction for qualifying users and organizations and enable trusted defenders to use frontier AI cyber capabilities to accelerate cyberdefense.


Initially the pilot program will be invite-only for vetted security professionals with a track record of responsible vulnerability disclosure and organizations with a clear professional cybersecurity use case. Qualifying participants will get access to our most capable models for defensive use-cases to enable legitimate dual-use work.


If you’re a security professional or part of an organization doing ethical security work like vulnerability research or authorized red-teaming, we invite you to express interest in joining and share feedback on what you’d like to see from the program here⁠.


Conclusion




GPT‑5.2-Codex represents a step forward in how advanced AI can support real-world software engineering and specialized domains like cybersecurity—helping developers and defenders tackle complex, long-horizon work, and strengthening the tools available for responsible security research.


By rolling GPT‑5.2-Codex out gradually, pairing deployment with safeguards, and working closely with the security community, we’re aiming to maximize defensive impact while reducing the risk of misuse. What we learn from this release will directly inform how we expand access over time as the software and cyber frontiers continue to advance.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page