Introducing GPT-5.3-Codex-Spark

OpenAI News

今天，我们发布了对 GPT‑5.3-Codex-Spark 的研究预览——这是 GPT‑5.3-Codex 的更小版本，也是我们首个为实时编程设计的模型。 Codex-Spark 标志着与 Cerebras 合作的首个里程碑（我们在一月份已宣布该合作）。该模型针对超低延迟硬件进行了优化——在保持高度实用的编程能力的同时，能够以超过每秒 1000 个 token 的速度近乎即时响应。

我们把在 Cerebras 上运行的 Codex-Spark 作为研究预览推给 ChatGPT Pro 用户，目的是让开发者提前试用，同时我们与 Cerebras 协作扩展数据中心容量、强化端到端用户体验，并部署更大的前沿模型。

我们最新的前沿模型在处理长时间运行任务方面表现尤为突出，能在数小时、数天乃至数周内自主工作而无需人工干预。 Codex-Spark 则是首个专为与 Codex 实时协作而设计的模型——它能进行有针对性的修改、重塑逻辑或细化接口，并即时看到结果。有了 Codex-Spark ， Codex 既能承担长期的雄心勃勃的任务，也能完成即时的工作需求。我们希望观察开发者的使用方式并据此反馈，继续扩大可用性。

发布时， Codex-Spark 支持 128k 的上下文窗口，目前仅限文本输入。在研究预览期间， Codex-Spark 有独立的速率限制，其使用不会计入标准速率配额。然而，当需求高涨时，为了在用户之间平衡可靠性，你可能会遇到访问受限或临时排队的情况。

速度与智能

Codex-Spark 针对对延迟敏感的交互式工作做了优化——延迟与智能同等重要。你可以与模型实时协作，在它工作过程中打断或重定向，并以近乎即时的响应快速迭代。由于以速度为调优目标， Codex-Spark 默认的工作方式也更轻量：它倾向于做最小且有针对性的改动，除非你明确要求，否则不会自动运行测试。

示例任务：制作贪吃蛇游戏 / 规划项目 / 翻译文件

编程能力

Codex-Spark 是一个在快速推理上高度优化的小型模型。在评估代理式软件工程能力的两项基准测试—— SWE-Bench Pro 和 Terminal-Bench 2.0 上， GPT‑5.3-Codex-Spark 在完成任务时展现出强劲的表现，同时所用时间仅为 GPT‑5.3-Codex 的一小部分。

时长估算被定义为以下各项之和：（1）输出生成时间（输出 token ÷ 采样速度）、（2）预填充时间（预填充 token ÷ 预填充速度）、（3）所有工具执行时间之和、以及（4）总网络开销。

面向所有模型的延迟改进

在训练 Codex-Spark 的过程中，我们发现模型本身的速度只是实现实时协作的一部分——还需要在整个请求-响应流水线中降低延迟。我们在运行框架中实施了端到端的延迟优化，这些改进将惠及所有模型。在底层，我们简化了客户端与服务器之间的响应流动方式，重写了推理栈的关键组件，并改造了会话初始化流程，使首个可见 token 更早出现，且在你迭代时 Codex 能保持响应。通过引入持久的 WebSocket 连接并在 Responses API 内做定向优化，我们将每次客户端/服务器往返的开销减少了 80%，每 token 的开销减少了 30%，首次 token 到达时间缩短了 50%。 WebSocket 路径在 Codex-Spark 上默认启用，并将很快成为所有模型的默认选项。

由 Cerebras 提供算力

Codex-Spark 运行在 Cerebras 的 Wafer Scale Engine 3 上——这是一款为高速推理量身打造的 AI 加速器，为 Codex 提供了以延迟为先的推理层。我们与 Cerebras 合作，将这种低延迟通路接入与我们舰队其余部分相同的生产服务栈，使其在 Codex 生态中无缝工作，并为支持未来模型打下基础。

“我们对 GPT-5.3-Codex-Spark 最感兴趣的是，与 OpenAI 及开发者社区合作共同探索快速推理能带来的可能性——新的交互模式、新的使用场景，以及根本不同的模型体验。此次预览仅仅是开始。”—— Sean Lie ，CTO and Co-Founder of Cerebras

在我们的训练与推理流水线中， GPUs 仍然是基础，它们在大规模、广泛使用时能提供最具成本效益的 token 输出。 Cerebras 在需要极低延迟的工作流上发挥补充作用，收紧端到端闭环，使 Codex 在迭代时感受更为灵敏。对于单一工作负载， GPUs 与 Cerebras 可以结合使用以获得最佳性能。

可用性与细节

Codex-Spark 自今日起作为研究预览向 ChatGPT Pro 用户推出，覆盖最新版的 Codex 应用、 CLI 与 VS Code 扩展。由于它运行在专用的低延迟硬件上，使用受独立速率限制管理，该限制可能会在研究预览期间根据需求进行调整。此外，我们已在 API 中向少数设计合作伙伴开放 Codex-Spark ，以便了解开发者希望如何将其整合到产品中。随着我们在实际工作负载下继续调优集成，将在未来数周内逐步扩大访问范围。

目前 Codex-Spark 为文本输入、支持 128k 上下文窗口，并是首批超快速模型家族中的第一员。随着我们与开发者社区一同了解快速模型在编程场景中的优势，会逐步引入更多功能——包括更大模型、更长上下文以及多模态输入。

在安全训练方面， Codex-Spark 与我们的主线模型接受相同的训练，包括面向网络安全相关的训练。我们已将其纳入标准部署流程评估，包含对网络与其他能力的基线测试，评估结果表明其不太可能达到我们在网络安全或生物学领域所设的高能力门槛（ Preparedness Framework ）。

下一步

Codex-Spark 是朝着具有两种互补模式的 Codex 前进的第一步：一方面是面向更长时域的推理与执行，另一方面是用于快速迭代的实时协作。随着时间推移，这两种模式将逐步融合——当你希望保持紧密的交互循环时， Codex 可以在前端即时响应，同时把需要更长时间的工作交给子代理在后台处理；或者在需要速度与覆盖面时并行分发任务给多个模型，因此你无需一开始就非要在两者之间做出选择。

随着模型能力提升，交互速度愈发成为明显的瓶颈。超快速的推理能收紧这一环节，使 Codex 的使用感更自然，并扩展任何人把想法变成可运行软件的可能性。

Today, we’re releasing a research preview of GPT‑5.3-Codex-Spark, a smaller version of GPT‑5.3-Codex, and our first model designed for real-time coding. Codex-Spark marks the first milestone in our partnership with Cerebras, which we announced in January⁠. Codex-Spark is optimized to feel near-instant when served on ultra-low latency hardware—delivering more than 1000 tokens per second while remaining highly capable for real-world coding tasks.

We’re sharing Codex-Spark on Cerebras as a research preview to ChatGPT Pro users so that developers can start experimenting early while we work with Cerebras to ramp up datacenter capacity, harden the end-to-end user experience, and deploy our larger frontier models.

Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention. Codex-Spark is our first model designed specifically for working with Codex in real-time—making targeted edits, reshaping logic, or refining interfaces and seeing results immediately. With Codex-Spark, Codex now supports both long-running, ambitious tasks and getting work done in the moment. We hope to learn from how developers use it and incorporate feedback as we continue to expand access.

At launch, Codex-Spark has a 128k context window and is text-only. During the research preview, Codex-Spark will have its own rate limits and usage will not count towards standard rate limits. However, when demand is high, you may see limited access or temporary queuing as we balance reliability across users.

Speed and intelligence

Codex-Spark is optimized for interactive work where latency matters as much as intelligence. You can collaborate with the model in real time, interrupting or redirecting it as it works, and rapidly iterate with near-instant responses. Because it’s tuned for speed, Codex-Spark keeps its default working style lightweight: it makes minimal, targeted edits and doesn’t automatically run tests unless you ask it to.

Build a snake gamePlan a projectTranslate a file

Coding

Codex-Spark is a highly capable small model optimized for fast inference. On SWE-Bench Pro and Terminal-Bench 2.0, two benchmarks evaluating agentic software engineering capability, GPT‑5.3-Codex-Spark demonstrates strong performance while accomplishing the tasks in a fraction of the time compared to GPT‑5.3-Codex.

Duration is estimated as the sum of (1) output generation time (output tokens ÷ sampling speed), (2) prefill time (prefill tokens ÷ prefill speed), (3) total tool execution time, and (4) total network overhead.

Latency improvements for all models

As we trained Codex-Spark, it became apparent that model speed was just part of the equation for real-time collaboration—we also needed to reduce latency across the full request-response pipeline. We implemented end-to-end latency improvements in our harness that will benefit all models. Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.

Powered by Cerebras

Codex-Spark runs on Cerebras’ Wafer Scale Engine 3⁠—a purpose-built AI accelerator for high-speed inference giving Codex a latency-first serving tier. We partnered with Cerebras to add this low-latency path to the same production serving stack as the rest of our fleet, so it works seamlessly across Codex and sets us up to support future models.

“What excites us most about GPT-5.3-Codex-Spark is partnering with OpenAI and the developer community to discover what fast inference makes possible—new interaction patterns, new use cases, and a fundamentally different model experience. This preview is just the beginning.”

— Sean Lie, CTO and Co-Founder of Cerebras

GPUs remain foundational across our training and inference pipelines and deliver the most cost effective tokens for broad usage. Cerebras complements that foundation by excelling at workflows that demand extremely low latency, tightening the end-to-end loop so Codex feels more responsive as you iterate. GPUs and Cerebras can be combined for single workloads to reach the best performance.

Availability & details

Codex-Spark is rolling out today as a research preview for ChatGPT Pro users in the latest versions of the Codex app, CLI, and VS Code extension. Because it runs on specialized low-latency hardware, usage is governed by a separate rate limit that may adjust based on demand during the research preview. In addition, we are making Codex-Spark available in the API for a small set of design partners to understand how developers want to integrate Codex-Spark into their products. We’ll expand access over the coming weeks as we continue tuning our integration under real workloads.

Codex-Spark is currently text-only at a 128k context window and is the first in a family of ultra-fast models. As we learn more with the developer community about where fast models shine for coding, we’ll introduce even more capabilities–including larger models, longer context lengths, and multimodal input.

Codex-Spark includes the same safety training as our mainline models, including cyber-relevant training. We evaluated Codex-Spark as part of our standard deployment process, which includes baseline evaluations for cyber and other capabilities, and determined that it does not have a plausible chance of reaching our Preparedness Framework threshold for high capability in cybersecurity or biology.

What’s next

Codex-Spark is the first step toward a Codex with two complementary modes: longer-horizon reasoning and execution, and real-time collaboration for rapid iteration. Over time, the modes will blend—Codex can keep you in a tight interactive loop while delegating longer-running work to sub-agents in the background, or fanning out tasks to many models in parallel when you want breadth and speed, so you don’t have to choose a single mode up front.

As models become more capable, interaction speed becomes a clear bottleneck. Ultra-fast inference tightens that loop, making Codex feel more natural to use and expanding what’s possible for anyone turning an idea into working software.

Generated by RSStT. The copyright belongs to the original author.

Source