Introducing gpt-realtime and Realtime API updates

Introducing gpt-realtime and Realtime API updates

OpenAI News

今天,我们正式推出了 Realtime API(实时 API),并新增多项功能,帮助开发者和企业构建可靠、可投入生产的语音代理。该 API 现支持远程 MCP 服务器、图像输入以及通过会话发起协议(SIP)进行电话呼叫,使语音代理能够借助更多工具和上下文,提升能力。

我们还发布了迄今为止最先进的语音转语音模型——gpt-realtime。该新模型在执行复杂指令、精准调用工具以及生成更自然、更富表现力的语音方面表现出色。它更擅长理解系统消息和开发者提示,无论是支持电话中逐字朗读免责声明、复述字母数字组合,还是在句中无缝切换语言。我们还推出了两款新声音 Cedar 和 Marin,今日起在 Realtime API 中独家提供。

自去年十月公开测试 Realtime API 以来,数千名开发者基于该 API 构建应用,并助力我们优化了今天发布的版本——在可靠性、低延迟和高质量方面表现卓越,助力语音代理成功投入生产。与传统将语音转文本和文本转语音多个模型串联的流程不同,Realtime API 通过单一模型和 API 直接处理和生成音频,降低延迟,保留语音细节,产生更自然、更富表现力的回复。

Zillow、T-Mobile、StubHub、Oscar Health、Lemonade 等企业已在使用。

“OpenAI Realtime API 中的新语音转语音模型展现了更强的推理能力和更自然的语音表现,使其能够处理复杂的多步骤请求,比如根据生活方式需求筛选房源,或利用我们的 BuyAbility 评分引导购房负担能力讨论。这将使在 Zillow 上找房或探索融资选项的体验如同与朋友对话般自然,帮助简化买卖和租赁房屋的决策。”
—— Zillow AI 负责人 Josh Weisberg


介绍 GPT-REALTIME

gpt-realtime 是我们迄今最先进、适合生产环境的语音模型。我们与客户紧密合作训练该模型,使其在客户支持、个人助理和教育等实际任务中表现优异,符合开发者构建和部署语音代理的需求。该模型在音频质量、智能水平、指令执行和功能调用方面均有提升。

音频质量

自然流畅的对话对语音代理的实际部署至关重要。模型需具备人类的语调、情感和语速,营造愉悦体验,促进用户持续交流。gpt-realtime 经过训练,能生成更高质量、更自然的语音,并能遵循细致指令,如“快速且专业地说话”或“用法语口音表达同理心”。

我们在 API 中新增了两款声音 Marin 和 Cedar,显著提升了自然语音表现,同时对现有八款声音进行了更新优化。

智能与理解能力

gpt-realtime 智能更高,能更准确理解母语音频。它能捕捉非语言线索(如笑声)、在句中切换语言,并调整语气(如“干练专业”与“亲切同理”)。内部评测显示,该模型在识别西班牙语、中文、日语和法语中的字母数字序列(如电话号码、车辆识别码)方面更准确。在 Big Bench Audio 评测中,gpt-realtime 推理准确率达 82.8%,优于我们 2024 年 12 月发布的前一代模型的 65.6%。

指令执行

构建语音转语音应用时,开发者会向模型下达行为指令,包括说话方式、特定场景的应答内容及行为规范。我们重点提升了模型对这些指令的遵循度,使细微指令也能有效引导模型行为。在 MultiChallenge 音频评测中,gpt-realtime 的指令执行准确率为 30.5%,显著优于 2024 年 12 月模型的 20.6%。

功能调用

要打造强大的语音代理,模型需能在合适时机调用正确工具。我们在相关函数调用的准确性、时机和参数传递上均有提升。在 ComplexFuncBench 音频评测中,gpt-realtime 得分 66.5%,高于前代模型的 49.7%。此外,我们改进了异步函数调用功能,长时间运行的函数调用不会中断会话流程,模型可在等待结果时继续流畅对话。该功能已内置于 gpt-realtime,开发者无需更新代码。


Realtime API 新功能
远程 MCP 服务器支持

您可通过在会话配置中传入远程 MCP 服务器的 URL 来启用 MCP 支持。连接后,API 会自动处理工具调用,无需手动集成。此方案便于扩展代理功能,只需指向不同 MCP 服务器,即可立即使用新工具。详情请参阅配置指南。

图像输入

gpt-realtime 现支持图像输入,您可以在 Realtime API 会话中添加图片、照片和截图,配合音频或文本使用。模型能基于用户所见内容进行对话,支持用户提问如“你看到了什么?”或“请读出这张截图中的文字”。

系统将图像视为对话中的一张图片,而非实时视频流。您的应用可控制何时向模型发送哪些图像,确保对话内容和响应时机的掌控权。

其他新增功能
  • 会话发起协议(SIP)支持:可将应用连接至公共电话网络、PBX 系统、座机及其他 SIP 终端,Realtime API 直接支持。
  • 可复用提示:现可保存并复用包含开发者消息、工具、变量及示例用户/助手消息的提示,跨 Realtime API 会话使用,类似 Responses API。

安全与隐私

Realtime API 采用多层安全防护措施防止滥用。详情见 beta 公告博客。我们在会话中使用主动分类器,检测到违反有害内容指南的对话会被中断。开发者也可通过 Agents SDK 轻松添加自定义安全防护。

使用政策禁止将服务输出用于垃圾信息、欺诈或其他有害用途。开发者必须明确告知终端用户其正在与 AI 交互,除非上下文已明显表明。Realtime API 使用预设声音,防止恶意冒充。

Realtime API 完全支持欧盟数据驻留政策,适用于欧盟应用,并受企业隐私承诺保护。


价格与可用性

Realtime API 及新模型 gpt-realtime 自今日起向所有开发者开放。我们将 gpt-realtime 价格较 gpt-4o-realtime-preview 降低 20%,音频输入为每百万音频输入令牌 32 美元(缓存输入令牌 0.40 美元),音频输出为每百万音频输出令牌 64 美元。新增对话上下文的细粒度控制,允许开发者设置智能令牌限制并一次截断多轮对话,显著降低长会话成本。

欢迎访问 Realtime API 文档,体验 Playground 中的新模型,并查看 Realtime API 提示指南,快速上手。



Today we’re making the Realtime API generally available with new features that enable developers and enterprises to build reliable, production-ready voice agents. The API now supports remote MCP servers, image inputs, and phone calling through Session Initiation Protocol (SIP), making voice agents more capable through access to additional tools and context.


We’re also releasing our most advanced speech-to-speech model yet—gpt-realtime. The new model shows improvements in following complex instructions, calling tools with precision, and producing speech that sounds more natural and expressive. It’s better at interpreting system messages and developer prompts—whether that’s reading disclaimer scripts word-for-word on a support call, repeating back alphanumerics, or switching seamlessly between languages mid-sentence. We’re also releasing two new voices, Cedar and Marin, which are available exclusively in the Realtime API starting today.


Since we first introduced the Realtime API in public beta last October, thousands of developers have built with the API and helped shape the improvements we’re releasing today—optimized for reliability, low latency, and high quality to successfully deploy voice agents in production. Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API. This reduces latency, preserves nuance in speech, and produces more natural, expressive responses.



ZillowT-MobileStubHubOscar HealthLemonade






“The new speech-to-speech model in OpenAI's Realtime API shows stronger reasoning and more natural speech—allowing it to handle complex, multi-step requests like narrowing listings by lifestyle needs or guiding affordability discussions with tools like our BuyAbility score. This could make searching for a home on Zillow or exploring financing options feel as natural as a conversation with a friend, helping simplify decisions like buying, selling, and renting a home.”

– Josh Weisberg, Head of AI at Zillow
















Introducing gpt-realtime




The new speech-to-speech model—gpt-realtime—is our most advanced, production-ready voice model. We trained the model in close collaboration with customers to excel at real-world tasks like customer support, personal assistance, and education—aligning the model to how developers build and deploy voice agents. The model shows improvements across audio quality, intelligence, instruction following, and function calling.










Audio quality



Natural-sounding conversation is critical for deploying voice agents in the real world. Models need to speak with the intonation, emotion, and pace of a human to create an enjoyable experience and encourage continuous conversation with users. We trained gpt-realtime to produce higher-quality speech that sounds more natural and can follow fine-grained instructions, such as “speak quickly and professionally” or “speak empathetically in a French accent.”


We’re releasing two new voices in the API, Marin and Cedar, with the most significant improvements to natural-sounding speech. We’re also updating our existing eight voices to benefit from these improvements.


Voice sample - Marin



Voice sample - Cedar







Intelligence and comprehension



gpt-realtime shows higher intelligence and can comprehend native audio with greater accuracy. The model can capture non-verbal cues (like laughs), switch languages mid-sentence, and adapt tone (“snappy and professional” vs. “kind and empathetic”). According to internal evaluations, the model also shows more accurate performance in detecting alphanumeric sequences (such as phone numbers, VINs, etc) in other languages, including Spanish, Chinese, Japanese, and French. On the Big Bench Audio eval measuring reasoning capabilities, gpt-realtime scores 82.8% accuracy—beating our previous model from December 2024, which scores 65.6%.




The Big Bench Audio⁠ benchmark is an evaluation dataset for assessing the reasoning capabilities of language models that support audio input. This dataset adapts questions from Big Bench Hard—chosen for its rigorous testing of advanced reasoning—into the audio domain.



Instruction following



When building a speech-to-speech application, developers give a set of instructions to the model on how to behave, including how to speak, what to say in a certain situation, and what to do or not do. We’ve focused our improvements on the adherence to these instructions, so that even minor directions carry more signal for the model. On the MultiChallenge audio benchmark measuring instruction following accuracy, gpt-realtime scores 30.5%, a significant improvement over our previous model from December 2024, which scores 20.6%.




MultiChallenge⁠ evaluates how well LLMs handle multi-turn conversations with humans. It focuses on four categories of realistic challenges that current frontier models struggle with. These challenges require models to combine instruction-following, context management, and in-context reasoning simultaneously. We converted an audio-friendly subset of the test questions from text-to-speech to create an audio version of this evaluation.



Function calling



To build a capable voice agent with a speech-to-speech model, the model needs to be able to call the right tools at the right time to be useful in production. We’ve improved function calling on three axes: calling relevant functions, calling functions at the appropriate time, and calling functions with appropriate arguments (resulting in higher accuracy). On the ComplexFuncBench audio eval measuring function calling performance, gpt-realtime scores 66.5%, while our previous model from December 2024 scores 49.7%.


We’ve also made improvements to asynchronous function calling⁠. Long-running function calls will no longer disrupt the flow of a session—the model can continue a fluid conversation while waiting on results. This feature is available natively in gpt-realtime, so developers do not need to update their code.




ComplexFuncBench⁠ measures how well models handle challenging function calling tasks. It evaluates performance across scenarios like multi-step calls, reasoning about constraints or implicit parameters, handling very long inputs. We converted the original text prompts into speech to build this evaluation for our model.



New in the Realtime API




Remote MCP server support



You can enable MCP support in a Realtime API session by passing the URL of a remote MCP server into the session configuration. Once connected, the API automatically handles the tool calls for you, so there’s no need to wire up integrations manually.


This setup makes it easy to extend your agent with new capabilities—just point the session to a different MCP server, and those tools become available right away. To learn more about configuring MCP with Realtime, check out this guide⁠.


JavaScript

1// POST /v1/realtime/client_secrets

2{

3 "session": {

4 "type": "realtime",

5 "tools": [

6 {

7 "type": "mcp",

8 "server_label": "stripe",

9 "server_url": "https://mcp.stripe.com",

10 "authorization": "{access_token}",

11 "require_approval": "never"

12 }

13 ]

14 }

15}

16







Image input



With image inputs now supported in gpt-realtime, you can add images, photos, and screenshots alongside audio or text to a Realtime API session. Now the model can ground the conversation in what the user is actually seeing, enabling users to ask questions like “what do you see?” or “read the text in this screenshot.”


Instead of treating an image like a live video stream, the system treats it more like adding a picture into the conversation. Your app can decide which images to share with the model and when to share them. This way, you stay in control of what the model sees and when it responds.


Check out our docs to get started with image input.


JavaScript

1{

2 "type": "conversation.item.create",

3 "previous_item_id": null,

4 "item": {

5 "type": "message",

6 "role": "user",

7 "content": [

8 {

9 "type": "input_image",

10 "image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"

11 }

12 ]

13 }

14}

15







Additional capabilities



We’ve added several other features to make the Realtime API easier to integrate and more flexible for production use.


  • Session Initiation Protocol (SIP) support: Connect your apps to the public phone network, PBX systems, desk phones, and other SIP endpoints with direct support in the Realtime API. Read about it in docs.⁠
  • Reusable prompts: You can now save and reuse prompts—consisting of developer messages, tools, variables, and example user/assistant messages—across Realtime API sessions, like in the Responses API. Learn more in docs.⁠

Safety & privacy




The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. You can learn more about our safety approach and system card details in the beta announcement blog⁠. We employ active classifiers over Realtime API sessions, meaning certain conversations can be halted if they are detected as violating our harmful content guidelines. Developers can also easily add their own additional safety guardrails using the Agents SDK⁠.


Our usage policies⁠ prohibit repurposing or distributing outputs from our services for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they’re interacting with AI, unless it’s already obvious from the context. The Realtime API uses preset voices to help prevent malicious actors from impersonating others.


The Realtime API fully supports EU Data Residency⁠ for EU-based applications and is covered by our enterprise privacy commitments⁠.


Pricing & availability




The generally available Realtime API and new gpt-realtime model are available to all developers starting today. We’re reducing prices for gpt-realtime by 20% compared to gpt-4o-realtime-preview—$32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens (see detailed pricing⁠). We’ve also added fine-grained control for conversation context to let developers set intelligent token limits and truncate multiple turns at a time, significantly reducing cost for long sessions.


To get started, visit our Realtime API documentation⁠, test the new model in the Playground⁠, and view our Realtime API prompting guide⁠.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page