Introducing AgentKit, new Evals, and RFT for agents

OpenAI News

今天我们推出 AgentKit——一套完整的工具，帮助开发者和企业构建、部署与优化智能体（agents）。迄今为止，构建智能体意味着要在分散的工具之间奔波——缺乏版本控制的复杂编排、自定义连接器、手动评估流水线、提示词调优，以及在上线前需要数周的前端工作。有了 AgentKit，开发者现在可以可视化设计工作流，并利用新的构建模块更快地嵌入具代理性的界面，例如：

Agent Builder：用于创建和版本控制多智能体工作流的可视化画布
Connector Registry：供管理员集中管理 OpenAI 产品间数据与工具连接的中心位置
ChatKit：用于在产品中嵌入可定制的基于聊天的智能体体验的工具包

我们还通过诸如数据集、过程追踪评分、自动化提示优化和第三方模型支持等新功能扩展了评估能力，以衡量和改进智能体性能。

自从三月份发布 Responses API 和 Agents SDK 以来，我们看到开发者与企业已经为深度研究、客户支持等多种场景构建了端到端的智能体工作流。Klarna 构建的支持智能体处理了三分之二的工单，Clay 用一个销售智能体将增长提高了 10 倍。AgentKit 基于 Responses API，帮助开发者以更高效、更可靠的方式构建智能体。

用 Agent Builder 设计工作流

随着智能体工作流变得更复杂，开发者需要更清晰的可视化以理解其运作。Agent Builder 提供了一个可视化画布，用拖放节点来组合逻辑、连接工具并配置自定义防护规则（guardrails）。它支持预览运行、内联评估配置和完整的版本控制——非常适合快速迭代。

画布可以从空白开始，也可以使用预构建模板快速上手。

在 Ramp，团队仅用了几个小时就从空白画布构建出一个采购智能体：

“Agent Builder 将曾经需要数月的复杂编排、自定义代码和手动优化，转变为只需几小时。可视化画布让产品、法务和工程保持一致，将迭代周期缩短 70%，使我们在两个冲刺而不是两个季度内上线智能体。” — Ramp

同样，领先的日本科技互联网公司 LY Corporation 在不到两小时内用 Agent Builder 构建了一个工作助理智能体：

“Agent Builder 让我们以全新的方式编排智能体，工程师和主题专家可以在同一界面协作。我们在不到两小时内搭建并运行了第一个多智能体工作流，大幅加速了智能体的创建与部署。” — LY Corporation

我们还推出了 Connector Registry，方便企业在多个工作区和组织中管理与维护数据。Connector Registry 将数据源整合到一个跨 ChatGPT 与 API 的管理面板中。该注册表包含所有预构建连接器（如 Dropbox、Google Drive、SharePoint、Microsoft Teams）以及第三方 MCP。

开发者还可以在 Agent Builder 中启用 Guardrails——一个开源、模块化的安全层，帮助防止智能体产生意外或恶意行为。Guardrails 可对个人身份信息（PII）进行掩码或标记、检测越狱（jailbreak）攻击并应用其他安全防护，使构建和部署可靠、安全的智能体更容易。Guardrails 可独立部署，或通过 Python 与 JavaScript 的 guardrails 库使用。

使用 ChatKit 嵌入具代理性的聊天体验

为智能体部署聊天界面可能出乎意料地复杂——需要处理流式响应、管理线程、显示模型思考过程，以及设计引人入胜的聊天体验。ChatKit 让在产品中嵌入原生感强的基于聊天的智能体变得简单。它可以嵌入应用或网站，并自定义以匹配你的主题或品牌。

“我们用 ChatKit 为 Canva 开发者社区构建支持智能体，节省了两周以上的开发时间，并在不到一小时内完成集成。这个支持智能体将把我们的文档转变为对话式体验，让开发者更容易构建应用和集成。” — Canva

ChatKit 已经支持了多种用例，从内部知识助理与入职指南到客户支持和研究智能体。HubSpot 的客户支持智能体就是一个例子。

衡量智能体性能的新 Evals 能力

构建可靠、可上线的智能体需要严谨的性能评估。去年我们推出了 Evals 平台，帮助开发者测试提示并衡量模型行为。现在我们添加了四项新功能，使构建评估更加容易：

数据集：可以快速从零开始构建智能体评估，并通过自动评分器和人工标注逐步扩展。
过程追踪评分（Trace grading）：对智能体工作流进行端到端评估并自动化评分，以定位不足之处。
自动化提示优化：基于人工注释与评分器输出生成改进后的提示。
第三方模型支持：在 OpenAI Evals 平台内评估其他厂商的模型。

使用 Evals 的客户已经看到显著的性能提升。

“评估平台将我们多智能体尽职调查框架的开发时间缩短了超过 50%，并将智能体准确率提升了 30%。” — Carlyle

通过强化微调推动智能体性能

强化微调（Reinforcement fine-tuning，RFT）允许开发者定制我们的推理模型。RFT 已在 OpenAI o4-mini 上普遍可用，并在 GPT‑5 上处于私有测试阶段。我们正在与数十家客户密切合作，在更广泛发布前进一步完善 GPT‑5 的 RFT。

今天，我们在该 RFT 测试中加入了两个新特性，旨在进一步提升智能体性能：

自定义工具调用：训练模型在正确的时间调用正确的工具，以提升推理能力
自定义评分器：为你的使用场景设定最重要的评估标准

定价与可用性

从今天起，ChatKit 和新增的 Evals 能力对所有开发者全面开放。Agent Builder 处于公测（beta），Connector Registry 正在向部分 API、ChatGPT Enterprise 和教育客户开始 Beta 推出（需要全局管理控制台 Global Admin Console，Global Owners 可以管理域、SSO 和多个 API 组织）。启用 Connector Registry 的前提是具备全局管理控制台。所有这些工具包含在标准 API 模型定价中。

我们计划很快推出独立的 Workflows API 和将智能体部署到 ChatGPT 的选项。

我们已迫不及待想看到你们用它们构建出什么。

Today we’re launching AgentKit, a complete set of tools for developers and enterprises to build, deploy, and optimize agents. Until now, building agents meant juggling fragmented tools—complex orchestration with no versioning, custom connectors, manual eval pipelines, prompt tuning, and weeks of frontend work before launch. With AgentKit, developers can now design workflows visually and embed agentic UIs faster using new building blocks like:

Agent Builder: a visual canvas for creating and versioning multi-agent workflows
Connector Registry: a central place for admins to manage how data and tools connect across OpenAI products
ChatKit: a toolkit for embedding customizable chat-based agent experiences in your product

We’re also expanding evaluation capabilities with new features like datasets, trace grading, automated prompt optimization, and third-party model support to measure and improve agent performance.

Since releasing the Responses API and Agents SDK⁠ in March, we’ve seen developers and enterprises build end-to-end agentic workflows for deep research, customer support, and more. Klarna built a support agent⁠ that handles two-thirds of all tickets and Clay 10x’ed growth⁠ with a sales agent. AgentKit builds on the Responses API to help developers build agents more efficiently and reliably.

Design workflows with Agent Builder

As agent workflows grow more complex, developers need clearer visibility into how they work. Agent Builder⁠ provides a visual canvas for composing logic with drag-and-drop nodes, connecting tools, and configuring custom guardrails. It supports preview runs, inline eval configuration, and full versioning—ideal for fast iteration.

Builders can get started with a blank canvas or with prebuilt templates.

At Ramp, the team went from a blank canvas to a procurement agent in just a few hours:

Agent Builder transformed what once took months of complex orchestration, custom code, and manual optimizations into just a couple of hours. The visual canvas keeps product, legal, and engineering on the same page, slashing iteration cycles by 70% and getting an agent live in two sprints rather than two quarters.”

— Ramp

Similarly, LY Corporation—a leading Japanese technology and internet services company—built a work assistant agent with Agent Builder in less than two hours.

"Agent Builder allowed us to orchestrate agents in a whole new way, with engineers and subject matter experts collaborating all in one interface. We built our first multi-agentic workflow and ran it in less than two hours, dramatically accelerating the time to create and deploy agents."

— LY Corporation

We’re also launching a Connector Registry for enterprises to govern and maintain data across multiple workspaces and organizations. The Connector Registry⁠ consolidates data sources into a single admin panel across ChatGPT and the API. The registry includes all pre-built connectors like Dropbox, Google Drive, Sharepoint, and Microsoft Teams, as well as third-party MCPs.

Developers can also enable Guardrails⁠ in Agent Builder—an open-source, modular safety layer that helps protect agents against unintended or malicious behavior. Guardrails can mask or flag PII, detect jailbreaks, and apply other safeguards, making it easier to build and deploy reliable, safe agents. Guardrails can be deployed standalone or via the guardrails library for Python⁠ and JavaScript⁠.

Embed agentic chat experiences with ChatKit

Deploying chat UIs for agents can be surprisingly complex— handling streaming responses, managing threads, showing the model thinking, and designing engaging in-chat experiences. ChatKit makes it simple to embed chat-based agents that feel native to your product. It can be embedded into apps or websites and customized to match your theme or brand.

CanvaLegalOnHubSpot

"We saved over two weeks of time building a support agent for our Canva Developers community with ChatKit, and integrated it in less than an hour. This support agent will transform the way developers engage with our docs by turning it into a conversational experience, making it easy to build apps and integrations on Canva."

— Canva

ChatKit already powers a range of use cases, from internal knowledge assistants and onboarding guides to customer support and research agents. HubSpot⁠’s customer support agent is one example:

RampAlbertsonsHubSpotCanvaActivelyLegalOnEvernoteTaboola

Measure agent performance with new Evals capabilities

Building reliable, production-ready agents requires rigorous performance evaluations. Last year, we launched Evals⁠ to help developers test prompts and measure model behavior. We’re now adding four new capabilities that make it even easier to build evals:

Datasets–rapidly build agent evals from scratch and expand them over time with automated graders and human annotations..
Trace grading–run end-to-end assessments of agentic workflows and automate grading to pinpoint shortcomings.
Automated prompt optimization–generate improved prompts based on human annotations and grader outputs.
Third-party model support–evaluate models from other providers within the OpenAI Evals platform.

We’ve already seen major performance gains from customers using Evals.

CarlyleRipplingBoxBain & Company

"The evaluation platform cut development time on our multi-agent due diligence framework by over 50%, and increased agent accuracy 30%."

— Carlyle

DatasetsPrompt optimizerTrace grading

Push agent performance with reinforcement fine-tuning

Reinforcement fine-tuning⁠ (RFT) lets developers customize our reasoning models. It is generally available on OpenAI o4-mini and in private beta for GPT‑5. We are working closely with dozens of customers to refine the RFT for GPT‑5 before wider release.

Today, we’re introducing two new features in that RFT beta designed to push agent performance even further:

Custom tool calls–train models to call the right tools at the right time for better reasoning
Custom graders–set custom evaluation criteria for what matters most in your use case

Pricing & availability

Starting today, ChatKit and the new Evals capabilities are generally available to all developers. Agent Builder is available in beta, and Connector Registry is beginning its beta rollout to some API, ChatGPT Enterprise and Edu customers with a Global Admin Console⁠ (where Global Owners can manage domains, SSO, multiple API orgs). The Global Admin console is a pre-requisite to enabling Connector Registry. All of these tools are included with standard API model pricing.

We plan to add a standalone Workflows API and agent deployment options to ChatGPT soon.

We can’t wait to see what you build.

Generated by RSStT. The copyright belongs to the original author.

Source