From model to agent: Equipping the Responses API with a comp…

OpenAI News

我们正处在从擅长单一任务的模型，向能处理复杂工作流的代理（agent）转变的阶段。仅靠提示模型，你只能调用其已训练出的“智能”；但如果给模型一个计算环境，就能覆盖更广泛的用途——例如运行服务、向外部 API 请求数据，或生成更有用的产物（比如电子表格或报告）。

在构建此类代理时会遇到一些现实问题：中间文件放在哪里、如何避免把大表格粘到提示里、如何在不制造安全隐患的前提下给工作流网络访问、以及如何在不另造工作流系统的情况下处理超时和重试等。

与其把这些都交给开发者去搭执行环境，我们为 Responses API 构建了必要组件，使其具备可靠执行真实任务的计算环境。

将 Responses API 、 shell tool 和托管的容器工作区配套使用，可以解决这些实际问题。模型负责提出步骤和命令；平台在隔离的环境中执行这些命令，该环境带有用于输入输出的文件系统、可选的结构化存储（例如 SQLite ），并对网络访问加以限制。

下面我们分解如何为代理构建计算环境，并分享一些能让生产工作流更快、更可复现、更安全的早期经验。

shell tool

一个好的代理工作流从紧密的执行循环开始：模型提出动作（例如读文件或用 API 拉取数据），平台执行，结果反馈到下一步。先介绍 shell tool——这是观察该循环最简单的方式——随后再讲容器工作区、网络、可复用技能和上下文压缩等。

要理解 shell tool，先要明白语言模型一般如何使用工具：在训练时，模型会看到工具使用及其后果的示例，从而学会何时使用工具以及如何使用。但“使用工具”对模型而言只是“提出工具调用”，它本身并不能执行调用。

shell tool 是“又一个工具”。它通过命令行与计算环境交互，能执行广泛任务，从文本搜索到在你的环境中发起 API 请求。基于熟悉的 Unix 工具集， shell tool 开箱即含诸如 grep 、 curl 、 awk 等实用工具。

与仅能执行 Python 的现有代码解释器不同， shell tool 支持更广泛的用例，例如运行 Go 、 Java 程序或启动 NodeJS 服务。这种灵活性让模型能完成更复杂的“代理式”任务。

协调代理循环

模型本身只能提出 shell 命令；这些命令如何被执行？需要一个协调器来获取模型输出、调用工具，并把工具响应循环回传给模型，直到任务完成。

开发者通过 Responses API 与 OpenAI 模型交互。配合自定义工具时， Responses API 会把控制权交还给客户端，客户端需自备运行工具的挂载层；但该 API 也能开箱即用地在模型与托管工具间做协调。

当 Responses API 收到提示时，它会组装模型上下文：用户提示、先前对话状态及工具说明。要实现 shell 执行，提示中必须提到使用 shell tool ，且所选模型要经过训练以提出 shell 命令——例如 GPT‑5.2 及其后续模型具备此能力。基于这些上下文，模型决定下一步动作；若选择 shell 执行，它会向 Responses API 返回一个或多个 shell 命令。API 将命令转发给容器运行时，实时流回 shell 输出，并在下一次请求的上下文中提供这些输出。模型据此检查结果、发出后续命令或给出最终答案。循环重复，直到模型返回不含额外 shell 命令的完成结果。

当 Responses API 执行 shell 命令时，它与容器服务保持流式连接。输出产生时，API 几乎实时地将其中继给模型，以便模型决定是等待更多输出、再运行一条命令，还是结束并返回结果。

模型可在一步中提出多条 shell 命令， Responses API 能在独立容器会话中并发执行它们。每个会话各自流式传输输出，API 会把这些流复用回去作为结构化的工具输出上下文。换言之，代理循环可以并行工作，例如同时搜索文件、抓取数据和校验中间结果。

当命令牵涉到文件操作或数据处理时，shell 输出可能非常庞大，消耗上下文配额却未必带来有用信息。为控制这一点，模型可为每个命令指定输出上限。 Responses API 强制执行该上限，返回一个有界结果，保存输出的开头和结尾，并标注被省略的部分。例如，你可把输出限定为 1,000 字符，保留开头与结尾：

text at the beginning ... 1000 chars truncated ... text at the end

并发执行与有界输出共同使代理循环既快速又节省上下文，使模型能持续基于有价值的结果推理，而不会被原始终端日志淹没。

上下文窗口满时：压缩（compaction）

代理循环可能长时间运行，长时任务会填满上下文窗口，而窗口对跨回合、跨代理提供连续上下文至关重要。想像一个代理调用某项技能、得到响应、加入工具调用和推理摘要——有限的上下文窗口很快被占满。为避免在代理继续运行时丢失关键信息，需要把要点保留、把冗余删去。我们没有要求开发者自行设计和维护摘要或状态传递系统，而是在 Responses API 中内置了与模型行为和训练相协调的原生压缩机制。

我们近期的模型被训练来分析既有对话状态，并生成一个压缩项（compaction item），以加密、节省 token 的方式保存关键先前状态。压缩后，下一个上下文窗口由该压缩项及早期窗口的高价值部分组成。这样即便在长时间、多步骤、依赖工具的会话里，工作流也能跨窗口持续一致。早期的 Codex 就依靠这一机制来维持长时的编码任务与迭代工具执行而不降低质量。

压缩功能可作为服务器端内置机制或通过独立的 /compact 端点使用。服务器端压缩允许你配置阈值，系统会自动处理压缩时机，免去复杂的客户端逻辑。它还能容忍在压缩前的少量超限，使接近限制的请求仍可被处理并压缩，而不是直接被拒。随着模型训练的演进，原生压缩方案也会随每次 OpenAI 模型发布而更新。

Codex 在作为该系统的早期用户与构建者时，曾通过自身出错、再观察排查而推动了压缩机制的完善；这也体现了 Codex 能与我们一道学习和改进的能力，这在 OpenAI 的工作中尤为有趣。

容器上下文

接下来讲状态与资源。容器不仅是运行命令的场所，还是模型的工作上下文。容器内模型可以读取文件、查询数据库，并在网络策略控制下访问外部系统。

文件系统

容器上下文的第一部分是文件系统，用于上传、组织和管理资源。我们构建了容器和文件的 API，以便为模型提供可用数据的地图，帮助它选择更有针对性的文件操作，而不是进行笼统且嘈杂的扫描。

把所有输入直接塞进提示是一种常见反模式。随着输入增长，过载提示既昂贵又难以让模型高效处理。更好的做法是把资源先放到容器文件系统，由模型决定打开、解析或用 shell 命令转换哪部分数据。像人类一样，模型在有条理的信息面前表现更佳。

数据库

容器上下文的第二部分是数据库。多数情况下，我们建议开发者把结构化数据存入数据库，例如 SQLite ，并通过查询来访问。例如，不必把整张表格复制进提示，你可以向模型描述表结构——有哪些列、它们代表什么——然后让模型只拉取所需行。

比如问“本季度哪些产品销量下滑？”，模型可以仅查询相关行，而不是扫描整个电子表格，这样更快、更便宜，也更容易扩展到更大的数据集。

网络访问

容器上下文的第三部分是网络访问，这是代理工作负载的核心。工作流可能需要抓取实时数据、调用外部 API 或安装包。但给容器不受限的互联网访问也有风险：会把信息暴露给外部网站，可能无意触及内部或第三方敏感系统，或增加凭证泄露与数据外泄的可能性。

为在不削弱代理效用的前提下降低风险，我们为托管容器采用了侧车出站代理（sidecar egress proxy）。所有出站网络请求都经过一个集中策略层，该层实施允许列表和访问控制并保持流量可观测。对于凭证，我们在出站做域范围的密钥注入：模型与容器仅见到占位符，真实密钥值保留在模型不可见的外部并只用于被批准的目的地。这在减少泄露风险的同时仍允许进行需鉴权的外部调用。

技能（agent skills）

Shell 命令很强，但许多任务反复呈现相同的多步模式。代理每次都要重新发现工作流——重新规划、重发命令并重新学会约定——这会导致结果不一致并浪费执行资源。 agent skills 把这些模式打包成可复用、可组合的构建模块。具体而言，一个技能是一个文件夹包，包含一个 SKILL.md （含元数据和说明）以及任意配套资源，例如 API 规范和 UI 资产。

这种结构自然映射到前面描述的运行时架构。容器提供持久文件与执行上下文， shell tool 提供执行接口。有了它们，模型可以在需要时用 shell 命令（如 ls 、 cat 等）发现技能文件、解读说明并在同一代理循环内运行技能脚本。

我们提供管理技能的 API，开发者将技能文件夹作为版本化包上传存储，日后可用技能 ID 检索。向模型发送提示前， Responses API 会加载相应技能并把它包含进模型上下文。该流程是确定性的：

获取技能元数据（包括名称和描述）。
获取技能包，复制到容器并解压。
用技能元数据和容器路径更新模型上下文。

当判断某项技能是否相关时，模型会逐步浏览其说明，并通过容器内的 shell 命令执行其脚本。

如何造出一个代理

把各部分拼起来： Responses API 提供协调， shell tool 提供可执行动作，托管容器提供持久运行时上下文，技能层提供可重用的工作流逻辑，压缩则让代理在拥有必要上下文的情况下长时间运行。

借助这些原语，一个提示就能扩展成端到端工作流：发现合适的技能、获取数据、把数据转为本地结构化状态、高效查询并生成长期保存的产物。

我们在文中举例了如何从实时数据创建电子表格的整个流程：技能发现、规划与准备、数据获取、数据处理到产物生成；而 Responses API 在其中负责协调执行。

自己动手造代理

有关如何将 shell tool 与计算环境结合以完成端到端工作流的深入示例，请参见我们的开发者博客和 cookbook，逐步演示如何打包技能并通过 Responses API 执行它们。

我们期待看到开发者用这些原语构建出什么。语言模型的用途远不止生成文本、图像和音频——我们会继续演进平台，提升其在大规模处理复杂现实任务时的能力。

We're currently in a shift from using models, which excel at particular tasks, to using agents capable of handling complex workflows. By prompting models, you can only access trained intelligence. However, giving the model a computer environment can achieve a much wider range of use cases, like running services, requesting data from APIs, or generating more useful artifacts like spreadsheets or reports.

A few practical problems emerge when you try to build agents: where to put intermediate files, how to avoid pasting large tables into a prompt, how to give the workflow network access without creating a security headache, and how to handle timeouts and retries without building a workflow system yourself.

Instead of putting it on developers to build their own execution environments, we built the necessary components to equip the Responses API⁠ with a computer environment to reliably execute real-world tasks.

OpenAI’s Responses API, together with the shell tool and a hosted container workspace, is designed to address these practical problems. The model proposes steps and commands; the platform runs them in an isolated environment with a filesystem for inputs and outputs, optional structured storage (like SQLite), and restricted network access.

In this post, we’ll break down how we built a computer environment for agents and share some early lessons on how to use it for faster, more repeatable, and safer production workflows.

The shell tool

A good agent workflow starts with a tight execution loop: the model proposes an action like reading files or fetching data with API, the platform runs it, and the result feeds into the next step. We’ll start with the shell tool—the simplest way to see this loop in action—and then cover the container workspace, networking, reusable skills, and context compaction.

To understand the shell tool, it’s first useful to understand how a language model uses tools in general: to do things like call a function or interact with a computer. During training, a model is shown examples of how tools are used and the resulting effects, step by step. This helps the model learn to decide when to use a tool and how to use it. When we say “using a tool”, we mean the model actually only proposes a tool call. It can't execute the call on its own.

The shell tool makes the model dramatically more powerful: it interacts with a computer through the command line to carry out a wide range of tasks, from searching for text to sending API requests on your computer. Built on familiar Unix tooling, our shell tool can do anything you'd expect, with utilities like grep, curl, and awk available out of the box.

Compared to our existing code interpreter, which only executes Python, the shell tool enables a much wider range of use cases, like running Go or Java programs or starting a NodeJS server. This flexibility lets the model fulfill complex agentic tasks.

Orchestrating the agent loop

On its own, a model can only propose shell commands, but how are these commands executed? We need an orchestrator to get model output, invoke tools, and pass the tool response back to the model in a loop, until the task is complete.

The Responses API is how developers interact with OpenAI models. When used with custom tools, the Responses API yields control back to the client, and the client requires its own harness for running the tools. However, this API can also orchestrate between the model and hosted tools out of the box.

When the Responses API receives a prompt, it assembles model context: user prompt, prior conversation state, and tool instructions. For shell execution to work, the prompt must mention using the shell tool and the selected model must be trained to propose shell commands—models GPT‑5.2 and later are trained for this. With all of this context, the model then decides the next action. If it chooses shell execution, it returns one or more shell commands to Responses API service. The API service forwards those commands to the container runtime, streams back shell output, and feeds it to the model in the next request’s context. The model can then inspect the results, issue follow-up commands, or produce a final answer. The Responses API repeats this loop until the model returns a completion without additional shell commands.

When the Responses API executes a shell command, it maintains a streaming connection to the container service. As output is produced, the API relays it to the model in near real time so the model can decide whether to wait for more output, run another command, or move on to a final response.

The Responses API streams shell command output

The model can propose multiple shell commands in one step, and the Responses API can execute them concurrently using separate container sessions. Each session streams output independently, and the API multiplexes those streams back into structured tool outputs as context. In other words, the agent loop can parallelize work, such as searching files, fetching data, and validating intermediate results.

When the command involves file operations or data processing, shell output can become very large and consume context budgets without adding useful signals. To control this, the model specifies an output cap per command. The Responses API enforces that cap and returns a bounded result that preserves both the beginning and end of the output, while marking omitted content. For example, you might bound the output to 1,000 characters, with preserved beginning and end:

text at the beginning ... 1000 chars truncated ... text at the end

Together, concurrent execution and bounded output make the agent loop both fast and context-efficient so the model can keep reasoning over relevant results instead of getting overwhelmed by raw terminal logs.

When the context window gets full: compaction

One potential issue with agent loops is that tasks can run for a long time. Long-running tasks fill the context window, which is important for providing context across turns and across agents. Picture an agent calling a skill, getting a response, adding tool calls and reasoning summaries—the limited context window quickly fills up. To avoid losing the important context as the agent continues running, we need a way to keep the key details and remove anything extraneous. Instead of requiring developers to design and maintain custom summarization or state-carrying systems, we added native compaction in the Responses API, designed to align with how the model behaves and how it's been trained.

Our latest models are trained to analyze prior conversation state and produce a compaction item that preserves key prior state in an encrypted token-efficient representation. After compaction, the next context window consists of this compaction item and high-value portions of the earlier window. This allows workflows to continue coherently across window boundaries, even in extended multi-step and tool-driven sessions. Codex relies on this mechanism to sustain long-running coding tasks and iterative tool execution without degrading quality.

Compaction is available either built-in on the server or through a standalone `/compact` endpoint. Server-side compaction lets you configure a threshold, and the system handles compaction timing automatically, eliminating the need for complex client-side logic. It allows a slightly larger effective input context window to tolerate small overages right before compaction, so requests near the limit can still be processed and compacted rather than rejected. As model training evolves, the native compaction solution evolves with it for every OpenAI model release.

Codex helped us build the compaction system while serving as an early user of it. When one Codex instance hit a compaction error, we'd spin up a second instance to investigate. The result was that Codex got a native, effective compaction system just by working on the problem. This ability for Codex to inspect and refine itself has become an especially interesting part of working at OpenAI. Most tools only require the user to learn how to use them; Codex learns alongside us.

Container context

Now let’s cover state and resources. The container is not only a place to run commands but also the working context for the model. Inside the container, the model can read files, query databases, and access external systems under network policy controls.

File systems

The first part of container context is the file system for uploading, organizing, and managing resources. We built container and file⁠ APIs to give the model a map of available data and help it choose targeted file operations instead of performing broad, noisy scans.

A common anti-pattern is packing all input directly into prompt context. As inputs grow, overfilling the prompt becomes expensive and hard for the model to navigate. A better pattern is to stage resources in the container file system and let the model decide what to open, parse, or transform with shell commands. Much like humans, models work better with organized information.

Databases

The second part of container context is databases. In many cases, we suggest developers store structured data in databases as SQLite and query them. Instead of copying an entire spreadsheet into the prompt, for example, you can give the model a description of the tables—what columns exist and what they mean—and let it pull the rows it needs.

For example, if you ask, “Which products had declining sales this quarter?” the model can query just the relevant rows instead of scanning the whole spreadsheet. This is faster, cheaper, more scalable to larger datasets.

Network access

The third part of the container context is network access, an essential part of agent workloads. Agent workflow may need to fetch live data, call external APIs, or install packages. At the same time, giving containers unrestricted internet access can be risky: it can expose information to external websites, unintentionally touch sensitive internal or third-party systems, or make credential leaks and data exfiltration harder to guard against.

To address these concerns without limiting agents' usefulness, we built hosted containers to use a sidecar egress proxy. All outbound network requests flow through a centralized policy layer that enforces allowlists and access controls while keeping traffic observable. For credentials, we use domain-scoped secret injection at egress. The model and container only see placeholders, while raw secret values stay outside model-visible context and only get applied for approved destinations. This reduces the risk of leakage while still enabling authenticated external calls.

Container setupAccessing network

Agent skills

Shell commands are powerful, but many tasks repeat the same multi-step patterns. Agents have to rediscover the workflow each run—replanning, reissuing commands, and relearning conventions—leading to inconsistent results and wasted execution. Agent skills⁠ package those patterns into reusable, composable building blocks. Concretely, a skill is a folder bundle that includes ‘SKILL.md⁠’ (containing metadata and instructions) plus any supporting resources, such as API specs and UI assets.

This structure maps naturally to the runtime architecture we described earlier. The container provides persistent files and execution context, and the shell tool provides the execution interface. With both in place, the model can discover skill files using shell commands (`ls`, `cat`, etc.) when it needs to, interpret instructions, and run skill scripts all in the same agent loop.

We provide APIs⁠ to manage skills in the OpenAI platform. Developers upload and store skill folders as versioned bundles, which can later be retrieved by skill ID. Before sending the prompt to the model, the Responses API loads the skill and includes it in model context. This sequence is deterministic:

Fetch skill metadata, including name and description.
Fetch the skill bundle, copy it into the container, and unpack it.
Update model context with skill metadata and the container path.

When deciding whether a skill is relevant, the model progressively explores its instructions, and executes its scripts through shell commands in the container.

How agents are made

To put all the pieces together: the Responses API provides orchestration, the shell tool provides executable actions, the hosted container provides persistent runtime context, skills layer reusable workflow logic, and compaction allows an agent to run for a long time with the context it needs.

With these primitives, a single prompt can expand into an end-to-end workflow: discover the right skill, fetch data, transform it into local structured state, query it efficiently, and generate durable artifacts.

The diagram below shows how this system works for creating a spreadsheet from live data.

Skill discoveryPlanning and setupData acquisitionData processingArtifact generation

The Responses API orchestrates an agentic task

Make your own agent

For an in-depth example of combining the shell tool and computer environment for end-to-end workflows, see our developer blog post⁠ and cookbook⁠ walking through packaging a skill and executing it through the Responses API.

We’re excited to see what developers build with this set of primitives. Language models are meant to do more than generating text, images, and audio–we’ll continue to evolve our platform to become more capable in handling complex, real-world tasks at scale.

Generated by RSStT. The copyright belongs to the original author.

Source