Continuously hardening ChatGPT Atlas against prompt injection

OpenAI News

ChatGPT Atlas 中的 Agent mode 是我们迄今推出的最通用的智能代理功能之一。在该模式下，浏览器代理会像你本人一样查看网页并在浏览器内执行操作、点击和键入，从而使 ChatGPT 能在同一空间、语境和数据下直接处理许多日常工作流程。

随着浏览器代理帮助用户完成更多任务，它也成为对手更有价值的攻击目标，因此 AI 安全变得尤为重要。早在推出 ChatGPT Atlas 之前，我们就持续构建并强化针对这种“浏览器内代理”范式的新兴威胁的防御措施。 prompt injection 是我们积极防护的最重要风险之一，旨在确保 ChatGPT Atlas 能在你的授权下安全运行。

作为该项工作的组成部分，我们近期向 Atlas 的浏览器代理推送了一次安全更新，包含经过对抗性训练的新模型和更严格的周边防护。此次更新是由我们内部自动化红队发现的一类新型 prompt injection 攻击促发的。

本文将解释基于网页的代理如何面临 prompt injection 风险，并揭示我们为持续发现新攻击并快速推出缓解措施而构建的快速响应闭环——以这次安全更新为例加以说明。

我们把 prompt injection 视为一个长期的 AI 安全挑战，需要像防范不断演进的网络诈骗那样持续加强防御。最近的快速响应周期作为这一长期工作的重要手段已展现早期成效：我们能在这些攻击广泛出现前于内部发现新策略。我们的长期愿景是充分利用（1）对自家模型的白盒访问、（2）对防御机制的深入理解以及（3）算力规模优势，始终领先外部攻击者——更早发现漏洞、更快发布补丁并持续收紧反馈闭环。配合前沿研究的新技术与对其他安全控制的更多投入，这一复合循环可以使攻击愈发困难且成本更高，从而在现实中显著降低 prompt injection 风险。最终，我们的目标是让你能够像信任一位能力强、具备安全意识的同事或朋友那样信任 ChatGPT 代理来使用你的浏览器。

将 prompt injection 视为代理安全的开放性挑战

prompt injection 攻击通过将恶意指令嵌入代理会处理的内容来针对 AI 代理。这类指令被精心设计以覆盖或重定向代理的行为——让代理遵循攻击者的意图而非用户的指令。

对于像 ChatGPT Atlas 这样的浏览器代理， prompt injection 带来了超越传统网页安全风险（如用户错误或软件漏洞）的新威胁向量。攻击者不用去钓鱼人或利用浏览器系统漏洞，而是直接针对运行于浏览器内的代理。

举个假设性例子，攻击者可能发送一封恶意邮件，诱导代理忽略用户请求，改为将敏感税务文件转发到攻击者控制的邮箱。如果用户要求代理查看未读邮件并总结要点，代理在执行流程时可能会处理到该恶意邮件；若其遵循了被注入的指令，就会偏离任务并错误泄露敏感信息。

这只是具体情形之一。浏览器代理之所以有用，正是因为其通用性，但这也扩大了风险面：代理可能在几乎无限的表面上遇到不受信任的指令——电子邮件与附件、日历邀请、共享文档、论坛、社交媒体帖子及任意网页。因为代理可以在浏览器中执行许多与用户相同的操作，成功攻击的影响同样可能非常广泛：转发敏感邮件、发起付款、编辑或删除云端文件等皆有可能。

我们已通过多层防护在抵御 prompt injection 方面取得进展，详见此前发布的说明。但 prompt injection 仍然是代理安全上的未决问题，我们预计未来数年都将持续投入解决。

通过端到端、高算力强化学习实现自动化 prompt injection 攻击发现

为强化防御，我们持续在生产环境中搜索针对代理系统的新型 prompt injection 攻击。发现这些攻击是构建稳健缓解措施的必要前提：它有助于理解现实风险、暴露防御薄弱环节并推动具体修补。

为实现规模化发现，我们构建了基于 LLM 的自动化攻击器，并训练它去寻找能成功攻破浏览器代理的 prompt injection 攻击。我们以强化学习进行端到端训练，使该攻击器能从自我成败中学习并提升红队技能。我们还让它能“先试后发”：在其链路式思考过程中，攻击器可以提出候选注入并将其发送给外部模拟器。该模拟器会对目标受害代理（防御方）遇到注入时的反事实执行进行演算，并返回受害代理的完整推理与动作轨迹。攻击器将该轨迹作为反馈，迭代改进攻击并重新运行模拟——在最终提交攻击前重复多次。这为攻击器提供了比单一成败信号更丰富的上下文内反馈，同时也放大了测试时的计算量。此外，内部对防御方推理轨迹的特权访问（这些轨迹不对外披露）赋予我们的内部攻击器不对称优势——提高了其击败外部对手的概率。

为什么采用强化学习（ RL ）？我们选用强化学习训练自动攻击器，原因有多方面：

优化长时序且非连续的攻击目标。我们的目标是寻找能诱使代理执行复杂对抗任务（如发邮件、银行交易）的 prompt injection，这类任务本质上是长时序的，需要多步推理与环境交互，并且成功信号稀疏且延迟。强化学习适合处理这种稀疏、延迟的奖励结构。
利用前沿 LLM 能力。我们直接把前沿 LLM 训练为自动红队，使攻击器能直接受益于模型在推理与规划方面的进步。随着基础模型能力提升，攻击器自然更强——这为在模型发展过程中持续给防御施压提供了可扩展路径。
放大算力并模拟适应性攻击者。强化学习适合把算力投入到大量采样和学习步骤上，以搜索攻击策略，同时也近似反映了人类攻击者迭代试错、从结果中学习并强化成功策略的适应性行为。

我们的自动攻击器能够端到端发现新颖且现实的 prompt injection 攻击。不同于以往大多数自动化红队工作仅能暴露诸如诱导特定输出字符串或触发代理意外单步工具调用的简单失效，我们用 RL 训练的攻击器可引导代理执行复杂且长时序的有害工作流，这类工作流可能跨越数十甚至上百步。同时，我们观察到的一些新型攻击策略在我们的人类红队行动或外部报告中未曾出现。

下面的演示展示了自动攻击器发现的一个具体 prompt injection 利用实例，我们据此进一步强化了 ChatGPT Atlas 的防御。攻击者在用户收件箱里投放一封含有注入指令的恶意邮件，指示代理向用户的 CEO 发送辞职信。随后，当用户要求代理草拟自动回复（out-of-office），代理在正常执行任务时读取了该邮件，将注入的指令视为权威并照做：自动回复没有被写出，代理反而替用户发送了辞职信。

演示要点： 1）请求代理管理邮件时，代理打开最新未读邮件； 2）邮件正文中隐藏或嵌入了欺骗性的“系统指令”； 3）若代理遵循这些注入指令，可能会在未经用户确认的情况下执行诸如发送辞职邮件等意外操作； 4）在我们发布安全更新后， Agent mode 成功检测到类似注入尝试，并在助手面板中给出提示，要求用户确认下一步操作。

Prompt injection 的本质使得要提供确定性安全保证变得困难，但通过扩大我们的自动化安全研究、对抗性测试并收紧快速响应闭环，我们能够在攻击流行于野外之前改进模型鲁棒性与防护能力。

我们分享此演示以帮助用户与研究者更好地理解这些攻击的本质——以及我们如何主动防御它们。我们认为这代表了自动化红队能达成的前沿成果，并对继续推进相关研究感到非常振奋。

以主动快速响应闭环来强化 ChatGPT Atlas

我们的自动化红队驱动起一条主动的快速响应闭环：当自动攻击器发现一类成功的 prompt injection 攻击时，它立刻为改进防御创建了具体目标。

对新发现攻击进行对抗性训练。我们持续用表现最好的自动攻击器对更新后的代理模型进行训练，优先针对当前代理易失手的攻击样式。目标是教会代理忽略对抗性指令、遵循用户意图，从而提升对新型 prompt injection 策略的抵抗力。此类训练把针对新颖、高强度攻击的鲁棒性直接“烧录”进模型检查点。例如，最近的自动化红队直接促成了一个新的、经对抗性训练的浏览器代理检查点，并已向所有 ChatGPT Atlas 用户推送。这最终有助于更好地保护用户免受新型攻击。

用攻击轨迹改进更广泛的防御栈。自动红队发现的许多攻击路径也揭示了模型之外可改进之处——例如监测手段、我们在模型上下文中放置的安全指令或系统级保护。这些发现帮助我们迭代完整的防御栈，而不仅仅是代理检查点。

应对现实中正在发生的攻击。该闭环还能更好地响应野外活跃攻击。当我们在全球范围内监测到潜在攻击时，可以把外部对手使用的技术和战术输入到此闭环中，模拟其活动并在整个平台上推动防御性改进。

展望：我们对代理安全的长期承诺

加强对代理的红队能力并用最强模型自动化部分工作——有助于通过放大发现到修复的闭环，使 Atlas 浏览器代理更为稳健。这一加固工作印证了安全领域的一条老道理：持续在真实系统上施加压力测试、对失败做出反应并发布具体修复，是走向更强保护的行之有效路径。

我们预期对手将持续适应。像网络诈骗与社会工程一样， prompt injection 不太可能被彻底“解决”。但我们乐观地认为，一个主动、高响应性的快速响应闭环能随着时间显著降低现实风险。通过将自动化攻击发现、对抗性训练和系统级防护结合起来，我们可以更早识别新攻击模式、更快弥补漏洞，并持续提高利用成本。

ChatGPT Atlas 的 Agent mode 功能强大——但同时也扩展了安全威胁面。对此权衡保持清醒认识是负责任建设的一部分。我们的目标是通过每一次迭代显著提升 Atlas 的安全性：提高模型鲁棒性、强化周边防御栈，并监测野外的新型滥用模式。

我们将持续在研究与部署上投入，开发更好的自动化红队方法、推出分层缓解措施并在实践中快速迭代。同时我们也会尽可能把经验分享给更广泛的社区。

关于安全使用代理的建议

在我们继续从系统层面强化 Atlas 的同时，用户也可以采取若干措施以降低使用代理的风险。

在可行时限制已登录访问。我们仍建议用户在使用 Agent in Atlas 时尽量使用未登录模式（当任务不需要访问你已登录的网站）或限制任务期间仅登录特定站点，以减少风险暴露。
仔细审查确认请求。对于某些后果严重的操作（如完成购买或发送邮件），代理在执行前会要求你确认。当代理请求确认时，请花时间核实该操作是否正确，以及所共享的信息是否适合在该语境下披露。
尽量给出明确指令。避免像“查看我的邮件并采取必要行动”这样过于宽泛的提示。过大的执行权限会让隐藏或恶意内容更容易影响代理，即便存在防护措施。更安全的做法是要求代理执行具体、范围明确的任务。虽然这不能完全消除风险，但会使攻击更难实施。

如果代理要成为日常任务中可信赖的伙伴，它们必须能抵御开放网络所允许的各种操纵。加固对 prompt injection 的防御是长期承诺，也是我们的优先事项之一。我们将很快分享更多相关工作进展。

Agent mode in ChatGPT Atlas is one of the most general-purpose agentic features we’ve released to date. In this mode, the browser agent views webpages and takes actions, clicks, and keystrokes inside your browser, just as you would. This allows ChatGPT to work directly on many of your day-to-day workflows using the same space, context, and data.

As the browser agent helps you get more done, it also becomes a higher-value target of adversarial attacks. This makes AI security especially important. Long before we launched ChatGPT Atlas, we’ve been continuously building and hardening defenses against emerging threats that specifically target this new “agent in the browser” paradigm. Prompt injection⁠ is one of the most significant risks we actively defend against to help ensure ChatGPT Atlas can operate securely on your behalf.

As part of this effort, we recently shipped a security update to Atlas’s browser agent, including a newly adversarially trained model and strengthened surrounding safeguards. This update was prompted by a new class of prompt-injection attacks uncovered through our internal automated red teaming.

In this post, we explain how prompt-injection risk can arise for web-based agents, and we share a rapid response loop we’ve been building to continuously discover new attacks and ship mitigations quickly—illustrated by this recent security update.

We view prompt injection as a long-term AI security challenge, and we’ll need to continuously strengthen our defenses against it (much like ever-evolving online scams that target humans). Our latest rapid response cycle is showing early promise as a critical tool on that journey: we’re discovering novel attack strategies internally before they show up in the wild. Our long-term vision is to fully leverage (1) our white-box access to our models, (2) deep understanding of our defenses, and (3) compute scale to stay ahead of external attackers—finding exploits earlier, shipping mitigations faster, and continuously tightening the loop. Combined with frontier research on new techniques to address prompt injection and increased investment in other security controls, this compounding cycle can make attacks increasingly difficult and costly, materially reducing real-world prompt-injection risk. Ultimately, our goal is for you to be able to trust a ChatGPT agent to use your browser the way you’d trust a highly competent, security-aware colleague or friend.

Prompt injection as an open challenge for agent security

A prompt injection attack targets AI agents by embedding malicious instructions into content the agent processes. Those instructions are crafted to override or redirect the agent’s behavior—hijacking it into following an attacker’s intent, rather than the user’s.

For a browser agent like the one inside ChatGPT Atlas, prompt injection adds a new threat vector beyond traditional web security risks (like user error or software vulnerabilities). Instead of phishing humans or exploiting system vulnerabilities of the browser, the attacker targets the agent operating inside it.

As a hypothetical example, an attacker could send a malicious email attempting to trick an agent to ignore the user’s request and instead forward sensitive tax documents to an attacker-controlled email address. If a user asks the agent to review unread emails and summarize key points, the agent may ingest that malicious email during the workflow. If it follows the injected instructions, it can go off-task—and wrongly share sensitive information.

This is just one specific scenario. The same generality that makes browser agents useful also makes the risks broader: the agent may encounter untrusted instructions across an effectively unbounded surface area—emails and attachments, calendar invites, shared documents, forums, social media posts, and arbitrary webpages. Since the agent can take many of the same actions a user can take in a browser, the impact of a successful attack can hypothetically be just as broad: forwarding a sensitive email, sending money, editing or deleting files in the cloud, and more.

We’ve made progress defending against prompt injection through multiple layers of safeguards, as we shared in an earlier post⁠. However, prompt injection remains an open challenge for agent security, and one we expect to continue working on for years to come.

Automated prompt injection attack discovery through end-to-end and high-compute reinforcement learning

To strengthen our defenses, we’ve been continuously searching for novel prompt injection attacks against agent systems in production. Finding these attacks is a necessary prerequisite for building robust mitigations: it helps us understand real-world risk, exposes gaps in our defenses, and drives concrete patches.

To do this at scale, we built an LLM-based automated attacker and trained it to hunt for prompt injection attacks that can successfully attack a browser agent. We trained this attacker end-to-end with reinforcement learning, so it learns from its own successes and failures to improve its red teaming skills. We also let it “try before it ships”, by which we mean: during its chain of thought reasoning, the attacker can propose a candidate injection and send it to an external simulator. The simulator runs a counterfactual rollout of how the targeted victim agent (the defender) would behave if it encountered the injection, and returns a full reasoning and action trace of the victim agent. The attacker uses that trace as feedback, iterates on the attack, and reruns the simulation—repeating this loop multiple times before committing to a final attack. This provides richer in-context feedback to the attacker than a single pass/fail signal. It also scales up the attacker’s test-time compute. Moreover, privileged access to the reasoning traces (that we don’t disclose to external users) of the defender gives our internal attacker an asymmetric advantage—raising the odds that it can outrun external adversaries.

Why reinforcement learning (RL)? We chose reinforcement learning to train the automated attacker for multiple reasons:

Optimizing long-horizon and non-continuous attacker objectives. Our goal is to search for prompt injection attacks that can trick the agent into executing sophisticated adversarial tasks (e.g., sending emails, bank transactions) that could occur in the real world. These adversarial tasks are inherently long-horizon, requiring many steps of reasoning and interaction with the environment, with sparse and delayed success signals. Reinforcement learning is well-suited to this sparse, delayed reward structure.
Leveraging frontier LLM capabilities. We trained frontier LLMs directly as auto-red-teamers, so the attacker benefits directly from improvements in reasoning and planning in frontier models. As base models get stronger, the attacker naturally becomes more capable as well—making this a scalable way to keep pressure on our defenses as our models evolve.
Scaling compute and mimicking adaptive attackers. Reinforcement learning is well suited for scaling computation spent on searching for attacks over large numbers of samplings and learning steps, and it also closely reflects how adaptive human attackers behave: iteratively trying strategies, learning from outcomes, and reinforcing successful behaviors.

Our automated attacker can discover novel, realistic prompt-injection attacks end-to-end. Unlike most prior automated red teaming work, which surfaced simple failures such as eliciting specific output strings or triggering an unintended single-step tool call from the agent, our RL-trained attacker can steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps. We also observed novel attack strategies that did not appear in our human red teaming campaign or external reports.

The demo below presents a concrete prompt injection exploit found by our automated attacker, which we then used to further harden the defenses of ChatGPT Atlas. The attacker seeds the user’s inbox with a malicious email containing a prompt injection that directs the agent to send a resignation letter to the user's CEO. Later, when the user asks the agent to draft an out-of-office reply, the agent encounters that email during normal task execution, treats the injected prompt as authoritative, and follows it. The out-of-office never gets written and the agent resigns on behalf of the user instead.

1. Asking agent for help managing email

2. Agent opens the latest unread email

3. The email has malicious instructions

4. Agent send unintended resignation email

5. Following our security update, agent mode successfully detects a prompt injection attempt

The nature of prompt injection makes deterministic security guarantees challenging, but by scaling our automated security research, adversarial testing, and tightening our rapid response loop, we are able to improve the model’s robustness and defenses - before waiting for an attack to occur in the wild.

We're sharing this demo to help users and researchers better understand the nature of these attacks—and how we are actively defending against them. We believe this represents the frontier of what automated red teaming can accomplish, and we are extremely excited to continue our research.

Hardening ChatGPT Atlas with a proactive rapid response loop

Our automated red teaming is driving a proactive rapid response loop: when the automated attacker discovers a new class of successful prompt injection attacks, it immediately creates a concrete target for improving our defenses.

Adversarially training against newly discovered attacks. We continuously train updated agent models against our best automated attacker—prioritizing the attacks where the target agents currently fail. The goal is to teach agents to ignore adversarial instructions and stay aligned with the user’s intent, improving resistance to newly discovered prompt-injection strategies. This “burns in” robustness against novel, high-strength attacks directly into the model checkpoint. For example, recent automated red teaming directly produced a new adversarially trained browser-agent checkpoint that has already been rolled out to all ChatGPT Atlas users. This ultimately helps better protect our users against new types of attacks.

Using attack traces to improve the broader defense stack. Many attack paths discovered by our automated red teamer also reveal opportunities for improvement outside of the model itself—such as in monitoring, safety instructions we put in the model’s context, or system-level safeguards. Those findings help us iterate on the full defense stack, not just the agent checkpoint.

Responding to active attacks. This loop can also help better respond to active attacks in the wild. As we look across our global footprint for potential attacks, we can take the techniques and tactics we observe external adversaries using, feed them into this loop, emulate their activity, and drive defensive change across our platform.

Outlook: our long-term commitment to agent security

Strengthening our ability to red team agents and using our most capable models to automate parts of that work—helps make the Atlas browser agent more robust by scaling the discovery-to-fix loop. This hardening effort reinforces a familiar lesson from security: a well-worn path to stronger protection is to continuously pressure-test real systems, react to failures, and ship concrete fixes.

We expect adversaries to keep adapting. Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully “solved”. But we’re optimistic that a proactive, highly responsive rapid response loop can continue to materially reduce real-world risk over time. By combining automated attack discovery with adversarial training and system-level safeguards, we can identify new attack patterns earlier, close gaps faster, and continuously raise the cost of exploitation.

Agent mode in ChatGPT Atlas is powerful—and it also expands the security threat surface. Being clear-eyed about that tradeoff is part of building responsibly. Our goal is to make Atlas meaningfully more secure with every iteration: improving model robustness, strengthening the surrounding defense stack, and monitoring for emerging abuse patterns in the wild.

We’ll continue investing across research and deployment, developing better automated red teaming methods, rolling out layered mitigations, and iterating quickly as we learn. We’ll also share what we can with the broader community.

Recommendations for using agents safely

While we continue to strengthen Atlas at the system level, there are steps users can take to reduce risk when using agents.

Limit logged-in access when possible. We continue to recommend that users take advantage of logged-out mode⁠ when using Agent in Atlas whenever access to websites you’re logged in to isn’t necessary for the task at hand, or to limit access to specific sites you sign-in to during the task.

Carefully review confirmation requests. For certain consequential actions, such as completing a purchase or sending an email, agents are designed to ask for your confirmation before proceeding. When an agent asks you to confirm an action, take a moment to verify that the action is correct and that any information being shared is appropriate for that context.

Give agents explicit instructions when possible. Avoid overly broad prompts like “review my emails and take whatever action is needed.” Wide latitude makes it easier for hidden or malicious content to influence the agent, even when safeguards are in place. It’s safer to ask the agent to perform specific, well-scoped tasks. While this does not eliminate risk, it makes attacks harder to carry out.

If agents are to become trusted partners for everyday tasks, they must be resilient to the kinds of manipulation the open web enables. Hardening against prompt injection is a long-term commitment and one of our top priorities. We’ll be sharing more on this work soon.

Generated by RSStT. The copyright belongs to the original author.

Source

Continuously hardening ChatGPT Atlas against prompt injection

Report Page