Introducing gpt-oss-safeguard

OpenAI News

今天我们发布了研究预览版 gpt-oss-safeguard ，这是面向安全分类任务的开源权重推理模型，提供两种规模： gpt-oss-safeguard-120b 和 gpt-oss-safeguard-20b 。这两款模型是基于 gpt-oss 开源模型的微调版本，采用同样宽松的 Apache 2.0 许可证，任何人均可自由使用、修改和部署。模型已在 Hugging Face 上线，今天即可下载。

这类模型在推理时直接利用推理能力去解释开发者提供的政策——按开发者需求对用户信息、模型回复或整段对话进行分类。开发者始终决定采用何种政策，因此判定结果更贴合具体应用场景。模型产出链式思考（chain-of-thought），开发者可以审阅其推理过程，了解模型如何得出结论。此外，政策是在推理时提供的，而不是训练时内嵌到模型中，这使得开发者能够方便地迭代修订政策以提升效果。我们最初在内部开发并使用的这种方法，比传统通过大量标注样本训练分类器、间接推断决策边界的方法灵活得多。

gpt-oss-safeguard 允许开发者自行划定最适合其场景的政策界限。例如，游戏讨论论坛可以定义一套政策去判别讨论作弊的帖子；产品评论平台可以用自己的政策筛查可能是虚假的评论。

模型同时接受两类输入——一条政策和一段需按该政策分类的内容，输出内容所属类别及相应的推理过程。开发者可自行决定是否以及如何在自己的安全流程中利用这些结论。我们发现基于推理的方法在下列情形下特别有效：

潜在危害正在出现或演变，政策需要快速适应；
领域高度细微，较小的分类器难以胜任；
开发者没有足够样本为平台上的每类风险训练高质量分类器；
相比延迟，更看重产出高质量且可解释的标签。

此次发布 gpt-oss-safeguard 的预览版，目的是征求研究与安全社区的反馈，并继续迭代模型表现。在过去数月中，我们与 ROOST 合作，识别开发者的关键需求、测试模型并编写开发者文档。作为此次发布的一部分， ROOST 今天还将启动一个模型社区（Model Community），以探索用于保护在线空间的开源 AI 模型。我们同时发布了一份简要技术报告，详述该预览模型的安全性能。

体系级安全：安全分类器的角色

在安全方面，我们信奉多重防线（defense in depth）的理念。我们既训练模型以实现安全响应，也在系统中加设额外保护层来检测并按政策处理可能不安全的输入与输出。长期以来，能够在特定风险领域区分安全与不安全内容的安全分类器，一直是我们和其他大型语言模型的重要防线之一。

传统的安全分类器（例如通过 Moderation API 提供的那些）通常通过人工整理成千上万条在预定义安全政策下标注为安全或不安全的示例来开发。分类器从这些训练数据中学习区分安全与不安全输出。在这种传统做法中，分类器并未直接“看到”安全政策，而是试图从被标注为不安全的内容间找到相似性、以及不安全与安全内容间的差异，从而间接推断出背后的政策标准。

传统分类器在性能、延迟和运行成本上常有优势，但收集足够的训练示例既费时又昂贵，且一旦政策变更，通常需要重新训练分类器。

gpt-oss-safeguard 的不同之处在于，其推理能力允许开发者直接应用任意政策，包括自行撰写或外部引入的政策，推理还能帮助模型对新写的政策进行泛化。除了安全政策外， gpt-oss-safeguard 也能用于针对产品或平台特定需求的其他标签化任务。

我们内部如何使用安全推理

我们的主力推理模型现在直接学习我们的安全政策，并利用推理能力判断什么是安全——我们称之为“审慎对齐”（deliberative alignment）。这一方法比早期的安全训练手段有显著改进，使得随着能力提升，推理模型在多个维度上比不具推理能力的前代模型更安全。但推理的价值不仅在于训练模型本身，它也为多重防线带来新可能：基于推理的方法更灵活，不那么受限于此前训练的细节，尽管这有时会带来额外的算力和延迟开销，但其优势往往足以抵消这些成本。

gpt-oss-safeguard 是我们内部方法在开源权重上的实现，源于我们称为 Safety Reasoner 的内部工具。我们首先在政策标注任务上进行了强化微调，奖励模型与人类专家的正确判断一致，从而教会模型围绕政策进行推理。如今， Safety Reasoner 使我们能够在生产环境中动态更新安全政策，所需时间远少于重新训练分类器。这使得 Safety Reasoner 成为迭代部署的重要工具：在将新模型投产时，我们常从更严格的政策开始，并在必要时分配较高算力，让 Safety Reasoner 仔细执行这些政策；随着对生产中风险理解的加深，我们再调整政策。在部分近期上线中，用于安全推理的总算力占比一度高达 16%。

Safety Reasoner 已成为我们安全体系的核心组成部分。在图像生成与 Sora 2 中，它能对生成结果进行动态、分步评估，以实时识别并阻断不安全的生成。在生物和自伤等领域，我们会先用类似于 Moderation API 中使用的那类小而快、召回率高的分类器来筛选出属相关关注域的内容，然后用 Safety Reasoner 进行复审。 Safety Reasoner 会针对详细的分类法对模型输出进行判定，以决定最佳应对方式，作为覆盖 GPT‑5 与 ChatGPT Agent 等系统的多层防护的一部分。现在， gpt-oss-safeguard 将同样的方法向任何人开放。

gpt-oss-safeguard 的表现

我们在内部和外部评估集上对 gpt-oss-safeguard 进行了评测。

在内部评估中，我们在推理时同时向模型提供多条政策。对每个测试输入，我们检查 gpt-oss-safeguard 是否能在所含的所有政策下都做出正确分类。这是一项难度较大的任务：只有当模型在所有政策上都与金标准标签完全一致时，才被计为准确。

在多政策准确率上， gpt-oss-safeguard 与内部的 Safety Reasoner 均优于 gpt-5-thinking 和 gpt-oss 开源模型。考虑到 gpt-oss-safeguard 的模型规模较小，能够超过 gpt-5-thinking 的表现尤其令人意外。

我们还在随 2022 年研究论文一并发布的审查数据集（moderation dataset）和基于开源聊天机器人用户查询的公开基准 ToxicChat 上做了评测。

在 2022 年的审查评测集上，我们按内部安全政策（并对其中一条政策做了后续更改的适配）进行评估。 gpt-oss-safeguard 在该测试中略微优于其他被测模型，包括我们的内部 Safety Reasoner 与 gpt-5-thinking ，不过 Safety Reasoner 与 gpt-oss-safeguard 之间的差异并不具有统计学显著性。

在 ToxicChat 上，我们用从内部政策改编的一段简短手写提示进行评估。内部的 Safety Reasoner 再次优于 gpt-5-thinking ，而 gpt-5-thinking 与 Safety Reasoner 在该基准上又略优于 gpt-oss-safeguard-120b 和 gpt-oss-safeguard-20b 。考虑到模型体量较小，我们认为 gpt-oss-safeguard 在这类任务中仍具有实际使用优势。

局限性

gpt-oss-safeguard 存在两点具体局限。其一，我们观察到在拥有数万条高质量标注样本的情况下，为特定任务训练出的分类器在内容分类上仍可能优于直接从政策推理的 gpt-oss-safeguard 。对于更复杂的风险，若追求更高性能，投入时间训练专门的分类器仍可能更合适。

其二， gpt-oss-safeguard 在时间与算力上可能较为昂贵，难以在平台上对所有内容大规模展开。我们在内部通过几种方式应对这一点：一是先用更小更快的分类器筛选出需要评估的内容；二是在某些情形下以异步方式运行 Safety Reasoner ，以在保持可干预能力的同时提供低延迟的用户体验。

未来：与社区继续共建

gpt-oss-safeguard 是 OpenAI 首批与社区共同打造的开放安全模型。我们在早期测试中与 SafetyKit 、 ROOST 和 Discord 的信任与安全专家反复迭代。 ROOST 的首席技术官 Vinay Rao 表示：“ gpt-oss-safeguard 是首个采用‘自带政策与伤害定义’设计的开源推理模型。组织应有权自由研究、修改和使用关键的安全技术并进行创新。在我们的测试中，它在理解不同政策、解释推理过程以及在应用政策时表现出细腻判断方面都很出色，这将对构建者和安全团队带来帮助。”

我们将继续与社区迭代改进开源安全工具，包括通过 ROOST Model Community （ RMC ）推进。该社区汇聚安全从业者与研究人员，分享将开源 AI 模型纳入安全工作流的最佳实践，包括评估结果与模型反馈。访问 GitHub 上的 RMC 仓库可了解更多合作详情与参与方式。

想开始使用这些模型，可从 Hugging Face 下载 gpt-oss-safeguard 。

Today, we’re releasing a research preview of gpt-oss-safeguard, our open-weight reasoning models for safety classification tasks, available in two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. These models are fine-tuned versions of our gpt-oss⁠ open models and available under the same permissive Apache 2.0 license, allowing anyone to use, modify, and deploy them freely. Both models can be downloaded today from Hugging Face⁠.

The gpt-oss-safeguard models use reasoning to directly interpret a developer-provided policy at inference time—classifying user messages, completions, and full chats according to the developer’s needs. The developer always decides what policy to use, so responses are more relevant and tailored to the developer’s use case. The model uses chain-of-thought, which the developer can review to understand how the model is reaching its decisions. Additionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance. This approach, which we initially developed for internal use, is significantly more flexible than the traditional method of training a classifier to indirectly infer a decision boundary from a large number of labeled examples.

gpt-oss-safeguard enables developers to draw the policy lines that best fit their use case. For instance, a video gaming discussion forum might want to develop a policy to classify posts that discuss cheating in the game, or a product reviews site might want to use its own policy to screen reviews that appear likely to be fake.

The model takes two inputs at once—a policy and the content to classify under that policy—and outputs a conclusion about where the content falls, along with its reasoning. Developers decide how, if at all, to use those conclusions in their own safety pipelines. We’ve seen this reasoning-based approach perform especially well in situations where:

The potential harm is emerging or evolving, and policies need to adapt quickly.
The domain is highly nuanced and difficult for smaller classifiers to handle.
Developers don’t have enough samples to train a high-quality classifier for each risk on their platform.
Latency is less important than producing high-quality, explainable labels.

We’re releasing this preview of gpt-oss-safeguard to receive feedback from the research and safety community and iterate further on model performance. Over months, we worked on this open weight release with ROOST⁠ to identify developer’s critical needs, test the model and produce developer documentation. As part of this launch ROOST will be establishing a model community⁠, also launching today, to explore open AI models to protect online spaces. Alongside this release, we’re publishing a short technical report⁠ that details the safety performance of this preview model.

System-level safety: the role of safety classifiers

When it comes to safety, we believe in defense in depth⁠. We train our models to respond safely, and we implement additional layers of protection to detect and address potentially unsafe inputs and outputs under our policies. Safety classifiers, which distinguish safe from unsafe content in a particular risk area, have long been a primary layer of defense for our own and other large language models.

Traditional safety classifiers, such as those available via our Moderation API⁠, are developed by manually curating thousands of examples of safe and unsafe content, under pre-defined safety policies. From this training data, the classifier learns to distinguish safe from unsafe outputs. In this traditional approach, the classifier never actually sees the safety policy. Instead, it attempts to infer the underlying policy that was used to label the examples by finding similarities in the content labeled as unsafe and differences between the unsafe and safe content.

Traditional classifiers can have high performance, with low latency and operating cost. But gathering a sufficient quantity of training examples can be time-consuming and costly, and updating or changing the policy requires re-training the classifier.

gpt-oss-safeguard is different because its reasoning capabilities allow developers to apply any policy, including ones they write themselves or draw from other sources, and reasoning helps the models generalize over newly written policies. Beyond safety policies, gpt-oss-safeguard can be used to label content in other ways that are important to specific products and platforms.

How we use safety reasoning internally

Our primary reasoning models now learn our safety policies directly, and use their reasoning capabilities to reason about what’s safe. This approach, which we call deliberative alignment⁠, significantly improves on earlier safety training methods and makes our reasoning models safer on several axes than their non-reasoning predecessors, even as their capabilities increase. But reasoning isn’t only useful for training the models themselves. It also creates new possibilities for defense in depth. Reasoning-based approaches are more flexible and less limited by the details of their previous training, advantages that sometimes more than justify the additional compute cost and latency they involve.

gpt-oss-safeguard is an open-weight implementation of an approach we developed internally, in a tool we call Safety Reasoner. We began with reinforcement fine-tuning on policy labelling tasks, rewarding the model for mirroring correct judgments from human experts. This taught the model to reason about how the policy leads to its judgment. Today, Safety Reasoner enables us to dynamically update our safety policies in production in less time than it would take to retrain a classifier. This makes Safety Reasoner a key tool for iterative deployment⁠: when we deploy new models to production, we often start with more strict policies and use relatively large amounts of compute where needed to enable Safety Reasoner to carefully apply those policies. Then we adjust our policies as our understanding of the risks in production improves. In some of our recent launches, the fraction of total compute devoted to safety reasoning has ranged as high as 16%.

Safety Reasoner has become a core component of our safety stack. For image generation and Sora 2, it performs dynamic, step-wise evaluations of outputs to identify and block unsafe generations in real time. In domains such as biology and self-harm, we run models like those used in the Moderation API as small, fast, and high-recall classifiers to determine which content is within a domain of interest, and then use Safety Reasoner to review that content. Safety Reasoner classifies model outputs against a detailed taxonomy to determine how best to respond, forming part of our multi-layered safeguards across systems like GPT‑5 and ChatGPT Agent. And now, the gpt-oss-safeguard models make this same approach available to anyone.

How gpt-oss-safeguard performs

We evaluated the gpt-oss-safeguard models on both internal and external evaluation sets.

In the internal evaluation, we provided multiple policies simultaneously to gpt-oss-safeguard at inference time. For each test input, we evaluate whether gpt-oss-safeguard correctly classifies the text under all of the included policies. This is a challenging task—the model is counted as accurate only if it exactly matches the golden set labels for all the included policies.

Our gpt-oss-safeguard models and internal Safety Reasoner outperform gpt-5-thinking and the gpt-oss open models on multi-policy accuracy. The gpt-oss-safeguard models outperforming gpt-5-thinking is especially surprising given the former models’ small size.

We further evaluated these models on the moderation dataset⁠ we released with our 2022 research paper⁠ and on ToxicChat⁠, a public benchmark based on user queries to an open-source chatbot.

On the 2022 moderation evaluation set, we evaluated using our internal safety policies (adapting one policy to account for subsequent changes). gpt-oss-safeguard slightly outperformed all other tested models, including our internal Safety Reasoner and gpt-5-thinking, though this difference between Safety Reasoner and gpt-oss-safeguard is not statistically significant.

For ToxicChat, we evaluated using a short hand-written prompt adapted from some of our internal policies. Our internal Safety Reasoner again outperformed gpt-5-thinking, while gpt-5-thinking and Safety Reasoner both marginally outperform gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. We expect that the relatively small size of gpt-oss-safeguard still makes it preferable for this type of task.

Limitations

There are two specific limitations of gpt-oss-safeguard. First, we have observed that classifiers trained on tens of thousands of high-quality labeled samples can still perform better at classifying content than gpt-oss-safeguard does when reasoning directly from the policy. Taking the time to train a dedicated classifier may be preferred for higher performance on more complex risks.

Second, gpt-oss-safeguard can be time and compute-intensive, which makes it challenging to scale across all platform content. Internally, we handle this in several ways with Safety Reasoner: (1) we use smaller and faster classifiers to determine which content to assess and (2) in some circumstances, we use Safety Reasoner asynchronously to provide a low-latency user experience while maintaining the ability to intervene if we detect unsafe content.

The road ahead: continuing to build with the community

gpt-oss-safeguard is OpenAI’s first set of open safety models built with the community. We’ve iterated on gpt-oss-safeguard with trust and safety specialists at SafetyKit, ROOST, and Discord as part of early testing. ROOST CTO Vinay Rao says, “gpt-oss-safeguard is the first open source reasoning model with a ‘bring your own policies and definitions of harm’ design. Organizations deserve to freely study, modify and use critical safety technologies and be able to innovate. In our testing, it was skillful at understanding different policies, explaining its reasoning, and showing nuance in applying the policies, which we believe will be beneficial to builders and safety teams.”

We’ll continue to iterate with the community to improve open safety tooling, including through the ROOST Model Community (RMC). The RMC brings together safety practitioners and researchers to share best practices for implementing open source AI models into safety workflows, including evaluation outcomes and model feedback. Visit the RMC GitHub repo⁠ to learn more about this partnership and how to get involved.

To start building with these models, download them from Hugging Face⁠.

Generated by RSStT. The copyright belongs to the original author.

Source