Strengthening our safety ecosystem with external testing

Strengthening our safety ecosystem with external testing

OpenAI News

OpenAI 认为,独立且值得信赖的第三方评估,对于强化前沿人工智能的安全生态至关重要。第三方评估是对前沿模型进行的独立检验,用以确认或补充关于关键安全能力与缓解措施的论断。这类评估有助于验证安全主张、弥补盲点并提高对模型能力与风险的透明度。通过邀请外部专家检验我们的前沿模型,我们既希望增强外界对我们能力评估与安全保障深度的信任,也希望提升整个安全生态的能力。

自从 GPT‑4 发布以来, OpenAI 便与多家外部机构合作测试与评估我们的模型。总体上,第三方合作主要有三种形式:

  • 针对生物安全、网络安全、AI 自我提升和“阴谋”行为(scheming)等关键前沿能力与风险领域的独立评估;
  • 对我们评估与解读风险的方法进行的方法学审查;
  • 让领域专家(SME)直接用真实的专业任务去检验模型,并通过结构化反馈为我们关于能力与相应防护措施的评估提供输入(即所谓的 Subject‑matter expert probing)。

本文说明了我们如何使用这些外部评估、它们为何重要、如何影响部署决策,以及我们在组织这些合作时遵循的原则。为促进透明,我们还披露了与第三方测试者合作时常见的保密与发表条款。

为什么重要

第三方评估为我们的内部工作增添了一层独立审查,提高了严谨性并降低了自我确认偏差的风险。这些外部意见可作为对我们自身评估的补充证据,帮助为强能力系统做出更为负责任的部署决策。

我们也把第三方评估视为构建有韧性的安全生态的一部分。尽管团队会在各能力与风险领域进行大量内部测试,独立机构能带来额外视角与方法学。我们努力支持一批多元且具资质的评估组织,与我们一道定期评估前沿模型。

同时,我们力求透明地说明外部输入如何影响我们的安全流程。我们会定期公开第三方评估结果摘要(例如在 system cards 中列出预部署评估总结),并在遵守保密与准确性审查的前提下,支持评估组织发表更详细的研究。通过展示外部意见如何塑造能力评估与防护措施,这种透明度有助于建立信任。

长期的合作关系、受信任的访问、透明和知识共享,能帮助整个生态在新兴风险面前保持领先,并促成可适应且可操作的评估,从而为前沿 AI 系统建立更严格的标准和更完善的治理基础。

外部实验室的独立评估

自 GPT‑4 启动起,我们就支持在模型早期检查点上进行独立评估并在部署前进行测试。此后,我们扩展了与一系列在关键前沿能力与风险评估方面具深厚专业知识的第三方组织的合作。我们把独立实验室工作定位为开放式测试,允许外部团队用自己的方法得出关于特定前沿能力的结论或评估。

举例来说,在 GPT‑5 项目中, OpenAI 协调了针对长时程自主、阴谋与欺骗、规避监督、湿实验室(wet lab)规划可行性以及进攻性网络安全评估等关键风险领域的广泛外部能力评估。

这些独立评估补充了按照 Preparedness Framework 执行的内部评估,且包括像 METR 的“时间视野评估”和 SecureBio 的“病毒学能力故障排查(Virology Capabilities Troubleshooting, VCT)”等基准测试。

为支持这些评估,我们向第三方提供了对早期模型检查点的安全访问、用于具体化能力提升的部分评估结果、在必要时的零数据保留政策,以及较少缓解措施的模型。例如,在网络安全和生物安全领域测试的组织会同时测试带有和不带有安全缓解的模型,以探查模型的潜在能力。一些机构还获准访问模型的 chain‑of‑thought(思路链)以检视推理痕迹。这种增加透明度的做法让评估者能够识别“放水”(sandbagging)或只有通过阅读思路链才能察觉的阴谋行为。所有访问都在安全控制下进行,且我们会随着模型能力与测试需求的变化而不断更新这些控制措施。

方法学审查

在某些情况下,外部评估者更适合对方法学进行审查,为前沿实验室评估风险所依赖的框架与证据提供额外视角。比如在 gpt‑oss 发布时,我们使用对抗性微调来估算开源权重模型的最坏情况能力,这在我们关于“估计开权重 LLM 的最坏情况前沿风险”的讨论中有所描述。关键的安全问题是:恶意行为者是否能通过微调使模型在生物或网络等领域达到我们 Preparedness Framework 下的“高能力”水平。因为这类对抗性微调需要大量资源,我们邀请第三方评估者审阅并对我们内部的方法与结果提出建议,而不是让他们重复同样的工作。

该过程为期数周,我们分享了评估流程、对抗性微调的方法细节,并收集了关于如何改进方法学与最坏情形评估的结构化建议。评估者的反馈促成了最终对抗性微调流程的修改,证明了方法学确认的价值。我们在 gpt‑oss 的论文与 system card 中记录了采纳的建议,并对未采纳的项给出理由。

在这里,方法学审查比独立评估更合适:因为最坏情形评估涉及大规模实验,需要一般大型 AI 实验室才具备的基础设施与技术专长,独立组织可能难以直接得出关于最坏情形的洞见。因此,将外部评估者定位为对内部论断的方法与证据进行确认,往往更为高效。评估者对方法与证据的审阅揭示了决策相关的差距,这些差距在反馈循环中被修正。我们希望把这种方法推广到其他由于访问或基础设施限制而不适合让第三方直接运行评估的领域,或目前还缺乏外部评估能力的场景。

领域专家(SME)探测

我们还通过领域专家直接检验模型并通过调查表提供结构化意见的方式与外部专家互动,这称为 SME probing。这不同于旨在压力测试具体防护措施的 red teaming,SME probing 能用领域专家的判断和现实工作流程来补强 Preparedness Framework 的评估,捕捉静态测试可能忽略的实务细节。例如,我们邀请了一组领域专家使用一个“仅提供有益内容(helpful‑only)”的模型来检验 ChatGPT Agent 和 GPT‑5 在端到端生物学场景下的表现。专家们评估模型在多大程度上能把一个像他们这样的专家与一个较少经验的新手拉开差距——也就是模型能否在实际场景中把有动力的新手实质性地推进到能够胜任执行任务的程度。专家们对模型是提供了能逐步执行的实质性帮助,还是仅给出不那么有用的总结,给出了细致评分。这类专家探测被纳入了这些模型部署评估的一部分,并在 system cards 中予以披露。

成功的第三方评估合作要素

为促进透明,我们说明第三方评估者在与我们合作时通常同意的条款及我们合作的基本原则:

  • 在审慎的保密界限内追求透明:第三方评估者签署保密协议,以便我们能共享非公开的保密信息来支持评估。本文附录列出了与第三方评估者合同中关于发表权利与审查期望的相关摘录。我们的原则是推动透明,并尽量允许发表能增进对安全与评估方法理解的工作,同时不泄露保密信息或知识产权。基于此,我们会审查并批准第三方评估的发表内容,以确保保密性与事实准确性。过去几年中,若干第三方评估在我们完成保密与准确性审查后公开发表了他们的成果,我们也在 system cards 中公布了评估摘要,例如 METR 对 GPT‑5 的报告、Apollo Research 对 OpenAI o1 的报告,以及 Irregular 的 GPT‑5 评估等。
  • 审慎的信息披露与受控的敏感访问:默认情况下,我们提供面向公众或已投产的模型与信息;在评估需要时,我们会提供更深层的访问,比如“仅提供有益内容(helpful‑only)”的模型或非公开信息。对于关键安全问题, OpenAI 在必要时已提供此类访问。重要的是,这些敏感访问要求严格的安全措施,我们会随着模型能力与测试需求的演进持续更新控制措施。
  • 平衡的财务激励:我们认为第三方评估生态应有良好资金支持与可持续性。因此,我们向所有第三方评估者提供酬劳,但部分组织基于自身理念可能会拒绝接受。酬劳形式包括对工作直接付款或通过 API 代金券等方式补贴模型使用成本。任何付款均不依赖于评估结果。

这些要素共同帮助第三方评估在保护敏感信息的同时促进 AI 安全透明,并为评估者争取合理的报酬路径。

展望

展望未来,我们认为需要继续强化那些能够开展可信且对决策有价值的前沿 AI 评估的组织生态。有效的第三方评估要求专业知识、稳定的资金与方法学严谨性。对合格评估组织的持续投资、测量科学的发展以及对敏感访问的安全保障,对于让评估跟上模型能力进展至关重要。

第三方评估是我们引入外部视角的方式之一,并与其他机制并行运作。我们还通过结构化的 red teaming、集体对齐项目( collective‑alignment )、与美国 CAISI 和英国 AISI 的合作,以及诸如我们的 Global Physician Network 和 Expert Council on Well‑Being and AI 这样的顾问群体,来指导我们在心理健康与用户福祉等领域的工作。这些努力为评估与治理高级 AI 系统提供了多样且更可靠的专业支撑。

附录(摘录)

以下为我们与参与预部署评估的第三方合作协议的示例性节选:

  • 研究发表:供应商保留或由 OpenAI 许可回供应商使用供应商在合作中创造或发现的工作成果,用于研究、学术发表、科学或教育目的,条件是(a)非商业使用、(b)不披露 OpenAI 的保密信息(除非事先经 OpenAI 书面许可)、且(c)在任何发表或披露前须提交 OpenAI 审查并书面批准。 OpenAI 的“保密信息”包括但不限于 OpenAI 的非公开模型及其输出,包含任何通过使用非公开模型而创造或发现的供应商工作成果。“非公开模型”指在拟议发表日期时尚未向公众发布的 OpenAI 人工智能与机器学习模型及其版本或快照。
  • 保密信息定义:本协议中“保密信息”包括:与 OpenAI 及其业务、财务状况、产品、编程技术、客户、供应商、技术或研发有关并在提供服务过程中向供应商披露或供应商获得访问的信息;供应商工作成果;以及本协议条款。保密信息不包括:经供应商证明在披露时已合法为其所有且无使用或披露限制的信息;或由有权披露的第三方在无使用或披露限制下正当提供的信息;或其他法律规定的例外。供应商同意对所有保密信息严格保密,仅为向 OpenAI 提供服务之目的使用,并采取合理措施保护这些信息不被未经授权使用或披露。
  • 时间与例外:披露方同意,上述约束在披露后 2 年不再适用,但对商业秘密信息的保密义务在其仍构成商业秘密期间持续有效;某些研究或学术发表在经 OpenAI 书面批准或基于已公开向公众提供的 OpenAI 技术版本产生的情况下可不受限制;或接收方能证明信息已通过合法途径公开、或接收方在接收前已无使用限制地合法拥有、或由有权披露且无使用限制的第三方披露,或是接收方在未使用披露方专有信息的情况下独立开发的信息等,均可为例外。接收方在法律或法院命令要求披露时可做出必要披露,但应尽最大努力限制披露并争取保密处理或保护令,并允许披露方参与相关程序。


At OpenAI, we believe that independent, trusted third party assessments play a critical role in strengthening the safety ecosystem of frontier AI. Third party assessments are evaluations conducted on frontier models to confirm or provide additional evidence to claims about critical safety capabilities and mitigations. These evaluations help validate safety claims, protect against blind spots, and increase transparency around capabilities and risks. By inviting external experts to test our frontier models, we also aim to foster trust in the depth of our capability evaluations and safeguards, and help uplift the broader safety ecosystem.


Since the launch of GPT‑4, OpenAI has collaborated with a range of external partners to test and assess our models. Broadly, our third-party collaborations take three forms:


  • Independent evaluations of key frontier capability and risk areas such as biosecurity, cybersecurity, AI self-improvement, and scheming
  • Methodology reviews that assess how we evaluate and interpret risk
  • Subject-matter expert (SME) probing, where experts evaluate the model directly on real world SME tasks and provide structured input into our assessment of its capabilities and associated safeguards1

This blog outlines how we use each of these forms of external assessment, why they matter, how they have shaped deployment decisions, and the principles we use to structure these collaborations. In the spirit of transparency, we’re also sharing more about the confidentiality and publication terms that govern our collaborations with third party testers. 


Why is this important? 




Third party assessors add an independent layer of evaluation alongside our internal work, strengthening rigor and providing additional protections against self-confirmation. Their input provides additional evidence alongside our own assessments, helping to inform responsible deployment decisions for powerful systems.


We also see third party assessments as part of building a resilient safety ecosystem. Our teams conduct extensive internal testing across capability and risk areas, but independent organizations bring additional perspectives and methodological approaches. We work to support a diverse group of qualified assessor organizations who can regularly evaluate frontier models alongside us.


Finally, we aim to be transparent about how this input helps shape our safety process. We regularly make third party assessments public—for example, by including summaries of pre-deployment evaluations in system cards, and supporting assessor organizations in publishing more detailed work following confidentiality and accuracy review. This transparency builds trust by showing how external input shapes our capability evaluations and safeguards. 


Sustained relationships built on trusted access, transparency, and knowledge-sharing help the entire ecosystem stay ahead of emerging risks and foster adaptable, actionable assessments needed for stronger standards and more informed governance for frontier AI systems.


Independent evaluations by external labs




Starting with the launch of GPT‑4, we’ve supported independent evaluations on early model checkpoints prior to deployment. Since that time, we’ve expanded our work with a range of third party organizations that have deep expertise in evaluations for key frontier capability and risk areas. We scope independent lab work as open-ended testing where outside teams apply their own methods to come up with a claim or assessment related to a particular frontier capability. 


As an example, for GPT‑5, OpenAI coordinated a broad set of external capability assessments across key risk areas such as long horizon autonomy, scheming, deception and oversight subversion, wet lab planning feasibility, and offensive cybersecurity evaluations.  


These independent evaluations supplement the evaluations conducted in accordance with OpenAI’s Preparedness Framework, and include benchmarks such as METR’s time horizon evaluation or SecureBio’s Virology Capabilities Troubleshooting (VCT) evaluation. 


To support these evaluations, we provided secure access to early model checkpoints, select evaluation results to concretize the capability improvements that we’re seeing, zero-data retention where needed, and models with fewer mitigations. For example, organizations testing in cybersecurity and biosafety domains tested models both with and without safety mitigations to probe underlying capabilities. Several other organizations received direct chain-of-thought access to enable them to inspect the model’s reasoning traces. This increased transparency step allowed assessors to identify cases of sandbagging2 or scheming behavior that might only be discernible through reading the chain-of-thought. Access was provided with security controls in place, and we continue to update those controls as model capabilities and testing needs evolve.


Methodology review




In certain contexts, external assessors are well-positioned to provide methodological review, providing additional perspectives to the frameworks and evidence that frontier labs rely on to assess risk. For example, during the launch of gpt-oss, we used adversarial fine-tuning to estimate worst-case capabilities for open weight models, described in Estimating worst case frontier risks of open weight LLMs. The core safety question was whether a malicious actor could fine-tune the model to reach High capability in areas such as bio or cyber under our Preparedness Framework. Because this required resource-intensive adversarial fine-tuning, we invited third party assessors to review and make recommendations on our internal methods and results rather than repeat similar work.


This entailed a multi-week process of sharing evaluation rollouts, details about the approach for adversarial fine tuning, and collecting structured recommendations about improving the methodology and evaluations for the worst case frontier risks. Feedback from the assessors led to changes in the final adversarial fine-tuning process and demonstrated the value of methodological confirmation. We recorded which items we adopted in the paper and the system card for gpt-oss, and we provided rationales for those we did not adopt.


Here, methodology review was the right fit rather than independent evaluations: the evaluations involved running large-scale, worst-case experiments, which requires infrastructure and technical expertise that is not commonly available outside of major AI labs. This meant that independent evaluations likely would not have been able to lead directly to insights into worst case scenarios, and it was more productive to focus external assessors on confirmation of the claims. External assessors reviewed the methods and evidence, highlighting decision-relevant gaps which were addressed as a part of the recommendation feedback loop. This approach is one we hope to extend across other avenues where access or infrastructure needs make it impractical for a third party to directly run evaluations itself, or where external evaluations may not yet exist. 


Subject-matter expert (SME) probing




Another way we engage external experts is through subject-matter expert (SME) probing, where experts evaluate the model directly and provide structured input via surveys into our assessment of its capabilities. This is different from red teaming, which aims to stress test specific safeguards. This allows us to supplement Preparedness Framework evaluations with domain-specific insights that reflect expert judgment and real-world context that static evaluations alone may not capture. For example, we invited a panel of subject-matter experts to use a helpful-only model3 to try their own end-to-end bio scenarios for ChatGPT Agent and GPT‑5. They scored how much the model could uplift an expert like themselves compared to a less experienced novice, based on the usefulness of the guidance it provided in their scenarios.  The goal was to gather additional input on how well the system could move a motivated novice materially closer to competent execution: SMEs stress-tested our “novice uplift” claims under realistic workflows they came up with and gave granular feedback on where the model provided material, step-level help vs less helpful summaries. This expert probing exercise was included as part of the overall assessment for deployment of these models, and shared in system cards for both launches. 


What makes a third party assessment collaboration successful?




In the spirit of transparency, we’re sharing more about what third party assessors agree to when they work with us, and the principles that guide our collaborations:


  • Transparency with careful confidentiality bounds: Third party assessors sign non-disclosure agreements to enable sharing of confidential, non-public information to support their assessments. In the Appendix to this post, we include relevant excerpts from contracts with third party assessors that outline rights around publication and expectations for review. We operate with the principle of transparency and strive to enable publication that advances understanding of safety and related evaluations without compromising confidential information or intellectual property. As part of this, we review and approve publications from third-party assessments to ensure both confidentiality and factual accuracy. Over the past few years, several third party assessors have published their work alongside our own publication of assessment summaries in system cards. Some examples of work that has been published after we reviewed it for confidentiality and accuracy include: [METR GPT‑5 report , Apollo Research report on OpenAI o1, Irregular GPT‑5 Assessment
  • Thoughtful information disclosure and secure, sensitive access: By default, we provide information and access to models that are intended to be public or production-ready. When the evaluations necessitate it, we provide deeper access, such as to helpful-only models or non-public information. OpenAI has provided these forms of access where necessary for critical safety questions for third party assessors. Importantly, these types of sensitive access require strict security measures, and we continue to update those controls as model capabilities and testing needs evolve.
  • Balanced financial incentives: We believe that it is important to ensure that the third party assessment ecosystem is well-funded and sustainable. Because of that, we offer compensation to all of our third party assessors, and some choose to decline depending on their organizational philosophy around this. Forms of compensation include direct payment for work and/or subsidizing model use costs through API credits or otherwise. No payment is ever contingent on the results of a third party assessment.

Combined, these factors help third party assessments both protect sensitive information and foster transparency in AI safety, and create paths for third party assessors to be compensated for their time. 


Looking ahead




Looking ahead, we see a need to continue strengthening the ecosystem of organizations capable of conducting credible, decision-relevant assessments of frontier AI systems. Effective third party evaluation requires specialized expertise, stable funding, and methodological rigor. Continued investment in qualified assessor organizations, the advancement of measurement science, and security for sensitive access will be essential for ensuring that assessments can keep pace with advances in model capabilities. 


Third party assessments are one way we bring external perspective into our safety work, and they operate alongside other mechanisms. We also collaborate with external experts through structured red teaming efforts, collective alignment projects, work with the U.S. CAISI and UK AISI, and advisory groups such as our Global Physician Network and our Expert Council on Well-Being and AI to help guide our work on mental health and user well-being. These efforts contribute different forms of expertise and support a broader, more reliable foundation for assessing and governing advanced AI systems.


Appendix




The following are illustrative excerpts from our agreements with third parties collaborating with us on pre-deployment assessments. 


Research Publications: [...] Hereunder, Supplier hereby retains, or OpenAI licenses back to Supplier, as applicable, the right to use the Supplier Work Product created or discovered by Supplier for research, academic publication, scientific and/or educational purposes, provided such uses (a) are not commercial in nature, (b) do not disclose OpenAI’s Confidential Information (except as expressly permitted in advance by OpenAI in writing) and (c) are submitted to OpenAI for review and approval in writing prior to any publication or disclosure. OpenAI’s “Confidential Information” includes without limitation OpenAI’s Non-Public Models and outputs thereof, including any Supplier Work Product that was created or discovered through use of the. Non-Public Models. “Non-Public Models” means OpenAI’s artificial intelligence and machine learning models, including versions and snapshots thereof, that have not been released to the general public at the time of Supplier’s proposed publication date.

















Confidential Information. For purposes of this Agreement, “Confidential Information” means and will include: (i) any information, materials or knowledge regarding OpenAI and its business, financial condition, products, programming techniques, customers, suppliers, technology or research and development that is disclosed to Supplier or to which Supplier has or obtains access in connection with performing Services; (ii) the Supplier Work Product; and (iii) the terms and conditions of this Agreement. Confidential Information will not include any information that: (a) is or becomes part of the public domain through no fault of Supplier or any representative or agent of Supplier; (b) is demonstrated by Supplier to have been rightfully in Supplier’s possession at the time of disclosure, without restriction as to use or disclosure; or (c) Supplier rightfully receives from a third party who has the right to disclose it and who provides it without restriction as to use or disclosure. Supplier agrees to hold all Confidential Information in strict confidence, not to use it in any way, commercially or otherwise, other than to perform Services for OpenAI, and not to disclose it to others. Supplier further agrees to take all actions reasonably necessary to protect the confidentiality of all Confidential Information including, without limitation, implementing and enforcing procedures to minimize the possibility of unauthorized use or disclosure of Confidential Information.

















Without granting any right or license, the Disclosing Party agrees that the foregoing shall not apply with respect to (a) any information after 2 years following the disclosure thereof, except for any information that is a trade secret, which shall remain subject to the confidentiality obligations of this Agreement for as long as it is a trade secret, (b) any information included in a Researcher’s noncommercial research or academic publication to the extent such information is either (i) approved in writing by OpenAI prior to publication or (ii) resulting from the version of OpenAI Technology that has been made generally available to the public by OpenAI (and not, for the avoidance of doubt, any information, results, or output from version of the OpenAI Technology that were not made generally available to the public); or (c) any information that the Receiving Party can document (i) is or becomes (through no improper action or inaction by the Receiving Party or any affiliate, agent, consultant or employee of the Receiving Party) generally available to the public, (ii) was in its possession or known by it without restriction prior to receipt from the Disclosing Party, (iii) was rightfully disclosed to it by a third party without restriction, or (iv) was independently developed without use of any Proprietary Information of the Disclosing Party by officers, directors, employees, consultants, representatives, advisors or affiliates of the Receiving Party who have had no access to any such Proprietary Information. The Receiving Party may make disclosures required by law or court order provided the Receiving Party uses diligent reasonable efforts to limit disclosure and to obtain confidential treatment or a protective order and allows the Disclosing Party to participate in the proceeding.


















Generated by RSStT. The copyright belongs to the original author.

Source

Report Page