Generated by RSStT

Anthropic News

过去一年里，我们与美国人工智能标准与创新中心（CAISI）和英国人工智能安全研究所（AISI）合作，这两个政府机构致力于评估和提升人工智能系统的安全性。我们的合作最初是自愿性质的初步咨询，随着时间推移，发展成为持续的伙伴关系，CAISI和AISI团队在模型开发的不同阶段获得了对我们系统的访问权限，从而能够持续测试我们的系统。

政府在这项工作中具备独特能力，尤其是在国家安全领域如网络安全、情报分析和威胁建模方面拥有深厚专业知识，结合其机器学习专长，能够评估具体的攻击路径和防御机制。他们的反馈帮助我们改进安全措施，使我们的系统能够抵御一些最复杂的滥用尝试。

与独立外部专家合作识别人工智能系统中的漏洞，是Anthropic安全保障策略的核心部分，对于防止模型被滥用造成现实世界的危害至关重要。

发现并解决漏洞

此次合作已带来关键发现，帮助我们强化了防止模型恶意使用的工具。根据与CAISI和AISI的协议，每个机构在模型如Claude Opus 4和4.1部署前，评估了我们“宪法分类器”（Constitutional Classifiers）的多个版本——这是一种用于检测和防止“越狱”攻击的防御系统，旨在识别漏洞并构建稳健的安全保障。

宪法分类器测试
我们向CAISI和AISI提供了多个早期版本的宪法分类器访问权限，并持续提供最新系统的访问。双方共同对这些分类器进行了压力测试，政府红队成员识别出多种漏洞（部署前后均有发现），我们的技术团队据此加强了安全措施。具体漏洞包括：
- 提示注入漏洞
  政府红队通过提示注入攻击发现早期分类器的弱点。此类攻击通过隐藏指令诱使模型执行设计者未预期的行为。测试人员发现，诸如虚假声明已进行人工审核的特定注释，能够完全绕过分类器检测。我们已修补这些漏洞。
- 安全架构压力测试
  他们开发了复杂的通用越狱方法，将有害交互编码成逃避我们标准检测的形式。此发现促使我们从根本上重构安全架构，以解决该类根本漏洞，而非仅修补单一漏洞。
- 基于密码的攻击识别
  利用密码、字符替换及其他混淆技术编码有害请求以规避分类器。此发现推动我们改进检测系统，使其能识别并阻断各种编码方式下的伪装有害内容。
- 输入输出混淆攻击
  发现利用复杂混淆方法的通用越狱，针对我们的特定防御，将有害字符串拆分成看似无害的片段嵌入更广泛上下文。识别这些盲点使我们能针对性改进过滤机制。
- 自动化攻击优化
  构建了新的自动化系统，逐步优化攻击策略。近期他们利用该系统从较弱的越狱版本迭代出有效的通用越狱，我们正利用此成果提升安全保障。
评估与风险方法论
除了识别具体漏洞，CAISI和AISI团队还帮助强化了我们更广泛的安全策略。他们对证据需求、部署监控和快速响应能力的外部视角，对检验我们的假设和发现需要补充证据支持的威胁模型领域极为宝贵。

有效合作的关键经验

我们的经验总结出若干重要教训，指导如何与政府研究和标准机构有效合作，提升模型的安全性。

全面的模型访问提升红队测试效果
给予政府红队更深入的系统访问权限，能发现更复杂的漏洞。我们提供了：
- 部署前的安全保障原型，测试人员可在系统上线前评估并迭代保护措施，提前发现弱点。
- 多种系统配置，从完全无保护版本到全防护版本，便于测试人员先针对基础模型开发攻击，再逐步优化绕过更复杂防御的技术。仅提供有益输出的模型变体也支持精确的有害输出评分和能力基准测试。
- 详尽的文档和内部资源，向受信任的政府红队提供安全架构细节、已知漏洞、保障报告及细化内容政策（包括具体禁止请求和评估标准），帮助团队聚焦高价值测试区域，避免盲目寻找弱点。
- 实时安全保障数据，加速漏洞发现。我们向政府红队开放分类器分数，便于他们调整攻击策略，开展更有针对性的探索性研究。
迭代测试促进复杂漏洞发现
虽然单次评估有价值，但持续合作使外部团队能深入系统，发现更复杂漏洞。关键阶段我们保持每日沟通和频繁技术深度交流。
互补方法实现更强安全
CAISI和AISI的评估与我们更广泛的生态系统协同工作。公开漏洞赏金计划吸引大量多样化漏洞报告，而专业专家团队则能发现需要深厚技术知识的复杂隐蔽攻击路径。多层策略确保我们既能捕捉常见漏洞，也能防范复杂边缘案例。

持续合作

确保强大AI模型的安全与益处，不仅需要技术创新，更需产业与政府间新型合作形式。我们的经验表明，公私合作在技术团队紧密协作识别和应对风险时最为有效。

随着AI能力提升，独立评估缓解措施的作用日益重要。我们欣慰看到其他AI开发者也在与这些政府机构合作，鼓励更多公司加入并广泛分享经验教训。

我们衷心感谢美国CAISI和英国AISI技术团队的严谨测试、深思熟虑的反馈及持续合作。他们的工作实质性提升了我们系统的安全性，推动了AI安全保障效果评估领域的发展。

Over the past year, we've collaborated with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), government bodies established to measure and improve the security of AI systems. Our voluntary work together began as initial consultations, but over time evolved to an ongoing partnership where CAISI and AISI teams were provided access to our systems at various stages of model development, allowing for ongoing testing of our systems.

Governments bring unique capabilities to this work, particularly deep expertise in national security areas like cybersecurity, intelligence analysis, and threat modeling that enables them to evaluate specific attack vectors and defense mechanisms when paired with their machine learning expertise. Their feedback helps us improve our security measures so our systems can withstand some of the most sophisticated attempts at misuse.

Working with independent external experts to identify vulnerabilities in AI systems is a core part of Anthropic’s Safeguards approach and is critical to preventing misuse of our models that could cause real-world harm.

Uncovering and addressing vulnerabilities

This collaboration has already led to key findings that helped us strengthen the tools we use to prevent malicious use of our models. As part of our respective agreements with CAISI and AISI, each organization evaluated several iterations of our Constitutional Classifiers—a defense system we use to spot and prevent jailbreaks—on models like Claude Opus 4 and 4.1 prior to deployment to help identify vulnerabilities and build robust safeguards.

Testing of Constitutional Classifiers. We gave CAISI and AISI access to several early versions of our constitutional classifiers, and we’ve continued to provide access to our latest systems as we’ve made improvements. Together, we stress-tested these classifiers, with government red-teamers identifying a range of vulnerabilities—both before and after deployment—and our technical team using these findings to strengthen the safeguards. As examples, these vulnerabilities included:

Uncovering prompt injection vulnerabilities. Government red-teamers identified weaknesses in our early classifiers via prompt injection attacks. Such attacks use hidden instructions to trick models into behavior that the system designer didn’t intend. Testers discovered that specific annotations, like falsely claiming human review had occurred, could bypass classifier detection entirely. We have patched these vulnerabilities.
Stress-testing safeguard architectures. They developed a sophisticated universal jailbreak that encoded harmful interactions in ways that evaded our standard detection methods. Rather than simply patching this individual exploit, the discovery prompted us to fundamentally restructure our safeguard architecture to address the underlying vulnerability class.
Identifying cipher-based attacks. Encoded harmful requests using ciphers, character substitutions, and other obfuscation techniques to evade our classifiers. These findings drove improvements to our detection systems, enabling them to recognize and block disguised harmful content regardless of encoding method.
Input and output obfuscation attacks. Discovered universal jailbreaks using sophisticated obfuscation methods tailored to our specific defenses, such as fragmenting harmful strings into seemingly benign components within a wider context. Identifying these blind spots enabled targeted improvements to our filtering mechanisms.
Automated attack refinement. Built new automated systems that progressively optimize attack strategies. They recently used this system to produce an effective universal jailbreak by iterating from a less effective jailbreak, which we are using to improve our safeguards.

Evaluation and risk methodology. Beyond identifying specific vulnerabilities, CAISI and AISI teams have helped strengthen our broader approach to security. Their external perspective on evidence requirements, deployment monitoring, and rapid response capabilities has been invaluable in pressure-testing our assumptions and identifying areas where additional evidence may be needed to support our threat models.

Key lessons for effective collaborations

Our experience has taught us several important lessons about how to engage effectively with government research and standards bodies to improve the safety and security of our models.

Comprehensive model access enhances red-teaming effectiveness. Our experience shows that giving government red-teamers deeper access to our systems enables more sophisticated vulnerability discovery. We provided several key resources:

Pre-deployment safeguard prototypes. Testers could evaluate and iterate on protection systems before they went live, identifying weaknesses before safeguards were deployed.
Multiple system configurations. We provided models across the protection spectrum, from completely unprotected versions to models with full safeguards. This approach lets testers first develop attacks against base models, then progressively refine techniques to bypass increasingly sophisticated defenses. Helpful-only model variants also enabled precise harmful output scoring and capability benchmarking.
Extensive documentation and internal resources. We provided trusted government red-teamers with our safeguard architecture details, documented vulnerabilities, safeguards reports, and granular content policy information (including specific prohibited requests and evaluation criteria). This transparency helped teams target high-value testing areas rather than searching blindly for weaknesses.
Real-time safeguards data accelerates vulnerability discovery. We gave government red-teamers direct access to classifier scores. This enabled testers to refine their attack strategies and conduct more targeted exploratory research.

Iterative testing allows for complex vulnerability discovery. Though single evaluations provide value, sustained collaboration enables external teams to develop deep system expertise and uncover more complex vulnerabilities. During critical phases, we've maintained daily communication channels and frequent technical deep-dives with our partners.

Complementary approaches offer more robust security. CAISI and AISI evaluations work synergistically with our broader ecosystem. Public bug bounty programs generate high-volume, diverse vulnerability reports from a wide talent pool, while specialized expert teams can help uncover complex, subtle attack vectors that require deep technical knowledge to identify. This multi-layered strategy helps ensure we catch both common exploits and sophisticated edge cases.

Ongoing collaboration

Making powerful AI models secure and beneficial requires not just technical innovation but also new forms of collaboration between industry and government. Our experience demonstrates that public-private partnerships are most effective when technical teams work closely together to identify and address risk.

As AI capabilities advance, the role of independent evaluations of mitigations is increasingly important. We are heartened that other AI developers are also working with these government bodies, and encourage more companies to do so and share their own lessons more widely.

We extend our gratitude to the technical teams at both US CAISI and UK AISI for their rigorous testing, thoughtful feedback, and ongoing collaboration. Their work has materially improved the security of our systems and advanced the field of measuring AI safeguard effectiveness.

Generated by RSStT. The copyright belongs to the original author.

Source

Generated by RSStT

Uncovering and addressing vulnerabilities

Key lessons for effective collaborations

Ongoing collaboration

Report Page