Building Safeguards for Claude

Building Safeguards for Claude

Anthropic News

Claude赋能数百万用户应对复杂挑战,激发创造力,深化对世界的理解。我们希望放大人类潜能,同时确保模型能力被引导至有益的结果。这意味着不断完善对用户学习和解决问题的支持,同时防止可能造成现实世界伤害的滥用。

这正是我们的安全保障团队发挥作用的地方:我们识别潜在滥用,响应威胁,构建防御,帮助保持Claude既有用又安全。安全保障团队汇聚了政策、执法、产品、数据科学、威胁情报和工程领域的专家,他们了解如何构建稳健系统以及不法分子如何试图破坏这些系统。

我们在多个层面开展工作:制定政策、影响模型训练、测试有害输出、实时执行政策、识别新型滥用和攻击。此方法贯穿模型整个生命周期,确保Claude在训练和构建过程中具备现实世界有效的保护措施。


政策制定

安全保障团队设计了我们的《使用政策》(Usage Policy)——定义Claude应如何使用及不应如何使用的框架。该政策指导我们处理儿童安全、选举诚信、网络安全等关键领域,同时为Claude在医疗、金融等行业的使用提供细致指导。

政策制定和迭代过程由两大机制驱动:

  • 统一伤害框架(Unified Harm Framework):这是一个不断演进的框架,帮助团队从物理、心理、经济、社会和个人自主五个维度理解Claude使用可能带来的潜在伤害。该框架不是正式的评分系统,而是作为结构化视角,考虑滥用的可能性和规模,指导政策和执法程序的制定。
  • 政策脆弱性测试(Policy Vulnerability Testing):我们与外部领域专家合作,识别关注点,并通过对模型在挑战性提示下的输出进行压力测试,检验政策的有效性。合作伙伴包括反恐、激进化、儿童安全和心理健康领域的专家。例如,在2024年美国大选期间,我们与战略对话研究所(Institute for Strategic Dialogue)合作,了解Claude何时可能提供过时信息,并为寻求选举信息的用户添加指向权威资源(如TurboVote)的横幅提醒。

Claude的训练

安全保障团队与微调团队紧密合作,通过协作过程帮助防止Claude产生有害行为和回应。团队广泛讨论Claude应展现和避免的行为,指导训练中应构建的模型特质。

评估和检测流程也会识别潜在有害输出。发现问题时,我们会与微调团队合作,采取措施如更新训练中的奖励模型或调整已部署模型的系统提示。

我们还与领域专家合作,细化Claude对敏感领域的理解。例如,与在线危机支持领导者ThroughLine合作,深入理解模型在自残和心理健康相关情境中的应答方式。我们将这些见解反馈给训练团队,帮助Claude在这些对话中表现出细腻的回应,而非完全拒绝或误解用户意图。

通过这一协作过程,Claude学会了多项重要技能:拒绝协助非法有害活动,识别生成恶意代码、伪造内容或策划有害行为的企图,谨慎讨论敏感话题,并区分这些话题与实际伤害企图。


测试与评估

在发布新模型前,我们会评估其性能和能力,评估内容包括:

  • 安全评估:检测Claude在儿童剥削、自残等话题上是否遵守使用政策,测试包括明显违规、模糊情境和多轮对话,利用模型自动评分并辅以人工复核。
  • 风险评估:针对高风险领域(如网络危害、化学、生物、放射性、核武器及高爆炸物),与政府和私营部门合作进行AI能力提升测试,定义潜在威胁模型并评估安全保障措施的有效性。
  • 偏见评估:检查Claude在不同情境和用户间是否持续提供可靠、准确的回答。针对政治偏见,测试对立观点提示,评分其事实性、全面性、公正性和一致性。还测试涉及职业、医疗等话题,评估性别、种族、宗教等身份属性是否导致偏见输出。

这些严格的预发布测试帮助验证训练效果,判断是否需要额外的保护措施。例如,在预发布评估我们的计算机使用工具时,发现其可能被用于垃圾邮件生成和传播,因而开发了新的检测和执法机制,包括对滥用账户禁用该工具,以及防范提示注入的新保护措施。

评估结果会在每个新模型系列发布时通过系统卡片公开。


实时检测与执法

模型部署后,我们结合自动系统和人工审核,检测有害行为并执行使用政策。

检测和执法系统由一组特别微调或提示的Claude模型(称为“分类器”)驱动,实时检测特定类型的政策违规。我们可同时部署多个分类器,分别监控不同类型的危害,同时保持对话自然流畅。针对儿童性虐待材料(CSAM),我们还会将上传图片的哈希值与已知CSAM数据库比对。

分类器帮助我们决定是否采取执法行动,包括:

  • 响应引导:实时调整Claude对某些用户提示的理解和回应,防止有害输出。例如,检测到用户可能试图生成垃圾邮件或恶意软件时,自动在系统提示中添加额外指令引导回应,极少数情况下可完全阻止回应。
  • 账户执法措施:调查违规模式,可能对账户采取警告或严重时终止账户等措施。我们也有防范欺诈账户创建和服务滥用的防御机制。

构建这些执法系统面临巨大挑战,既需机器学习研究设计,也需工程实现。分类器需处理数万亿输入输出令牌,同时限制计算开销和对正常内容的误判。


持续监控与调查

我们还监控Claude的有害流量,超越单条提示和单个账户,了解特定危害的普遍性,识别更复杂的攻击模式,包括:

  • Claude洞察与观察:使用洞察工具(Clio)以隐私保护方式分析真实使用情况,将对话归类为高层次主题。相关研究(如Claude使用对情绪影响)为我们构建保护措施提供依据。
  • 分层摘要:为监控计算机使用能力或潜在有害网络使用,采用分层摘要技术,将单次交互浓缩为摘要,再分析摘要识别账户层面问题,帮助发现仅在整体上显现的违规行为,如自动化影响操作和大规模滥用。
  • 威胁情报:研究最严重的模型滥用,识别对抗性使用和现有检测系统可能遗漏的模式。通过比较异常账户活动与正常模式,结合外部威胁数据(开源仓库、行业报告)和内部系统交叉验证,监控不法分子可能活跃的社交媒体、消息平台和黑客论坛。我们在公开威胁情报报告中分享发现。

展望未来

保障AI安全使用的重要性超出任何单一组织的能力范围。我们积极寻求用户、研究人员、政策制定者和民间社会组织的反馈与合作。我们也基于公众反馈,开展持续的漏洞赏金计划,测试我们的安全防御。

为支持这项工作,我们正在积极招聘能够帮助解决这些问题的人才。如果你有兴趣加入我们的安全保障团队,欢迎访问我们的招聘页面了解详情。


以上为Anthropic关于Claude安全保障团队工作流程和策略的详细介绍。




Claude empowers millions of users to tackle complex challenges, spark creativity, and deepen their understanding of the world. We want to amplify human potential while ensuring our models’ capabilities are channeled toward beneficial outcomes. This means continuously refining how we support our users’ learning and problem-solving, while preventing misuse that could cause real-world harm.

This is where our Safeguards team comes in: we identify potential misuse, respond to threats, and build defenses that help keep Claude both helpful and safe. Our Safeguards team brings together experts in policy, enforcement, product, data science, threat intelligence, and engineering who understand how to build robust systems and how bad actors try to break them.

We operate across multiple layers: developing policies, influencing model training, testing for harmful outputs, enforcing policies in real-time, and identifying novel misuses and attacks. This approach spans the entire lifecycle of our models, ensuring Claude is trained and built with protections that are effective in the real world.

Figure 1: Safeguards’ approach to building effective protections throughout the lifecycle of our models


Policy development

Safeguards designs our Usage Policy—the framework that defines how Claude should and shouldn’t be used. The Usage Policy informs how we address critical areas like child safety, election integrity, and cybersecurity while providing nuanced guidance for Claude’s use in industries like healthcare and finance.

Two mechanisms guide our policy development and iteration process:

  • Unified Harm Framework: This evolving framework helps our team understand potentially harmful impacts from Claude use across five dimensions: physical, psychological, economic, societal, and individual autonomy. Rather than a formal grading system, the framework serves as a structured lens and considers the likelihood and scale of misuse when developing policies and enforcement procedures.
  • Policy Vulnerability Testing: We partner with external domain experts to identify areas of concern, and then stress-test these concerns against our policies by assessing the output of our models under challenging prompts. Our partners include experts in terrorism, radicalization, child safety, and mental health. The findings from these stress tests directly shape our policies, training, and detection systems. For example, during the 2024 U.S. election, we partnered with the Institute for Strategic Dialogue to understand when Claude might provide outdated information. We then added a banner pointing Claude.ai users seeking election information to authoritative sources like TurboVote.
Figure 2: Banner displayed during 2024 U.S. election cycle for accurate voting information as a result of our policy vulnerability testing with the Institute for Strategic Dialogue

Claude’s training

Safeguards works closely with our fine-tuning teams through a collaborative process to help prevent harmful behavior and responses from Claude. This involves extensive discussion about what behaviors Claude should and shouldn't exhibit, which helps to inform decisions about which traits to build into the model during training.

Our evaluation and detection processes also identify potentially harmful outputs. When issues are flagged, we can work with fine-tuning teams on solutions like updating our reward models during training or adjusting system prompts for deployed models.

We also work with domain specialists and experts to refine Claude’s understanding of sensitive areas. For example, we partner with ThroughLine, a leader in online crisis support, to develop a deep understanding of where and how models should respond in situations related to self-harm and mental health. We feed these insights back to our training team to help influence the nuance in Claude’s responses, rather than having Claude refuse to engage completely or misinterpreting a user’s intent in these conversations.

Through this collaborative process, Claude develops several important skills. It learns to decline assistance with harmful illegal activities, and it recognizes attempts to generate malicious code, create fraudulent content, or plan harmful activities. It learns how to discuss sensitive topics with care, and how to distinguish between these and attempts to cause actual harm.


Testing and evaluation

Before releasing a new model, we evaluate its performance and capabilities. Our evaluations include:

Figure 3: We test each model via safety evaluations, risk assessments, and bias evaluations prior to deployment

  • Safety evaluations: We assess Claude’s adherence to our Usage Policy on topics like child exploitation or self-harm. We test a variety of scenarios, including clear usage violations, ambiguous contexts, and extended multi-turn conversations. These evaluations leverage our models to grade Claude’s responses, with human review as an additional check for accuracy.
  • Risk assessments: For high-risk domains, such as those associated with cyber harm, or chemical, biological, radiological, and nuclear weapons and high-yield explosives (CBRNE), we conduct AI capability uplift testing in partnership with government entities and private industry. We define threat models that could arise from improved capabilities and assess the performance of our safeguards against these threat models.
  • Bias evaluations: We check whether Claude consistently provides reliable, accurate responses across different contexts and users. For political bias, we test prompts with opposing viewpoints and compare the responses, scoring them for factuality, comprehensiveness, equivalency, and consistency. We also test responses on topics such as jobs and healthcare to identify whether the inclusion of identity attributes like gender, race, or religion result in biased outputs.

This rigorous pre-deployment testing helps us verify whether training holds up under pressure, and signals whether we might need to build additional guardrails to monitor and protect against risks. During pre-launch evaluations of our computer use tool, we determined it could augment spam generation and distribution. In response, prior to launch we developed new detection methods and enforcement mechanisms, including the option to disable the tool for accounts showing signs of misuse, and new protections for our users against prompt injection.

The results of our evaluations are reported in our system cards, released with each new model family.

Real-time detection and enforcement

We use a combination of automated systems and human review to detect harm and enforce our Usage Policy once models are deployed.

Our systems for detection and enforcement are powered by a set of prompted or specially fine-tuned Claude models called "classifiers," which are designed to detect specific types of policy violations in real-time. We can deploy a number of different classifiers simultaneously, each monitoring for specific types of harm while the main conversation flows naturally. Along with our classifiers, we also employ specific detection for child sexual abuse material (CSAM), in which we compare hashes of uploaded images against databases of known CSAM on our first-party products.

These classifiers help us determine if and when we take enforcement actions including:

  • Response steering: We can adjust how Claude interprets and responds to certain user prompts in real-time to prevent harmful output. For example, if our classifier detects that a user may be attempting to generate spam or malware, we can automatically add additional instructions to Claude’s system prompt to steer its response. In a narrow set of cases, we can also stop Claude from responding entirely.
  • Account enforcement actions: We investigate patterns of violations and might take additional measures on the account level, including warnings or, in severe cases, account termination. We also have defenses for blocking fraudulent account creation and use of our services.

Building these enforcement systems represents a considerable challenge, both in terms of machine learning research needed to design them and engineering solutions required to implement them. For instance, our classifiers must be able to process trillions of input and output tokens, while simultaneously limiting both compute overhead and enforcement on benign content.

Ongoing monitoring and investigation

We also monitor harmful Claude traffic, going beyond single prompts and individual accounts, to understand the prevalence of particular harms and identify more sophisticated attack patterns. This work includes:

  • Claude insights and observations: Our insights tool helps us measure real-world use of Claude and analyze traffic in a privacy-preserving manner by grouping conversations into high-level topic clusters. Research informed by this work (like on the emotional impacts of Claude use) can inform the guardrails we build.
  • Hierarchical summarization: To monitor computer use capabilities or potential harmful cyber use, we use hierarchical summarization, a technique that condenses individual interactions into summaries and then analyzes these summaries to identify account-level concerns. This helps us spot behaviors that might appear violative only in aggregate, such as automated influence operations and other large-scale misuses.
  • Threat intelligence: We also study the most severe misuses of our models, identifying adversarial use and patterns that our existing detection systems might miss. We use methods like comparing indicators of abuse (such as unusual spikes in account activity) against typical account usage patterns to identify suspicious activity and cross-reference external threat data (like open source repositories or industry reporting) with our internal systems. We also monitor channels where bad actors might operate, including social media, messaging platforms, and hacker forums. We share our findings in our public threat intelligence reports.

Looking forward

Safeguarding AI use is too important for any one organization to tackle alone. We actively seek feedback and partnership from users, researchers, policymakers, and civil society organizations. We also build on feedback from the public, including via an ongoing bug bounty program for testing our defenses.

To support our work, we’re actively looking to hire people who can help us tackle these problems. If you’re interested in working on our Safeguards team, we encourage you to check out our Careers page.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page