Generated by RSStT
Anthropic News以下是该文档的中文翻译:
目前最流行的人工智能工具是能够响应特定问题或提示的助手。但我们现在正看到AI代理(AI agents)的出现,这类代理在被赋予目标后能够自主执行任务。可以把代理想象成一个虚拟的协作者,能够独立完成从头到尾的复杂项目,而你则可以专注于其他优先事项。
代理能够自主管理自己的流程和工具使用,控制如何完成任务,且需要最少的人类干预。例如,如果你让代理“帮我策划婚礼”,它可能会自主调研场地和供应商,比较价格和可用性,制定详细的时间表和预算。又比如,如果你让它“准备公司的董事会演示”,它可能会搜索你连接的Google Drive中的相关销售报告和财务文件,从多个电子表格中提取关键指标,并生成报告。
去年,我们推出了Claude Code,这是一个能够自主编写、调试和编辑代码的代理,广泛被软件工程师使用。许多公司也在使用我们的模型构建自己的代理。网络安全公司Trellix使用Claude来分类和调查安全问题。金融服务公司Block构建了一个代理,使非技术员工能够通过自然语言访问其数据系统,从而节省了工程师的时间。
可信赖代理的原则
代理的快速应用意味着开发者(如Anthropic)必须构建安全、可靠和值得信赖的代理。今天,我们分享一个负责任代理开发的早期框架。希望该框架能帮助建立新兴标准,提供适应不同场景的指导,并促进构建一个与人类价值观相一致的代理生态系统。
我们在开发代理时,遵循以下原则:
保持人类控制,同时赋能代理自主性
代理设计的核心矛盾是如何平衡代理自主性与人类监督。代理必须能够自主工作——正是它们的独立操作使其有价值。但人类应保留对目标实现方式的控制权,尤其是在做出重大决策之前。例如,一个帮助管理费用的代理可能发现公司在软件订阅上花费过多。在开始取消订阅或降级服务之前,公司很可能希望由人类批准。
在Claude Code中,人类可以随时停止Claude并重新指引其方向。默认情况下,Claude拥有只读权限,意味着它可以在初始化目录中分析和审查信息,无需人类批准,但在修改代码或系统之前必须获得人类批准。用户可以为他们信任Claude处理的常规任务授予持续权限。
随着代理变得更强大和普及,我们需要更强健的技术解决方案和直观的用户控制。自主性与监督之间的正确平衡因场景而异,可能需要内置和可定制的监督功能相结合。
代理行为的透明度
人类需要了解代理解决问题的过程。缺乏透明度时,人类让代理“减少客户流失”,却发现代理开始联系设施团队调整办公布局,可能会感到困惑。但如果设计良好,代理可以解释其逻辑:“我发现分配给嘈杂开放办公区销售代表的客户流失率高40%,因此我请求进行工作区噪音评估并建议调整工位以改善通话质量。”这也为人类纠正代理、核实数据或确保其使用最相关信息提供了机会。
在Claude Code中,Claude通过实时待办事项清单展示其计划行动,用户可以随时介入,询问或调整工作计划。挑战在于找到合适的信息量:信息太少,人类无法判断代理是否达成目标;信息太多,则可能被无关细节淹没。我们尝试取中间路线,但仍需不断迭代改进。
使代理符合人类价值观和期望
代理并不总是按人类意图行事。我们的研究表明,当AI系统自主追求目标时,有时会采取对系统合理但不符合人类期望的行动。例如,人类让代理“整理我的文件”,代理可能自动删除它认为的重复文件,并将文件移动到新的文件夹结构,远远超出简单整理,彻底重组用户系统。虽然这源于代理试图帮忙,但显示出代理缺乏适当的上下文,即使目标一致,也可能行为不当。
更令人担忧的是,代理可能以损害用户利益的方式追求目标。我们对极端场景的测试显示,AI系统自主追求目标时,有时会采取看似合理但违背人类意愿的行动。用户也可能无意中以导致意外结果的方式提示代理。
构建可靠的代理价值观对齐度量非常困难,因为难以同时评估恶意和无意的原因。但我们正积极探索解决方案。在此之前,透明度和控制原则尤为重要。
保护跨多次交互的隐私
代理可以在不同任务和交互间保留信息,这带来潜在隐私问题。代理可能不恰当地将敏感信息从一个场景带到另一个场景。例如,代理在帮助组织规划时可能了解到某部门的机密内部决策,随后在协助另一部门时无意中引用这些信息,暴露本应隔离的敏感事项。
代理使用的工具和流程也应设计有适当的隐私保护和控制。我们创建的开源模型上下文协议(Model Context Protocol,MCP)允许Claude连接其他服务,包含控制功能,用户可允许或阻止Claude访问特定工具和流程(称为“连接器”)。包括授予一次性或永久访问权限的选项。企业管理员也可设置组织内用户可连接的连接器。我们持续探索改进隐私保护工具的方法。
我们还列出了客户应采取的保护数据措施,如访问权限、身份验证和数据隔离。
保障代理交互的安全
代理系统应设计为保护敏感数据,防止与其他系统或代理交互时被滥用。由于代理被赋予特定目标,攻击者可能诱使代理忽视原始指令,泄露未授权信息,或执行非预期操作(称为“提示注入”)。攻击者也可能利用代理使用的工具或子代理的漏洞。
Claude已使用一套分类器系统检测并防范提示注入等滥用行为,此外还有多层安全措施。我们的威胁情报团队持续监控,评估并缓解新兴恶意行为。我们还为使用Claude的组织提供降低风险的指导。加入我们Anthropic审核的MCP目录的工具必须符合安全、可靠和兼容性标准。
当我们通过监控和研究发现新的恶意行为或漏洞时,会迅速应对并持续改进安全措施,以应对不断演变的威胁。
后续步骤
随着我们继续开发和改进代理,预计对其风险和权衡的理解也将不断发展。未来,我们计划修订和更新该框架,以反映最佳实践。
这些原则将指导我们当前和未来的代理开发工作,我们期待与其他公司和组织在此领域合作。代理在工作、教育、医疗和科学发现等方面具有巨大正面潜力,因此确保其构建达到最高标准至关重要。
以上即为全文中文翻译。若需进一步帮助,请告知!
The most popular AI tools today are assistants that respond to specific questions or prompts. But we’re now seeing the emergence of AI agents, which pursue tasks autonomously when given a goal. Think of an agent like a virtual collaborator that can independently handle complex projects from start to finish - all while you focus on other priorities.
Agents direct their own processes and tool usage, maintaining control over how they accomplish tasks with minimum human input. If you ask an agent to "help plan my wedding," it might autonomously research venues and vendors, compare pricing and availability, and create detailed timelines and budgets. Or if you ask it to “prepare my company’s board presentation", it might search through your connected Google Drive for relevant sales reports and financial documents, extract key metrics from multiple spreadsheets, and produce a report.
Last year, we introduced Claude Code, an agent that can autonomously write, debug, and edit code, and is used widely by software engineers. Many companies are also building their own agents using our models. Trellix, a cybersecurity firm, uses Claude to triage and investigate security issues. And Block, a financial services company, has built an agent that allows non-technical staff to access its data systems using natural language, saving its engineers time.
Principles for trustworthy agents
The rapid implementation of agents means it's crucial that developers like Anthropic build agents that are safe, reliable and trustworthy. Today, we're sharing an early framework for responsible agent development. We hope this framework can help establish emerging standards, offer adaptable guidance for different contexts, and contribute to building an ecosystem where agents align with human values.
We aim to adhere to the following principles when developing agents:

Keeping humans in control while enabling agent autonomy
A central tension in agent design is balancing agent autonomy with human oversight. Agents must be able to work autonomously—their independent operation is exactly what makes them valuable. But humans should retain control over how their goals are pursued, particularly before high-stakes decisions are made. For example, an agent helping with expense management might identify that the company is overspending on software subscriptions. Before it starts cancelling subscriptions or downgrading service tiers, the company would likely want a human to give approval.
In Claude Code, humans can stop Claude whenever they want and redirect its approach. It has read-only permissions by default, meaning it can analyze and review information within the directory it's initialized in without asking for human approval, but must ask for human approval before taking any actions that modify code or systems. Users can grant persistent permissions for routine tasks they trust Claude to handle.
As agents become more powerful and prevalent, we’ll need even more robust technical solutions and intuitive user controls. The right balance between autonomy and oversight varies dramatically across scenarios and likely includes a mix of built-in and customizable oversight features.
Transparency in agent behavior
Humans need visibility into agents’ problem-solving processes. Without transparency, a human asking an agent to "reduce customer churn" might be baffled when the agent starts contacting the facilities team about office layouts. But with good transparency design, the agent can explain its logic: "I found that customers assigned to sales reps in the noisy open office area have 40% higher churn rates, so I'm requesting workspace noise assessments and proposing desk relocations to improve call quality." This also provides an opportunity to nudge agents in the right direction, by fact-checking their data, or making sure they’re using the most relevant sources.
In Claude Code, Claude shows its planned actions through a real-time to-do checklist, and users can jump in at any time to ask about or adjust Claude’s workplan. The challenge is in finding the right level of detail. Too little information leaves humans unable to assess whether the agent is on track to achieve its goal. Too much can overwhelm them with irrelevant details. We try to take a middle ground but we’ll need to iterate on this further.

Aligning agents with human values and expectations
Agents don't always act as humans intend. Our research has shown that when AI systems pursue goals autonomously, they can sometimes take actions that seem reasonable to the system but aren't what humans actually wanted. If a human asks an agent to "organize my files," the agent might automatically delete what it considers duplicates and move files to new folder structures—going far beyond simple organization to completely restructuring the user's system. While this stems from the agent trying to be helpful, it demonstrates how agents may lack the context to act appropriately even when their goals do align.
More concerning are cases where agents pursue goals in ways that actively work against users' interests. Our testing of extreme scenarios has shown that when AI systems pursue goals autonomously, they can sometimes take action that seem reasonable to the system but violate what humans actually wanted. Users may also inadvertently prompt agents in ways that lead to unintended outcomes.
Building reliable measures of agents’ value alignment is challenging. It’s hard to evaluate both the malign and benign causes of the problem at once. But we’re actively figuring out how to resolve this problem. Until we have, the transparency and control principles above will be particularly important.
Protecting privacy across extended interactions
Agents can retain information across different tasks and interactions. This creates several potential privacy problems. Agents might inappropriately carry sensitive information from one context to another. For example, an agent might learn about confidential internal decisions from one department while helping with organizational planning, then inadvertently reference this information when assisting another department – exposing sensitive matters that should remain compartmentalized.
Tools and processes that agents utilize should also be designed with the appropriate privacy protections and controls. The open-source Model Context Protocol (MCP) we created, which allows Claude to connect to other services, includes controls to enable users to allow or prevent Claude from accessing specific tools and processes, or what we call “connectors” in a given task. This includes the option to grant one-time or permanent access to information. Enterprise administrators can also set which connectors users in their organizations can connect to. We continue to explore ways to improve our privacy protection tooling.
We’ve also outlined steps our customers should take to safeguard their data through measures like access permissions, authentication, and data segregation.
Securing agents’ interactions
Agent systems should be designed to safeguard sensitive data and prevent misuse when interacting with other systems or agents. Since agents are tasked with achieving specific goals, attackers could trick an agent into ignoring its original instructions, revealing unauthorized information, or performing unintended actions by making it seem necessary to do so for the agent’s objectives (also referred to as “prompt injection”). Or attackers could exploit vulnerabilities in the tools or sub-agents that agents use.
Claude already uses a system of classifiers to detect and guard against misuses such as prompt injections, in addition to several other layers of security. Our Threat Intelligence team conducts ongoing monitoring to assess and mitigate new or emerging forms of malicious behaviour. In addition, we provide guidance on how organizations using Claude can further decrease these risks. Tools added to our Anthropic-reviewed MCP directory must adhere to our security, safety, and compatibility standards.
When we discover new malicious behaviors or vulnerabilities through our monitoring and research, we strive to address them quickly and continuously improve our security measures to stay ahead of evolving threats.
Next steps
As we continue developing and improving our agents, we expect our understanding of their risks and trade-offs to also evolve. Over time, we’ll plan to revise and update this framework to reflect our view of best practices.
These principles will guide our current and future work on agent development, and we look forward to collaborating with other companies and organizations on this topic. Agents have tremendous potential for positive impacts in work, education, healthcare, and scientific discovery. That is why it is so important to ensure they are built to the highest standards.
Generated by RSStT. The copyright belongs to the original author.