Generated by RSStT

Generated by RSStT

Anthropic News

今天我们发布了 Claude Opus 4.1,这是对 Claude Opus 4 在代理任务、真实世界编码和推理方面的升级。我们计划在未来几周内发布更大幅度的模型改进。

Opus 4.1 现已向付费 Claude 用户和 Claude Code 开放,同时也可通过我们的 API、Amazon Bedrock 和 Google Cloud 的 Vertex AI 使用。定价与 Opus 4 保持一致。

CLAUDE OPUS 4.1

Opus 4.1 将我们的最先进编码性能提升至 SWE-bench Verified [https://www.swebench.com/] 的 74.5%。它还增强了 Claude 在深入研究和数据分析方面的能力,特别是在细节跟踪和代理搜索方面。

GitHub 指出,Claude Opus 4.1 相较于 Opus 4 在大多数能力上都有提升,尤其是在多文件代码重构方面表现显著。乐天集团发现 Opus 4.1 能够准确定位大型代码库中的具体修正点,而不会进行不必要的调整或引入错误,他们的团队更喜欢这种精确度来进行日常调试。Windsurf 报告称,Opus 4.1 在其初级开发者基准测试中比 Opus 4 提升了一个标准差,表现提升幅度与 Sonnet 3.7 升级到 Sonnet 4 相当。

入门指南

我们建议所有用户将 Opus 4 升级到 Opus 4.1。开发者只需通过 API 使用 claude-opus-4-1-20250805 即可。你也可以访问我们的系统卡 [https://www.anthropic.com/claude-opus-4-1-system-card]、模型页面 [https://www.anthropic.com/claude/opus]、定价页面 [https://www.anthropic.com/pricing#api] 和文档 [https://docs.anthropic.com/en/docs/about-claude/models/overview] 了解更多信息。

如往常一样,欢迎通过 feedback@anthropic.com 向我们反馈意见,帮助我们持续改进,尤其是在我们不断发布更强大模型的过程中。


附录

数据来源

  • OpenAI: o3 发布文章 [https://openai.com/index/introducing-o3-and-o4-mini/],o3 系统卡 [https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf]
  • Gemini: 2.5 Pro 模型卡 [https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro.pdf]
  • Claude: Sonnet 3.7 发布文章 [https://www.anthropic.com/news/claude-3-7-sonnet],Claude 4 发布文章 [https://www.anthropic.com/news/claude-4]

基准测试说明

Claude 模型是混合推理模型。本文报告的基准测试结果展示了使用或不使用扩展思考时的最高分数。以下标注了每个结果是否使用了扩展思考:

  • 无扩展思考:SWE-bench Verified,Terminal-Bench
  • 使用扩展思考(最多 64K 令牌):TAU-bench,GPQA Diamond,MMMLU,MMMU,AIME

TAU-bench 方法论

通过在 Airline 和 Retail Agent Policy 指令中添加提示,指导 Claude 在使用工具时更好地利用推理能力并进行扩展思考。模型被鼓励在解决问题时写下思考过程,与我们通常的思考模式区分开来。为适应更多思考步骤,最大步骤数(以模型完成次数计)从 30 增加到 100(大多数轨迹在 30 步以内完成,只有一条轨迹超过 50 步)。

SWE-bench 方法论

对于 Claude 4 系列模型,我们继续使用之前发布中描述的简单脚手架 [https://www.anthropic.com/engineering/swe-bench-sonnet],仅配备两个工具——bash 工具和通过字符串替换操作的文件编辑工具。不再包含 Claude 3.7 Sonnet 使用的第三个“规划工具” [https://www.anthropic.com/engineering/claude-think-tool]。所有 Claude 4 模型均报告 500 道题目的得分。OpenAI 模型报告的是 477 道题目的子集得分 [https://openai.com/index/gpt-4-1/]。





Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

Opus 4.1 is now available to paid Claude users and in Claude Code. It's also on our API, Amazon Bedrock, and Google Cloud's Vertex AI. Pricing is the same as Opus 4.

Claude Opus 4.1

Opus 4.1 advances our state-of-the-art coding performance to 74.5% on SWE-bench Verified. It also improves Claude’s in-depth research and data analysis skills, especially around detail tracking and agentic search.


GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring. Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks. Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.


Getting started

We recommend upgrading from Opus 4 to Opus 4.1 for all uses. If you’re a developer, simply use claude-opus-4-1-20250805 via the API. You can also explore our system card, model page, pricing page, and docs to learn more.

As always, your feedback helps us improve, especially as we continue to release new and more capable models.



Appendix

Data sources

Benchmark reporting

Claude models are hybrid reasoning models. The benchmarks reported in this blog post show the highest scores achieved with or without extended thinking. We’ve noted below for each result whether extended thinking was used:

  • No extended thinking: SWE-bench Verified, Terminal-Bench
  • The following benchmarks were reported with extended thinking (up to 64K tokens): TAU-bench, GPQA Diamond, MMMLU, MMMU, AIME

TAU-bench methodology

Scores were achieved with a prompt addendum to both the Airline and Retail Agent Policy instructing Claude to better leverage its reasoning abilities while using extended thinking with tool use. The model is encouraged to write down its thoughts as it solves the problem distinct from our usual thinking mode, during the multi-turn trajectories to best leverage its reasoning abilities. To accommodate the additional steps Claude incurs by utilizing more thinking, the maximum number of steps (counted by model completions) was increased from 30 to 100 (most trajectories completed under 30 steps with only one trajectory reaching above 50 steps).

SWE-bench methodology

For the Claude 4 family of models, we continue to use the same simple scaffold that equips the model with solely the two tools described in our prior releases here—a bash tool, and a file editing tool that operates via string replacements. We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. On all Claude 4 models, we report scores out of the full 500 problems. Scores for OpenAI models are reported out of a 477 problem subset.


Generated by RSStT. The copyright belongs to the original author.

Source

Report Page