How evals drive the next chapter in AI for businesses
OpenAI News全球已有超过一百万家企业在把 AI 用到实际业务中,以提高效率和创造价值,但也有不少组织未能得到预期效果。这种差距到底源于什么?
在 OpenAI 内部,我们也在用 evals 推动雄心勃勃的目标。 evals 是一套衡量并改进智能系统是否达成预期的方法。
像产品需求文档一样, evals 把模糊的目标和抽象的想法具体化、明晰化。把 evals 当作战略工具,可以让面向客户的产品或内部工具在大规模使用时更可靠,减少高严重性错误,规避下行风险,并为组织指明可衡量的、提高投资回报率的路径。
对 OpenAI 来说,模型本身就是产品,因此我们的研究人员用严格的 frontier evals 来衡量模型在不同领域的表现。 frontier evals 帮助我们更快地发布更好的模型,但它们无法穷尽某个模型在特定业务流程中运行所需的所有细节。这也是内部团队开发数十个情境化 evals 的原因——这些评估专门用于检验某一产品或内部流程下的表现。企业领导也应学会为自家组织和运营环境量身制定情境化的 evals 。
本文是面向希望在组织内落地 evals 的商业领导者的入门指南。情境化 evals —— 为特定组织的流程或产品量身打造的评估——仍处于积极发展期,尚无定论性的流程。因此,本文提供了我们在多种场景中见效的总体框架。我们预期这一领域会继续演进,并会出现针对具体商业情境和目标的更多框架。例如,一个面向前沿、以 AI 为驱动的消费产品所需的优秀 eval 流程,可能和以标准作业程序为基础的内部自动化的评估流程不同。但我们相信下述框架可作为两类场景的最佳实践集合,并在你为组织构建量身 evals 时提供有用指导。
evals 的工作逻辑:明确 → 测量 → 改进
- 明确:定义何为“优秀”
先组建一支小而有权责的团队,用通俗的话把你们的 AI 系统目标写清楚,例如:“把符合条件的来邮转化为已排期的演示,并保持品牌调性。”
该团队应当兼具技术与领域专长(例如上例就需要销售专家)。他们要能列出最重要的衡量结果,勾勒端到端流程,并识别系统将在其中遇到的每个关键决策点。对流程中的每一步,定义成功的标准和要避免的情况。这个过程会把几十个示例输入(如来信)映射到你希望系统产出的结果上。由此形成的“黄金样本集”应成为活的、权威的参考,代表你们最有经验专家对“优秀”标准的判断和品味。
不要被冷启动压垮,也别试图一口气解决所有问题。这个过程是迭代且杂乱的,早期原型能带来大量帮助。审查 50 到 100 个早期系统输出就能揭示系统何时何地出现问题。这种“错误分析”会形成不同错误类型及其频率的分类,随着系统改进持续追踪。
这个过程并非纯技术性的——它是跨职能的,核心在于定义商业目标和期望流程。不能把技术团队单独放着,让他们独自判断什么对客户或产品、销售、HR 等团队最有利。领域专家、技术负责人和其他关键利益相关者应共同承担责任。
- 测量:在真实环境下测试
下一步是测量,目标是可靠地找出系统何时何地失败并给出具体示例。为此,要创建一个尽可能贴近真实使用场景的专用测试环境——不要仅仅停留在演示或提示游乐场里。在该环境下,用你们的黄金样本和错误分析来评估性能,并在系统将真实面临的压力和边缘场景下检验。
评分量表能让对系统输出的判断更具体,但也可能让人过分关注表面项,从而牺牲总体目标。有些品质难以或无法量化。某些情况下,传统业务指标仍然重要;另一些情况下,你需要发明新指标。在整个过程中保持领域专家的参与,并让测评与核心目标紧密对齐。
尽可能用真实场景中抽取的示例进行测试,并纳入那些罕见但一旦处理不当代价高昂的边缘情况。
部分 evals 可借助 LLM grader 放大规模,即用模型来像专家那样评分;但仍需保留人工环节。领域专家需定期审计 LLM grader 的准确性,并直接审查系统行为日志。
Evals 可以帮助判断何时将系统推向生产,但评估不会在上线时终止。你应持续衡量系统在真实输入下产生的真实输出质量。和任何产品一样,来自最终用户(内外部)的信号尤其重要,务必把这些信号纳入你的 eval 。
- 改进:从错误中学习
最后要建立持续改进的流程。解决 eval 中发现的问题可以有多种方式:优化提示、调整数据访问、更新评估以更贴合目标等等。随着你发现新类型的错误,把它们加入错误分析并着手修复。每轮迭代都会在前一轮基础上累积:新增的评判标准和更清晰的系统行为预期会揭示新的边缘情形和难缠的问题,从而得以纠正。
为支持这一迭代,构建数据飞轮。记录输入、输出与结果;按计划抽样这些日志,并自动将模糊或代价高昂的案例送交专家复审。把这些专家判定加入你的评估与错误分析,再用于更新提示、工具或模型。通过这一闭环,你会更清晰地界定对系统的期望,使其与期望更紧密对齐,并识别出更多需要追踪的相关输出与结果。在规模化部署后,这将产出一套难以复制、具有差异化且与情境高度相关的数据集——这是组织在打造市场上最佳产品或流程时的宝贵资产。
尽管 evals 提供了系统化改进的方法,但也会带来新的失效模式。实际上,随着模型、数据和商业目标的演进, evals 本身也必须持续维护、扩展并进行压力测试。
对于面向外部的部署, evals 并不能替代传统的 A/B 测试和产品实验。二者是互补的,可以相互引导并让你看到变更对真实世界表现的影响。
对商业领导意味着什么
每一次重大技术变革都会重塑运营卓越与竞争优势。像 OKRs 和 KPIs 这样的框架,曾帮助组织在大数据分析时代围绕“衡量重要事项”建立方向。 evals 则是 AI 时代衡量的自然延伸。
与概率性系统打交道需要新的度量方法和更深的权衡考量。领导者必须决定何时要求高精度、何时可以放宽标准,以及如何在速度与可靠性之间平衡。
构建优秀的 evals 之难,与打造出色产品的难度相同:需要严谨、远见和品味。做好了, evals 会成为独特的差异化因素。在信息全球可得、专业能力日益普及的世界里,你的优势取决于系统在自有情境中执行的能力。健全的 evals 会随着系统的提升创造复合性优势和组织性专有知识。
归根结底, evals 关乎对商业情境与目标的深刻理解。如果你无法定义特定用例下的“优秀”,就很难实现它。从这个意义上看, evals 强调了 AI 时代的一条关键教训:管理能力就是 AI 能力。清晰的目标、直接的反馈、审慎的判断以及对价值主张、战略和流程的清楚认识,比以往任何时候都更重要。
随着更多最佳实践和框架的出现,我们会持续分享。与此同时,我们鼓励你在组织内尝试 evals ,摸索何种流程最适合你的需求。要开始的话,先确定要解决的问题和你的领域专家,召集小团队;如果你基于我们的 API 构建,也可参考我们的 Platform Docs 。
不要寄希望于“优秀”会自然出现。把它说清楚,去衡量,然后持续改进。
Over one million businesses around the world are leveraging AI to drive greater efficiency and value creation. But some organizations have struggled to get the results they are expecting. What is causing the gap?
At OpenAI we are leveraging AI internally to achieve our ambitious goals. One key set of tools we use are evals, methods to measure and improve the ability of an AI system to meet expectations.
Similar to product requirement documents, evals make fuzzy goals and abstract ideas specific and explicit. Using evals strategically can make a customer-facing product or internal tool more reliable at scale, decrease high-severity errors, protect against downside risk, and give an organization a measurable path to higher ROI.
At OpenAI, our models are our products, so our researchers use rigorous frontier evals 1 to measure how well the models perform in different domains. While frontier evals help us ship better models faster, they cannot reveal all the nuances required to ensure the model will perform on a specific workflow in a specific business setting. That is why internal teams have also created dozens of contextual evals designed to assess performance within a specific product or internal workflow. It is also why business leaders should learn how to create contextual evals specific to their organization’s needs and operating environment.
This is a primer for business leaders looking to apply evals in their organizations. Contextual evals, each crafted for a specific organization’s workflow or product, are an active area of development and definitive processes have yet to emerge. As a result, this article provides a broad framework that we have seen work across many situations. We expect this field to evolve and for more frameworks to emerge that address specific business contexts and goals. For example, an excellent eval for a cutting-edge, AI-enabled consumer product might require a different process than an eval for an internal automation based around a standard operating procedure. We believe that the framework presented below will serve as a collection of best practices in both cases, and will be a useful guide as you build evals tailored to your organization’s needs.
How evals work: Specify → Measure → Improve
1. Specify: Define what “great” means
Start with a small, empowered team that can write down the purpose of your AI system in plain terms, for example: “Convert qualified inbound emails into scheduled demos while staying on brand.”
This team should be a mix of individuals with technical and domain expertise (in the given example, you’d want sales experts on the team). They should be able to state the most important outcomes to measure, outline the workflow end-to-end, and identify each important decision point your AI system will encounter. For every step in that workflow, the team should define what success looks like and what to avoid. This process will create a mapping of dozens of example inputs (e.g. inbound emails) to the outputs they want the system to produce. The resulting golden set of examples should be a living, authoritative reference of your most skilled experts’ judgement and taste for what “great” looks like.
Do not get overwhelmed with a cold start or try to solve it all at once. The process is iterative and messy. Early prototyping can help immensely. Reviewing 50 to 100 outputs from an early version of the system will uncover how and when your system is failing. This “error analysis” will result in a taxonomy of different errors (and their frequencies) to track as your system improves.
This process is not purely technical—it’s cross-functional and centered on defining business goals and desired processes. Technical teams should not be asked in isolation to judge what best serves customers or the needs of other teams like product, sales, or HR. Consequently, domain experts, technical leads, and other key stakeholders should share ownership.
2. Measure: Test against real-world conditions
The next step is to measure. The goal of measurement is to reliably surface concrete examples of how and when the system is failing. To do that, create a dedicated test environment that closely mirrors real-world conditions—not just a demo or prompt playground. Evaluate performance against your golden set and error analysis under the same pressures and edge cases your system will actually face.
Rubrics can help bring concreteness to judging outputs from your system, but it is possible to over-emphasize superficial items at the expense of your overall goals. Further, some qualities are difficult or impossible to measure. In some cases, traditional business metrics will be important. In others, you’ll need to invent new metrics. Keep your subject matter experts in the loop throughout, and tightly align the process with your core objectives.
To actually test the system, use examples drawn from real-world situations whenever possible, and include or invent edge cases that are rare but costly if mishandled.
Some evals can be scaled through the use of an LLM grader, an AI model that grades outputs the same way an expert would; yet, it is still important to keep a human in the loop. Your domain expert needs to regularly audit LLM graders for accuracy and should also directly review logs of your system’s behavior.
Evals can help you decide when a system is ready to launch, but they do not stop at launch. You should continuously measure the quality of your system's real outputs generated from real inputs. As with any product, signals from your end-users (whether external or internal) are especially important and should be built into your eval.
3. Improve: Learn from errors
The last step is to set up a process for continuous improvement. Addressing problems uncovered by your eval can take on many forms: refining prompts, adjusting data access, updating the eval itself to better reflect your goals, and so forth. As you uncover new types of errors, add them to your error analysis and address them. Each iteration compounds upon the last: new criteria and clearer expectations of system behavior help reveal new edge cases and subtle, stubborn issues to correct.
To support this iteration, build a data flywheel. Log inputs, outputs, and outcomes; sample those logs on a schedule and automatically route ambiguous or costly cases to expert review. Add these expert judgements to your eval and error analysis, then use them to update prompts, tools, or models. Through this loop you will more clearly define your expectations for the system, align it tighter to those expectations, and identify additional relevant outputs and outcomes to track. Deploying this process at scale yields a large, differentiated, context-specific dataset that is hard to copy—a valuable asset your organization can leverage as you build the best product or process in your market.
While evals create a systematic way to improve your AI system, new failure modes can arise. In practice, as models, data, and business goals evolve, evals must also be continuously maintained, expanded, and stress-tested.
For external-facing deployments, evals do not replace more traditional A/B tests and product experimentation. They are complements to traditional experimentation that can help guide each other and provide visibility into how changes you make impact real-world performance.
What evals mean for business leaders
Every major technology shift reshapes operational excellence and competitive advantage. Frameworks like OKRs and KPIs have helped organizations orient themselves around “measuring what matters” for their business in the age of big data analytics. Evals are the natural extension of measurement for the age of AI.
Working with probabilistic systems requires new kinds of measurement and deeper consideration of trade-offs. Leaders must decide when precision is essential, when they can be more flexible, and how to balance velocity and reliability.
Evals are difficult to implement for the same reason that building great products is difficult; they require rigor, vision, and taste. If done well, evals become unique differentiators. In a world where information is freely available across the world and expertise is democratized, your advantage hinges on how well your systems can execute inside your context. Robust evals create compounding advantages and institutional know-how as your systems improve.
At their core, evals are about a deep understanding of business context and objectives. If you cannot define what “great” means for your use case, you’re unlikely to achieve it. In this sense, evals highlight a key lesson of the AI era: management skills are AI skills. Clear goals, direct feedback, prudent judgment, and a clear understanding of your value proposition, strategy, and processes still matter, perhaps even more than ever.
As more best practices and frameworks emerge, we will be sharing them. In the meantime, we encourage you to experiment with evals and discover what processes work best for your needs. To get started, identify the problem to be solved and your domain expert, round up your small team, and, if you are building on our API, explore our Platform Docs.
Don’t hope for “great.” Specify it, measure it, and improve toward it.
Generated by RSStT. The copyright belongs to the original author.