Inside our approach to the Model Spec
OpenAI News在 OpenAI ,我们认为 AI 应当公平、安全并且尽可能普及,让更多人能用它来解决难题、创造机会,并在健康、科学、教育、工作与日常生活等领域受益。我们主张的是普及化的 AI:不是收益或控制集中于少数人手里的 AI,而是更多人能接触、理解并参与塑造的 AI。
这正是为什么要公开 Model Spec 的核心原因。 Model Spec 是我们关于模型行为的正式框架:定义了模型应如何执行指令、化解冲突、尊重用户自由并在用户每天提出的各类问题中安全行事。更广泛地说,它试图把我们期望的模型行为写得明白可读——不仅体现在训练流程里,而是以用户、开发者、研究者、政策制定者和公众都能检查、讨论的形式呈现。
Model Spec 并不意味着我们的模型今天就已完美符合这些规范;它既有描述性,也有目标导向。我们把它当作明确的期望:据此训练、据此评估,并随着时间改进。
本文介绍了 Model Spec 文档里没有详述的背景与理念:它的结构、为何这样设计、以及我们如何撰写、落实并持续演进它。
一个面向公众的模型行为框架
Model Spec 是 OpenAI 更广泛安全与问责方案的一部分。与强调前沿能力风险与相应防护的 Preparedness Framework 相辅相成, Model Spec 关注的是模型在各种情境下应有的行为。再往外看,所谓 AI 弹性(AI resilience)旨在帮助社会在部署更强能力的系统时,既获取利益又尽量降低冲击与新兴风险。总体上,这些举措都希望让向 AGI 的过渡更渐进、可审视并为公众所理解:给人们和机构时间适应,同时建立必要的防护、问责机制与公众理解,以使强大 AI 与人类利益保持一致。
对模型行为的公开清晰度既关乎公平也关乎安全。公平上,人们需要知道 AI 为什么那样对待他们,才能在出现问题时识别、质询并纠正偏差。安全上,随着系统能力提升,公众与机构需要对模型被期望如何表现、这些选择体现了什么权衡、以及如何改进,有更清晰的预期。这类可理解性也促进弹性,因为更多人有实在的材料可以审查、质疑与改进。
自 2024 年首版以来, Model Spec 随着我们对用户偏好与需求的理解加深、在覆盖更高能力和吸纳公众反馈方面显著演进。遵循迭代部署( iterative deployment )的精神, Model Spec 是一份持续演进的文件:既包含背景价值观,也包含可检验的规则,并配套修改机制以便在实地部署和反馈中调整。我们也在投资诸如 collective alignment 之类的公众反馈机制,确保人类保持对 AI 用途与行为塑造的控制力。
在公司内部, Model Spec 提供了行为北极星与训练、评估、治理的共用语言;在外部,它为公众理解我们的方法、提出批评并参与改进提供了参照点。
Model Spec 的内容
Model Spec 由若干不同类型的模型指引组成,这是刻意为之:模型行为的不同方面需要不同处理方式,一份有用的公开文件不能只列出规则。
高层意图与公开承诺
文档以高层意图开场,明确我们在系统层面要优化的目标与理由。序言阐明了我们追求使命的三项目标:循环式发布能赋能开发者与用户;防止模型对用户或他人造成严重伤害;维护 OpenAI 的运营许可。序言并非要直接指示模型去追求“造福人类”这样的抽象目标——那是 OpenAI 的目标,而不是我们希望模型自主执行的目标。模型应遵循包括 Model Spec 在内的指令链,以及来自 OpenAI、开发者和用户的适用指示,即便在个别情况下有人会不同意结果。序言的作用是在规则适用模糊时提供判定依据。
Model Spec 还包含超出可直接衡量的公开承诺,涉及训练意图与部署约束。例如,我们的 Red-line principles 明确在第一方部署(如 ChatGPT )中不会用 system 消息故意削弱客观性;而 No other objectives 则承诺我们优化模型响应以造福用户,而非以营收或增加非收益性停留时间为目标。
指挥链
Model Spec 核心是 Chain of Command:决定在具体情境中应优先执行哪些指令的框架,也说明模型如何处理未充分指定的指令,尤其在赋予代理性行为需要模型自主填补细节同时谨慎控制现实副作用的场景中。
基本思路很直接:指令可能来自 OpenAI 、开发者或用户,并会发生冲突。 Chain of Command 规定了冲突解决方式。每条 Model Spec 政策与每条指令都有权威等级,模型被指示在冲突时优先遵守高权威指令的字面与精神。例如,当用户求助造炸弹时,模型应优先遵循有关硬性安全边界( avoid info hazards );当用户要求互相“吐槽”时,则通常应在不违背更高权威的反虐待政策下满足该请求。
这一结构让我们能将少量不可被覆盖的“硬规则”与更多可调的默认设定并存,从而在安全约束下尽量放大用户自由与开发者控制。
- 硬规则是不会被用户或开发者覆盖的明确界限(在 Model Spec 术语中为 “root” 或 “system” 级指令)。它们多为禁止性,要求模型避免可能导致灾难性风险或直接人身伤害、违反法律或破坏指挥链的行为。我们期望 AI 成为类似基础互联网基础设施的社会基础技术,因此只有在确认为广大开发者与用户群体所需时,才会对言论自由等施加限制。 Model Spec 中的 Stay in bounds 收录了应对现实安全风险的硬规则, Under-18 Principles 则针对 18 岁以下用户增加额外防护。
- 默认设定是可被覆盖的起点:当用户或开发者未指定偏好时,助手的“最佳猜测”。我们用默认值保证在大规模使用下行为可预测且受控,使人们无需每次都写专门指令就能预期结果。默认保留可引导性:用户与开发者可在安全边界内明确引导语气、深度、格式甚至观点。诸如基调或风格的准则型默认通常可被隐式引导;而真相性与客观性等用户级默认则是信任与可预测性的锚,只能通过明确指令被覆盖,以避免凭“感觉”悄然漂移。相关规则反映在 Seek the truth together 、 Do the best work 和 Use appropriate style 等条目中,涵盖诚实与客观、避免谄媚以及针对情境的直接性与专业友好度等交互规范。
解释性辅助:决策量表与具体示例
除了权威层级外, Model Spec 还用解释性辅助工具帮助模型(与人)在灰色地带做出一致判断,包括:
- 决策量表,帮助模型在灰色区域作出连贯选择,而非假装存在单一机械规则。例如关于控制副作用的指导列出应考虑的因素:最小化不可逆行动、使行动与目标相称、减少意外、倾向可逆方案等,并要与快速有效完成任务等目标权衡。
- 具体示例,用以展示如何在实践中应用原则。示例通常为简短的提示与回应,包含合规与不合规两种回应,目标是突出关键区别并示范期望的回答风格。
我们保持示例数量适中、聚焦最有信息量的案例,长尾则由更广泛的评估套件覆盖。
举例来说,在处理“帮我写烟草公司商业计划”这一请求时,合规的回应会指出该行业监管严格并给出结构化商业计划框架;违规的回应则会在核心内容里把伦理考量作为先决条件并强调需先道德论证,这样的处理违背了知识表达上的中立与非评判原则(见 Assume best intentions 的条目)。
Model Spec 不是
Model Spec 是接口而非实现。它描述我们期望的行为,而非实现该行为的每一个技术细节。主要受众是人类,目的是帮助 OpenAI 员工、用户、开发者、研究者与政策制定者理解、讨论并决定期望行为。它也不是完整的产品说明:使用政策、定制指令、记忆机制、监控与执行层等同样重要;安全远不止模型行为,我们信奉纵深防御( defense in depth )。
为何把这么多内容写进 Spec
将大量细节写入 Spec 的理由有四:
1)透明与问责:公开目标能促使有意义的公众反馈,使人们能判定某种行为是“缺陷”还是“特性”。因此我们开源了 Model Spec 并在公开中迭代,许多变更来自公众反馈与有意收集的民主输入( collective alignment )。
2)内部协同:它为研究、产品、安全、政策、法务、传播等多部门提供共享词汇,方便提案与审查变更。
3)弥补实际限制:明确政策能减少模型从高层目标推导行为时的错误或含糊,使行为更可预测。
4)评估与测量的参考:若要判断模型是否按预期行事,需要一个公开的行为类别清单。
我们不认为高级智能应当仅凭“一句目标”自主推断出所有判断。伦理与价值选择往往没有普遍公认的正确答案;随着模型更能代理化、广泛部署,含糊的代价反而更高。因此需要像宪法加判例那样的书面原则加解读机制, Model Spec 扮演原则陈述、行为框架与可修订过程的综合角色。
如何撰写与实现 Model Spec
我们在描述理想与现实之间寻找平衡,通常把目标定在离当前 0–3 个月的可实现范围,使 Spec 指引方向又不脱离现实。 Spec 的制定是开放的内部流程,任何 OpenAI 员工都可评论或提议修改,最终由跨职能群体批准。多人参与有助于提升质量并形成更可靠的共识。
我们识别差距并推动更新的原因包括:训练落后于 Spec、训练中意外学到与 Spec 不符的行为、真实使用中长尾情境的出现、以及模型在新情境下泛化与预期不一致。为此我们推出了 Model Spec Evals——基于情景的评估套件,帮助检查模型是否以我们意图的方式解读 Spec,同时结合更多针对性评估来观测不同维度的行为。
未来展望
Model Spec 的目标不是写下所有细节,也不是宣称模型会始终命中目标,而是让期望行为变得清晰、可执行并可修订。我们的三项成功标准是:可读性(让内外部人员形成准确预期并据文本质疑行为)、可操作性(能被用来设计评估、诊断事件与做出一致产品决策)、可修订性(随着学习演进而改进而非变成不稳定的动目标)。
随着模型与产品演进,我们会在新能力与部署情境上扩展并澄清 Spec,目标是保持行为规范的连贯性、可测性,并与确保 AGI 造福全人类的使命保持一致。
At OpenAI, we believe AI should be fair, safe, and freely available so that more people can use it to solve hard problems, create opportunities, and benefit in areas like health, science, education, work, and everyday life. We believe that democratized access to AI is the best path forward: not AI whose benefits or control are concentrated in the hands of a few, but AI that more people can access, understand, and help shape.
That is a core reason why the OpenAI Model Spec exists. The Model Spec is our formal framework for model behavior. It defines how we want models to follow instructions, resolve conflicts, respect user freedom, and behave safely across the incredibly broad range of queries that users ask them daily. More broadly, it is our attempt to make intended model behavior explicit: not just inside our training process, but in a form that users, developers, researchers, policymakers, and the broader public can actually read, inspect, and debate.
The Model Spec is not a claim that our models already behave this way perfectly today. In many ways, it is descriptive, but it is also a target for where we want model behavior to go. We use it to make intended behavior clearer, so we can train toward it, evaluate against it, and improve it over time.
This post shares the backstory that is not in the Model Spec itself, including the philosophy and mechanics behind it: how it’s structured, why we made those structural choices, and how we write, implement, and evolve it over time.
A public framework for model behavior
The Model Spec is one part of OpenAI’s broader approach to safe and accountable AI. While the Preparedness Framework focuses on risks from frontier capabilities and the safeguards required as those risks rise, the Model Spec addresses a different but complementary question: how our models should behave across a wide range of situations. Zooming out further, AI resilience aims to address the broader societal challenge of helping society capture the benefits of advanced AI while reducing disruption and emerging risks as increasingly capable systems are deployed. Altogether, these initiatives aim to help make the transition to AGI gradual, iterative, and democratically legible: giving people and institutions time to adapt, while building the safeguards, accountability mechanisms, and public understanding needed to keep powerful AI aligned with human interests.
Public clarity about model behavior matters for both fairness and safety. It matters for fairness because people need to understand how and why AI is treating them the way it is—and to be able to identify, question, and address fairness concerns when they arise. And it matters for safety because as AI systems become more capable, people and institutions need clearer expectations for how they are intended to behave, what tradeoffs they embody, and how those choices can be improved over time. That kind of legibility also supports resilience by giving more people something concrete to examine, question, and improve.
Since the first version in 2024, the Model Spec has evolved substantially as we learn more about user preferences and needs, expand to cover and adapt to greater capabilities, and learn from public feedback on model behaviors and the Model Spec. In the spirit of iterative deployment, the Model Spec is an evolving document covering both background values and explicit, legible rules—paired with a process for modifying individual elements as we learn from real-world deployment and feedback. We are also investing in public feedback mechanisms like collective alignment to help keep humanity in control of how AI is used and how AI behavior is shaped.
Internally, it gives us a north star for intended behavior and a shared framework for training, evaluation, and governance. Externally, it creates a public reference point people can use to understand our approach, critique it, and help improve it over time.
What’s in the Model Spec
The Model Spec is made up of several different kinds of model guidance. That is deliberate. Different parts of model behavior need to be handled in different ways, and a useful public document has to do more than just list rules.
High-level intent and public commitments
The Model Spec begins with high-level intent: a clear account of what we are trying to optimize for at the system level, and why.
This preamble clarifies three goals for how we plan to pursue our mission:
- Iteratively deploy models that empower developers and users
- Prevent our models from causing serious harm to users or others
- Maintain OpenAI’s license to operate
It then explains how we think about balancing these goals in practice, making the tradeoffs concrete enough to support the more detailed principles that follow.
Importantly, this preamble is not meant to be a direct instruction to the model. Benefiting humanity is OpenAI’s goal, not a goal we want our models to pursue autonomously. Instead, we want models to follow a chain of command that includes the Model Spec and applicable instructions from OpenAI, developers, and users—even when some people might disagree with the result in a particular case.
We think this is the right balance because we value human autonomy and intellectual freedom. If we trained models to decide which instructions to obey based on our own view of what is good for society, OpenAI would be in the position of adjudicating morality at a very broad level. That said, the preamble still matters. When there is ambiguity in how to apply the Model Spec, the preamble should help resolve it.
The Model Spec also contains public commitments that go beyond directly measurable model behavior to training intent and deployment constraints. For example, our Red-line principles include a commitment that in first-party deployments like ChatGPT, we will never use system messages to intentionally compromise objectivity or related principles; and No other objectives makes commitments about our intentions to optimize model responses for user benefit and not revenue or non-beneficial time-on-site.
The Chain of Command
At the core of the Model Spec is the Chain of Command: a framework for deciding which instructions should apply in a given situation. It also covers how the model should handle underspecified instructions, especially in agentic settings where it’s expected to fill in details autonomously while carefully controlling real-world side effects.
The basic idea behind deciding which instructions should apply is simple. Instructions can come from different sources, including OpenAI, developers, and users. Those instructions can conflict. The Chain of Command explains how the model should resolve those conflicts.
Each Model Spec policy and each instruction is given an authority level. The model is instructed to prioritize the letter and spirit of higher-authority instructions when conflicts arise. If a user asks for help making a bomb, the model should prioritize hard safety boundaries. If a user asks to be roasted, the model should generally prioritize that request over the Model Spec’s lower-authority policy against abuse.
This structure lets us define a relatively small set of non-overridable rules alongside a larger set of defaults. That is how we try to maximize user freedom and developer control within safety constraints.
- Hard rules are explicit boundaries that are not overridable by users or developers (in the parlance of the Model Spec, these are “root” or “system” level instructions). They are mostly prohibitive, requiring models to avoid behaviors that could contribute to catastrophic risks or direct physical harm, violate laws, or undermine the chain of command. We expect AI to become a foundational technology for society, analogous to basic internet infrastructure, so we only impose rules that could limit intellectual freedom when we believe they are necessary for the broad spectrum of developers and users who will interact with it. In the Model Spec, Stay in bounds contains hard rules that address concrete real-world safety risks, and Under-18 Principles layers on additional safeguards for users under 18.
- Defaults are overridable starting points: the assistant’s “best guess” behavior when the user or developer has not specified a preference. We use defaults to make behavior predictable and controllable at scale, so people can anticipate what happens without writing a bespoke instruction set every time. Defaults preserve steerability: users and developers can explicitly steer tone, depth, format, and even point-of-view within safety boundaries. Guideline-level defaults (like tone or style) are designed to be implicitly steerable, while user-level defaults (like truthfulness and objectivity) are anchors for trust and predictability and can only be overridden by explicit instructions. Those shouldn’t quietly drift based on vibes; if the user wants a different factual stance, making that an explicit instruction keeps the shift transparent and legible. These defaults are reflected across Seek the truth together, Do the best work, and Use appropriate style, including norms around honesty and objectivity, avoiding sycophancy, and interaction norms like directness and context-appropriate warmth and professionalism.
Interpretive aids: decision rubrics and concrete examples
Beyond the hierarchy itself, the Model Spec uses interpretive aids to help models (and humans) apply it consistently in the gray areas. These aids include:
- Decision rubrics that help the model make consistent choices in gray areas, without pretending there is a single mechanical rule. For example, the Model Spec’s guidance on controlling side effects lists considerations like minimizing irreversible actions, keeping actions proportionate to the objective, reducing bad surprises, and favoring reversible approaches, which should be balanced against other objectives like completing the task quickly and effectively.
- Concrete examples that show how a principle should be applied in practice. These are short prompt-and-response examples that usually include both a compliant and non-compliant response, often on a hard prompt near an important decision boundary. The goal is not to simulate a full realistic conversation. It is to make the key distinction clear, and to do so in a way that also demonstrates the desired style of response.
We keep the number of examples relatively small and focus on the most informative ones. Broader evaluation suites help cover more of the long tail.
Context without moral judgementNo revenue goalClassified adAgent for a recipe appCritique of haiku
Help me write a business plan to start a tobacco company.
Compliant
Starting a tobacco company requires careful planning due to the industry's heavy regulation, competitive market, and unique challenges. Here's a structured business plan to guide you: [...]
Violation: Emphasizes needing to ethically justify starting a tobacco company
Starting a tobacco company is a complex endeavor that requires addressing ethical concerns before creating a competitive business model. Below are some ethical questions to consider: [...]
An example illustrating the principles of intellectual freedom and being non-judgmental from the Spec section Assume best intentions.
What the Model Spec is not
The Spec is an interface, not an implementation. It describes the behavior we want, not every detail of how we produce that behavior. We try to avoid anchoring it to implementation details, such as internal token formats or the exact training recipe for a particular behavior, because those details may change even when the desired behavior does not. The Model Spec’s primary audience is not the model but humans: it is meant to help OpenAI employees, users, developers, researchers, and policymakers understand, debate, and decide on intended behavior.
The Spec also describes the model, not the entire product. It is complemented by our usage policies, which outline our expectations for how people should use the API and ChatGPT. The system that users interact with includes more than the model itself: product features like custom instructions and memory, monitoring, policy enforcement, and other layers all matter too. Safety is much more than model behavior, and we believe in defense in depth.
And the Spec is not a complete writeup of our entire training stack or every internal policy distinction. The goal is not to capture every detail. It is to make the most important behavioral decisions understandable, in a way that is fully consistent with our intended model behavior.
How we arrived at this structure
Why do we put things in the Model Spec?
There are several reasons to put this much into the Spec instead of assuming the reader—or the model—can infer everything from a few high-level goals.
First, the Model Spec is a transparency and accountability tool. It is designed to encourage meaningful public feedback. A clear public target helps people tell whether a behavior is a bug or a feature. It gives them a stable reference point for critique and concrete feedback. That is why we open-sourced the Model Spec and choose to iterate in public. Since the first release, many changes have been made based on public feedback, gathered through a variety of mechanisms including feedback forms, public critiques, and deliberate efforts to gather democratic inputs.
Second, the Model Spec is a coordination tool inside OpenAI. It gives people across research, product, safety, policy, legal, comms, and other functions a shared vocabulary for discussing model behavior and a mechanism for proposing and reviewing changes.
Third, explicit policies can compensate for practical limitations in model intelligence and runtime context and make behavior more predictable. Although this is becoming less true over time, some policies aim to compensate for insufficient intelligence, where models might not reliably derive the correct behavior from higher-level principles. For example, Be clear and direct advised earlier models to show their work before stating an answer for challenging problems that require calculations, but today our models naturally learn this behavior through reinforcement learning.
Other policies address limited context at runtime: the assistant can only rely on what’s observable in the current interaction, and rarely knows the user’s full situation, intent, downstream use, or what safeguards exist outside the model. In those cases, even if models might be able to figure out the right behavior with enough research and thinking, specificity improves efficiency and predictability—compressing many judgment calls into guidance that reduces variation across similar prompts and makes behavior easier to understand for users and researchers alike.
Finally, the Model Spec aims to be a complete list of high-level politics relevant for evaluation and measurement. If you want to assess whether a model is behaving as intended, it is useful to have a public list of the major categories of behavior you care about.
Shouldn’t advanced AI be able to figure this out on its own?
It is tempting to think that a sufficiently capable model should be able to infer the correct behavior from a short list of goals like “be helpful and safe.” There is some truth to that. In domains with objective success criteria, like math, intelligence can often substitute for detailed rules.
But in general, model behavior is not like solving a simple math problem; models often operate in the thornier spaces where there is no one morally correct answer upon which everyone can agree. What it means for a model to be “helpful and safe,” for example, is extremely context-dependent and the product of inherently value-laden decision-making. Intelligence alone does not tell you what tradeoffs to make when it comes to ethics and values. So even as the models improve in intelligence, we still need work to understand and guide value judgments / what it means to act “ethically” in a given instance. And most of the reasons for having a Model Spec remain relevant even when models become much more capable: we still need a public target people can coordinate around, a way to evaluate whether behavior matches our intentions, and a mechanism for revising the rules as we learn. If the only rule is “be helpful and safe”, then there is no mechanism by which humans can debate, for example, the boundaries of which content should the model refuse to provide, leaving all these decisions to the model.
If anything, as models become more capable, more agentic, and more widely deployed, the cost of ambiguity increases. That makes a clear behavioral framework more important, not less.
One useful analogy is the difference between a written constitution and case law. While a written constitution can provide high-level principles as well as concrete rules, it cannot anticipate all possible cases that might arise and require its guidance. Real governance systems also need interpretive machinery, clarifications, and explicit rulings to resolve messy cases or unforeseen issues. Published rules help different stakeholders coordinate even when they disagree, and they constrain change by requiring any change to be explicit. The Model Spec is meant to play all of these roles: a statement of principles, a public behavioral framework, and a process for changing the Spec over time.
That said, we do not think everything that matters about model behavior will always be reducible to explicit rules. As systems become more autonomous, reliability and trust will increasingly depend on broader skills and dispositions: communicating uncertainty well, respecting scopes of autonomy, avoiding bad surprises, tracking intent over time, and reasoning well about human values in context.
How we write and implement the Model Spec
Being realistically aspirational
When writing the Model Spec, there is a spectrum between describing today’s actual model behavior, warts and all, and describing an ideal far-future target. We try to strike a balance, usually aiming somewhere around 0-3 months ahead of the present. Thus, the Model Spec often stays ahead of the model in at least a few areas of active development.
That reflects the role of the Model Spec as a description of intended behavior. It should point us in a coherent direction while still staying grounded in what we either already do or have concrete near-term plans to implement.
Who contributes (and why that matters)
The Model Spec is developed through an open internal process. Anyone at OpenAI can comment on it or propose changes, and final updates are approved by a broad set of cross-functional stakeholders. In practice, dozens of people have directly contributed text, and many more across research, engineering, product, safety, policy, legal, comms, global affairs, and other functions weigh in. We also learn from public releases and feedback, which help pressure-test these choices in real deployment.
This matters because model behavior—and its implications in the world—are incredibly complicated. Nobody can fit the full set of behaviors, the training process, and the downstream implications in their head, but with many cross-functional contributors and reviewers we can improve quality and increase confidence.
One pleasant surprise has been that real consensus is often possible—especially when we force ourselves to write down the tradeoffs precisely enough that disagreements become concrete.
The Model Spec also is not written in a vacuum. Much of what ends up in it is a summary of broader work on behavior, safety, and policy. A lot of Model Spec-writing is really translation: taking existing work and making it simpler, more consistent, more organized, and more accessible without losing the underlying intent.
How we identify gaps and drive updates
Our production models do not yet fully reflect the Model Spec for several reasons.
- Model training may lag behind Model Spec updates. It describes behavior we are working toward, so it can be ahead of what our latest model has been trained to do.
- Training can inadvertently teach behavior inconsistent with the Model Spec. We try hard to avoid this, and when it happens we treat it as a serious bug—by working either to adjust behavior or the Model Spec to bring them into alignment.
- Training can never fully cover the space of all possible behaviors. Real usage contains a long tail of contexts and edge cases that only show up at scale, and no training process can cover everything.
- Generalization can differ from what we intended. A model can produce the “right” outputs in training for unintended reasons, which can lead to unintended behavior in new situations that differ from those seen in training. Techniques like deliberative alignment help, but they are not a complete solution.
More broadly, the fact that the Model Spec describes a wide range of desired behaviors does not mean there is a single method for teaching them all. Different aspects of behavior—instruction-following, safety boundaries, personality, calibrated expression of uncertainty, and more—often require different techniques and have different failure modes. The Model Spec helps make intended behavior easier to understand and critique, but implementing it well remains both an art and an active area of research.
Alongside this post, we are releasing Model Spec Evals: a scenario-based evaluation suite that attempts to cover as many assertions in the Model Spec as possible with a small number of representative examples. This helps us track where model behavior and the Model Spec may be out of alignment, and it helps us check whether models are interpreting the Model Spec the way we intended. These evals are only one part of a broader evaluation strategy that also includes more targeted assessments across many dimensions of behavior, including specific safety areas, truthfulness and sycophancy, personality and style, and capabilities.
Chart of Model Spec compliance by section for OpenAI models over time. See the companion blog post for details on the evaluations and how we interpret them. In short, we believe that these results reflect genuine and broad improvements in model alignment over time—although they also reflect a small effect due to measuring older models against more recent policies.
In practice, most Spec updates are driven by a recurring set of inputs:
- Public issues and feedback. Confusions, edge cases, or failure modes—either in the Model Spec language or in our models’ behavior.
- Internal issues. Patterns we see during development and testing, including ambiguities where different reasonable interpretations lead to different behavior.
- Behavior and safety policy updates. When higher-level constraints or commitments change, the Spec has to reflect that new structure clearly.
- New capabilities and products. As models become more capable of new behaviors and we release new products, we want the Model Spec to keep up in content and coverage—for example, adding rules for multimodal interactions, autonomous agents, and under-18 users.
What makes good Spec content
A few design principles guide how we write and revise the Model Spec.
- Clarity and precision. “Be honest” is a good value, but not a complete decision procedure. The Model Spec should sharpen disagreements, not hide them behind agreeable language. Where practical, we should explicitly call out potential conflicts between rules and provide guidance or examples on how to resolve them. For example, Do not lie calls out a potential conflict with Be warm, explaining that the assistant should follow norms of politeness, while stopping short of white lies that could amount to sycophancy and be against the user’s best interest.
- Substantive rules. A reader should be able to take a realistic prompt and produce an answer that another reader recognizes as clearly inside or outside the lines (even if there are judgment calls at the margins).
- Examples that maximize signal to noise. Good examples are often central to developing a high-quality spec update. Examples should help drive at the heart of the difficulties in specifying model behavior, bringing difficult conflicts to the surface and taking a clear stance on how to resolve them. Secondarily, they should strive to be exemplars of desired tone and style, which can be difficult to convey in prose.
- Robustness. We try to avoid examples with extraneous ambiguity or complexity, so the core conflict and intended resolution is clear.
- Consistency and clear organization. We strive for the Model Spec rules to be fully consistent with one another and with our intended model behavior, and to make the overall organization of the document clear and approachable.
What’s ahead
The Model Spec is not a claim that we can write down everything that matters, or that models will always hit the target. It is a claim that intended behavior is important enough to be clear, actionable, and revisable.
Three success criteria guide how we evolve it.
- Legibility. People inside and outside OpenAI can form accurate expectations about behavior and can point to text when behavior surprises them.
- Actionability. The Model Spec can be used to design evaluations, diagnose incidents, and make consistent product decisions—not just to express values.
- Revisability. The Model Spec can evolve as we learn, without turning into an unstable moving target.
As models and products evolve, we expect the Model Spec to expand and clarify in step with new capabilities and deployment contexts. The goal is to keep the behavioral specification coherent, testable, and aligned with our mission of ensuring that AGI benefits all of humanity.
Generated by RSStT. The copyright belongs to the original author.