Beyond rate limits: scaling access to Codex and Sora

过去一年里， Sora 和 Codex 的采用速度急剧上升，使用量很快超出我们原先的预期。我们观察到一个稳定的模式：用户一开始就深入使用、从中获得真实价值，随后却遇到配额限制。

配额限制有助于平滑需求并保证公平访问；但当用户正在获得价值时，被硬性中断会令人沮丧。我们希望在保护系统性能和用户信任的前提下，让用户能继续使用。

为此，我们构建了一个实时访问引擎来计数使用量。其中一个层是允许购买积分（ credits ）的能力。当用户超出配额时，积分可以让他们通过消耗积分余额继续使用我们的产品。

在引擎底层，是一个将限额、实时使用跟踪和积分余额融合为单一访问模型的复杂系统。本文介绍了为何要在扩展 Sora 和 Codex 时重新设计访问控制、一个可证明正确的实时系统如何在每次请求中混合应用配额与积分，以及这一基础如何为两款产品解锁更多访问能力。

为何现有访问模型不够用

放大来看，传统访问模型常常只能在两者之间做出选择：

配额（ rate limits ）起初有用，但用户一旦用尽就会得到“稍后再来”的体验；
按使用计费灵活，但用户从第一次请求就开始付费——这对早期探索并不友好。

对 Sora 和 Codex 来说，单一方案都无法满足需求。单纯提高配额会丢失平滑需求与公平性的控制，也会耗尽服务所有人的能力。完全依赖异步的按使用计费，又会引入延迟、超额或核算问题——这些正是用户在最投入时最容易察觉并感到不满的故障。

我们需要的是一个将实时限额与随用随付访问结合在一起的混合系统，能够在实时做出决策并平滑用户体验。

这个系统必须能够：

在配额未用尽时强制执行配额；
在同一请求中无缝切换到积分支付；
实时做出上述决策；
在追踪积分消耗时做到严格准确并可审计。

把“访问”看成瀑布式决策，而不是一道闸门

我们在概念上的关键转变是把访问建模为一个决策瀑布（decision waterfall）。不再问“允许不允许？”，而是问“能允许多少、从哪里扣除？”在计数使用时，系统按顺序评估不同层次：

从用户的体验角度看，配额、免费额度、积分、促销和企业特权都只是同一决策栈中的不同层。对用户而言，他们并没有“切换系统”——只是继续使用 Sora 或 Codex。这也解释了为什么积分感觉“无感”：它们只是瀑布式决策中的另一个层。

为什么我们要自行开发

我们评估过第三方的使用计费与计量平台，它们在开具发票和报表方面很成熟，但无法满足两个关键要求：

实时正确性

当用户触及限额且有积分可用时，系统必须立刻知道。靠尽力而为或延迟计数会造成意外阻断、不一致的余额和错误收费。对于像 Sora、Codex 这样的交互式产品，这类错误会立刻变得明显且令人恼火。

可核对性与信任

我们还需要对每一种结果提供透明解释：

为什么某次请求被允许或被阻止；
它消耗了多少使用量；
应用了哪些限额或余额。

这项能力必须紧密融入我们的决策瀑布，而不是由只看到部分情况的外部计费平台孤立解决。为了在不损害用户信任的前提下让用户继续访问产品，我们需要对正确性、时间性和可观测性拥有完全控制权，这把我们推向了自研方案。

构建高并发的使用与余额系统

为此，我们打造了一个面向同步访问决策的分布式使用与余额系统。总体而言，该系统：

跟踪每个用户、每项功能的使用情况；
维护配额窗口；
保持实时的积分余额；
通过流式异步处理器幂等地扣减余额。

每次请求都会走同一条评估路径，做出实时决定：先同步消耗配额（若需），并在必要时验证是否有足够积分；随后返回一个确定性结果，同时以异步方式结算积分扣减。这样可保证跨产品行为一致，避免团队间重复实现逻辑。

可证明正确的计费系统

我们设计该系统的核心原则之一是：必须能够证明计费是正确的。这与我们为企业客户引入积分支持的初衷一致。在系统中，有三套紧密关联的数据集：

Product usage events ：用户实际的操作行为；
Monetization events ：我们据此对用户计费的事件；
Balance updates ：我们为什么以及如何调整用户积分余额。

这些数据集不是可有可无的副产物，而是驱动系统运行的核心：每类事件会触发下一步。把“发生了什么”、 “收取了什么” 与 “我们扣了多少余额” 分离，使我们能独立审计、回放并对每一层进行核对。这是我们为可证明正确性所做的权衡，代价是积分余额更新会有短暂延迟。我们如何实现这一点：

无论是否触发积分消耗，所有用户活动都会发布 Product usage events ，为用户行为提供审计链并解释我们为何或为何不收费。
每个事件都携带一个稳定的 idempotency key ，这样重试、重放或工作进程重启都不会导致重复扣款，防止重复收费。这也让我们可以离线批量核对结果。
我们采用异步（但接近实时）的余额更新而非同步更新，以便保留审计轨迹。我们容忍余额更新的小幅延迟，以便证明系统运作正常并向用户保证不会错收。当短暂延迟导致我们超支用户积分时，我们会自动退款；我们选择以可证明的正确性和用户信任优先，而不是一味严格执行。
我们在单个原子数据库事务中减少 Credit Balance 并插入一条 Balance Update 记录。余额更新按账户序列化，防止并发请求争抢同一笔积分。Balance Update 记录既包含扣减金额，也追溯到触发该更新的 monetization events；把这两者包裹在同一事务中，能确保每次余额调整都有审计依据。

所有这些严谨性服务于一个目标：让访问既简单又安全。当人们在创作或编程时，不该为请求是否会通过、是否会被多扣钱或余额是否准确而分心。通过让使用、计费和余额可证明地正确，我们为用户提供了不干扰其体验的系统。这也使我们能以连续访问取代硬性中断，让积分在实际工作中即时可用，而不仅仅体现在发票上。

以用户势头为核心的架构

我们方案的指导原则是保护用户的使用势头。每一项架构决策都与面向用户的结果挂钩：实时余额避免不必要的中断，原子性消费防止重复收费，统一的访问逻辑确保可预期的行为。结果是用户可以更长时间地工作、更深入地探索、把项目推进得更远，而不会遇到硬性停顿或被迫提前更换套餐。

当用户处于高度参与状态时，系统应当助其一臂之力，而不是设置障碍。配额与积分消失在背景里，成为推动工作的无感机制。

要打造这种体验，必须把访问、使用与计费视为一个整体，并把正确性当作一项一等产品特性来构建基础设施。这一底座未来还可以扩展到更多产品； Sora 和 Codex 只是开始。

In the past year, both Sora and Codex have seen rapid adoption, with usage quickly pushing beyond what we originally expected. We’ve seen a consistent pattern: users dive in, find real value, and then run into rate limits.

Rate limits can help smooth demand and ensure fair access; however, when users are getting value, hitting a hard stop can be frustrating. We wanted a way for users to keep going, while protecting system performance and user trust in our approach.

To solve this, we built a real‑time access engine that counts usage. One of the layers in that engine is the ability to purchase credits. When users exceed their rate limits, credits let them keep using our products by spending down their credit balance.

Underneath this is a complex system that fuses limits, real‑time usage tracking, and credit balances in a single access model. This post covers why scaling Sora and Codex required rethinking access control, how a provably correct, real-time system blends rate limits and credits per request, and how that foundation now unlocks additional access for both products.

Why existing access models fell short

Zooming out, traditional access models tend to force a choice:

Rate limits can be helpful at first, but leave users with a bad experience when they run out: “come back later”
Usage‑based billing is flexible, but leaves users paying from the first token—not ideal for supporting early exploration

For Sora and Codex, neither was sufficient on its own. If we simply raised rate limits, we’d lose important demand-smoothing and fairness controls and run out of capacity to serve everyone. If we relied entirely on asynchronous usage billing, we’d introduce lag, overages, or reconciliation issues—exactly the kinds of problems users notice when they’re most engaged.

What we needed instead was a single hybrid system combining real-time limits with pay-as-you-go access:

This system had to:

Enforce rate limits until they’re reached
Seamlessly transition to credits within the same request
Make that decision in real time
Be rigorously accurate and auditable when tracking credit consumption

Access as a waterfall, not a gate

One of the key conceptual shifts we made was modeling access as a decision waterfall. Instead of asking “is this allowed?”, we ask “how much is allowed, and from where?” When counting usage, the system goes through the following sequence:

This model reflects how users actually experience the product. Rate limits, free tiers, credits, promotions, and enterprise entitlements are all just layers in the same decision stack. From a user’s perspective, they don’t “switch systems”—they just keep using Sora or Codex. That’s why credits feel invisible: they’re just another element in the waterfall.

Why we built this in‑house

We evaluated third‑party usage billing and metering platforms to handle credit consumption. They’re well‑suited for invoicing and reporting, but didn’t meet two critical requirements:

Real‑time correctness

When a user hits a limit and has credits available, the system must know immediately. Best‑effort or delayed counting shows up as surprise blocks, inconsistent balances, and incorrect charges. For interactive products like Sora and Codex, those failures become visible and frustrating.

Reconcilability and trust

We also needed to offer transparency into every outcome:

Why a request was allowed or blocked
How much usage it consumed
Which limits or balances were applied

This capability needed to be tightly integrated into our decision waterfall rather than solved in isolation in a separate usage billing platform that only saw one piece of what was happening. To let users access our products without compromising trust, we needed full control over correctness, timing, and observability. That pushed us toward an in‑house solution.

Building a high‑scale usage and balance system

To power this, we built a distributed usage and balance system designed specifically for synchronous access decisions.

At a high level, the system:

Tracks per‑user, per‑feature usage
Maintains rate‑limit windows
Maintains real‑time credit balances
Debits balances idempotently through a streaming async processor

Every request passes through a single evaluation path that makes a real‑time decision about how much usage is allowed by synchronously consuming from rate limits and, if needed, verifying sufficient credits; it then returns one definitive outcome while settling any credit debits asynchronously. This ensures consistent behavior across products and eliminates duplicated logic across teams.

A provably correct billing system

One of the key design principles of this system is that we must be able to prove that our billing is correct. This reflects the roots of our credit support, which originated with enterprise customers. In the above system diagram, we have three separate datasets that all tie together:

Product usage events: What the user actually did
Monetization events: What we charge the user for their usage
Balance updates: How much we adjusted the user’s credit balance and why

These datasets aren’t a casual by-product; they actually drive the system, with each dataset triggering the next. Separating what occurred, any associated charges, and what we debited lets us independently audit, replay, and reconcile every layer. This is an intentional trade-off where we are prioritizing provable correctness, at the cost of credit balance updates being slightly delayed. How we accomplished this:

Product usage events are published for all user activity, whether it drives credit consumption or not. This provides an audit trail for user activity and allows us to explain why we charged, or didn’t charge, credits.
Every event carries a stable idempotency key, so retries, replays, or worker restarts can never double‑debit a balance, which prevents double‑charging. This also lets us run a batch reconciliation to verify our work offline.
We do asynchronous (but still near-real-time) balance updates instead of synchronous updates to create an audit trail. We tolerate a small delay in updating the user’s balance so that we can prove that the system is functioning and assure our users that we are not misbilling them. When that brief delay causes us to overshoot a user’s credit balance, we automatically refund it; we choose provable correctness and user trust over strict enforcement.
We decrease the Credit Balance and insert a Balance Update record in a single atomic database transaction. Balance updates are serialized per account, so concurrent requests can never race to spend the same credits. The Balance Update record contains both the debit amount as well as attribution back to the monetization event that triggered the update; wrapping this in a single database transaction guarantees we have an audit trail for every adjustment to the credit balance.

All of this rigor supports one objective: to make access simple and safe. When people are creating or coding, they shouldn’t have to wonder whether a request will go through, if they’ll be overcharged, or whether their balance is accurate. By making usage, billing, and balances provably correct, we give users a system that doesn’t distract from their experience. That’s what lets us replace hard stops with continuous access—and it’s what makes credits usable in the middle of real work, not just on an invoice.

Architecture in service of momentum

The guiding principle behind our approach is protecting user momentum. Every architectural decision maps back to a user-facing outcome: real-time balances prevent unnecessary interruptions, atomic consumption prevents double-charging, and unified access logic ensures predictable behavior. The result is that people can work longer, explore more deeply, and take projects further without facing hard stops or premature plan changes.

When users are engaged, the system should help them continue, not get in the way. Limits and credits disappear into the background.

Building that experience required rethinking access, usage, and billing as a single system and building infrastructure that treats correctness as a first‑class product feature. The same foundation can extend to more products over time; Sora and Codex are just the beginning.

Generated by RSStT. The copyright belongs to the original author.

Source