Scaling PostgreSQL to power 800 million ChatGPT users
OpenAI News多年来, PostgreSQL 一直是支撑 ChatGPT 和 OpenAI API 等核心产品的重要底层数据系统之一。随着用户规模迅速扩大,对数据库的压力也呈指数级上升。过去一年里,我们在 PostgreSQL 上的负载增长超过 10 倍,而且仍在快速攀升。
在推进生产基础设施以承载这种增长的过程中,我们得出一个新认识:在很多人看来难以想象的规模下, PostgreSQL 实际上可以被扩展到可靠支撑大得多的读密集型负载。这套由加州大学伯克利分校( University of California, Berkeley )的一组科学家最初开发的系统,使我们可以用单个主节点的 Azure PostgreSQL flexible server 实例和遍布全球近 50 个只读副本来支撑海量全球流量。下面是我们如何通过大量优化和扎实工程,将 PostgreSQL 扩展到为 8 亿用户提供每秒数百万次查询( QPS )的实践故事,以及一路上的关键经验教训。
最初设计暴露的问题
ChatGPT 上线后,流量增长速度前所未有。为支撑这一增长,我们在应用层和 PostgreSQL 层迅速实施了大量优化,通过扩大实例规格和增加只读副本来做纵向与横向扩容。这套架构长期以来运转良好,经持续改进后仍为未来增长提供充足空间。
听起来让单主架构满足 OpenAI 规模的需求似乎令人惊讶,但要把它在生产中稳定运行并不容易。我们遇到过若干由 PostgreSQL 过载引发的严重事件( SEV ),其模式往往相似:上游问题导致数据库负载骤增,例如缓存层故障引起的大规模缓存未命中、代价昂贵的多表连接激增导致 CPU 饱和,或新功能上线引发的写入风暴。随着资源利用率上升,查询延迟加剧,请求开始超时,重试又进一步放大负载,形成可能影响整个 ChatGPT 和 API 服务的恶性循环。
此外,尽管 PostgreSQL 对我们的读密集型场景扩展性良好,但在写密集时期仍面临困难。主要原因在于 PostgreSQL 的多版本并发控制( MVCC )实现:当一条查询更新一个元组甚至单个字段时,会复制整行以生成新版本。高写入量下,这会导致严重的写放大,也增加读放大——查询必须扫描多个(死)元组以找到最新版本。 MVCC 还带来表与索引膨胀、索引维护开销上升以及复杂的 autovacuum 调优等额外挑战。(关于这些问题的深度分析,可见我与卡内基梅隆大学 Andy Pavlo 教授合著的博客 The Part of PostgreSQL We Hate the Most。)
将 PostgreSQL 扩展到百万级 QPS
为缓解上述限制并降低主库的写压力,我们把可分片(即可水平划分)的写密集型工作负载迁移到分片系统,例如 Azure Cosmos DB ,并在应用逻辑上尽量减少不必要的写入。我们也不再允许向当前的 PostgreSQL 部署中添加新表,新工作负载默认落在分片系统上。
尽管基础设施在演进, PostgreSQL 仍保持未分片,所有写操作由单个主实例处理。主要理由是对现有应用进行分片将非常复杂且耗时,需要修改数百个应用端点,可能需要数月甚至数年。鉴于我们的负载以读为主,并已实施大量优化,当前架构仍能为未来增长提供足够余量。我们并不排除未来对 PostgreSQL 进行分片,但鉴于目前有充足的增长缓冲,这并非近期优先事项。
下面各节详细描述我们面对的挑战、采取的广泛优化以及防止未来故障的措施,讲述如何把 PostgreSQL 推向极限并把它扩展到每秒数百万次查询。
减轻主库负载
挑战:单主写入意味着写无法横向扩展,写入突增会迅速压垮主库,影响 ChatGPT 和 API 等服务。
做法:我们尽可能把主库的负载降到最低——包括读和写——以保证主库有足够容量应对写入突发。尽可能把读流量卸载到副本上。但有些读查询必须在主库上执行(因它们属于写事务的一部分),这类查询我们着重优化,避免慢查询。写方面,我们把可分片的写密集型工作负载迁到 Azure Cosmos DB 等分片系统;较难分片但写量大的工作仍在逐步迁移中。同时,我们在应用端积极优化以减少写入,例如修复造成冗余写入的 bug,在合适情形下引入延迟写(lazy writes)以平滑流量突发;回填字段时则施加严格速率限制以防写入压力过大。
查询优化
挑战:我们发现若干代价极高的查询,过去这些查询在流量突增时会大量消耗 CPU,拖慢 ChatGPT 与 API 的响应。
做法:类似大量表连接的昂贵查询能显著削弱甚至拉垮整个服务。我们持续优化 PostgreSQL 查询,避免常见的 OLTP 反模式。例如,曾经有一条连接 12 张表的超昂贵查询,其流量激增导致过往多个高严重性 SEV。我们尽量避免复杂的多表连接;如果不可避免,会考虑将复杂的连接逻辑拆分并移到应用层。很多此类问题由 ORM 生成的 SQL 导致,因此务必仔细审查其输出并确保行为可控。此外, PostgreSQL 中常见的长时间空闲事务查询会阻塞 autovacuum,配置如 idle_in_transaction_session_timeout 之类的超时设置至关重要。
单点故障缓解
挑战:某个只读副本宕机时,流量可重定向到其他副本;但单个写主节点则是单点故障,一旦主库宕机,整个服务受影响。
做法:我们把大多数关键请求(仅读)从写主上卸载到副本上,这样即便主库下线,这些请求仍能继续被服务,写操作会失败但影响被限制,不至于成为 SEV0。为应对主库故障,我们让主库以 High-Availability ( HA ) 模式运行,配有热备(hot standby)——持续同步、随时可接管流量的副本。主库宕机或需维护时,可迅速提升热备以把停机降到最低。 Azure PostgreSQL team 在确保高负载下故障切换的安全可靠方面做了大量工作。为处理只读副本故障,我们在每个区域部署多台副本并留有容量余量,保证单个副本故障不会引发区域性中断。
工作负载隔离
挑战:某些请求会在 PostgreSQL 实例上消耗不成比例的资源,导致与同实例运行的其他工作负载性能下降。比如新功能上线可能带来低效查询,严重占用 CPU,拖慢其他关键请求。
做法:为缓解“吵闹邻居”问题,我们把工作负载隔离到专用实例,确保资源密集型突发不会影响其他流量。具体做法是把请求分为低优先级和高优先级两个层级,分别路由到不同实例,即便低优先级负载突然变得资源密集,也不会降低高优先级请求的性能。跨产品和服务也采用相同策略,避免一个产品的活动影响另一个产品的可靠性或性能。
连接池
挑战:每个实例有最大连接数限制(在 Azure PostgreSQL 中为 5,000 连接),容易出现连接耗尽或大量空闲连接累积的情况。我们曾因连接风暴耗尽所有可用连接而发生事故。
做法:我们部署 PgBouncer 作为代理层进行连接池化。以 statement 或 transaction 池化模式运行能够高效重用连接,大幅减少活跃客户端连接数,同时降低连接建立延迟:基准测试显示平均连接时延从 50 毫秒降到 5 毫秒。跨区连接和请求成本较高,因此我们将代理、客户端与副本同域部署以最小化网络开销和连接占用时间。 PgBouncer 需要精心配置,诸如空闲超时等设置对于防止连接耗尽至关重要。
每个只读副本都有自己的 Kubernetes 部署,运行多个 PgBouncer pods。我们在同一个 Kubernetes Service 后面运行多个部署,由 Service 在 pods 间进行负载均衡。
缓存
挑战:缓存未命中骤增会触发大量对 PostgreSQL 的读取,导致 CPU 饱和并拖慢用户请求。
做法:为减轻对 PostgreSQL 的读压力,我们使用缓存层来承载大部分读流量。但当缓存命中率意外下降时,缓存未命中潮会把大量请求直接推给 PostgreSQL,迅速消耗资源并放慢服务。为防止缓存未命中风暴,我们实现了缓存锁(及租约)机制:对于同一缓存键,一次缓存未命中时只允许一个请求去读取并回填数据库,其他请求等待缓存更新而不是同时击穿到 PostgreSQL 。此举显著减少了冗余数据库读取,保护系统免受级联流量冲击。
扩展只读副本
挑战:主库需要向每个只读副本流式传输 WAL(Write Ahead Log)。随着副本数量增加,主库必须向更多实例传输 WAL,增加网络带宽与 CPU 压力,导致副本延迟升高且不稳定,使系统难以可靠扩展。
做法:我们在多个地理区域运行近 50 个只读副本以最小化延迟。但在当前架构下,主库需向每个副本推送 WAL。尽管借助更大规格实例与高网络带宽目前能很好地扩展,但不能无限制地增加副本数而不最终压垮主库。为此,我们正与 Azure PostgreSQL team 合作测试 cascading replication(级联复制),由中间副本将 WAL 转发给下游副本。该方法可让我们在不压垮主库的前提下,将副本扩展到上百台,但也带来更高的运维复杂度,尤其是故障切换管理方面。我们会在确认其在故障时能安全切换后才在生产环境推广。
限流
挑战:特定端点的流量突增、代价高昂查询激增或重试风暴,都会迅速耗尽 CPU、I/O 与连接等关键资源,进而引发广泛的服务降级。
做法:我们在多层面实施限流——应用层、连接池、代理与查询层——以防突发流量压垮数据库实例并触发连锁故障。同时避免过短的重试间隔以免引发重试风暴。我们还增强了 ORM 层对限流的支持,并在必要时完全阻断特定查询摘要(query digests)。这种有针对性的削峰手段使我们能在代价高昂查询突发时快速恢复。
模式管理
挑战:即便是小的模式变更,比如修改列类型,也可能触发表或索引的全表重写。为此我们对模式变更采取谨慎策略——只允许轻量级操作,避免任何会重写整表的修改。
做法:仅允许轻量级的模式变更,例如添加或删除不会触发全表重写的列。我们对模式变更强制 5 秒超时,允许并发创建/删除索引。模式变更只限于现有表;若新功能需要额外表,必须在诸如 Azure Cosmos DB 等分片系统中创建,而不是在 PostgreSQL 中。回填字段时施加严格速率限制以防写入激增。尽管该流程有时会耗时超过一周,但它能确保稳定性并避免对生产环境产生影响。
结果与未来方向
这些努力表明,凭借恰当的设计与大量优化, Azure PostgreSQL 可以被扩展以承载最大规模的生产工作负载。在读密集场景下, PostgreSQL 已能处理每秒数百万次查询,支撑 OpenAI 的核心产品如 ChatGPT 和 API 平台。我们新增了近 50 个只读副本,同时保持极低的复制延迟,在全球多区维持低延迟读取,并建立了足够的容量余量来支持未来增长。
在降低延迟与提升可靠性的同时,这套扩展方案依然有效。我们在生产中持续交付低两位数毫秒的 p99 客户端延迟,并实现五个九的可用性。在过去 12 个月里,我们仅发生过一次 SEV-0 的 PostgreSQL 事件——发生在 ChatGPT ImageGen 病毒式传播期间,当时写入流量在一周内因超过 1 亿新用户注册而骤增超过 10 倍。
尽管对 PostgreSQL 的表现感到满意,我们仍在不断挑战其极限以确保未来的增长空间。我们已把可分片的写密集型负载迁至像 CosmosDB 这样的分片系统,剩下的写密集型工作负载更难分片,我们也在积极迁移这些工作以进一步卸载 PostgreSQL 主库。同时,我们正与 Azure 合作开启级联复制,以便安全地扩展到更多只读副本。
展望未来,随着基础设施需求持续增长,我们将继续探索更多扩展路径,包括对 PostgreSQL 的分片方案或采用替代的分布式系统,以确保能够满足长期的规模与可靠性需求。
For years, PostgreSQL has been one of the most critical, under-the-hood data systems powering core products like ChatGPT and OpenAI’s API. As our user base grows rapidly, the demands on our databases have increased exponentially, too. Over the past year, our PostgreSQL load has grown by more than 10x, and it continues to rise quickly.
Our efforts to advance our production infrastructure to sustain this growth revealed a new insight: PostgreSQL can be scaled to reliably support much larger read-heavy workloads than many previously thought possible. The system (initially created by a team of scientists at University of California, Berkeley) has enabled us to support massive global traffic with a single primary Azure PostgreSQL flexible server instance and nearly 50 read replicas spread over multiple regions globally. This is the story of how we’ve scaled PostgreSQL at OpenAI to support millions of queries per second for 800 million users through rigorous optimizations and solid engineering; we’ll also cover key takeaways we learned along the way.
Cracks in our initial design
After the launch of ChatGPT, traffic grew at an unprecedented rate. To support it, we rapidly implemented extensive optimizations at both the application and PostgreSQL database layers, scaled up by increasing the instance size, and scaled out by adding more read replicas. This architecture has served us well for a long time. With ongoing improvements, it continues to provide ample runway for future growth.
It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple. We’ve seen several SEVs caused by Postgres overload, and they often follow the same pattern: an upstream issue causes a sudden spike in database load, such as widespread cache misses from a caching-layer failure, a surge of expensive multi-way joins saturating CPU, or a write storm from a new feature launch. As resource utilization climbs, query latency rises and requests begin to time out. Retries then further amplify the load, triggering a vicious cycle with the potential to degrade the entire ChatGPT and API services.
Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning. (You can find a deep-dive on these issues in a blog I wrote with Prof. Andy Pavlo at Carnegie Mellon University called The Part of PostgreSQL We Hate the Most, cited in the PostgreSQL Wikipedia page.)
Scaling PostgreSQL to millions of QPS
To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimizing application logic to minimize unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.
Even as our infrastructure has evolved, PostgreSQL has remained unsharded, with a single primary instance serving all writes. The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years. Since our workloads are primarily read-heavy, and we’ve implemented extensive optimizations, the current architecture still provides ample headroom to support continued traffic growth. While we’re not ruling out sharding PostgreSQL in the future, it’s not a near-term priority given the sufficient runway we have for current and future growth.
In the following sections, we’ll dive into the challenges we faced and the extensive optimizations we implemented to address them and prevent future outages, pushing PostgreSQL to its limits and scaling it to millions of queries per second (QPS).
Reducing load on the primary
Challenge: With only one writer, a single-primary setup can’t scale writes. Heavy write spikes can quickly overload the primary and impact services like ChatGPT and our API.
Solution: We minimize load on the primary as much as possible—both reads and writes—to ensure it has sufficient capacity to handle write spikes. Read traffic is offloaded to replicas wherever possible. However, some read queries must remain on the primary because they’re part of write transactions. For those, we focus on ensuring they’re efficient and avoid slow queries. For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB. Workloads that are harder to shard but still generate high write volume take longer to migrate, and that process is still ongoing. We also aggressively optimized our applications to reduce write load; for example, we’ve fixed application bugs that caused redundant writes and introduced lazy writes, where appropriate, to smooth traffic spikes. In addition, when backfilling table fields, we enforce strict rate limits to prevent excessive write pressure.
Query optimization
Challenge: We identified several expensive queries in PostgreSQL. In the past, sudden volume spikes in these queries would consume large amounts of CPU, slowing both ChatGPT and API requests.
Solution: A few expensive queries, such as those joining many tables together, can significantly degrade or even bring down the entire service. We need to continuously optimize PostgreSQL queries to ensure they’re efficient and avoid common Online Transaction Processing (OLTP) anti-patterns. For example, we once identified an extremely costly query that joined 12 tables, where spikes in this query were responsible for past high-severity SEVs. We should avoid complex multi-table joins whenever possible. If joins are necessary, we learned to consider breaking down the query and move complex join logic to the application layer instead. Many of these problematic queries are generated by Object-Relational Mapping frameworks (ORMs), so it’s important to carefully review the SQL they produce and ensure it behaves as expected. It’s also common to find long-running idle queries in PostgreSQL. Configuring timeouts like idle_in_transaction_session_timeout is essential to prevent them from blocking autovacuum.
Single point of failure mitigation
Challenge: If a read replica goes down, traffic can still be routed to other replicas. However, relying on a single writer means having a single point of failure—if it goes down, the entire service is affected.
Solution: Most critical requests only involve read queries. To mitigate the single point of failure in the primary, we offloaded those reads from the writer to replicas, ensuring those requests can continue serving even if the primary goes down. While write operations would still fail, the impact is reduced; it’s no longer a SEV0 since reads remain available.
To mitigate primary failures, we run the primary in High-Availability (HA) mode with a hot standby, a continuously synchronized replica that is always ready to take over serving traffic. If the primary goes down or needs to be taken offline for maintenance, we can quickly promote the standby to minimize downtime. The Azure PostgreSQL team has done significant work to ensure these failovers remain safe and reliable even under very high load. To handle read replica failures, we deploy multiple replicas in each region with sufficient capacity headroom, ensuring that a single replica failure doesn’t lead to a regional outage.
Workload isolation
Challenge: We often encounter situations where certain requests consume a disproportionate amount of resources on PostgreSQL instances. This can lead to degraded performance for other workloads running on the same instances. For example, a new feature launch can introduce inefficient queries that heavily consume PostgreSQL CPU, slowing down requests for other critical features.
Solution: To mitigate the “noisy neighbor” problem, we isolate workloads onto dedicated instances to ensure that sudden spikes in resource-intensive requests don’t impact other traffic. Specifically, we split requests into low-priority and high-priority tiers and route them to separate instances. This way, even if a low-priority workload becomes resource-intensive, it won’t degrade the performance of high-priority requests. We apply the same strategy across different products and services as well, so that activity from one product does not affect the performance or reliability of another.
Connection pooling
Challenge: Each instance has a maximum connection limit (5,000 in Azure PostgreSQL). It’s easy to run out of connections or accumulate too many idle ones. We’ve previously had incidents caused by connection storms that exhausted all available connections.
Solution: We deployed PgBouncer as a proxy layer to pool database connections. Running it in statement or transaction pooling mode allows us to efficiently reuse connections, greatly reducing the number of active client connections. This also cuts connection setup latency: in our benchmarks, the average connection time dropped from 50 milliseconds (ms) to 5 ms. Inter-region connections and requests can be expensive, so we co-locate the proxy, clients, and replicas in the same region to minimize network overhead and connection use time. Moreover, PgBouncer must be configured carefully. Settings like idle timeouts are critical to prevent connection exhaustion.
Each read replica has its own Kubernetes deployment running multiple PgBouncer pods. We run multiple Kubernetes deployments behind the same Kubernetes Service, which load-balances traffic across pods.
Caching
Challenge: A sudden spike in cache misses can trigger a surge of reads on the PostgreSQL database, saturating CPU and slowing user requests.
Solution: To reduce read pressure on PostgreSQL, we use a caching layer to serve most of the read traffic. However, when cache hit rates drop unexpectedly, the burst of cache misses can push a large volume of requests directly to PostgreSQL. This sudden increase in database reads consumes significant resources, slowing down the service. To prevent overload during cache-miss storms, we implement a cache locking (and leasing) mechanism so that only a single reader that misses on a particular key fetches the data from PostgreSQL. When multiple requests miss on the same cache key, only one request acquires the lock and proceeds to retrieve the data and repopulate the cache. All other requests wait for the cache to be updated rather than all hitting PostgreSQL at once. This significantly reduces redundant database reads and protects the system from cascading load spikes.
Scaling read replicas
Challenge: The primary streams Write Ahead Log (WAL) data to every read replica. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on both network bandwidth and CPU. This causes higher and more unstable replica lag, which makes the system harder to scale reliably.
Solution: We operate nearly 50 read replicas across multiple geographic regions to minimize latency. However, with the current architecture, the primary must stream WAL to every replica. Although it currently scales well with very large instance types and high-network bandwidth, we can’t keep adding replicas indefinitely without eventually overloading the primary. To address this, we’re collaborating with the Azure PostgreSQL team on cascading replication, where intermediate replicas relay WAL to downstream replicas. This approach allows us to scale to potentially over a hundred replicas without overwhelming the primary. However, it also introduces additional operational complexity, particularly around failover management. The feature is still in testing; we’ll ensure it’s robust and can fail over safely before rolling it out to production.
Rate limit
Challenge: A sudden traffic spike on specific endpoints, a surge of expensive queries, or a retry storm can quickly exhaust critical resources such as CPU, I/O, and connections, which causes widespread service degradation.
Solution: We implemented rate-limiting across multiple layers—application, connection pooler, proxy, and query—to prevent sudden traffic spikes from overwhelming database instances and triggering cascading failures. It’s also crucial to avoid overly short retry intervals, which can trigger retry storms. We also enhanced the ORM layer to support rate limiting and when necessary, fully block specific query digests. This targeted form of load shedding enables rapid recovery from sudden surges of expensive queries.
Schema Management
Challenge: Even a small schema change, such as altering a column type, can trigger a full table rewrite. We therefore apply schema changes cautiously—limiting them to lightweight operations and avoiding any that rewrite entire tables.
Solution: Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger a full table rewrite. We enforce a strict 5-second timeout on schema changes. Creating and dropping indexes concurrently is allowed. Schema changes are restricted to existing tables. If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL. When backfilling a table field, we apply strict rate limits to prevent write spikes. Although this process can sometimes take over a week, it ensures stability and avoids any production impact.
Results and the road ahead
This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads. PostgreSQL handles millions of QPS for read-heavy workloads, powering OpenAI’s most critical products like ChatGPT and the API platform. We added nearly 50 read replicas, while keeping replication lag near zero, maintained low-latency reads across geo-distributed regions, and built sufficient capacity headroom to support future growth.
This scaling works while still minimizing latency and improving reliability. We consistently deliver low double-digit millisecond p99 client-side latency and five-nines availability in production. And over the past 12 months, we’ve had only one SEV-0 PostgreSQL incident (it occurred during the viral launch of ChatGPT ImageGen, when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week.)
While we’re happy with how far PostgreSQL has taken us, we continue to push its limits to ensure we have sufficient runway for future growth. We’ve already migrated the shardable write-heavy workloads to our sharded systems like CosmosDB. The remaining write-heavy workloads are more challenging to shard—we’re actively migrating those as well to further offload writes from the PostgreSQL primary. We’re also working with Azure to enable cascading replication so we can safely scale to significantly more read replicas.
Looking ahead, we’ll continue to explore additional approaches to further scale, including sharded PostgreSQL or alternative distributed systems, as our infrastructure demands continue to grow.
Generated by RSStT. The copyright belongs to the original author.