How we built OWL, the new architecture behind our ChatGPT-ba…

How we built OWL, the new architecture behind our ChatGPT-ba…

OpenAI News

作者: Ken Rockot(技术人员)与 Ben Goodger( ChatGPT Atlas 工程负责人)

上周,我们发布了 ChatGPT Atlas ,这是一种全新方式,让 ChatGPT 陪你一起浏览网页。除了作为一个功能齐全的网页浏览器外, ChatGPT Atlas 也预示着未来:你可以把 ChatGPT 带到互联网上任何地方,随时提问、提出建议并完成任务。本文着重剖析产品中最复杂的工程挑战之一:我们如何把 ChatGPT 打造为一个会随着使用变得更有用的浏览器。

要让 ChatGPT 真正成为上网的“副驾驶”,就必须重构浏览器的整体架构:把 ChatGPT Atlas 与 Chromium 运行时分离开来。为此我们开发了新的 Chromium 集成方式,以实现产品目标:瞬时启动、在打开更多标签页时仍保持响应,以及为具备智能代理(agentic)能力的场景打下坚实基础。

Chromium 是自然而然的基石。它提供了先进的网页引擎、成熟的安全模型、可靠的性能记录和无与伦比的网页兼容性,而且由全球社区持续维护,是现代桌面浏览器的常用选择。

我们的设计团队对用户体验有更高要求——比如为 Agent 模式设计丰富的动画和视觉效果。为此工程团队采用了现代原生 UI 框架( 如 SwiftUI 、 AppKit 和 Metal ),而不是简单地在开源 Chromium 的 UX 上做皮肤替换。因此, Atlas 的界面是对整套应用体验的重新构建。

我们还有其他性能目标:比如快速启动、支持数百个标签而不影响性能。Chromium 默认的启动序列、线程模型和标签模型对这些目标并不友好。我们曾考虑对 Chromium 做大幅改动,但希望保持对上游补丁的精简以便快速跟新版本。为最大化开发速度,我们必须想出一种不同的集成与驱动 Chromium 的方式。

我们的解决方案是构建一个新的架构层: OWL( OpenAI’s Web Layer )。 OWL 将 Chromium 的浏览器进程移出主 Atlas 应用进程之外,作为一个独立的服务层运行。

可以这样理解: Chromium 改革了浏览器架构,把标签页放到独立进程;我们把想法再推进一步,把整个 Chromium 移到主应用进程之外的隔离服务层。这一变动带来一系列好处:

  • 应用更简洁、现代: Atlas 几乎完全基于 SwiftUI 和 AppKit 构建,单一语言与技术栈,代码库更干净。
  • 启动更快: Chromium 在后台异步启动, Atlas 不必等待——界面几乎即时显示。
  • 抗卡顿与崩溃隔离: Chromium 是复杂且强大的引擎,若其主线程卡住或崩溃, Atlas 仍可保持运行。
  • 合并成本更低:我们对 Chromium 开源 UI 的依赖更少,对上游的差异更小、更易维护。
  • 迭代更快:大多数工程师不再需要本地编译 Chromium。 OWL 以预构建二进制形式提供, Atlas 的构建从小时级降到分钟级。

因为团队里大多数工程师无需经常从源码构建 Chromium,开发速度得以大幅提升,连新成员也能在到岗第一天下午合并小改动。

在实现上, Atlas 浏览器是 OWL Client,而 Chromium 的浏览器进程是 OWL Host。两者通过进程间通信(IPC),具体采用 Chromium 自己的消息系统 Mojo 来交互。我们为 Mojo 编写了 Swift(甚至 TypeScript)绑定,允许 Swift 应用直接调用 host 端接口。

OWL 客户端库对外暴露了简洁的 Swift API,抽象出主机服务层的若干关键概念:

  • Session:全局配置与控制 host。
  • Profile:管理特定用户配置的浏览器状态。
  • WebView:控制并嵌入单个网页内容(渲染、输入、导航、缩放等)。
  • WebContentRenderer:把输入事件转发到 Chromium 的渲染管线并接收渲染器反馈。
  • LayerHost/Client:在 UI 与 Chromium 之间交换合成信息。

此外还有一系列服务端点,用于管理书签、下载、扩展和自动填充等高层功能。

渲染方面, WebView 在客户端应用中共享同一展示容器并按需切换。举例来说,浏览器窗口通常只有一个可见的共享容器,切换标签就把对应的 WebView 切入该容器。在 Chromium 端,这个容器对应一个 gfx::AcceleratedWidget,最终由 CALayer 支撑。我们向客户端暴露该图层的上下文 ID,然后在 NSView 中使用私有的 CALayerHost API 嵌入它。

像 下拉菜单或颜色选择器这类 Chromium 在独立弹出窗口中渲染的 UI,同样采用这一方案。它们虽然没有 content::WebContents,但有各自的 content::RenderWidgetHostView 和 gfx::AcceleratedWidget,因此同样适用委托渲染模型。

OWL 在内部保持视图几何与 Chromium 端同步,这样 GPU 合成器就能得到正确尺寸与设备像素比的图层内容。我们也用这套技术有选择地把 Chromium 自身的原生 Views UI 投影到 Atlas 中(这对快速引导权限提示等功能很有用,而无需在 SwiftUI 里从零实现)。这项技术在 macOS 对可安装 Web 应用的现有基础设施上借鉴颇多。

关于输入事件, Chromium UI 会把平台事件(如 macOS 的 NSEvent )转换为 Blink 的 WebInputEvent 模型再转发渲染器。但由于 OWL 把 Chromium 放在隐藏进程中运行,我们在 Swift 客户端库里完成这一步转换,然后把已翻译的事件转发给 Chromium。事件在网页内容中按正常生命周期处理;若页面未处理某事件,事件会被返回给客户端,我们会重新合成一个 NSEvent 并让应用其他部分有机会处理该输入。

Agent 模式带来了一些特殊挑战。我们的模型使用屏幕图像作为输入,但像 这类 UI 元素会在标签边界外的弹窗里渲染。为了解决这一点, agent 模式会把这些弹窗按正确坐标合成为主页面图像,让模型在单帧里看到完整上下文。

在输入路径上,代理生成的事件会直接路由到渲染器,而不会经过特权的浏览器层,保留了沙箱边界。例如,我们不希望这类事件合成能触发与网页内容无关的浏览器快捷键。

代理浏览还可以在短暂的“未登录”上下文中运行。我们不会复用用户的隐身(Incognito)配置以避免状态泄露,而是用 Chromium 的 StoragePartition 基础设施创建隔离的内存存储。每个代理会话都是全新开始,结束时清除所有 cookie 与站点数据,你可以同时运行多个相互隔离的“未登录”代理会话,每个会话对应独立标签页。

没有全球 Chromium 社区奠定的现代网页基础,这一切都不可想象。 OWL 在此基础上做了新的组合:把引擎与应用解耦,将一流的网页平台与现代原生框架融合,解锁更快、更灵活的架构。

通过重新思考浏览器与 Chromium 的关系,我们为新型体验腾出空间:更顺滑的启动、更丰富的界面、更紧密的系统集成,以及以想法速度前进的开发闭环。如果你对这样的挑战感兴趣,欢迎查看我们关于 Software Engineer, Atlas 、 Software Engineer, iOS 等岗位的招聘信息。

试用请访问: chatgpt.com/atlas 。



By Ken Rockot, Member of the Technical Staff and Ben Goodger, Head of Engineering, ChatGPT Atlas


Last week, we launched ChatGPT Atlas, a new way to browse the web with ChatGPT by your side. In addition to being a full-featured web browser, Atlas offers a glimpse into the future: a world where you can bring ChatGPT with you across the internet to ask questions, make suggestions, and complete tasks for you. In this post, we unpack one of the most complex engineering aspects of the product: how we turned ChatGPT into a browser that gets more useful as you go.


Making ChatGPT a true co-pilot for the web meant reimagining the entire architecture of a browser: separating Atlas from the Chromium runtime. This entailed developing a new way of integrating Chromium that allows us to deliver on our product goals: instant startup, responsiveness even as you open more tabs, and creating a strong foundation for agentic use cases.


Shaping the foundation












Chromium was a natural building block. It provides a state-of-the-art web engine with a robust security model, established performance credentials, and peerless web compatibility. Furthermore, it’s developed by a global community that continuously improves it. It’s a common go-to for modern desktop web browsers.


Rethinking the browser experience




Our talented design team had ambitious goals for our user experience, including rich animations and visual effects for features like Agent mode. This required our engineering team to leverage the most modern native frameworks for our UI (SwiftUI, AppKit and Metal), instead of simply reskinning the open source Chromium UX. As a result, Atlas’ UI is a comprehensive rebuild of the entire application UX.


We also had other product goals like fast startup times and supporting hundreds of tabs without penalizing performance. These goals were challenging to achieve with Chromium out-of-the-box, which is opinionated about many details from the boot sequence, threading model and tab models. We considered making substantial changes here, but we wanted to keep our set of patches against Chromium targeted so we could quickly integrate new versions. To ensure our development velocity was maximally accelerated, we needed to come up with a different way to integrate and drive the Chromium runtime.


A litmus test for our technical investment was not only that it would enable faster experimentation, iteration and delivery of new features – it would also enable us to maintain a core part of OpenAI’s engineering culture: shipping on day one. Every new engineer makes and merges a small change in the afternoon of their first day. We needed to make sure this was possible even though Chromium can take hours to check out and build.


Our Solution: OWL




Our answer to these challenges was to build a new architectural layer we call OWL: OpenAI’s Web Layer. OWL is our integration of Chromium, which entails running Chromium’s browser process outside of the main Atlas app process.










Think of it like this: Chromium revolutionized browsers by moving tabs into separate processes. We’re taking that idea further by moving Chromium itself out of the main application process and into an isolated service layer. This shift unlocks a cascade of benefits:


  • A simpler, modern app: Atlas is built almost entirely in SwiftUI and AppKit. One language, one tech stack, one clean codebase.
  • Faster startup: Chromium boots asynchronously in the background. Atlas doesn’t wait — pixels hit the screen nearly instantly.
  • Isolation from jank and crashes: Chromium is a powerful and complex web engine. If its main thread hangs, Atlas doesn’t. If it crashes, Atlas stays up.
  • Fewer merge headaches: Because we’re not building on as much of the Chromium open source UI, our diff against upstream Chromium is much smaller and easier to maintain.
  • Faster iteration: Most engineers never need to build Chromium locally. OWL ships internally as a prebuilt binary, so Atlas builds take minutes not hours.

Because most engineers on our team aren’t regularly building Chromium from source, development can go much faster—even new team members can merge simple changes on their first afternoon.


How OWL works




At a high level, the Atlas browser is the OWL Client, and the Chromium browser process is the OWL Host. They communicate over IPC, specifically Mojo, Chromium’s own message-passing system. We wrote custom Swift (and even TypeScript) bindings for Mojo, so our Swift app can call host-side interfaces directly.


The OWL client library exposes a simple public Swift API, which abstracts several key concepts exposed by the host’s service layer:


  • Session: Configure and control the host globally
  • Profile: Manage browser state for a specific user profile
  • WebView: Control and embed individual web contents (e.g. render, input, navigate, zoom, etc.)
  • WebContentRenderer: Forward input events into Chromium’s rendering pipeline and receive feedback from the renderer
  • LayerHost/Client: Exchange compositing information between the UI and Chromium









There’s also a wide range of service endpoints for managing high-level features like bookmarks, downloads, extensions, and autofill.


Rendering: Getting pixels across the process boundary



WebViews, which share a mutually exclusive presentation space in the client app are swapped in and out of a shared compositing container. For example, a browser window often has a single shared container visible and selecting a tab in the tab strip swaps that tab’s WebView into the container. On the Chromium side, this container corresponds to a gfx::AcceleratedWidget which is ultimately backed by a CALayer. We expose that layer’s context ID to the client, where an NSView embeds it using the private CALayerHost API.










Special cases like <select> dropdowns or color pickers, which Chromium renders in separate popup widgets, use the same approach. They don’t have a content::WebContents, but they do have a content::RenderWidgetHostView with their own gfx::AcceleratedWidget, so the same delegated rendering model applies.


OWL internally keeps view geometry in sync with the Chromium side, so the GPU compositor can be updated accordingly and can always produce layer contents of the correct size and device scale.


We also reuse this technique to selectively project elements of Chromium’s own native Views UI into Atlas (this is also useful for bootstrapping features like permission prompts quickly without building replacements from scratch in SwiftUI). This technique borrows heavily from Chromium’s existing infrastructure for installable web apps on macOS.


Input events: Cracking and forwarding



Chromium UI translates platform events (like macOS NSEvents) into Blink’s WebInputEvent model before forwarding them to renderers. But since OWL runs Chromium in a hidden process, we do that translation ourselves within the Swift client library and forward already-translated events down to Chromium.










From there, they follow the same lifecycle that real input events would normally follow for web content. This includes having events returned back to the client whenever a page indicates that it didn’t handle the event. When this happens, we re-synthesize an NSEvent and give the rest of the app a chance to handle the input.


Agent mode: Special cases



Atlas’ agentic browsing feature poses some unique challenges for our approaches to rendering, input event forwarding, and data storage.


Our computer use model expects a single image of the screen as input. But some UI elements, like <select> dropdowns, render outside the tab’s bounds in separate windows. In agent mode, we composite those popups back into the main page image at the correct coordinates so the model sees the full context in one frame.


For input, we apply the same principle: agent-generated events are routed directly to the renderer, never through the privileged browser layer. That preserves the sandbox boundary even under automated control. For example, we don’t want this class of events to synthesize keyboard shortcuts that make the browser do things unrelated to the web content being shown.


Agent browsing can also run in an ephemeral "logged-out" context. Instead of sharing the user’s existing Incognito profile, which could leak state, we use Chromium’s StoragePartition infrastructure to spin up isolated, in-memory stores. Each agent session starts fresh, and when it ends, all cookies and site data are discarded. You can run multiple "logged-out" agent sessions, each one in its own browser tab, and each fully isolated from the others.


A new way to use the web




None of this would be possible without the global Chromium community and their incredible work building a foundation for the modern web. OWL builds on that foundation in a new way: decoupling the engine from the app, blending a world-class web platform with modern native frameworks, and unlocking a faster, more flexible architecture.


By rethinking how a browser holds Chromium, we’re creating space for new kinds of experiences: smoother startups, richer UI, tighter integration with the rest of the OS, and a development loop that moves at the speed of ideas. If that sounds like your kind of challenge, check out our openings to work on Atlas as a Software Engineer, Atlas, Software Engineer, iOS, and more.


Try Atlas at chatgpt.com/atlas⁠.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page