How We Used Codex to Ship Sora for Android in 28 Days

How We Used Codex to Ship Sora for Android in 28 Days

OpenAI News

2025 年 11 月,我们把 Sora Android app 推向全球,让任何拥有 Android 设备的人只要输入一个简短提示,就能生成一段生动的视频。发布当天,该应用登上了 Play Store 排行榜第一名;首 24 小时内,Android 用户就生成了超过一百万段视频。示例提示:figure skater performs a triple axle with a cat on her head。

这次发布背后有一个故事:Sora 的首个可生产版本在 28 天内完成上线,而关键推手正是任何团队或开发者都能使用的同一款代理: Codex 。

从 2025 年 10 月 8 日到 11 月 5 日,一个精简的工程团队在 Codex 的配合下、消耗了大约 50 亿个 token,把 Sora for Android 从原型推向全球发布。尽管规模不小,应用的无崩溃率仍达到 99.9%,架构也让我们感到自豪。如果你在好奇我们是否用了某个“秘密模型”,答案是:我们用了一个较早版本的 GPT‑5.1-Codex —— 任何开发者或企业今天都可以通过 CLI 、 IDE 扩展或网页版使用同一版本。

拥抱 Brooks 定律:精简以提速

当 Sora 在 iOS 上线时,使用量呈爆发式增长,用户立刻开始源源不断地生成视频。与之相比,Android 端在准备期只有一个小规模内部原型,但 Google Play 上预注册用户数量持续增长。面对高风险、时间紧迫的发布,常见做法是投入更多人力、增加流程。像这样规模和质量的生产级应用通常需要许多工程师工作数月,但协作成本会拖慢进度。

美国计算机体系结构师 Fred Brooks 曾著名地提醒过:“向已延期的软件项目增加人手,只会让它更晚。”换言之,试图快速交付复杂项目时,增加工程师人数常常因沟通开销、任务碎片化和整合成本而降低效率。我们采纳了这一见解而非忽视它:组建了一个由四位工程师组成的小而精团队——每位成员都配备了 Codex ,以极大放大个人产能。

以这种方式工作,我们在 18 天内把供内部试用的 Sora for Android 构建交到员工手中,随后 10 天公开发布。我们在 Android 工程实践上坚持高标准,注重可维护性,并把应用的可靠性要求设定为与传统项目相同的水准。(至今我们仍大量使用 Codex 来演进并为应用带来新功能。)

把 Codex 当作一名新来的高级工程师来培养

要理解我们如何与 Codex 协作,首先要知道它擅长什么、需要何种引导。把它当作一位新入职且能力很强的高级工程师非常有效。 Codex 的能力让我们把更多时间放在指挥和审查代码上,而不是亲自逐行编写。

Codex 需要人类引导的地方

  1. 它不擅长推断未被告知的信息(比如你偏好的架构模式、产品策略、真实用户行为或团队内部的约定与捷径)。
  2. 它看不到应用的真实运行情况:无法在设备上打开 Sora 感知滚动是否顺滑,或判断某个流程是否令人困惑。这类体验层面的工作只能由团队完成。
  3. 每个会话都需要引导。向 Codex 清晰传达目标、约束和“我们的做法”对其高质量执行至关重要。
  4. 它在深层架构判断上存在不足:独立工作时可能会新增不必要的 view model,或把应放在 repository 层的逻辑推到 UI 层。它的本能是让东西能跑起来,而非优先考虑长期的代码整洁性。

我们发现,让 Codex 在代码库中创建并维护大量的 AGENT.md 文件非常有用,这使得在多次会话间复用相同的指导和最佳实践变得容易。比如,为确保 Codex 按我们的风格写代码,我们在顶层 AGENTS.md 中加入了格式和静态检查的规则(例如提交前一定要运行 ./gradlew detektFix,CI 会在格式或 detekt 问题存在时失败)。

Codex 擅长的领域

  1. 快速阅读和理解大规模代码库:它几乎掌握所有主流编程语言,能在多平台间复用相同概念,减少了复杂抽象的必要性。
  2. 测试覆盖: Codex 对编写单元测试热情高涨,虽然并非每个测试都很深入,但覆盖面广有助于避免回归。
  3. 应对反馈:当 CI 报错时,把日志粘给 Codex,它能提出修复方案。
  4. 大规模并行、可丢弃的执行:可以并行测试多个想法,把代码视为可替换物,从而快速试错。
  5. 提供新视角:在设计讨论中,我们把 Codex 当作生成性工具来探索潜在失效点和解决方案。例如在设计视频播放器的内存优化时, Codex 在多个 SDK 中筛查并提出我们没有时间逐一研究的方案,其研究结论对最终减小内存占用非常有价值。
  6. 提升高杠杆工作:实际操作中我们把更多时间花在审查和指挥代码上,而不是写代码本身。 Codex 在代码审查上也很有帮助,常能在合并前捕捉到 bug,提高可靠性。

一旦我们认识并接受了这些特性,协作模式就变得更直接:在明确的模式和边界内让 Codex 承担大量繁重工作,而团队则专注于架构、用户体验、系统性改动和最终质量保障。

手工搭好基础

即便是最优秀的新高级工程师,也不可能一上来就做出恰当的长期权衡。为充分发挥 Codex 并保证其产出稳健可维护,我们亲自把关了应用的系统设计和关键权衡,包括应用架构、模块化、依赖注入和导航,并实现了身份认证和基础网络流程。

在此基础上,我们端到端实现了若干具有代表性的功能,同时把希望整个代码库遵循的规则记录下来。通过让 Codex 参考这些代表性功能,它便能在我们的标准内更独立地工作。对一个我们估计约有 85% 代码由 Codex 完成的项目来说,精心设计的基础避免了代价高昂的返工和重构——这是我们做出的最重要决定之一。

我们的目标不是尽快做出“能跑的东西”,而是做出“符合我们期望工作方式的东西”。写代码有许多“正确”的方式,我们不需要事无巨细地指挥 Codex,而是要向它展示在我们团队里什么才是“正确”。一旦确定了起点和偏好, Codex 就可以开始高效工作。

我们确实尝试过直接下命令式的提示:“根据 iOS 代码构建 Sora Android app ,开始吧”,但很快放弃了。虽然 Codex 生成的代码在技术上能运行,但产品体验欠佳;在不了解端点、数据和用户流程的情况下,一次性生成的大量代码也不可靠(即便不借助 agent,合并成千上万行代码也是有风险的)。

我们的假设是:在写得好的示例沙盒中 Codex 会发挥最佳表现;事实证明我们是对的。让 Codex 在几乎没有上下文的情况下“构建设置页面”往往不可靠,但当我们告诉它“用与刚才看到的另一个页面相同的架构和模式来构建这个设置页面”时,效果明显更好。人类做出结构性决策并设定不变量, Codex 在该框架内补完大量代码。

在编码前与 Codex 共同规划

要把 Codex 的潜力最大化,下一步是让它在长时间内(最近能超过 24 小时)无人监督地工作。起初我们常下达诸如“这是功能、这些是文件,请实现它”的提示,这有时有效,但多数情况下会生成技术上可编译却偏离我们架构和目标的代码。

于是我们改变了工作流:任何非琐碎变更前,先请 Codex 帮助我们理解系统和代码的运作。例如,请它阅读一组相关文件并总结这个功能如何工作——数据如何从 API 流经 repository 层、view model,最终到达 UI——然后我们再纠正或细化其理解(比如指出某个抽象应属于另一层,或某个类仅用于离线模式不应扩展)。

像对待一位高能新人队友一样,我们和 Codex 一起制定实施计划。这个计划常呈现为一个小型设计文档,指明应修改哪些文件、引入哪些新状态以及逻辑如何流动。然后按步骤让 Codex 开始落地。对于特别长的任务(触及上下文窗口上限),我们会让 Codex 把计划保存为文件,以便在不同实例间复用同一指令。

这额外的规划循环值得投入:它让 Codex 能在长时间内“无人监督”运行,因为我们有可核查的计划;也让代码审查更容易,因为可以把实现与计划对照,而不是在缺乏上下文的 diff 中摸索;当出现问题时,我们可以先排查计划,再看代码。

分布式工程与指挥者角色

项目高峰期我们常并行运行多个 Codex 会话:一个做回放,一个做搜索,一个处理错误,还有的做测试或重构。感觉不像在用一件工具,更像是在管理一支团队。每个会话会定期向我们汇报进度:有的提交“我完成了该模块的规划,这是我的提议”,有的提交一个新功能的大 diff。每个都需要关注、反馈和审查——颇像作为技术负责人同时带几位新工程师。

结果是协同流动起来了。 Codex 的原始编码能力为我们节省了大量手动敲代码的时间,让我们能更多思考架构、细读 PR 并测试应用。但这种加速也带来持续的审查负担: Codex 不会因上下文切换而被阻塞,但我们会。开发的瓶颈从“写代码”转向“决策、反馈与整合改动”。

这也为 Brooks 的看法提供了新的视角:你不能简单增加 Codex 会话就期待线性提速,就像不能不断增加工程师人数就期待进度按比例缩短一样。每增加一个“帮手”,即便是虚拟的,也会引入协调开销。我们变成了乐团的指挥,而非单纯更快的独奏者。

Codex:跨平台的超级能力

我们之所以能迅速推进,还有一大优势: Sora 已在 iOS 发布。我们常把 Codex 指向 iOS 与后端代码库,帮助它理解关键需求与约束。项目期间我们常打趣说,自己重新发明了跨平台框架的概念——忘掉 React Native 或 Flutter ,跨平台的未来就是 Codex 。

这背后有两点原则:

  1. 逻辑是可迁移的。无论代码是用 Swift 还是 Kotlin 写成,底层应用逻辑——数据模型、网络调用、校验规则、业务逻辑——是同质的。 Codex 非常擅长阅读一份 Swift 实现并生成语义等价的 Kotlin 版本。
  2. 具体示例提供强上下文。一个能“看到 iOS 上是如何实现的”和“了解 Android 架构”的新 Codex 会话,要远比仅凭自然语言描述的会话更有效。

基于这些原则,我们把 iOS、后端和 Android 的仓库放在同一环境中,并给 Codex 类似这样的提示:“阅读 iOS 代码中的这些模型和端点,然后提出在 Android 上利用现有 API 客户端和模型类实现等效行为的计划。”一个小技巧是把本地仓库位置和内容记录在 ~/.codex/AGENTS.md 中,便于 Codex 发现和导航相关代码。

实质上我们是通过“翻译”做跨平台开发,而非共享抽象。由于 Codex 承担了大部分翻译工作,我们避免了实现成本翻倍。更广泛的教训是:对 Codex 来说,上下文就是一切。它在理解了 iOS 的实现方式并配合 Android 的结构后,表现最佳;缺乏上下文时,它不是“不配合”,而是在猜测。把它当作新队友并投入正确的输入,会得到更好回报。

对未来软件工程的今天式实践

到了四周冲刺结束时,使用 Codex 不再像一场试验,而成为我们的默认开发循环:用它来理解既有代码、规划改动和实现功能,并像审查同事产出那样审查它的输出。它已成为我们交付软件的常态工具。

显而易见的是,AI 辅助开发并不会降低对严谨性的需求,反而提升了它。 Codex 的目标是尽快把 A 点变成 B 点;这就是为什么没有人的参与 AI 辅助编码行不通。软件工程师能理解并应用系统的现实约束、最佳架构方式,以及如何为未来开发与产品计划而构建。明日工程师的超级能力,将是深厚的系统理解力以及在长期协作中与 AI 配合的能力。

软件工程最有趣的部分是构建引人入胜的产品、设计可扩展系统、编写复杂算法以及基于数据、模式和代码进行实验。然而过去和当下的软件实际工作常常偏向琐碎:对齐按钮、连线端点、写样板代码。现在, Codex 让我们能更多聚焦于工程中最有意义的部分,以及我们热爱这门工艺的理由。

一旦在上下文丰富的环境中把 Codex 配置好,让它理解你的目标和构建偏好,任何团队都能成倍提升能力。我们的上线复盘并非放之四海而皆准的配方,也并非宣称已彻底解决 AI 辅助开发的问题,但我们希望这些经验能帮助你找到更好方式,让 Codex 更好地赋能你。

当 Codex 在七个月前以研究预览形式发布时,软件工程还是另一番景象。通过 Sora,我们得以探索工程的下一章节。随着模型与应用手段的不断改进,AI 将变得越来越不可或缺。

你会和你的 Codex 团队一起创造什么?

特别感谢所有参与构建 Sora for Android 的团队成员。



In November, we launched the Sora Android app to the world, giving anyone with an Android device the ability to turn a short prompt into a vivid video. On launch day, the app reached #1 in the Play Store. Android users generated more than a million videos in the first 24 hours.


Behind the launch is a story: the initial version of Sora’s production Android app was built in 28 days, thanks to the same agent that’s available to any team or developer: Codex.


From October 8 to November 5, 2025, a lean engineering team working alongside Codex and consuming roughly 5 billion tokens, shipped Sora for Android from prototype to global launch. Despite its scale, the app has a crash-free rate of 99.9 percent and an architecture we’re proud of. If you’re wondering whether we used a secret model, we used an early version of the GPT‑5.1-Codex model – the same version that any developer or business can use today via CLI, IDE extension, or web app.







Prompt: figure skater performs a triple axle with a cat on her head








Embracing Brooks’ Law: Staying nimble to move fast



When Sora launched on iOS, usage exploded. People immediately began generating a stream of videos. On Android, by contrast, we had only a small internal prototype and a mounting number of pre-registered users on Google Play.


A common response to a high stakes, time-pressured launch is to pile on resources and add process. A production app of this scope and quality would typically involve many engineers working for months, slowed down by coordination. 


American computer architect Fred Brooks famously warned that “adding more people to a late software project makes it later.” In other words, when trying to ship a complex project quickly, adding more engineers can often slow down efficiency by adding to communication overhead, task fragmentation, and integration costs. We leaned into this insight instead of ignoring it; we assembled a strong team of four engineers – all equipped with Codex to drastically increase each engineer’s impact. 


Working this way, we shipped an internal build of Sora for Android to employees in 18 days and launched publicly 10 days later. We maintained a high bar on Android engineering practices, invested in maintainability, and held the app to the same reliability bar we would expect from a more traditional project. (We also continue to use Codex extensively today to evolve and bring new features to the app).


Onboarding a new senior engineer



To make sense of how we worked with Codex, it helps to know where it shines and where it needs direction. Treating it like a newly hired senior engineer was a good approach. Codex’s ability meant we could spend more time directing and reviewing code than writing it ourselves.


Where Codex needs guidance


  1. Codex isn’t yet great at inferring what it hasn’t been told (e.g., your preferred architecture patterns, product strategy, real user behavior, and internal norms or shortcuts).
  2. Similarly, Codex couldn’t see the app actually run: It couldn’t open Sora on a device, notice that a scroll felt off, or sense that a flow was confusing. Only our team could cover these experiential tasks.
  3. Each instance requires onboarding. Sharing context with clear goals, constraints, and guidance on “how we do things” was essential to making Codex execute well.
  4. In the same vein, Codex struggled with deep architectural judgment: Left on its own, it might introduce an extra view model where we really wanted to extend an existing one or push logic into the UI layer that clearly belonged in a repository. Its instinct is to get something working, not to prioritize long‑term cleanliness.

We found it useful to have Codex create and maintain a generous amount of AGENT.md files throughout the codebase. This made it easy to apply the same guidance and best practices across sessions. For example, to ensure Codex wrote code in our style guidelines, we added the following to our top-level AGENTS.md:


Plain Text

1## Formatting and static checks

2- **Always run** `./gradlew detektFix` (or for the affected modules) **before committing**. CI will fail if formatting or detekt issues are present.






Where Codex excels


  1. Reading and understanding large codebases rapidly: Codex knows essentially all major programming languages, which makes it easier to leverage the same concepts across many platforms without complex abstractions.
  2. Testing coverage: Codex is (uniquely) enthusiastic about writing unit tests to cover a broad variety of cases. Not every test was deep, but having breadth of coverage was helpful in preventing regressions. 
  3. Applying feedback: In a similar vein, Codex is good at reacting to feedback. When CI failed, we could paste log output into a prompt and ask Codex to propose fixes.
  4. Massively parallel, disposable execution: Most won’t push the limits of the number of sessions they could actually run at any one time. It’s highly feasible to test multiple ideas in parallel and view code as disposable.
  5. Offering new perspective: In design discussions, we used Codex as a generative tool to explore potential failure points and discover new ways to solve a problem. For example, while we designed video player memory optimizations, Codex sifted through multiple SDKs to propose approaches we wouldn’t have had time to parse. The insights from Codex’s research proved invaluable in minimizing memory footprint in the final app.
  6. Enabling higher‑leverage work: In practice, we ended up spending more time reviewing and directing code than writing it ourselves. That said, Codex is very good at code review, too, often catching bugs before they’re merged, improving reliability.

Once we acknowledged these characteristics, our working model became more straightforward. We leaned on Codex to do a huge amount of heavy lifting inside well‑understood patterns and well‑bounded scopes, while our team focused on architecture, user experience, systemic changes, and final quality.


Laying the foundation by hand



Even the best new, senior hire doesn’t have the right vantage point for making long-term trade-offs right away. To leverage Codex and ensure its work was robust and maintainable, it was key that we oversaw the app’s systems design and key trade-offs ourselves. These included shaping the app’s architecture, modularization, dependency injection, and navigation; we also implemented authentication and base networking flows. 


From this foundation, we wrote a few representative features end‑to‑end. We used the rules we wanted the entire codebase to follow and documented project‑wide patterns as we went. By pointing Codex to representative features, it was able to work more independently within our standards. For a project that we estimate was 85% written by Codex, a carefully planned foundation avoided costly backtracking and refactoring. It was one of the most important decisions we made. 


The idea was not to make “something that works” as quickly as possible, rather to make “something that gets how we want things to work.” There are many “correct” ways to write code. We didn’t need to tell Codex exactly what to do; we needed to show Codex what’s “correct” on our team. Once we had established our starting point and how we liked to build, Codex was ready to start.


To see what would happen, we did try prompting: “Build the Sora Android app based on the iOS code. Go,” but quickly aborted that path. While what Codex created technically worked, the product experience was sub-par. And without a clear understanding of endpoints, data, and user flows, Codex’s single-shot code was unreliable (Even without using an agent, it’s risky to merge thousands of lines of code.) 


We hypothesized Codex would thrive in a sandbox of well-written examples; and we were right. Asking Codex to “build this settings screen” with almost no context was unreliable. Asking Codex to “build this settings screen using the same architecture and patterns as this other screen you just saw” worked far better. Humans made the structural decisions and set the invariants; Codex then filled in large amounts of code inside that structure.


Planning with Codex before coding



Our next step in maximizing Codex’s potential was figuring out how to enable Codex to work for long periods of time (recently, more than 24 hours), unsupervised.


Early on in using Codex, we jumped to prompts like, “Here is the feature. Here are some files. Please build it.” That sometimes worked, but mostly produced code that technically compiled, while straying from our architecture and goals.


So we changed the workflow. For any non‑trivial change, we first asked Codex to help us understand how the system and code work. For example, we’d ask it to read a set of related files and summarize how that feature works; for example, how data flows from the API through the repository layer, the view model, and into the UI. Then we would correct or refine its understanding. (For example, we’d point out that a particular abstraction really belongs in a different layer or that a given class exists only for offline mode and should not be extended.)


Similarly to how you might engage a new, highly capable teammate, we worked with Codex to create a solid implementation plan. That plan often looked like a miniature design document directing which files should change, what new states should be introduced, and how logic should flow. Only then did we ask Codex to start applying the plan, one step at a time. One helpful tip: for very long tasks, where we hit the limit of our context window), we’d ask Codex to save its plan to a file, allowing us to apply the same direction across instances.


This extra planning loop turned out to be worth the time. It allowed us to let Codex run “unsupervised” for long stretches, because we knew its plans. It made code review easier, because we could check the implementation against our plan rather than reading a diff without context. And when something went wrong, we could debug the plan first and the code second. 


The dynamic felt similar to the way a good design document gives a tech lead confidence in a project. We weren’t just generating code: we were producing code that supported a shared roadmap.


Distributed engineering



At the peak of the project, we were often running multiple Codex sessions in parallel. One was working on playback, another on search, another on error handling, and sometimes another on tests or refactors. It felt less like using a tool and more like managing a team.


Each session would periodically report back to us with progress. One might say, “I’m done planning out this module; here’s what I propose,” while another would offer a large diff for a new feature. Each required attention, feedback, and review. It was uncannily similar to being a tech lead with several new engineers, all making progress, all needing guidance.


The result was a collaborative flow. Codex’s raw coding capability freed us from a lot of manual typing. We had more time to think about architecture, read pull requests carefully, and test out the app. 


At the same time, that extra speed meant we always had something waiting in our review queue. Codex didn’t get blocked by context switching, but we did. Our bottleneck in development shifted from writing code to making decisions, giving feedback, and integrating changes.


This is where Brooks’s insights land in a new way. You can’t simply add Codex sessions and expect linear speedups any more than you can keep adding engineers to a project and expect the schedule to shrink linearly. Each additional “pair of hands,” even virtual ones, adds coordination overhead. We had become the conductor of an orchestra versus simply faster solo players.


Codex as a cross‑platform superpower



We started our project with a huge stepping stone: Sora had already shipped on iOS. We frequently pointed Codex at the iOS and backend codebases to help it understand key requirements and constraints. Throughout the project we joked that we had reinvented the idea of a cross‑platform framework. Forget React Native or Flutter; the future of cross‑platform is just Codex.


Beneath the quip are two principles:.


  1. Logic is portable. Whether the code is written in Swift or Kotlin, the underlying application logic – data models, network calls, validation rules, business logic – are the same. Codex is very good at reading a Swift implementation and producing an equivalent in Kotlin that preserves semantics.
  2. Concrete examples provide powerful context. A fresh Codex session that can see “here is exactly how this works on iOS” and “here is the Android architecture” is far more effective than one that’s only working from natural language descriptions.

Putting these principles to work, we made the iOS, backend and Android repos available in the same environment. We gave Codex prompts like:


“Read these models and endpoints in the iOS code and then propose a plan to implement the equivalent behavior on Android using our existing API client and model classes.”


One small but useful trick was to detail in  ~/.codex/AGENTS.md where local repos lived and what they contained. That made it easier for Codex to discover and navigate relevant code.


We were effectively doing cross-platform development through translation instead of shared abstraction. Because Codex handled most of the translation, we avoided doubling implementation costs.


The broader lesson is that for Codex, context is everything. Codex did its best work when it understood how the feature already worked in iOS, paired with an understanding of how our Android app was structured. When Codex lacked that context, it wasn’t “refusing to cooperate”; it was guessing. The more we treated it like a new teammate and invested in giving it the right inputs, the better it performed.


The software engineering of tomorrow, today



By the end of our four‑week sprint, using Codex stopped feeling like an experiment and became our default development loop. We used it to understand existing code, plan changes, and implement features. We reviewed its output the same way we’d review a teammate’s. It was simply how we shipped software.


It became clear that AI‑assisted development does not reduce the need for rigor; it increases it. As capable as Codex is, its objective is to get from A to B, now. This is why AI-assisted coding doesn’t work without humans. Software engineers can understand and apply the real-world constraints of systems, the best ways to architect software, and how to build with future development and product plans in mind. The super powers of tomorrow’s software engineer will be deep systems understanding and the ability to work collaboratively with AI over long time horizons. 


The most interesting parts of software engineering are building compelling products, designing scalable systems, writing complex algorithms, and experimenting with data, patterns, and code. However, the realities of software engineering of the past and present often lean more mundane: centering buttons, wiring endpoints, and writing boilerplate. Now, Codex makes it possible to focus on the most meaningful parts of software engineering and the reasons we love our craft.


Once Codex is set up in a context-rich environment where it understands your goals and how you like to build, any team can multiply its capabilities. Our launch retro isn’t a one‑size‑fits‑all recipe, and we're not claiming to have solved AI‑assisted development. But we hope our experience makes it easier to find the best ways to empower Codex to empower you. 


When Codex launched in a research preview seven months ago, software engineering looked very different. Through Sora, we got to explore the next chapter of engineering. As our models and harness keep improving, AI will become an increasingly indispensable part of building. 


What will you make with your own team of Codex?


Special thanks to the entire team that helped build Sora for Android.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page