Harness engineering: leveraging Codex in an agent-first world
OpenAI News过去五个月,我们团队做了一个实验:用 0 行人工编写的代码,构建并发布了一个内部测试版软件产品。
这个产品有内部日常用户和外部的 alpha 测试者。它会发布、部署、出问题,然后被修复。不同之处在于——每一行代码:应用逻辑、测试、持续集成配置、文档、可观测性配置以及内部工具,都是由 Codex 生成的。我们估算,这样构建的时间大约只是手工写代码所需时间的十分之一。
人类负责引导,代理负责执行。
我们有意设定这一约束,以迫使自己构建那些真正能把工程速度提升数个数量级的东西。我们只有数周时间,却交付了最终相当于百万行代码的产出。要做到这一点,就必须弄清楚当软件工程团队的主要工作不再是写代码,而是设计环境、明确意图、构建让 Codex 代理可靠工作的反馈回路时,会发生什么变化。
下面这篇文章总结了我们用一组代理构建全新产品时的所见所学:哪些地方会出问题、问题如何放大,以及如何最大化我们唯一真正稀缺的资源——人类的时间与注意力。
我们从一个空的 Git 仓库开始
第一个提交出现在 2025 年 8 月底。
最初的骨架——仓库结构、CI 配置、格式化规则、包管理器设置和应用框架——由 Codex CLI 在 GPT‑5 的驱动下、并借助少量模板生成。就连指导代理在仓库中如何工作的最初 AGENTS.md 文件,也是由 Codex 写成的。
系统里没有任何事先存在的人工代码作为锚点。从一开始,仓库就被代理塑造。
五个月后,仓库里在应用逻辑、基础设施、工具、文档和内部开发工具方面累计约有百万行代码。在此期间,大约有 1,500 个拉取请求由仅三名工程师带动 Codex 发起并合并。这相当于每位工程师平均每天 3.5 个 PR,随着团队扩展到现在的七人,吞吐量反而提高了。重要的是,这并非为了产出而产出:产品已经被数百名内部用户使用,其中包括每日活跃的高级用户。
在整个开发过程中,人类从未直接编写过任何代码。这也成为团队的核心信条:不写手工代码。
重新定义工程师的角色
没有人工手写代码后,工程工作变成了另一种类型,聚焦于系统、脚手架和杠杆效应。
早期进展比预期慢,并不是因为 Codex 力不从心,而是因为环境定义不充分。代理缺乏完成高阶目标所需的工具、抽象和内部结构。工程团队的首要任务变成了使代理能做有用的工作。
实践上这意味着采取深度优先的工作方式:把大目标拆成更小的构建块(设计、编码、审查、测试等),提示代理去构建这些模块,并用它们去解锁更复杂的任务。出了问题时,解决办法几乎从来不是“再努把力”。因为推进的唯一方式是让 Codex 去做,人类工程师总会介入问一句:“缺少了什么能力?如何把它以对代理可读且可强制执行的形式表达出来?”
人类几乎通过提示与系统交互:工程师描述任务、运行代理,让其开出一个拉取请求。为了把 PR 推到完成,我们指示 Codex 在本地复审自身改动、请求额外的特定代理在本地和云端复审、对任何人类或代理提供的反馈作出响应,并在一个循环里迭代,直到所有代理评审者满意(实质上是一个自我迭代回路)。 Codex 直接使用我们的标准开发工具(如 gh、本地脚本和嵌入仓库的技能)来收集上下文,而不需要人类频繁复制粘贴到命令行里。
人类可以审查 PR,但并非必须。随着时间推移,我们将几乎所有的审查工作都推向了代理间的处理。
提升应用的可读性
随着代码吞吐量上升,我们的瓶颈变成了人类 QA 的容量。既然人类时间与注意力是固定约束,我们就努力把更多能力交给代理,让应用 UI、日志和指标等对 Codex 更“可读”。
例如,我们让应用能够按 git worktree 启动,这样 Codex 可以为每个改动单独启动并驱动一个实例。我们还把 Chrome DevTools Protocol 接入代理运行时,并为处理 DOM 快照、截图和导航创建了技能。这样 Codex 能够重现 bug、验证修复并直接推理 UI 行为。
我们对可观测性工具也做了同样的事情:日志、指标和 traces 对 Codex 是可查询的,本地有一个针对每个 worktree 的临时可观测性栈。代理在完全隔离的应用实例上工作——包括其日志和指标,任务完成后会被销毁。代理可以用 LogQL 查询日志、用 PromQL 查询指标。有了这些上下文,像“确保服务启动时间小于 800ms”或“这四条关键用户路径中没有一个 span 超过 2 秒”这样的提示就变得可解。
我们经常看到单次 Codex 运行会在一个任务上连续工作六个小时以上(通常是在团队成员休息时)。
把仓库知识当作事实记录系统
上下文管理是让代理能胜任复杂任务的最大挑战之一。我们很早就学到一个简单的教训:给 Codex 一张地图,而不是一本一千页的说明书。
我们尝试过“一个巨大的 AGENTS.md”方法,结果如预期般失败:
- 上下文是稀缺资源。一大堆说明会挤掉任务、代码和相关文档——代理要么忽略关键约束,要么开始优化错误的目标。
- 指导过多就变成无指导。当所有事情都被标为“重要”时,实际上没有任何东西是特别重要的。代理会在局部做模式匹配,而不是有目的地导航。
- 它会迅速腐败。一份巨大的手册会变成陈旧规则的坟场。代理无法分辨哪些还是真的,人类也停了维护,文件便悄然成了诱人的陈旧信息源。
- 难以验证。单一的大块文本不利于机械检查(覆盖率、新鲜度、归属、交叉链接),漂移不可避免。
因此,我们不把 AGENTS.md 当百科全书,而把它当成目录。
仓库的知识库存在于结构化的 docs/ 目录中,被视为事实记录。简短的 AGENTS.md(大约 100 行)被注入到上下文里,主要起导航作用,指向其他更深的真实来源。
设计文档被编目和索引,包括验证状态和定义以代理为先的核心信念。架构文档提供对领域与包分层的顶层映射。质量文档为每个产品域和架构层打分、追踪缺口。计划被视为一等公民:短期轻量计划用于小变更,复杂工作用可执行计划记录并提交到仓库,包含进度与决策日志。活动计划、已完成计划和已知技术债务均被版本化并同处,使代理无需依赖外部上下文即可运作。
这实现了渐进披露:代理从一个小而稳定的入口点开始,被教导去看哪里,而不是一上来就被淹没。
我们用机械化手段来强制执行这些:专门的 linter 和 CI 任务验证知识库是否最新、是否有交叉链接和正确结构。一个定期运行的“文档整理”代理会扫描那些与真实代码行为不一致的陈旧文档并提交修复 PR。
让代理容易理解是目标
随着代码库演进, Codex 的决策框架也必须跟着改变。
因为仓库完全由代理生成,我们优先让它对 Codex 可读。就像团队会为新工程师改进代码的可导航性一样,我们人类工程师的目标是让代理能够直接从仓库里推理出完整的业务域。
从代理的视角看,运行时无法在上下文中访问的任何东西,等同于不存在。存在于 Google Docs、聊天记录或人脑里的知识对系统而言都是不可见的。仓库本地、可版本化的工件(代码、markdown、模式、可执行计划)是代理能看到的一切。
我们学到需要把越来越多的上下文推到仓库里。那次在 Slack 里对齐的架构讨论?如果代理看不到,它对代理就像对三个月后加入的新员工一样是未知的。
把更多上下文暴露给 Codex ,意味着要组织并呈现恰当的信息以便代理推理,而不是用零散的指令淹没它。就像你会把产品原则、工程规范和团队文化(甚至表情包偏好)传授给新同事一样,把这些信息给代理会带来更一致的输出。
在权衡时我们更偏好那些能在仓库内被完全内化并可供推理的依赖和抽象。通常被称作“无聊”的技术对代理更友好,因为它们可组合、API 稳定、并且在训练集中有良好表示。有时让代理重新实现某些功能子集,比绕开外部库的不可见行为更划算。例如,我们没有直接引入通用的 p-limit 包,而是自己实现了一个并发映射助手:它与我们的 OpenTelemetry 集成、测试覆盖率 100%、并且行为正是运行时所期望的。
把系统更多地变成代理能检查、验证和修改的形式,会提高杠杆效应——不仅对 Codex ,也对其他在同一代码库上工作的代理(例如 Aardvark )。
以不变性而非微观管理来维持架构与风格
仅靠文档无法让完全由代理生成的代码库保持一致。通过强制不变量、而非对实现细节的微观管理,我们让代理既能快速交付,又不破坏基础。例如,我们要求 Codex 在边界处解析数据形状,但不指定必须用哪个库(模型似乎偏好 Zod,但我们不强制)。
代理在具有严格边界和可预测结构的环境中最有效,所以我们围绕一个刚性的架构模型构建应用。每个业务域被划分为固定层级,依赖方向严格验证且允许的边有限。通过自定义 linter(当然也是由 Codex 生成)和结构化测试,机械化地强制这些约束。
这些规则在大型工程组织里通常会被延后施行,但在有编码代理时,它们是早期必要条件:约束让速度不会导致衰变或架构漂移。
我们用自定义 lints 和结构化测试来强制这些规则,并设定一小套“品味不变量”:例如静态强制结构化日志、schema 与类型的命名规范、文件大小限制以及平台特定的可靠性要求。由于 lints 是定制的,我们会把修复指引写进错误信息里以注入到代理上下文中。
对人类为先的工作流而言,这些规则可能显得过于细致或受限;但对代理而言,它们是乘数效应:一旦编码,就会在所有地方同时生效。
同时,我们明确哪些地方需要约束、哪些地方可以放手。这类似于领导大型工程平台组织:在中心强制边界,局部允许自治。你要在边界、正确性和可重复性上高度关注;在这些边界内,让团队或代理在表达解决方案方式上有相当自由。
产出不一定总符合人类的风格偏好,但没关系。只要输出是正确的、可维护的、并且对未来的代理运行可读,就达到了标准。
人类的“品味”会被持续反馈回系统。审查意见、重构 PR 和面向用户的 bug 被记录为文档更新或直接编码进工具链。当文档不足时,我们把规则提升为代码。
吞吐量改变了合并哲学
随着 Codex 的吞吐量提高,许多传统工程规范变得适得其反。
仓库采用最少的阻塞合并门。拉取请求寿命短。测试抖动通常通过后续运行来解决,而不是无限期阻塞进度。在代理吞吐远超人类注意力的系统中,修正成本低、等待成本高——在很多情况下这是合理的权衡。
这在低吞吐环境下会不负责任,但在这里常常是正确选择。
“代理生成”到底意味着什么
当我们说代码库由 Codex 代理生成,意思是覆盖代码库中的一切。
代理产出包括:
- 产品代码和测试
- CI 配置与发布工具
- 内部开发者工具
- 文档与设计历史
- 评估机制
- 审查评论与回复
- 管理仓库的脚本
- 生产仪表盘定义文件
人类始终在回路中,但工作层次与以往不同:我们设定优先级、把用户反馈翻译为验收标准并验证结果。当代理遇到困难,我们把它看作信号:找出缺失的工具、护栏或文档,并把修复反馈进仓库——始终由 Codex 自己来写出修复。
代理直接使用我们的标准开发工具:拉取审查反馈、在线响应、推送更新,并常常合并自己开的 PR。
自主程度的提升
随着测试、验证、审查、反馈处理和恢复等更多开发环节被编码入系统,仓库最近跨过了一个重要阈值: Codex 能端到端驱动一个新特性。
只需一个提示,代理现在可以:
- 验证代码库的当前状态
- 重现上报的 bug
- 录制演示失败的视频
- 实施修复
- 通过驱动应用验证修复
- 录制演示已解决的视频
- 发起拉取请求
- 回应代理与人工反馈
- 检测并修复构建失败
- 仅在需判断时上报给人类
- 合并变更
这种能力高度依赖于本仓库的具体结构与工具投入,不应简单假设能在未作类似投入的环境中泛化——至少目前还不能。
熵与垃圾回收
完全的代理自治也带来了新问题。 Codex 会复制仓库中已有的模式,哪怕是不均衡或次优的模式。随着时间推移,这势必导致漂移。
起初人类手工处理这些问题。我们团队过去每周五(占用 20% 的时间)清理“AI 残留”。显然这不可持续。
于是我们把所谓的“黄金原则”编码进仓库,并建立了定期的清理流程。这些原则是带有倾向性的机械规则,能保持代码库对未来代理运行的可读性和一致性。例如:(1)我们更偏好共享的工具包而不是各自为政的助手,以便把不变量集中管理;(2)我们不做“YOLO 式”的数据探测——要么验证边界,要么依赖有类型的 SDK,以免代理基于猜测的数据形状构建功能。我们定期运行一组后台 Codex 任务来扫描偏差、更新质量评分并发起针对性重构 PR。这些大多能在一分钟内审查并自动合并。
这就像垃圾回收。技术债务像高利贷:通常更划算的是持续以小步偿还,而不是让其复利滚大再痛苦清理。人类的“品味”被捕捉一次,然后对每一行代码持续强制执行。这也让我们能在每天而非数日或数周内发现并解决坏模式。
我们仍在学习的事
到目前为止,这一策略在内部发布与采用阶段表现良好。为真实用户构建真实产品帮助我们把投资落到实处,并指引我们走向长期可维护性。
我们还不清楚的是:在一个完全由代理生成的系统中,架构一致性会如何在数年间演化。我们仍在学习人类判断在哪些环节能带来最大杠杆,以及如何把这种判断编码进去以实现复利效应。我们也不知道随着模型能力继续提升,这个系统将如何演化。
可以明确的是:构建软件仍然需要纪律,但这种纪律更多体现在脚手架上,而非代码本身。保持代码库一致的工具、抽象和反馈回路变得愈发重要。
我们目前最难的挑战集中在设计环境、反馈回路和控制系统上,帮助代理实现我们的目标:在规模上构建并维护复杂且可靠的软件。
随着像 Codex 这样的代理承担软件生命周期中更大的部分,这些问题只会变得更加重要。我们希望分享这些早期经验,能帮助你判断该把精力投向哪里,从而去构建东西。
Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.
Humans steer. Agents execute.
We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.
This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.
We started with an empty git repository
The first commit to an empty repository landed in late August 2025.
The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.
There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.
Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: no manually-written code.
Redefining the role of the engineer
The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.
Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.
In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”
Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.
Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.
Increasing application legibility
As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.
For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.
We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.
We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).
We made repository knowledge the system of record
Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual.
We tried the “one big AGENTS.md” approach. It failed in predictable ways:
- Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.
- Too much guidance becomes non-guidance. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
- It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
- It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.
So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents.
The repository’s knowledge base lives in a structured docs/ directory treated as the system of record. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.
Plain Text
1AGENTS.md
2ARCHITECTURE.md
3docs/
4├── design-docs/
5│ ├── index.md
6│ ├── core-beliefs.md
7│ └── ...
8├── exec-plans/
9│ ├── active/
10│ ├── completed/
11│ └── tech-debt-tracker.md
12├── generated/
13│ └── db-schema.md
14├── product-specs/
15│ ├── index.md
16│ ├── new-user-onboarding.md
17│ └── ...
18├── references/
19│ ├── design-system-reference-llms.txt
20│ ├── nixpacks-llms.txt
21│ ├── uv-llms.txt
22│ └── ...
23├── DESIGN.md
24├── FRONTEND.md
25├── PLANS.md
26├── PRODUCT_SENSE.md
27├── QUALITY_SCORE.md
28├── RELIABILITY.md
29└── SECURITY.md
In-repository knowledge store layout.
Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.
Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.
This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.
We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.
Agent legibility is the goal
As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.
Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.
From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.
We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.
Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.
This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.
Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.
Enforcing architecture and taste
Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary, but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).
Agents are most effective in environments with strict boundaries and predictable structure, so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.
The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.
This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.
In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.
At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.
The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.
Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code
Throughput changes the merge philosophy
As Codex’s throughput increased, many conventional engineering norms became counterproductive.
The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.
This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.
What “agent-generated” actually means
When we say the codebase is generated by Codex agents, we mean everything in the codebase.
Agents produce:
- Product code and tests
- CI configuration and release tooling
- Internal developer tools
- Documentation and design history
- Evaluation harnesses
- Review comments and responses
- Scripts that manage the repository itself
- Production dashboard definition files
Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.
Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.
Increasing levels of autonomy
As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.
Given a single prompt, the agent can now:
- Validate the current state of the codebase
- Reproduce a reported bug
- Record a video demonstrating the failure
- Implement a fix
- Validate the fix by driving the application
- Record a second video demonstrating the resolution
- Open a pull request
- Respond to agent and human feedback
- Detect and remediate build failures
- Escalate to a human only when judgment is required
- Merge the change
This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.
Entropy and garbage collection
Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.
Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.
Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, updates quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.
This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.
What we’re still learning
This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.
What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.
What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.
Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.
As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.
Generated by RSStT. The copyright belongs to the original author.