Datadog uses Codex for system-level code review

Datadog uses Codex for system-level code review

OpenAI News

Datadog 运行着全球最广泛使用的可观测性平台之一,帮助企业监控、排查并保护复杂的分布式系统。系统一旦出问题,客户依赖 Datadog 快速发现故障,因此可靠性必须在代码进入生产环境之前很久就内建好。

对 Datadog 的工程团队来说, code review 是一个高风险点。它不仅要找出错误,更要理解变更如何在互联的系统中产生连锁反应——这是传统静态分析和基于规则的工具常常力不从心的领域。

为应对这一挑战, Datadog 的 AI DevX 团队引入了来自 OpenAI 的编码智能体 Codex,旨在把系统级的推理能力带进 code review ,揭示人为难以在规模上察觉的风险。

“节省时间确实很重要,”负责该团队的 Brad Carter 说。“但在我们这个规模上,防止事故发生更具吸引力。”

Datadog 过去有效的 code review 很大程度上依赖资深工程师——那些了解代码库、历史以及架构权衡,能够识别系统性风险的人。但这种深厚的上下文难以扩展,早期的 AI 代码审查工具并未解决这一点;许多工具更像高级的 linters ,只标示表面问题,忽略系统层面的细微差别。工程师们常觉得这些建议要么太浅,要么噪声太多,便不会采纳。

Datadog 开始在实际开发工作流中试点 Codex 。在公司一个最大且使用最频繁的仓库里,每一次 pull request 都由 Codex 自动审查。工程师们对 Codex 的评论以点赞或点踩作出反应,并在团队间口头共享反馈。许多人指出, Codex 的反馈值得一读——这与此前那些发出噪声或肤浅建议的工具形成了对比。

为了验证 AI 辅助审查能否超越样式层面的提示,Datadog 建立了事故回放机制。他们没有用假设情形,而是回到历史事故,重建那些导致事故的 pull request ,将 Codex 当作当时的审查者运行,然后询问当时负责这些事故的工程师:如果当时有 Codex 的反馈,情况会不会不同。

结果显示:在 Datadog 检视的事故中,约有十多个案例——大约占 22%——工程师确认 Codex 的意见本会产生不同的影响,这一数字高于评估过的其他任何工具。鉴于这些 pull request 当时已经通过了人工审核,回放测试表明 Codex 揭示了当时审查者未曾注意到的风险,起到的是补充人类判断而非取代的作用。

Datadog 的分析还表明, Codex 能持续地指出仅凭当前 diff 无法看出的、也无法用确定性规则捕捉的问题。工程师们把 Codex 的评论视为远不止“机器人噪声”:

  • 指出了与 diff 中未涉及模块的交互
  • 发现了在跨服务耦合处缺失的测试覆盖
  • 提示了会带来下游风险的 API 合约变更

“对我来说, Codex 的评论就像一位我共事过且有着无限时间找漏洞的最聪明工程师。它能看到我一次无法掌握的联系。”——Brad Carter, Datadog 工程经理

正是这种把审查反馈与真实可靠性结果连接起来的能力,让 Codex 在 Datadog 的评估中脱颖而出。不同于静态分析工具, Codex 会把 pull request 的意图与提交的代码变更进行比较,在整个代码库和依赖关系上进行推理,必要时执行代码和测试以验证行为是否符合预期。

“它是第一个真正把 diff 放到程序更大上下文中考虑的工具,”Carter 说,“这既新颖又让人眼界大开。”

对许多工程师而言,这种改变也改变了他们与 AI 审查的互动方式。Datadog 的高级软件工程师 Ted Wexler 说:“我开始把 Codex 的评论当做真正的代码审查反馈来看待,不是略读或忽略的东西,而是值得重视的意见。”

在评估之后,Datadog 将 Codex 更广泛地部署到工程团队中。如今超过一千名工程师定期使用它。反馈主要是自发出现而非通过正式工具指标收集:工程师们会在 Slack 上分享有用见解、建设性评论以及那些让他们以不同方式思考问题的时刻。

尽管节省时间很明显,团队们普遍认为更重要的是工作方式发生了更有意义的转变。

“ Codex 改变了我对代码审查应该是什么的看法。它的目标不是复制我们最优秀的人类审查者,而是发现在人们单独审查变更时难以看清的关键缺陷和边界情形。”——Brad Carter, Datadog 工程经理

更广泛的影响是,Datadog 重新定义了代码审查的目的:不再把审查仅仅看作发现错误或优化周期的检查点,而是把 Codex 视为核心的可靠性系统和合作伙伴,旨在:

  • 揭示超出单个审查者能在头脑中把控的风险
  • 强调跨模块与跨服务的相互影响
  • 提高在大规模交付时的信心
  • 让人类审查者更多聚焦于架构与设计

这一转变也与 Datadog 领导层对工程优先级的表述相吻合:在这里,可靠性与信任的重要性不亚于、甚至超过速度。

“当其他一切都在崩塌时,我们是平台,客户依赖我们,”Carter 说。“防止事故发生能增强客户对我们的信任。”



Datadog runs one of the world’s most widely-used observability platforms, helping companies monitor, troubleshoot, and secure complex distributed systems. When something breaks, customers depend on Datadog to surface issues fast, which means reliability has to be built in long before code ever reaches production.


For Datadog’s engineering teams, that makes code review a high-stakes moment. It’s not just about catching mistakes, but about understanding how changes ripple through interconnected systems—an area where traditional static analysis and rule-based tools often fall short.


To meet this challenge, Datadog’s AI Development Experience (AI DevX) team turned to Codex, the coding agent from OpenAI, which brings system-level reasoning into code review and surfaces risks humans can’t easily see at scale.


“Time savings are real and important,” says Brad Carter, who leads Datadog’s AI DevX team. “But preventing incidents is far more compelling at our scale.”


Bringing system-level context to code review with Codex




Effective code review at Datadog traditionally relied heavily on senior engineers—the people who understand the codebase, its history, and the architectural tradeoffs well enough to spot systemic risk. 


But that kind of deep context is hard to scale, and early AI code review tools didn’t solve this problem; many behaved like advanced linters, flagging surface-level issues while missing broader system nuances. Datadog’s engineers often found the suggestions too shallow or too noisy, and ignored them.


Datadog began piloting Codex, the coding agent from OpenAI, by integrating it into the live development workflows. In one of the company’s largest and most heavily used repositories, every pull request was automatically reviewed by Codex. Engineers reacted to comments from Codex with thumbs up or down and shared informal feedback across teams. Many noted that the Codex feedback was worth reading, unlike previous tools that produced noisy or shallow suggestions.


Validating AI review against real incidents




To test whether AI‑assisted review could do more than point out style issues, Datadog built an incident replay harness.


Instead of using hypothetical scenarios, the team went back to historical incidents. They reconstructed pull requests that had contributed to incidents, ran Codex against each one as if it were part of the original review, then asked the engineers who owned those incidents whether feedback from Codex would have made a difference.


The result: Codex found more than 10 cases, or roughly 22% of the incidents that Datadog examined, where engineers confirmed that the feedback Codex provided would have made a difference—more than any other tool evaluated.


Because these pull requests had already passed code review, the replay test showed that Codex surfaced risks reviewers hadn’t seen at the time, complementing human judgment rather than replacing it.


Delivering consistent, high-signal feedback




Datadog’s analysis showed that Codex consistently flagged issues that aren’t obvious from the immediate diff alone and can’t be caught by deterministic rules.


Engineers described Codex comments as more than “bot noise”:


  • Codex pointed out interactions with modules not touched in the diff
  • It identified missing test coverage in areas of cross‑service coupling
  • It highlighted API contract changes that carried downstream risk

“For me, a Codex comment feels like the smartest engineer I’ve worked with and who has infinite time to find bugs. It sees connections my brain doesn’t hold all at once.”
—Brad Carter, Engineering Manager at Datadog



That ability to connect review feedback to real reliability outcomes was what made Codex stand out in Datadog’s evaluation. Unlike static analysis tools, Codex compares the intent of the pull request with submitted code changes, reasoning over the entire codebase and dependencies to execute code and tests to validate behavior.


“It was the first one that actually seemed to consider the diff in the larger context of the program,” says Carter. “That was novel and eye‑opening.”

For many engineers, that shift changed how they engaged with AI review altogether. “I started treating Codex comments like real code review feedback,” says Ted Wexler, Senior Software Engineer at Datadog. “Not something I’d skim or ignore, but something worth paying attention to.”


Focusing engineers on design over detection




Following the evaluation, Datadog deployed Codex more broadly across its engineering workforce. Today more than 1,000 engineers use it regularly. 


Feedback is largely surfaced organically rather than through formal in‑tool metrics. Engineers post to Slack about useful insights, constructive comments, and moments where Codex helped them think differently about a problem.


While time savings are significant, teams consistently pointed to a more meaningful shift in how work got done. 


“Codex changed my mind for what code review should be. It’s not about replicating our best human reviewers. It's about finding critical flaws and edge cases that humans struggle to see when reviewing changes in isolation.”
—Brad Carter, Engineering Manager at Datadog



Redefining code review around risk, not speed




The broader impact for Datadog was a change in how code review itself is defined. Rather than treating review as a checkpoint for catching errors or optimizing cycle time, the team now sees Codex as a core reliability system that acts as a partner:


  • Surfacing risk beyond what individual reviewers can hold in context
  • Highlighting cross-module and cross-service interactions
  • Increasing confidence in shipping at scale
  • Allowing human reviewers to focus on architecture and design

This shift aligns with how Datadog’s leaders frame engineering priorities, where reliability and trust matter as much as, if not more than, velocity.


“We are the platform companies rely on when everything else is breaking,” says Carter. “Preventing incidents strengthens the trust our customers place in us.”



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page