Why Codex Security Doesn’t Include a SAST Report

OpenAI News

几十年来，静态应用安全测试（ SAST ）一直是安全团队扩展代码审查效率的最有效手段之一。

但在构建 Codex Security 时，我们有意做了不同的设计抉择：并没有先把静态分析报告导入，再让智能体去审理。我们从仓库本身出发——包括其架构、信任边界和预期行为——并在把问题交给人工之前，先尽可能验证发现是否成立。

原因很简单：最难发现的漏洞通常并非单纯的数据流问题。问题常发生在代码看起来有做安全检查，但该检查并不能真正保证系统所依赖的属性。换言之，挑战不只是追踪数据如何在程序中流动，而是判断代码里的防护真的能起作用吗。

问题在于： SAST 被优化为解决数据流

SAST 常被描述为一条清晰的管道：识别不受信任的输入源，追踪数据在程序中的流动，标记那些到达敏感汇点却未被消毒的情况。这一模型优雅，也覆盖了许多真实漏洞。

但在实际中，SAST 为了在大规模代码库中可行，必须做出近似处理——尤其面对间接调用、动态分发、回调、反射以及框架驱动的控制流时。这些近似并不是对 SAST 的批评，而是试图在不执行代码的情况下推理时的现实限制。

仅凭此并不能解释为什么 Codex Security 不以 SAST 报告为起点。更深层的问题在于：当你成功把某个源追踪到某个汇点之后，接下来要回答的决定性问题是：

防护真的起作用了吗？

静态分析的薄弱环节：约束与语义

即便静态分析正确地把输入跨越多层函数追踪完，它仍需回答那条真正决定漏洞是否存在的问题：

防护措施是否真的发挥了作用？

举个常见例子：代码在渲染不受信任内容前调用了类似 sanitize_html() 的函数。静态分析能看到该消毒函数被调用，但通常无法判断这个消毒器在特定渲染上下文、模板引擎、编码行为和后续变换中是否足够。

问题的关键不只是数据是否到达汇点，而是代码里的检查是否真的按照系统假设的方式约束了值。

换句话说：“代码调用了消毒函数”和“系统安全”之间有着本质差别。

示例：在解码前做校验的陷阱

这类模式在真实系统中常见。

一个 Web 应用收到 JSON 负载，取出 redirect_url，按 allowlist regex 校验，接着进行 URL 解码，然后把结果交给重定向处理逻辑。

一个典型的源到汇的报告会写成：

不受信任输入 → 正则校验 → URL 解码 → 重定向

但真正的问题不是校验是否存在，而是该校验在随后的变换之后是否仍然约束住了值。

如果正则在解码之前运行，校验是否能约束解码后的 URL，使之被重定向处理器按预期解释？

要回答这个问题，需要推理整条变换链：正则允许什么、解码和归一化如何工作、URL 解析如何处理边界情况，以及重定向逻辑如何解析 scheme 与 authority。

许多实务中重要的漏洞都长这个样子：操作顺序错误、部分归一化、解析歧义、以及校验与解释之间的不匹配。数据流是可见的，弱点在于约束如何在变换链中传播——或者未能传播。

这并非纯理论。在 CVE-2024-29041 中， Express 受到了开放重定向问题的影响：由于重定向目标在编码和解释上的处理方式，畸形 URL 能绕过常见的 allowlist 实现。数据流本身并不复杂，更难判定的、也正是决定漏洞是否存在的问题，是校验在变换链之后是否仍然成立。

我们的做法：从行为出发，再去验证

Codex Security 的目标很清楚：通过呈现证据更充分的问题来减少人工复核工作量。在产品中，这意味着利用仓库特定的上下文（包括威胁模型），并在隔离环境里验证高置信度的发现后再展示给用户。

当 Codex Security 遇到看起来像“校验”或“消毒”的边界时，它不会把它当作一个打勾项。系统尝试理解代码想要保证什么——然后尝试去反证这个保证。

在实践中，这通常包括几种做法的组合：

在完整仓库上下文里阅读相关代码路径，像安全研究员那样寻找意图与实现之间的不匹配。注释会被看，但模型不会轻易被注释欺骗；在代码上方加一行 //Halvar says: this is not a bug 并不能掩盖真实的缺陷。
把问题缩减到最小可测切片（例如围绕单一输入的变换管道），以便在不受整个系统干扰的情况下进行推理。为此， Codex Security 会抽出微小代码片段并为其生成微型模糊测试器。
将注意力放在跨变换的约束传播上，而不是把每个检查孤立看待。必要时，可以把问题形式化为可满足性（satisfiability）问题——为此模型可访问带有 z3-solver 的 Python 环境，并在需要时像人类那样使用它，尤其用于处理整数溢出或非标准架构上的类似问题。
在可能的情况下，在沙箱验证环境中执行假设，以区分“这可能是问题”和“这确实是问题”。没有比在调试模式下完整编译并运行的端到端 PoC 更有说服力的证明了。

关键的转变是：系统不满足于“有检查存在”，而要推进到“该不变量成立（或不成立），并给出证据”。模型会为达成这一目标选择最合适的工具。

为什么我们不把 Codex Security 的起点设在 SAST 报告上

有人可能会问：为什么不两者兼顾？先用 SAST 报告做起点，再让智能体深入推理。

在某些情况下，预先计算的发现确实有用——尤其针对范围窄且已知的漏洞类别。但对于旨在在上下文中发现并验证漏洞的智能体来说，以 SAST 报告为起点会带来三种可预见的失败模式。

其一，会促成过早的聚焦。发现列表体现了某个工具已检查过的区域。如果把它当作起点，会偏向让系统在相同区域、用相同抽象上投入过多精力，从而忽视那些不符合该工具视角的问题类别。

其二，会引入难以纠正的隐性假设。许多 SAST 发现内含关于消毒、校验或信任边界的假设。如果这些假设是错误的或不完整，将它们送入推理环会把智能体的角色从“调查”转为“确认或否定”，而这并非我们希望智能体扮演的角色。

其三，会使得评估推理系统变得困难。如果流程以 SAST 输出为起点，就难以区分某些结论是智能体自行发现的，还是继承自其他工具。要准确衡量系统能力并推动其改进，这种区分很重要。

因此我们让 Codex Security 从安全研究的出发点开始：从代码和系统意图出发，以验证来抬高证据门槛，然后再打断人工工作流。

SAST 工具仍然非常重要

SAST 工具在其设计目标上仍然非常有效：强制安全编码标准、捕捉直接的源到汇问题、以及以可预期的权衡在大规模上检测已知模式。它们是纵深防御的重要组成。

本文的论点更窄：对于一个被设计来推理行为并验证发现的智能体，为什么不应以静态发现列表为锚点开始工作。

还值得指出的是，纯粹的源到汇思维的一个相关局限：并非所有漏洞都是数据流问题。许多真实缺陷是状态或不变量的问题——工作流绕过、授权漏洞，以及“系统处于错误状态”的缺陷。对这类漏洞而言，受污染的值不会到达单一的“危险汇点”，风险在于程序所假设的那些始终为真的前提何时被打破。

展望

我们预计安全工具生态会持续改进：静态分析、模糊测试、运行时防护以及基于智能体的工作流都会各司其职。

我们希望 Codex Security 擅长解决的是对安全团队代价最大的那部分工作：把“这看起来可疑”变成“这是真的，问题如何触发，且给出与系统意图相符的修复办法”。

如果你想了解 Codex Security 如何扫描仓库、验证发现以及提出修复建议，请参见我们的文档（开发者页面）。

For decades, static application security testing (SAST) has been one of the most effective ways security teams scale code review.

But when we built Codex Security, we made a deliberate design choice: we didn’t start by importing a static analysis report and asking the agent to triage it. We designed the system to start with the repository itself—its architecture, trust boundaries, and intended behavior—and to validate what it finds before it asks a human to spend time on it.

The reason is simple: the hardest vulnerabilities usually aren’t dataflow problems. They happen when code appears to enforce a security check, but that check doesn’t actually guarantee the property the system relies on. In other words, the challenge isn’t just tracking how data moves through a program—it’s determining whether the defenses in the code really work.

The problem: SAST is optimized for dataflow

SAST is often framed as a clean pipeline: identify a source of untrusted input, track data through the program, and flag cases where that data reaches a sensitive sink without sanitization. It’s an elegant model, and it covers a lot of real bugs.

In practice, SAST has to make approximations to stay tractable at scale—especially in real codebases with indirection, dynamic dispatch, callbacks, reflection, and framework-heavy control flow. Those approximations aren’t a knock on SAST; they’re the reality of trying to reason about code without executing it.

That, by itself, is not why Codex Security doesn’t start with a SAST report.

The deeper issue is what happens after you successfully trace a source to a sink.

Where static analysis struggles: constraints and semantics

Even when static analysis correctly traces input across multiple functions and layers, it still has to answer the question that actually determines whether a vulnerability exists:

Did the defense really work?

Take a common pattern: code calls something like sanitize_html() before rendering untrusted content. A static analyzer can see that the sanitizer ran. What it usually can’t determine is whether that sanitizer is actually sufficient for the specific rendering context, template engine, encoding behavior, and downstream transformations involved.

That’s where things get tricky. The problem isn’t just whether data reaches a sink. It’s whether the checks in the code actually constrain the value in the way the system assumes they do.

Put differently: there’s a big difference between “the code calls a sanitizer” and “the system is safe.”

Example: validation before decoding

Here’s a pattern that shows up in real systems all the time.

A web application receives a JSON payload, extracts a redirect_url, validates it against an allowlist regex, URL-decodes it, and passes the result to a redirect handler.

A classic source-to-sink report can describe the flow:

untrusted input → regex check → URL decode → redirect

But the real question isn’t whether the check exists. It’s whether that check still constrains the value after the transformations that follow.

If the regex runs before decoding, does it actually constrain the decoded URL the way the redirect handler interprets it?

Answering that means reasoning about the entire transformation chain: what the regex allows, how decoding and normalization behave, how URL parsing treats edge cases, and how the redirect logic resolves schemes and authorities.

Many of the vulnerabilities that matter in practice look like this: order-of-operations mistakes, partial normalization, parsing ambiguities, and mismatches between validation and interpretation. The dataflow is visible. The weakness is in how constraints propagate—or fail to propagate—through the transformation chain.

This isn’t just a theoretical pattern. In CVE-2024-29041⁠, Express was affected by an open redirect issue where malformed URLs could bypass common allowlist implementations because of how redirect targets were encoded and then interpreted. The dataflow was straightforward. The harder question—and the one that determined whether the bug existed—was whether the validation still held after the transformation chain.

Our approach: start from behavior, then validate

Codex Security is built around a simple goal: reduce triage by surfacing issues with stronger evidence. In the product, that means using repo-specific context (including a threat model) and validating high-signal issues in an isolated environment before surfacing them.

When Codex Security encounters a boundary that looks like “validation” or “sanitization,” it doesn’t treat that as a checkbox. It tries to understand what the code is attempting to guarantee—and then it tries to falsify that guarantee.

In practice, that tends to look like a mix of:

Reading the relevant code path with full repository context, the way a security researcher would, and looking for mismatches between intent and implementation. This includes comments, but the model does not necessarily believe comments so adding //Halvar says: this is not a bug above your code does not confuse it, if there really is a bug.
Reducing the problem to the smallest testable slice (for example, the transformation pipeline around a single input), so you can reason about it without the rest of the system in the way. In this sense, Codex Security pulls out tiny code slices and then writes micro-fuzzers for them.
Reasoning about constraints across transformations, rather than treating each check independently. Where appropriate, this can include formalization as a satisfiability question. In other words, we give the model access to a Python environment with z3-solver and it is good at using it when needed, just as a human would have to when answering a particularly complicated input constraint problem. This is especially useful for looking at integer overflows or similar bugs on non-standard architectures.
Executing hypotheses in a sandboxed validation environment when possible, to distinguish “this could be a problem” from “this is a problem”. There's no better proof than a full end-to-end PoC with the code compiled in debug mode.

This is the key shift: instead of stopping at “a check exists,” the system pushes toward “the invariant holds (or it doesn’t), and here’s the evidence”. And the model chooses the best tool for that job.

Why we don’t seed Codex Security with a SAST report

A reasonable reaction is: why not do both? Start with a SAST report, then use the agent to reason deeper.

There are cases where precomputed findings are helpful—especially for narrow, known bug classes. But for an agent designed to discover and validate vulnerabilities in context, starting from a SAST report creates three predictable failure modes.

First, it can encourage premature narrowing. A findings list is a map of where a tool already looked. If you treat it as the starting point, you can bias the system toward spending disproportionate effort in the same regions, using the same abstractions, and missing classes of issues that don’t fit the tool’s worldview.

Second, it can introduce implicit judgments that are hard to unwind. Many SAST findings encode assumptions about sanitization, validation, or trust boundaries. If those assumptions are wrong—or just incomplete—feeding them into the reasoning loop can shift the agent from “investigate” to “confirm or dismiss,” which is not what we want the agent to do.

Third, it can make it harder to evaluate the reasoning system. If the pipeline starts with SAST output, it becomes difficult to separate what the agent discovered through its own analysis from what it inherited from another tool. That separation is important if you want to measure the system’s capabilities accurately, which is needed for the system to improve over time.

So we built Codex Security to begin where security research begins: from the code and the system’s intent, with validation used to raise the confidence bar before we interrupt a human.

SAST tools are still very important

SAST tools can be excellent at what they’re designed for: enforcing secure coding standards, catching straightforward source-to-sink issues, and detecting known patterns at scale with predictable tradeoffs. They can be a strong part of defense-in-depth.

This post is narrower: it’s about why an agent designed to reason about behavior and validate findings should not start its work anchored to a static findings list.

It’s also worth calling out a related limitation of pure source-to-sink thinking: not every vulnerability is a dataflow problem. Many real failures are state and invariant problems—workflow bypasses, authorization gaps, and “the system is in the wrong state” bugs. For these types of bugs, a tainted value does not reach a single “dangerous sink.” The risk is in what the program assumes will always be true.

Looking ahead

We expect the security tooling ecosystem to keep improving: static analysis, fuzzing, runtime guards, and agentic workflows will all have roles.

What we want Codex Security to be good at is the part that costs the most for security teams: turning “this looks suspicious” into “this is real, here’s how it fails, and here’s a fix that matches system intent.”

If you want to learn more about how Codex Security scans repositories, validates findings, and proposes fixes, see our documentation⁠.

Generated by RSStT. The copyright belongs to the original author.

Source