Inside OpenAI’s in-house data agent

数据决定系统如何学习、产品如何演进、公司如何抉择。但要快速、准确并在正确语境下得到答案，往往比应有的要难得多。为了解决这个问题并在规模化过程中保持效率，我们开发了一个内部定制的 AI 数据代理，用以在自家平台上探索与推理。

这个代理是一个仅供内部使用的定制工具（不是对外产品），专门围绕 OpenAI 的数据、权限与工作流构建。我们在此展示它的设计与应用，说明 AI 如何在各团队的日常工作中，实实在在地提供帮助。用于构建与运行它的 OpenAI 工具—— 包括 Codex 、我们的旗舰模型 GPT‑5.2 、 Evals API 与 Embeddings API ——与我们向外部开发者开放的工具相同。

借助该代理，员工能在数分钟而不是数天内从问题得到洞见。这大幅降低了跨部门获取数据和进行细致分析的门槛，而不仅限于数据团队。如今，OpenAI 的工程、数据科学、市场、财务与研究团队都在依赖该代理来回答高影响力的数据问题，例如评估产品上线与判断业务健康状况——全部通过自然语言的直观方式操作。代理将由 Codex 支撑的表级知识与产品与组织上下文结合，其持续学习的记忆系统也会让每次交互都推动能力提升。

在内部界面中，用户可以看到类似这样的比较：询问 ChatGPT WAU 在 2025 年 10 月 6 日相较于 DevDay 2023 的变化，代理报告 2025 年约为 800M WAU，2023 年约 100M，显示 +700M 的增量与约 8× 的增长，并附上解释性背景。

下面我们拆解为什么需要一个定制代理、它的“代码增强”数据语境和自学习为何有用，以及我们在构建过程中的经验教训。

为什么要一个专用工具

OpenAI 的数据平台服务着超过 3.5k 名内部用户，分布在工程、产品与研究等团队，管理着超过 600 PB 的数据与约 70k 个数据集。在这样的规模下，单纯找到合适的数据表就可能成为分析中最耗时的一环。

正如一位内部用户所说：“我们有很多看上去很像的表，我花大量时间去搞清它们哪里不同、应该用哪一个。有些包含登出用户，有些不包含；字段有重叠，很难判断哪些字段表示什么。”

即便表选对了，确保结果正确仍有挑战。分析师必须对表内数据与表之间关系进行推理，以确保转换与过滤应用无误。常见失误——多对多连接、过滤下推错误与未处理的空值——都可能在不知不觉中使结果失真。在 OpenAI 这样的规模下，分析师不应把时间耗在调试 SQL 语义或查询性能上；他们的精力应放在定义指标、验证假设和基于数据做决策上。

举例来说，某条 SQL 语句超过 180 行，要判断我们是否连接了正确的表、查询了正确的列，本身就不容易。

它如何工作

先看代理是什么、如何汇集语境以及如何自我提升。

代理由 GPT‑5.2 驱动，针对 OpenAI 的数据平台进行推理设计。它出现在员工的工作场景中：作为 Slack 代理、网页界面、IDE 内、通过 Codex CLI 的 MCP 接入，或通过 MCP 连接器直接集成到 OpenAI 的内部 ChatGPT 应用中。

用户可以提出复杂且开放式的问题，这些问题通常需要多轮手动探索。举例：在测试数据集中询问“对于 NYC 出租车行程，哪些上车—下车 ZIP 配对最不可靠——典型与最差行程时间差距最大——这种波动发生在什么时候？”代理从理解问题到探索数据、运行查询并综合发现，完成端到端分析。

代理的“超能力”之一是它的推理过程。它不按固定脚本行事，而是在执行中评估自身进展。如果中间结果看起来异常（例如因错误的连接或过滤导致零行结果），代理会调查原因、调整策略并重试。它在整个过程中保留完整上下文，并将学习结果带入后续步骤。这个闭环的自学习过程把迭代工作从用户转移到代理内部，使得分析更快且质量更稳定，明显优于手动工作流。

该代理覆盖完整的分析流程：发现数据、运行 SQL 、发布笔记本与报告。它理解公司内部知识，可进行网络检索来补充外部信息，并通过使用与记忆不断改进。

语境决定一切

高质量的答案依赖于丰富、准确的语境。没有语境，即便是强大的模型也会给出错误结果，比如大幅高估用户数或误解内部术语。

为避免这些失效模式，代理建立在多层语境之上，以把其推理锚定于 OpenAI 的数据与制度性知识。

我们把语境分成六层：

表使用情况（Table Usage）
人工注释（Human Annotations）
代码级增强（ Codex Enrichment ）
机构知识（Institutional Knowledge）
记忆（Memory）
运行时语境（Runtime Context）

第一层：表使用情况

元数据锚定：代理依赖表模式元数据（列名与数据类型）来辅助写 SQL ，并使用表的血缘关系（上游/下游）来说明表之间如何关联。
查询推断：摄取历史查询帮助代理理解通常如何编写查询、哪些表经常被连接。

第二层：人工注释

由领域专家整理的表与列描述，捕捉意图、语义、业务含义与已知陷阱，这些往往无法仅从模式或历史查询中推断出来。

仅有元数据还不足以区分表；要真正区分，需要知道这些表如何生成、来源何在。

第三层：代码级增强

通过对表的代码级定义，代理得以更深入理解数据的实际含义。例如，能说明表中存储值的独特性、更新频率、数据覆盖范围（比如是否排除了某些字段）、以及该表如何由分析事件派生。
还展示了表在 Spark 、 Python 等系统之外的使用方式。
这样，代理可以区分外观相似但关键差异的表，例如某表是否只包含第一方的 ChatGPT 流量。此类上下文也会自动刷新，无需人工维护。

第四层：机构知识

代理可访问 Slack 、 Google Docs 与 Notion 等，那里记录着发版、可靠性事件、内部代号与工具，以及关键指标的规范定义和计算逻辑。
这些文档被摄取、嵌入并按元数据与权限存储。检索服务在运行时处理访问控制与缓存，使代理能高效且安全地调用这些信息。

第五层：记忆

当代理收到修正或在回答中发现某些数据问题的细微差别时，会将这些学习保存下来，以便下一次改进回答起点，从而避免重复遇到同样的问题。
记忆的目标是保留那些对数据正确性至关重要但难以从其他层推断出来的非显性修正、过滤与约束。例如代理曾不知道如何针对特定分析实验做过滤（需匹配实验门控中定义的特定字符串），记忆在此类场景中十分关键。
当用户给出修正或代理在对话中学到知识时，会提示用户将其保存为记忆；用户也可以手动创建或编辑记忆。记忆有全局与个人范围，工具使编辑变得简单。

第六层：运行时语境

如果表没有现成语境或现有信息过时，代理可以向数据仓库发起实时查询，直接检查并查询表，这让它能实时验证模式并理解数据。
代理也可以按需查询其它数据平台系统（元数据服务、Airflow、 Spark 等），以获取仓库外的更广语境。

我们每天运行离线管道，把表使用情况、人工注释与 Codex 派生的增强汇总成单一的标准化表示，再用 Embeddings API 将其转成嵌入并存储以供检索。查询时，代理通过检索增强生成（ RAG ）只拉取最相关的嵌入语境，而非扫描原始元数据或日志。这样在面对数万张表时，表理解既快又可扩展，同时保证运行时延迟可预测且较低。必要时仍会发起实时仓库查询。

这些层级共同保证代理的推理扎根于 OpenAI 的数据、代码与机构知识，从而显著降低错误率并提升答案质量。

被打造得像一个队友

一次性答案在问题明确时可行，但多数问题并非如此。往往需要来回打磨与适当修正才能得到正确结果。

代理被设计成可以与之推理的队友：它是一个会话式、始终可用的助手，既能给出快速回答，也能做迭代探索。它把完整上下文跨轮次保留，用户可以提出后续问题、调整意图或改变方向而无需重述全部内容。如果代理走错方向，用户可在分析过程中中断并重定向，就像与会倾听的人类同事合作一样。

当指令不清或不完整时，代理会主动提出澄清问题；若无回应，它会应用合理的默认值以继续推进（例如在用户未指定时间范围时，可能假定为最近 7 天或 30 天）。这些先验使其在保持响应性的同时不至于被阻塞，并能逐步收敛到正确结果。

结果是：无论你是明确想要“告诉我这张表的情况”，还是处于探索阶段“我看到这里有下滑，能按客户类型和时间段拆解吗？”，代理都能很自然地工作。

在推广后，我们观察到用户经常为例行重复任务运行相同分析。为加速这种工作，代理把常见分析打包成可复用的工作流，例如周报与表验证。把语境与最佳实践编码一次，工作流能简化重复分析并确保不同用户之间结果一致。

安全与信任

构建一个始终在线、持续演进的代理意味着质量既可能提升也可能漂移。没有紧密的反馈回路，回归问题不可避免且往往难以察觉。要在不破坏信任的前提下扩展能力，需通过系统化评估来确保质量。

代理直接接入 OpenAI 现有的安全与访问控制模型，作为接口层严格继承并执行 OpenAI 数据的权限和防护措施。所有访问均为严格透传，用户只能查询其已有权限访问的表；当权限不足时，代理会标注或回退到用户被授权使用的替代数据集。

同时，代理强调透明性。像任何系统一样，它会犯错。代理会在每个答案旁总结假设与执行步骤来暴露其推理过程；执行查询时，它会链接到底层结果，允许用户检验原始数据与分析的每一步。

快速迭代中的经验

从零开始构建代理让我们看到代理的行为模式、难点以及在规模化下保持可靠性的关键。

经验一：少即是多早期我们把所有工具同时暴露给代理，结果很快遇到功能重叠的问题。虽然冗余对某些定制场景有帮助，但对代理而言会造成混淆。为减少歧义并提高可靠性，我们限制并整合了部分工具调用。

经验二：引导目标，而非限定路径我们发现过于指令化的提示会削弱效果。许多问题在分析形态上相似，但细节差异足以让僵化指导把代理推向错误路径。改为提供更高层次的目标指导，并依赖 GPT‑5 的推理选择执行路径后，代理更稳健、结果也更好。

经验三：意义存在于代码中模式与查询历史能描述表的形态与用法，但表的真正含义藏在生成它的代码中。管道逻辑记录了假设、数据新鲜度保证与业务意图，这些在 SQL 或元数据中未必显现。通过用 Codex 爬取代码库，代理能理解数据集如何被构建，从而比仅凭仓库信号更准确地判断“这里面到底有什么”和“我什么时候能用它”。

共同愿景，新工具的迭代

我们会持续改进代理的能力，让它更好地处理模糊问题、通过更强的校验提升可靠性与准确性，并更深地融入工作流。我们的目标是让它自然地融入人们已有的工作方式，而不是变成一个独立的工具。

尽管底层代理推理、验证与自纠能力的改进会持续推动工具演进，但团队的使命不变：在 OpenAI 的数据生态中无缝提供快速且值得信赖的数据分析。

Data powers how systems learn, products evolve, and how companies make choices. But getting answers quickly, correctly, and with the right context is often harder than it should be. To make this easier as OpenAI scales, we built our own bespoke in-house AI data agent that explores and reasons over our own platform.

Our agent is a custom internal-only tool (not an external offering), built specifically around OpenAI’s data, permissions, and workflows. We’re showing how we built and use it to help surface examples of the real, impactful ways AI can support day-to-day work across our teams. The OpenAI tools we used to build and run it (Codex, our GPT‑5 flagship model, the Evals API⁠, and the Embeddings API⁠) are the same tools we make available to developers everywhere.

Our data agent lets employees go from question to insight in minutes, not days. This lowers the bar to pulling data and nuanced analysis across all functions, not just by our data team. Today, teams across Engineering, Data Science, Go-To-Market, Finance, and Research at OpenAI lean on the agent to answer high-impact data questions. For example, it can help answer how to evaluate launches and understand business health, all through the intuitive format of natural language. The agent combines Codex-powered table-level knowledge with product and organizational context. Its continuously learning memory system means it also improves with every turn.

In this post, we’ll break down why we needed a bespoke AI data agent, what makes its code-enriched data context and self-learning so useful, and lessons we learned along the way.

Why we needed a custom tool

OpenAI’s data platform serves more than 3.5k internal users working across Engineering, Product, and Research, spanning over 600 petabytes of data across 70k datasets. At that size, simply finding the right table can be one of the most time-consuming parts of doing analysis.

As one internal user put it:

“We have a lot of tables that are fairly similar, and I spend tons of time trying to figure out how they’re different and which to use. Some include logged-out users, some don’t. Some have overlapping fields; it’s hard to tell what is what.”

Even with the correct tables selected, producing correct results can be challenging. Analysts must reason about table data and table relationships to ensure transformations and filters are applied correctly. Common failure modes—many-to-many joins, filter pushdown errors, and unhandled nulls—can silently invalidate results. At OpenAI’s scale, analysts should not have to sink time into debugging SQL semantics or query performance: their focus should be on defining metrics, validating assumptions, and making data-driven decisions.

This SQL statement is 180+ lines long. It’s not easy to know if we’re joining the right tables and querying the right columns.

How it works

Let’s walk through what our agent is, how it curates context, and how it keeps self-improving.

Our agent is powered by GPT‑5.2 and is designed to reason over OpenAI’s data platform. It’s available wherever employees already work: as a Slack agent, through a web interface, inside IDEs, in the Codex CLI via MCP⁠, and directly in OpenAI’s internal ChatGPT app through a MCP connector⁠.

Users can ask complex, open-ended questions which would typically require multiple rounds of manual exploration. Take this example prompt, which uses a test data set: “For NYC taxi trips, which pickup-to-dropoff ZIP pairs are the most unreliable, with the largest gap between typical and worst-case travel times, and when does that variability occur?”

The agent handles the analysis end-to-end, from understanding the question to exploring the data, running queries, and synthesizing findings.

AnalysisDataQueryGraph

The agent's response to the question.

One of the agent’s superpowers is how it reasons through problems. Rather than following a fixed script, the agent evaluates its own progress. If an intermediate result looks wrong (e.g., if it has zero rows due to an incorrect join or filter), the agent investigates what went wrong, adjusts its approach, and tries again. Throughout this process, it retains full context, and carries learnings forward between steps. This closed-loop, self-learning process shifts iteration from the user into the agent itself, enabling faster results and consistently higher-quality analyses than manual workflows.

The agent’s reasoning to identify the most unreliable NYC taxi pickup–dropoff pairs.

The agent covers the full analytics workflow: discovering data, running SQL, and publishing notebooks and reports. It understands internal company knowledge, can web search for external information, and improves over time through learned usage and memory.

Context is everything

High-quality answers depend on rich, accurate context. Without context, even strong models can produce wrong results, such as vastly misestimating user counts or misinterpreting internal terminology.

The agent without memory, unable to query effectively.

The agent’s memory enables faster queries by locating the correct tables.

To avoid these failure modes, the agent is built around multiple layers of context that ground it in OpenAI’s data and institutional knowledge.

Layer #1: Table Usage

Metadata grounding: The agent relies on schema metadata (column names and data types) to inform SQL writing and uses table lineage (e.g., upstream and downstream table relationships) to provide context on how different tables relate.
Query inference: Ingesting historical queries helps the agent understand how to write its own queries and which tables are typically joined together.

Layer #2: Human Annotations

Curated descriptions of tables and columns provided by domain experts, capturing intent, semantics, business meaning, and known caveats that are not easily inferred from schemas or past queries.

Metadata alone isn’t enough. To really tell tables apart, you need to understand how they were created and where they originate.

Layer #3: Codex Enrichment

By deriving a code-level definition of a table, the agent builds a deeper understanding of what the data actually contains.
- Nuances on what is stored in the table and how it is derived from an analytics event provides extra information. For example, it can give context on the uniqueness of values, how often the table data is updated, the scope of the data (e.g., if the table excludes certain fields, it has this level of granularity), etc.
This provides enhanced usage context by showing how the table is used beyond SQL in Spark, Python, and other data systems.
This means that the agent can distinguish between tables that look similar but differ in critical ways. For example, it can tell whether a table only includes first-party ChatGPT traffic. This context is also refreshed automatically, so it stays up to date without manual maintenance.

Layer #4: Institutional Knowledge

The agent can access Slack, Google Docs, and Notion, which capture critical company context such as launches, reliability incidents, internal codenames and tools, and the canonical definitions and computation logic for key metrics.
These documents are ingested, embedded, and stored with metadata and permissions. A retrieval service handles access control and caching at runtime, enabling the agent to efficiently and safely pull in this information.

Layer #5: Memory

When the agent is given corrections or discovers nuances about certain data questions, it's able to save these learnings for next time, allowing it to constantly improve with its users.
- As a result, future answers begin from a more accurate baseline rather than repeatedly encountering the same issues.
- The goal of memory is to retain and reuse non-obvious corrections, filters, and constraints that are critical for data correctness but difficult to infer from the other layers alone.
- For example, in one case, the agent didn’t know how to filter for a particular analytics experiment (it relied on matching against a specific string defined in an experiment gate). Memory was crucially important here to ensure it was able to filter correctly, instead of fuzzily trying to string match.
When you give the agent a correction or when it finds a learning from your conversation, it will prompt you to save that memory for next time.
- Memories can also be manually created and edited by users.
- Memories are scoped at the global and personal level, and the agent’s tooling makes it easy to edit them.

Layer #6: Runtime Context

When no prior context exists for a table or when existing information is stale, the agent can issue live queries to the data warehouse to inspect and query the table directly. This allows it to validate schemas, understand the data in real-time, and respond accordingly.
The agent is also able to talk to other Data Platform systems (metadata service, Airflow, Spark) as needed to get broader data context that exists outside the warehouse.

We run a daily offline pipeline that aggregates table usage, human annotations, and Codex-derived enrichment into a single, normalized representation. This enriched context is then converted into embeddings using the OpenAI embeddings API⁠ and stored for retrieval. At query time, the agent pulls only the most relevant embedded context via retrieval-augmented generation⁠ (RAG) instead of scanning raw metadata or logs. This makes table understanding fast and scalable, even across tens of thousands of tables, while keeping runtime latency predictable and low. Runtime queries are issued to our data warehouse live as needed.

Together, these layers ensure the agent’s reasoning is grounded in OpenAI’s data, code, and institutional knowledge, dramatically reducing errors and improving answer quality.

Built to think and work like a teammate

One-shot answers work when the problem is clear, but most questions aren’t. More often, arriving at the correct result requires back-and-forth refinement and some course correction.

The agent is built to behave like a teammate you can reason with. It’s a conversational, always-on and handles both quick answers and iterative exploration.

It carries over complete context across turns, so users can ask follow-up questions, adjust their intent, or change direction without restating everything. If the agent starts heading down the wrong path, users can interrupt mid-analysis and redirect it, just like working with a human collaborator who listens instead of plowing ahead.

When instructions are unclear or incomplete, the agent proactively asks clarifying questions. If no response is provided, it applies sensible defaults to make progress. For example, if a user asks about business growth with no date range specified, it may assume the last seven or 30 days. These priors allow it to stay responsive and non-blocking while still converging on the right outcome.

The result is an agent that works well both when you know exactly what you want (e.g., “Tell me about this table”) and just as strong when you’re exploring (e.g., “I’m seeing a dip here, can we break this down by customer type and timeframe?”).

After rollout, we observed that users frequently ran the same analyses for routine repetitive work. To expedite this, the agent's workflows package recurring analyses into reusable instruction sets. Examples include workflows for weekly business reports and table validations. By encoding context and best practices once, workflows streamline repeat analyses and ensure consistent results across users.

Moving fast without breaking trust

Building an always-on, evolving agent means quality can drift just as easily as it can improve. Without a tight feedback loop, regressions are inevitable and invisible. The only way to scale capability without breaking trust is through systematic evaluation.

In this section, we’ll discuss how we leverage OpenAI’s Evals API⁠ to measure and protect the agent’s response quality.

Its Evals are built on curated sets of question-answer pairs. Each question targets an important metric or analytical pattern we care deeply about getting right, paired with a manually authored “golden” SQL query that produces the expected result. For each eval, we send the natural language question to its query-generation endpoint, execute the generated SQL, and compare the output against the result of the expected SQL.

Evaluation doesn’t rely on naive string matching. Generated SQL can differ syntactically while still being correct, and result sets may include extra columns that don’t materially affect the answer. To account for this, we compare both the SQL and the resulting data, and feed these signals into OpenAI’s Evals grader. The grader produces a final score along with an explanation, capturing both correctness and acceptable variation.

These evals are like unit tests that run continuously during development to identify regressions as canaries in production; this allows us to catch issues early and confidently iterate as the agent's capabilities expand.

Agent security

Our agent plugs directly into OpenAI’s existing security and access-control model. It operates purely as an interface layer, inheriting and enforcing the same permissions and guardrails that govern OpenAI’s data.

All of the agent’s access is strictly pass-through, meaning users can only query tables they already have permission to access. When access is missing, it flags this or falls back to alternative datasets the user is authorized to use.

Finally, it's built for transparency. Like any system, it can make mistakes. It exposes its reasoning process by summarizing assumptions and execution steps alongside each answer. When queries are executed, it links directly to the underlying results, allowing users to inspect raw data and verify every step of the analysis.

Lessons learned

Building our agent from scratch surfaced practical lessons about how agents behave, where they struggle, and what actually makes them reliable at scale.

Lesson #1: Less is More

Early on, we exposed our full tool set to the agent, and quickly ran into problems with overlapping functionality. While this redundancy can be helpful for specific custom cases and is more obvious to a human when manually invoking, it’s confusing to agents. To reduce ambiguity and improve reliability, we restricted and consolidated certain tool calls.

Lesson #2: Guide the Goal, Not the Path

We also discovered that highly prescriptive prompting degraded results. While many questions share a general analytical shape, the details vary enough that rigid instructions often pushed the agent down incorrect paths. By shifting to higher-level guidance and relying on GPT‑5’s reasoning to choose the appropriate execution path, the agent became more robust and produced better results.

Lesson #3: Meaning Lives in Code

Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata. By crawling the codebase with Codex, our agent understands how datasets are actually constructed and is able to better reason about what each table actually contains. It can answer “what’s in here” and “when can I use it” far more accurately than from warehouse signals alone.

Same vision, new tools

We’re constantly working to improve our agent by increasing its ability to handle ambiguous questions, improving its reliability and accuracy with stronger validations, and integrating it more deeply into workflows. We believe it should blend naturally into how people already work, instead of functioning like a separate tool.

While our tooling will keep benefiting from underlying improvements in agent reasoning, validation, and self-correction, our team’s mission remains the same: seamlessly deliver fast, trustworthy data analysis across OpenAI’s data ecosystem.

Generated by RSStT. The copyright belongs to the original author.

Source