Three lessons for creating a sustainable AI advantage

OpenAI News

2022年ChatGPT发布时，客户服务软件公司Intercom（https://www.intercom.com/）没有仅仅关注新闻头条，而是迅速行动起来。在GPT-3.5发布数小时内，他们开始进行实验，仅仅四个月后便推出了Fin——他们的AI客服代理，现在每月解决数百万客户咨询。

这种早期的动力绝非偶然。随着大型语言模型（LLM）的飞跃发展，Intercom意识到AI将重塑客户体验。公司领导层迅速采取行动，组建跨职能工作组，取消非AI项目，并投入1亿美元围绕AI重构业务平台。

这一决定引发了全公司的变革：产品团队重组，推出以AI为核心的客服策略，以及构建支持Fin处理大量复杂客户咨询的平台。

以下是Intercom历程中的三大经验教训，任何团队——无论起点如何——都能立即借鉴应用。

“AI必须从一开始就内建，不能事后附加。”
——Intercom首席产品官Paul Adams

经验一：尽早且频繁地实验，培养模型熟练度

Intercom从早期就开始测试生成式模型，频繁实践并深入学习。

团队早早尝试生成模型，积累了丰富的实操经验，帮助他们识别模型的局限和机会。2023年初GPT-4发布时，他们已做好准备，四个月内推出Fin，并持续迭代。

工程高级副总裁Jordan Neill表示：“我们利用GPT-3.5实现了流畅对话，偶尔展现出神奇效果，但还不够稳定，无法完全信赖客户服务。正因为我们做了充分准备，GPT-4一出，我们就知道它已成熟，立即推出Fin。”

这种熟练度还帮助Intercom设计了Fin Tasks系统，自动化处理退款、技术支持等复杂流程。虽然最初计划基于推理模型，但评估显示GPT-4.1单独即可高效完成任务，且延迟更低、可靠性更高。

如今，GPT-4.1驱动Intercom越来越多的AI应用，包括Fin Tasks的核心逻辑。团队还发现，在非推理查询中加入链式思维提示能弥补性能差距。

Intercom的经验是：越了解模型，越能快速适应技术进步。

在Intercom的评估中，GPT-4.1在完成任务的可靠性最高，且相比GPT-4成本降低20%。完整性数据基于5次独立测试（Pass@k），仅当5次均成功才计为“完成”，以减少波动。

经验二：通过严格评估提升速度

快速推进必须衡量什么有效、为何有效。

Intercom快速采用新模型、新模态和新架构，得益于严谨的评估流程。每个OpenAI新模型——无论用于Fin Voice（实时API驱动）还是Fin Tasks（GPT-4.1驱动）——都经过结构化离线测试和在线A/B试验，评估指令遵循、工具调用准确性和整体连贯性，确保部署前质量。

例如，团队用真实客服对话文本对模型进行基准测试，评估其处理多步骤指令（如退款）、保持Fin品牌语调和执行函数调用的能力。结果指导在线A/B测试，比较GPT-4与GPT-4.1在解决率和客户满意度上的表现。

这种方法帮助Intercom在几天内完成从GPT-4到GPT-4.1的迁移。确认指令处理和函数执行改进后，迅速在Fin Tasks中部署GPT-4.1，性能和用户满意度立刻提升。

Intercom首席机器学习科学家Pedro Tabacof说：“GPT-4.1发布后48小时内拿到评估结果，随即制定上线计划。我们立刻看到它在智能和延迟上的平衡，非常符合客户需求。”

在Fin Voice方面，同样的评估流程帮助验证新语音模型快照，精准定位延迟、函数执行和脚本遵循的改进，这些都是提供人类级电话支持的关键。

Intercom还扩展评估维度，系统考察Fin Voice的个性、语调、中断处理和背景噪音，确保高质量客户体验。

经验三：以架构灵活性构建长期优势

Intercom从一开始就为变化而设计，打造了足够灵活的架构，能随着模型演进不断调整。

Fin系统模块化设计，支持聊天、邮件、语音等多种模态，每种模态在延迟和复杂度上有不同权衡。架构允许Intercom将查询路由到最合适的模型，并能在不重构系统的情况下替换模型。

这种灵活性是有意为之，且持续演进。Fin架构已迭代三次，下一代版本正在开发中。随着模型提升，团队在必要时增加复杂度以解锁新能力，能简化时则简化。

这种适应性在Fin Tasks尤为关键。团队最初认为需要基于推理的模型支持Fin Tasks，处理复杂客户查询和多步骤流程（退款、账户变更、技术排障等）。

但测试显示，GPT-4.1的指令遵循能力超出预期，以更低延迟和成本实现同等可靠性。

Intercom首席机器学习工程师Pratik Bothra说：“坦白讲，大家对GPT-4.1关注不够。它的延迟和成本表现让我们惊喜，促使我们调整架构，去除大量复杂度。”

Fin AI引擎示意图

（图示展示了一个模块化子代理架构，查询经过向量搜索、自定义分块、自定义重排序、精炼、生成和验证六个阶段，每阶段由专门的LLM驱动，强调检索、重排序和多阶段验证以产出最终答案。）

通过统一数据和工作流自动化构建连贯客户体验

Intercom才刚刚起步。依托先进模型和模块化、模型无关架构，Intercom正扩展至业务各环节，提升响应速度和客户体验：

支持团队：Fin AI代理处理聊天、邮件、语音等大部分入站咨询
运营团队：Fin Tasks自动化退款、账户变更、订阅更新等复杂流程
产品团队：借助Intercom MCP服务器，ChatGPT等AI工具可访问客户对话、工单和用户数据，帮助团队发现问题、规划路线、优化信息传递、准备季度业务回顾

Intercom通过严谨评估、性能为本和灵活设计，打造了可扩展AI平台，重新定义客户支持，为所有构建AI的企业提供了宝贵经验。

想了解更多关于ChatGPT商业应用？

请联系我们的团队：https://openai.com/contact-sales/

When ChatGPT launched in 2022, Intercom⁠ didn’t just watch the headlines—they mobilized. Within hours of GPT‑3.5's release, the customer service software company began experimenting, and just four months later launched Fin, their AI Agent that now resolves millions of customer queries each month.

That early momentum wasn’t an accident. As LLMs leapt forward, Intercom recognized that AI would reshape customer experience. Leadership acted quickly, spinning up a cross-functional task force, canceling non-AI projects, and committing $100 million to replatform the business around AI.

That decision sparked company-wide changes: reorganized product teams, a new AI-first helpdesk strategy, and a platform built to support Fin in handling high volumes and complex customer queries.

Below are three lessons from Intercom’s journey that any team—no matter where you’re starting—can put to work right now.

“AI-first has to be built in; you can’t bolt it on.”

Paul Adams, Chief Product Officer, Intercom

Lesson 1: Experiment early and often to build model fluency

Intercom tests models early, often, and learns deeply from their work.

The team began experimenting with generative models early, and their hands-on experience helped them map model limitations and spot opportunities. When GPT‑4 became available in early 2023, they were ready. Within four months, they launched Fin—and haven’t slowed down since.

“We were able to leverage GPT‑3.5 to have fluid conversations with glimpses of magic, but it wasn’t yet reliable enough to trust with our customers,” says Jordan Neill, SVP of Engineering. “Because we’d done the work, when GPT‑4 arrived, we knew it was ready, and we shipped Fin.”

That same fluency helped Intercom design Fin Tasks, a system that automates complex workflows like refunds and technical support. While the team initially planned for a reasoning model based stack, their evaluations showed GPT‑4.1 could handle the job on its own—with high reliability and lower latency.

Today, GPT‑4.1 powers a growing share of Intercom’s AI usage, including key logic inside Fin Tasks. The team also discovered that adding chain-of-thought prompting to non-reasoning queries closed performance gaps.

Intercom’s takeaway: the better you know your models, the faster you can adapt as the state of the art evolves.

In Intercom’s evaluations, GPT‑4.1 showed highest reliability in completing tasks while delivering a 20% cost reduction compared to GPT‑4o. Completeness numbers were averaged across 5 independent runs (using Pass@k); a result is only counted as 'complete' if it is successful in all 5 runs, to reduce variance.

Lesson 2: Unlock speed with strong evaluations

To move fast, you have to measure what works—and why.

Intercom’s ability to adopt new models, modalities, and architectures quickly is rooted in their rigorous evaluation process. Every new OpenAI model—whether used for Fin Voice, powered by the Realtime API, or for Fin Tasks, powered by GPT‑4.1—is put through structured offline tests and live A/B trials to assess for instruction following, tool call accuracy, and overall coherence before deployment.

For example, the team benchmarks models against transcripts of actual support interactions, evaluating how well they handle multi-step instructions like refunds, maintain Fin’s brand voice, and execute function calls reliably. These results inform live A/B tests that compare resolution rates and customer satisfaction across models like GPT‑4 and GPT‑4.1.

This approach helped Intercom migrate from GPT‑4 to GPT‑4.1 in just days. After confirming improvements in instruction handling and function execution, they rolled out GPT‑4.1 across Fin Tasks and saw immediate gains in both performance and user satisfaction.

“When GPT‑4.1 dropped, we had eval results within 48 hours and a rollout plan right after,” says Pedro Tabacof, Principal Machine Learning Scientist at Intercom. “We immediately saw that GPT‑4.1 had a good mix of intelligence and latency for our customers’ needs.”

For Fin Voice, the same evaluation process helped Intercom validate new voice model snapshots and pinpoint improvements in latency, function execution, and script adherence: all essential to delivering human-quality phone support.

Intercom expanded their evals to capture the extra dimension that voice brings to interactions. They systematically assess Fin Voice for factors like personality, tone, interruption handling, and background noise to ensure high-quality customer experiences.

Lesson 3: Build long-term advantages with architectural flexibility

Intercom built for change from day one, designing an architecture flexible enough to evolve alongside the models it depends on.

Fin’s system is modular by design, supporting multiple modalities like chat, email, and voice each with different tradeoffs for latency and complexity. The architecture allows Intercom to route queries to the best model for the job and swap models without reengineering the underlying system.

That flexibility is deliberate, and constantly evolving. Fin’s architecture is now on its third major iteration, with the next one already in development. As models improve, the team adds complexity where needed to unlock new capabilities and simplifies where possible.

This adaptability proved critical with Fin Tasks. Initially, the team assumed they’d need reasoning based models to support Fin Tasks—which enables Fin to resolve complex customer queries and executive multi-step processes like issuing refunds, making account changes, or technical troubleshooting.

But in testing, GPT‑4.1’s instruction-following capabilities outperformed expectations delivering the same reliability at lower latency and cost.

“Honestly, I don’t think people talk about GPT‑4.1 enough,” says Pratik Bothra, Principal Machine Learning Engineer at Intercom. “We were genuinely surprised by the latency and cost profile. It lets us pivot our architecture and remove a lot of complexity.”

Fin AI Engine™

Building connected customer experiences through unified data and workflow automation

The team is just getting started. Powered by advanced models and built on a modular, model-agnostic architecture, Intercom is expanding beyond customer support to power workflows across the business, delivering faster resolutions and better customer experiences:

Support teams: Resolving the majority of inbound queries across chat, email, voice, and more with Fin AI Agent
Ops teams: Automating complex workflows like refunds, account changes, and subscription updates with Fin Tasks
Product teams: Using Intercom’s MCP Server, AI tools like ChatGPT can access customer conversations, tickets, and user data - helping teams across the business spot bugs, shape roadmaps, refine messaging, and prepare for QBRs.

Intercom built a scalable AI platform by staying rigorous on evaluation, grounded in performance, and flexible in design—redefining support and offering lessons for any company building with AI.

Interested in learning more about ChatGPT for business?

Talk with our team

Generated by RSStT. The copyright belongs to the original author.

Source