Scaling domain expertise in complex, regulated domains

OpenAI News

传统的税务研究通常需要先筛选数百个信息来源，然后才能开始解读这些资料。税务专业人士接着花费数小时解析法规、条例、裁定、判例法和专家评论，拼凑出规则之间的相互作用，并提炼出答案。

根据问题的复杂程度，这一过程可能耗时数小时、数天甚至数周，且仍可能产生不一致或过时的结果，每一个被忽视的细节都可能带来实际损失。

Blue J（https://www.bluej.com/）成立于2015年，由税法教授和从业者创立，他们意识到税务研究需要更好的工具。基于早期利用人工智能预测税法案件结果的探索，团队开发了多种产品，支持各种规模的税务和会计事务所。

ChatGPT推出后，Blue J看到了创造新产品的机会：将对复杂法规的高级推理与结构化检索相结合，能够在几秒钟内提供专家级的税务答案，附带内嵌引用和专业人士信赖的来源列表。

Blue J首席技术官Brett Janssen表示：“我们行动迅速，因为我们已经了解问题所在及解决方案。OpenAI为我们提供了所需的模型质量，使专业知识得以规模化。”

产品发布两年内，Blue J在美国、加拿大和英国推出了基于GPT-4.1的税务研究引擎。ChatGPT发布仅六个月后，他们推出了首款产品，并快速迭代，借助反馈构建了稳健的评估框架。

如今，超过70%的用户每周登录，回答的异议率低于每700个回答中1个。他们分享了进入复杂领域并快速扩展的经验，强调信任、准确性和速度是不可妥协的核心。

利用领域专业知识打造独一无二的解决方案

Blue J的税务研究解决方案采用检索增强生成（RAG）系统，将GPT-4.1与包含数百万条精选文档的专有库结合，包括权威的原始资料和来自Tax Notes等来源的专家评论。

当用户提出诸如“OBBBA创造的SALT鱼雷是什么？如何缓解？”等税务问题时，Blue J会即时检索相关资料，GPT-4.1将其综合成清晰、完整引用的答案，感觉更像是可信同事的指导，而非简单的模型输出。这是只有精通税务和技术的团队才能打造的系统。

Janssen说：“我们测试了很多模型，GPT-4.1始终满足我们的需求。它能遵循指令，尊重上下文，处理边缘案例的能力优于其他模型。”

通过用户反馈扩大信任

税务领域即使是小错误也可能引发审计、延迟申报或造成客户经济损失。Blue J团队凭经验知道，即使是罕见的边缘案例若未及时发现和修正，也会削弱信任。因此，从一开始他们就设计产品以捕捉反馈并实现规模化改进，由税务研究团队的专业知识指导。

为了让系统随着使用不断优化，Blue J为愿意贡献数据的客户提供了可选的数据共享。每个Blue J答案都配有“不同意”按钮，标记后，系统会按问题类型、税务主题和可能根源进行分类。

这一反馈循环既能捕捉偶发错误，也能揭示模式。如果合伙企业税务查询表现不佳，团队会进行调查；如果提示词产生不一致回答，则进行调整。GPT-4.1驱动这一分诊层，分析成千上万条反馈，聚类相关问题，帮助产品和税务研究团队聚焦最有影响力的改进点。

由于GPT-4.1响应稳定，即使是微小改进也会累积成显著提升。结果是一个能从每次交互中学习的系统，推动更快迭代、更高质量答案，产品与用户需求同步演进。这一飞轮效应帮助Blue J将异议率降至每700个回答中不到1个。

迭代式产品开发还帮助Blue J领先应对变化。2025年美国通过一项重大税法时，团队已提前六周完成影响映射。法案签署数小时内，用户即可获得更新后的答案。

在规则不断变化且精确度至关重要的领域，这种闭环系统让Blue J保持快速、可信并领先市场。

设计提升标准的评估体系，而非仅仅测试

模型质量不仅是功能，更是门槛。Blue J使用涵盖美国、加拿大和英国税法的350多个提示组成的基准套件，评估每个新模型版本。测试内容包括指令遵循、来源一致性和答案清晰度，确保提示或检索逻辑的改进能带来可预测的实际表现。

Janssen说：“尽管测试了所有模型，我们从未发布过非OpenAI模型。OpenAI模型在内部基准测试中持续领先，尤其是在遵循指令和提供符合实际使用标准的答案方面。”

将领域专业知识作为优势

客户全年依赖Blue J作为默认税务研究工具，而非仅限于报税季。超过70%的用户每周登录，每位用户每周节省2.7小时用于研究和客户沟通，这些时间被重新投入到利润更高的规划和咨询工作中。

Blue J通过专注于税务专业人士的真实需求赢得了高度参与度。GPT-4.1通过结构化输出、一致的指令遵循和可靠结果放大了他们的专业知识。对创业者而言，启示明确：将你的专业知识作为核心优势，利用合适的模型实现规模化。

想了解更多关于ChatGPT for Business的信息？

请联系我们的团队：https://openai.com/contact-sales/

Traditional tax research starts by sifting through hundreds of sources before even beginning to interpret them. Tax professionals then spend hours parsing statutes, regulations, rulings, case law, and expert commentary to piece together how the rules interact and distill them into an answer.

Depending on the complexity of the question, this process can take hours, days, or even weeks–and still yield inconsistent or outdated results, with every missed nuance carrying real costs.

Blue J⁠ was founded in 2015 by tax law professors and practitioners who saw the need for better tools in tax research. Building on early explorations in leveraging AI for predicting the outcome of tax law cases, the team went on to develop a variety of products to support tax and accounting firms of all sizes.

When ChatGPT launched, Blue J saw an opportunity to create something new: combining advanced reasoning over dense regulations with structured retrieval to deliver expert-grade tax answers in seconds, complete with inline citations and source lists professionals can trust.

“We moved quickly because we already knew the problem and how to solve it,” said Brett Janssen, CTO of Blue J. “OpenAI gave us the model quality we needed to make that expertise scalable.”

Within two years of launch, Blue J rolled out a tax research engine with GPT‑4.1 across the US, Canada, and the UK. They launched their first product just six months after ChatGPT’s debut, then iterated rapidly, learning from feedback and building a robust evaluation framework.

Today, more than 70% of users log in weekly, with a disagree rate of fewer than 1 in 700 responses. They share their lessons learned for breaking into a complex domain and scaling fast by making trust, accuracy, and speed non-negotiable.

Leveraging domain expertise to build the solution no one else can

Blue J’s tax research solution is built using a Retrieval-Augmented Generation (RAG) system, combining GPT‑4.1 with a proprietary library of millions of curated documents, including authoritative primary sources and expert commentary from sources like Tax Notes.

When a user asks a tax question like, “What is the SALT torpedo created by OBBBA and how can it be mitigated?” Blue J instantly retrieves the source material and GPT‑4.1 synthesizes it into a clear, fully-cited answer that feels more like guidance from a trusted colleague than a model output. It’s a system that only a team fluent in both tax and technology could build.

“We’ve tested a lot of models, and GPT‑4.1 is the one that consistently does what we need,” says Janssen. “It follows instructions, respects the context, and handles edge cases better than anything else we’ve seen.”

Scale trust through user feedback

In tax, even small mistakes can trigger audits, delay filings, or cost clients real money. Blue J’s team knew from experience that even rare edge cases could erode trust if not caught and fixed quickly. So from day one, they designed the product to capture feedback and improve at scale, guided by the expertise of their tax research team.

To make the system better with every use, Blue J introduced optional data sharing for customers who wanted to contribute to product improvement. Every Blue J answer includes a ‘disagree’ button, and when flagged, the response is systematically categorized by issue type, tax topic, and likely root cause.

The loop catches one-off errors and reveals patterns. If partnership tax queries underperform, the team investigates. If a prompt yields inconsistent completions, they tune it. GPT‑4.1 powers this triage layer, analyzing thousands of feedback points, clustering related issues, and helping Blue J’s product and tax research teams focus their efforts where they’ll have the biggest impact.

Because GPT‑4.1 responds consistently, even small improvements compound. The result is a system that learns from every interaction, driving faster iteration, higher-quality answers, and a product that evolves in lockstep with user needs. This flywheel helped Blue J bring their disagree rate down to fewer than 1 in every 700 answers.

The iterative product development also helps Blue J stay ahead of change. When a sweeping U.S. tax bill passed in 2025, the team had already spent six weeks mapping its impact across the codebase. Within hours of the bill being signed, users were seeing updated answers in production.

In a field where rules shift and precision is everything, this closed-loop system is what keeps Blue J fast, trusted, and ahead of the market.

Design evaluations that raise the bar, not just test it

Model quality is not just a feature, it’s a gating function. Blue J evaluates every new model release using a benchmark suite of over 350 prompts across U.S., Canadian, and U.K. tax law. Each model is tested for instruction adherence, source alignment, and answer clarity, ensuring improvements to prompts or retrieval logic are reinforced by predictable, real-world behavior.

“Despite testing them all, we’ve never shipped a non-OpenAI model,” says Janssen. “OpenAI models have consistently outperformed against our internal benchmarks, especially when it comes to instruction-following and delivering answers that meet our bar for real-world use.”

Leverage your domain expertise as your advantage

Customers rely on Blue J as their default tax research tool year-round, not just during tax season. More than 70% of users log in weekly, saving 2.7 hours per user per week on research and client communication – time they reinvest into higher-margin planning and advisory work.

Blue J earned that level of engagement by focusing on what they uniquely know best: the real-world needs of tax professionals. GPT‑4.1 amplifies their expertise with its structured outputs, consistent instruction-following, and reliable results. For founders, the takeaway is clear: leverage your specific expertise as your core advantage, and use the right models to scale.

Interested in learning more about ChatGPT for business?

Talk with our team

Generated by RSStT. The copyright belongs to the original author.

Source