Understanding AI and learning outcomes
OpenAI News教育是人工智能最有前景的应用领域之一。借助像 ChatGPT 这样的工具,个性化学习支持可以随时随地向任何学生提供。
但教育界对人工智能如何影响学习成效的认识仍处于早期。去年,我们的团队对包括 study mode 在内的工具进行研究,发现学生成绩有可喜提升;同时也提出了一个关键问题:怎样评估 AI 对学习者随时间推移的影响,而不仅仅是对一次期末考试的作用?
这是一个更广泛的生态系统问题。迄今为止,大多数研究方法只关注狭窄的绩效信号——比如考试分数——无法评估学生在真实教学场景中如何与 AI 一起学习,以及这种使用如何随着时间改变学习结果。
为填补这一空白,我们与 University of Tartu (塔尔图大学)和 Stanford Accelerator for Learning 的 SCALE Initiative 共同开发了 Learning Outcomes Measurement Suite ,一个支持跨不同教育情境进行纵向学习成效测量的框架。
当前,我们正通过一项随机对照试验进行广泛验证,并计划在 OpenAI 的学习研究生态 Learning Lab 中,与包括 Arizona State University 、 UCL Knowledge Lab 和 MIT Media Lab 在内的创始机构继续开展研究,基于此前的协作研究进一步推进。
今天,我们公开了该测量套件的工作原理与重要性概览。未来我们计划发布更多研究成果,并将该测量工具以公共资源形式提供给全球的学校、大学和教育系统。
—— Susanna Loeb :“这项研究让我们在迅速学习的同时,也为深入理解如何把 AI 谨慎融入学校教育奠定基础。我们要弄清这些工具如何既支持严格的学术学习,又能培养高阶思维、创造力、好奇心和学生的学习自信。”
要点总结
- 现有关于 AI 对学习影响的研究在绩效方面显示出积极信号,但无法完整呈现 AI 随时间如何影响学习结果的全貌。
- Learning Outcomes Measurement Suite 将首次为纵向研究提供标准化框架,帮助教育者、研究人员和机构在不同情境下理解 AI 对学习与结果的塑造方式。
- OpenAI 的 Learning Lab 是推动此项工作的研究生态系统。随着领域发展, OpenAI 将与多方合作公布研究发现。
起源与早期研究
学生使用 AI 学习时情形多样:有时是向 AI 询问快速答案,有时是像与辅导员逐步推演问题。为鼓励用户以更有助于深度理解和技能培养的方式与 ChatGPT 互动, OpenAI 去年推出了 study mode 。该模式背后是我们与教师、科学家和教学法专家共同撰写的自定义系统指令,旨在让模型采取支持真正学习的行为,而不仅仅给出答案——包括搭建脚手架、确认理解与引导式练习。
为检验这种与教学法一致的 AI 交互是否能转化为更好的学习成效,我们对300多名备考神经科学和微观经济学考试的大学生进行了随机研究。尽管分析仍在进行中,早期结果让我们相信,通过像 study mode 这样的功能鼓励的教学对齐型交互,能够提升学习成果。但研究也指出一个现实:关键在于这些提升和相应的良性学习行为能否在更长时间内保持。
研究设计
参与者被分为三组:对照组使用传统在线资源(如 Google 搜索和 YouTube),并禁用任何 AI 生成的概览功能;另外两组分别获得两种略有不同的 study mode 变体,旨在以不同方式引导学习过程。研究前收集了基线测验和入组问卷,以调整先前课程接触、学习习惯、学业信心和对 AI 工具的熟悉度等差异。学生在每次考试前开展限时的 study mode 学习,两个变体在学科之间交替安排。
该设置旨在反映真实世界的学习情境,而非严格的实验室环境。参与与否并不与考试成绩挂钩,且并非所有学生在名义上的40分钟会话中都以相同程度使用 study mode 。这使我们能够测量并报告“意向处理效应”( ITT )——即在现实推广条件下被赋予使用工具的影响,也就是被提供 study mode 的因果效应,同时承认实际参与度会有差异。
发现
我们分别衡量了各科考试表现。在这项随机试验中,改善并非在所有学科均匀出现,且参与者对 study mode 的使用程度各不相同。
- 神经科学(主要 ITT):相较于对照组, study mode 表现出方向性正差异,但与使用传统在线资源的学生并无显著可区分的差异。一些入组和技术问题影响了使用 study mode 学生的学习时长。
- 微观经济学(主要 ITT):被分配使用 study mode 的学生在考试成绩上有明显提升——大约比无 AI 的对照组高出约15%。
这些差异在分别比较两种 study mode 变体与对照组时仍然保持一致。
尽管以上结果反映了现实中存在的变异,但也凸显出现有学习成效度量方式的局限。
大多数评估方法依赖于在短期窗口内对固定干预的评估,以考试分数或期末论文等结果作为主要信号。这些方法未能捕捉 AI 在实践中影响学习的核心机制:即随学习者策略、偏好和学习习惯演化的持续、个性化互动;也难以揭示某项能力的提高(如短期记忆)是否伴随其他能力的权衡(如坚持性、自主动机或创造性解决问题)。因此,它们错过了最终决定 AI 是否真正改善学习的纵向认知效应。
此外,学习环境在不同国家、课程和机构目标间差异很大,单次研究的结果难以在系统间推广。测量方法因此必须具备灵活性,使不同教育系统能够在各自语境下定义成功、依据本身标准评估 AI 并据此迭代。
构建更完善的测量体系
基于对 study mode 研究的经验,我们构建了一个结构化测量系统,用以在规模上衡量 AI 对学习者的影响,并据此改进模型。该系统基于三类信号:模型行为、学习者反应及随时间变化的可测认知成果。其组成包括:
- 系统指令,用以调整模型行为:通过自然语言改变模型默认行为,使其更符合特定教学法。
- 学习互动分类器:自动在真实、去标识化的学习者—模型互动中检测“学习时刻”,并标注参与度、纠错等显著特征。
- 学习质量评分器:评估并为每个学习时刻打分,判断学习者是否达成目标以及互动在多大程度上遵循良好教学原则,同时识别失败模式。
- 纵向学习评分器:追踪同一学习者随时间与模型互动的变化——包括参与度、坚持性与元认知策略——既在个体层面也在队列层面分析。
- 标准化的认知与元认知量表:使用经过验证的第三方测量工具,通过 ChatGPT 在使用前/使用中/使用后实施,以建立基线并衡量批判性思维、创造力、记忆等基础能力的变化。
合并使用时,我们将这一整套称为 Learning Outcomes Measurement Suite 。
该体系能为教育生态提供重要信号:对学习时刻的结构化视图、显示各队列随时间如何改变的面板、模型相对于教学与辅导量表的表现指标,以及与标准化评估和短学员问卷对齐的结果度量。在可能的情况下,它还能纳入合作方提供的真实地面数据,比如考试分数、课堂观察或出勤记录。所有数据均去标识化处理。
通过该系统,我们的合作伙伴可以更好理解长期使用 AI 学习对更深层认知能力的影响,例如:
- 自主动机:学习者在多大程度上主导自己的学习而非被模型驱动
- 富有成效的参与:教学互动的频率、种类与质量
- 任务坚持性:学习者在面对认知挑战时坚持并推进的程度
- 元认知:学习者计划、反思与监控学习方法的频率与质量
- 回忆能力:学习者从先前互动中记忆内容的准确性
这体现了我们的总体努力:不只盯着狭义的学习结果(考试分数上升),而是关注支撑学习的整体能力。同时我们认为,没有一个放之四海而皆准的优化目标;系统和教育者需要在教学最佳实践指导下权衡并做出取舍。
下一步
在向公众广泛开放之前,我们正在通过大规模研究验证 Learning Outcomes Measurement Suite 。这项工作已与 University of Tartu 和 SCALE Initiative 在包括爱沙尼亚在内的国家级合作中展开:在近几个月内,测量套件将与近 20,000 名 16–18 岁学生一起被研究。学生使用将在与当地教育主管紧密协作下进行,以确保安全并与当地课程保持一致。
—— Jaan Aru :“爱沙尼亚一直把教育看作一个持续改进的系统。随着 AI 成为其中一部分,关键问题是如何衡量 AI 对长期学习的影响。这正是我们与 OpenAI 一起探索的方向。学生愿意参与这一发展过程,很多人也希望学会如何用 AI 支持学习。感觉这是一个真正的转折点,我们很高兴贡献出可供其他教育系统复用和构建的方法论。”
这项工作建立在更广泛的合作研究基础上。除在 Learning Lab 创始合作方间进行的成果研究外, OpenAI 还支持多项交叉学习与劳动力研究——考察 AI 如何影响学生的学业路径、职业选择,以及机构如何支持负责任的采用。这些研究在 Bocconi University 、 Innova Schools 、 Dartmouth 的 Tuck School of Business 、 San Diego State University 、 Stony Brook University 等机构开展。
随着我们开展更长期的研究以弄清学生如何在 AI 辅助下最佳学习,我们将持续分享研究成果,并与更广泛的教育生态合作,确保 AI 能惠及各地学习者。
如欲获取该项工作的后续更新,可在此注册关注。
Education is one of AI’s most promising frontiers. With tools like ChatGPT, personalized learning support can be available to any student, anywhere, at any time.
But the education sector is still early in its understanding of the impact of AI on learning outcomes. Last year, our team set out to study the use of tools like study mode and found promising gains in student performance. But our research also raised an important question: how can we assess how AI influences a learner's progress over time, not just on a final exam?
This is a broader ecosystem challenge. To-date, most research methods focus on narrow performance signals—such as test scores—and lack the ability to assess how students actually learn with AI in real-world settings, and how that use shapes outcomes over time.
To address this gap, we developed the Learning Outcomes Measurement Suite, a framework created with Estonia’s University of Tartu and the SCALE Initiative at the Stanford Accelerator for Learning to support longitudinal measurement of learning outcomes across different educational contexts.
Extensive validation is underway through a randomized controlled trial, and further research is planned with founding organizations in the Learning Lab, OpenAI’s learning research ecosystem, including researchers from Arizona State University, UCL Knowledge Lab, and MIT Media Lab (building on prior collaborative studies).
Today, we’re sharing an overview of how the measurement suite works and why it matters. Over time, we intend to publish more research and release the measurement suite as a public resource for schools, universities, and education systems worldwide.
“This research allows us to learn quickly while also laying the groundwork for a deeper understanding of how AI can be thoughtfully integrated into schools in ways that truly matter. We want to understand how these tools can support rigorous academic learning while also cultivating higher-order thinking, creativity, curiosity, and students’ confidence in themselves as learners.”–Susanna Loeb, Professor of Education and Faculty Director, SCALE Initiative at Stanford University
Summary of takeaways
- Today’s research methods on the impact of AI on learning show promising signals about performance, but don’t capture the full picture of how AI affects learning outcomes over time.
- The Learning Outcomes Measurement Suite will, for the first time, provide a standard framework for longitudinal studies that help educators, researchers, and institutions understand how AI shapes learning and outcomes across different contexts.
- OpenAI’s Learning Lab is a new research ecosystem focused on advancing this work. OpenAI will publish findings alongside a range of partners as the field continues to develop.
Origins and early research
When students use AI tools to study and learn, it can mean many different things—from going to AI for quick answers to using it to work through problems step by step with tutor-like guidance. To encourage users to engage with ChatGPT in ways that support deeper understanding and skill-building, OpenAI introduced study mode last year. Under the hood, study mode is powered by custom system instructions we’ve written in collaboration with teachers, scientists, and pedagogy experts to reflect a core set of behaviors that support true learning, not just answers—using scaffolding, checks for understanding, and guided practice.
To test whether this kind of pedagogically aligned AI interaction style translates into better learning outcomes, we ran a randomized study with over 300 college students preparing for neuroscience and microeconomics exams. While analysis is still underway, early results give us confidence that a pedagogically aligned AI interaction style, encouraged through features like study mode, can improve learning outcomes. But this research also surfaced an important reality: what really matters is whether the gains and associated productive behaviors remain durable over time.
Study design
Participants were assigned to one of three groups: a control group studied using traditional online resources such as Google Search and YouTube, with AI generated overview features disabled, while two additional groups were given access to one of two study mode variants designed to guide students through the learning process in slightly different ways. Baseline quizzes and onboarding surveys were collected ahead of time to adjust for differences in prior coursework exposure, study habits, academic confidence, and familiarity with AI tools. Students completed timed study mode sessions before each exam, with the two study mode variants counterbalanced across subjects.
This setup was designed to reflect real world study conditions rather than a tightly controlled lab environment. Participation was not tied to exam performance, and not all students used study mode to the same extent during the nominal 40 minute sessions. This allowed us to measure and report intention-to-treat (ITT) effects, the impact of being provided access to the tool under realistic rollout conditions—in other words, the causal impact of being offered study mode, acknowledging that engagement can vary in practice.
Findings
We measured performance on each exam separately. In our randomized study, improvements were not uniform across subjects, and levels of engagement with study mode varied across participants.
- Neuroscience (primary ITT): We observed directionally positive differences for study mode relative to control, but results were not distinguishable from students studying with traditional online resources. Some onboarding and technical issues impacted time spent studying among students using study mode.
- Microeconomics (primary ITT): We observed meaningful gains in exam performance among students assigned access to study mode vs the no-AI control group—roughly a 15% higher score relative.
Study mode (variants A & B) vs Control (no AI group): Adjusted mean exam scores
The effect remains consistent when we compare each study mode variant separately with the control.
While this reflects real world variation, it highlighted a deeper limitation in how learning outcomes are typically measured.
Most existing evaluation approaches rely on fixed interventions assessed over short time windows, using outcomes such as test scores or final essays as primary signals. These methods are not designed to capture the core mechanism through which AI affects learning in practice: ongoing, personalized interactions that evolve alongside a learner’s own strategies, preferences, and study habits. Nor do they surface whether improvements in one capability, such as short term recall, may come alongside trade offs in others, such as persistence, autonomous motivation, or creative problem solving. As a result, they miss the longitudinal cognitive effects that ultimately determine whether AI meaningfully improves learning.
Because learning environments differ widely across countries, curricula, and institutional goals, outcomes from one-off studies rarely generalize across systems. Measurement approaches must therefore be flexible enough for different education systems to define what success looks like in their context, evaluate AI against their own standards, and iterate accordingly.
Building a better measurement system
Based on the learnings from OpenAI’s study mode research, we have been building a structured measurement system to measure AI’s impact on learners at scale, and create a mechanism to improve models based on those outcomes. It is grounded in three signals—how the model behaves, how learners respond, and what measurable cognitive outcomes result over time. It includes:
- System instructions to refine model behavior: use of natural language to change the default behavior of the model to be better aligned to specific pedagogical approaches.
- Learning interaction classifiers: these automatically detect “learning moments” within real, de-identified, learner–model interactions and label salient characteristics such as engagement and error correction.
- Learning quality graders: these evaluate and score each of those learning moments by whether the learner achieved their objective and the degree to which the interaction followed strong pedagogical principles, including identification of failure modes.
- Longitudinal learning graders: these track changes in the same learner’s interactions with the model over time—including engagement, persistence, and metacognitive strategies—at the individual and cohort levels.
- Standardized cognitive and metacognitive measures: these are validated third-party instruments delivered via ChatGPT pre/during/post access to establish baselines and measure changes in foundational capabilities such as critical thinking, creativity, and memory.
When combined, we refer to this measurement system as the Learning Outcomes Measurement Suite.
It produces important signals the education ecosystem can use: structured views of learning moments, dashboards showing how outcomes shift over time across cohorts, indicators of model performance against teaching and tutoring rubrics, and outcome measures aligned to standardized assessments and short learner questionnaires. Where available, it can incorporate partner-provided ground truth such as exam scores, classroom observations, or attendance.
All data de-identified
It also enables our partners to understand the deeper cognitive impacts of using AI for learning over time, as we are able through this system also to track impact on capabilities such as:
- Autonomous Motivation: the degree to which learners are shaping their own studies vs being directed by the model
- Productive Engagement: the frequency, variety and quality of pedagogical interactions
- Task Persistence: the degree to which a learner sits with and pushes through cognitive challenges
- Metacognition: the frequency and quality of learner’s efforts to plan, reflect and monitor their approaches to studying
- Recall: the accuracy with which a learner can remember content from previous interactions
This reflects our overall efforts to not simply focus on narrow definitions of learning outcomes (test scores rising), but the holistic capabilities that underpin learning. It also reflects our belief that there will be no silver bullet in terms of what to optimize for: systems and educators will need to be empowered to guide trade offs in alignment with pedagogical best practice and approaches.
Where we go from here
We are validating the Learning Outcomes Measurement Suite through large-scale studies before making it broadly available. This work is underway with the University of Tartu and Stanford’s SCALE Initiative across nation-scale partners like Estonia, where the measurement suite is being studied with nearly 20,000 students aged 16-18 over several months. Student use will happen in close collaboration with local leaders, to ensure safety and alignment with local curricula.
“Estonia has always approached education not as static but as a system we continuously improve. With AI becoming part of that picture, the big question is how we measure AI's long-term impact on learning. That's what we're figuring out in collaboration with OpenAI. Students are keen to be involved in the development process, and many want to learn how to support learning with AI. It feels like a real turning point, and we’re excited to contribute methods that other education systems can reuse and build on.”–Jaan Aru, University of Tartu
This work builds on a broader body of collaborative research underway. In addition to the outcomes research being conducted through founding partners in the Learning Lab, OpenAI is supporting studies at the intersection of learning and labor—examining how AI shapes students’ academic pathways, career decisions, and the ways institutions can support responsible adoption. This research is happening across Bocconi University, Innova Schools and Tuck School of Business at Dartmouth, San Diego State University, Stony Brook University, and others.
As we run longer-term studies on how students learn best with AI, we intend to share findings and work with the broader education ecosystem to ensure AI benefits learners everywhere.
Those interested in receiving updates on this work can sign up here.
Generated by RSStT. The copyright belongs to the original author.