Evaluating AI’s ability to perform scientific research tasks
OpenAI News推理是科学工作的核心。科研不仅仅是记忆事实,还包括提出假说、设计检验并不断修正、以及在不同领域间整合思路。随着模型能力提升,关键问题在于它们能否进行深入推理,从而真正助力科学研究。
在过去一年里,我们的模型取得了重要里程碑,例如在 International Math Olympiad 和 International Olympiad in Informatics 上获得金牌级别的表现。同时,我们也开始看到最强模型(例如 GPT‑5 )在现实科研流程中带来实质性提速。研究者们已将这些系统用于跨学科、多语种的文献检索,以及推进复杂数学证明等任务;在许多情形下,模型能把原本需要数天或数周的工作缩短到数小时内完成。我们在 2025 年 11 月发布的论文 Early science acceleration experiments with GPT‑5 中记录了早期证据,表明 GPT‑5 能够可测量地加速科研工作流程。
引入 FrontierScience
鉴于加速科学进展是人工智能造福人类的最有前景的方向之一,我们在提升模型在困难数学与科学题目上的能力的同时,也在开发能帮助科学家最大化利用这些模型的工具。
当 GPQA ——一套由博士级专家编写、被描述为“Google‑Proof”的科学基准题库——在 2023 年 11 月发布时, GPT‑4 的得分为 39%,低于专家基线 70%。两年后, GPT‑5.2 的得分达到了 92%。随着模型在推理与知识能力上的持续扩展,更高难度的基准将变得愈发重要,用以衡量并预测模型在加速科学研究方面的潜力。此前的科学基准大多侧重选择题、已趋饱和,或并非以科学问题为核心。
为填补这一空白,我们推出了 FrontierScience :一个旨在衡量专家级科学能力的新基准。 FrontierScience 由物理、化学、生物等领域的专家出题并验证,包含数百道设计为高难度、原创且有意义的问题。 FrontierScience 包含两个题型分支:Olympiad(衡量奥林匹克式的科学推理能力)和 Research(衡量真实科研能力)。对模型科学能力的更细致刻画,能帮助我们追踪进展并推动以 AI 加速的科研发展。
在初步评估中, GPT‑5.2 在 FrontierScience‑Olympiad(得分 77%)和 FrontierScience‑Research(得分 25%)上均为表现最佳的模型,领先其他前沿模型。我们在攻克专家级问题方面已见显著进步,但仍留有上升空间,尤其是在开放式研究任务上。对科学家而言,这表明当前模型已能支持研究中需要结构化推理的环节,但在执行开放性思维方面仍需大量改进。这与科学界当前使用模型的方式相吻合:研究者用模型来加速工作流程,仍以人工判断负责问题的界定与结果的验证,并越来越多地借助模型探索那些否则要花更长时间才能发现的思路与联系——在某些情况下,模型还会提出需由专家评估与检验的新见解。
归根结底,衡量 AI 科学能力的最重要标准,是它能促成哪些新的发现;那才是对科学和社会真正有意义的产出。 FrontierScience 处于这个过程的上游:它为专家级科学推理提供了一个指向目标,让我们能用标准化的问题组测试模型,明察模型何处成功、何处失误,从而明确改进方向。 FrontierScience 的范围较窄,并在若干关键方面存在局限(例如侧重受限的专家出题问题),不能完全反映科学家日常工作的全部内容。但领域内确实需要更多高难度、原创且有意义的科学基准, FrontierScience 在这方面迈出了一步。
FrontierScience 测量什么、如何构建
完整的 FrontierScience 评测覆盖 700 多道文本题(其中金集为 160 道),横跨物理、化学、生物诸多子领域。该基准由 Olympiad 与 Research 两部分组成。
FrontierScience‑Olympiad 包含 100 道题,由国际奥林匹克奖牌得主设计,旨在以受限的短答案形式评估科学推理能力。该套题旨在包含至少与国际奥林匹克竞赛题难度相当的理论问题。
FrontierScience‑Research 由 60 道原创研究子任务构成,出题者为拥有博士学位的科学家(博士生、教授或博士后),每题以十点评分细则评判。Research 套题设计为自洽的、多步子任务,难度接近博士研究者在科研中可能遇到的问题。
样题示例(节选)
(示例化学与生物题目若干,原题详见源文。)例如:
B1 与水相溴( Br2 )反应生成 B2;B2 与亚硝酸钾( KNO2 )反应生成 B3;B3 在硝酸( HNO3 )与硫酸( H2SO4 )中硝化生成 B4。
• B1 含一单取代的杂芳香 5 元环,分子量 96.08 g/mol,可通过在酸催化下脱水 5 碳糖(如木糖)得到。
• B2 分子式为 C4H2Br2O3,含一个四取代烯,其中两个溴原子为顺式。
• B3 为二钾盐,分子量 269.27 g/mol,含 1 个氢。
• B4 为一个无手性的拟卤素二聚体,含 2 个碳、不含氢,分子量为 300. g/mol。
当 B4 在溶液中分解时,会形成中间体 B5,并以副产物释放等当量的四氧化二氮( N2O4 )。中间体 B5 能被捕获并以 Diels‑Alder 加成物形式检测到。
请按格式给出 B1、B2、B3、B4、B5 的结构:“B1: X; B2: X; B3: X; B4: X; B5: X”。
每道 FrontierScience 题均由该领域的专家编写并互相验证。Olympiad 套题的出题专家至少在一个(通常是多个)国际奥林匹克竞赛中获得过奖牌;Research 套题的出题者均具相关博士学位。
Olympiad 题由 42 名曾任国际奖牌得主或国家队教练的合作者参与出题,合计获得 109 枚奥林匹克奖牌。Research 题由 45 位合格的科学家与领域专家合作出题,参与者身份包括博士生、博士后或教授,他们的专长涵盖从量子电动力学到合成有机化学再到进化生物学等多项重要学科。
两套题的题目制作流程包含对 OpenAI 内部模型的选择性筛除(例如剔除那些模型已能正确解答的题目,因此我们预计评测会在某种程度上对这些模型产生偏向)。我们将 Olympiad 的 100 道金集题和 Research 的 60 道金集题开源,保留其余题目以防污染。
任务开发流程展示了一条四阶段管线——创建(Creation)、审核(Review)、定稿(Resolution)和修订(Revision)——每一步均由独立专家互评,以确保题目满足事实性、可评分性、客观性和困难度等标准。
我们如何评判模型表现
Olympiad 套题可用短答案评分:数字、表达式或模糊字符串匹配等,便于核验正确性。但这种核验常以牺牲问题的表达力和开放性为代价。对于 Research 套题,我们引入了基于评分细则的评估架构以处理更具开放性的任务。每道 Research 题包含一个由多个独立且可客观评估项组成的十点评分细则。该细则不仅评估最终答案的准确性,还考察中间推理步骤的正确性,从而提供更细致的表现与失误分析。若解答获得至少 7/10 的细则分数,则被视为“正确”。
答题由模型评分器( GPT‑5 )依据短答案或评分细则进行评估。理想情况下我们希望由领域专家逐条评分,但这在规模化上不可行,因此我们设计了可由模型评分器核查的细则,并建立了一个验证流程以确保细则与题目在难度与正确性上的校准。
化学研究题示例(节选)
(此处为关于苝绿素类(phthalocyanines)及其在镍(II)配合物中 meso‑氮原子修饰对电子结构、芳香性、光谱性质与反应性的影响,以及合成方法、π‑电子数与观测性质之间关系的研究性问题。原文包含详细背景、分析要点与评分细则。)
样例回答(节选)
(原文给出了一份示例性答题结构:引言、对电子结构与芳香性的影响、光谱后果、反应性与配位化学等要点,并配以评分细则中各项通过/未通过的判据。)
研究题的每个条目都以十点细则评分,可由专家或模型评分器使用。为扩展我们的评估能力,我们采用另一模型作为评分器来进行规模化评估。
模型表现
我们在多个前沿模型上做了评测:包括 GPT‑5.2 、 Claude Opus 4.5 、 Gemini 3 Pro 、 GPT‑4o 、 OpenAI o4‑mini 、以及 OpenAI o3 ,分别在 FrontierScience‑Olympiad 和 FrontierScience‑Research 上进行测试。除 GPT‑5.2 在“xhigh”(极高)推理努力下外,其他推理模型均在“high”(高)推理努力下评估。在初步评估中, GPT‑5.2 在 FrontierScience‑Olympiad(得分 77%)和 FrontierScience‑Research(得分 25%)上均为表现最佳的模型; Gemini 3 Pro 在 Olympiad 套题上的表现与 GPT‑5.2 接近(得分 76%)。
我们在攻克专家级问题上已见显著进展,尤其是在开放式、研究型任务上。但仍有提升空间:对失败记录的分析显示,前沿模型有时会犯推理、逻辑或计算错误,不理解某些小众科学概念,或出现事实性不准确。
我们比较了若干前沿模型的准确率;在我们测评中, GPT‑5.2 在 Research 与 Olympiad 套题上表现最佳。我们还比较了不同推理努力(thinking time)下的准确率,以 GPT‑5.2 和 o3 为例,延长思考时间通常能提高准确性。
局限性与下一步
尽管 FrontierScience 在提升科学基准难度上迈出了一步,但仍存在诸多局限。 FrontierScience 由受限的问题陈述构成,侧重评估最终答案(Olympiad)或评估完成研究任务的推理(Research)。此外,在较长任务上采用多组件评分细则的客观性不如直接核对最终答案那般明确。
FrontierScience 为我们在困难、专家撰写问题上提供了更高分辨率的模型推理快照,但并不能完整反映科学实践的全貌。它尤其未评估科研中极为重要的部分:模型如何产生真正的新假说,或如何与多模态数据(包括视频)以及物理世界的真实实验系统交互。
展望未来,我们预计科学推理能力的进步将来自于更强的一般推理系统与针对性提升科学能力的专项努力。 FrontierScience 只是众多工具之一;随着模型改进,我们将迭代这一基准、将其扩展到新领域,并辅以更多面向现实世界的评估,以观测这些系统实际能为科学家带来哪些能力提升。像 FrontierScience 这样的基准帮助我们识别当下 AI 系统的薄弱环节,从而把工作重点放在使模型成为可靠科研伙伴的方向上。
Reasoning is at the core of scientific work. Beyond recalling facts, scientists generate hypotheses, test and refine them, and synthesize ideas across fields. As our models become more capable, the central question is how they can reason deeply to contribute to scientific research.
Over the last year, our models have reached major milestones, including achieving gold-medal performance at the International Math Olympiad and the International Olympiad in Informatics. In parallel, we’re starting to see our most capable models, such as GPT‑5, meaningfully accelerate real scientific workflows. Researchers are using these systems for tasks such as literature search across disciplines and languages and working through complex mathematical proofs. In many cases, the model shortens work that might have taken days or weeks to hours. This progress is documented in our paper Early science acceleration experiments with GPT‑5, released in November 2025, which presents early evidence that GPT‑5 can measurably accelerate scientific workflows.
Introducing FrontierScience
As accelerating scientific progress is one of the most promising opportunities for AI to benefit humanity, we’re improving our models on difficult math and science tasks and working on the tools that will help scientists get the most from them.
When GPQA, a “Google-Proof” science benchmark of questions written by PhD experts, was released in November 2023, GPT‑4 scored 39%, below the expert baseline of 70%. Two years later, GPT‑5.2 scored 92%. As models’ reasoning and knowledge capabilities continue to scale, more difficult benchmarks will be important to measure and forecast models’ ability to accelerate scientific research. Prior scientific benchmarks largely focus on multiple-choice questions, are saturated, or are not centrally focused on science.
To bridge this gap, we’re introducing FrontierScience: a new benchmark built to measure expert-level scientific capabilities. FrontierScience is written and verified by experts across physics, chemistry, and biology, and consists of hundreds of questions designed to be difficult, original, and meaningful. FrontierScience includes two tracks of questions: Olympiad, which measures Olympiad-style scientific reasoning capabilities, and Research, which measures real-world scientific research abilities. Providing more insight into models’ scientific capabilities helps us track progress and advance AI-accelerated science.
In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. We’ve seen substantial progress on solving expert-level questions while leaving headroom for more progress, especially on open-ended research-style tasks. For scientists, this suggests that current models can already support parts of research that involve structured reasoning, while highlighting that significant work remains to improve their ability to carry out open-ended thinking. These results align with how scientists are already using today’s models: to accelerate research workflows while relying on human judgment for problem framing and validation, and increasingly to explore ideas and connections that would otherwise take much longer to uncover—including, in some cases, contributing new insights that experts then evaluate and test.
In the end, the most important benchmark for the scientific capabilities of AI is the novel discoveries it helps generate; those are what ultimately matter to science and society. FrontierScience sits upstream of that. It gives us a north star for expert-level scientific reasoning, letting us test models on a standardized set of questions, see where they succeed or fail, and identify where we need to improve them. FrontierScience is narrow and has limitations in key respects (for example, focusing on constrained, expert-written problems) and does not capture everything scientists do in their everyday work. But the field needs more difficult, original, and meaningful science benchmarks, and FrontierScience provides a step forward in this direction.
What FrontierScience measures and how we built it
The full FrontierScience evaluation spans over 700 textual questions (with 160 in the gold set) covering subfields across physics, chemistry, and biology. The benchmark is composed of an Olympiad and a Research split. FrontierScience-Olympiad contains 100 questions designed by international olympiad medalists to assess scientific reasoning in a constrained, short answer format. The Olympiad set was designed to contain theoretical questions at least as difficult as problems at international olympiad competitions. FrontierScience-Research consists of 60 original research subtasks designed by PhD scientists (doctoral candidates, professors, or postdoctoral researchers) that are graded using a 10-point rubric. The Research set was created to contain self-contained, multi-step subtasks at the level of difficulty that a PhD scientist might encounter during their research.
Sample questions
Chemistry olympiadPhysics olympiadBiology olympiadChemistry researchPhysics researchBiology research
B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.
- B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
- B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
- B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
- B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.
When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.
Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".
Each task in FrontierScience is written and verified by a domain expert in physics, chemistry, or biology. For the Olympiad set, all experts were awarded a medal in at least one (and often multiple) international olympiad competitions. For the Research set, all experts hold a relevant PhD degree.
The Olympiad questions were created in collaboration with 42 former international medalists or national team coaches in the relevant domains, totalling 109 olympiad medals. The research questions were created in collaboration with 45 qualified scientists and domain experts. All scientists were either doctoral candidates, post-doctoral researchers, or professors. Their areas of expertise spanned an array of specialized and important scientific disciplines, from quantum electrodynamics to synthetic organic chemistry to evolutionary biology.
The task creation process for both sets included some selection against OpenAI internal models (e.g., discarding tasks that models successfully got right, so we expect the evaluation to be somewhat biased against these models relative to others). We open-source the Olympiad gold set of 100 questions and Research gold set of 60 questions, holding out the other questions to track contamination.
Tasks go through four stages: Creation, Review, Resolution, Revision. Independent experts review each other tasks to verify it aligns with the criteria.
How we grade model performance
The Olympiad set is gradable with a short answer: either with a number, expression, or fuzzy string match, which helps with verifying correctness. However, this verification often trades off with the expressivity and open-endedness of the problem. For the Research set, we introduce a rubric-based architecture for grading more open-ended tasks. Each question includes a scoring rubric with multiple independent and objectively assessable items, totaling 10 points. The grading rubric assesses not only the accuracy of the final answer, but also the correctness of intermediate reasoning steps, allowing for nuanced model performance and failure analysis. A solution is considered “correct” if it’s awarded at least 7/10 rubric points.
Responses are evaluated by a model-based grader (GPT‑5) against either the short answer or the rubric criteria. While we’d ideally use an expert human to grade each response, this approach is not scalable, so we designed the rubric to be checkable using a model grader. We developed a verification pipeline to help ensure rubrics and questions were well-calibrated to difficulty and correctness.
Sample chemistry research problem
Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.
Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.
Sample response
1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?
Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.
2. Effect on Electronic Structure and Aromaticity
- π-Electron Count and Ring Currents Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.
- Charge Distribution and Frontier Orbitals Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.
3. Spectroscopic Consequences
- UV-Vis Absorption (Q and B Bands) The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm). Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:
- Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
- Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
- NMR Spectroscopy and Aromatic Ring Currents Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts. More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.
4. Reactivity and Coordination Chemistry
Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:
- Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.
(... shortened for the purposes of this figure)
Sample grading rubric
Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)
1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.
0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.
0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.
Thiolate-Mediated Tetramerization Process (1 point)
1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.
0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.
0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.
Analysis of NMR Spectroscopic Features (1 point)
1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.
0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.
0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.
Electrochemical Property Analysis (1 point)
1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.
0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.
0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.
Analysis of Absorption Spectroscopy (1 point)
1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.
0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.
0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.
Reactivity Analysis of Antiaromatic System (1 point)
1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.
0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.
0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.
(... and more)
Each task in the research set is graded using a rubric totaling 10 points that can be used by an expert or a model grader. To scale our ability to evaluate models, we use another model to grade responses.
Model performance
We evaluated several frontier models: GPT‑5.2, Claude Opus 4.5, and Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, and OpenAI o3 on FrontierScience-Olympiad and FrontierScience-Research. All reasoning models were evaluated at “high” reasoning effort with the exception of GPT‑5.2 at “xhigh”. In our initial evaluations, GPT‑5.2 is our top performing model on FrontierScience-Olympiad (scoring 77%) and Research (scoring 25%), ahead of other frontier models. Gemini 3 Pro is comparable to GPT‑5.2 on the Olympiad set (scoring 76%).
We’ve seen substantial progress on solving expert-level questions, especially on open-ended research-style tasks. There is still more room to grow: from analyzing the transcripts for failures, frontier models sometimes made reasoning, logic, and calculation errors, didn’t understand niche scientific concepts, and made factual inaccuracies.
We compare accuracies across several frontier models. GPT‑5.2 is our highest performing model on the FrontierScience-Research and the Olympiad set.
We compare accuracies across reasoning efforts for GPT‑5.2 and o3. Longer thinking time leads to improved accuracy.
Limitations and what’s next
While FrontierScience represents a step forward in difficulty of scientific benchmarks, there are still many limitations. FrontierScience is composed of questions with a constrained problem statement, which focuses on evaluating the final answer (Olympiad) or evaluating the reasoning to complete a research task (Research). In addition, using rubrics with multiple components on longer tasks is less objective than checking the final answer.
FrontierScience offers a higher resolution snapshot of models’ reasoning on difficult, expert-written questions, but not a full picture of how science gets done in practice. In particular, it does not assess a significant part of scientific research: how models generate genuinely novel hypotheses, or interact with multiple modalities, including video data and real experimental systems in the physical world.
Looking ahead, we expect progress in scientific reasoning to come from both better general-purpose reasoning systems and focused effort on improvising scientific capabilities. FrontierScience is one tool among many, and as models improve, we plan to iterate on this benchmark, expand it to new domains, and pair it with more real-world evaluations that look at what these systems actually enable scientists to do. Benchmarks like FrontierScience help us understand the weaknesses of today’s AI systems to focus our work on making models be reliable partners in scientific discovery.
Generated by RSStT. The copyright belongs to the original author.