Introducing IndQA

Introducing IndQA

OpenAI News

我们的使命是让 AGI 造福全人类。要让人工智能对每个人都有用,就必须能跨越语言与文化有效工作。当下约 80% 的人并不以英语为母语,但衡量非英语能力的大多数基准测试仍存在明显不足。

现有的多语种基准,如 MMMLU ,已接近饱和——顶级模型得分集中,使其难以真实反映进步。此外,现有评测多侧重于翻译或选择题,无法充分衡量对语境、文化、历史以及人们日常关切的理解和推理能力,而这些正是评估语言能力时最重要的方面。

因此我们推出了 IndQA ,这是一个针对印度语种设计的新基准,旨在评估模型在印度语言中理解与推理与文化相关问题的能力,覆盖广泛的文化领域。我们希望未来为其他语言与地区建立类似基准,但印度是显而易见的起点:印度有大约十亿不以英语为主要语言的人口,官方语言有 22 种(其中至少 7 种使用者超过 5,000 万),同时印度也是 ChatGPT 的第二大市场。

这项工作是我们持续改进面向印度用户的产品与工具、并让技术在全印度更易获取的承诺的一部分。

工作原理

IndQA 侧重评估模型在印度语言中对印度文化和日常生活的知识与推理能力。问卷共 2,278 道题,覆盖 12 种语言和 10 个文化领域,由来自印度各地的 261 位领域专家共同创作。与 MMMLU 和 MGSM 等现有基准不同, IndQA 专门设计用于考查那些文化上微妙、需要深入推理的问题,而这些正是现有评测难以捕捉的能力。

IndQA 涵盖建筑与设计、艺术与文化、日常生活、饮食与烹饪、历史、法律与伦理、文学与语言学、媒体与娱乐、宗教与灵性、体育与娱乐等话题,题目以本地语言原生撰写,包含 Bengali、English、Hindi、Hinglish、Kannada、Marathi、Odia、Telugu、Gujarati、Malayalam、Punjabi 和 Tamil。我们特别加入了 Hinglish ,以反映对话中常见的代码混用现象。

每个数据点都包括:以印度语言写就、具文化根基的题干;便于审计的英文翻译;用于评分的细则标准;以及反映专家预期的理想答案。

评分采用基于细则的办法:每个回答根据与该问题相关的领域专家制定的评分标准逐项判定。评分细则明确列出理想答案应包含或避免的内容,每项按重要性赋予不同权重。一位基于模型的评分器检查每条细则是否满足,最终得分为满足细则得分之和与总分的比例。

构建方法

  • 专家出题:我们与合作伙伴一道在印度物色各领域专家,他们以所属地区和专业为背景,编写难度较高、侧重推理的题目。这些专家在相关语言和英语上均为母语级,并具备深厚的学科知识。
  • 对抗性筛选:每题在创建时都用当时 OpenAI 最强的模型测试,包括 GPT-4o 、 OpenAI o3 、 GPT-4.5 ,以及(在公开发布后部分使用的) GPT-5 。我们只保留那些这些模型大多数未能给出可接受答案的问题,以为未来进步留出空间。
  • 详细评分细则:每道题配有类似论文题目评分表的细则,由领域专家提供,用于给候选模型的回答打分。
  • 理想答案与复审:专家给出理想答案与英文翻译,经过同行评审和反复修订直至签字确认。

示例题目节选

以下为题目示例的意译版本,以展示题目类型与深度。

  • 语言:Bengali;领域:文学与语言学 题目要点:小说 Dandak Theke Marichjhanpi 的作者如何描绘被安置后(rehabilitation)低种姓男女的生活?在 Dandakaranya 的安置是否反映了政府的冷漠?这与当地生态环境的改变有何关联?
  • 领域:饮食与烹饪 题目要点:从 19 世纪末开始,烹饪书为何开始出版?第一本孟加拉语食谱与 Bipradas Mukherjee 所著者有何不同?由 Bipradas 创办的杂志持续了多久?在 Bipradas 与 Pragya Sundari 的影响下,Dighapatiya 后来出版了哪本书?

随着时间的推移改进

我们用 IndQA 跟踪近期前沿模型在印度语言上的表现与进步。通过 IndQA 可以看出, OpenAI 的模型在印度语言上近几年确实有显著提升(附带若干 caveats ),但仍有大量改进空间。我们期待未来进一步提高性能并公布更多模型的评测结果。

同时我们也按语言与领域对 IndQA 的表现做分层比较,展示 GPT-5 Thinking High 与其他前沿模型的差异。

注意事项

由于题目在各语言间并不完全一致, IndQA 并非语言排行榜;不同语言间的分数不能被直接解读为语言能力的横向比较。我们计划用 IndQA 来衡量同一模型系列或配置随时间的改进。

另外,题目在筛选时是针对 GPT-4o 、 OpenAI o3 、 GPT-4.5 和(公开发布后) GPT-5 的弱点进行对抗性选择的,这可能会混淆对 GPT-5 相对性能的解读,也可能使所有 OpenAI 模型在与非 OpenAI 模型比较时处于不利位置。

参与的专家

我们感谢为 IndQA 撰写与审题的 261 位印度专家——包括记者、语言学家、学者、艺术家与业界从业者。部分参与者示例包括:

  • 一位获得 Nandi Award 的 Telugu 演员兼编剧,参演作品超过 750 部;
  • 《Tarun Bharat》的一位 Marathi 记者与编辑;
  • 一位研究 Kannada 语言学并担任词典编辑的学者;
  • 一位国际象棋特级大师,指导世界前 100 的棋手;
  • 一位倡导社会正义、种姓平等与文学自由的 Tamil 作家、诗人与文化活动家;
  • 一位获奖的 Punjabi 音乐作曲家;
  • 一位 Gujarati 遗产策展人与保护专家;
  • 一位获奖的 Malayalam 诗人和行为艺术家;
  • 一位专研 Bengal 丰富文化遗产的历史学教授;
  • 一位专注 Odisha 寺庙建筑的建筑学教授。

下一步

我们希望 IndQA 的发布能促发研究界推出更多此类基准。对那些在现有 AI 基准中被忽视的语言或文化领域, IndQA 风格的问题特别有价值。打造类似的基准有助于 AI 研究机构识别模型目前薄弱的语言和领域,并为未来改进提供明确方向。



Our mission is to make AGI benefit all of humanity. If AI is going to be useful for everyone, it needs to work well across languages and cultures. About 80 percent of people worldwide do not speak English as their primary language, yet most existing benchmarks that measure non-English language capabilities fall short. 


Existing multilingual benchmarks like MMMLU are now saturated—top models cluster near high scores—which make them less useful for measuring real progress. In addition, current benchmarks mostly focus on translation or multiple-choice tasks. They don’t adequately capture what really matters for evaluating an AI system’s language capabilities—understanding context, culture, history, and the things that matter to people where they live.


That’s why we built IndQA, a new benchmark designed to evaluate how well AI models understand and reason about questions that matter in Indian languages, across a wide range of cultural domains. While our aim is to create similar benchmarks for other languages and regions, India is an obvious starting point. India has about a billion people who don’t use English as their primary language, 22 official languages (including at least seven with over 50 million speakers), and is ChatGPT’s second largest market.  


This work is part of our ongoing commitment to improve our products and tools for Indian users, and to make our technology more accessible throughout the country.


How it works




IndQA evaluates knowledge and reasoning about Indian culture and everyday life in Indian languages. It spans 2,278 questions across 12 languages and 10 cultural domains, created in partnership with 261 domain experts from across India. Unlike existing benchmarks like MMMLU and MGSM, it is designed to probe culturally nuanced, reasoning-heavy tasks that existing evaluations struggle to capture.


IndQA covers a broad range of culturally relevant topics, such as Architecture & Design, Arts & Culture, Everyday Life, Food & Cuisine, History, Law & Ethics, Literature & Linguistics, Media & Entertainment, Religion & Spirituality, and Sports & Recreation—with items written natively in Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. Note: We specifically added Hinglish given the prevalence of code-switching in conversations.


Each datapoint includes a culturally grounded prompt in an Indian language, an English translation for auditability, rubric criteria for grading, and an ideal answer that reflects expert expectations.










IndQA uses a rubric-based approach. Each response is graded against criteria written by domain experts for that specific question. The criteria spell out what an ideal answer should include or avoid, and each one is given a weighted point value based on its importance. A model-based grader checks whether each criterion is met. The final score is the sum of the points for criteria satisfied out of the total possible.


How we built IndQA




  • Expert‑authored questions. We worked with partners to find experts in India across 10 different domains. They drafted difficult, reasoning‑focused prompts tied to their regions and specialties. These experts are native‑level speakers of the relevant language (and English) and bring deep subject expertise.
  • Adversarial filtering: Each question was tested against OpenAI’s strongest models at the time of their creation: GPT‑4o, OpenAI o3, GPT‑4.5, and (partially, post public launch) GPT‑5. We kept only those questions where a majority of these models failed to produce acceptable answers, preserving headroom for progress
  • Detailed Criteria. Along with every question, domain experts provided criteria used to grade the model response, similar to an exam rubric for an essay question. These criteria are used to grade responses from candidate models.
  • Ideal answers + review. Experts added ideal answers and English translations, followed by peer review and iterative fixes until sign‑off.

Example questions





BengaliGujaratiHindiHinglishKannadaMalayalamMarathiOdiaPunjabiTamilTelugu



Language: Bengali


Domain: Literature and linguistics



Prompt

‘দণ্ডক থেকে মরিচঝাঁপি’ উপন্যাসের লেখক নিম্নবর্ণের পুরুষ ও নারীদের দণ্ডকারন্যে পুনর্বাসন পরবর্তী জীবন কিভাবে দেখিয়েছেন? দণ্ডকারণ্যে পুনর্বাসন কি সরকারী উদাসীনতার ফল? পরিবর্তিত প্রাকৃতিক পরিবেশের সাথে উদ্বাস্তুরা কিভাবে মানিয়ে নিয়েছিল?

English Translation

How did the writer of Bengali novel ‘Dandak Theke Marichjhanpi’ depict the post-rehabilitation lives of lower caste men and women? Was the rehabilitation in Dandakaranya a result of governmental indifference? What was its relation with the new natural landscapes?













Domain: Food and cuisine



Prompt

কোন পরিপ্রেক্ষিতে উনিশ শতকের শেষ দিক থেকে রান্নার বইগুলো বেরচ্ছিল ? প্রথম বাংলা রান্নার বইটির সাথে বিপ্রদাস মুখোপাধ্যায় রচিত বইটির পার্থক্য কোথায় ? বিপ্রদাসের উদ্যোগে প্রকাশিত পত্রিকাটি চলেছিল কতদিন ? বিপ্রদাস ও প্রজ্ঞা সুন্দরীর লেখা অনুসরণ করে দিঘাপতিয়া থেকে কোন বইটি বেরিয়েছিল ?

English Translation

In what context were cookbooks published from the end of the 19th century? What is the difference between the first Bengali cookbook and the book written by Bipradas Mukherjee? How long did the magazine published by Bipradas run? Which book was published by Dighapatiya following the writings of Bipradas and Pragya Sundari?




















Improvements over time




We use IndQA to evaluate how recent frontier models perform and chart progress over the last couple years. With IndQA we can see that OpenAI’s models have improved significantly over time on Indian languages (with caveats), but still have substantial room for improvement. We look forward to improving performance and sharing results for future models.






We also stratify performance on IndQA by Language and Domain below, comparing GPT‑5 Thinking High to other frontier models.










Caveats



Because questions are not identical across languages, IndQA is not a language leaderboard; cross‑language scores shouldn’t be interpreted as direct comparisons of language ability. Instead, we plan to use IndQA to measure improvement over time within a model family or configuration.


Additionally, because questions were filtered to those GPT‑4o, OpenAI o3, GPT‑4.5, and (post public launch) GPT‑5 could not answer sufficiently, question selection is adversarial against these models. This potentially confounds the relative performance of GPT‑5, and could disadvantage all OpenAI models compared to non-OpenAI models.


The experts behind IndQA




We’re grateful to the 261 Indian experts—journalists, linguists, scholars, artists, and industry practitioners—who authored and reviewed questions for IndQA. A few examples of the experts we worked with includes:


  • A Nandi Award winning Telugu actor and screenwriter with over 750 films
  • A Marathi journalist and editor at Tarun Bharat 
  • A scholar of Kannada linguistics and dictionary editor
  • An International Chess Grandmaster who coaches top-100 chess players
  • A Tamil writer, poet, and cultural activist advocating for social justice, caste equity, and literary freedom
  • An award winning Punjabi music composer
  • A Gujarati heritage curator and conservation specialist
  • An award winning Malayalam poet and performance artist
  • A professor of history, specializing in Bengal's rich cultural heritage
  • A professor of architecture, focusing on Odishan temples

Next steps




We hope the release of IndQA will inform and inspire new benchmark creation from the research community. IndQA style questions are especially valuable in languages or cultural domains that are poorly covered by existing AI benchmarks. Creating similar benchmarks to IndQA can help AI research labs learn more about languages and domains models struggle with today, and provide a north star for improvements in the future.



Generated by RSStT. The copyright belongs to the original author.

Source

Report Page