MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
Abstract Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. To support rigorous evalua...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-08-01
|
| Series: | Scientific Data |
| Online Access: | https://doi.org/10.1038/s41597-025-05283-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849767040270729216 |
|---|---|
| author | Meng Fang Xiangpeng Wan Fei Lu Fei Xing Kai Zou |
| author_facet | Meng Fang Xiangpeng Wan Fei Lu Fei Xing Kai Zou |
| author_sort | Meng Fang |
| collection | DOAJ |
| description | Abstract Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. To support rigorous evaluation of mathematical reasoning in LLMs, we introduce the “MathOdyssey” dataset - a curated collection of 387 expert-generated mathematical problems spanning high school, university, and Olympiad-level topics. Each problem is accompanied by a detailed solution and categorized by difficulty level, subject area, and answer type. The dataset was developed through a rigorous multi-stage process involving contributions from subject experts, peer review, and standardized formatting. We provide detailed metadata and a standardized schema to facilitate consistent use in downstream applications. To demonstrate the dataset’s utility, we evaluate several representative LLMs and report their performance across problem types. We release MathOdyssey as an open-access resource to enable reproducible and fine-grained assessment of mathematical capabilities in LLMs and to foster further research in mathematical reasoning and education. |
| format | Article |
| id | doaj-art-07c2e740654e4ec48ff1dc01a4eab489 |
| institution | DOAJ |
| issn | 2052-4463 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Data |
| spelling | doaj-art-07c2e740654e4ec48ff1dc01a4eab4892025-08-20T03:04:22ZengNature PortfolioScientific Data2052-44632025-08-011211810.1038/s41597-025-05283-3MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataMeng Fang0Xiangpeng Wan1Fei Lu2Fei Xing3Kai Zou4Department of Computer Science, University of LiverpoolNetMind.AIDepartment of Mathematics, Johns Hopkins UniversityMathematica Policy ResearchNetMind.AIAbstract Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. To support rigorous evaluation of mathematical reasoning in LLMs, we introduce the “MathOdyssey” dataset - a curated collection of 387 expert-generated mathematical problems spanning high school, university, and Olympiad-level topics. Each problem is accompanied by a detailed solution and categorized by difficulty level, subject area, and answer type. The dataset was developed through a rigorous multi-stage process involving contributions from subject experts, peer review, and standardized formatting. We provide detailed metadata and a standardized schema to facilitate consistent use in downstream applications. To demonstrate the dataset’s utility, we evaluate several representative LLMs and report their performance across problem types. We release MathOdyssey as an open-access resource to enable reproducible and fine-grained assessment of mathematical capabilities in LLMs and to foster further research in mathematical reasoning and education.https://doi.org/10.1038/s41597-025-05283-3 |
| spellingShingle | Meng Fang Xiangpeng Wan Fei Lu Fei Xing Kai Zou MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data Scientific Data |
| title | MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data |
| title_full | MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data |
| title_fullStr | MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data |
| title_full_unstemmed | MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data |
| title_short | MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data |
| title_sort | mathodyssey benchmarking mathematical problem solving skills in large language models using odyssey math data |
| url | https://doi.org/10.1038/s41597-025-05283-3 |
| work_keys_str_mv | AT mengfang mathodysseybenchmarkingmathematicalproblemsolvingskillsinlargelanguagemodelsusingodysseymathdata AT xiangpengwan mathodysseybenchmarkingmathematicalproblemsolvingskillsinlargelanguagemodelsusingodysseymathdata AT feilu mathodysseybenchmarkingmathematicalproblemsolvingskillsinlargelanguagemodelsusingodysseymathdata AT feixing mathodysseybenchmarkingmathematicalproblemsolvingskillsinlargelanguagemodelsusingodysseymathdata AT kaizou mathodysseybenchmarkingmathematicalproblemsolvingskillsinlargelanguagemodelsusingodysseymathdata |