System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam

The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September...

Full description

Saved in:
Bibliographic Details
Main Authors: Joost C. F. de Winter, Dimitra Dodou, Yke Bauke Eisma
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/13/11/278
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850267591116849152
author Joost C. F. de Winter
Dimitra Dodou
Yke Bauke Eisma
author_facet Joost C. F. de Winter
Dimitra Dodou
Yke Bauke Eisma
author_sort Joost C. F. de Winter
collection DOAJ
description The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the <i>o1</i> model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the <i>o1-preview</i> model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the <i>GPT-4o</i> model scored 66 and 62 out of 76, well above the Dutch students’ average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff for <i>o1-preview</i> and <i>GPT-4o</i> was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that <i>o1-preview</i> performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of <i>o1-preview</i>, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that the self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.
format Article
id doaj-art-000f94c5669f45fcb0ba85cc5d2f73ca
institution OA Journals
issn 2073-431X
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Computers
spelling doaj-art-000f94c5669f45fcb0ba85cc5d2f73ca2025-08-20T01:53:44ZengMDPI AGComputers2073-431X2024-10-01131127810.3390/computers13110278System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics ExamJoost C. F. de Winter0Dimitra Dodou1Yke Bauke Eisma2Faculty of Mechanical Engineering, Delft University of Technology, 2628 CD Delft, The NetherlandsFaculty of Mechanical Engineering, Delft University of Technology, 2628 CD Delft, The NetherlandsFaculty of Mechanical Engineering, Delft University of Technology, 2628 CD Delft, The NetherlandsThe processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the <i>o1</i> model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the <i>o1-preview</i> model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the <i>GPT-4o</i> model scored 66 and 62 out of 76, well above the Dutch students’ average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff for <i>o1-preview</i> and <i>GPT-4o</i> was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that <i>o1-preview</i> performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of <i>o1-preview</i>, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that the self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.https://www.mdpi.com/2073-431X/13/11/278large language modelsreasoningmathematicschain of thought
spellingShingle Joost C. F. de Winter
Dimitra Dodou
Yke Bauke Eisma
System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
Computers
large language models
reasoning
mathematics
chain of thought
title System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
title_full System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
title_fullStr System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
title_full_unstemmed System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
title_short System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam
title_sort system 2 thinking in openai s o1 preview model near perfect performance on a mathematics exam
topic large language models
reasoning
mathematics
chain of thought
url https://www.mdpi.com/2073-431X/13/11/278
work_keys_str_mv AT joostcfdewinter system2thinkinginopenaiso1previewmodelnearperfectperformanceonamathematicsexam
AT dimitradodou system2thinkinginopenaiso1previewmodelnearperfectperformanceonamathematicsexam
AT ykebaukeeisma system2thinkinginopenaiso1previewmodelnearperfectperformanceonamathematicsexam