Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests

Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” va...

Full description

Saved in:
Bibliographic Details
Main Authors: Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, Pierre Baldi
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Computers in Human Behavior: Artificial Humans
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2949882125000544
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849424748634701824
author Sherif Abdelkarim
David Lu
Dora-Luz Flores
Susanne Jaeggi
Pierre Baldi
author_facet Sherif Abdelkarim
David Lu
Dora-Luz Flores
Susanne Jaeggi
Pierre Baldi
author_sort Sherif Abdelkarim
collection DOAJ
description Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.
format Article
id doaj-art-6ca87c4217d9420b976e72a9c9e86ead
institution Kabale University
issn 2949-8821
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series Computers in Human Behavior: Artificial Humans
spelling doaj-art-6ca87c4217d9420b976e72a9c9e86ead2025-08-20T03:30:02ZengElsevierComputers in Human Behavior: Artificial Humans2949-88212025-08-01510017010.1016/j.chbah.2025.100170Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ testsSherif Abdelkarim0David Lu1Dora-Luz Flores2Susanne Jaeggi3Pierre Baldi4University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversidad Autónoma de Baja California, Mexicali, 21100, Baja California, MexicoUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Northeastern University, 360 Huntington Ave, Boston, 02115, MA, USA; Correspondence to: Interdisciplinary Science and Engineering Complex, 805 Columbus Ave, Boston, MA 02120, USA.University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Corresponding author.Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.http://www.sciencedirect.com/science/article/pii/S2949882125000544Large language modelsIntelligence QuotientArtificial Intelligence
spellingShingle Sherif Abdelkarim
David Lu
Dora-Luz Flores
Susanne Jaeggi
Pierre Baldi
Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
Computers in Human Behavior: Artificial Humans
Large language models
Intelligence Quotient
Artificial Intelligence
title Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_full Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_fullStr Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_full_unstemmed Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_short Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_sort evaluating the intelligence of large language models a comparative study using verbal and visual iq tests
topic Large language models
Intelligence Quotient
Artificial Intelligence
url http://www.sciencedirect.com/science/article/pii/S2949882125000544
work_keys_str_mv AT sherifabdelkarim evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests
AT davidlu evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests
AT doraluzflores evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests
AT susannejaeggi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests
AT pierrebaldi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests