Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests

Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” va...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, Pierre Baldi
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Computers in Human Behavior: Artificial Humans
Subjects:	Large language models Intelligence Quotient Artificial Intelligence
Online Access:	http://www.sciencedirect.com/science/article/pii/S2949882125000544
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849424748634701824
author	Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi
author_facet	Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi
author_sort	Sherif Abdelkarim
collection	DOAJ
description	Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.
format	Article
id	doaj-art-6ca87c4217d9420b976e72a9c9e86ead
institution	Kabale University
issn	2949-8821
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	Computers in Human Behavior: Artificial Humans
spelling	doaj-art-6ca87c4217d9420b976e72a9c9e86ead2025-08-20T03:30:02ZengElsevierComputers in Human Behavior: Artificial Humans2949-88212025-08-01510017010.1016/j.chbah.2025.100170Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ testsSherif Abdelkarim0David Lu1Dora-Luz Flores2Susanne Jaeggi3Pierre Baldi4University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversidad Autónoma de Baja California, Mexicali, 21100, Baja California, MexicoUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Northeastern University, 360 Huntington Ave, Boston, 02115, MA, USA; Correspondence to: Interdisciplinary Science and Engineering Complex, 805 Columbus Ave, Boston, MA 02120, USA.University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Corresponding author.Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.http://www.sciencedirect.com/science/article/pii/S2949882125000544Large language modelsIntelligence QuotientArtificial Intelligence
spellingShingle	Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests Computers in Human Behavior: Artificial Humans Large language models Intelligence Quotient Artificial Intelligence
title	Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_full	Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_fullStr	Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_full_unstemmed	Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_short	Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
title_sort	evaluating the intelligence of large language models a comparative study using verbal and visual iq tests
topic	Large language models Intelligence Quotient Artificial Intelligence
url	http://www.sciencedirect.com/science/article/pii/S2949882125000544
work_keys_str_mv	AT sherifabdelkarim evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT davidlu evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT doraluzflores evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT susannejaeggi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT pierrebaldi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests

Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests

Similar Items