Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” va...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Computers in Human Behavior: Artificial Humans |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2949882125000544 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849424748634701824 |
|---|---|
| author | Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi |
| author_facet | Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi |
| author_sort | Sherif Abdelkarim |
| collection | DOAJ |
| description | Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities. |
| format | Article |
| id | doaj-art-6ca87c4217d9420b976e72a9c9e86ead |
| institution | Kabale University |
| issn | 2949-8821 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computers in Human Behavior: Artificial Humans |
| spelling | doaj-art-6ca87c4217d9420b976e72a9c9e86ead2025-08-20T03:30:02ZengElsevierComputers in Human Behavior: Artificial Humans2949-88212025-08-01510017010.1016/j.chbah.2025.100170Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ testsSherif Abdelkarim0David Lu1Dora-Luz Flores2Susanne Jaeggi3Pierre Baldi4University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USAUniversidad Autónoma de Baja California, Mexicali, 21100, Baja California, MexicoUniversity of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Northeastern University, 360 Huntington Ave, Boston, 02115, MA, USA; Correspondence to: Interdisciplinary Science and Engineering Complex, 805 Columbus Ave, Boston, MA 02120, USA.University of California Irvine, 510 E Peltason Dr., Irvine, 92617, CA, USA; Corresponding author.Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.http://www.sciencedirect.com/science/article/pii/S2949882125000544Large language modelsIntelligence QuotientArtificial Intelligence |
| spellingShingle | Sherif Abdelkarim David Lu Dora-Luz Flores Susanne Jaeggi Pierre Baldi Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests Computers in Human Behavior: Artificial Humans Large language models Intelligence Quotient Artificial Intelligence |
| title | Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests |
| title_full | Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests |
| title_fullStr | Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests |
| title_full_unstemmed | Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests |
| title_short | Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests |
| title_sort | evaluating the intelligence of large language models a comparative study using verbal and visual iq tests |
| topic | Large language models Intelligence Quotient Artificial Intelligence |
| url | http://www.sciencedirect.com/science/article/pii/S2949882125000544 |
| work_keys_str_mv | AT sherifabdelkarim evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT davidlu evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT doraluzflores evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT susannejaeggi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests AT pierrebaldi evaluatingtheintelligenceoflargelanguagemodelsacomparativestudyusingverbalandvisualiqtests |