Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests
Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” va...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Computers in Human Behavior: Artificial Humans |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2949882125000544 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ ≈ 125 vs visual-IQ ≈ 103), and persistent failure on abstract arithmetic (≤ 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities. |
|---|---|
| ISSN: | 2949-8821 |