Efficient GPT-4V level multimodal large language model for deployment on edge devices

Abstract Multimodal large language models have revolutionized AI research and industry, paving the way toward the next milestone. However, their large sizes and high computational costs restrict deployment to cloud servers, limiting use in mobile, offline, energy-sensitive, or privacy-critical scena...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Chi Chen, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Ronghua Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-61040-5
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Multimodal large language models have revolutionized AI research and industry, paving the way toward the next milestone. However, their large sizes and high computational costs restrict deployment to cloud servers, limiting use in mobile, offline, energy-sensitive, or privacy-critical scenarios. We present MiniCPM-V, efficient models for edge devices that integrate advancements in architecture, training, and data. The 8B model outperforms GPT-4V, Gemini Pro, and Claude 3 across 11 public benchmarks, processes high-resolution images at any aspect ratio, achieves robust optical character recognition, exhibits low hallucination rates, and supports over 30 languages while running efficiently on mobile phones. This progress reflects a broader trend: The sizes for high-performing models are rapidly decreasing alongside growing edge computation capacity, enabling advanced multimodal models to operate locally on consumer hardware. Such developments unlock applications across diverse real-world scenarios, from enhanced mobile AI to privacy-preserving solutions, marking a critical step toward democratizing powerful multimodal intelligence.
ISSN:2041-1723