A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing

In modern manufacturing, making accurate and timely decisions requires the ability to effectively handle multiple types of data. This paper presents a multimodal system designed specifically for smart manufacturing applications. The system combines various data sources including images, sensor data,...

Full description

Saved in:
Bibliographic Details
Main Authors: Tianyu Wang, Bowen Zhang, Daqi Jiang, Dong Li
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/10/3072
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850126919640547328
author Tianyu Wang
Bowen Zhang
Daqi Jiang
Dong Li
author_facet Tianyu Wang
Bowen Zhang
Daqi Jiang
Dong Li
author_sort Tianyu Wang
collection DOAJ
description In modern manufacturing, making accurate and timely decisions requires the ability to effectively handle multiple types of data. This paper presents a multimodal system designed specifically for smart manufacturing applications. The system combines various data sources including images, sensor data, and production records, using advanced multimodal large language models. This approach addresses common limitations of traditional single-modal methods, such as isolated data analysis and poor integration between different data types. Key contributions include a unified method for representing different data types, dynamic semantic tokenization for better data processing, strong alignment strategies across modalities, and a practical two-stage training method involving initial large-scale pretraining and later fine-tuning for specific tasks. Additionally, a novel Transformer-based model is introduced for generating both images and text, significantly improving real-time decision-making capabilities. Experiments on relevant industrial datasets show that this method consistently performs better than current state-of-the-art approaches in tasks like image–text retrieval and visual question answering. The results demonstrate the effectiveness and versatility of the proposed methods, offering important insights and practical solutions to enhance intelligent manufacturing, predictive maintenance, and anomaly detection, thus supporting the development of more efficient and reliable industrial systems.
format Article
id doaj-art-8d3a558968a34693a5efdbefde8e143c
institution OA Journals
issn 1424-8220
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-8d3a558968a34693a5efdbefde8e143c2025-08-20T02:33:48ZengMDPI AGSensors1424-82202025-05-012510307210.3390/s25103072A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart ManufacturingTianyu Wang0Bowen Zhang1Daqi Jiang2Dong Li3State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, ChinaShenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, ChinaNational Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang 110819, ChinaState Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, ChinaIn modern manufacturing, making accurate and timely decisions requires the ability to effectively handle multiple types of data. This paper presents a multimodal system designed specifically for smart manufacturing applications. The system combines various data sources including images, sensor data, and production records, using advanced multimodal large language models. This approach addresses common limitations of traditional single-modal methods, such as isolated data analysis and poor integration between different data types. Key contributions include a unified method for representing different data types, dynamic semantic tokenization for better data processing, strong alignment strategies across modalities, and a practical two-stage training method involving initial large-scale pretraining and later fine-tuning for specific tasks. Additionally, a novel Transformer-based model is introduced for generating both images and text, significantly improving real-time decision-making capabilities. Experiments on relevant industrial datasets show that this method consistently performs better than current state-of-the-art approaches in tasks like image–text retrieval and visual question answering. The results demonstrate the effectiveness and versatility of the proposed methods, offering important insights and practical solutions to enhance intelligent manufacturing, predictive maintenance, and anomaly detection, thus supporting the development of more efficient and reliable industrial systems.https://www.mdpi.com/1424-8220/25/10/3072multimodal large language modelsmart manufacturingsemantic tokenizationTransformer modeldecision-making
spellingShingle Tianyu Wang
Bowen Zhang
Daqi Jiang
Dong Li
A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
Sensors
multimodal large language model
smart manufacturing
semantic tokenization
Transformer model
decision-making
title A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
title_full A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
title_fullStr A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
title_full_unstemmed A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
title_short A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
title_sort multimodal large language model framework for intelligent perception and decision making in smart manufacturing
topic multimodal large language model
smart manufacturing
semantic tokenization
Transformer model
decision-making
url https://www.mdpi.com/1424-8220/25/10/3072
work_keys_str_mv AT tianyuwang amultimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT bowenzhang amultimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT daqijiang amultimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT dongli amultimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT tianyuwang multimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT bowenzhang multimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT daqijiang multimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing
AT dongli multimodallargelanguagemodelframeworkforintelligentperceptionanddecisionmakinginsmartmanufacturing