Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy

Visual information is becoming increasingly essential in construction management. However, a significant portion of this information remains underutilized by construction managers due to the limitations of existing image processing algorithms. These algorithms primarily rely on low-level visual feat...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hui Deng, Kejie Fu, Binglin Yu, Huimin Li, Rui Duan, Yichuan Deng, Jia-rui Lin
Format:	Article
Language:	English
Published:	MDPI AG 2025-03-01
Series:	Buildings
Subjects:	visual language model construction scene image scene understanding image captioning attention mechanism
Online Access:	https://www.mdpi.com/2075-5309/15/6/959
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849343260743434240
author	Hui Deng Kejie Fu Binglin Yu Huimin Li Rui Duan Yichuan Deng Jia-rui Lin
author_facet	Hui Deng Kejie Fu Binglin Yu Huimin Li Rui Duan Yichuan Deng Jia-rui Lin
author_sort	Hui Deng
collection	DOAJ
description	Visual information is becoming increasingly essential in construction management. However, a significant portion of this information remains underutilized by construction managers due to the limitations of existing image processing algorithms. These algorithms primarily rely on low-level visual features and struggle to capture high-order semantic information, leading to a gap between computer-generated image semantics and human interpretation. However, current research lacks a comprehensive justification for the necessity of employing scene understanding algorithms to address this issue. Moreover, the absence of large-scale, high-quality open-source datasets remains a major obstacle, hindering further research progress and algorithmic optimization in this field. To address this issue, this paper proposes a construction scene visual language model based on attention mechanism and encoder–decoder architecture, with the encoder built using ResNet101 and the decoder built using LSTM (long short-term memory). The addition of the attention mechanism and beam search strategy improves the model, making it more accurate and generalizable. To verify the effectiveness of the proposed method, a publicly available construction scene visual-language dataset containing 16 common construction scenes, SODA-ktsh, is built and verified. The experimental results demonstrate that the proposed model achieves a BLEU-4 score of 0.7464, a CIDEr score of 5.0255, and a ROUGE_L score of 0.8106 on the validation set. These results indicate that the model effectively captures and accurately describes the complex semantic information present in construction images. Moreover, the model exhibits strong generalization, perceptual, and recognition capabilities, making it well suited for interpreting and analyzing intricate construction scenes.
format	Article
id	doaj-art-cc3030dbf4774e76bf99bc41634b92a8
institution	Kabale University
issn	2075-5309
language	English
publishDate	2025-03-01
publisher	MDPI AG
record_format	Article
series	Buildings
spelling	doaj-art-cc3030dbf4774e76bf99bc41634b92a82025-08-20T03:43:02ZengMDPI AGBuildings2075-53092025-03-0115695910.3390/buildings15060959Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search StrategyHui Deng0Kejie Fu1Binglin Yu2Huimin Li3Rui Duan4Yichuan Deng5Jia-rui Lin6School of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaSchool of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaSchool of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaSchool of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaSchool of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaSchool of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510641, ChinaDepartment of Civil Engineering, Tsinghua University, Beijing 100084, ChinaVisual information is becoming increasingly essential in construction management. However, a significant portion of this information remains underutilized by construction managers due to the limitations of existing image processing algorithms. These algorithms primarily rely on low-level visual features and struggle to capture high-order semantic information, leading to a gap between computer-generated image semantics and human interpretation. However, current research lacks a comprehensive justification for the necessity of employing scene understanding algorithms to address this issue. Moreover, the absence of large-scale, high-quality open-source datasets remains a major obstacle, hindering further research progress and algorithmic optimization in this field. To address this issue, this paper proposes a construction scene visual language model based on attention mechanism and encoder–decoder architecture, with the encoder built using ResNet101 and the decoder built using LSTM (long short-term memory). The addition of the attention mechanism and beam search strategy improves the model, making it more accurate and generalizable. To verify the effectiveness of the proposed method, a publicly available construction scene visual-language dataset containing 16 common construction scenes, SODA-ktsh, is built and verified. The experimental results demonstrate that the proposed model achieves a BLEU-4 score of 0.7464, a CIDEr score of 5.0255, and a ROUGE_L score of 0.8106 on the validation set. These results indicate that the model effectively captures and accurately describes the complex semantic information present in construction images. Moreover, the model exhibits strong generalization, perceptual, and recognition capabilities, making it well suited for interpreting and analyzing intricate construction scenes.https://www.mdpi.com/2075-5309/15/6/959visual language modelconstruction sceneimage scene understandingimage captioningattention mechanism
spellingShingle	Hui Deng Kejie Fu Binglin Yu Huimin Li Rui Duan Yichuan Deng Jia-rui Lin Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy Buildings visual language model construction scene image scene understanding image captioning attention mechanism
title	Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
title_full	Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
title_fullStr	Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
title_full_unstemmed	Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
title_short	Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
title_sort	enabling high level worker centric semantic understanding of onsite images using visual language models with attention mechanism and beam search strategy
topic	visual language model construction scene image scene understanding image captioning attention mechanism
url	https://www.mdpi.com/2075-5309/15/6/959
work_keys_str_mv	AT huideng enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT kejiefu enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT binglinyu enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT huiminli enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT ruiduan enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT yichuandeng enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy AT jiaruilin enablinghighlevelworkercentricsemanticunderstandingofonsiteimagesusingvisuallanguagemodelswithattentionmechanismandbeamsearchstrategy

Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy

Similar Items