Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition

Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natu...

Full description

Saved in:
Bibliographic Details
Main Authors: Jingzhong Huang, Xia Hao, Yu Wang, Ruizhi Song, Zenan Mu, Wen Chu, Georgios Papadakis, Sijie Niu, Xuchao Guo
Format: Article
Language:English
Published: Elsevier 2025-12-01
Series:Smart Agricultural Technology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2772375525004198
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849714010993197056
author Jingzhong Huang
Xia Hao
Yu Wang
Ruizhi Song
Zenan Mu
Wen Chu
Georgios Papadakis
Sijie Niu
Xuchao Guo
author_facet Jingzhong Huang
Xia Hao
Yu Wang
Ruizhi Song
Zenan Mu
Wen Chu
Georgios Papadakis
Sijie Niu
Xuchao Guo
author_sort Jingzhong Huang
collection DOAJ
description Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natural language processing, plays a critical role in fields such as information extraction and agricultural knowledge graph construction. However, traditional single modal methods based on pure text often face limitations in agricultural entity recognition, such as text description ambiguity, contextual limitations, and a lack of information fusion capabilities. This paper overcomes those limitations by introducing an agricultural multimodal NER model that uses entity-level cross-modal alignment. First, we propose a Dual-Stream Entity-Level Feature Encoder. The text stream employs a Boundary-Middle (B-M) classification strategy to achieve fine-grained semantic unit segmentation, effectively addressing long-entity boundary ambiguity and parallel computing challenges. The visual stream focuses on interesting region detection to enhance multi-scale visual entity feature extraction capabilities. Secondly, we introduce a Dynamic Cross-modal Gated Attention (DCGA) mechanism that adaptively adjusts visual feature contributions through gating weights. This approach integrates cross-modal contrastive learning to strengthen semantic connections at the entity level between images and text. To validate the model's effectiveness, we constructed a multimodal NER dataset containing 12,074 sample pairs across 10 entity categories, covering 10 crops, 82 typical diseases/pests, and related agrochemical data. The proposed method achieves a macro-average F1 score of 90.73 % across 10 agricultural entity types, outperforming single-modal baselines by 5.96 %, mainstream multimodal NER models by +3.06 %, zero-shot GPT models by +11.41 %, and fine-tuned multimodal large models by +2.1 %. Comprehensive experimental results indicated that our multimodal collaborative learning framework could effectively enhance agricultural entity recognition accuracy, providing reliable technical support for downstream applications such as agricultural knowledge graph construction and intelligent question answering.
format Article
id doaj-art-4cc53f53e3de46989a31c8407f019367
institution DOAJ
issn 2772-3755
language English
publishDate 2025-12-01
publisher Elsevier
record_format Article
series Smart Agricultural Technology
spelling doaj-art-4cc53f53e3de46989a31c8407f0193672025-08-20T03:13:49ZengElsevierSmart Agricultural Technology2772-37552025-12-011210118810.1016/j.atech.2025.101188Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognitionJingzhong Huang0Xia Hao1Yu Wang2Ruizhi Song3Zenan Mu4Wen Chu5Georgios Papadakis6Sijie Niu7Xuchao Guo8College of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, ChinaDigital Twin Agricultural Technology Research Center, Shandong Agricultural University, Tai'an 271018, China; Agricultural University of Athens, Dept of Natural Resources and Agricultural Engineering, Athens, GreeceSchool of Information Science and Technology, University of Jinan, Jinan 250022, China; Shandong Key Laboratory of Ubiquitous Intelligent Computing, Jinan 250022, ChinaCollege of Information Science and Engineering, Shandong Agricultural University, Tai’an 271000, China; Corresponding author.Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natural language processing, plays a critical role in fields such as information extraction and agricultural knowledge graph construction. However, traditional single modal methods based on pure text often face limitations in agricultural entity recognition, such as text description ambiguity, contextual limitations, and a lack of information fusion capabilities. This paper overcomes those limitations by introducing an agricultural multimodal NER model that uses entity-level cross-modal alignment. First, we propose a Dual-Stream Entity-Level Feature Encoder. The text stream employs a Boundary-Middle (B-M) classification strategy to achieve fine-grained semantic unit segmentation, effectively addressing long-entity boundary ambiguity and parallel computing challenges. The visual stream focuses on interesting region detection to enhance multi-scale visual entity feature extraction capabilities. Secondly, we introduce a Dynamic Cross-modal Gated Attention (DCGA) mechanism that adaptively adjusts visual feature contributions through gating weights. This approach integrates cross-modal contrastive learning to strengthen semantic connections at the entity level between images and text. To validate the model's effectiveness, we constructed a multimodal NER dataset containing 12,074 sample pairs across 10 entity categories, covering 10 crops, 82 typical diseases/pests, and related agrochemical data. The proposed method achieves a macro-average F1 score of 90.73 % across 10 agricultural entity types, outperforming single-modal baselines by 5.96 %, mainstream multimodal NER models by +3.06 %, zero-shot GPT models by +11.41 %, and fine-tuned multimodal large models by +2.1 %. Comprehensive experimental results indicated that our multimodal collaborative learning framework could effectively enhance agricultural entity recognition accuracy, providing reliable technical support for downstream applications such as agricultural knowledge graph construction and intelligent question answering.http://www.sciencedirect.com/science/article/pii/S2772375525004198Chinese named entity recognitionMultimodal named entity recognitionMultimodal alignmentDynamic attention mechanismDual stream encoder
spellingShingle Jingzhong Huang
Xia Hao
Yu Wang
Ruizhi Song
Zenan Mu
Wen Chu
Georgios Papadakis
Sijie Niu
Xuchao Guo
Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
Smart Agricultural Technology
Chinese named entity recognition
Multimodal named entity recognition
Multimodal alignment
Dynamic attention mechanism
Dual stream encoder
title Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
title_full Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
title_fullStr Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
title_full_unstemmed Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
title_short Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
title_sort entity level cross modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
topic Chinese named entity recognition
Multimodal named entity recognition
Multimodal alignment
Dynamic attention mechanism
Dual stream encoder
url http://www.sciencedirect.com/science/article/pii/S2772375525004198
work_keys_str_mv AT jingzhonghuang entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT xiahao entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT yuwang entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT ruizhisong entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT zenanmu entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT wenchu entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT georgiospapadakis entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT sijieniu entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition
AT xuchaoguo entitylevelcrossmodalfusionformultimodalchineseagriculturaldiseasesandpestsnamedentityrecognition