Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition

Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natu...

Full description

Saved in:
Bibliographic Details
Main Authors: Jingzhong Huang, Xia Hao, Yu Wang, Ruizhi Song, Zenan Mu, Wen Chu, Georgios Papadakis, Sijie Niu, Xuchao Guo
Format: Article
Language:English
Published: Elsevier 2025-12-01
Series:Smart Agricultural Technology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2772375525004198
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natural language processing, plays a critical role in fields such as information extraction and agricultural knowledge graph construction. However, traditional single modal methods based on pure text often face limitations in agricultural entity recognition, such as text description ambiguity, contextual limitations, and a lack of information fusion capabilities. This paper overcomes those limitations by introducing an agricultural multimodal NER model that uses entity-level cross-modal alignment. First, we propose a Dual-Stream Entity-Level Feature Encoder. The text stream employs a Boundary-Middle (B-M) classification strategy to achieve fine-grained semantic unit segmentation, effectively addressing long-entity boundary ambiguity and parallel computing challenges. The visual stream focuses on interesting region detection to enhance multi-scale visual entity feature extraction capabilities. Secondly, we introduce a Dynamic Cross-modal Gated Attention (DCGA) mechanism that adaptively adjusts visual feature contributions through gating weights. This approach integrates cross-modal contrastive learning to strengthen semantic connections at the entity level between images and text. To validate the model's effectiveness, we constructed a multimodal NER dataset containing 12,074 sample pairs across 10 entity categories, covering 10 crops, 82 typical diseases/pests, and related agrochemical data. The proposed method achieves a macro-average F1 score of 90.73 % across 10 agricultural entity types, outperforming single-modal baselines by 5.96 %, mainstream multimodal NER models by +3.06 %, zero-shot GPT models by +11.41 %, and fine-tuned multimodal large models by +2.1 %. Comprehensive experimental results indicated that our multimodal collaborative learning framework could effectively enhance agricultural entity recognition accuracy, providing reliable technical support for downstream applications such as agricultural knowledge graph construction and intelligent question answering.
ISSN:2772-3755