Entity-level cross-modal fusion for multimodal chinese agricultural diseases and pests named entity recognition
Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natu...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-12-01
|
| Series: | Smart Agricultural Technology |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2772375525004198 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Named Entity Recognition11 To improve clarity and accessibility for readers unfamiliar with the topic, we provide definitions of key terms used throughout the paper, along with relevant references for further reading, as shown in Table 5 in Appendix A. (NER), as one of the popular directions in natural language processing, plays a critical role in fields such as information extraction and agricultural knowledge graph construction. However, traditional single modal methods based on pure text often face limitations in agricultural entity recognition, such as text description ambiguity, contextual limitations, and a lack of information fusion capabilities. This paper overcomes those limitations by introducing an agricultural multimodal NER model that uses entity-level cross-modal alignment. First, we propose a Dual-Stream Entity-Level Feature Encoder. The text stream employs a Boundary-Middle (B-M) classification strategy to achieve fine-grained semantic unit segmentation, effectively addressing long-entity boundary ambiguity and parallel computing challenges. The visual stream focuses on interesting region detection to enhance multi-scale visual entity feature extraction capabilities. Secondly, we introduce a Dynamic Cross-modal Gated Attention (DCGA) mechanism that adaptively adjusts visual feature contributions through gating weights. This approach integrates cross-modal contrastive learning to strengthen semantic connections at the entity level between images and text. To validate the model's effectiveness, we constructed a multimodal NER dataset containing 12,074 sample pairs across 10 entity categories, covering 10 crops, 82 typical diseases/pests, and related agrochemical data. The proposed method achieves a macro-average F1 score of 90.73 % across 10 agricultural entity types, outperforming single-modal baselines by 5.96 %, mainstream multimodal NER models by +3.06 %, zero-shot GPT models by +11.41 %, and fine-tuned multimodal large models by +2.1 %. Comprehensive experimental results indicated that our multimodal collaborative learning framework could effectively enhance agricultural entity recognition accuracy, providing reliable technical support for downstream applications such as agricultural knowledge graph construction and intelligent question answering. |
|---|---|
| ISSN: | 2772-3755 |