PlanText: Gradually Masked Guidance to Align Image Phenotypes with Trait Descriptions for Plant Disease Texts

Plant diseases are a critical driver of the global food crisis. The integration of advanced artificial intelligence technologies can substantially enhance plant disease diagnostics. However, current methods for early and complex detection remain challenging. Employing multimodal technologies, akin t...

Full description

Saved in:
Bibliographic Details
Main Authors: Kejun Zhao, Xingcai Wu, Yuanyuan Xiao, Sijun Jiang, Peijia Yu, Yazhou Wang, Qi Wang
Format: Article
Language:English
Published: American Association for the Advancement of Science (AAAS) 2024-01-01
Series:Plant Phenomics
Online Access:https://spj.science.org/doi/10.34133/plantphenomics.0272
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Plant diseases are a critical driver of the global food crisis. The integration of advanced artificial intelligence technologies can substantially enhance plant disease diagnostics. However, current methods for early and complex detection remain challenging. Employing multimodal technologies, akin to medical artificial intelligence diagnostics that combine diverse data types, may offer a more effective solution. Presently, the reliance on single-modal data predominates in plant disease research, which limits the scope for early and detailed diagnosis. Consequently, developing text modality generation techniques is essential for overcoming the limitations in plant disease recognition. To this end, we propose a method for aligning plant phenotypes with trait descriptions, which diagnoses text by progressively masking disease images. First, for training and validation, we annotate 5,728 disease phenotype images with expert diagnostic text and provide annotated text and trait labels for 210,000 disease images. Then, we propose a PhenoTrait text description model, which consists of global and heterogeneous feature encoders as well as switching-attention decoders, for accurate context-aware output. Next, to generate a more phenotypically appropriate description, we adopt 3 stages of embedding image features into semantic structures, which generate characterizations that preserve trait features. Finally, our experimental results show that our model outperforms several frontier models in multiple trait descriptions, including the larger models GPT-4 and GPT-4o. Our code and dataset are available at https://plantext.samlab.cn/.
ISSN:2643-6515