E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit va...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yi Zhang, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng, Peng Song
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Agriculture
Subjects:	visual language models contrast learning smart agriculture
Online Access:	https://www.mdpi.com/2077-0472/15/11/1173
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850129881357090816
author	Yi Zhang Yang Shao Chen Tang Zhenqing Liu Zhengda Li Ruifang Zhai Hui Peng Peng Song
author_facet	Yi Zhang Yang Shao Chen Tang Zhenqing Liu Zhengda Li Ruifang Zhai Hui Peng Peng Song
author_sort	Yi Zhang
collection	DOAJ
description	With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.
format	Article
id	doaj-art-e4499c3eadf848cd8cc86ceee8427592
institution	OA Journals
issn	2077-0472
language	English
publishDate	2025-05-01
publisher	MDPI AG
record_format	Article
series	Agriculture
spelling	doaj-art-e4499c3eadf848cd8cc86ceee84275922025-08-20T02:32:49ZengMDPI AGAgriculture2077-04722025-05-011511117310.3390/agriculture15111173E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and RecognitionYi Zhang0Yang Shao1Chen Tang2Zhenqing Liu3Zhengda Li4Ruifang Zhai5Hui Peng6Peng Song7College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaCollege of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, ChinaWith the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.https://www.mdpi.com/2077-0472/15/11/1173visual language modelscontrast learningsmart agriculture
spellingShingle	Yi Zhang Yang Shao Chen Tang Zhenqing Liu Zhengda Li Ruifang Zhai Hui Peng Peng Song E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition Agriculture visual language models contrast learning smart agriculture
title	E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
title_full	E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
title_fullStr	E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
title_full_unstemmed	E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
title_short	E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition
title_sort	e clip an enhanced clip based visual language model for fruit detection and recognition
topic	visual language models contrast learning smart agriculture
url	https://www.mdpi.com/2077-0472/15/11/1173
work_keys_str_mv	AT yizhang eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT yangshao eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT chentang eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT zhenqingliu eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT zhengdali eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT ruifangzhai eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT huipeng eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition AT pengsong eclipanenhancedclipbasedvisuallanguagemodelforfruitdetectionandrecognition

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

Similar Items