NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection

IntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unim...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ye Li, Li Yang, Meifang Yang, Fei Yan, Tonghua Liu, Chensi Guo, Rufeng Chen
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-01-01
Series:	Frontiers in Neurorobotics
Subjects:	UAV navigation object detection multimodal learning transfer learning computational efficiency
Online Access:	https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832589823916900352
author	Ye Li Li Yang Meifang Yang Fei Yan Tonghua Liu Chensi Guo Rufeng Chen
author_facet	Ye Li Li Yang Meifang Yang Fei Yan Tonghua Liu Chensi Guo Rufeng Chen
author_sort	Ye Li
collection	DOAJ
description	IntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unimodal deep learning approaches. While these methods have seen some success, they frequently encounter limitations in dynamic environments, where robustness and computational efficiency become critical for real-time performance. Additionally, these methods often fail to effectively integrate multimodal inputs, which restricts their adaptability and generalization capabilities when facing complex and diverse scenarios.MethodsTo address these challenges, we introduce NavBLIP, a novel visual-language model specifically designed to enhance UAV navigation and object detection by utilizing multimodal data. NavBLIP incorporates transfer learning techniques along with a Nuisance-Invariant Multimodal Feature Extraction (NIMFE) module. The NIMFE module plays a key role in disentangling relevant features from intricate visual and environmental inputs, allowing UAVs to swiftly adapt to new environments and improve object detection accuracy. Furthermore, NavBLIP employs a multimodal control strategy that dynamically selects context-specific features to optimize real-time performance, ensuring efficiency in high-stakes operations.Results and discussionExtensive experiments on benchmark datasets such as RefCOCO, CC12M, and Openlmages reveal that NavBLIP outperforms existing state-of-the-art models in terms of accuracy, recall, and computational efficiency. Additionally, our ablation study emphasizes the significance of the NIMFE and transfer learning components in boosting the model's performance, underscoring NavBLIP's potential for real-time UAV applications where adaptability and computational efficiency are paramount.
format	Article
id	doaj-art-2504df3baf3343dab6ff0795b162f30b
institution	Kabale University
issn	1662-5218
language	English
publishDate	2025-01-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Neurorobotics
spelling	doaj-art-2504df3baf3343dab6ff0795b162f30b2025-01-24T07:13:46ZengFrontiers Media S.A.Frontiers in Neurorobotics1662-52182025-01-011810.3389/fnbot.2024.15133541513354NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detectionYe Li0Li Yang1Meifang Yang2Fei Yan3Tonghua Liu4Chensi Guo5Rufeng Chen6Department of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaBaotou Iron and Steel (Group) Co., Ltd., Baotou, ChinaIntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unimodal deep learning approaches. While these methods have seen some success, they frequently encounter limitations in dynamic environments, where robustness and computational efficiency become critical for real-time performance. Additionally, these methods often fail to effectively integrate multimodal inputs, which restricts their adaptability and generalization capabilities when facing complex and diverse scenarios.MethodsTo address these challenges, we introduce NavBLIP, a novel visual-language model specifically designed to enhance UAV navigation and object detection by utilizing multimodal data. NavBLIP incorporates transfer learning techniques along with a Nuisance-Invariant Multimodal Feature Extraction (NIMFE) module. The NIMFE module plays a key role in disentangling relevant features from intricate visual and environmental inputs, allowing UAVs to swiftly adapt to new environments and improve object detection accuracy. Furthermore, NavBLIP employs a multimodal control strategy that dynamically selects context-specific features to optimize real-time performance, ensuring efficiency in high-stakes operations.Results and discussionExtensive experiments on benchmark datasets such as RefCOCO, CC12M, and Openlmages reveal that NavBLIP outperforms existing state-of-the-art models in terms of accuracy, recall, and computational efficiency. Additionally, our ablation study emphasizes the significance of the NIMFE and transfer learning components in boosting the model's performance, underscoring NavBLIP's potential for real-time UAV applications where adaptability and computational efficiency are paramount.https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/fullUAV navigationobject detectionmultimodal learningtransfer learningcomputational efficiency
spellingShingle	Ye Li Li Yang Meifang Yang Fei Yan Tonghua Liu Chensi Guo Rufeng Chen NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection Frontiers in Neurorobotics UAV navigation object detection multimodal learning transfer learning computational efficiency
title	NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_full	NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_fullStr	NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_full_unstemmed	NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_short	NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_sort	navblip a visual language model for enhancing unmanned aerial vehicles navigation and object detection
topic	UAV navigation object detection multimodal learning transfer learning computational efficiency
url	https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/full
work_keys_str_mv	AT yeli navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT liyang navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT meifangyang navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT feiyan navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT tonghualiu navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT chensiguo navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection AT rufengchen navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection

NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection

Similar Items