NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection

IntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unim...

Full description

Saved in:
Bibliographic Details
Main Authors: Ye Li, Li Yang, Meifang Yang, Fei Yan, Tonghua Liu, Chensi Guo, Rufeng Chen
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Neurorobotics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589823916900352
author Ye Li
Li Yang
Meifang Yang
Fei Yan
Tonghua Liu
Chensi Guo
Rufeng Chen
author_facet Ye Li
Li Yang
Meifang Yang
Fei Yan
Tonghua Liu
Chensi Guo
Rufeng Chen
author_sort Ye Li
collection DOAJ
description IntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unimodal deep learning approaches. While these methods have seen some success, they frequently encounter limitations in dynamic environments, where robustness and computational efficiency become critical for real-time performance. Additionally, these methods often fail to effectively integrate multimodal inputs, which restricts their adaptability and generalization capabilities when facing complex and diverse scenarios.MethodsTo address these challenges, we introduce NavBLIP, a novel visual-language model specifically designed to enhance UAV navigation and object detection by utilizing multimodal data. NavBLIP incorporates transfer learning techniques along with a Nuisance-Invariant Multimodal Feature Extraction (NIMFE) module. The NIMFE module plays a key role in disentangling relevant features from intricate visual and environmental inputs, allowing UAVs to swiftly adapt to new environments and improve object detection accuracy. Furthermore, NavBLIP employs a multimodal control strategy that dynamically selects context-specific features to optimize real-time performance, ensuring efficiency in high-stakes operations.Results and discussionExtensive experiments on benchmark datasets such as RefCOCO, CC12M, and Openlmages reveal that NavBLIP outperforms existing state-of-the-art models in terms of accuracy, recall, and computational efficiency. Additionally, our ablation study emphasizes the significance of the NIMFE and transfer learning components in boosting the model's performance, underscoring NavBLIP's potential for real-time UAV applications where adaptability and computational efficiency are paramount.
format Article
id doaj-art-2504df3baf3343dab6ff0795b162f30b
institution Kabale University
issn 1662-5218
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Neurorobotics
spelling doaj-art-2504df3baf3343dab6ff0795b162f30b2025-01-24T07:13:46ZengFrontiers Media S.A.Frontiers in Neurorobotics1662-52182025-01-011810.3389/fnbot.2024.15133541513354NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detectionYe Li0Li Yang1Meifang Yang2Fei Yan3Tonghua Liu4Chensi Guo5Rufeng Chen6Department of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaDepartment of Electrical Engineering, Baotou Iron and Steel Vocational Technical College, Baotou, ChinaBaotou Iron and Steel (Group) Co., Ltd., Baotou, ChinaIntroductionIn recent years, Unmanned Aerial Vehicles (UAVs) have increasingly been deployed in various applications such as autonomous navigation, surveillance, and object detection. Traditional methods for UAV navigation and object detection have often relied on either handcrafted features or unimodal deep learning approaches. While these methods have seen some success, they frequently encounter limitations in dynamic environments, where robustness and computational efficiency become critical for real-time performance. Additionally, these methods often fail to effectively integrate multimodal inputs, which restricts their adaptability and generalization capabilities when facing complex and diverse scenarios.MethodsTo address these challenges, we introduce NavBLIP, a novel visual-language model specifically designed to enhance UAV navigation and object detection by utilizing multimodal data. NavBLIP incorporates transfer learning techniques along with a Nuisance-Invariant Multimodal Feature Extraction (NIMFE) module. The NIMFE module plays a key role in disentangling relevant features from intricate visual and environmental inputs, allowing UAVs to swiftly adapt to new environments and improve object detection accuracy. Furthermore, NavBLIP employs a multimodal control strategy that dynamically selects context-specific features to optimize real-time performance, ensuring efficiency in high-stakes operations.Results and discussionExtensive experiments on benchmark datasets such as RefCOCO, CC12M, and Openlmages reveal that NavBLIP outperforms existing state-of-the-art models in terms of accuracy, recall, and computational efficiency. Additionally, our ablation study emphasizes the significance of the NIMFE and transfer learning components in boosting the model's performance, underscoring NavBLIP's potential for real-time UAV applications where adaptability and computational efficiency are paramount.https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/fullUAV navigationobject detectionmultimodal learningtransfer learningcomputational efficiency
spellingShingle Ye Li
Li Yang
Meifang Yang
Fei Yan
Tonghua Liu
Chensi Guo
Rufeng Chen
NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
Frontiers in Neurorobotics
UAV navigation
object detection
multimodal learning
transfer learning
computational efficiency
title NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_full NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_fullStr NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_full_unstemmed NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_short NavBLIP: a visual-language model for enhancing unmanned aerial vehicles navigation and object detection
title_sort navblip a visual language model for enhancing unmanned aerial vehicles navigation and object detection
topic UAV navigation
object detection
multimodal learning
transfer learning
computational efficiency
url https://www.frontiersin.org/articles/10.3389/fnbot.2024.1513354/full
work_keys_str_mv AT yeli navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT liyang navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT meifangyang navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT feiyan navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT tonghualiu navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT chensiguo navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection
AT rufengchen navblipavisuallanguagemodelforenhancingunmannedaerialvehiclesnavigationandobjectdetection