Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection

The multi-sensor fusion, such as LiDAR and camera-based 3D object detection, is a key technology in autonomous driving and robotics. However, traditional 3D detection models are limited to recognizing predefined categories and struggle with unknown or novel objects. Given the complexity of real-worl...

Full description

Saved in:
Bibliographic Details
Main Authors: Hengsong Liu, Tongle Duan
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/2/553
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587496231272448
author Hengsong Liu
Tongle Duan
author_facet Hengsong Liu
Tongle Duan
author_sort Hengsong Liu
collection DOAJ
description The multi-sensor fusion, such as LiDAR and camera-based 3D object detection, is a key technology in autonomous driving and robotics. However, traditional 3D detection models are limited to recognizing predefined categories and struggle with unknown or novel objects. Given the complexity of real-world environments, research into open-vocabulary 3D object detection is essential. Therefore, this paper aims to address two key issues in this area: how to localize and classify novel objects. We propose Cross-modal Collaboration and Robust Feature Classifier to improve localization accuracy and classification robustness for novel objects. The Cross-modal Collaboration involves the collaborative localization between LiDAR and camera. In this approach, 2D images provide preliminary regions of interest for novel objects in the 3D point cloud, while the 3D point cloud offers more precise positional information to the 2D images. Through iterative updates between two modalities, the preliminary region and positional information are refined, achieving the accurate localization of novel objects. The Robust Feature Classifier aims to accurately classify novel objects. To prevent them from being misidentified as background or other incorrect categories, this method maps the semantic vectors of new categories into multiple sets of visual features distinguished from the background. And it clusters these visual features based on each individual semantic vector to maintain inter-class separability. Our method achieves state-of-the-art performance on various scenarios and datasets.
format Article
id doaj-art-5b907241b0ed40c89ebd3aafefb8078a
institution Kabale University
issn 1424-8220
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-5b907241b0ed40c89ebd3aafefb8078a2025-01-24T13:49:19ZengMDPI AGSensors1424-82202025-01-0125255310.3390/s25020553Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object DetectionHengsong Liu0Tongle Duan1The 54th Research Institute, China Electronics Technology Group Corporation, College of Signal and Information Processing, Shijiazhuang 050081, ChinaThe 54th Research Institute, China Electronics Technology Group Corporation, College of Signal and Information Processing, Shijiazhuang 050081, ChinaThe multi-sensor fusion, such as LiDAR and camera-based 3D object detection, is a key technology in autonomous driving and robotics. However, traditional 3D detection models are limited to recognizing predefined categories and struggle with unknown or novel objects. Given the complexity of real-world environments, research into open-vocabulary 3D object detection is essential. Therefore, this paper aims to address two key issues in this area: how to localize and classify novel objects. We propose Cross-modal Collaboration and Robust Feature Classifier to improve localization accuracy and classification robustness for novel objects. The Cross-modal Collaboration involves the collaborative localization between LiDAR and camera. In this approach, 2D images provide preliminary regions of interest for novel objects in the 3D point cloud, while the 3D point cloud offers more precise positional information to the 2D images. Through iterative updates between two modalities, the preliminary region and positional information are refined, achieving the accurate localization of novel objects. The Robust Feature Classifier aims to accurately classify novel objects. To prevent them from being misidentified as background or other incorrect categories, this method maps the semantic vectors of new categories into multiple sets of visual features distinguished from the background. And it clusters these visual features based on each individual semantic vector to maintain inter-class separability. Our method achieves state-of-the-art performance on various scenarios and datasets.https://www.mdpi.com/1424-8220/25/2/5533D object detectionmulti-sensor fusionzero-shot learningautonomous driving
spellingShingle Hengsong Liu
Tongle Duan
Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
Sensors
3D object detection
multi-sensor fusion
zero-shot learning
autonomous driving
title Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
title_full Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
title_fullStr Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
title_full_unstemmed Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
title_short Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
title_sort cross modal collaboration and robust feature classifier for open vocabulary 3d object detection
topic 3D object detection
multi-sensor fusion
zero-shot learning
autonomous driving
url https://www.mdpi.com/1424-8220/25/2/553
work_keys_str_mv AT hengsongliu crossmodalcollaborationandrobustfeatureclassifierforopenvocabulary3dobjectdetection
AT tongleduan crossmodalcollaborationandrobustfeatureclassifierforopenvocabulary3dobjectdetection