Text-guided RGB-P grasp generation

In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For...

Full description

Saved in:
Bibliographic Details
Main Authors: Van Duc Vu, Van Thiep Nguyen, Nam Hai Pham, Dinh-Cuong Hoang, Phan Xuan Tan
Format: Article
Language:English
Published: PeerJ Inc. 2025-08-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-3060.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849395336923054080
author Van Duc Vu
Van Thiep Nguyen
Nam Hai Pham
Dinh-Cuong Hoang
Phan Xuan Tan
author_facet Van Duc Vu
Van Thiep Nguyen
Nam Hai Pham
Dinh-Cuong Hoang
Phan Xuan Tan
author_sort Van Duc Vu
collection DOAJ
description In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.
format Article
id doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e8
institution Kabale University
issn 2376-5992
language English
publishDate 2025-08-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e82025-08-20T03:39:40ZengPeerJ Inc.PeerJ Computer Science2376-59922025-08-0111e306010.7717/peerj-cs.3060Text-guided RGB-P grasp generationVan Duc Vu0Van Thiep Nguyen1Nam Hai Pham2Dinh-Cuong Hoang3Phan Xuan Tan4IT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamCollege of Engineering, Shibaura Institute of Technology, Tokyo, JapanIn the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.https://peerj.com/articles/cs-3060.pdfGrasp generationLarge language modelsComputer visionMulti-modalRobotics
spellingShingle Van Duc Vu
Van Thiep Nguyen
Nam Hai Pham
Dinh-Cuong Hoang
Phan Xuan Tan
Text-guided RGB-P grasp generation
PeerJ Computer Science
Grasp generation
Large language models
Computer vision
Multi-modal
Robotics
title Text-guided RGB-P grasp generation
title_full Text-guided RGB-P grasp generation
title_fullStr Text-guided RGB-P grasp generation
title_full_unstemmed Text-guided RGB-P grasp generation
title_short Text-guided RGB-P grasp generation
title_sort text guided rgb p grasp generation
topic Grasp generation
Large language models
Computer vision
Multi-modal
Robotics
url https://peerj.com/articles/cs-3060.pdf
work_keys_str_mv AT vanducvu textguidedrgbpgraspgeneration
AT vanthiepnguyen textguidedrgbpgraspgeneration
AT namhaipham textguidedrgbpgraspgeneration
AT dinhcuonghoang textguidedrgbpgraspgeneration
AT phanxuantan textguidedrgbpgraspgeneration