Text-guided RGB-P grasp generation

In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For...

Full description

Saved in:

Bibliographic Details
Main Authors:	Van Duc Vu, Van Thiep Nguyen, Nam Hai Pham, Dinh-Cuong Hoang, Phan Xuan Tan
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-08-01
Series:	PeerJ Computer Science
Subjects:	Grasp generation Large language models Computer vision Multi-modal Robotics
Online Access:	https://peerj.com/articles/cs-3060.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849395336923054080
author	Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan
author_facet	Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan
author_sort	Van Duc Vu
collection	DOAJ
description	In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.
format	Article
id	doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e8
institution	Kabale University
issn	2376-5992
language	English
publishDate	2025-08-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e82025-08-20T03:39:40ZengPeerJ Inc.PeerJ Computer Science2376-59922025-08-0111e306010.7717/peerj-cs.3060Text-guided RGB-P grasp generationVan Duc Vu0Van Thiep Nguyen1Nam Hai Pham2Dinh-Cuong Hoang3Phan Xuan Tan4IT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamCollege of Engineering, Shibaura Institute of Technology, Tokyo, JapanIn the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.https://peerj.com/articles/cs-3060.pdfGrasp generationLarge language modelsComputer visionMulti-modalRobotics
spellingShingle	Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan Text-guided RGB-P grasp generation PeerJ Computer Science Grasp generation Large language models Computer vision Multi-modal Robotics
title	Text-guided RGB-P grasp generation
title_full	Text-guided RGB-P grasp generation
title_fullStr	Text-guided RGB-P grasp generation
title_full_unstemmed	Text-guided RGB-P grasp generation
title_short	Text-guided RGB-P grasp generation
title_sort	text guided rgb p grasp generation
topic	Grasp generation Large language models Computer vision Multi-modal Robotics
url	https://peerj.com/articles/cs-3060.pdf
work_keys_str_mv	AT vanducvu textguidedrgbpgraspgeneration AT vanthiepnguyen textguidedrgbpgraspgeneration AT namhaipham textguidedrgbpgraspgeneration AT dinhcuonghoang textguidedrgbpgraspgeneration AT phanxuantan textguidedrgbpgraspgeneration

Text-guided RGB-P grasp generation

Similar Items