Text-guided RGB-P grasp generation
In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2025-08-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-3060.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849395336923054080 |
|---|---|
| author | Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan |
| author_facet | Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan |
| author_sort | Van Duc Vu |
| collection | DOAJ |
| description | In the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection. |
| format | Article |
| id | doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e8 |
| institution | Kabale University |
| issn | 2376-5992 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | PeerJ Inc. |
| record_format | Article |
| series | PeerJ Computer Science |
| spelling | doaj-art-8dcfee8b4f254511bcb5f19c1d1a52e82025-08-20T03:39:40ZengPeerJ Inc.PeerJ Computer Science2376-59922025-08-0111e306010.7717/peerj-cs.3060Text-guided RGB-P grasp generationVan Duc Vu0Van Thiep Nguyen1Nam Hai Pham2Dinh-Cuong Hoang3Phan Xuan Tan4IT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamIT Department, FPT University, Ha Noi, VietnamCollege of Engineering, Shibaura Institute of Technology, Tokyo, JapanIn the field of robotics, object grasping is a complex and challenging task. Although state-of-the-art computer vision-based models have made significant progress in predicting grasps, the lack of semantic information from textual data makes them susceptible to ambiguities in object recognition. For example, when asked to grasp a specific object on a table with many objects, robots relying only on visual data can easily get confused and grasp the wrong object. To address this limitation, we propose a multimodal approach that seamlessly integrates 3D data (shape) and red-green-blue (RGB) images (color, texture) into a unified representation called red-green-blue and point cloud (RGB-P), while also incorporating semantic information from textual descriptions processed by a large language model (LLM) to enhance object disambiguation. This combination of data allows our model to accurately infer and capture target objects based on natural language descriptions, overcoming the limitations of vision-only approaches. Our approach achieves superior performance, with an average precision (AP) of 53.2% on the GraspNet-1Billion dataset, significantly outperforming state-of-the-art methods. Additionally, we introduce an automated dataset creation pipeline that addresses the challenges of data collection and annotation. This pipeline leverages cutting-edge models: LLMs for text generation, Stable Diffusion for image synthesis, Depth Anything for depth estimation, using standard intrinsic parameters from the Kinect depth sensor to ensure geometric consistency, and GraspNet for grasp estimation. This automated process generates high-quality datasets with paired RGB-P, images, textual descriptions and potential grasp poses, significantly reducing the manual effort and enabling large-scale data collection.https://peerj.com/articles/cs-3060.pdfGrasp generationLarge language modelsComputer visionMulti-modalRobotics |
| spellingShingle | Van Duc Vu Van Thiep Nguyen Nam Hai Pham Dinh-Cuong Hoang Phan Xuan Tan Text-guided RGB-P grasp generation PeerJ Computer Science Grasp generation Large language models Computer vision Multi-modal Robotics |
| title | Text-guided RGB-P grasp generation |
| title_full | Text-guided RGB-P grasp generation |
| title_fullStr | Text-guided RGB-P grasp generation |
| title_full_unstemmed | Text-guided RGB-P grasp generation |
| title_short | Text-guided RGB-P grasp generation |
| title_sort | text guided rgb p grasp generation |
| topic | Grasp generation Large language models Computer vision Multi-modal Robotics |
| url | https://peerj.com/articles/cs-3060.pdf |
| work_keys_str_mv | AT vanducvu textguidedrgbpgraspgeneration AT vanthiepnguyen textguidedrgbpgraspgeneration AT namhaipham textguidedrgbpgraspgeneration AT dinhcuonghoang textguidedrgbpgraspgeneration AT phanxuantan textguidedrgbpgraspgeneration |