LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data

Masked language modeling (MLM) and masked image modeling (MIM) pretraining paradigms have achieved remarkable success in both natural language processing (NLP) and computer vision (CV). However, extending MIM to large-scale outdoor point cloud data presents significant challenges due to the inherent...

Full description

Saved in:
Bibliographic Details
Main Authors: Nuo Cheng, Chuanyu Luo, Han Li, Sikun Ma, Shengguang Lei, Pu Li
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11105382/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850034108486385664
author Nuo Cheng
Chuanyu Luo
Han Li
Sikun Ma
Shengguang Lei
Pu Li
author_facet Nuo Cheng
Chuanyu Luo
Han Li
Sikun Ma
Shengguang Lei
Pu Li
author_sort Nuo Cheng
collection DOAJ
description Masked language modeling (MLM) and masked image modeling (MIM) pretraining paradigms have achieved remarkable success in both natural language processing (NLP) and computer vision (CV). However, extending MIM to large-scale outdoor point cloud data presents significant challenges due to the inherent sparsity and wide spatial coverage of such data. To address this issue, we develop a masked autoencoding pre-training model, LSV-MAE, making it possible to train detection models on large-volume unlabeled point cloud data. Our approach pre-trains the backbone to reconstruct masked voxel features extracted by PointNN. To enhance the feature extraction capability of the encoder, the point cloud is voxelized with different voxel sizes at different pre-training stages. Meanwhile, to avoid the effect of masking key points, the masked voxel features are re-integrated into the decoder during pretraining. To verify the proposed approach, experiments are conducted on well-known datasets, showing that our method not only can avoid tedious labeling work but improve the detection accuracy by up to 18%, compared with that using the model without pre-training, across datasets of different sizes.
format Article
id doaj-art-3a10747ffbcc4b448a9b37623bde1839
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-3a10747ffbcc4b448a9b37623bde18392025-08-20T02:57:57ZengIEEEIEEE Access2169-35362025-01-011313570813572110.1109/ACCESS.2025.359461411105382LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud DataNuo Cheng0https://orcid.org/0000-0002-4748-4554Chuanyu Luo1https://orcid.org/0000-0001-8496-8550Han Li2Sikun Ma3Shengguang Lei4Pu Li5https://orcid.org/0000-0001-6481-9961Process Optimization Group, Technische Universität Ilmenau, Ilmenau, GermanyProcess Optimization Group, Technische Universität Ilmenau, Ilmenau, GermanyLiangDao GmbH, Berlin, GermanyLiangDao GmbH, Berlin, GermanyLiangDao GmbH, Berlin, GermanyProcess Optimization Group, Technische Universität Ilmenau, Ilmenau, GermanyMasked language modeling (MLM) and masked image modeling (MIM) pretraining paradigms have achieved remarkable success in both natural language processing (NLP) and computer vision (CV). However, extending MIM to large-scale outdoor point cloud data presents significant challenges due to the inherent sparsity and wide spatial coverage of such data. To address this issue, we develop a masked autoencoding pre-training model, LSV-MAE, making it possible to train detection models on large-volume unlabeled point cloud data. Our approach pre-trains the backbone to reconstruct masked voxel features extracted by PointNN. To enhance the feature extraction capability of the encoder, the point cloud is voxelized with different voxel sizes at different pre-training stages. Meanwhile, to avoid the effect of masking key points, the masked voxel features are re-integrated into the decoder during pretraining. To verify the proposed approach, experiments are conducted on well-known datasets, showing that our method not only can avoid tedious labeling work but improve the detection accuracy by up to 18%, compared with that using the model without pre-training, across datasets of different sizes.https://ieeexplore.ieee.org/document/11105382/Pre-trainingautonomous drivingpoint cloudtransformerKITTInuScenes
spellingShingle Nuo Cheng
Chuanyu Luo
Han Li
Sikun Ma
Shengguang Lei
Pu Li
LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
IEEE Access
Pre-training
autonomous driving
point cloud
transformer
KITTI
nuScenes
title LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
title_full LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
title_fullStr LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
title_full_unstemmed LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
title_short LSV-MAE: A Masked-Autoencoder Pre-Training Approach for Large-Scale 3D Point Cloud Data
title_sort lsv mae a masked autoencoder pre training approach for large scale 3d point cloud data
topic Pre-training
autonomous driving
point cloud
transformer
KITTI
nuScenes
url https://ieeexplore.ieee.org/document/11105382/
work_keys_str_mv AT nuocheng lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata
AT chuanyuluo lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata
AT hanli lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata
AT sikunma lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata
AT shengguanglei lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata
AT puli lsvmaeamaskedautoencoderpretrainingapproachforlargescale3dpointclouddata