Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders

Abstract Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional netw...

Full description

Saved in:

Bibliographic Details
Main Authors:	Muhammad Sajid, Kaleem Razzaq Malik, Ateeq Ur Rehman, Tauqeer Safdar Malik, Masoud Alajmi, Ali Haider Khan, Amir Haider, Seada Hussen
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Subjects:	2D Vision Transformers 3D Masked Autoencoders 2D Semantics
Online Access:	https://doi.org/10.1038/s41598-025-87376-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585763565338624
author	Muhammad Sajid Kaleem Razzaq Malik Ateeq Ur Rehman Tauqeer Safdar Malik Masoud Alajmi Ali Haider Khan Amir Haider Seada Hussen
author_facet	Muhammad Sajid Kaleem Razzaq Malik Ateeq Ur Rehman Tauqeer Safdar Malik Masoud Alajmi Ali Haider Khan Amir Haider Seada Hussen
author_sort	Muhammad Sajid
collection	DOAJ
description	Abstract Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision. Masking autoencoding is a promising self-supervised learning approach that greatly advances computer vision and natural language processing. For robust 2D representations, pre-training with large image data has become standard practice. On the other hand, the low availability of 3D datasets significantly impedes learning high-quality 3D features because of the high data processing cost. We present a strong multi-scale MAE prior training architecture that uses a trained ViT and a 3D representation model from 2D images to let 3D point clouds learn on their own. We employ the adept 2D information to direct a 3D masking-based autoencoder, which uses an encoder-decoder architecture to rebuild the masked point tokens through self-supervised pre-training. To acquire the input point cloud’s multi-view visual characteristics, we first use pre-trained 2D models. Next, we present a two-dimensional masking method that preserves the visibility of semantically significant point tokens. Numerous tests demonstrate how effectively our method works with pre-trained models and how well it generalizes to a range of downstream tasks. In particular, our pre-trained model achieved 93.63% accuracy for linear SVM on ScanObjectNN and 91.31% accuracy on ModelNet40. Our approach demonstrates how a straightforward architecture solely based on conventional transformers may outperform specialized transformer models from supervised learning.
format	Article
id	doaj-art-3143c9ff3a0846e5bbefb25e2bfc10b7
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-3143c9ff3a0846e5bbefb25e2bfc10b72025-01-26T12:31:21ZengNature PortfolioScientific Reports2045-23222025-01-0115111810.1038/s41598-025-87376-yLeveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencodersMuhammad Sajid0Kaleem Razzaq Malik1Ateeq Ur Rehman2Tauqeer Safdar Malik3Masoud Alajmi4Ali Haider Khan5Amir Haider6Seada Hussen7Department of Computer Science, Air UniversityDepartment of Computer Science, Air UniversityComputer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical SciencesDepartment of Information & Communication Technology, Bahauddin Zakariya UniversityDepartment of Computer Engineering, College of Computers and Information Technology, Taif UniversitySchool of Software Engineering, Beijing University of TechnologyDepartment of Artificial Intelligence and Robotics, Sejong UniversityDepartment of Electrical Power, Adama Science and Technology UniversityAbstract Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision. Masking autoencoding is a promising self-supervised learning approach that greatly advances computer vision and natural language processing. For robust 2D representations, pre-training with large image data has become standard practice. On the other hand, the low availability of 3D datasets significantly impedes learning high-quality 3D features because of the high data processing cost. We present a strong multi-scale MAE prior training architecture that uses a trained ViT and a 3D representation model from 2D images to let 3D point clouds learn on their own. We employ the adept 2D information to direct a 3D masking-based autoencoder, which uses an encoder-decoder architecture to rebuild the masked point tokens through self-supervised pre-training. To acquire the input point cloud’s multi-view visual characteristics, we first use pre-trained 2D models. Next, we present a two-dimensional masking method that preserves the visibility of semantically significant point tokens. Numerous tests demonstrate how effectively our method works with pre-trained models and how well it generalizes to a range of downstream tasks. In particular, our pre-trained model achieved 93.63% accuracy for linear SVM on ScanObjectNN and 91.31% accuracy on ModelNet40. Our approach demonstrates how a straightforward architecture solely based on conventional transformers may outperform specialized transformer models from supervised learning.https://doi.org/10.1038/s41598-025-87376-y2DVision Transformers3DMasked Autoencoders2D Semantics
spellingShingle	Muhammad Sajid Kaleem Razzaq Malik Ateeq Ur Rehman Tauqeer Safdar Malik Masoud Alajmi Ali Haider Khan Amir Haider Seada Hussen Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders Scientific Reports 2D Vision Transformers 3D Masked Autoencoders 2D Semantics
title	Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders
title_full	Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders
title_fullStr	Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders
title_full_unstemmed	Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders
title_short	Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders
title_sort	leveraging two dimensional pre trained vision transformers for three dimensional model generation via masked autoencoders
topic	2D Vision Transformers 3D Masked Autoencoders 2D Semantics
url	https://doi.org/10.1038/s41598-025-87376-y
work_keys_str_mv	AT muhammadsajid leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT kaleemrazzaqmalik leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT ateequrrehman leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT tauqeersafdarmalik leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT masoudalajmi leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT alihaiderkhan leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT amirhaider leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders AT seadahussen leveragingtwodimensionalpretrainedvisiontransformersforthreedimensionalmodelgenerationviamaskedautoencoders

Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders

Similar Items