Text this: Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders