Simplifying Masked Image Modeling With Symmetric Masking and Contrastive Learning

Masked image modeling (MIM) has emerged as an effective self-supervised learning paradigm for pre-training Vision Transformers (ViTs) by reconstructing missing pixels from masked image regions. While prior approaches have demonstrated strong performance, they typically rely on random masking strateg...

Full description

Saved in:
Bibliographic Details
Main Authors: Khanh-Binh Nguyen, Chae Jung Park
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11080374/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Masked image modeling (MIM) has emerged as an effective self-supervised learning paradigm for pre-training Vision Transformers (ViTs) by reconstructing missing pixels from masked image regions. While prior approaches have demonstrated strong performance, they typically rely on random masking strategies and require extensive hyperparameter tuning—particularly in searching for optimal masking ratios—leading to increased training cost and reduced generalizability. In this work, we propose a simple yet powerful symmetric masking strategy that eliminates the need for such ratio exploration by maintaining a fixed 50% masking pattern. Based on this strategy, we introduce SymMIM, a novel MIM framework that integrates both reconstruction and contrastive learning objectives to jointly capture local and global representations. Despite its simplicity, SymMIM achieves state-of-the-art accuracy of 85.9% on ImageNet with ViT-Large and consistently outperforms prior methods across various downstream tasks, including image classification, semantic segmentation, object detection, and instance segmentation—all while requiring only a single-stage pre-training process.
ISSN:2169-3536