Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification

Due to the excellent feature extraction capabilities, deep learning has become the mainstream method for hyperspectral image (HSI) classification. Transformer, with its powerful long-range relationship modeling ability, has become a popular model; however, it usually requires a large number of label...

Full description

Saved in:
Bibliographic Details
Main Authors: Haipeng Liu, Zhen Ye, Wen-Shuai Hu, Zhan Cao, Wei Li
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11005553/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850153271471112192
author Haipeng Liu
Zhen Ye
Wen-Shuai Hu
Zhan Cao
Wei Li
author_facet Haipeng Liu
Zhen Ye
Wen-Shuai Hu
Zhan Cao
Wei Li
author_sort Haipeng Liu
collection DOAJ
description Due to the excellent feature extraction capabilities, deep learning has become the mainstream method for hyperspectral image (HSI) classification. Transformer, with its powerful long-range relationship modeling ability, has become a popular model; however, it usually requires a large number of labeled data for parameter training, which may be costly and impractical for HSI classification. As such, based on the self-supervised learning, this article proposes a spatial–spectral hierarchical multiscale transformer-based masked autoencoder (SSHMT-MAE) for HSI classification. First, after the spatial–spectral feature embedding with a spatial–spectral feature extraction module, to solve the increased computational complexity caused by filling invisible patches in traditional masked autoencoder (MAE), the grouped window attention module is introduced to process only the visible patches of HSIs during spatial–spectral reconstruction, avoiding unnecessary computations for masked ones. After that, a spatial–spectral hierarchical transformer is designed to build a hierarchical MAE structure, followed by a cross-feature fusion module to extract the multiscale spatial–spectral fusion features. It can not only assist the whole model in learning the fine-grained local spatial–spectral features within each local region but also capture the long-range dependencies between different regions, generating rich multiscale spatial–spectral features with high-level semantic and low-level detail information for HSI classification. Extensive experiments are conducted on the five public HSI datasets, evaluating the superiority of the proposed SSHMT-MAE model over several state-of-the-art methods.
format Article
id doaj-art-840ac260ce1e4645b3d9a740a3e1cbfa
institution OA Journals
issn 1939-1404
2151-1535
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling doaj-art-840ac260ce1e4645b3d9a740a3e1cbfa2025-08-20T02:25:45ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118121501216510.1109/JSTARS.2025.356565211005553Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image ClassificationHaipeng Liu0Zhen Ye1https://orcid.org/0000-0001-5410-863XWen-Shuai Hu2https://orcid.org/0000-0002-4757-2765Zhan Cao3Wei Li4https://orcid.org/0000-0001-7015-7335School of Electronics and Control Engineering, Chang'an University, Xi'an, ChinaSchool of Electronics and Control Engineering, Chang'an University, Xi'an, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaSchool of Electronics and Control Engineering, Chang'an University, Xi'an, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaDue to the excellent feature extraction capabilities, deep learning has become the mainstream method for hyperspectral image (HSI) classification. Transformer, with its powerful long-range relationship modeling ability, has become a popular model; however, it usually requires a large number of labeled data for parameter training, which may be costly and impractical for HSI classification. As such, based on the self-supervised learning, this article proposes a spatial–spectral hierarchical multiscale transformer-based masked autoencoder (SSHMT-MAE) for HSI classification. First, after the spatial–spectral feature embedding with a spatial–spectral feature extraction module, to solve the increased computational complexity caused by filling invisible patches in traditional masked autoencoder (MAE), the grouped window attention module is introduced to process only the visible patches of HSIs during spatial–spectral reconstruction, avoiding unnecessary computations for masked ones. After that, a spatial–spectral hierarchical transformer is designed to build a hierarchical MAE structure, followed by a cross-feature fusion module to extract the multiscale spatial–spectral fusion features. It can not only assist the whole model in learning the fine-grained local spatial–spectral features within each local region but also capture the long-range dependencies between different regions, generating rich multiscale spatial–spectral features with high-level semantic and low-level detail information for HSI classification. Extensive experiments are conducted on the five public HSI datasets, evaluating the superiority of the proposed SSHMT-MAE model over several state-of-the-art methods.https://ieeexplore.ieee.org/document/11005553/Feature fusiongrouped window attentionhierarchical vision transformer (HViT)hyperspectral image (HSI) classificationmasked autoencoder (MAE)multiscale learning
spellingShingle Haipeng Liu
Zhen Ye
Wen-Shuai Hu
Zhan Cao
Wei Li
Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Feature fusion
grouped window attention
hierarchical vision transformer (HViT)
hyperspectral image (HSI) classification
masked autoencoder (MAE)
multiscale learning
title Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
title_full Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
title_fullStr Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
title_full_unstemmed Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
title_short Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
title_sort spatial x2013 spectral hierarchical multiscale transformer based masked autoencoder for hyperspectral image classification
topic Feature fusion
grouped window attention
hierarchical vision transformer (HViT)
hyperspectral image (HSI) classification
masked autoencoder (MAE)
multiscale learning
url https://ieeexplore.ieee.org/document/11005553/
work_keys_str_mv AT haipengliu spatialx2013spectralhierarchicalmultiscaletransformerbasedmaskedautoencoderforhyperspectralimageclassification
AT zhenye spatialx2013spectralhierarchicalmultiscaletransformerbasedmaskedautoencoderforhyperspectralimageclassification
AT wenshuaihu spatialx2013spectralhierarchicalmultiscaletransformerbasedmaskedautoencoderforhyperspectralimageclassification
AT zhancao spatialx2013spectralhierarchicalmultiscaletransformerbasedmaskedautoencoderforhyperspectralimageclassification
AT weili spatialx2013spectralhierarchicalmultiscaletransformerbasedmaskedautoencoderforhyperspectralimageclassification