Local-Global Feature Extraction Network With Dynamic 3-D Convolution and Residual Attention Transformer for Hyperspectral Image Classification
Currently, convolutional neural network (CNN) and transformer-based hyperspectral image (HSI) classification methods have attracted significant attention owing to their effective feature representation capabilities. However, methods based on CNN pay insufficient attention to valuable pixels in 3-D H...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10946673/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Currently, convolutional neural network (CNN) and transformer-based hyperspectral image (HSI) classification methods have attracted significant attention owing to their effective feature representation capabilities. However, methods based on CNN pay insufficient attention to valuable pixels in 3-D HSI samples and cannot adapt to variations in these samples. Transformer-based methods also suffer from high computational complexity and a tendency for low-level spatial-spectral features of the shallow attention layer to vanish as the number of attention layers increases. To address these issues, we proposed a local–global feature extraction network with dynamic 3-D convolution and residual attention transformer (LGDRNet). The LGDRNet primarily consists of multiscale 3-D conv, dynamic local feature extraction, residual global feature extraction, and feature fusion modules. Specifically, a multiscale 3-D conv module is used for low-level multiscale spectral information extraction. Then, the dynamic local feature extraction module utilizes dynamic 3-D convolution, which can adapt to different samples. This allows the network to focus on valuable pixels in 3-D samples. The residual global feature extraction module utilizes a convolutional projection unit and convolutional multihead self-attention to reduce computational complexity. It employs a residual attention connection to enable the network to effectively transmit and accumulate attention information across consecutive multihead attention layers. This prevents the vanishing of shallow spatial-spectral features. Finally, local and global HSI information may be efficiently integrated using the feature fusion module, which also improves performance during subsequent classification. The proposed model achieves overall classification accuracies of 89.24%, 92.01%, and 94.53% on three benchmark datasets, respectively, outperforming state-of-the-art approaches with limited training samples. |
|---|---|
| ISSN: | 1939-1404 2151-1535 |