MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well t...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10970012/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance. |
|---|---|
| ISSN: | 1939-1404 2151-1535 |