MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification

Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wei Huang, Tianren Wu, Xueyu Zhang, Liangliang Li, Ming Lv, Zhenhong Jia, Xiaobin Zhao, Hongbing Ma, Gemine Vivone
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer
Online Access:	https://ieeexplore.ieee.org/document/10970012/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.
ISSN:	1939-1404 2151-1535

MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification

Similar Items