MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification

Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well t...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei Huang, Tianren Wu, Xueyu Zhang, Liangliang Li, Ming Lv, Zhenhong Jia, Xiaobin Zhao, Hongbing Ma, Gemine Vivone
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10970012/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849711945954885632
author Wei Huang
Tianren Wu
Xueyu Zhang
Liangliang Li
Ming Lv
Zhenhong Jia
Xiaobin Zhao
Hongbing Ma
Gemine Vivone
author_facet Wei Huang
Tianren Wu
Xueyu Zhang
Liangliang Li
Ming Lv
Zhenhong Jia
Xiaobin Zhao
Hongbing Ma
Gemine Vivone
author_sort Wei Huang
collection DOAJ
description Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.
format Article
id doaj-art-32d56b92fa544fe28fb1e0fc00c1e109
institution DOAJ
issn 1939-1404
2151-1535
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling doaj-art-32d56b92fa544fe28fb1e0fc00c1e1092025-08-20T03:14:28ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118128031281810.1109/JSTARS.2025.356247710970012MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data ClassificationWei Huang0https://orcid.org/0009-0001-1078-7690Tianren Wu1Xueyu Zhang2https://orcid.org/0009-0009-6277-1381Liangliang Li3https://orcid.org/0000-0001-7354-7494Ming Lv4Zhenhong Jia5https://orcid.org/0000-0002-5182-4929Xiaobin Zhao6https://orcid.org/0000-0002-9828-1976Hongbing Ma7https://orcid.org/0000-0002-1785-4024Gemine Vivone8https://orcid.org/0000-0001-9542-0638School of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Electronic Information, Guangxi University, Nanning, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, ChinaDepartment of Electronic Engineering, Tsinghua University, Beijing, ChinaNational Research Council, Institute of Methodologies for Environmental Analysis (CNR-IMAA), Tito, ItalyRemote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.https://ieeexplore.ieee.org/document/10970012/Classificationconvolutional neural network (CNN)hyperspectral imaging (HSI)light detection and ranging (LiDAR)remote sensing (RS)Transformer
spellingShingle Wei Huang
Tianren Wu
Xueyu Zhang
Liangliang Li
Ming Lv
Zhenhong Jia
Xiaobin Zhao
Hongbing Ma
Gemine Vivone
MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Classification
convolutional neural network (CNN)
hyperspectral imaging (HSI)
light detection and ranging (LiDAR)
remote sensing (RS)
Transformer
title MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_full MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_fullStr MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_full_unstemmed MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_short MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_sort mcftnet multimodal cross layer fusion transformer network for hyperspectral and lidar data classification
topic Classification
convolutional neural network (CNN)
hyperspectral imaging (HSI)
light detection and ranging (LiDAR)
remote sensing (RS)
Transformer
url https://ieeexplore.ieee.org/document/10970012/
work_keys_str_mv AT weihuang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT tianrenwu mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT xueyuzhang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT liangliangli mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT minglv mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT zhenhongjia mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT xiaobinzhao mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT hongbingma mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification
AT geminevivone mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification