MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification

Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wei Huang, Tianren Wu, Xueyu Zhang, Liangliang Li, Ming Lv, Zhenhong Jia, Xiaobin Zhao, Hongbing Ma, Gemine Vivone
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer
Online Access:	https://ieeexplore.ieee.org/document/10970012/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849711945954885632
author	Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone
author_facet	Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone
author_sort	Wei Huang
collection	DOAJ
description	Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.
format	Article
id	doaj-art-32d56b92fa544fe28fb1e0fc00c1e109
institution	DOAJ
issn	1939-1404 2151-1535
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling	doaj-art-32d56b92fa544fe28fb1e0fc00c1e1092025-08-20T03:14:28ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118128031281810.1109/JSTARS.2025.356247710970012MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data ClassificationWei Huang0https://orcid.org/0009-0001-1078-7690Tianren Wu1Xueyu Zhang2https://orcid.org/0009-0009-6277-1381Liangliang Li3https://orcid.org/0000-0001-7354-7494Ming Lv4Zhenhong Jia5https://orcid.org/0000-0002-5182-4929Xiaobin Zhao6https://orcid.org/0000-0002-9828-1976Hongbing Ma7https://orcid.org/0000-0002-1785-4024Gemine Vivone8https://orcid.org/0000-0001-9542-0638School of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Electronic Information, Guangxi University, Nanning, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, ChinaDepartment of Electronic Engineering, Tsinghua University, Beijing, ChinaNational Research Council, Institute of Methodologies for Environmental Analysis (CNR-IMAA), Tito, ItalyRemote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.https://ieeexplore.ieee.org/document/10970012/Classificationconvolutional neural network (CNN)hyperspectral imaging (HSI)light detection and ranging (LiDAR)remote sensing (RS)Transformer
spellingShingle	Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer
title	MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_full	MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_fullStr	MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_full_unstemmed	MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_short	MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
title_sort	mcftnet multimodal cross layer fusion transformer network for hyperspectral and lidar data classification
topic	Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer
url	https://ieeexplore.ieee.org/document/10970012/
work_keys_str_mv	AT weihuang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT tianrenwu mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT xueyuzhang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT liangliangli mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT minglv mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT zhenhongjia mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT xiaobinzhao mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT hongbingma mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT geminevivone mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification

MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification

Similar Items