MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification
Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well t...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10970012/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849711945954885632 |
|---|---|
| author | Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone |
| author_facet | Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone |
| author_sort | Wei Huang |
| collection | DOAJ |
| description | Remote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance. |
| format | Article |
| id | doaj-art-32d56b92fa544fe28fb1e0fc00c1e109 |
| institution | DOAJ |
| issn | 1939-1404 2151-1535 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| spelling | doaj-art-32d56b92fa544fe28fb1e0fc00c1e1092025-08-20T03:14:28ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118128031281810.1109/JSTARS.2025.356247710970012MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data ClassificationWei Huang0https://orcid.org/0009-0001-1078-7690Tianren Wu1Xueyu Zhang2https://orcid.org/0009-0009-6277-1381Liangliang Li3https://orcid.org/0000-0001-7354-7494Ming Lv4Zhenhong Jia5https://orcid.org/0000-0002-5182-4929Xiaobin Zhao6https://orcid.org/0000-0002-9828-1976Hongbing Ma7https://orcid.org/0000-0002-1785-4024Gemine Vivone8https://orcid.org/0000-0001-9542-0638School of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Electronic Information, Guangxi University, Nanning, ChinaSchool of Information and Electronics, Beijing Institute of Technology, Beijing, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer Science and Technology, Xinjiang University, Urumqi, ChinaSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, ChinaDepartment of Electronic Engineering, Tsinghua University, Beijing, ChinaNational Research Council, Institute of Methodologies for Environmental Analysis (CNR-IMAA), Tito, ItalyRemote sensing image classification is a popular yet challenging field. Many researchers have combined convolutional neural networks (CNNs) and Transformers for hyperspectral imaging (HSI) classification tasks. However, in traditional Transformers, shallow-level information does not propagate well to deeper layers, which can lead to spatial variations and overfitting. Moreover, traditional Transformer models use an external classification token (CLS token) that is randomly initialized and often struggles to generalize effectively. In this article, we combine the strengths of HSI and light detection and ranging (LiDAR) data, using LiDAR as an external CLS token, which significantly enhances classification accuracy and reliability. We propose a new multimodal cross-layer fusion transformer network (MCFTNet), integrating CNNs with the latest Transformer networks. It includes a CNN for extracting spatial features and a hybrid cross-patch attention mechanism for land cover classification, leveraging LiDAR data to generate CLS and HSI patch tokens. More importantly, to reduce the loss of valuable information during the layer-by-layer propagation, we designed cross-layer skip connections. Through adaptive learning, these cross-layer fusions help address the gradient vanishing problem in deep networks while preserving early-layer features. This enables the model to better integrate information from different layers, enhancing both its stability and performance. We carried out in-depth experiments on commonly used benchmark datasets, specifically, the University of Houston dataset, the Trento dataset, and the University of Southern Mississippi Gulf Park dataset. We compared the results of the proposed MCFTNet model with those obtained from state-of-the-art Transformer models, classical CNNs, and traditional classifiers. As a result, the MCFTNet model outshones them all in terms of performance.https://ieeexplore.ieee.org/document/10970012/Classificationconvolutional neural network (CNN)hyperspectral imaging (HSI)light detection and ranging (LiDAR)remote sensing (RS)Transformer |
| spellingShingle | Wei Huang Tianren Wu Xueyu Zhang Liangliang Li Ming Lv Zhenhong Jia Xiaobin Zhao Hongbing Ma Gemine Vivone MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer |
| title | MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification |
| title_full | MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification |
| title_fullStr | MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification |
| title_full_unstemmed | MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification |
| title_short | MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification |
| title_sort | mcftnet multimodal cross layer fusion transformer network for hyperspectral and lidar data classification |
| topic | Classification convolutional neural network (CNN) hyperspectral imaging (HSI) light detection and ranging (LiDAR) remote sensing (RS) Transformer |
| url | https://ieeexplore.ieee.org/document/10970012/ |
| work_keys_str_mv | AT weihuang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT tianrenwu mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT xueyuzhang mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT liangliangli mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT minglv mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT zhenhongjia mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT xiaobinzhao mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT hongbingma mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification AT geminevivone mcftnetmultimodalcrosslayerfusiontransformernetworkforhyperspectralandlidardataclassification |