Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images

Building type information indicates the functional properties of buildings and plays a crucial role in smart city development and urban socioeconomic activities. Existing methods for classifying building types often face challenges in accurately distinguishing buildings between types while maintaini...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shaofeng Zhang, Mengmeng Li, Wufan Zhao, Xiaoqin Wang, Qunyong Wu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Building type classification CNN-transformer networks cross-encoder feature interaction very high resolution remote sensing
Online Access:	https://ieeexplore.ieee.org/document/10756709/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846128626094833664
author	Shaofeng Zhang Mengmeng Li Wufan Zhao Xiaoqin Wang Qunyong Wu
author_facet	Shaofeng Zhang Mengmeng Li Wufan Zhao Xiaoqin Wang Qunyong Wu
author_sort	Shaofeng Zhang
collection	DOAJ
description	Building type information indicates the functional properties of buildings and plays a crucial role in smart city development and urban socioeconomic activities. Existing methods for classifying building types often face challenges in accurately distinguishing buildings between types while maintaining well-delineated boundaries, especially in complex urban environments. This study introduces a novel framework, i.e., CNN-Transformer cross-attention feature fusion network (CTCFNet), for building type classification from very high resolution remote sensing images. CTCFNet integrates convolutional neural networks (CNNs) and Transformers using an interactive cross-encoder fusion module that enhances semantic feature learning and improves classification accuracy in complex scenarios. We develop an adaptive collaboration optimization module that applies human visual attention mechanisms to enhance the feature representation of building types and boundaries simultaneously. To address the scarcity of datasets in building type classification, we create two new datasets, i.e., the urban building type (UBT) dataset and the town building type (TBT) dataset, for model evaluation. Extensive experiments on these datasets demonstrate that CTCFNet outperforms popular CNNs, Transformers, and dual-encoder methods in identifying building types across various regions, achieving the highest mean intersection over union of 78.20% and 77.11%, F1 scores of 86.83% and 88.22%, and overall accuracy of 95.07% and 95.73% on the UBT and TBT datasets, respectively. We conclude that CTCFNet effectively addresses the challenges of high interclass similarity and intraclass inconsistency in complex scenes, yielding results with well-delineated building boundaries and accurate building types.
format	Article
id	doaj-art-979926b8605b48b8b79e2e0649306ebc
institution	Kabale University
issn	1939-1404 2151-1535
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling	doaj-art-979926b8605b48b8b79e2e0649306ebc2024-12-11T00:00:38ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-011897699410.1109/JSTARS.2024.350167810756709Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite ImagesShaofeng Zhang0https://orcid.org/0009-0001-0689-264XMengmeng Li1https://orcid.org/0000-0002-9083-0475Wufan Zhao2https://orcid.org/0000-0002-0265-3465Xiaoqin Wang3Qunyong Wu4Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, ChinaKey Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, ChinaUrban Governance and Design Thrust, Society Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, ChinaKey Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, ChinaKey Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, ChinaBuilding type information indicates the functional properties of buildings and plays a crucial role in smart city development and urban socioeconomic activities. Existing methods for classifying building types often face challenges in accurately distinguishing buildings between types while maintaining well-delineated boundaries, especially in complex urban environments. This study introduces a novel framework, i.e., CNN-Transformer cross-attention feature fusion network (CTCFNet), for building type classification from very high resolution remote sensing images. CTCFNet integrates convolutional neural networks (CNNs) and Transformers using an interactive cross-encoder fusion module that enhances semantic feature learning and improves classification accuracy in complex scenarios. We develop an adaptive collaboration optimization module that applies human visual attention mechanisms to enhance the feature representation of building types and boundaries simultaneously. To address the scarcity of datasets in building type classification, we create two new datasets, i.e., the urban building type (UBT) dataset and the town building type (TBT) dataset, for model evaluation. Extensive experiments on these datasets demonstrate that CTCFNet outperforms popular CNNs, Transformers, and dual-encoder methods in identifying building types across various regions, achieving the highest mean intersection over union of 78.20% and 77.11%, F1 scores of 86.83% and 88.22%, and overall accuracy of 95.07% and 95.73% on the UBT and TBT datasets, respectively. We conclude that CTCFNet effectively addresses the challenges of high interclass similarity and intraclass inconsistency in complex scenes, yielding results with well-delineated building boundaries and accurate building types.https://ieeexplore.ieee.org/document/10756709/Building type classificationCNN-transformer networkscross-encoderfeature interactionvery high resolution remote sensing
spellingShingle	Shaofeng Zhang Mengmeng Li Wufan Zhao Xiaoqin Wang Qunyong Wu Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Building type classification CNN-transformer networks cross-encoder feature interaction very high resolution remote sensing
title	Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images
title_full	Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images
title_fullStr	Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images
title_full_unstemmed	Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images
title_short	Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images
title_sort	building type classification using cnn transformer cross encoder adaptive learning from very high resolution satellite images
topic	Building type classification CNN-transformer networks cross-encoder feature interaction very high resolution remote sensing
url	https://ieeexplore.ieee.org/document/10756709/
work_keys_str_mv	AT shaofengzhang buildingtypeclassificationusingcnntransformercrossencoderadaptivelearningfromveryhighresolutionsatelliteimages AT mengmengli buildingtypeclassificationusingcnntransformercrossencoderadaptivelearningfromveryhighresolutionsatelliteimages AT wufanzhao buildingtypeclassificationusingcnntransformercrossencoderadaptivelearningfromveryhighresolutionsatelliteimages AT xiaoqinwang buildingtypeclassificationusingcnntransformercrossencoderadaptivelearningfromveryhighresolutionsatelliteimages AT qunyongwu buildingtypeclassificationusingcnntransformercrossencoderadaptivelearningfromveryhighresolutionsatelliteimages

Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images

Similar Items