CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes

Abstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem...

Full description

Saved in:
Bibliographic Details
Main Authors: Shuyi Tan, Yi Zhang, Yan Li, Byeong-Seok Shin
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-08258-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849388267614502912
author Shuyi Tan
Yi Zhang
Yan Li
Byeong-Seok Shin
author_facet Shuyi Tan
Yi Zhang
Yan Li
Byeong-Seok Shin
author_sort Shuyi Tan
collection DOAJ
description Abstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem are challenges for existing semantic segmentation methods, which further limit their accuracy and efficiency. To address these, this paper introduces a new framework that balances accuracy and computational efficiency by utilizing temporal alignment (TA), projection multi-scale convolution (PMC), and priority point retention (PPR). By combining TA and PMC, the framework effectively captures inter-frame correlations, improving local detail information, reducing error accumulation, and maintaining detailed scene features. Second, employing the PPR mechanism ensures that critical three-dimensional information is retained, thereby resolving information loss caused by the “many-to-one” mapping problem. Finally, by combining LiDAR and camera data through multimodal fusion, the framework provides complementary perspectives, further enhancing segmentation performance. Our method achieves state-of-the-art performance on the benchmark SemanticKITTI and nuScenes datasets. Notably, the proposed framework excels at detecting occluded objects and dynamic entities.
format Article
id doaj-art-014cbfc3704e41748874a0c98b09d524
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-014cbfc3704e41748874a0c98b09d5242025-08-20T03:42:22ZengNature PortfolioScientific Reports2045-23222025-07-0115111710.1038/s41598-025-08258-xCrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenesShuyi Tan0Yi Zhang1Yan Li2Byeong-Seok Shin3College of Computer Science and Technology, Chongqing University of Posts and TelecommunicationsInformation Accessibility Engineering R&D Center, Chongqing University of Posts and TelecommunicationsThe Department of Electrical and Computer Engineering, Inha UniversityThe Department of Electrical and Computer Engineering, Inha UniversityAbstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem are challenges for existing semantic segmentation methods, which further limit their accuracy and efficiency. To address these, this paper introduces a new framework that balances accuracy and computational efficiency by utilizing temporal alignment (TA), projection multi-scale convolution (PMC), and priority point retention (PPR). By combining TA and PMC, the framework effectively captures inter-frame correlations, improving local detail information, reducing error accumulation, and maintaining detailed scene features. Second, employing the PPR mechanism ensures that critical three-dimensional information is retained, thereby resolving information loss caused by the “many-to-one” mapping problem. Finally, by combining LiDAR and camera data through multimodal fusion, the framework provides complementary perspectives, further enhancing segmentation performance. Our method achieves state-of-the-art performance on the benchmark SemanticKITTI and nuScenes datasets. Notably, the proposed framework excels at detecting occluded objects and dynamic entities.https://doi.org/10.1038/s41598-025-08258-xTemporal alignmentMultimodal fusionSemantic segmentationAutonomous perception
spellingShingle Shuyi Tan
Yi Zhang
Yan Li
Byeong-Seok Shin
CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
Scientific Reports
Temporal alignment
Multimodal fusion
Semantic segmentation
Autonomous perception
title CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
title_full CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
title_fullStr CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
title_full_unstemmed CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
title_short CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
title_sort crossmodalsync joint temporal spatial fusion for semantic scene segmentation in large scale scenes
topic Temporal alignment
Multimodal fusion
Semantic segmentation
Autonomous perception
url https://doi.org/10.1038/s41598-025-08258-x
work_keys_str_mv AT shuyitan crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes
AT yizhang crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes
AT yanli crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes
AT byeongseokshin crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes