CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes
Abstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-08258-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849388267614502912 |
|---|---|
| author | Shuyi Tan Yi Zhang Yan Li Byeong-Seok Shin |
| author_facet | Shuyi Tan Yi Zhang Yan Li Byeong-Seok Shin |
| author_sort | Shuyi Tan |
| collection | DOAJ |
| description | Abstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem are challenges for existing semantic segmentation methods, which further limit their accuracy and efficiency. To address these, this paper introduces a new framework that balances accuracy and computational efficiency by utilizing temporal alignment (TA), projection multi-scale convolution (PMC), and priority point retention (PPR). By combining TA and PMC, the framework effectively captures inter-frame correlations, improving local detail information, reducing error accumulation, and maintaining detailed scene features. Second, employing the PPR mechanism ensures that critical three-dimensional information is retained, thereby resolving information loss caused by the “many-to-one” mapping problem. Finally, by combining LiDAR and camera data through multimodal fusion, the framework provides complementary perspectives, further enhancing segmentation performance. Our method achieves state-of-the-art performance on the benchmark SemanticKITTI and nuScenes datasets. Notably, the proposed framework excels at detecting occluded objects and dynamic entities. |
| format | Article |
| id | doaj-art-014cbfc3704e41748874a0c98b09d524 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-014cbfc3704e41748874a0c98b09d5242025-08-20T03:42:22ZengNature PortfolioScientific Reports2045-23222025-07-0115111710.1038/s41598-025-08258-xCrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenesShuyi Tan0Yi Zhang1Yan Li2Byeong-Seok Shin3College of Computer Science and Technology, Chongqing University of Posts and TelecommunicationsInformation Accessibility Engineering R&D Center, Chongqing University of Posts and TelecommunicationsThe Department of Electrical and Computer Engineering, Inha UniversityThe Department of Electrical and Computer Engineering, Inha UniversityAbstract Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the “many-to-one” mapping problem are challenges for existing semantic segmentation methods, which further limit their accuracy and efficiency. To address these, this paper introduces a new framework that balances accuracy and computational efficiency by utilizing temporal alignment (TA), projection multi-scale convolution (PMC), and priority point retention (PPR). By combining TA and PMC, the framework effectively captures inter-frame correlations, improving local detail information, reducing error accumulation, and maintaining detailed scene features. Second, employing the PPR mechanism ensures that critical three-dimensional information is retained, thereby resolving information loss caused by the “many-to-one” mapping problem. Finally, by combining LiDAR and camera data through multimodal fusion, the framework provides complementary perspectives, further enhancing segmentation performance. Our method achieves state-of-the-art performance on the benchmark SemanticKITTI and nuScenes datasets. Notably, the proposed framework excels at detecting occluded objects and dynamic entities.https://doi.org/10.1038/s41598-025-08258-xTemporal alignmentMultimodal fusionSemantic segmentationAutonomous perception |
| spellingShingle | Shuyi Tan Yi Zhang Yan Li Byeong-Seok Shin CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes Scientific Reports Temporal alignment Multimodal fusion Semantic segmentation Autonomous perception |
| title | CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes |
| title_full | CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes |
| title_fullStr | CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes |
| title_full_unstemmed | CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes |
| title_short | CrossModalSync: joint temporal-spatial fusion for semantic scene segmentation in large-scale scenes |
| title_sort | crossmodalsync joint temporal spatial fusion for semantic scene segmentation in large scale scenes |
| topic | Temporal alignment Multimodal fusion Semantic segmentation Autonomous perception |
| url | https://doi.org/10.1038/s41598-025-08258-x |
| work_keys_str_mv | AT shuyitan crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes AT yizhang crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes AT yanli crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes AT byeongseokshin crossmodalsyncjointtemporalspatialfusionforsemanticscenesegmentationinlargescalescenes |