Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability,...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Smart Cities |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2624-6511/8/3/96 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849433724361375744 |
|---|---|
| author | Wenhui Yu Gengshen Wu Jungong Han |
| author_facet | Wenhui Yu Gengshen Wu Jungong Han |
| author_sort | Wenhui Yu |
| collection | DOAJ |
| description | Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, a deep multimodal-interactive network is proposed for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, a dataset is created that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text–image cross-modal retrieval. Experimental results on the proprietary dataset demonstrate that the proposed method substantially outperforms both extractive and abstractive baselines. In particular, it achieves a Rouge-1 score of 46.55, a Rouge-2 score of 16.13, and a Rouge-L score of 24.95, improving over the best comparison abstractive model (Pegasus: Rouge-1 43.63, Rouge-2 14.62, Rouge-L 24.46) by approximately 2.9, 1.5, and 0.5 points, respectively. Even against strong extractive methods like TextRank (Rouge-1 30.93) and LexRank (Rouge-1 29.63), our approach shows gains of over 15 points in Rouge-1, underlining its effectiveness in capturing both textual and visual semantics. These results suggest significant potential for smart city applications—such as accident scene documentation and automated environmental monitoring summaries—where rapid, accurate processing of urban multimodal data is essential. |
| format | Article |
| id | doaj-art-cf31507e44d3476f9d3ae885cea113bd |
| institution | Kabale University |
| issn | 2624-6511 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Smart Cities |
| spelling | doaj-art-cf31507e44d3476f9d3ae885cea113bd2025-08-20T03:26:56ZengMDPI AGSmart Cities2624-65112025-06-01839610.3390/smartcities8030096Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management SystemsWenhui Yu0Gengshen Wu1Jungong Han2Faculty of Data Science, City University of Macau, Macao SAR, ChinaFaculty of Data Science, City University of Macau, Macao SAR, ChinaDepartment of Automation, Tsinghua University, Beijing 100084, ChinaUrban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, a deep multimodal-interactive network is proposed for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, a dataset is created that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text–image cross-modal retrieval. Experimental results on the proprietary dataset demonstrate that the proposed method substantially outperforms both extractive and abstractive baselines. In particular, it achieves a Rouge-1 score of 46.55, a Rouge-2 score of 16.13, and a Rouge-L score of 24.95, improving over the best comparison abstractive model (Pegasus: Rouge-1 43.63, Rouge-2 14.62, Rouge-L 24.46) by approximately 2.9, 1.5, and 0.5 points, respectively. Even against strong extractive methods like TextRank (Rouge-1 30.93) and LexRank (Rouge-1 29.63), our approach shows gains of over 15 points in Rouge-1, underlining its effectiveness in capturing both textual and visual semantics. These results suggest significant potential for smart city applications—such as accident scene documentation and automated environmental monitoring summaries—where rapid, accurate processing of urban multimodal data is essential.https://www.mdpi.com/2624-6511/8/3/96multi-task learningmultimodal learningpaper summarizationimportant image selectioncross-modal retrievalsmart city information management systems |
| spellingShingle | Wenhui Yu Gengshen Wu Jungong Han Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems Smart Cities multi-task learning multimodal learning paper summarization important image selection cross-modal retrieval smart city information management systems |
| title | Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems |
| title_full | Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems |
| title_fullStr | Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems |
| title_full_unstemmed | Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems |
| title_short | Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems |
| title_sort | deep multimodal interactive document summarization network and its cross modal text image retrieval application for future smart city information management systems |
| topic | multi-task learning multimodal learning paper summarization important image selection cross-modal retrieval smart city information management systems |
| url | https://www.mdpi.com/2624-6511/8/3/96 |
| work_keys_str_mv | AT wenhuiyu deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems AT gengshenwu deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems AT jungonghan deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems |