Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems

Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability,...

Full description

Saved in:
Bibliographic Details
Main Authors: Wenhui Yu, Gengshen Wu, Jungong Han
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Smart Cities
Subjects:
Online Access:https://www.mdpi.com/2624-6511/8/3/96
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849433724361375744
author Wenhui Yu
Gengshen Wu
Jungong Han
author_facet Wenhui Yu
Gengshen Wu
Jungong Han
author_sort Wenhui Yu
collection DOAJ
description Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, a deep multimodal-interactive network is proposed for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, a dataset is created that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text–image cross-modal retrieval. Experimental results on the proprietary dataset demonstrate that the proposed method substantially outperforms both extractive and abstractive baselines. In particular, it achieves a Rouge-1 score of 46.55, a Rouge-2 score of 16.13, and a Rouge-L score of 24.95, improving over the best comparison abstractive model (Pegasus: Rouge-1 43.63, Rouge-2 14.62, Rouge-L 24.46) by approximately 2.9, 1.5, and 0.5 points, respectively. Even against strong extractive methods like TextRank (Rouge-1 30.93) and LexRank (Rouge-1 29.63), our approach shows gains of over 15 points in Rouge-1, underlining its effectiveness in capturing both textual and visual semantics. These results suggest significant potential for smart city applications—such as accident scene documentation and automated environmental monitoring summaries—where rapid, accurate processing of urban multimodal data is essential.
format Article
id doaj-art-cf31507e44d3476f9d3ae885cea113bd
institution Kabale University
issn 2624-6511
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Smart Cities
spelling doaj-art-cf31507e44d3476f9d3ae885cea113bd2025-08-20T03:26:56ZengMDPI AGSmart Cities2624-65112025-06-01839610.3390/smartcities8030096Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management SystemsWenhui Yu0Gengshen Wu1Jungong Han2Faculty of Data Science, City University of Macau, Macao SAR, ChinaFaculty of Data Science, City University of Macau, Macao SAR, ChinaDepartment of Automation, Tsinghua University, Beijing 100084, ChinaUrban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, a deep multimodal-interactive network is proposed for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, a dataset is created that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text–image cross-modal retrieval. Experimental results on the proprietary dataset demonstrate that the proposed method substantially outperforms both extractive and abstractive baselines. In particular, it achieves a Rouge-1 score of 46.55, a Rouge-2 score of 16.13, and a Rouge-L score of 24.95, improving over the best comparison abstractive model (Pegasus: Rouge-1 43.63, Rouge-2 14.62, Rouge-L 24.46) by approximately 2.9, 1.5, and 0.5 points, respectively. Even against strong extractive methods like TextRank (Rouge-1 30.93) and LexRank (Rouge-1 29.63), our approach shows gains of over 15 points in Rouge-1, underlining its effectiveness in capturing both textual and visual semantics. These results suggest significant potential for smart city applications—such as accident scene documentation and automated environmental monitoring summaries—where rapid, accurate processing of urban multimodal data is essential.https://www.mdpi.com/2624-6511/8/3/96multi-task learningmultimodal learningpaper summarizationimportant image selectioncross-modal retrievalsmart city information management systems
spellingShingle Wenhui Yu
Gengshen Wu
Jungong Han
Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
Smart Cities
multi-task learning
multimodal learning
paper summarization
important image selection
cross-modal retrieval
smart city information management systems
title Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
title_full Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
title_fullStr Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
title_full_unstemmed Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
title_short Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems
title_sort deep multimodal interactive document summarization network and its cross modal text image retrieval application for future smart city information management systems
topic multi-task learning
multimodal learning
paper summarization
important image selection
cross-modal retrieval
smart city information management systems
url https://www.mdpi.com/2624-6511/8/3/96
work_keys_str_mv AT wenhuiyu deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems
AT gengshenwu deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems
AT jungonghan deepmultimodalinteractivedocumentsummarizationnetworkanditscrossmodaltextimageretrievalapplicationforfuturesmartcityinformationmanagementsystems