PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization

Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that...

Full description

Saved in:
Bibliographic Details
Main Authors: Rawan Alsultan, Alaa Sagheer, Hala Hamdoun, Lamya Alshamlan, Latifah Alfadhli
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-11062-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849766494479581184
author Rawan Alsultan
Alaa Sagheer
Hala Hamdoun
Lamya Alshamlan
Latifah Alfadhli
author_facet Rawan Alsultan
Alaa Sagheer
Hala Hamdoun
Lamya Alshamlan
Latifah Alfadhli
author_sort Rawan Alsultan
collection DOAJ
description Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios.
format Article
id doaj-art-e2ead9ce80184e8c91974fd807dec30e
institution DOAJ
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-e2ead9ce80184e8c91974fd807dec30e2025-08-20T03:04:34ZengNature PortfolioScientific Reports2045-23222025-07-0115112810.1038/s41598-025-11062-2PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarizationRawan Alsultan0Alaa Sagheer1Hala Hamdoun2Lamya Alshamlan3Latifah Alfadhli4Department of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityAbstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios.https://doi.org/10.1038/s41598-025-11062-2Natural language processingAbstractive summarizationMulti-document summarizationSaliency modelingTF-IDFSBERT embeddings
spellingShingle Rawan Alsultan
Alaa Sagheer
Hala Hamdoun
Lamya Alshamlan
Latifah Alfadhli
PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
Scientific Reports
Natural language processing
Abstractive summarization
Multi-document summarization
Saliency modeling
TF-IDF
SBERT embeddings
title PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
title_full PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
title_fullStr PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
title_full_unstemmed PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
title_short PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
title_sort pegasus xl with saliency guided scoring and long input encoding for multi document abstractive summarization
topic Natural language processing
Abstractive summarization
Multi-document summarization
Saliency modeling
TF-IDF
SBERT embeddings
url https://doi.org/10.1038/s41598-025-11062-2
work_keys_str_mv AT rawanalsultan pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization
AT alaasagheer pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization
AT halahamdoun pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization
AT lamyaalshamlan pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization
AT latifahalfadhli pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization