PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization

Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that...

Full description

Saved in:

Bibliographic Details
Main Authors:	Rawan Alsultan, Alaa Sagheer, Hala Hamdoun, Lamya Alshamlan, Latifah Alfadhli
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	Scientific Reports
Subjects:	Natural language processing Abstractive summarization Multi-document summarization Saliency modeling TF-IDF SBERT embeddings
Online Access:	https://doi.org/10.1038/s41598-025-11062-2
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios.
ISSN:	2045-2322

PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization

Similar Items