PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-11062-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios. |
|---|---|
| ISSN: | 2045-2322 |