PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization
Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-11062-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849766494479581184 |
|---|---|
| author | Rawan Alsultan Alaa Sagheer Hala Hamdoun Lamya Alshamlan Latifah Alfadhli |
| author_facet | Rawan Alsultan Alaa Sagheer Hala Hamdoun Lamya Alshamlan Latifah Alfadhli |
| author_sort | Rawan Alsultan |
| collection | DOAJ |
| description | Abstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios. |
| format | Article |
| id | doaj-art-e2ead9ce80184e8c91974fd807dec30e |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-e2ead9ce80184e8c91974fd807dec30e2025-08-20T03:04:34ZengNature PortfolioScientific Reports2045-23222025-07-0115112810.1038/s41598-025-11062-2PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarizationRawan Alsultan0Alaa Sagheer1Hala Hamdoun2Lamya Alshamlan3Latifah Alfadhli4Department of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityDepartment of Computer Science, College of Computer Sciences and Information Technology, King Faisal UniversityAbstract With the exponential growth of digital content, Multi-Document Summarization (MDS) has become increasingly critical for synthesizing dispersed information into coherent and contextually relevant summaries. This paper presents PEGASUS-XL, an enhanced abstractive summarization framework that addresses key challenges in MDS, including salient content selection, redundancy reduction, factual consistency, and input length limitations. PEGASUS-XL is developed through a structured enhancement pipeline that integrates lexical-semantic saliency modeling with long-input encoding. It employs a hybrid scoring mechanism that combines TF-IDF and SBERT representations, modulated by a document-aware adaptive weighting scheme to dynamically balance lexical and semantic importance. To promote diversity and reduce redundancy, Maximal Marginal Relevance (MMR) is applied during content selection. To overcome the 1024-token limitation of standard Transformer models, Longformer is incorporated to enable efficient sparse attention over extended contexts. The vanilla PEGASUS model serves as the decoder and is fine-tuned on saliency-ranked, Longformer-encoded inputs to generate abstractive summaries. Extensive experiments on the Multi-News and XSum datasets demonstrate that PEGASUS-XL consistently outperforms strong baselines, including BART and PRIMERA, across multiple evaluation metrics (ROUGE, METEOR, BERTScore, and SBERT similarity). Ablation studies quantify the contribution of each component, and detailed error analysis identifies remaining issues such as factual drift and residual redundancy. Human evaluations further confirm that PEGASUS-XL produces summaries that are more coherent, informative, and faithful. Efficiency profiling shows that the framework achieves substantial quality gains without incurring disproportionate computational costs. Together, these contributions position PEGASUS-XL as a robust, scalable, and extensible solution for high-quality abstractive summarization in real-world multi-document scenarios.https://doi.org/10.1038/s41598-025-11062-2Natural language processingAbstractive summarizationMulti-document summarizationSaliency modelingTF-IDFSBERT embeddings |
| spellingShingle | Rawan Alsultan Alaa Sagheer Hala Hamdoun Lamya Alshamlan Latifah Alfadhli PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization Scientific Reports Natural language processing Abstractive summarization Multi-document summarization Saliency modeling TF-IDF SBERT embeddings |
| title | PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization |
| title_full | PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization |
| title_fullStr | PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization |
| title_full_unstemmed | PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization |
| title_short | PEGASUS-XL with saliency-guided scoring and long-input encoding for multi-document abstractive summarization |
| title_sort | pegasus xl with saliency guided scoring and long input encoding for multi document abstractive summarization |
| topic | Natural language processing Abstractive summarization Multi-document summarization Saliency modeling TF-IDF SBERT embeddings |
| url | https://doi.org/10.1038/s41598-025-11062-2 |
| work_keys_str_mv | AT rawanalsultan pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization AT alaasagheer pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization AT halahamdoun pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization AT lamyaalshamlan pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization AT latifahalfadhli pegasusxlwithsaliencyguidedscoringandlonginputencodingformultidocumentabstractivesummarization |