Mathematical Model and Algorithm for Accurate Main Content Extraction From News Websites
Irrelevant elements like ads, menus, and footers in web pages hinder data extraction and reduce the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). This paper tackles the challenge of accurately identifying and extracting the main content from web pages t...
Saved in:
| Main Authors: | Hamza Salem, Hadi Salloum, Manuel Mazzara |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10819347/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Swamped with Too Many Articles? GraphRAG Makes Getting Started Easy
by: Joëd Ngangmeni, et al.
Published: (2025-03-01) -
Large language models for closed-library multi-document query, test generation, and evaluation
by: Claire Randolph, et al.
Published: (2025-08-01) -
Retrieval-Augmented Generation to Generate Knowledge Assets and Creation of Action Drivers
by: Antony James, et al.
Published: (2025-06-01) -
Enhancing the Precision and Interpretability of Retrieval-Augmented Generation (RAG) in Legal Technology: A Survey
by: Mahd Hindi, et al.
Published: (2025-01-01) -
Comparative Analysis of RAG-Based Open-Source LLMs for Indonesian Banking Customer Service Optimization Using Simulated Data
by: Hendra Lijaya, et al.
Published: (2025-07-01)