An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where <i>open access</i> has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files an...

Full description

Saved in:
Bibliographic Details
Main Authors: Tresna Maulana Fahrudin, Nobuo Funabiki, Komang Candra Brata, Inzali Naing, Soe Thandar Aung, Amri Muhaimin, Dwi Arman Prasetya
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/17/5/195
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where <i>open access</i> has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a <i>reference paper collection system</i> using a <i>web scraping</i> technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved <i>reference paper collection system</i> with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the <i>bulk search endpoint</i> in the <i>Semantic Scholar</i> API, the <i>article search endpoint</i> in the <i>DOAJ</i> API, and the <i>search and fetch endpoint</i> in the <i>PubMed</i> API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing <i>Dynamic URL Redirection</i>, <i>Regex-based URL Parsing</i>, and <i>HTML Scraping with URL Extraction</i> for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments.
ISSN:1999-5903