Using Crawlers for Targeted Data Extraction from a Local Multi-File Database

Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of proce...

Full description

Saved in:
Bibliographic Details
Main Authors: Alexandru ANGHELESCU, Ciprian-Viorel STUPINEAN, Ariana-Anamaria CORDOȘ
Format: Article
Language:English
Published: Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca 2025-05-01
Series:Applied Medical Informatics
Subjects:
Online Access:https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850218476496486400
author Alexandru ANGHELESCU
Ciprian-Viorel STUPINEAN
Ariana-Anamaria CORDOȘ
author_facet Alexandru ANGHELESCU
Ciprian-Viorel STUPINEAN
Ariana-Anamaria CORDOȘ
author_sort Alexandru ANGHELESCU
collection DOAJ
description Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of processing a folder (local database) containing specific .pdf files. The application sequentially read each file and extracted relevant information. The ability of the crawler to extract the e-mail addresses of corresponding authors from academic papers was evaluated as a .pdf file may have contained multiple articles. In addition to email addresses, names of corresponding authors were extracted where possible. The PdfPig library was used to access the data since the input data were .pdf files. The output consisted of a CSV file containing all extracted email addresses. The input dataset included 19 books of abstracts and 180 articles. Results: During testing, the application managed to extract 929 email addresses and 77 names. However, due to pattern inconsistencies, name extraction was possible only for articles, not for books of abstracts. Further, evaluation on precision and accuracy was performed. While there was only 1 line extracted that did not contain emails out of the 880 lines, 34.55% of them needed corrective actions. In 213 instances text was attached to the e-mails (e.g. country names: Spain, Israel etc. or other words like keyword or abstract), country prefixes were attached in 156 cases and in 2 lines there additional full stops at the beginning or end of the e-mail. Discussion: Crawlers can be effective in extracting specific data from big databases of files simultaneously. In medical research, this ability can have an impact on productivity when dealing with data collection for research purposes. On the other hand, it poses a risk when personal information, e-mails in this case, become accessible for malicious purposes. Future work should explore compliance with data protection regulations, such as GDPR, and methods to ensure responsible data use. Conclusion: Besides the usefulness of crawlers in extracting email addresses, they prove their efficiency while dealing with the data gathering part of the research. By using a crawler, the researcher may be able to save some time, just by not dealing with the data extraction part of the study, while dealing with a large database of studies to cite.
format Article
id doaj-art-52cdf6f54b4841858e41c58c59e57dc0
institution OA Journals
issn 2067-7855
language English
publishDate 2025-05-01
publisher Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca
record_format Article
series Applied Medical Informatics
spelling doaj-art-52cdf6f54b4841858e41c58c59e57dc02025-08-20T02:07:44ZengIuliu Hatieganu University of Medicine and Pharmacy, Cluj-NapocaApplied Medical Informatics2067-78552025-05-0147Suppl. 1Using Crawlers for Targeted Data Extraction from a Local Multi-File DatabaseAlexandru ANGHELESCU0Ciprian-Viorel STUPINEAN1Ariana-Anamaria CORDOȘ2West University of TimișoaraRomanian Society of Medical Informatics, 300222 Timișoara, Romania; Department of Computer Science, Faculty of Mathematics and Computer Science, Babeş-Bolyai University, 400084 Cluj-Napoca, RomaniaRomanian Society of Medical Informatics, 300222 Timișoara, Romania; Department of Public Health, Faculty of Political, Administrative and Communication Sciences, Babeş-Bolyai University, 400084 Cluj-Napoca, Romania Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of processing a folder (local database) containing specific .pdf files. The application sequentially read each file and extracted relevant information. The ability of the crawler to extract the e-mail addresses of corresponding authors from academic papers was evaluated as a .pdf file may have contained multiple articles. In addition to email addresses, names of corresponding authors were extracted where possible. The PdfPig library was used to access the data since the input data were .pdf files. The output consisted of a CSV file containing all extracted email addresses. The input dataset included 19 books of abstracts and 180 articles. Results: During testing, the application managed to extract 929 email addresses and 77 names. However, due to pattern inconsistencies, name extraction was possible only for articles, not for books of abstracts. Further, evaluation on precision and accuracy was performed. While there was only 1 line extracted that did not contain emails out of the 880 lines, 34.55% of them needed corrective actions. In 213 instances text was attached to the e-mails (e.g. country names: Spain, Israel etc. or other words like keyword or abstract), country prefixes were attached in 156 cases and in 2 lines there additional full stops at the beginning or end of the e-mail. Discussion: Crawlers can be effective in extracting specific data from big databases of files simultaneously. In medical research, this ability can have an impact on productivity when dealing with data collection for research purposes. On the other hand, it poses a risk when personal information, e-mails in this case, become accessible for malicious purposes. Future work should explore compliance with data protection regulations, such as GDPR, and methods to ensure responsible data use. Conclusion: Besides the usefulness of crawlers in extracting email addresses, they prove their efficiency while dealing with the data gathering part of the research. By using a crawler, the researcher may be able to save some time, just by not dealing with the data extraction part of the study, while dealing with a large database of studies to cite. https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116CrawlerC#Automated Data ExtractionEmailData
spellingShingle Alexandru ANGHELESCU
Ciprian-Viorel STUPINEAN
Ariana-Anamaria CORDOȘ
Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
Applied Medical Informatics
Crawler
C#
Automated Data Extraction
Email
Data
title Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_full Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_fullStr Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_full_unstemmed Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_short Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_sort using crawlers for targeted data extraction from a local multi file database
topic Crawler
C#
Automated Data Extraction
Email
Data
url https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116
work_keys_str_mv AT alexandruanghelescu usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase
AT ciprianviorelstupinean usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase
AT arianaanamariacordos usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase