Using Crawlers for Targeted Data Extraction from a Local Multi-File Database

Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of proce...

Full description

Saved in:

Bibliographic Details
Main Authors:	Alexandru ANGHELESCU, Ciprian-Viorel STUPINEAN, Ariana-Anamaria CORDOȘ
Format:	Article
Language:	English
Published:	Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca 2025-05-01
Series:	Applied Medical Informatics
Subjects:	Crawler C# Automated Data Extraction Email Data
Online Access:	https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850218476496486400
author	Alexandru ANGHELESCU Ciprian-Viorel STUPINEAN Ariana-Anamaria CORDOȘ
author_facet	Alexandru ANGHELESCU Ciprian-Viorel STUPINEAN Ariana-Anamaria CORDOȘ
author_sort	Alexandru ANGHELESCU
collection	DOAJ
description	Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of processing a folder (local database) containing specific .pdf files. The application sequentially read each file and extracted relevant information. The ability of the crawler to extract the e-mail addresses of corresponding authors from academic papers was evaluated as a .pdf file may have contained multiple articles. In addition to email addresses, names of corresponding authors were extracted where possible. The PdfPig library was used to access the data since the input data were .pdf files. The output consisted of a CSV file containing all extracted email addresses. The input dataset included 19 books of abstracts and 180 articles. Results: During testing, the application managed to extract 929 email addresses and 77 names. However, due to pattern inconsistencies, name extraction was possible only for articles, not for books of abstracts. Further, evaluation on precision and accuracy was performed. While there was only 1 line extracted that did not contain emails out of the 880 lines, 34.55% of them needed corrective actions. In 213 instances text was attached to the e-mails (e.g. country names: Spain, Israel etc. or other words like keyword or abstract), country prefixes were attached in 156 cases and in 2 lines there additional full stops at the beginning or end of the e-mail. Discussion: Crawlers can be effective in extracting specific data from big databases of files simultaneously. In medical research, this ability can have an impact on productivity when dealing with data collection for research purposes. On the other hand, it poses a risk when personal information, e-mails in this case, become accessible for malicious purposes. Future work should explore compliance with data protection regulations, such as GDPR, and methods to ensure responsible data use. Conclusion: Besides the usefulness of crawlers in extracting email addresses, they prove their efficiency while dealing with the data gathering part of the research. By using a crawler, the researcher may be able to save some time, just by not dealing with the data extraction part of the study, while dealing with a large database of studies to cite.
format	Article
id	doaj-art-52cdf6f54b4841858e41c58c59e57dc0
institution	OA Journals
issn	2067-7855
language	English
publishDate	2025-05-01
publisher	Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca
record_format	Article
series	Applied Medical Informatics
spelling	doaj-art-52cdf6f54b4841858e41c58c59e57dc02025-08-20T02:07:44ZengIuliu Hatieganu University of Medicine and Pharmacy, Cluj-NapocaApplied Medical Informatics2067-78552025-05-0147Suppl. 1Using Crawlers for Targeted Data Extraction from a Local Multi-File DatabaseAlexandru ANGHELESCU0Ciprian-Viorel STUPINEAN1Ariana-Anamaria CORDOȘ2West University of TimișoaraRomanian Society of Medical Informatics, 300222 Timișoara, Romania; Department of Computer Science, Faculty of Mathematics and Computer Science, Babeş-Bolyai University, 400084 Cluj-Napoca, RomaniaRomanian Society of Medical Informatics, 300222 Timișoara, Romania; Department of Public Health, Faculty of Political, Administrative and Communication Sciences, Babeş-Bolyai University, 400084 Cluj-Napoca, Romania Background: A crawler is a software program used to extract data in an automated manner. This study aimed to demonstrate how a crawler can extract specific data from multiple diverse .pdf files. Methods and Materials: To achieve this objective, a C# .NET9 application was developed, capable of processing a folder (local database) containing specific .pdf files. The application sequentially read each file and extracted relevant information. The ability of the crawler to extract the e-mail addresses of corresponding authors from academic papers was evaluated as a .pdf file may have contained multiple articles. In addition to email addresses, names of corresponding authors were extracted where possible. The PdfPig library was used to access the data since the input data were .pdf files. The output consisted of a CSV file containing all extracted email addresses. The input dataset included 19 books of abstracts and 180 articles. Results: During testing, the application managed to extract 929 email addresses and 77 names. However, due to pattern inconsistencies, name extraction was possible only for articles, not for books of abstracts. Further, evaluation on precision and accuracy was performed. While there was only 1 line extracted that did not contain emails out of the 880 lines, 34.55% of them needed corrective actions. In 213 instances text was attached to the e-mails (e.g. country names: Spain, Israel etc. or other words like keyword or abstract), country prefixes were attached in 156 cases and in 2 lines there additional full stops at the beginning or end of the e-mail. Discussion: Crawlers can be effective in extracting specific data from big databases of files simultaneously. In medical research, this ability can have an impact on productivity when dealing with data collection for research purposes. On the other hand, it poses a risk when personal information, e-mails in this case, become accessible for malicious purposes. Future work should explore compliance with data protection regulations, such as GDPR, and methods to ensure responsible data use. Conclusion: Besides the usefulness of crawlers in extracting email addresses, they prove their efficiency while dealing with the data gathering part of the research. By using a crawler, the researcher may be able to save some time, just by not dealing with the data extraction part of the study, while dealing with a large database of studies to cite. https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116CrawlerC#Automated Data ExtractionEmailData
spellingShingle	Alexandru ANGHELESCU Ciprian-Viorel STUPINEAN Ariana-Anamaria CORDOȘ Using Crawlers for Targeted Data Extraction from a Local Multi-File Database Applied Medical Informatics Crawler C# Automated Data Extraction Email Data
title	Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_full	Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_fullStr	Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_full_unstemmed	Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_short	Using Crawlers for Targeted Data Extraction from a Local Multi-File Database
title_sort	using crawlers for targeted data extraction from a local multi file database
topic	Crawler C# Automated Data Extraction Email Data
url	https://ami.info.umfcluj.ro/index.php/AMI/article/view/1116
work_keys_str_mv	AT alexandruanghelescu usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase AT ciprianviorelstupinean usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase AT arianaanamariacordos usingcrawlersfortargeteddataextractionfromalocalmultifiledatabase

Using Crawlers for Targeted Data Extraction from a Local Multi-File Database

Similar Items