An automated data collection process for constructing graph data relying on LLMs

This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into...

Full description

Saved in:
Bibliographic Details
Main Authors: Ngoc Ton Ho, Hoang Son Nguyen, Ngoc Minh Chau Ngueyen, Pham Cuong Nguyen
Format: Article
Language:English
Published: Can Tho University Publisher 2024-10-01
Series:CTU Journal of Innovation and Sustainable Development
Subjects:
Online Access:https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849688848556097536
author Ngoc Ton Ho
Hoang Son Nguyen
Ngoc Minh Chau Ngueyen
Pham Cuong Nguyen
author_facet Ngoc Ton Ho
Hoang Son Nguyen
Ngoc Minh Chau Ngueyen
Pham Cuong Nguyen
author_sort Ngoc Ton Ho
collection DOAJ
description This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources.
format Article
id doaj-art-34c7165bdd4f4708829c0b34fe63d4ee
institution DOAJ
issn 2588-1418
2815-6412
language English
publishDate 2024-10-01
publisher Can Tho University Publisher
record_format Article
series CTU Journal of Innovation and Sustainable Development
spelling doaj-art-34c7165bdd4f4708829c0b34fe63d4ee2025-08-20T03:21:50ZengCan Tho University PublisherCTU Journal of Innovation and Sustainable Development2588-14182815-64122024-10-0116Special issue: ISDS10.22144/ctujoisd.2024.323An automated data collection process for constructing graph data relying on LLMsNgoc Ton HoHoang Son NguyenNgoc Minh Chau NgueyenPham Cuong Nguyen0a:1:{s:5:"en_US";s:74:"Faculty of Information Technology, University of Science, Ho Chi Minh City";} This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources. https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148Data collection process, graph data, large language model
spellingShingle Ngoc Ton Ho
Hoang Son Nguyen
Ngoc Minh Chau Ngueyen
Pham Cuong Nguyen
An automated data collection process for constructing graph data relying on LLMs
CTU Journal of Innovation and Sustainable Development
Data collection process, graph data, large language model
title An automated data collection process for constructing graph data relying on LLMs
title_full An automated data collection process for constructing graph data relying on LLMs
title_fullStr An automated data collection process for constructing graph data relying on LLMs
title_full_unstemmed An automated data collection process for constructing graph data relying on LLMs
title_short An automated data collection process for constructing graph data relying on LLMs
title_sort automated data collection process for constructing graph data relying on llms
topic Data collection process, graph data, large language model
url https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148
work_keys_str_mv AT ngoctonho anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT hoangsonnguyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT ngocminhchaungueyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT phamcuongnguyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT ngoctonho automateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT hoangsonnguyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT ngocminhchaungueyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms
AT phamcuongnguyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms