An automated data collection process for constructing graph data relying on LLMs
This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Can Tho University Publisher
2024-10-01
|
| Series: | CTU Journal of Innovation and Sustainable Development |
| Subjects: | |
| Online Access: | https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849688848556097536 |
|---|---|
| author | Ngoc Ton Ho Hoang Son Nguyen Ngoc Minh Chau Ngueyen Pham Cuong Nguyen |
| author_facet | Ngoc Ton Ho Hoang Son Nguyen Ngoc Minh Chau Ngueyen Pham Cuong Nguyen |
| author_sort | Ngoc Ton Ho |
| collection | DOAJ |
| description |
This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources.
|
| format | Article |
| id | doaj-art-34c7165bdd4f4708829c0b34fe63d4ee |
| institution | DOAJ |
| issn | 2588-1418 2815-6412 |
| language | English |
| publishDate | 2024-10-01 |
| publisher | Can Tho University Publisher |
| record_format | Article |
| series | CTU Journal of Innovation and Sustainable Development |
| spelling | doaj-art-34c7165bdd4f4708829c0b34fe63d4ee2025-08-20T03:21:50ZengCan Tho University PublisherCTU Journal of Innovation and Sustainable Development2588-14182815-64122024-10-0116Special issue: ISDS10.22144/ctujoisd.2024.323An automated data collection process for constructing graph data relying on LLMsNgoc Ton HoHoang Son NguyenNgoc Minh Chau NgueyenPham Cuong Nguyen0a:1:{s:5:"en_US";s:74:"Faculty of Information Technology, University of Science, Ho Chi Minh City";} This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources. https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148Data collection process, graph data, large language model |
| spellingShingle | Ngoc Ton Ho Hoang Son Nguyen Ngoc Minh Chau Ngueyen Pham Cuong Nguyen An automated data collection process for constructing graph data relying on LLMs CTU Journal of Innovation and Sustainable Development Data collection process, graph data, large language model |
| title | An automated data collection process for constructing graph data relying on LLMs |
| title_full | An automated data collection process for constructing graph data relying on LLMs |
| title_fullStr | An automated data collection process for constructing graph data relying on LLMs |
| title_full_unstemmed | An automated data collection process for constructing graph data relying on LLMs |
| title_short | An automated data collection process for constructing graph data relying on LLMs |
| title_sort | automated data collection process for constructing graph data relying on llms |
| topic | Data collection process, graph data, large language model |
| url | https://ctujs.ctu.edu.vn/index.php/ctujs/article/view/1148 |
| work_keys_str_mv | AT ngoctonho anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms AT hoangsonnguyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms AT ngocminhchaungueyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms AT phamcuongnguyen anautomateddatacollectionprocessforconstructinggraphdatarelyingonllms AT ngoctonho automateddatacollectionprocessforconstructinggraphdatarelyingonllms AT hoangsonnguyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms AT ngocminhchaungueyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms AT phamcuongnguyen automateddatacollectionprocessforconstructinggraphdatarelyingonllms |