DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset

In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integra...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xingzhou Liang, Xiaochen Zhou, Hui Zou, Yi Lu, Jingjing Qu
Format:	Article
Language:	English
Published:	Tsinghua University Press 2025-06-01
Series:	Journal of Social Computing
Subjects:	ai-related document text classification long short-term memory (lstm) bidirectional encoder representations from transformers (bert) large language model (llm)
Online Access:	https://www.sciopen.com/article/10.23919/JSC.2025.0007
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849426222135640064
author	Xingzhou Liang Xiaochen Zhou Hui Zou Yi Lu Jingjing Qu
author_facet	Xingzhou Liang Xiaochen Zhou Hui Zou Yi Lu Jingjing Qu
author_sort	Xingzhou Liang
collection	DOAJ
description	In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integrates expert knowledge with the capabilities of advanced models, structured into two primary stages. In the first stage, expert-curated classification datasets are used to train a Long Short-Term Memory (LSTM) model, which performs coarse-grained classification of AI-related records from large-scale datasets. In the second stage, a large language model, specifically Qwen2.5 Plus, is employed to annotate a random 10% of the initially coarse set of classified AI-related records. These annotated records are subsequently used to train a Bidirectional Encoder Representations from Transformers (BERT) based binary classifier, further refining the coarse set to produce the final DeepDiveAI dataset. Evaluation results indicate that the proposed pipeline achieves both accuracy and efficiency in identifying AI-related literature from large-scale datasets.
format	Article
id	doaj-art-836bcc6172e2486fbc8ad30eae3489f7
institution	Kabale University
issn	2688-5255
language	English
publishDate	2025-06-01
publisher	Tsinghua University Press
record_format	Article
series	Journal of Social Computing
spelling	doaj-art-836bcc6172e2486fbc8ad30eae3489f72025-08-20T03:29:31ZengTsinghua University PressJournal of Social Computing2688-52552025-06-016215816910.23919/JSC.2025.0007DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature DatasetXingzhou Liang0Xiaochen Zhou1Hui Zou2Yi Lu3Jingjing Qu4Shanghai Artificial Intelligence Laboratory, Shanghai 200030, ChinaUniversity of Hong Kong, Hong Kong 999077, ChinaSchool of Cultural Heritage and Information Management, Shanghai University, Shanghai 200030, ChinaDepartment of Informatics, King’s College London, London, WC2R 2LS, UKShanghai Artificial Intelligence Laboratory, Shanghai 200030, ChinaIn this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integrates expert knowledge with the capabilities of advanced models, structured into two primary stages. In the first stage, expert-curated classification datasets are used to train a Long Short-Term Memory (LSTM) model, which performs coarse-grained classification of AI-related records from large-scale datasets. In the second stage, a large language model, specifically Qwen2.5 Plus, is employed to annotate a random 10% of the initially coarse set of classified AI-related records. These annotated records are subsequently used to train a Bidirectional Encoder Representations from Transformers (BERT) based binary classifier, further refining the coarse set to produce the final DeepDiveAI dataset. Evaluation results indicate that the proposed pipeline achieves both accuracy and efficiency in identifying AI-related literature from large-scale datasets.https://www.sciopen.com/article/10.23919/JSC.2025.0007ai-related documenttext classificationlong short-term memory (lstm)bidirectional encoder representations from transformers (bert)large language model (llm)
spellingShingle	Xingzhou Liang Xiaochen Zhou Hui Zou Yi Lu Jingjing Qu DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset Journal of Social Computing ai-related document text classification long short-term memory (lstm) bidirectional encoder representations from transformers (bert) large language model (llm)
title	DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
title_full	DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
title_fullStr	DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
title_full_unstemmed	DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
title_short	DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
title_sort	deepdiveai identifying ai related documents in large scale literature dataset
topic	ai-related document text classification long short-term memory (lstm) bidirectional encoder representations from transformers (bert) large language model (llm)
url	https://www.sciopen.com/article/10.23919/JSC.2025.0007
work_keys_str_mv	AT xingzhouliang deepdiveaiidentifyingairelateddocumentsinlargescaleliteraturedataset AT xiaochenzhou deepdiveaiidentifyingairelateddocumentsinlargescaleliteraturedataset AT huizou deepdiveaiidentifyingairelateddocumentsinlargescaleliteraturedataset AT yilu deepdiveaiidentifyingairelateddocumentsinlargescaleliteraturedataset AT jingjingqu deepdiveaiidentifyingairelateddocumentsinlargescaleliteraturedataset

DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset

Similar Items