AMSunda: A novel dataset for Sundanese information retrievalzenodo

Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the...

Full description

Saved in:
Bibliographic Details
Main Authors: Aries Maesya, Yulyani Arifin, Amalia Zahra, Widodo Budiharto
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340925005232
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849699344147546112
author Aries Maesya
Yulyani Arifin
Amalia Zahra
Widodo Budiharto
author_facet Aries Maesya
Yulyani Arifin
Amalia Zahra
Widodo Budiharto
author_sort Aries Maesya
collection DOAJ
description Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the Sundanese dataset mainly focused on text classification and generation, leaving information retrieval tasks underexplored. To address this gap, we named the AMSunda dataset. The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks. The dataset consists of 1499 documents generated using GPT-4o-mini LLM, resulting in 7492 triplet passages and 7491 BEIR-format queries. This dataset enables further development of Sundanese-focused models in Information Retrieval.
format Article
id doaj-art-716fe4a981874ff4abcfb4dbf894981b
institution DOAJ
issn 2352-3409
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-716fe4a981874ff4abcfb4dbf894981b2025-08-20T03:18:38ZengElsevierData in Brief2352-34092025-08-016111179610.1016/j.dib.2025.111796AMSunda: A novel dataset for Sundanese information retrievalzenodoAries Maesya0Yulyani Arifin1Amalia Zahra2Widodo Budiharto3Computer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, Indonesia; Corresponding author.Computer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, IndonesiaComputer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, IndonesiaComputer Science Department, School of Computer Science, Bina Nusantara University, Jakarta 11480, IndonesiaInformation Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the Sundanese dataset mainly focused on text classification and generation, leaving information retrieval tasks underexplored. To address this gap, we named the AMSunda dataset. The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks. The dataset consists of 1499 documents generated using GPT-4o-mini LLM, resulting in 7492 triplet passages and 7491 BEIR-format queries. This dataset enables further development of Sundanese-focused models in Information Retrieval.http://www.sciencedirect.com/science/article/pii/S2352340925005232Sundanese languageSundanese datasetInformation retrievalText embeddingNatural language processing
spellingShingle Aries Maesya
Yulyani Arifin
Amalia Zahra
Widodo Budiharto
AMSunda: A novel dataset for Sundanese information retrievalzenodo
Data in Brief
Sundanese language
Sundanese dataset
Information retrieval
Text embedding
Natural language processing
title AMSunda: A novel dataset for Sundanese information retrievalzenodo
title_full AMSunda: A novel dataset for Sundanese information retrievalzenodo
title_fullStr AMSunda: A novel dataset for Sundanese information retrievalzenodo
title_full_unstemmed AMSunda: A novel dataset for Sundanese information retrievalzenodo
title_short AMSunda: A novel dataset for Sundanese information retrievalzenodo
title_sort amsunda a novel dataset for sundanese information retrievalzenodo
topic Sundanese language
Sundanese dataset
Information retrieval
Text embedding
Natural language processing
url http://www.sciencedirect.com/science/article/pii/S2352340925005232
work_keys_str_mv AT ariesmaesya amsundaanoveldatasetforsundaneseinformationretrievalzenodo
AT yulyaniarifin amsundaanoveldatasetforsundaneseinformationretrievalzenodo
AT amaliazahra amsundaanoveldatasetforsundaneseinformationretrievalzenodo
AT widodobudiharto amsundaanoveldatasetforsundaneseinformationretrievalzenodo