AMSunda: A novel dataset for Sundanese information retrievalzenodo
Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925005232 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849699344147546112 |
|---|---|
| author | Aries Maesya Yulyani Arifin Amalia Zahra Widodo Budiharto |
| author_facet | Aries Maesya Yulyani Arifin Amalia Zahra Widodo Budiharto |
| author_sort | Aries Maesya |
| collection | DOAJ |
| description | Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the Sundanese dataset mainly focused on text classification and generation, leaving information retrieval tasks underexplored. To address this gap, we named the AMSunda dataset. The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks. The dataset consists of 1499 documents generated using GPT-4o-mini LLM, resulting in 7492 triplet passages and 7491 BEIR-format queries. This dataset enables further development of Sundanese-focused models in Information Retrieval. |
| format | Article |
| id | doaj-art-716fe4a981874ff4abcfb4dbf894981b |
| institution | DOAJ |
| issn | 2352-3409 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Data in Brief |
| spelling | doaj-art-716fe4a981874ff4abcfb4dbf894981b2025-08-20T03:18:38ZengElsevierData in Brief2352-34092025-08-016111179610.1016/j.dib.2025.111796AMSunda: A novel dataset for Sundanese information retrievalzenodoAries Maesya0Yulyani Arifin1Amalia Zahra2Widodo Budiharto3Computer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, Indonesia; Corresponding author.Computer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, IndonesiaComputer Science Department, BINUS Graduate Program—Doctor of Computer Science Program, Bina Nusantara University, Jakarta 11480, IndonesiaComputer Science Department, School of Computer Science, Bina Nusantara University, Jakarta 11480, IndonesiaInformation Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the Sundanese dataset mainly focused on text classification and generation, leaving information retrieval tasks underexplored. To address this gap, we named the AMSunda dataset. The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks. The dataset consists of 1499 documents generated using GPT-4o-mini LLM, resulting in 7492 triplet passages and 7491 BEIR-format queries. This dataset enables further development of Sundanese-focused models in Information Retrieval.http://www.sciencedirect.com/science/article/pii/S2352340925005232Sundanese languageSundanese datasetInformation retrievalText embeddingNatural language processing |
| spellingShingle | Aries Maesya Yulyani Arifin Amalia Zahra Widodo Budiharto AMSunda: A novel dataset for Sundanese information retrievalzenodo Data in Brief Sundanese language Sundanese dataset Information retrieval Text embedding Natural language processing |
| title | AMSunda: A novel dataset for Sundanese information retrievalzenodo |
| title_full | AMSunda: A novel dataset for Sundanese information retrievalzenodo |
| title_fullStr | AMSunda: A novel dataset for Sundanese information retrievalzenodo |
| title_full_unstemmed | AMSunda: A novel dataset for Sundanese information retrievalzenodo |
| title_short | AMSunda: A novel dataset for Sundanese information retrievalzenodo |
| title_sort | amsunda a novel dataset for sundanese information retrievalzenodo |
| topic | Sundanese language Sundanese dataset Information retrieval Text embedding Natural language processing |
| url | http://www.sciencedirect.com/science/article/pii/S2352340925005232 |
| work_keys_str_mv | AT ariesmaesya amsundaanoveldatasetforsundaneseinformationretrievalzenodo AT yulyaniarifin amsundaanoveldatasetforsundaneseinformationretrievalzenodo AT amaliazahra amsundaanoveldatasetforsundaneseinformationretrievalzenodo AT widodobudiharto amsundaanoveldatasetforsundaneseinformationretrievalzenodo |