Mixtec–Spanish Parallel Text Dataset for Language Technology Development

This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachist...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hermilo Santiago-Benito, Diana-Margarita Córdova-Esparza, Juan Terven, Noé-Alejandro Castro-Sánchez, Teresa García-Ramirez, Julio-Alejandro Romero-González, José M. Álvarez-Alvarado
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Data
Subjects:	Mixtec language parallel corpus low resource language OCR
Online Access:	https://www.mdpi.com/2306-5729/10/7/94
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849733078588588032
author	Hermilo Santiago-Benito Diana-Margarita Córdova-Esparza Juan Terven Noé-Alejandro Castro-Sánchez Teresa García-Ramirez Julio-Alejandro Romero-González José M. Álvarez-Alvarado
author_facet	Hermilo Santiago-Benito Diana-Margarita Córdova-Esparza Juan Terven Noé-Alejandro Castro-Sánchez Teresa García-Ramirez Julio-Alejandro Romero-González José M. Álvarez-Alvarado
author_sort	Hermilo Santiago-Benito
collection	DOAJ
description	This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa María Yosoyúa, Central, Lower Cañada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Querétaro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language.
format	Article
id	doaj-art-7de6149f19f34ada807e1b34ca62b4ec
institution	DOAJ
issn	2306-5729
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Data
spelling	doaj-art-7de6149f19f34ada807e1b34ca62b4ec2025-08-20T03:08:09ZengMDPI AGData2306-57292025-06-011079410.3390/data10070094Mixtec–Spanish Parallel Text Dataset for Language Technology DevelopmentHermilo Santiago-Benito0Diana-Margarita Córdova-Esparza1Juan Terven2Noé-Alejandro Castro-Sánchez3Teresa García-Ramirez4Julio-Alejandro Romero-González5José M. Álvarez-Alvarado6Facultad de Informática, Universidad Autónoma de Querétaro, Av. de las Ciencias S/N, Campus Juriquilla, Querétaro 76230, MexicoFacultad de Informática, Universidad Autónoma de Querétaro, Av. de las Ciencias S/N, Campus Juriquilla, Querétaro 76230, MexicoCentro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, MexicoCentro Nacional de Investigación y Desarrollo Tecnológico, Tecnológico Nacional de México, Interior Internado Palmira S/N, Palmira, Cuernavaca 62493, MexicoCentro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, MexicoCentro de Investigación en Ciencia Aplicada y Tecnología Avanzada—Unidad Querétaro, Instituto Politécnico Nacional, Cerro Blanco No. 141, Col. Colinas del Cimatario, Querétaro 76090, MexicoFacultad de Ingeniería, Universidad Autónoma de Querétaro, Querétaro 76010, MexicoThis article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa María Yosoyúa, Central, Lower Cañada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Querétaro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language.https://www.mdpi.com/2306-5729/10/7/94Mixtec languageparallel corpuslow resource languageOCR
spellingShingle	Hermilo Santiago-Benito Diana-Margarita Córdova-Esparza Juan Terven Noé-Alejandro Castro-Sánchez Teresa García-Ramirez Julio-Alejandro Romero-González José M. Álvarez-Alvarado Mixtec–Spanish Parallel Text Dataset for Language Technology Development Data Mixtec language parallel corpus low resource language OCR
title	Mixtec–Spanish Parallel Text Dataset for Language Technology Development
title_full	Mixtec–Spanish Parallel Text Dataset for Language Technology Development
title_fullStr	Mixtec–Spanish Parallel Text Dataset for Language Technology Development
title_full_unstemmed	Mixtec–Spanish Parallel Text Dataset for Language Technology Development
title_short	Mixtec–Spanish Parallel Text Dataset for Language Technology Development
title_sort	mixtec spanish parallel text dataset for language technology development
topic	Mixtec language parallel corpus low resource language OCR
url	https://www.mdpi.com/2306-5729/10/7/94
work_keys_str_mv	AT hermilosantiagobenito mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT dianamargaritacordovaesparza mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT juanterven mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT noealejandrocastrosanchez mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT teresagarciaramirez mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT julioalejandroromerogonzalez mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment AT josemalvarezalvarado mixtecspanishparalleltextdatasetforlanguagetechnologydevelopment

Mixtec–Spanish Parallel Text Dataset for Language Technology Development

Similar Items