Mixtec–Spanish Parallel Text Dataset for Language Technology Development

Mixtec–Spanish Parallel Text Dataset for Language Technology Development

This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachist...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hermilo Santiago-Benito, Diana-Margarita Córdova-Esparza, Juan Terven, Noé-Alejandro Castro-Sánchez, Teresa García-Ramirez, Julio-Alejandro Romero-González, José M. Álvarez-Alvarado
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Data
Subjects:	Mixtec language parallel corpus low resource language OCR
Online Access:	https://www.mdpi.com/2306-5729/10/7/94
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Automatic grammatical tagger for a Spanish–Mixtec parallel corpus
by: Hermilo Santiago-Benito, et al.
Published: (2025-02-01)

Mixtec Sound Change Database 2.0: Integrating Tone Change
by: Sandra Auderset
Published: (2025-06-01)

Mixtec social memory in Late Renaissance Rome: Ulisse Aldrovandi, Tommaso de’ Cavalieri, and “the skull of an Indian king”
by: Davide Domenici
Published: (2024-12-01)

La classification de la diversité de maïs des Mixtèques et des Chatines de la Sierra Sur, Oaxaca Mexique
by: Quetzalcóatl Orozco-Ramírez, et al.
Published: (2021-11-01)

Harmony search for hyperparameters optimization of a low resource language transformer model trained with a novel parallel corpus Ocelotl Nahuatl – Spanish
by: Máximo Enrique Pacheco Martínez, et al.
Published: (2024-12-01)

La cuisine et sa ritualisation en pays mixtèque (Oaxaca, Mexique)
by: Esther Katz
Published: (2013-09-01)

Sexual and reproductive health awareness and practices among adolescents and adults in a rural farming community in Baja California, Mexico: a quantitative and qualitative cross-sectional study
by: Cristina Espinosa da Silva, et al.
Published: (2024-12-01)

IT-SR-NER: SERBIAN-ITALIAN PARALLEL CORPUS FOR LEARNING SERBIAN AS A FOREIGN LANGUAGE
by: Olja Perišić, et al.
Published: (2025-06-01)

Design of Automatic Scan Library Feature in Senayan Library Management System (SLiMS) Application
by: Salsabila Guspayane, et al.
Published: (2024-12-01)

THE USE OF ENGLISH NEGATION BY SPANISH STUDENTS OF ENGLISH: A LEARNER CORPUS-BASED STUDY
by: Araceli Garca Fuentes
Published: (2008-07-01)

Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
by: Muhammad Nazir Arifin, et al.
Published: (2025-08-01)

Disfrazados of San Juan Mixtepec, Oaxaca, Mexico
by: Ivy Rieger
Published: (2023-02-01)

Translation of Modal Verbs in Media Texts: Corpus-Based Approach
by: Ya. A. Volkova, et al.
Published: (2023-05-01)

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
by: Davlatyor Mengliev, et al.
Published: (2025-02-01)

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) scriptZenodo
by: Sardar Omar Salih, et al.
Published: (2025-06-01)

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data
by: Nilufar Abdurakhmonova, et al.
Published: (2025-08-01)

Evidential English adverbials and their French equivalents in a specialised parallel corpus
by: Francisco J. Alonso-Almeida, et al.
Published: (2025-03-01)

A Novel Framework for Saraiki Script Recognition Using Advanced Machine Learning Models (YOLOv8 and CNN)
by: Hafiz Muhammad Raza Ur Rehman, et al.
Published: (2025-01-01)

Advances in Amazigh Language Technologies: A Comprehensive Survey Across Processing Domains
by: Oussama Akallouch, et al.
Published: (2025-07-01)

Phonetic minimization of the text corpus in Belarusian for the speech synthesis system training
by: S. I. Lysy
Published: (2019-03-01)

Pièges méthodologiques des corpus parallèles et comment les éviter
by: Olga Nádvorníková
Published: (2017-02-01)

AraEyebility: Eye-Tracking Data for Arabic Text Readability
by: Ibtehal Baazeem, et al.
Published: (2025-05-01)

USAGES AND TRANSLATIONS OF THE ITALIAN TANTO IN SPOKEN LANGUAGE
by: Patrizia Giampieri
Published: (2025-07-01)

USAGES AND TRANSLATIONS OF THE ITALIAN TANTO IN SPOKEN LANGUAGE
by: Patrizia Giampieri
Published: (2025-07-01)

Slovak nouns of short duration chvíľa, okamih vs. German Weile, Augenblick according to the parallel corpus
by: Daria Vashchenko
Published: (2023-12-01)

THE ROLE OF CORPUS LINGUISTICS IN CONTEMPORARY LINGUISTICS RESEARCH AND TRANSLATION STUDIES
by: Pei Haitong
Published: (2025-02-01)

Reading comprehension in Spanish language and literature manuals of the second cycle of Elementary Education
by: Beatriz Sánchez Hita
Published: (2016-01-01)

Expressions parenthétiques dans un corpus parallèle français-grec : les adverbiaux de conviction personnelle
by: Fryni Kakoyianni-Doa
Published: (2014-12-01)

Learning languages from parallel corpora
by: Johannes Graën
Published: (2022-12-01)

How easy are audio descriptions? Exploring the viability of hybrid access services across English, Spanish and Catalan
by: Blanca Arias-Badia, et al.
Published: (2025-07-01)

Tracing the scope of fear in corpus: similarities and differences in cross-domain/genre texts
by: Ignacio Rodríguez Sánchez, et al.
Published: (2024-12-01)

Automated compilation of Urdu poetry handwritten image datasets for optical character recognition
by: Irtaza Ijaz, et al.
Published: (2025-06-01)

Commentary on four studies for JALT vocabulary SIG
by: Yukio Tono
Published: (2018-12-01)

PROCESSING IMAGES OF SALES RECEIPTS FOR ISOLATING AND RECOGNISING TEXT INFORMATION
by: A. S. Nazdryukhin, et al.
Published: (2020-01-01)

Characteristics of Malay translated hadith corpus
by: Siti Syakirah Sazali, et al.
Published: (2022-05-01)

AsymGroup: Asymmetric Grouping and Communication Optimization for 2D Tensor Parallelism in LLM Inference
by: Ki Tae Kim, et al.
Published: (2025-01-01)

Coarse-Grained Column Agglomeration Parallel Algorithm for LU Factorization Using Multi-Threaded MATLAB
by: Osama Sabir, et al.
Published: (2025-01-01)

Dataset of vocabulary in Uzbek primary education: Extraction and analysis in case of the school corpusZenodo
by: Khabibulla Madatov, et al.
Published: (2025-04-01)

Comparing the Long-Term Stability of Titanium Clip Partial Prostheses with Other Titanium Partial and Total Ossicular Reconstruction Prostheses
by: Jasmine Leahy, et al.
Published: (2025-04-01)

ANÁLISE QUÍMICO-MINERALÓGICA DE OCRES E A BUSCA POR CORRELAÇÕES ARQUEOLÓGICAS COM OS PIGMENTOS DE PINTURAS RUPESTRES DO SÍTIO PEDRA DO CANTAGALO I
by: Heralda Kelis Sousa da Silva, et al.
Published: (2019-06-01)