Documenting Geographically and Contextually Diverse Language Data Sources

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of docume...

Full description

Saved in:
Bibliographic Details
Main Authors: Angelina McMillan-Major, Francesco De Toni, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Daniel van Strien, Zeerak Talat, Yacine Jernite
Format: Article
Language:English
Published: Linköping University Electronic Press 2025-01-01
Series:Northern European Journal of Language Technology
Online Access:https://nejlt.ep.liu.se/article/view/5217
Tags: Add Tag
No Tags, Be the first to tag this record!