BERTinchamps: Cost-Effective In-House Training of Language Models in French

Many in-house applications are envisioned for Language Models (LMs) across various fields. In the medical domain, LMs could automate tasks such as summarizing the health condition of a patient and codifying electronic health records. They also hold potential in the legal field and in journalism. Whi...

Full description

Saved in:
Bibliographic Details
Main Authors: Amaury Fierens, Sébastien Jodogne
Format: Article
Language:English
Published: Accademia University Press 2024-12-01
Series:IJCoL
Online Access:https://journals.openedition.org/ijcol/1487
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Many in-house applications are envisioned for Language Models (LMs) across various fields. In the medical domain, LMs could automate tasks such as summarizing the health condition of a patient and codifying electronic health records. They also hold potential in the legal field and in journalism. While training LMs directly inside an institution is desirable for leveraging local data and addressing data privacy concerns, this process demands a costly and complex computational infrastructure. This paper explores the recent Cramming approach as a cost-effective way to locally train medium-sized LMs, in one day and using one graphics processing unit (GPU). We show that the Cramming approach that was originally designed for the English language can be transposed to French, that the resulting models can be fine-tuned to domain-specific tasks in the French language, and that pre-training by including in-house data increases the performance of the models for journalism data. This research opens the path to the creation of medium-sized LMs that are tailored to the specific needs of institutions that handle sensitive textual data in another language than English.
ISSN:2499-4553