Paraly: An (annotated) dataset for exploring the concept of paralysis (fr. ‘paralysie’) in a digital corpus of French LiteratureMADATA
The dataset consists of three corpora (full texts and metadata) of French Literature of the 18th, 19th and 20th century containing figurative and concrete linguistic references (annotations) to the concept of paralysis. The texts originate from the collection “Les classiques de la littérature” [The...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925003099 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The dataset consists of three corpora (full texts and metadata) of French Literature of the 18th, 19th and 20th century containing figurative and concrete linguistic references (annotations) to the concept of paralysis. The texts originate from the collection “Les classiques de la littérature” [The Classics of Literature] maintained on Gallica, the public digital library of the National Library of France (BnF). The dataset contains original OCR-ed texts with their metadata, the annotations of text excerpts containing the character sequence ``paraly”, the annotation guidelines, a model and application for automatic annotations, and codes for data collection, extraction, processing, and training. After the data collection from Gallica, we used the open-source software CorpusExplorer to automatically annotate part-of-speech and lemma information for each text and to create a separate text corpus for each century. The CorpusExplorer’s ``Key Word in Context'' (KWIC) feature was used to search for the character sequence “paraly” within each text/corpus. Based on the results, new century-specific corpora were created, containing only texts with this characteristic. With the help of OpenRefine we cleaned the tables with the metadata to ensure consistency of the entries. The text passages containing the set of characters “paraly” have been manually annotated. A multilabel classifier was then trained using the annotated data and the flair-library, and an application with a graphical user interface was deployed at Hugging Face. The provided dataset facilitates a better understanding of the associated research project on the French literary and cultural history of paralysis. The methodological approach can be adapted to generate new research datasets tailored to other studies. |
|---|---|
| ISSN: | 2352-3409 |