Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo

Annotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken lan...

Full description

Saved in:
Bibliographic Details
Main Authors: George Christodoulides, Giulia Barreca
Format: Article
Language:English
Published: Cercle linguistique du Centre et de l'Ouest - CerLICO 2017-02-01
Series:Corela
Subjects:
Online Access:https://journals.openedition.org/corela/4867
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850125059253862400
author George Christodoulides
Giulia Barreca
author_facet George Christodoulides
Giulia Barreca
author_sort George Christodoulides
collection DOAJ
description Annotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken language corpora, to allow for comparisons between these two modalities, and apply an enriched annotation scheme for phenomena specific to spoken language. In this article, we present the approach implemented in the DisMo automatic annotator, which is specifically designed for spoken corpora, and which generates a multi-level annotation, including : part-of-speech tagging, lemmatisation, multi-word unit detection, detection and annotation of disfluencies and discourse markers, and chunking. We present our work on the French corpus of the Phonologie du Français Contemporain (PFC) project ; this work allowed us to improve the tool. We discuss the theoretical and practical considerations that informed the choice of levels of annotation, types of phenomena detected, and tag sets, and we present a performance evaluation of the automatic annotation.
format Article
id doaj-art-e11d504cd5bb4f6fa1afdb4715847038
institution OA Journals
issn 1638-573X
language English
publishDate 2017-02-01
publisher Cercle linguistique du Centre et de l'Ouest - CerLICO
record_format Article
series Corela
spelling doaj-art-e11d504cd5bb4f6fa1afdb47158470382025-08-20T02:34:10ZengCercle linguistique du Centre et de l'Ouest - CerLICOCorela1638-573X2017-02-012110.4000/corela.4867Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMoGeorge ChristodoulidesGiulia BarrecaAnnotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken language corpora, to allow for comparisons between these two modalities, and apply an enriched annotation scheme for phenomena specific to spoken language. In this article, we present the approach implemented in the DisMo automatic annotator, which is specifically designed for spoken corpora, and which generates a multi-level annotation, including : part-of-speech tagging, lemmatisation, multi-word unit detection, detection and annotation of disfluencies and discourse markers, and chunking. We present our work on the French corpus of the Phonologie du Français Contemporain (PFC) project ; this work allowed us to improve the tool. We discuss the theoretical and practical considerations that informed the choice of levels of annotation, types of phenomena detected, and tag sets, and we present a performance evaluation of the automatic annotation.https://journals.openedition.org/corela/4867exploitation of oral corporamultilevel annotationautomatic annotation
spellingShingle George Christodoulides
Giulia Barreca
Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
Corela
exploitation of oral corpora
multilevel annotation
automatic annotation
title Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
title_full Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
title_fullStr Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
title_full_unstemmed Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
title_short Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
title_sort experiences sur l analyse morphosyntaxique des corpus oraux avec l annotateur multi niveaux dismo
topic exploitation of oral corpora
multilevel annotation
automatic annotation
url https://journals.openedition.org/corela/4867
work_keys_str_mv AT georgechristodoulides experiencessurlanalysemorphosyntaxiquedescorpusorauxaveclannotateurmultiniveauxdismo
AT giuliabarreca experiencessurlanalysemorphosyntaxiquedescorpusorauxaveclannotateurmultiniveauxdismo