Encoding polylexical units with TEI Lex-o: A case study

The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital...

Full description

Saved in:
Bibliographic Details
Main Authors: Toma Tasovac, Ana Salgado, Rute Costa
Format: Article
Language:English
Published: University of Ljubljana Press (Založba Univerze v Ljubljani) 2020-08-01
Series:Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:
Online Access:https://journals.uni-lj.si/slovenscina2/article/view/9157
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850112996993400832
author Toma Tasovac
Ana Salgado
Rute Costa
author_facet Toma Tasovac
Ana Salgado
Rute Costa
author_sort Toma Tasovac
collection DOAJ
description The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which are not accompanied by an explicit definition and those that are: the former are encoded as <form>–like constructs, whereas the latter becomes <entry>–like constructs, which can have further constraints imposed on them (sense numbers, domain labels, grammatical labels etc.). We codify the use of attributes on <gram> to encode different kinds of labels for polylexicals (implicit, explicit and normalised), concluding that the interoperability of lexical resources would be significantly improved if dictionary encoders would have access to an expressive but relatively simple typology of polylexical units.
format Article
id doaj-art-277f3d9cbfaa40169aa1f926ad4da081
institution OA Journals
issn 2335-2736
language English
publishDate 2020-08-01
publisher University of Ljubljana Press (Založba Univerze v Ljubljani)
record_format Article
series Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
spelling doaj-art-277f3d9cbfaa40169aa1f926ad4da0812025-08-20T02:37:16ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362020-08-018210.4312/slo2.0.2020.2.28-57Encoding polylexical units with TEI Lex-o: A case studyToma Tasovac0Ana Salgado1Rute Costa2Belgrade Center for Digital Humanities, SerbiaNew University of Lisbon, CLUNL, PortugalNew University of Lisbon, CLUNL, Portugal The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which are not accompanied by an explicit definition and those that are: the former are encoded as <form>–like constructs, whereas the latter becomes <entry>–like constructs, which can have further constraints imposed on them (sense numbers, domain labels, grammatical labels etc.). We codify the use of attributes on <gram> to encode different kinds of labels for polylexicals (implicit, explicit and normalised), concluding that the interoperability of lexical resources would be significantly improved if dictionary encoders would have access to an expressive but relatively simple typology of polylexical units. https://journals.uni-lj.si/slovenscina2/article/view/9157TEIlexicographylanguage resourcespolylexical unitsinteroperability
spellingShingle Toma Tasovac
Ana Salgado
Rute Costa
Encoding polylexical units with TEI Lex-o: A case study
Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
TEI
lexicography
language resources
polylexical units
interoperability
title Encoding polylexical units with TEI Lex-o: A case study
title_full Encoding polylexical units with TEI Lex-o: A case study
title_fullStr Encoding polylexical units with TEI Lex-o: A case study
title_full_unstemmed Encoding polylexical units with TEI Lex-o: A case study
title_short Encoding polylexical units with TEI Lex-o: A case study
title_sort encoding polylexical units with tei lex o a case study
topic TEI
lexicography
language resources
polylexical units
interoperability
url https://journals.uni-lj.si/slovenscina2/article/view/9157
work_keys_str_mv AT tomatasovac encodingpolylexicalunitswithteilexoacasestudy
AT anasalgado encodingpolylexicalunitswithteilexoacasestudy
AT rutecosta encodingpolylexicalunitswithteilexoacasestudy