Corpus-based measures discriminate inflection and derivation cross-linguistically

In morphology, a distinction is commonly drawn between inflection and derivation. However, a precise definition of this distinction which reflects the way it manifests across languages remains elusive within linguistic theory, typically being based on subjective tests. In this study, we present 4 q...

Full description

Saved in:
Bibliographic Details
Main Authors: Coleman Haley, Edoardo M. Ponti, Sharon Goldwater
Format: Article
Language:English
Published: Institute of Computer Science, Polish Academy of Sciences 2024-12-01
Series:Journal of Language Modelling
Subjects:
Online Access:https://jlm.ipipan.waw.pl/index.php/JLM/article/view/351
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In morphology, a distinction is commonly drawn between inflection and derivation. However, a precise definition of this distinction which reflects the way it manifests across languages remains elusive within linguistic theory, typically being based on subjective tests. In this study, we present 4 quantitative measures which use the statistics of a raw text corpus in a language to estimate to what extent a given morphological construction changes the form and distribution of lexemes. In particular, we measure both the average and the variance of this change across lexemes. Crucially, distributional information captures syntactic and semantic properties and can be operationalised by word embeddings. Based on a sample of 26 languages, we find that we can reconstruct 89±1% of the classification of constructions into inflection and derivation in UniMorph using our 4 measures, providing large-scale cross-linguistic evidence that the concepts of inflection and derivation are associated with measurable signatures in terms of form and distribution that behave consistently across a variety of languages. We also use our measures to identify in a quantitative way whether categories of inflection which have been considered noncanonical in the linguistic literature, such as inherent inflection or transpositions, appear so in terms of properties of their form and distribution. We find that while combining multiple measures reduces the amount of overlap between inflectional and derivational constructions, there are still many constructions near the model’s decision boundary between the two categories. This indicates a gradient, rather than categorical, distinction.
ISSN:2299-856X
2299-8470