Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach

Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach

It has been argued that most of corpus linguistics involves one of four fundamental methods: frequency lists, dispersion, collocation, and concordancing. All these presuppose (if only implicitly) the definition of a unit: the element whose frequency in a corpus, in corpus parts, or around a search w...

Full description

Saved in:

Bibliographic Details
Main Author:	Stefan Th. Gries
Format:	Article
Language:	English
Published:	Université Jean Moulin - Lyon 3 2022-03-01
Series:	Lexis: Journal in English Lexicology
Subjects:	corpus linguistics multi-word units n-grams frequency dispersion association
Online Access:	https://journals.openedition.org/lexis/6231
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On using corpus frequency, dispersion, and chronological data to help identify useful collocations
by: James Rogers, et al.
Published: (2015-12-01)

CORPUS-BASED ANALYSIS OF TRANSITION WORDS
by: Satyawati Surya, M.Pd.
Published: (2023-11-01)

Blending creativity and productivity: on the issue of delimiting the boundaries of blends as a type of word formation
by: Natalia Beliaeva
Published: (2019-12-01)

An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
by: Andreas Hallberg
Published: (2025-06-01)

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT
by: Hanjo Jeong
Published: (2025-03-01)

Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation
by: Azzah Allahim, et al.
Published: (2024-11-01)

Developing a discipline-specific corpus and high-frequency word list for science and engineering students in graduate school
by: Suwako Uehara, et al.
Published: (2022-12-01)

Big data mining and comparative analyses across lexica on the relationship between syllable complexity and word stress
by: Amanda Post da Silveira
Published: (2023-12-01)

The distribution of constituent words in nominal compounds and its impact on semantic interpretation: an empirical study
by: Annelen Brunner, et al.
Published: (2021-01-01)

Koncepcja Rozproszonego Korpusu Dyskursu Akademickiego
by: KAMIL WABNIC
Published: (2025-08-01)

The Use of Corpora in Word Formation Research
by: Pius ten Hacken, et al.
Published: (2014-01-01)

Key Words as Markers of the Communicative Behavior of the Discursive Personality of the Nominee to the USA Presidency (Based on the Genre of Pre-Election Debates)
by: L. A. Kochetova, et al.
Published: (2022-04-01)

Meaning Extensions of Grasp: A Corpus-Based Study
by: Marie Nordlund
Published: (2010-04-01)

Examining the word family through word lists
by: Dale Brown
Published: (2018-12-01)

Evaluating corpora with word lists and word difficulty
by: Brent A. Culligan
Published: (2019-12-01)

Zu kreativen Ausdrucksformen im deutschsprachigen öffentlichen Raum
by: Anna Dargiewicz
Published: (2024-01-01)

THE ROLE OF CORPUS LINGUISTICS IN CONTEMPORARY LINGUISTICS RESEARCH AND TRANSLATION STUDIES
by: Pei Haitong
Published: (2025-02-01)

The use and the frequency of Arabic loan words in the Turkish language
by: Vulović Aleksandra M.
Published: (2025-01-01)

Word Frequency List of Turkish Books on Wattpad
by: Feyza Tokat
Published: (2023-12-01)

A methodology for identification of the formulaic language most representative of high-frequency collocations
by: Chris Brizzard, et al.
Published: (2014-04-01)

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation
by: Yonghua Wen, et al.
Published: (2024-12-01)

Linguistic and sociological perspective on the perception of profanity in Moscow in 2024
by: Ekaterina R. Dobrushina
Published: (2024-01-01)

When more is less: the impact of multimorphemic words on learning word meaning
by: Niveen Omar, et al.
Published: (2024-12-01)

The influence of nominal prefixes on the formation of compound words in Xitsonga
by: Respect Mlambo, et al.
Published: (2025-03-01)

Paving the road to hell: The Spanish word menas as a case study
by: David Bordonaba Plou, et al.
Published: (2021-09-01)

LISTENING IN L2 ROMANIAN: WHY FUNCTION WORDS GO UNNOTICED
by: Ioana-Silvia SONEA
Published: (2025-06-01)

Assessing the Translation of English Janus Words into Arabic
by: Essam Taher Muhammed Essam Taher Muhammed
Published: (2024-06-01)

Where do new words like boobage, flamage, ownage come from? Tracking the history of ‑age words from 1100 to 2000 in the OED3
by: Chris A. Smith
Published: (2018-12-01)

Linguo-Conceptual Studies of Literary Text: Evolution of Theoretical and Methodological Approaches
by: I. V. Kononova, et al.
Published: (2023-06-01)

A Method of Word Sense Disambiguation with Restricted Boltzmann Machine
by: ZHANG Chun-xiang, et al.
Published: (2019-10-01)

Etymology and the Technique of Word Formation
by: Syeda Fasiha Abid, et al.
Published: (2025-06-01)

Meaning and Use of the Words DOBROVOLETS and VOLUNTEER in the Russian Language (according to the Russian National Corpus)
by: J. N. Ilyina
Published: (2019-10-01)

INFORMATION TECHNOLOGIES IN OPTIMIZING SCIENTIFIC RESEARCH IN THE SPHERE OF THEORETICAL AND APPLIED LINGUISTICS IN THE DIGITAL AGE
by: M. V. Kamensky
Published: (2022-02-01)

Teaching practices and perspectives regarding word counting units
by: Louis Lafleur
Published: (2023-12-01)

Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus
by: L. A. Kochetova, et al.
Published: (2019-06-01)

Word-Formation Representation of Concept FRIENDSHIP in Modern Russian
by: A. Aru
Published: (2021-08-01)

On creating a large-scale corpus-based academic multi-word unit resource
by: James Rogers
Published: (2020-12-01)

Marking the beginning and end of the Tatar word: The system of vowels
by: A.M. Galieva, et al.
Published: (2022-10-01)

Evaluating the efficacy of yes-no checklist tests to assess knowledge of multi-word lexical units
by: Raymond Stubbe, et al.
Published: (2018-12-01)

WORD ORDER AND SENTENCE STRESS IN ENGLISH UTTERANCES CONSISTING OF THE SAME SET OF WORDS
by: Amina E. Safarzade
Published: (2020-12-01)