Lexical variation in English language podcasts, editorial media, and social media

The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversation...

Full description

Saved in:
Bibliographic Details
Main Author: Jussi Karlgren
Format: Article
Language:English
Published: Linköping University Electronic Press 2022-08-01
Series:Northern European Journal of Language Technology
Online Access:https://nejlt.ep.liu.se/article/view/3566
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832591212956090368
author Jussi Karlgren
author_facet Jussi Karlgren
author_sort Jussi Karlgren
collection DOAJ
description The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.
format Article
id doaj-art-173dcde29fe243aba15b66ec2c9abbc1
institution Kabale University
issn 2000-1533
language English
publishDate 2022-08-01
publisher Linköping University Electronic Press
record_format Article
series Northern European Journal of Language Technology
spelling doaj-art-173dcde29fe243aba15b66ec2c9abbc12025-01-22T15:25:18ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332022-08-018110.3384/nejlt.2000-1533.2022.3566Lexical variation in English language podcasts, editorial media, and social mediaJussi Karlgren0Spotify The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection. https://nejlt.ep.liu.se/article/view/3566
spellingShingle Jussi Karlgren
Lexical variation in English language podcasts, editorial media, and social media
Northern European Journal of Language Technology
title Lexical variation in English language podcasts, editorial media, and social media
title_full Lexical variation in English language podcasts, editorial media, and social media
title_fullStr Lexical variation in English language podcasts, editorial media, and social media
title_full_unstemmed Lexical variation in English language podcasts, editorial media, and social media
title_short Lexical variation in English language podcasts, editorial media, and social media
title_sort lexical variation in english language podcasts editorial media and social media
url https://nejlt.ep.liu.se/article/view/3566
work_keys_str_mv AT jussikarlgren lexicalvariationinenglishlanguagepodcastseditorialmediaandsocialmedia