Spelling correction with large language models to reduce measurement error in open-ended survey responses

Open-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected t...

Full description

Saved in:
Bibliographic Details
Main Authors: Maxwell B. Allamong, Jongwoo Jeong, Paul M. Kellstedt
Format: Article
Language:English
Published: SAGE Publishing 2025-01-01
Series:Research & Politics
Online Access:https://doi.org/10.1177/20531680241311510
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850085649719230464
author Maxwell B. Allamong
Jongwoo Jeong
Paul M. Kellstedt
author_facet Maxwell B. Allamong
Jongwoo Jeong
Paul M. Kellstedt
author_sort Maxwell B. Allamong
collection DOAJ
description Open-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected through web surveys—may contain measurement error in the form of misspellings which are not easily corrected in a reliable and systematic manner. This paper provides evidence that large language models (LLMs), specifically OpenAI’s GPT-4o, offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended responses. We demonstrate the efficacy of this approach with open-ended responses about the Democratic and Republican parties from the 1996–2020 American National Election Studies, where GPT is shown to correct 85%–90% of misspellings identified by human coders in a sample of responses. Following spelling correction on ∼50,000 responses, we document several consequential changes to the data. First, we show that spelling correction reduces the number of unique and single-use tokens while increasing the number of words matched to a sentiment dictionary. Then, to highlight the potential benefits and limitations of spelling correction we show improved out-of-sample prediction accuracy from a text-based machine learning classifier. Finally, we show a significantly larger degree of emotionality is captured in the spelling-corrected texts, though the size of this measure’s relationship with a known correlate in political interest remains relatively unchanged. Our findings point to LLMs as an effective tool for reducing measurement error by correcting misspellings in open-ended survey responses.
format Article
id doaj-art-232b7ecf5fa545859c66f3ef012df420
institution DOAJ
issn 2053-1680
language English
publishDate 2025-01-01
publisher SAGE Publishing
record_format Article
series Research & Politics
spelling doaj-art-232b7ecf5fa545859c66f3ef012df4202025-08-20T02:43:39ZengSAGE PublishingResearch & Politics2053-16802025-01-011210.1177/20531680241311510Spelling correction with large language models to reduce measurement error in open-ended survey responsesMaxwell B. AllamongJongwoo JeongPaul M. KellstedtOpen-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected through web surveys—may contain measurement error in the form of misspellings which are not easily corrected in a reliable and systematic manner. This paper provides evidence that large language models (LLMs), specifically OpenAI’s GPT-4o, offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended responses. We demonstrate the efficacy of this approach with open-ended responses about the Democratic and Republican parties from the 1996–2020 American National Election Studies, where GPT is shown to correct 85%–90% of misspellings identified by human coders in a sample of responses. Following spelling correction on ∼50,000 responses, we document several consequential changes to the data. First, we show that spelling correction reduces the number of unique and single-use tokens while increasing the number of words matched to a sentiment dictionary. Then, to highlight the potential benefits and limitations of spelling correction we show improved out-of-sample prediction accuracy from a text-based machine learning classifier. Finally, we show a significantly larger degree of emotionality is captured in the spelling-corrected texts, though the size of this measure’s relationship with a known correlate in political interest remains relatively unchanged. Our findings point to LLMs as an effective tool for reducing measurement error by correcting misspellings in open-ended survey responses.https://doi.org/10.1177/20531680241311510
spellingShingle Maxwell B. Allamong
Jongwoo Jeong
Paul M. Kellstedt
Spelling correction with large language models to reduce measurement error in open-ended survey responses
Research & Politics
title Spelling correction with large language models to reduce measurement error in open-ended survey responses
title_full Spelling correction with large language models to reduce measurement error in open-ended survey responses
title_fullStr Spelling correction with large language models to reduce measurement error in open-ended survey responses
title_full_unstemmed Spelling correction with large language models to reduce measurement error in open-ended survey responses
title_short Spelling correction with large language models to reduce measurement error in open-ended survey responses
title_sort spelling correction with large language models to reduce measurement error in open ended survey responses
url https://doi.org/10.1177/20531680241311510
work_keys_str_mv AT maxwellballamong spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses
AT jongwoojeong spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses
AT paulmkellstedt spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses