Spelling correction with large language models to reduce measurement error in open-ended survey responses
Open-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected t...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SAGE Publishing
2025-01-01
|
| Series: | Research & Politics |
| Online Access: | https://doi.org/10.1177/20531680241311510 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850085649719230464 |
|---|---|
| author | Maxwell B. Allamong Jongwoo Jeong Paul M. Kellstedt |
| author_facet | Maxwell B. Allamong Jongwoo Jeong Paul M. Kellstedt |
| author_sort | Maxwell B. Allamong |
| collection | DOAJ |
| description | Open-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected through web surveys—may contain measurement error in the form of misspellings which are not easily corrected in a reliable and systematic manner. This paper provides evidence that large language models (LLMs), specifically OpenAI’s GPT-4o, offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended responses. We demonstrate the efficacy of this approach with open-ended responses about the Democratic and Republican parties from the 1996–2020 American National Election Studies, where GPT is shown to correct 85%–90% of misspellings identified by human coders in a sample of responses. Following spelling correction on ∼50,000 responses, we document several consequential changes to the data. First, we show that spelling correction reduces the number of unique and single-use tokens while increasing the number of words matched to a sentiment dictionary. Then, to highlight the potential benefits and limitations of spelling correction we show improved out-of-sample prediction accuracy from a text-based machine learning classifier. Finally, we show a significantly larger degree of emotionality is captured in the spelling-corrected texts, though the size of this measure’s relationship with a known correlate in political interest remains relatively unchanged. Our findings point to LLMs as an effective tool for reducing measurement error by correcting misspellings in open-ended survey responses. |
| format | Article |
| id | doaj-art-232b7ecf5fa545859c66f3ef012df420 |
| institution | DOAJ |
| issn | 2053-1680 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | SAGE Publishing |
| record_format | Article |
| series | Research & Politics |
| spelling | doaj-art-232b7ecf5fa545859c66f3ef012df4202025-08-20T02:43:39ZengSAGE PublishingResearch & Politics2053-16802025-01-011210.1177/20531680241311510Spelling correction with large language models to reduce measurement error in open-ended survey responsesMaxwell B. AllamongJongwoo JeongPaul M. KellstedtOpen-ended survey questions have a long history in public opinion research and are seeing a renewed interest as computing power and tools of text analysis proliferate. A major challenge in performing text analyses on open-ended responses is that the documents—especially if transcribed or collected through web surveys—may contain measurement error in the form of misspellings which are not easily corrected in a reliable and systematic manner. This paper provides evidence that large language models (LLMs), specifically OpenAI’s GPT-4o, offer a flexible, dependable, and low-cost solution to correcting misspellings in open-ended responses. We demonstrate the efficacy of this approach with open-ended responses about the Democratic and Republican parties from the 1996–2020 American National Election Studies, where GPT is shown to correct 85%–90% of misspellings identified by human coders in a sample of responses. Following spelling correction on ∼50,000 responses, we document several consequential changes to the data. First, we show that spelling correction reduces the number of unique and single-use tokens while increasing the number of words matched to a sentiment dictionary. Then, to highlight the potential benefits and limitations of spelling correction we show improved out-of-sample prediction accuracy from a text-based machine learning classifier. Finally, we show a significantly larger degree of emotionality is captured in the spelling-corrected texts, though the size of this measure’s relationship with a known correlate in political interest remains relatively unchanged. Our findings point to LLMs as an effective tool for reducing measurement error by correcting misspellings in open-ended survey responses.https://doi.org/10.1177/20531680241311510 |
| spellingShingle | Maxwell B. Allamong Jongwoo Jeong Paul M. Kellstedt Spelling correction with large language models to reduce measurement error in open-ended survey responses Research & Politics |
| title | Spelling correction with large language models to reduce measurement error in open-ended survey responses |
| title_full | Spelling correction with large language models to reduce measurement error in open-ended survey responses |
| title_fullStr | Spelling correction with large language models to reduce measurement error in open-ended survey responses |
| title_full_unstemmed | Spelling correction with large language models to reduce measurement error in open-ended survey responses |
| title_short | Spelling correction with large language models to reduce measurement error in open-ended survey responses |
| title_sort | spelling correction with large language models to reduce measurement error in open ended survey responses |
| url | https://doi.org/10.1177/20531680241311510 |
| work_keys_str_mv | AT maxwellballamong spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses AT jongwoojeong spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses AT paulmkellstedt spellingcorrectionwithlargelanguagemodelstoreducemeasurementerrorinopenendedsurveyresponses |