Automatic Seed Word Selection for Topic Modeling
Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in mul...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10879013/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850189992180056064 |
|---|---|
| author | Dahyun Jeong Jeongin Hwang Yunjin Choi Yoon-Yeong Kim |
| author_facet | Dahyun Jeong Jeongin Hwang Yunjin Choi Yoon-Yeong Kim |
| author_sort | Dahyun Jeong |
| collection | DOAJ |
| description | Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through “seed words”. Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula>, an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> operates in a fully unsupervised manner, requiring no predefined labels or human intervention. <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> achieves results comparable to those obtained through supervised seed word selection. |
| format | Article |
| id | doaj-art-790437a990324eb285be562be77e7d5c |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-790437a990324eb285be562be77e7d5c2025-08-20T02:15:28ZengIEEEIEEE Access2169-35362025-01-0113312693128510.1109/ACCESS.2025.354041010879013Automatic Seed Word Selection for Topic ModelingDahyun Jeong0Jeongin Hwang1https://orcid.org/0009-0004-4545-568XYunjin Choi2https://orcid.org/0000-0002-0194-5802Yoon-Yeong Kim3https://orcid.org/0000-0003-3696-6571Department of Statistics, University of Seoul, Seoul, South KoreaDepartment of Statistical Data Science, University of Seoul, Seoul, South KoreaDepartment of Statistics, University of Seoul, Seoul, South KoreaDepartment of Statistics, University of Seoul, Seoul, South KoreaTopic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through “seed words”. Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula>, an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> operates in a fully unsupervised manner, requiring no predefined labels or human intervention. <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> achieves results comparable to those obtained through supervised seed word selection.https://ieeexplore.ieee.org/document/10879013/Seed-guided topic modelingseed wordsautomatic seed word selection |
| spellingShingle | Dahyun Jeong Jeongin Hwang Yunjin Choi Yoon-Yeong Kim Automatic Seed Word Selection for Topic Modeling IEEE Access Seed-guided topic modeling seed words automatic seed word selection |
| title | Automatic Seed Word Selection for Topic Modeling |
| title_full | Automatic Seed Word Selection for Topic Modeling |
| title_fullStr | Automatic Seed Word Selection for Topic Modeling |
| title_full_unstemmed | Automatic Seed Word Selection for Topic Modeling |
| title_short | Automatic Seed Word Selection for Topic Modeling |
| title_sort | automatic seed word selection for topic modeling |
| topic | Seed-guided topic modeling seed words automatic seed word selection |
| url | https://ieeexplore.ieee.org/document/10879013/ |
| work_keys_str_mv | AT dahyunjeong automaticseedwordselectionfortopicmodeling AT jeonginhwang automaticseedwordselectionfortopicmodeling AT yunjinchoi automaticseedwordselectionfortopicmodeling AT yoonyeongkim automaticseedwordselectionfortopicmodeling |