Automatic Seed Word Selection for Topic Modeling

Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in mul...

Full description

Saved in:
Bibliographic Details
Main Authors: Dahyun Jeong, Jeongin Hwang, Yunjin Choi, Yoon-Yeong Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10879013/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850189992180056064
author Dahyun Jeong
Jeongin Hwang
Yunjin Choi
Yoon-Yeong Kim
author_facet Dahyun Jeong
Jeongin Hwang
Yunjin Choi
Yoon-Yeong Kim
author_sort Dahyun Jeong
collection DOAJ
description Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through &#x201C;seed words&#x201D;. Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula>, an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> operates in a fully unsupervised manner, requiring no predefined labels or human intervention. <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> achieves results comparable to those obtained through supervised seed word selection.
format Article
id doaj-art-790437a990324eb285be562be77e7d5c
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-790437a990324eb285be562be77e7d5c2025-08-20T02:15:28ZengIEEEIEEE Access2169-35362025-01-0113312693128510.1109/ACCESS.2025.354041010879013Automatic Seed Word Selection for Topic ModelingDahyun Jeong0Jeongin Hwang1https://orcid.org/0009-0004-4545-568XYunjin Choi2https://orcid.org/0000-0002-0194-5802Yoon-Yeong Kim3https://orcid.org/0000-0003-3696-6571Department of Statistics, University of Seoul, Seoul, South KoreaDepartment of Statistical Data Science, University of Seoul, Seoul, South KoreaDepartment of Statistics, University of Seoul, Seoul, South KoreaDepartment of Statistics, University of Seoul, Seoul, South KoreaTopic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through &#x201C;seed words&#x201D;. Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula>, an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> operates in a fully unsupervised manner, requiring no predefined labels or human intervention. <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that <inline-formula> <tex-math notation="LaTeX">$\textsf {SeedCapture}$ </tex-math></inline-formula> achieves results comparable to those obtained through supervised seed word selection.https://ieeexplore.ieee.org/document/10879013/Seed-guided topic modelingseed wordsautomatic seed word selection
spellingShingle Dahyun Jeong
Jeongin Hwang
Yunjin Choi
Yoon-Yeong Kim
Automatic Seed Word Selection for Topic Modeling
IEEE Access
Seed-guided topic modeling
seed words
automatic seed word selection
title Automatic Seed Word Selection for Topic Modeling
title_full Automatic Seed Word Selection for Topic Modeling
title_fullStr Automatic Seed Word Selection for Topic Modeling
title_full_unstemmed Automatic Seed Word Selection for Topic Modeling
title_short Automatic Seed Word Selection for Topic Modeling
title_sort automatic seed word selection for topic modeling
topic Seed-guided topic modeling
seed words
automatic seed word selection
url https://ieeexplore.ieee.org/document/10879013/
work_keys_str_mv AT dahyunjeong automaticseedwordselectionfortopicmodeling
AT jeonginhwang automaticseedwordselectionfortopicmodeling
AT yunjinchoi automaticseedwordselectionfortopicmodeling
AT yoonyeongkim automaticseedwordselectionfortopicmodeling