ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA train...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang
Format:	Article
Language:	English
Published:	Tsinghua University Press 2018-03-01
Series:	Big Data Mining and Analytics
Subjects:	latent dirichlet allocation collapsed gibbs sampling monte-carlo graph computing large-scale machine learning
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2018.9020006
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832568926551146496
author	Bo Zhao Hucheng Zhou Guoqiang Li Yihua Huang
author_facet	Bo Zhao Hucheng Zhou Guoqiang Li Yihua Huang
author_sort	Bo Zhao
collection	DOAJ
description	Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.
format	Article
id	doaj-art-964e322084df4408ae89183ef6bfabdd
institution	Kabale University
issn	2096-0654
language	English
publishDate	2018-03-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj-art-964e322084df4408ae89183ef6bfabdd2025-02-02T23:47:25ZengTsinghua University PressBig Data Mining and Analytics2096-06542018-03-0111577410.26599/BDMA.2018.9020006ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel PlatformBo Zhao0Hucheng Zhou1Guoqiang Li2Yihua Huang3<institution content-type="dept">National Key Laboratory for Novel Software Technology</institution>, <institution>Nanjing University</institution>, <city>Nanjing</city> <postal-code>210023</postal-code>, <country>China</country> and <institution>Collaborative Innovation Center of Novel Software Technology and Industrialization</institution>, <city>Nanjing </city><postal-code>210023</postal-code>, <country>China</country>.<institution>Microsoft Research</institution>, <city>Beijing</city> <postal-code>100080</postal-code>, <country>China</country>.<institution>Huawei Technologies Co., Ltd.</institution>, <city>Shenzhen</city> <postal-code>518129</postal-code>, <country>China</country>.<institution content-type="dept">National Key Laboratory for Novel Software Technology</institution>, <institution>Nanjing University</institution>, <city>Nanjing</city> <postal-code>210023</postal-code>, <country>China</country> and <institution>Collaborative Innovation Center of Novel Software Technology and Industrialization</institution>, <city>Nanjing </city><postal-code>210023</postal-code>, <country>China</country>.Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.https://www.sciopen.com/article/10.26599/BDMA.2018.9020006latent dirichlet allocationcollapsed gibbs samplingmonte-carlograph computinglarge-scale machine learning
spellingShingle	Bo Zhao Hucheng Zhou Guoqiang Li Yihua Huang ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform Big Data Mining and Analytics latent dirichlet allocation collapsed gibbs sampling monte-carlo graph computing large-scale machine learning
title	ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
title_full	ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
title_fullStr	ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
title_full_unstemmed	ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
title_short	ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
title_sort	zenlda large scale topic model training on distributed data parallel platform
topic	latent dirichlet allocation collapsed gibbs sampling monte-carlo graph computing large-scale machine learning
url	https://www.sciopen.com/article/10.26599/BDMA.2018.9020006
work_keys_str_mv	AT bozhao zenldalargescaletopicmodeltrainingondistributeddataparallelplatform AT huchengzhou zenldalargescaletopicmodeltrainingondistributeddataparallelplatform AT guoqiangli zenldalargescaletopicmodeltrainingondistributeddataparallelplatform AT yihuahuang zenldalargescaletopicmodeltrainingondistributeddataparallelplatform

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Similar Items