Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model

In this article, we present a model for analyzing the co-occurrence count data derived from practical fields such as user–item or item–item data from online shopping platforms and co-occurring word–word pairs in sequences of texts. Such data contain important information for developing recommender s...

Full description

Saved in:
Bibliographic Details
Main Authors: Taejoon Kim, Haiyan Wang
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/4/612
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850231467595005952
author Taejoon Kim
Haiyan Wang
author_facet Taejoon Kim
Haiyan Wang
author_sort Taejoon Kim
collection DOAJ
description In this article, we present a model for analyzing the co-occurrence count data derived from practical fields such as user–item or item–item data from online shopping platforms and co-occurring word–word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying the relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the co-occurrence matrix is typically of such high dimension that it does not fit into a computer’s memory for modeling. We extract numerical data by defining windows of co-occurrence using weighted counts on the continuous scale. Positive probability mass is allowed for zero observations. We present the Shared Parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track with optimizing direction. Gradient descent with the Adam update was also considered as an alternative method for the estimation. Simulation studies showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. We applied SA-Tweedie to English-language Wikipedia dump data to obtain dense vector representations for WordPiece tokens. The vector representation embeddings were then used in an application of the Named Entity Recognition (NER) task. The SA-Tweedie embeddings significantly outperform GloVe, random, and BERT embeddings in the NER task. A notable strength of the SA-Tweedie embedding is that the number of parameters and training cost for SA-Tweedie are only a tiny fraction of those for BERT.
format Article
id doaj-art-34c2f4849b6043659ff38bfa50c6cf40
institution OA Journals
issn 2227-7390
language English
publishDate 2025-02-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-34c2f4849b6043659ff38bfa50c6cf402025-08-20T02:03:31ZengMDPI AGMathematics2227-73902025-02-0113461210.3390/math13040612Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie ModelTaejoon Kim0Haiyan Wang1Department of Statistics and Biostatistics, California State University East Bay, Hayward, CA 94542, USADepartment of Statistics, Kansas State University, Manhattan, KS 66506, USAIn this article, we present a model for analyzing the co-occurrence count data derived from practical fields such as user–item or item–item data from online shopping platforms and co-occurring word–word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying the relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the co-occurrence matrix is typically of such high dimension that it does not fit into a computer’s memory for modeling. We extract numerical data by defining windows of co-occurrence using weighted counts on the continuous scale. Positive probability mass is allowed for zero observations. We present the Shared Parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track with optimizing direction. Gradient descent with the Adam update was also considered as an alternative method for the estimation. Simulation studies showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. We applied SA-Tweedie to English-language Wikipedia dump data to obtain dense vector representations for WordPiece tokens. The vector representation embeddings were then used in an application of the Named Entity Recognition (NER) task. The SA-Tweedie embeddings significantly outperform GloVe, random, and BERT embeddings in the NER task. A notable strength of the SA-Tweedie embedding is that the number of parameters and training cost for SA-Tweedie are only a tiny fraction of those for BERT.https://www.mdpi.com/2227-7390/13/4/612NLPword embeddingtweedie distributionhigh-dimensional co-occurrence matrixmatrix factorizationadam
spellingShingle Taejoon Kim
Haiyan Wang
Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
Mathematics
NLP
word embedding
tweedie distribution
high-dimensional co-occurrence matrix
matrix factorization
adam
title Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
title_full Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
title_fullStr Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
title_full_unstemmed Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
title_short Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
title_sort global dense vector representations for words or items using shared parameter alternating tweedie model
topic NLP
word embedding
tweedie distribution
high-dimensional co-occurrence matrix
matrix factorization
adam
url https://www.mdpi.com/2227-7390/13/4/612
work_keys_str_mv AT taejoonkim globaldensevectorrepresentationsforwordsoritemsusingsharedparameteralternatingtweediemodel
AT haiyanwang globaldensevectorrepresentationsforwordsoritemsusingsharedparameteralternatingtweediemodel