Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings
We present a new framework to study the stability of semantic projections based on word embeddings. Roughly speaking, semantic projections are indices taking values in the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><s...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Axioms |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2075-1680/14/5/389 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | We present a new framework to study the stability of semantic projections based on word embeddings. Roughly speaking, semantic projections are indices taking values in the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>]</mo></mrow></semantics></math></inline-formula> that measure how terms share contextual meaning with the words of a given universe. Since there are many ways to define such projections, it is important to establish a procedure for verifying whether a group of them behaves similarly. Moreover, when fixing one particular projection, it is important to assess whether the average projections remain consistent when replacing the original universe with a similar one describing the same semantic environment. The aim of this paper is to address the lack of formal tools for assessing the stability of semantic projections (that is, their invariance under formal changes which preserve the underlying semantic context) across alternative but semantically related universes in word embedding models. To address these problems, we employ a combination of statistical and AI methods, including correlation analysis, clustering, chi-squared distance measures, weighted approximations, and Lipschitz-based estimators. The methodology provides theoretical guarantees under mild mathematical assumptions, ensuring bounded errors in projection estimations based on the assumption of Lipschitz continuity. We demonstrate the practical applicability of our approach through two case studies involving agricultural terminology across multiple data sources (DOAJ, Scholar, Google, and Arxiv). Our results show that semantic stability can be quantitatively evaluated and that the careful modeling of projection functions and universes is crucial for robust semantic analysis in NLP. |
|---|---|
| ISSN: | 2075-1680 |