Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings

We present a new framework to study the stability of semantic projections based on word embeddings. Roughly speaking, semantic projections are indices taking values in the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><s...

Full description

Saved in:
Bibliographic Details
Main Authors: Roger Arnau, Ana Coronado Ferrer, Álvaro González Cortés, Claudia Sánchez Arnau, Enrique A. Sánchez Pérez
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Axioms
Subjects:
Online Access:https://www.mdpi.com/2075-1680/14/5/389
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We present a new framework to study the stability of semantic projections based on word embeddings. Roughly speaking, semantic projections are indices taking values in the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>]</mo></mrow></semantics></math></inline-formula> that measure how terms share contextual meaning with the words of a given universe. Since there are many ways to define such projections, it is important to establish a procedure for verifying whether a group of them behaves similarly. Moreover, when fixing one particular projection, it is important to assess whether the average projections remain consistent when replacing the original universe with a similar one describing the same semantic environment. The aim of this paper is to address the lack of formal tools for assessing the stability of semantic projections (that is, their invariance under formal changes which preserve the underlying semantic context) across alternative but semantically related universes in word embedding models. To address these problems, we employ a combination of statistical and AI methods, including correlation analysis, clustering, chi-squared distance measures, weighted approximations, and Lipschitz-based estimators. The methodology provides theoretical guarantees under mild mathematical assumptions, ensuring bounded errors in projection estimations based on the assumption of Lipschitz continuity. We demonstrate the practical applicability of our approach through two case studies involving agricultural terminology across multiple data sources (DOAJ, Scholar, Google, and Arxiv). Our results show that semantic stability can be quantitatively evaluated and that the careful modeling of projection functions and universes is crucial for robust semantic analysis in NLP.
ISSN:2075-1680