The Geometry of Concepts: Sparse Autoencoder Feature Structure

Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals”...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/27/4/344
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850183257352568832
author Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
author_facet Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
author_sort Yuxiao Li
collection DOAJ
description Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (<i>man:woman::king:queen</i>). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently performed with linear discriminant analysis. (2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. (3) The “galaxy”-scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
format Article
id doaj-art-14fa34bd5cc647d7be0047e5c168656f
institution OA Journals
issn 1099-4300
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj-art-14fa34bd5cc647d7be0047e5c168656f2025-08-20T02:17:25ZengMDPI AGEntropy1099-43002025-03-0127434410.3390/e27040344The Geometry of Concepts: Sparse Autoencoder Feature StructureYuxiao Li0Eric J. Michaud1David D. Baek2Joshua Engels3Xiaoqing Sun4Max Tegmark5Beneficial AI Foundation (BAIF), Cambridge, MA 02139, USADepartment of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USAInstitute for Artificial Intelligence and Fundamental Interaction, Cambridge, MA 02139, USADepartment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USADepartment of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USADepartment of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USASparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (<i>man:woman::king:queen</i>). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently performed with linear discriminant analysis. (2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. (3) The “galaxy”-scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.https://www.mdpi.com/1099-4300/27/4/344sparse codingmechanistic interpretabilityneural networkslarge language modelsclustering
spellingShingle Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Entropy
sparse coding
mechanistic interpretability
neural networks
large language models
clustering
title The Geometry of Concepts: Sparse Autoencoder Feature Structure
title_full The Geometry of Concepts: Sparse Autoencoder Feature Structure
title_fullStr The Geometry of Concepts: Sparse Autoencoder Feature Structure
title_full_unstemmed The Geometry of Concepts: Sparse Autoencoder Feature Structure
title_short The Geometry of Concepts: Sparse Autoencoder Feature Structure
title_sort geometry of concepts sparse autoencoder feature structure
topic sparse coding
mechanistic interpretability
neural networks
large language models
clustering
url https://www.mdpi.com/1099-4300/27/4/344
work_keys_str_mv AT yuxiaoli thegeometryofconceptssparseautoencoderfeaturestructure
AT ericjmichaud thegeometryofconceptssparseautoencoderfeaturestructure
AT daviddbaek thegeometryofconceptssparseautoencoderfeaturestructure
AT joshuaengels thegeometryofconceptssparseautoencoderfeaturestructure
AT xiaoqingsun thegeometryofconceptssparseautoencoderfeaturestructure
AT maxtegmark thegeometryofconceptssparseautoencoderfeaturestructure
AT yuxiaoli geometryofconceptssparseautoencoderfeaturestructure
AT ericjmichaud geometryofconceptssparseautoencoderfeaturestructure
AT daviddbaek geometryofconceptssparseautoencoderfeaturestructure
AT joshuaengels geometryofconceptssparseautoencoderfeaturestructure
AT xiaoqingsun geometryofconceptssparseautoencoderfeaturestructure
AT maxtegmark geometryofconceptssparseautoencoderfeaturestructure