Graph-based vision transformer with sparsity for training on small datasets from scratch
Abstract Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of indu...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-10408-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849344762589478912 |
|---|---|
| author | Peng Li Lu Huang Jin Li Haiyan Yan Dongjing Shan |
| author_facet | Peng Li Lu Huang Jin Li Haiyan Yan Dongjing Shan |
| author_sort | Peng Li |
| collection | DOAJ |
| description | Abstract Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. |
| format | Article |
| id | doaj-art-d43e5767b32a41de8f46330aa2ada150 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-d43e5767b32a41de8f46330aa2ada1502025-08-20T03:42:35ZengNature PortfolioScientific Reports2045-23222025-07-0115111210.1038/s41598-025-10408-0Graph-based vision transformer with sparsity for training on small datasets from scratchPeng Li0Lu Huang1Jin Li2Haiyan Yan3Dongjing Shan4Emergency Department, Yantaishan HospitalSouthwest Medical UniversitySouthwest Medical UniversityYantai Municipal Health Service CenterSchool of Medical Information and Engineering, Southwest Medical UniversityAbstract Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets.https://doi.org/10.1038/s41598-025-10408-0Vision TransformerGraph convolutionSelf-attentionGraph-poolingImage classification |
| spellingShingle | Peng Li Lu Huang Jin Li Haiyan Yan Dongjing Shan Graph-based vision transformer with sparsity for training on small datasets from scratch Scientific Reports Vision Transformer Graph convolution Self-attention Graph-pooling Image classification |
| title | Graph-based vision transformer with sparsity for training on small datasets from scratch |
| title_full | Graph-based vision transformer with sparsity for training on small datasets from scratch |
| title_fullStr | Graph-based vision transformer with sparsity for training on small datasets from scratch |
| title_full_unstemmed | Graph-based vision transformer with sparsity for training on small datasets from scratch |
| title_short | Graph-based vision transformer with sparsity for training on small datasets from scratch |
| title_sort | graph based vision transformer with sparsity for training on small datasets from scratch |
| topic | Vision Transformer Graph convolution Self-attention Graph-pooling Image classification |
| url | https://doi.org/10.1038/s41598-025-10408-0 |
| work_keys_str_mv | AT pengli graphbasedvisiontransformerwithsparsityfortrainingonsmalldatasetsfromscratch AT luhuang graphbasedvisiontransformerwithsparsityfortrainingonsmalldatasetsfromscratch AT jinli graphbasedvisiontransformerwithsparsityfortrainingonsmalldatasetsfromscratch AT haiyanyan graphbasedvisiontransformerwithsparsityfortrainingonsmalldatasetsfromscratch AT dongjingshan graphbasedvisiontransformerwithsparsityfortrainingonsmalldatasetsfromscratch |