Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification
Convolutional neural networks (CNNs) are widely adopted for remote sensing image scene classification. However, labeling of large annotated remote sensing datasets is costly and time consuming, which limits the applicability of CNNs for real-world. Inspired by human ability, few-shot image classific...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11083761/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849392695603101696 |
|---|---|
| author | Dongmei Deng Ping Yao |
| author_facet | Dongmei Deng Ping Yao |
| author_sort | Dongmei Deng |
| collection | DOAJ |
| description | Convolutional neural networks (CNNs) are widely adopted for remote sensing image scene classification. However, labeling of large annotated remote sensing datasets is costly and time consuming, which limits the applicability of CNNs for real-world. Inspired by human ability, few-shot image classification offers a promising solution by utilizing limited labeled data. Recently, contrastive vision-language pretraining (CLIP) has shown impressive few-shot image classification performance in downstream remote sensing tasks. However, existing CLIP-based methods still have two essential issues: 1) bias in text features; 2) unreliable similarity in image features. To address these issues, we design a multilevel image–text feature alignment (MITA) component to align the multimodal embeddings with visual-guided text features from instance, class, and random level, and an image–image feature alignment (IIA) component to reliably measure the similarity between images by remapping these visual features from image–text alignment embedding space to image–image alignment feature space. Besides, we build an adaptive knowledge fusion component to automatically fuse prior knowledge from pre-training model and task-specific new knowledge from MITA and IIA module. These components comprise the proposed dual-alignment CLIP (DA-CLIP) method and extensive experiments on 12 remote sensing datasets validate its effectiveness. |
| format | Article |
| id | doaj-art-5a86ca38849347eab21ddefda0f01ac2 |
| institution | Kabale University |
| issn | 1939-1404 2151-1535 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
| spelling | doaj-art-5a86ca38849347eab21ddefda0f01ac22025-08-20T03:40:43ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-0118192601927210.1109/JSTARS.2025.359059011083761Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene ClassificationDongmei Deng0https://orcid.org/0009-0005-8822-5742Ping Yao1https://orcid.org/0009-0008-7539-5066Institute of Computing Technology Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, ChinaInstitute of Computing Technology Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, ChinaConvolutional neural networks (CNNs) are widely adopted for remote sensing image scene classification. However, labeling of large annotated remote sensing datasets is costly and time consuming, which limits the applicability of CNNs for real-world. Inspired by human ability, few-shot image classification offers a promising solution by utilizing limited labeled data. Recently, contrastive vision-language pretraining (CLIP) has shown impressive few-shot image classification performance in downstream remote sensing tasks. However, existing CLIP-based methods still have two essential issues: 1) bias in text features; 2) unreliable similarity in image features. To address these issues, we design a multilevel image–text feature alignment (MITA) component to align the multimodal embeddings with visual-guided text features from instance, class, and random level, and an image–image feature alignment (IIA) component to reliably measure the similarity between images by remapping these visual features from image–text alignment embedding space to image–image alignment feature space. Besides, we build an adaptive knowledge fusion component to automatically fuse prior knowledge from pre-training model and task-specific new knowledge from MITA and IIA module. These components comprise the proposed dual-alignment CLIP (DA-CLIP) method and extensive experiments on 12 remote sensing datasets validate its effectiveness.https://ieeexplore.ieee.org/document/11083761/Contrastive vision-language pretraining (CLIP)few-shot learning (FSL)image classificationremote sensing |
| spellingShingle | Dongmei Deng Ping Yao Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Contrastive vision-language pretraining (CLIP) few-shot learning (FSL) image classification remote sensing |
| title | Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification |
| title_full | Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification |
| title_fullStr | Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification |
| title_full_unstemmed | Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification |
| title_short | Dual-Alignment CLIP: Task-Specific Alignment of Text and Visual Features for Few-Shot Remote Sensing Scene Classification |
| title_sort | dual alignment clip task specific alignment of text and visual features for few shot remote sensing scene classification |
| topic | Contrastive vision-language pretraining (CLIP) few-shot learning (FSL) image classification remote sensing |
| url | https://ieeexplore.ieee.org/document/11083761/ |
| work_keys_str_mv | AT dongmeideng dualalignmentcliptaskspecificalignmentoftextandvisualfeaturesforfewshotremotesensingsceneclassification AT pingyao dualalignmentcliptaskspecificalignmentoftextandvisualfeaturesforfewshotremotesensingsceneclassification |