Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. Th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jaehoon Kim, Byoung Chul Ko
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Sensors
Subjects:	vision sensor graph similarity learning graph neural network scene graph generation semantic image retrieval subgraph extraction
Online Access:	https://www.mdpi.com/1424-8220/25/11/3252
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges arise because keyword-based matching fails to adequately capture contextual and semantic meanings. To address these limitations, we propose a novel approach that transforms sentences and images into semantic graphs and scene graphs, enabling a quantitative comparison between them. Specifically, we utilize a graph neural network (GNN) to learn features of nodes and edges and generate graph embeddings, enabling image retrieval through natural language queries without relying on additional image metadata. We introduce a contrastive GNN-based framework that matches semantic graphs with scene graphs to retrieve semantically similar images. In addition, we incorporate a hard negative mining strategy, allowing the model to effectively learn from more challenging negative samples. The experimental results on the Visual Genome dataset show that the proposed method achieves a top nDCG@50 score of 0.745, improving retrieval performance by approximately 7.7 percentage points compared to random sampling with full graphs. This confirms that the model effectively retrieves semantically relevant images by structurally interpreting complex scenes.
ISSN:	1424-8220

Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data

Similar Items