Detecting Black-Box Model Probing Attacks Through Probability Scores

In the black-box model probing attack, the attacker sends a series of model inference requests to a victim model to map out the classification boundary of the model. This attack is considered critical because it helps the attacker gain a better understanding of the model and launch follow-up attacks...

Full description

Saved in:
Bibliographic Details
Main Authors: Yongzhi Wang, Ahsan Habib, Likhitha Reddy Kesara, Brahmarshi Jasti, Renjie Hu, Tejasv Singh
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11029295/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849334007760683008
author Yongzhi Wang
Ahsan Habib
Likhitha Reddy Kesara
Brahmarshi Jasti
Renjie Hu
Tejasv Singh
author_facet Yongzhi Wang
Ahsan Habib
Likhitha Reddy Kesara
Brahmarshi Jasti
Renjie Hu
Tejasv Singh
author_sort Yongzhi Wang
collection DOAJ
description In the black-box model probing attack, the attacker sends a series of model inference requests to a victim model to map out the classification boundary of the model. This attack is considered critical because it helps the attacker gain a better understanding of the model and launch follow-up attacks. While most existing works defend against this attack through the input of the victim model, this paper addresses the problem by utilizing the output of the victim model, i.e., the sequence of probability scores generated through the model inquiries. Our research discovered that the sequences of top-2 probability scores generated by different model probing attacks form distinct patterns that can be detected through time-series classification methods, such as Transformer and Bidirectional Long Short-Term Memory (BiLSTM). Our experiments showed that both Transformer and BiLSTM can detect known and unknown black-box model probing attacks and their variants. Compared with existing defense methods, our classifiers can reduce the Attack Success Rate (ASR) of the OARS-enhanced attacks from 93% to 1.78%. To facilitate further studies on the black-box model probing attack, we collected the probability score sequences generated by 6 different black-box model probing attacks and released them as an open dataset: <uri>https://www.kaggle.com/datasets/drvoyager/model-probing-attack-dataset</uri>.
format Article
id doaj-art-82751f018f8b474099fea9c0511ac265
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-82751f018f8b474099fea9c0511ac2652025-08-20T03:45:41ZengIEEEIEEE Access2169-35362025-01-011310306010307610.1109/ACCESS.2025.357856511029295Detecting Black-Box Model Probing Attacks Through Probability ScoresYongzhi Wang0https://orcid.org/0000-0002-7117-4097Ahsan Habib1Likhitha Reddy Kesara2Brahmarshi Jasti3Renjie Hu4https://orcid.org/0000-0002-0496-6035Tejasv Singh5Department of Computer Information Systems, California State Polytechnic University Pomona, Pomona, CA, USADepartment of Computer Science, Texas A&#x0026;M University-Corpus Christi, Corpus Christi, TX, USADepartment of Information Science Technology, University of Houston, Houston, TX, USADepartment of Information Science Technology, University of Houston, Houston, TX, USADepartment of Information Science Technology, University of Houston, Houston, TX, USADepartment of Computer Science, Texas A&#x0026;M University-Corpus Christi, Corpus Christi, TX, USAIn the black-box model probing attack, the attacker sends a series of model inference requests to a victim model to map out the classification boundary of the model. This attack is considered critical because it helps the attacker gain a better understanding of the model and launch follow-up attacks. While most existing works defend against this attack through the input of the victim model, this paper addresses the problem by utilizing the output of the victim model, i.e., the sequence of probability scores generated through the model inquiries. Our research discovered that the sequences of top-2 probability scores generated by different model probing attacks form distinct patterns that can be detected through time-series classification methods, such as Transformer and Bidirectional Long Short-Term Memory (BiLSTM). Our experiments showed that both Transformer and BiLSTM can detect known and unknown black-box model probing attacks and their variants. Compared with existing defense methods, our classifiers can reduce the Attack Success Rate (ASR) of the OARS-enhanced attacks from 93% to 1.78%. To facilitate further studies on the black-box model probing attack, we collected the probability score sequences generated by 6 different black-box model probing attacks and released them as an open dataset: <uri>https://www.kaggle.com/datasets/drvoyager/model-probing-attack-dataset</uri>.https://ieeexplore.ieee.org/document/11029295/Machine learning securityadversarial machine learningmodel probing attacksblack-box attackstransformerLSTM
spellingShingle Yongzhi Wang
Ahsan Habib
Likhitha Reddy Kesara
Brahmarshi Jasti
Renjie Hu
Tejasv Singh
Detecting Black-Box Model Probing Attacks Through Probability Scores
IEEE Access
Machine learning security
adversarial machine learning
model probing attacks
black-box attacks
transformer
LSTM
title Detecting Black-Box Model Probing Attacks Through Probability Scores
title_full Detecting Black-Box Model Probing Attacks Through Probability Scores
title_fullStr Detecting Black-Box Model Probing Attacks Through Probability Scores
title_full_unstemmed Detecting Black-Box Model Probing Attacks Through Probability Scores
title_short Detecting Black-Box Model Probing Attacks Through Probability Scores
title_sort detecting black box model probing attacks through probability scores
topic Machine learning security
adversarial machine learning
model probing attacks
black-box attacks
transformer
LSTM
url https://ieeexplore.ieee.org/document/11029295/
work_keys_str_mv AT yongzhiwang detectingblackboxmodelprobingattacksthroughprobabilityscores
AT ahsanhabib detectingblackboxmodelprobingattacksthroughprobabilityscores
AT likhithareddykesara detectingblackboxmodelprobingattacksthroughprobabilityscores
AT brahmarshijasti detectingblackboxmodelprobingattacksthroughprobabilityscores
AT renjiehu detectingblackboxmodelprobingattacksthroughprobabilityscores
AT tejasvsingh detectingblackboxmodelprobingattacksthroughprobabilityscores