Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models

11 Jun 2024


(1) Rui Duan University of South Florida Tampa, USA (email: ruiduan@usf.edu);

(2) Zhe Qu Central South University Changsha, China (email: zhe_qu@csu.edu.cn);

(3) Leah Ding American University Washington, DC, USA (email: ding@american.edu);

(4) Yao Liu University of South Florida Tampa, USA (email: yliu@cse.usf.edu);

(5) Yao Liu University of South Florida Tampa, USA (email: yliu@cse.usf.edu).

Abstract and Intro

Background and Motivation

Parrot Training: Feasibility and Evaluation

PT-AE Generation: A Joint Transferability and Perception Perspective

Optimized Black-Box PT-AE Attacks

Experimental Evaluations

Related Work

Conclusion and References


Abstract—Audio adversarial examples (AEs) have posed significant security challenges to real-world speaker recognition systems. Most black-box attacks still require certain information from the speaker recognition model to be effective (e.g., keeping probing and requiring the knowledge of similarity scores). This work aims to push the practicality of the black-box attacks by minimizing the attacker’s knowledge about a target speaker recognition model. Although it is not feasible for an attacker to succeed with completely zero knowledge, we assume that the attacker only knows a short (or a few seconds) speech sample of a target speaker. Without any probing to gain further knowledge about the target model, we propose a new mechanism, called parrot training, to generate AEs against the target model. Motivated by recent advancements in voice conversion (VC), we propose to use the one short sentence knowledge to generate more synthetic speech samples that sound like the target speaker, called parrot speech. Then, we use these parrot speech samples to train a parrot-trained (PT) surrogate model for the attacker. Under a joint transferability and perception framework, we investigate different ways to generate AEs on the PT model (called PT-AEs) to ensure the PT-AEs can be generated with high transferability to a black-box target model with good human perceptual quality. Real-world experiments show that the resultant PT-AEs achieve the attack success rates of 45.8%–80.8% against the open-source models in the digital-line scenario and 47.9%–58.3% against smart devices, including Apple HomePod (Siri), Amazon Echo, and Google Home, in the over-the-air scenario[1].


Adversarial speech attacks against speech recognition [28], [114], [72], [101], [105], [32], [43], [118] and speaker recognition [43], [29], [118] have become one of the most active research areas of machine learning in computer audio security. These attacks craft audio adversarial examples (AEs) that can spoof the speech classifier in either white-box [28], [114], [72], [52] or black-box settings [105], [32], [43], [118], [29], [74], [17]. Compared with white-box attacks that require the full knowledge of a target audio classification model, blackbox attacks do not assume the full knowledge and have been investigated in the literature under different attack scenarios [29], [118]. Despite the substantial progress in designing blackbox attacks, they can still be challenging to launch in realworld scenarios in that the attacker is still required to gain information from the target model.

Generally, the attacker can use a query (or probing) process to gradually know the target model: repeatedly sending a speech signal to the target model, then measuring either the confidence level/prediction score [32], [43], [29] or the final output results [118], [113] of a classifier. The probing process usually requires a large number of interactions (e.g., over 1000 queries [113]), which can cost substantial labor and time. This may work in the digital line, such as interacting with local machine learning models (e.g., Kaldi toolkit [93]) or online commercial platforms (e.g., Microsoft Azure [12]). However, it can be even more cumbersome, if not possible, to probe physical devices because today’s smart devices (e.g., Amazon Echo [2]) accept human speech over the air. Moreover, some internal knowledge of the target model still has to be assumed known to the attacker (e.g., the access to the similarity scores of the target model [29], [113]). Two recent studies further limited the attacker’s knowledge to be (i) [118] only knowing the target speaker’s one-sentence speech [118] and requiring probing to get the target model’s hard-label (accept or reject) results (e.g., over 10,000 times) and (ii) [30] only knowing one-sentence speech for each speaker enrolled in the target model.

In this paper, we present a new, even more practical perspective for black-box attacks against speaker recognition. We first note that the most practical attack assumption is to let the attacker know nothing about the target model and never probe the model. However, such completely zero knowledge for the attacker unlikely leads to effective audio AEs. We have to assume some knowledge but keep it at the minimum level towards the attack practicality. Our work limits the attacker’s knowledge to be only a one-sentence (or a few seconds) speech sample of her target speaker without knowing any other information about the target model. The attacker has neither knowledge of nor access to the internals of the target model. Moreover, she does not probe the classifier and needs no observation of the classification results (either soft or hard labels). To the best of our knowledge, our assumption of the attacker’s knowledge is the most restricted compared with prior work (in particular with the two recent attacks [118], [30]).

Centered around this one-sentence knowledge of the target speaker, our basic attack framework is to (i) propose a new training procedure, called parrot training, which generates a sufficient number of synthetic speech samples of the target speaker and uses them to construct a parrot-trained (PT) model for a further transfer attack, and (ii) systematically evaluate the transferability and perception of different AE generation mechanisms and create PT-model based AEs (PT-AEs) towards high attack success rates and good audio quality.

Our motivation behind parrot training is that the recent advancements in the voice conversion (VC) domain have shown that the one-shot speech methods [34], [77], [110], [31] are able to leverage the semantic human speech features to generate speech samples that sound like a target speaker’s voice in different linguistic contents. Based on the attacker’s onesentence knowledge, we should be able to generate different synthetic speech samples of her target speaker and use them to build a PT model for speaker recognition. Our feasibility evaluations show that a PT model can perform similarly to a ground-truth trained (GT) model that uses the target speaker’s actual speech samples.

The similarity between PT and GT models creates a new, interesting question of transferability: if we create a PT-AE from a PT model, can it perform similarly to an AE generated from the GT model (GT-AE) and transfer to a black-box target GT model? Transferability in adversarial machine learning is already an intriguing concept. It has been observed that the transferability depends on many aspects, such as model architecture, model parameters, training dataset, and attacking algorithms [79], [76]. Existing AE evaluations have been primarily focused on GT-AEs on GT models without involving synthetic data. As a result, we conduct a comprehensive study on PT-AEs in terms of their generation and quality.

• Quality: We first need to define a quality metric to quantify whether a PT-AE is good or not. There are two important factors of PT-AEs: (i) transferability of PT-AEs to a blackbox target model. We adopt the match rate, which has been comprehensively studied in the image domain [79], to measure the transferability. The match rate is defined as the percentage of PT-AEs that can still be misclassified as the same target label on a black-box GT model. (ii) The perception quality of audio AEs. We conduct a human study to let human participants rate the speech quality of AEs with different types of carriers in a unified scale of perception score from 1 (the worst) to 7 (the best) commonly used in speech evaluation studies [47], [108], [23], [19], [91], [36], and then build regression models to predict human scores of speech quality. However, these two factors are generally contradictory, as a high level of transferability likely results in poor perception quality. We then define a new metric called transferability-perception ratio (TPR) for PT-AEs generated using a specific type of carriers. This metric is based on their match rate and average perception score, and it quantifies the level of transferability a carrier type can achieve in degrading a unit score of human perception. A high TPR can be interpreted as high transferability achieved by a relatively small cost of perception degradation.

(i) Queries: indicating the typical number of probes need to interact with the blackbox target model. (ii) Soft level: the confidence score [32] or prediction score [101], [105], [32], [29], [113] from the target model. (iii) Hard label: accept or reject result [118], [74] from the target model. (iv) QFA2SR [30] requires the speech sample of each enrolled speaker in the target model. (v) Human perception means integrating the human perception factor into the AE generation.

Under the TPR framework, we formulate a two-stage PTAE attack that can be launched over the air against a black-box target model. In the first stage, we narrow down from a full set of carriers to a subset of candidates with high TPRs for the attacker’s target speaker. In the second stage, we adopt an ensemble learning-based formulation [76] that selects the best carrier candidates from the first stage and manipulates their auditory features to minimize a joint loss objective of attack effectiveness and human perception. Real-world experiments show that the proposed PT-AE attack achieves the success rates of 45.8%–80.8% against open-source models in the digital-line scenario and 47.9%–58.3% against smart devices, including Apple HomePod (Siri), Amazon Echo, and Google Home, in the over-the-air scenario. Compared with two recent attack strategies Smack [113] and QFA2SR [30], our strategy achieves improvements of 263.7% (attack success) and 10.7% (human perception score) over Smack, and 95.9% (attack success) and 44.9% (human perception score) over QFA2SR. Table I provides a comparison of the required knowledge between the proposed PT-AE attack and existing strategies.

Our major contribution can be summarized as follows. (i) We propose a new concept of the PT model and investigate state-of-the-art VC methods to generate parrot speech samples to build a surrogate model for an attacker with the knowledge of only one sentence speech of the target speaker. (ii) We propose a new TPR framework to jointly evaluate the transferability and perceptual quality for PT-AE generations with different types of carriers. (iii) We create a two-stage PT-AE attack strategy that has been shown to be more effective than existing attacks strategies, while requiring the minimum level of the attacker’s knowledge.

This paper is available on arxiv under CC0 1.0 DEED license.

[1] Our attack demo can be found at: https://sites.google.com/view/pt-attack-demo