Poster: SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems Tianyu Du * , Shouling Ji *† , Jinfeng Li * , Qinchen Gu ‡ , Ting Wang § and Raheem Beyah ‡ * Institute of Cyberspace Research and College of Computer Science and Technology, Zhejiang University Email: {zjradty, sji, lijinfeng0713}@zju.edu.cn † Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies ‡ Georgia Institute of Technology, Email: [email protected], [email protected]§ Lehigh University, Email: [email protected]Abstract—In this poster, we present SIRENATTACK, a new class of attacks to generate adversarial audios. Compared with existing attacks, SIRENATTACK highlights with a set of significant features, i.e., versatile, targeted, and evasive. Experimental results on a set of state-of-the-art deep learning-based acoustic systems demonstrate the versatility, effectiveness, and stealthiness of SIRENATTACK. I. I NTRODUCTION,PRELIMINARY AND ATTACK DESIGN Nowadays deep learning-based acoustic systems are ubiq- uitous in our everyday lives, ranging from smart locks on mobiles to speech assistants on smart home devices and to machine translation services on clouds. However, deep neural networks (DNNs) are inherently vulnerable to adversarial inputs, which are maliciously crafted samples to trigger target models to misbehave [2]. Despite the plethora of work on the image domain, the research of adversarial attacks on the audio domain is still limited, due to a number of non- trivial challenges. First, the acoustic systems need to deal with information changes in the time dimension, which is more complex than image classification systems. Second, the audio sampling rate is usually very high, but images only have hundreds/thousands of pixels in total. Therefore, it is harder to craft adversarial audios than images since adding slight noise to audios are less likely to impact the local features. In this poster, we present SIRENATTACK, a new class of adversarial attacks against deep neural network-based acoustic systems. Compared with prior work, SIRENATTACK departs in significant ways: versatile –SIRENATTACK is applicable to a range of end-to-end acoustic systems under both white-box and black-box settings; targeted –SIRENATTACK generates adversarial audio that trigger target systems to misbehave in a highly predictable manner (e.g., misclassifying the adversarial audio into a specific class); and evasive –SIRENATTACK is able to generate adversarial audios indistinguishable from their benign counterparts to human perception. SIRENATTACK is based on the Particle Swarm Optimiza- tion (PSO) algorithm [1]. PSO is a heuristic and stochastic algorithm to find solutions for optimization problems by imitat- ing the behavior of a swarm of birds. It can search a very large space of candidate solutions while does not require the gradient information. At a high level, it solves an optimization problem by iteratively making a population of candidate solutions (which we referred to as particles) move around in the search- space according to their fitness values. The fitness value of a particle is the evaluation result of the objective function on that particle’s position in the solution space. In each iteration, each particle’s movement is influenced by its local best position P best , and meanwhile is guided toward the global best position G best in the search-space. This iteration process is expected to move the swarm toward the best solution. Once a termination criterion is met, G best should hold the solution for a local minimum. The detailed black-box attack is shown in Algorithm 1. To fool a machine learning model, we feed it with a legitimate audio x and the target output t. First, we initialize the epoch to zero and generate n particle randomized sequences (collectively referred to as seeds) from a uniform distribution (line 1). Then we run the PSO subroutine (line 3) with the target output t and seeds. If any particle p i produces the target output t when being added to the original audio x, then the attack succeeds (line 4-5), and the particle p i is the expected noise δ. Otherwise, we will preserve the best particle that has the minimum fitness value in the current PSO run as one of the seeds in the next PSO run (line 7-8). The above steps iterate (line 2-11) till the attack succeeds or it reaches epoch max . If succeed, we would obtain an adversarial audio x adv that can be predicted as t by the victim model. Algorithm 1 SIRENATTACK under black-box settings Input: Original audio x, target output t, n particles and epoch max Output: A targeted adversarial audio x adv 1: Initialize epoch =0 and seeds and set Eq. (1) as the objective function; 2: while epoch reaches epoch max do 3: Run PSO subroutine with t and seeds; 4: if any particle produce target output t during PSO then 5: Solution is found. Exit. 6: else 7: Clear seeds; 8: seeds ⊇ best particle that produce the minimum value of Eq. (1) from the current PSO run; 9: end if 10: epoch = epoch + 1; 11: end while 12: Get adversarial audio x adv with target label t. We would further emphasize two key aspects of our algo- rithm: (1) We modify the PSO algorithm to globally keep track
3
Embed
Poster: SirenAttack: Generating Adversarial Audio for End ... · machine translation services on clouds. However, deep neural networks (DNNs) are inherently vulnerable to adversarial
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Poster: SirenAttack: Generating Adversarial Audiofor End-to-End Acoustic Systems
Tianyu Du∗, Shouling Ji∗†, Jinfeng Li∗, Qinchen Gu‡, Ting Wang§ and Raheem Beyah‡∗ Institute of Cyberspace Research and College of Computer Science and Technology, Zhejiang University
Email: {zjradty, sji, lijinfeng0713}@zju.edu.cn† Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies
Abstract—In this poster, we present SIRENATTACK, a newclass of attacks to generate adversarial audios. Compared withexisting attacks, SIRENATTACK highlights with a set of significantfeatures, i.e., versatile, targeted, and evasive. Experimental resultson a set of state-of-the-art deep learning-based acoustic systemsdemonstrate the versatility, effectiveness, and stealthiness ofSIRENATTACK.
I. INTRODUCTION, PRELIMINARY AND ATTACK DESIGN
Nowadays deep learning-based acoustic systems are ubiq-uitous in our everyday lives, ranging from smart locks onmobiles to speech assistants on smart home devices and tomachine translation services on clouds. However, deep neuralnetworks (DNNs) are inherently vulnerable to adversarialinputs, which are maliciously crafted samples to trigger targetmodels to misbehave [2]. Despite the plethora of work onthe image domain, the research of adversarial attacks onthe audio domain is still limited, due to a number of non-trivial challenges. First, the acoustic systems need to dealwith information changes in the time dimension, which ismore complex than image classification systems. Second, theaudio sampling rate is usually very high, but images only havehundreds/thousands of pixels in total. Therefore, it is harder tocraft adversarial audios than images since adding slight noiseto audios are less likely to impact the local features.
In this poster, we present SIRENATTACK, a new class ofadversarial attacks against deep neural network-based acousticsystems. Compared with prior work, SIRENATTACK departsin significant ways: versatile – SIRENATTACK is applicable toa range of end-to-end acoustic systems under both white-boxand black-box settings; targeted – SIRENATTACK generatesadversarial audio that trigger target systems to misbehave in ahighly predictable manner (e.g., misclassifying the adversarialaudio into a specific class); and evasive – SIRENATTACK isable to generate adversarial audios indistinguishable from theirbenign counterparts to human perception.
SIRENATTACK is based on the Particle Swarm Optimiza-tion (PSO) algorithm [1]. PSO is a heuristic and stochasticalgorithm to find solutions for optimization problems by imitat-ing the behavior of a swarm of birds. It can search a very largespace of candidate solutions while does not require the gradientinformation. At a high level, it solves an optimization problemby iteratively making a population of candidate solutions(which we referred to as particles) move around in the search-space according to their fitness values. The fitness value of a
particle is the evaluation result of the objective function on thatparticle’s position in the solution space. In each iteration, eachparticle’s movement is influenced by its local best positionPbest, and meanwhile is guided toward the global best positionGbest in the search-space. This iteration process is expected tomove the swarm toward the best solution. Once a terminationcriterion is met, Gbest should hold the solution for a localminimum.
The detailed black-box attack is shown in Algorithm 1. Tofool a machine learning model, we feed it with a legitimateaudio x and the target output t. First, we initialize theepoch to zero and generate n particle randomized sequences(collectively referred to as seeds) from a uniform distribution(line 1). Then we run the PSO subroutine (line 3) with thetarget output t and seeds. If any particle pi produces the targetoutput t when being added to the original audio x, then theattack succeeds (line 4-5), and the particle pi is the expectednoise δ. Otherwise, we will preserve the best particle that hasthe minimum fitness value in the current PSO run as one of theseeds in the next PSO run (line 7-8). The above steps iterate(line 2-11) till the attack succeeds or it reaches epochmax. Ifsucceed, we would obtain an adversarial audio xadv that canbe predicted as t by the victim model.
Algorithm 1 SIRENATTACK under black-box settingsInput: Original audio x, target output t, n particles and
epochmax
Output: A targeted adversarial audio xadv1: Initialize epoch = 0 and seeds and set Eq. (1) as the
objective function;2: while epoch reaches epochmax do3: Run PSO subroutine with t and seeds;4: if any particle produce target output t during PSO then5: Solution is found. Exit.6: else7: Clear seeds;8: seeds ⊇ best particle that produce the minimum
value of Eq. (1) from the current PSO run;9: end if
10: epoch = epoch + 1;11: end while12: Get adversarial audio xadv with target label t.
We would further emphasize two key aspects of our algo-rithm: (1) We modify the PSO algorithm to globally keep track
Fig. 1. Performance of SIRENATTACK for every {source, target} pair onthe Speech Commands Dataset against the CNN model.
of the current saved best particle throughout all PSO iterationsinstead of using the standard PSO algorithm. (2) During eachiteration, PSO aims to minimize an objective function definedas g(x+pi). We experimented with several definitions of g(·)and found the following to be the most effective:
g(x+ pi) = max(maxj 6=t
(O(x+ pi)j)−O(x+ pi)t, κ) (1)
where O(x+pi)j is the confidence value of label j for inputx+pi. The function can move the particles to the position thatmaximizes the probability of the target label t. In addition, wecan control the confidence of misprediction with the parameterκ, and a smaller κ means that the found adversarial audiowill be predicted as t with higher confidence. We set κ =0 for SIRENATTACK but we note here that a side benefit ofthis formulation is that it allows one to control the desiredconfidence. In addition, this function can be used to conductuntarget attacks with trivial modifications.
II. EXPERIMENTS AND CONCLUSION
We conducted black-box attacks under four differentscenes, including speech command recognition, speaker recog-nition, audio scene classification and music genre classifica-tion. Due to the limitation of pages, we only show part of theexperimental results. For speech command recognition task,we evaluated SIRENATTACK on Speech Commands Dataset [4]against the CNN described in [3] and other six state-of-the-artspeech command recognition models, i.e., VGG19, DenseNet,ResNet18, ResNeXt, WideResNet18 and DPN-92. In addition,we use SNR (Signal Noise Ratio) to evaluate the audio quality,which is calculated as follows:
SNR(dB) = 10 log10(PxPδ
) (2)
where x is the original audio waveform, δ is the added noise,and Px and Pδ are the power of the original signal and thenoise signal, respectively.
0 0.2 0.4 0.6 0.8 1 1.2
Time(s)
-0.5
0
0.5
Am
plit
ude
Original audio
0 0.2 0.4 0.6 0.8 1 1.2
Time(s)
-0.5
0
0.5
Am
plit
ude
Adversarial audio
(a) Waveform
Original audio
0.2 0.4 0.6 0.8 1 1.2
Time (s)
-012345678
Fre
qu
en
cy (
kH
z)
-40
-20
0
20
Po
we
r/D
eca
de
(d
B)
Adversarial audio
0.2 0.4 0.6 0.8 1 1.2
Time (s)
-012345678
Fre
qu
en
cy (
kH
z)
-40
-20
0
20
Po
we
r/D
eca
de
(d
B)
(b) Spectrogram
Fig. 2. Comparison of the waveform and spectrogram between an originalaudio (upper graphs) and the adversarial counterpart (lower graphs) with δ =100. The original transcription is “restart the phone” while the adversarialtranscription is “open the front door”.
TABLE III. TRANSFERABILITY EVALUATION: EXAMPLE RESULTS.
Number Original Advesarial ASR Platforms Results
1 stop no Sphinx no2 off on IBM on3 down no Sphinx, Wit.ai no4 down no Wit.ai, Bing no5 go no Wit.ai no6 go yes Sphinx yes7 left yes Wit.ai, IBM yeah8 right on Google, Bing play
Table I shows the main experimental results with δ = 800and epochmax = 300. Fig. 1 shows the pair-to-pair successrate and the average time to generate an adversarial audio ofSIRENATTACK. Fig. 2 shows the waveform and spectrogramof an example original audio and the adversarial couterpart.Furthermore, we also evaluated the transferability of the gen-erated adversarial audios (against the VGG19 model) and showthe results in Table II. Some successful transferred examplesare shown in Table III. From the above experimental results wecan see that (i) SIRENATTACK is effective when against all thetargeted models, even when the models have high performanceon the legitimate datasets; (ii) the average time of generating anadversarial audio is very short; (iii) the noise in the generatedadversarial audios is almost ignorable; and (iv) the generatedadversarial audio has transferability to some extent.
REFERENCES
[1] R. Eberhart and J. Kennedy, “A new optimizer using particle swarmtheory,” in Proceedings of the Sixth International Symposium on MicroMachine and Human Science (MHS’95). IEEE, 1995, pp. 39–43.
[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” in Proceedings of the International Conferenceon Learning Representations (ICLR), 2015, pp. 1–11.
[3] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of the 16th Annual Confer-ence of the International Speech Communication Association (INTER-SPEECH), 2015, pp. 1478–1482.
[4] P. Warden, “Speech commands: A public dataset for single-wordspeech recognition.” Dataset available from http://download.tensorflow.org/data/speech commands v0.01.tar.gz, 2017.
1.Zhejiang University 2. Georgia Institute of Technology 3. Lehigh University
Ø Nowadays, deep learning-based acoustic systems are ubiquitous in our everyday lives, ranging from smart locks on mobiles to speech assistants on smart home devices.However, deep neural networks (DNNs) are inherently vulnerable to adversarial inputs, which are maliciously crafted samples to trigger target models to misbehave [1].
Ø We present SirenAttack, a new class of adversarial attacks against deep neural network-based acoustic systems. Compared with prior work, SirenAttack departs insignificant ways: versatile – SirenAttack is applicable to a range of end-to-end acoustic systems under both white-box and black-box settings; targeted – SirenAttackgenerates adversarial audio that trigger target systems to misbehave in a highly predictable manner (e.g., misclassifying the adversarial audio into a specific class); andevasive – SirenAttack is able to generate adversarial audios indistinguishable from their benign counterparts to human perception.
Introduction
[1] I. J. Goodfellow, J. Shlens, and C. Szegedy,Explaining and harnessing adversarial examples,ICLR, 2015.[2] R. Eberhart and J. Kennedy, A new optimizerusing particle swarm theory, MHS, 1995.[3] T. N. Sainath and C. Parada, “Convolutionalneural networks for small- footprint keywordspotting,” INTERSPEECH, 2015, pp. 1478–1482.
• We modified the PSO to globally keep track of thecurrent saved best particle throughout all iterations.
• Objective function:
Pair-to-pair Success Rate Average Time (min)
Waveform Spectrogram
• Particle Swarm Optimization (PSO) [2] solves anoptimization problem by iteratively making apopulation of candidate solutions move around in thesearch-space according to their fitness values.