Poster: SirenAttack: Generating Adversarial Audio for End ... · machine translation services on clouds. However, deep neural networks (DNNs) are inherently vulnerable to adversarial

Poster: SirenAttack: Generating Adversarial Audiofor End-to-End Acoustic Systems

Tianyu Du∗, Shouling Ji∗†, Jinfeng Li∗, Qinchen Gu‡, Ting Wang§ and Raheem Beyah‡∗ Institute of Cyberspace Research and College of Computer Science and Technology, Zhejiang University

Email: {zjradty, sji, lijinfeng0713}@zju.edu.cn† Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies

‡ Georgia Institute of Technology, Email: [email protected], [email protected]§ Lehigh University, Email: [email protected]

Abstract—In this poster, we present SIRENATTACK, a newclass of attacks to generate adversarial audios. Compared withexisting attacks, SIRENATTACK highlights with a set of significantfeatures, i.e., versatile, targeted, and evasive. Experimental resultson a set of state-of-the-art deep learning-based acoustic systemsdemonstrate the versatility, effectiveness, and stealthiness ofSIRENATTACK.

I. INTRODUCTION, PRELIMINARY AND ATTACK DESIGN

Nowadays deep learning-based acoustic systems are ubiq-uitous in our everyday lives, ranging from smart locks onmobiles to speech assistants on smart home devices and tomachine translation services on clouds. However, deep neuralnetworks (DNNs) are inherently vulnerable to adversarialinputs, which are maliciously crafted samples to trigger targetmodels to misbehave [2]. Despite the plethora of work onthe image domain, the research of adversarial attacks onthe audio domain is still limited, due to a number of non-trivial challenges. First, the acoustic systems need to dealwith information changes in the time dimension, which ismore complex than image classification systems. Second, theaudio sampling rate is usually very high, but images only havehundreds/thousands of pixels in total. Therefore, it is harder tocraft adversarial audios than images since adding slight noiseto audios are less likely to impact the local features.

In this poster, we present SIRENATTACK, a new class ofadversarial attacks against deep neural network-based acousticsystems. Compared with prior work, SIRENATTACK departsin significant ways: versatile – SIRENATTACK is applicable toa range of end-to-end acoustic systems under both white-boxand black-box settings; targeted – SIRENATTACK generatesadversarial audio that trigger target systems to misbehave in ahighly predictable manner (e.g., misclassifying the adversarialaudio into a specific class); and evasive – SIRENATTACK isable to generate adversarial audios indistinguishable from theirbenign counterparts to human perception.

SIRENATTACK is based on the Particle Swarm Optimiza-tion (PSO) algorithm [1]. PSO is a heuristic and stochasticalgorithm to find solutions for optimization problems by imitat-ing the behavior of a swarm of birds. It can search a very largespace of candidate solutions while does not require the gradientinformation. At a high level, it solves an optimization problemby iteratively making a population of candidate solutions(which we referred to as particles) move around in the search-space according to their fitness values. The fitness value of a

particle is the evaluation result of the objective function on thatparticle’s position in the solution space. In each iteration, eachparticle’s movement is influenced by its local best positionPbest, and meanwhile is guided toward the global best positionGbest in the search-space. This iteration process is expected tomove the swarm toward the best solution. Once a terminationcriterion is met, Gbest should hold the solution for a localminimum.

The detailed black-box attack is shown in Algorithm 1. Tofool a machine learning model, we feed it with a legitimateaudio x and the target output t. First, we initialize theepoch to zero and generate n particle randomized sequences(collectively referred to as seeds) from a uniform distribution(line 1). Then we run the PSO subroutine (line 3) with thetarget output t and seeds. If any particle pi produces the targetoutput t when being added to the original audio x, then theattack succeeds (line 4-5), and the particle pi is the expectednoise δ. Otherwise, we will preserve the best particle that hasthe minimum fitness value in the current PSO run as one of theseeds in the next PSO run (line 7-8). The above steps iterate(line 2-11) till the attack succeeds or it reaches epochmax. Ifsucceed, we would obtain an adversarial audio xadv that canbe predicted as t by the victim model.

Algorithm 1 SIRENATTACK under black-box settingsInput: Original audio x, target output t, n particles and

epochmax

Output: A targeted adversarial audio xadv1: Initialize epoch = 0 and seeds and set Eq. (1) as the

objective function;2: while epoch reaches epochmax do3: Run PSO subroutine with t and seeds;4: if any particle produce target output t during PSO then5: Solution is found. Exit.6: else7: Clear seeds;8: seeds ⊇ best particle that produce the minimum

value of Eq. (1) from the current PSO run;9: end if

10: epoch = epoch + 1;11: end while12: Get adversarial audio xadv with target label t.

We would further emphasize two key aspects of our algo-rithm: (1) We modify the PSO algorithm to globally keep track

TABLE I. PERFORMANCE OF THE BLACK-BOX ATTACK.

Model Accuracy Success Rate SNR(dB) Time(s)

CNN 96.10% 95.25% 22.36 100.69VGG19 91.39% 88.10% 18.22 332.26

DenseNet 94.93% 86.90% 15.34 458.13ResNet18 92.06% 87.35% 15.87 340.31ResNeXt 94.28% 90.05% 17.03 317.92

WideResNet18 90.80% 89.25% 17.57 368.29DPN92 95.20% 83.60% 14.04 462.58

right off ye

s up stop on lef

tdown no go

Target label

right

off

yes

up

stop

on

left

down

no

go

Original la

bel

0.0 1.0 0.9 1.0 1.0 1.0 1.0 1.0 0.9 1.0

1.0 0.0 0.7 1.0 1.0 1.0 0.9 1.0 0.8 0.9

1.0 1.0 0.0 0.9 1.0 0.9 1.0 1.0 1.0 1.0

1.0 1.0 0.8 0.0 1.0 1.0 1.0 0.9 0.8 0.9

0.9 1.0 0.8 1.0 0.0 0.9 0.9 0.9 0.8 0.9

0.9 1.0 0.8 1.0 1.0 0.0 0.8 1.0 0.8 0.9

1.0 1.0 1.0 1.0 1.0 0.9 0.0 1.0 1.0 0.9

0.8 0.9 0.9 0.9 1.0 0.9 0.8 0.0 1.0 1.0

0.9 0.9 0.9 0.9 1.0 0.9 1.0 1.0 0.0 1.0

1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 0.00.0

0.2

0.4

0.6

0.8

1.0

(a) Success Rate

off godown up rig

ht no on stop ye

s left

Target label

off

go

down

up

right

no

on

stop

yes

left

Original la

bel

0.0 2.6 1.6 0.7 1.8 2.8 0.4 0.2 3.4 2.4

1.3 0.0 0.6 2.1 1.5 0.3 3.5 0.3 1.1 1.2

1.5 1.0 0.0 2.2 3.3 1.7 2.3 0.3 2.0 3.3

0.8 3.7 2.9 0.0 1.1 3.5 1.4 0.1 3.9 0.9

1.1 2.1 0.7 1.6 0.0 1.5 1.7 0.2 2.0 1.4

1.7 0.6 0.3 2.0 1.5 0.0 2.4 0.3 0.9 1.8

0.5 2.5 0.7 1.3 1.6 3.0 0.0 0.2 4.4 3.7

0.6 2.9 1.8 0.7 1.5 2.4 2.5 0.0 3.2 1.6

2.1 1.7 0.4 3.1 1.7 1.0 4.8 0.2 0.0 0.4

1.9 1.9 0.7 1.6 1.1 1.0 4.2 0.2 0.4 0.00

1

2

3

4

(b) Time (min)

Fig. 1. Performance of SIRENATTACK for every {source, target} pair onthe Speech Commands Dataset against the CNN model.

of the current saved best particle throughout all PSO iterationsinstead of using the standard PSO algorithm. (2) During eachiteration, PSO aims to minimize an objective function definedas g(x+pi). We experimented with several definitions of g(·)and found the following to be the most effective:

g(x+ pi) = max(maxj 6=t

(O(x+ pi)j)−O(x+ pi)t, κ) (1)

where O(x+pi)j is the confidence value of label j for inputx+pi. The function can move the particles to the position thatmaximizes the probability of the target label t. In addition, wecan control the confidence of misprediction with the parameterκ, and a smaller κ means that the found adversarial audiowill be predicted as t with higher confidence. We set κ =0 for SIRENATTACK but we note here that a side benefit ofthis formulation is that it allows one to control the desiredconfidence. In addition, this function can be used to conductuntarget attacks with trivial modifications.

II. EXPERIMENTS AND CONCLUSION

We conducted black-box attacks under four differentscenes, including speech command recognition, speaker recog-nition, audio scene classification and music genre classifica-tion. Due to the limitation of pages, we only show part of theexperimental results. For speech command recognition task,we evaluated SIRENATTACK on Speech Commands Dataset [4]against the CNN described in [3] and other six state-of-the-artspeech command recognition models, i.e., VGG19, DenseNet,ResNet18, ResNeXt, WideResNet18 and DPN-92. In addition,we use SNR (Signal Noise Ratio) to evaluate the audio quality,which is calculated as follows:

SNR(dB) = 10 log10(PxPδ

) (2)

where x is the original audio waveform, δ is the added noise,and Px and Pδ are the power of the original signal and thenoise signal, respectively.

0 0.2 0.4 0.6 0.8 1 1.2

Time(s)

-0.5

0

0.5

Am

plit

ude

Original audio

0 0.2 0.4 0.6 0.8 1 1.2

Time(s)

-0.5

0

0.5

Am

plit

ude

Adversarial audio

(a) Waveform

Original audio

0.2 0.4 0.6 0.8 1 1.2

Time (s)

-012345678

Fre

qu

en

cy (

kH

z)

-40

-20

0

20

Po

we

r/D

eca

de

(d

B)

Adversarial audio

0.2 0.4 0.6 0.8 1 1.2

Time (s)

-012345678

Fre

qu

en

cy (

kH

z)

-40

-20

0

20

Po

we

r/D

eca

de

(d

B)

(b) Spectrogram

Fig. 2. Comparison of the waveform and spectrogram between an originalaudio (upper graphs) and the adversarial counterpart (lower graphs) with δ =100. The original transcription is “restart the phone” while the adversarialtranscription is “open the front door”.

TABLE II. TRANSFERABILITY EVALUATION RESULTS.

Sphinx Google Bing Houndify Wit.ai IBM

Success Rate 39.60% 10.00% 14.00% 12.80% 21.20% 20.40%

TABLE III. TRANSFERABILITY EVALUATION: EXAMPLE RESULTS.

Number Original Advesarial ASR Platforms Results

1 stop no Sphinx no2 off on IBM on3 down no Sphinx, Wit.ai no4 down no Wit.ai, Bing no5 go no Wit.ai no6 go yes Sphinx yes7 left yes Wit.ai, IBM yeah8 right on Google, Bing play

Table I shows the main experimental results with δ = 800and epochmax = 300. Fig. 1 shows the pair-to-pair successrate and the average time to generate an adversarial audio ofSIRENATTACK. Fig. 2 shows the waveform and spectrogramof an example original audio and the adversarial couterpart.Furthermore, we also evaluated the transferability of the gen-erated adversarial audios (against the VGG19 model) and showthe results in Table II. Some successful transferred examplesare shown in Table III. From the above experimental results wecan see that (i) SIRENATTACK is effective when against all thetargeted models, even when the models have high performanceon the legitimate datasets; (ii) the average time of generating anadversarial audio is very short; (iii) the noise in the generatedadversarial audios is almost ignorable; and (iv) the generatedadversarial audio has transferability to some extent.

REFERENCES

[1] R. Eberhart and J. Kennedy, “A new optimizer using particle swarmtheory,” in Proceedings of the Sixth International Symposium on MicroMachine and Human Science (MHS’95). IEEE, 1995, pp. 39–43.

[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” in Proceedings of the International Conferenceon Learning Representations (ICLR), 2015, pp. 1–11.

[3] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of the 16th Annual Confer-ence of the International Speech Communication Association (INTER-SPEECH), 2015, pp. 1478–1482.

[4] P. Warden, “Speech commands: A public dataset for single-wordspeech recognition.” Dataset available from http://download.tensorflow.org/data/speech commands v0.01.tar.gz, 2017.

2

http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

TransferabilityAttack Evaluation

SirenAttack: Generating Adversarial Audio for End-to-End Acoustic SystemsTianyu Du1 Shouling Ji1 Jinfeng Li1 Qinchen Gu2 Ting Wang3 Raheem Beyah2

1.Zhejiang University 2. Georgia Institute of Technology 3. Lehigh University

Ø Nowadays, deep learning-based acoustic systems are ubiquitous in our everyday lives, ranging from smart locks on mobiles to speech assistants on smart home devices.However, deep neural networks (DNNs) are inherently vulnerable to adversarial inputs, which are maliciously crafted samples to trigger target models to misbehave [1].

Ø We present SirenAttack, a new class of adversarial attacks against deep neural network-based acoustic systems. Compared with prior work, SirenAttack departs insignificant ways: versatile – SirenAttack is applicable to a range of end-to-end acoustic systems under both white-box and black-box settings; targeted – SirenAttackgenerates adversarial audio that trigger target systems to misbehave in a highly predictable manner (e.g., misclassifying the adversarial audio into a specific class); andevasive – SirenAttack is able to generate adversarial audios indistinguishable from their benign counterparts to human perception.

Introduction

[1] I. J. Goodfellow, J. Shlens, and C. Szegedy,Explaining and harnessing adversarial examples,ICLR, 2015.[2] R. Eberhart and J. Kennedy, A new optimizerusing particle swarm theory, MHS, 1995.[3] T. N. Sainath and C. Parada, “Convolutionalneural networks for small- footprint keywordspotting,” INTERSPEECH, 2015, pp. 1478–1482.

Original audio

0.2 0.4 0.6 0.8 1 1.2Time (s)

-012345678

Freq

uenc

y (k

Hz)

-40-20020

Powe

r/Dec

ade

(dB)

Adversarial audio

0.2 0.4 0.6 0.8 1 1.2Time (s)

-012345678

Freq

uenc

y (k

Hz)

-40-20020

Powe

r/Dec

ade

(dB)

Original audio

0.2 0.4 0.6 0.8 1 1.2Time (s)

-012345678

Freq

uenc

y (k

Hz)

-40-20020

Powe

r/Dec

ade

(dB)

Adversarial audio

0.2 0.4 0.6 0.8 1 1.2Time (s)

-012345678

Freq

uenc

y (k

Hz)

-40-20020

Powe

r/Dec

ade

(dB)

Dataset: Speech CommandsTargeted Model: CNN [3], VGG19, DenseNet, ResNet18, ResNeXt, WideResNet18, DPN92

SirenAttack

Reference

!"# $% = 10 log,-(/0/1)Evaluation Metric:

• We modified the PSO to globally keep track of thecurrent saved best particle throughout all iterations.

• Objective function:

Pair-to-pair Success Rate Average Time (min)

Waveform Spectrogram

• Particle Swarm Optimization (PSO) [2] solves anoptimization problem by iteratively making apopulation of candidate solutions move around in thesearch-space according to their fitness values.

• Update the i-th particle’s velocity:345 = 63457, + 9,:, ;<=>?4 − A457, + 9B:B(C<=>?D − A457,)

A45 = A457,+ 3457,• Update the i-th particle’s position:

Successful Examples

• Offline VGG19 à Online ASR Platforms

Poster: SirenAttack: Generating Adversarial Audio for End ... · machine translation services on clouds. However, deep neural networks (DNNs) are inherently vulnerable to adversarial

Documents