Top Banner
Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign
33

Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Dec 18, 2015

Download

Documents

Patience Mason
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Extracting Noise-Robust Features from Audio Data

Chris Burges, John Platt,

Erin Renshaw, Soumya Jana*

Microsoft Research

*U. Illinois, Urbana/Champaign

Page 2: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Page 3: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Audio Feature Extraction Goals

• Map high dim space of audio samples to low dim feature space

• Must be robust to likely input distortions

• Must be informative (different audio segments map to distant points)

• Must be efficient (use small fraction of resources on typical PC)

Page 4: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Audio Fingerprinting as a Test Bed

• Train: Given an audio clip, extract one or more feature vectors

• Test: Compare feature vectors, computed from an audio stream, with database of stored vectors, to identify the clip

• Clip is not changed, unlike watermarking

What is audio fingerprinting?

Page 5: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Why audio fingerprinting?

• Enable user to identify streaming audio (e.g. radio) in real time

• Track music play for royalty assignment

• Identify audio segments in e.g. television or radio: “Did my commercial air?”

• Search unlabeled data, e.g. on web: “Did someone violate my copyright?”

Page 6: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Goals, cont.

• Feature extraction: current audio features are often hand-crafted. Can we learn the features we need?

• Database must be able to handle 1000s of requests, and contain millions of clips

Page 7: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Page 8: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Feature Extraction

Output 64 floats every half-frame

Downsample to 11.025 KHz mono

Compute Robust ProjectionsCollate Samples

(Repeat)

De-EqualizePerceptual Thresholding

Compute windowed FFTcoefficients, take magnitude

Page 9: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

De-Equalization

Goal: Remove slow variation in frequency space

MCLT

Magnitude r

Phase

DCT(log r)

Apply highpass mask

exp(Inverse DCT)

Normalize Power

1.0

0.0

Page 10: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

De-Equalization

Before After

De-equalize by flattening the log spectrum. But does thisdistort? Can reconstruct signal by adding phases back in:

Page 11: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Page 12: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Oriented PCA

• Want projection so that the variance of the projected signal is maximized, and MSE of projected noise simultaneously minimized. [Daimantaras & Kung, ’96]

• Maximize Generalized Rayleigh coefficient

where C1 is the signal covariance, C2 the noise correlation.

Page 13: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Properties of Oriented PCA

Page 14: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Page 15: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Distortion Discriminant Analysis

DDA is a convolutional neural network whoseweights are trained using OPCA:

2048

372ms, step every 186ms

6.1s, step every 186ms

64 64

32

64

64

OPCA

OPCA

Page 16: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

The Distortions

Original

3:1 compress above 30dB

Light Grunge

Distort Bass

“Mackie Mid boost” filter

Old Radio

Phone

RealAudio© Compander

Page 17: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

DDA Training Data

• 50 20s audio segments, catted (16.7min)• Compute distortions• Solve generalize eigenvector problem

(after preprocessing)• Subtract mean projections of train data,

norm noise to unit variance• Repeat• Use 10 20s segments as validation set to

compute scale factor for final projections

Page 18: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Outline

• Goals

• De-equalization and Perceptual Thresholding

• Oriented PCA

• Distortion Discriminant Analysis

• Results and demo

Page 19: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

OPCA vs. Bark Band Averaging

Compare with a ‘standard’ approach:

• 23 ms windows, same preprocessing.• Use 10 Bark bands from 510 Hz to 2.7 KHz.• Average over bands to get 10-dim features.

Page 20: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

OPCA vs. Bark Averaging

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

1.2

1.4

Light GrungeDistort Bass

Phone

OPCA Bark

Noi

se-t

o-S

igna

l Var

ianc

e R

atio

Old Radio

Page 21: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

OPCA vs. Bark Averaging, cont.

0 5 10 15 20 25 300

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

3 to 1 compress 30dBMackie MidReal Compander

OPCA Bark

No

ise

-to

-Sig

na

l Va

ria

nce

Ra

tio

Page 22: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Summed Distortions

0 2 4 6 810

0

0.5

1

1.5

2

2.5

OPCABark

Projection ID

Su

mm

ed

Va

ria

nce

Ra

tios

Page 23: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

DDA Eigenspectrum

0 50 100 150 200 2500

200

400

600

800

1000

1200

1400

Eigenvalue Index

Gen

eral

ized

Eig

enva

lues

(Sig

nal-t

o-no

ise

rat

ios)

1st Layer2nd Layer

.....

Page 24: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

DDA for Alignment Robustness

Train second layer with alignmentdistortion:

0 20 40 60 80 100 120 140 160 1800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

No Shift TrainingWith Shift Training

Alignment shift / ms

Min

Tar

get

/Tar

get s

q. d

ist

Min

Ta

rge

t/Non

Tar

get

sq.

dis

t

Page 25: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Test Results – Large Test Set

• 500 songs, 36 hours of music• One stored fingerprint computed for every song,

about 10s in, random start point• Test alignment noise, false positive and negative

rates, same with distortions• Define distance between two clips as minimum

distance between fingerprint of first, and all traces of the second

• Approx. 3.5 108 chances for a false positive

Page 26: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Test Results: Positives

0 50 100 150 200 250 300 350 400 450 5000

0.005

0.01

0.015

0.02

0.025

0.03

372ms steps186ms steps

Fingerprints, sorted by distance

No

rma

lize

d d

ista

nce

sq

ua

red

Page 27: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Test Results: Negatives

Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026

0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Smallest 100 of 249,500 target/nontarget values

No

rma

lize

d d

ista

nce

sq

ua

red

Page 28: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Fit with parametric log-logistic model

1( )

1 exp( log( ) )P d x

x

log1

Py

P

-2 -1.5 -1 -0.5 0 0.5 1 1.5-15

-10

-5

0

5

10

15

DataLinear Fit

log(x)

Gives 2.3 fp’sper song for 106

fingerprints

Page 29: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Tests for Other Distortions

• Take first 10 test clips

• Distort each in 7 ways

• Further distort with 2% pitch-preserving time compression (distortion not in training set)

Page 30: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Test Results on Distorted Data

0 10 20 30 40 50 60 70 80 90 1000.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

Smallest 100 of 39,920 target /non-target values

0 10 20 30 40 50 60 70 800

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

No

rmal

ized

dis

tan

ce s

qu

are

d

Time compressed distortions, not sortedAll target/target values, sorted

Page 31: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Computational Cost

On a 750 MHz PIII, computing features on a stream and searching for fingerprints, using simple sequential scan, for 10,000 audio clips, takes 12MB memory and 10% of CPU.

Page 32: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Conclusions

• DDA features work well for audio fingerprinting. How about other audio tasks?

• A layered approach is necessary to keep DDA computation feasible.

• This is a convolutional neural net, with weights chosen by OPCA and linear transfer functions.

• Demo

http://research.microsoft.com/~cburges/tech_reports/dda.pdf

Page 33: Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Demo (1 min. 14 sec)

• Repeated musical phrase

• Voice

• Clean clip C

• C + equalization

• C + equalization + reverb

• C + equalization + reverb + repeat echo