LabROSA Research Overview - Columbia University

LabROSA Research Overview - Dan Ellis 2014-06-12 - /201

1. Music 2. Environmental sound 3. Speech Enhancement

LabROSA Research Overview

Dan Ellis Laboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA !

[email protected] http://labrosa.ee.columbia.edu/

Laboratory for the Recognition andOrganization of Speech and Audio

COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK

mailto:[email protected]

http://labrosa.ee.columbia.edu


LabROSA

• Getting information from sound

2

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation


1. Music Audio Analysis

• Trained classifiers for low-level information • notes, chords, beats, section boundaries

• E.g. Polyphonic transcription !!!!!!

• feature agnostic • needs training data

3

Poliner & Ellis ’06


Million Song Dataset

• Industrial-scale database for music information research

• Many facets: • Echo Nest audio features

+ metadata • Echo Nest “taste profile”

user-song-listen count • Second Hand Song covers • musiXmatch lyric BoW • last.fm tags

• Now with audio? • resolving artist / album / track / duration

against what.cd

4

Bertin-Mahieux McFee


MIDI-to-MSD

• Aligned MIDI to Audio is a nice transcription !!!!!!!!!

• Can we find matches in large databases?

5

Raffel


Singing ASR

• Speech recognition adapted to singing • needs aligned data

• Extensive work to line up scraped “acapellas” and full mix • including jumps!

6

McVicar


Block Structure RPCA

• RPCA separates vocals and background based on low rank optimization • single trade-off parameter • adjust based on higher-level musical features?

7

Papadopoulos

Table 1. Sound excerpts used for the evaluation and proportion of purely-instrumental segments (P.I.) (in% of the whole excerpt duration).Name % (P.I.) Name % (P.I.) Name % (P.I.)1- Beatles Sgt Pepper’s Lonely Hearts Club Band 49.3 5,6 - Puccini piece for soprano and piano 24.7 10 - Marvin Gaye Heard itThrough The Grapevine 30.22 - BeatlesWith A Little Help From My Friends 13.5 7 - Pink Noise Party Their Shallow Singularity 42.13 - Beatles She’s Leaving Home 24.6 8 - Bob Marley Is This Love 37.2 11 - The Eagles Take it Easy 35.54 - Beatles A Day in The Life 35.6 9 - Doobie Brothers Long Train Running 65.6 12 - The PoliceMessage in aBottle 24.9

mixture is computed using a window length of 1024 samples with75% overlap at a sampling rate of 11.5KHz. No post-processing(such as masking) is added.4.2. Results and Discussion

Fig. 2. Separation performance of the leading singing voice with the base-line method, for various values of λ, for the song Their Shallow Singularity.

Fig. 3. Separation performance for the background (left) and the singingvoice (right) via, from top to bottom, the SDR, SIR, SAR and NSDR mea-sures for each song. Constant λ = 1 (∗), adaptive λ = (1, 5) with priorground truth (•) and estimated (◦) voice activity location.

• Global separation results. As illustrated by Fig. 2, the qual-ity of the separation with the baseline method [18] depends on thevalue of the regularization parameter. Moreover, the value that leadsto the best separation quality differs from one music excerpt to an-other. Thus, when processing automatically a collection of musictracks, the choice of this value results from a trade-off. We reporthere results obtained with the typical choice λv = 1. In A-RPCA,this regularization parameter is further adapted to the music contentbased on prior music information. In all experiments, for a givenconstant value λv in the baseline method, setting λnv > λv in Eq.(7) improves the results6. Results of the separation obtained withvarious configurations of the proposed model are described in Fig.3. Using a musically-informed adaptive regularization parameter al-lows improving the results of the separation both for the backgroundand the leading voice components. Note that the larger the propor-tion of purely-instrumental segments in a piece (see Tab. 1), the

6For lack of space, we do not report all of the experiments obtained withvarious values of λ.

larger the results improvement (see in particular pieces 1, 7, 8 and 9in Fig. 2), which is consistent with the goal of the proposed method.

There is however one drawback: improved SDR (better over-all separation performance) and SIR (better capability of removingmusic interferences from the singing voice) with A-RPCA are ob-tained at the price of introducing more artifacts in the estimated voice(lower SARvoice). Listening tests reveal that in some segments pro-cessed by A-RPCA, as for instance segment [1 − 1.15]m in Fig.4, one can hear some high frequency isolated coefficients superim-posed to the separated voice. This drawback could be reduced byincluding harmonicity priors in the sparse component of RPCA, asproposed in [20].

• Ground truth versus estimated voice activity location. Im-perfect voice activity location information still allows an improve-ment, although to a lesser extent than with ground-truth voice ac-tivity information. The decrease in the results mainly comes frombackground segments classified as vocal segments.

Fig. 4. Separated voice for various values of λ for the Pink Noise Party songTheir Shallow Singularity. From top to bottom: clean voice, constant λ = 1,constant λ = 5, adaptive λ = (1, 5).

• Local separation results. It is interesting to note that using anadaptive regularization parameter in a unified analysis of the wholepiece is different from separately analyzing vocal and purely instru-mental segments with different but constant values of λ. This isillustrated in the dashed rectangles areas of Fig. 4. Moreover, localresults7 with the unified analysis, show not only that the sparse com-ponents (singing voice) are limited in purely-instrumental segments,but also that the energy of music background is better weakened inthe resynthesized voice in vocal segments (better local SIRvoice).

5. CONCLUSIONWe have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including local vari-ations in the music content. Music content information is incorpo-rated in the decomposition to guide the selection of coefficients inthe sparse and low-rank layers according to the semantic structureof the piece. We have focused on a simple criterion (voice activityinformation), but the method could be extended with other criteria(singer identification, vibrato saliency. etc.). The method could beimproved by incorporating additional information to set differentlythe regularization parameters for each track to better accommodatethe varying contrast of foreground and background. The idea of anadaptive decomposition could also be improved with a more com-plex formulation of RPCA that incorporates additional constraints[20] or a learned dictionary [46].

7For space constraint, local BSS-eval results are not reported.


• Low-rank decomposition of skewed self-similarity to identify repeats

• Learned weighting of multiple factors to segment !

• Linear Discriminant Analysis between adjacent segments

Ordinal LDA Segmentation

8

McFee


2. Environmental Sound

• Extracting useful information from soundtracks

• e.g. TRECVID Multimedia Event Detection (MED) • “Making a Sandwich”, “Getting a Vehicle Unstuck” • 100 examples, find matches in 100k videos • manual annotations for ~10 h

9

E009 Getting a Vehicle Unstuck


Foreground Event Recognition

• Transients = foreground events?

• Onset detector finds energy bursts • best SNR

• PCA basis to represent each • 300 ms x auditory

freq • “bag of transients”

10

Cotton, Ellis, Loui ’11


NMF Transient Features

• Decompose spectrograms intotemplates + activation !!!!!

• well-behaved gradient descent

• 2D patches • sparsity control • computation

time…

11

Smaragdis & Brown ’03 Abdallah & Plumbley ’04

Virtanen ’07 Cotton & Ellis’ 11

Original mixture

Basis 1(L2)

Basis 2(L1)

Basis 3(L1)

freq

/ Hz

442883

1398220334745478

0 1 2 3 4 5 6 7 8 9 10time / s

X = W • H


Background Retrieval

• Classify soundtracks by statistics of ambience • E.g. Texture features

!!!!!• Subband

distributions • Envelope

cross-corrs

12

SoundAutomatic

gaincontrol

Envelopecorrelation

Cross-bandcorrelations

(318 samples)

Modulationenergy(18 x 6)

mean, var,skew, kurt(18 x 4)

melfilterbank

(18 chans) x

x

xxxx

FFT Octave bins0.5,1,2,4,8,16 Hz

Histogram

617

1273

2404

5

10

15

0.5 2 8 32

0.5 2 8 320.5 2

8 32

5 10 15 8 32

5 10 15

0

level

1

freq / H

z

mel band

1159_10 urban cheer clap

Texture features

1062_60 quiet dubbed speech music

0 2 4 6 8 10time / s

moments mod frq / Hz mel band moments mod frq / Hz mel band

0 2 4 6 8 10

K S V MK S V M

McDermott et al. ’09 Ellis, Zheng, McDermott ’11


Auditory Model Features

• Subband Autocorrelation PCA • Simplified version of autocorrelogram • 10x faster than Lyon original

• Capture fine time structure in multiple bands • information lost in MFCCs

13

Lyon et al. 2010 Lee & Ellis 2012

Cotton & Ellis 2013

Cochleafilterbank

Sound

short-timeautocorrelation

delay line

frequ

ency

chan

nels

Correlogramslice

freqlag

time

Subband VQ

Subband VQ

Subband VQ

Subband VQ

Histogram

Feat

ure

Vecto

r


Subband Autocorrelation

• Autocorrelation stabilizesfine time structure

14

Cochleafilterbank

Sound

short-timeautocorrelation

delay line

frequ

ency

chan

nels

Correlogramslice

freqlag

time

Subband VQ

Subband VQ

Subband VQ

Subband VQ

Histogram

Feat

ure

Vecto

r

• 25 ms window, lags up to 25 ms

• calculated every 10 ms

• normalized to max (zero lag)


Retrieval Examples

• High precision for in-domain top hits

15


3. Speech Enhancement

• Noisy speech scenarios • Ambient recording (background noise) • Communication channel (processing distortion)

16

0

50

100

level

/ dB

CAR KIT - BP 101 43086 20111025 160708 in

1500 Hz chan

dB

0 1000 2000 30000

50

100HOME LAND - BP 101 84943 20111020 144955 in

0 1000 2000 3000freq / Hz freq / Hz

0

50

100

freq

/ Hz

time / s87 90 93 96 99

2000

4000

time / s

level / dB

67 70 73 76 79


RPCA Enhancement

• Decompose spectrogram into sparse + low-rank

• Sparse activation H of dictionary W !!!!

• ASR benefits:

17

minH,L,S

�HkHk1 + �LkLk⇤ + �SkSk1

+ I+(H)

s.t. Y = WH + L+ S

C S D I Orig 6.8 10.6 82.6 0.7

RPCA 10.8 36.5 52.7 0.5

wie+RPCA 10.4 40.1 49.5 2.1

Chen, McFee & Ellis ’14


Classification Pitch Tracker

• SAcC: MLP trained on noisy speech with ground-truth pitch track targets !!!!!!

• Large benefits forin-domain noisyspeech

18

Lee & Ellis ’12

−10 0 10 20 30

PTE

(%)

FDA − RBF and pink noise

10

100

SNR (dB)

10

YINWuget_f0SAcC


Pitch-Normalized Enhancement

• Use noise-robust pitch tracker for enhancement?

19

• Normalize voice pitch !

• Fixed-pitch enhancement !

• Reimpose pitch

Noisy signal

Clean signal

0 0.5 1 1.5 2 2.5 30

500

1000pitchsmoothed pvx

Freq

uenc

yresampled to pitch = 100 Hz

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

500

1000

Filtered − pvsmooth

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

500

1000

Time

Resampled back to original pitch

0 0.5 1 1.5 2 2.5 30

500

1000

0 0.5 1 1.5 2 2.5 30

500

1000

−40−20020


Summary

• Music • transcription, segmentation, … • alignment for ground truth !

• Soundtracks • foreground events, background ambience !

• Noisy Speech • classification pitch tracking • spectrogram enhancement

20

LabROSA Research Overview - Columbia University

Documents