Nonnegative Tensor Factorization for Source Separation of ... · [Rafii, Liutkus, & Pardo 2014] NMF can handle many types of repetition: Method. Nonnegative tensor factorization

Post on 26-Jul-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Nonnegative Tensor Factorization for Source

Separation of Loops in Audio

Jordan B. L. Smith National Institute of Advanced Industrial Science and Technology (AIST), Japan

Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan

Laboratoire de signaux et systèmes (L2S) & IRCAM, Paris

Introduction

• In some musical styles, songs are built from loops. E.g.:Extracting loops from music

2. Loops arranged to make a song

0:00 0:30 1:00A A A A A A A

B B B B B BC C C C

DD D

DrumMelody

BassFX

1. Collection of loops

A B

C D

3. Song mixed down to audio

→ composition process →

Audio examples (and test data) all borrowed from [López-Serrano et al. 2016]

• In some musical styles, songs are built from loops. E.g.:Extracting loops from music

2. Loops arranged to make a song

0:00 0:30 1:00A A A A A A A

B B B B B BC C C C

DD D

DrumMelody

BassFX

• Goal: decompose the audio signal to recover:• the layout of the song• the source-separated loops

1. Collection of loops

A B

C D

3. Song mixed down to audio

← decomposition procedure ←

• Two previous approaches that inspired us:• Fingerprint-based loop detection [López-Serrano et al. 2016]

Extracting loops from music

Inputs:

A B

C D+ →

A A A A A A AB B B B B B

C C C CDD D

Output:

Original loops Mixed audio Map of loop activations

Inputs:

+ →

Output:

Assumption that loops are introduced

additively A:B:C:D:

Mixed audio Separated tracks, one per loop

• Iterative NMF [Seetharaman & Pardo 2016]

• Our proposed system:Extracting loops from music

Input:

→A A A A A A A

B B B B B BC C C C

DD D

Outputs:

+A:B:C:D:

Mixed audio Map of loop activations Separated tracks, one per loop

• We attempt to solve both problems in one step, without assumption of additive layout

• We do so by extending nonnegative matrix factorization (NMF) to handle periodicity

Source separation using NMF*

• Steady-state notes

• Note sequences repeated in time

• Transposed notes

• Periodicity (especially at downbeats)

• NMF with harmonic templates

• NMFD with time-evolving templates[Smaragdis 2004]

• NMF2D with transposed harmonic templates[e.g., FitzGerald, Cranitch & Coyle 2008]

• ...no nonnegative approach!NB: REPET, a median-filtering approach[Rafii, Liutkus, & Pardo 2014]

NMF can handle many

types of repetition:

Method

Nonnegative tensor factorization• Step 1: estimate downbeats [madmom, Böck et al. 2016]

Nonnegative tensor factorization• Step 1: estimate downbeats [madmom, Böck et al. 2016]

Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a

“spectral cube”)

Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a

“spectral cube”)

Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a

“spectral cube”)

Detour: understanding the spectral cube

Frequency

Bar number (time in piece)

Time in bar

Detour: understanding the spectral cube

Frequency

Bar number (time in piece)

Time in bar

Detour: understanding the spectral cube

Frequency

Bar number (time in piece)

Time in bar

Detour: understanding the spectral cube

Frequency

Bar number (time in piece)

Time in bar

Bottom to topBack to front Left to right

Visualizing a 3D volume: CT scan

Low frequency to highBeginning to end of piece Beginning to end of a bar

Visualizing a 3D volume: CT scan

Frequency

Bar number (time in piece)

Time in bar

Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a

“spectral cube”)• Step 3: use nonnegative tensor factorization (NTF) to

model the spectral cube

Nonnegative matrix factorization• NMF: X ≈ W ◦ H• W = note templates• H = activation functions

X ≈ M

N

M × rW

r × NH

• Needs post-processing to separate sources:• which templates in W belong to the same source?• different sources could use the same harmonic

components!

Nonnegative tensor factorization• Tucker Decomposition: X ≈ C ◦ (W ◦ H ◦ D)• W = note templates• H = activation functions (time-in-bar)• D = loop activation functions (time-in-piece)• C = core tensor = recipe for each loop type

≈ M

PQ

=

Tucker decomposition

𝓧

Interpreting theNTF model

• W, H, and D all musically intuitive:

A A A A A A AB B B B B B

C C C CDD DLoop template

activations directly estimate layout of song

Interpreting theNTF model

• Core tensor C = recipe for each loop typeLoop recipes

(C)

• Pixel C(i, j, k) tells us to play note wi with activation function hj whenever loop dk appears.

(w4, h7)+

(w11, h10)+

(w24, h16)

Interpreting theNTF model

• Core tensor C = recipe for each loop typeLoop recipes

(C)

• To recover entire spectrogram: C ◦ (W ◦ H ◦ D) • To recover individual loop source: C[:,:,k] ◦ (W ◦ H ◦ D[k,:])

Evaluation

Evaluation• We used synthetic data [López-Serrano et al. 2016]

• 7 sets of loops x 3 different layouts (arrangements)• Algorithm output 1: separated signals

• Evaluate quality with SDR, SIR, SAR

A A A A A A AB B B B B B

C C C CDD D

estimated map ground truth map

estimated source tracks stem tracks

• Algorithm output 2: loop layout• Evaluate accuracy with correlation

Good separation example

• When it works, it works

Collection of loops for genre: “Acid”

Drum Melody

Bass FX

Extracted loops

1 2

3 4

Flawed separation example

Original tracks for genre “Brezo”

Source separated tracks

A A A A A A A

B B B B B B

C C C C

DD D

A A A

B B B

C C C C

DD D

Flawed separation example

Original tracks for genre “Brezo”

Source separated tracks

A A A A A A A

B B B B B B

C C C C

DD D

A A A

B B B

C C C C

DD D

A A A A A A AB B B B B BC C C C

DD D

swap rows

substitute C=CA

A A AB B B B B BC C C C

DD D

substitute C=CA

A A AB B BC C C C

DD D

A A A A A A AB B B B B BC C C CDD D

A A AB B BC C C CDD D

swap rows

(proposed)

(performance ceiling)

[Seetharaman & Pardo 2016]

10

5

0

20151050

1050

–5

SAR

SDR

SIR

Our reconstruction quality is average.:-|

We have more noisy artifacts. :-(

We have less crosstalk than others! :-D

1.00.80.60.40.20.0Co

rrela

tion

We get very clean layouts! :-D

Conclusion

Conclusion• Proposed method of decomposing audio into loops that:

• Models periodicity using the spectral cube• Models source signals and song composition jointly• Tucker decomposition is musically intuitive

• Weaknesses include:• Very conservative reconstructions don’t model the

whole signal• Like NMFD, we cannot distinguish between

algebraically equivalent decompositions• Future work: searching for repetitions at multiple

hierarchical time scales

Future work: hierarchical analysis

• Different loops in the song have different lengths and periods

• Spectral cubes with different periods highlight different consistent repetitions

1 downbeat 4 downbeatsPERIOD: 2 beats

Future work: hierarchical analysis

• Different loops in the song have different lengths and periods

• Spectral cubes with different periods highlight different consistent repetitions

1 downbeat 2 downbeats 4 downbeatsPERIOD: 2 beats

Thank you!

PS. Jordan is now at:

+

top related