Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

1

Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt

Rémi Gribonval Inria Rennes - Bretagne Atlantique

[email protected]

Random Moments for Compressive Learning

mailto:[email protected]


Main Contributors & Collaborators

3

Anthony Bourrier Nicolas Keriven Yann Traonmilin

Nicolas Tremblay

Gilles Puy

Mike Davies Patrick PerezGilles Blanchard


Agenda

From Compressive Sensing to Compressive Learning ? The Sketch Trick Compressive K-means Compressive GMM Conclusion

4

From Compressive Sensing to Compressive Learning



Machine Learning

Available data training collection of feature vectors = point cloud

Goals infer parameters to achieve a certain task generalization to future samples with the same probability distribution

Examples

6

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

�

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

-6 -4 -2 0 2 4 6-4

-3

-2

-1

0

1

2

3

PCA principal subspace

Dictionary learning dictionary

Clustering centroids

Classification classifier parameters

(e.g. support vectors)

X


Point cloud = large matrix of feature vectors

Challenging dimensions

7

X




7

x1X




7

x1 x2X




7

x1 x2 xN…X X



High feature dimension n Large collection size N


7

x1 x2 xN…X X



High feature dimension n Large collection size N


7

x1 x2 xN…X X

Challenge: compress before learning ?X


Compressive Machine Learning ?


8

x1 x2 xN…X X

yNy2 …y1Y = MX

M




8

x1 x2 xN…X X

yNy2 …y1Y = MX

M




Reduce feature dimension [Calderbank & al 2009, Reboredo & al 2013]

(Random) feature projection Exploits / needs low-dimensional feature model

8

x1 x2 xN…X X

yNy2 …y1Y = MX


Challenges of large collections

Feature projection: limited impact

9

X

Y = MX


Challenges of large collections

Feature projection: limited impact

9

X

Y = MX

“Big Data” Challenge: compress collection size



Point cloud = … empirical probability distribution

10

X




Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

X







10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors

linear in their probability distribution







10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors



Example: Compressive K-means

11

X


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16



M z 2 Rm

Recovery algorithm

estimated centroids

ground truth

N = 1000;n = 2 m = 60


Computational impact of sketching

12

Ph.D. A. Bourrier & N. Keriven

Computation time Memory

Collection size N Collection size N

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20


102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)


102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)


102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20


Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)


Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)


Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space


MM

Probability space

Sketch space


z

p


nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010


Information preservation ?

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space


MM

Probability space

Sketch space


z

p



z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)




The Sketch Trick

Data distribution

Sketch



Dimension reduction ?

14

X ⇠ p(x)

y

Signal

space

x

Observation space


MM

Probability space

Sketch space


z

p



z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

Compressive Learning (Heuristic) Examples



Compressive Machine Learning

Point cloud = empirical probability distribution

Reduce collection dimension ~ sketching

16

X

z` =1

N

NX

i=1

h`(xi) 1 ` m

M z 2 Rm

Sketching operator

Choosing information preserving sketch ?


Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies


17


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16



z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007




Sketching approach






17


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16



z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn





Sketching approach






17


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16



z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn




18

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16






18

ground truth

X M z 2 Rm

N = 1000;n = 2


Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16




z = Mp ⇡KX

k=1

↵kM�✓k



18

estimated centroids

ground truth

X M z 2 Rm

N = 1000;n = 2


Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k


Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez


[email protected]


Motivation



n x

1

. . .xN


ˆ

A

=) mˆ

z


L=) K ✓






fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)


p =

Pks=1↵sfµs

with:

•weights ↵1


1

, . . . ,µk 2 Rn.


1

N



z =

ˆ


p = argminq2⌃k

kˆz�Aqk22

, (2)



(Rn)


Dictionary {e1



Algorithm





�


argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1


1

, . . . , ↵k.3. Final ”shift”:










1

, . . . ,µk ⇠i.i.d.

N (0, 1).



drawn in0,max

x2X||x||

2


2

(0, 1).




NCompressed EM


3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24


20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16



Recovery algorithm

= “decoder”

�


CLOMP =Compressive Learning OMP similar to: OMP with Replacement, Subspace Pursuit & CoSaMP

z = Mp ⇡KX

k=1

↵kM�✓k

⇡ arg min↵k,✓k

kz �KX

k=1

↵kM�✓kk2


Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective



19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107


M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.



xi 2 Rn

Training set

K-means objective



19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107


M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.



xi 2 Rn

Training set

K-means objective



20

N training samplesK=10 clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.



xi 2 Rn

Training set

K-means objective1 rep. 5 rep.

0

0.05

0.1

0.15

0.2

SS

E/N

N1=70 000

KM

CKM

1 rep. 5 rep.

N2=300 000

1 rep. 5 rep.

N3=1 000 000

MNIST infMNIST infMNIST

Lloyd-Max vs Sketch+CLOMP algorithm with 1 or 5 replicates (random initialization)

CLOMPLloyd-Max

Spectral features


Goal: fit k Gaussians

X M z 2 RmSampled

Characteristic Function

m = 5 000N = 60 000 000;n = 12

21

z = Mp ⇡KX

k=1

↵kMp✓k

p ⇡KX

k=1

↵kp✓k

Density model=GMM with diagonal covariance

Recovery algorithm

= “decoder”

�

Example: Compressive GMM

⇡ arg min↵k,✓k

kz �KX

k=1

↵kMp✓kk2

estimated GMM parameters (⇥,↵)

Compressive Hierarchical Splitting (CHS) = extension of CLOMP to general GMM


Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech



22


After silence detection

N = 60 000 000

N = 300 000 000




22



N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000




23



N = 60 000 000


N = 300 000

N = 300 000 000


CHS

for EMfor CHS


CHS


24



N = 60 000 000


N = 300 000

N = 300 000 000


for EM

for CHS


m= 5 000720 000-fold compression exploit whole collection improved performance


25


m= 10003 600 000-fold compression

m= 5007 200 000-fold compression

CHS

Computational Efficiency



Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi




27

z`

=1

N

NX

i=1

ejw>` xi

X




27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)




27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average




27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?

see also [Bruna & al 2013, Giryes & al 2015]




28

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Privacy-reserving sketch and forget

see also [Bruna & al 2013, Giryes & al 2015]

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?




29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Streaming algorithms One pass; online update




29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

streaming…

Streaming algorithms One pass; online update




30

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

… … … …

DIS TRI BU TED

Distributed computing Decentralized (HADOOP) / parallel (GPU)


Summary: Compressive K-means / GMM

31

✓ Dimension reduction ✓ Resource efficiency

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)


102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20


102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20


102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)


102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)


102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20

✓ In the pipe: information preservation (generalized RIP, “intrinsic dimension”)

✓ Neural net - like

z

•Challenge: provably good recovery algorithms ?

Conclusion



Projections & Learning

33

y

Signal

space

x

Observation space

Signal Processing compressive sensing

M M

Probability space

Sketch space

Machine Learning compressive learning

z

p


Compressive sensing random projections of data items

Compressive learning with sketches

random projections of collections

nonlinear in the feature vectors


Reduce dimension of data items

Reduce size of collection


Summary

Compressive GMM

Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 2013 Keriven & al, Sketching for Large-Scale Learning of Mixture Models. ICASSP 2016 & arXiv:1606.02838

Compressive k-means

Keriven & al, Compressive K-Means submitted to ICASSP 2017

Compressive spectral clustering (with graph signal processing)

Tremblay & al, Accelerated Spectral Clustering using Graph Filtering of Random Signals ICASSP 2016

Tremblay & al, Compressive Spectral Clustering ICML 2016 & arXiv:1602.02018

Ex: with Amazon graph (106 edges), 5 times speedup (3 hours instead of 15 hours for k= 500 classes)

34

Challenge: compress before learning ?X

Introduction Graph signal processing... ... applied to clustering Conclusion

What’s the point of using a graph ?

N points in d = 2 dimensions.Result with k-means (k=2) :

After creating a graph fromthe N points’ interdistances,and running the spectral clus-tering algorithm (with k=2) :

N. Tremblay Graph signal processing for clustering Rennes, 13th of January 2016 8 / 26

O(k2 log2 k +N(logN + k))O(k2N)


Recent / ongoing work / challenges

When is information preserved with sketches / projections ? Bourrier & al, Fundamental perf. limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 2014

Notion of Instance Optimal Decoders = Uniform guarantees Fundamental role of general Restricted Isometry Property

How to reconstruct: algorithm / decoder ? Traonmilin & G., Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all. ACHA 2016

RIP guarantees for general (convex & nonconvex) regularizers

How to (maximally) reduce dimension? [Dirksen 2014] : given a random sub-gaussian linear form Puy & al, Recipes for stable linear embeddings from Hilbert spaces to ℝ^m arXiv:1509.06947

Role of covering dimension / Gaussian width of normalized secant set

What is the achievable compression for learning tasks ? Compressive statistical learning, work in progress with G. Blanchard, N. Keriven, Y. Traonmilin

Number of random moments = “intrinsic dimension” of PCA, k-means, Dictionary Learning … Statistical learning: risk minimization + generalization to future samples with same distribution

35

Guarantees ?

�(y) := argminx2H

f(x) s.t.kMx� yk ✏


TH###NKS# Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt

Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

Documents