Igh maa-2015 nov

Post on 13-Apr-2017

84 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

Transcript

Iterated geometric harmonics for missing data recovery

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

The missing data problemMissing data is often a problem. Data can be lost

while recording measurements,during storage or transmission,due to equipment failure,...

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

The missing data problemMissing data is often a problem. Data can be lost

while recording measurements,during storage or transmission,due to equipment failure,...

Existing techniques:require some records (rows) to be complete, orrequire some characteristics (columns) to be complete, orare based on linear regression.(But data often has highly nonlinear internal structure!)

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.

{[ ]n records(p characteristics)

one record

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

A dataset is a collection of vectors, stored as a matrixThe data is an n× p matrix. Each row is a vector of length p; onerow is a record and each column is a parameter or coordinate.

EXAMPLES

36 photos, each of size 112 pixels × 92 pixels.{vk}36

k=1 ⊆ R10,304. (Each photo stored as a vector)

Results from a psychology experiment: a 50-question examgiven to 200 people.{vk}200

k=1 ⊆ R50.

3000 student records (SAT, ACT, GPA, class rank, etc.){vk}3000

k=1 ⊆ R20.

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Special case of the missing data problemSuppose all missing data are in one column

v1 �v2 f2v3 �...

vn fn

Consider last column as a function f : {1, 2, . . . , n} → R.

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

f

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

fF

Iterated geometric harmonics for missing data recovery

Motivation: the missing data problem

Introduction and background

Out-of-sample extension of an empirical functionIdea: A function f is defined on a subset Γ of the dataset.

f : Γ→ Y, where Γ ⊆ Rp is the set where value of f is known.Want to extend f to F : X → Y so that F|Γ(x) = f (x), for x ∈ Γ.

Application: The data is a sample {(x, f (x))}x∈Γ.

Example: X may be a collection of images or documents.Y = R

Want to generalize to as-yet-unseen instances in X.

“function extension”←→ “automated sorting”

=⇒ machine learning/manifold learning

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.

k(v,u) =

{≈ 0, v and u very different≈ 1, v and u very similar

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Similarities within data are modeled via nonlinearityIntroduce a nonlinear kernel function k to model the similaritybetween two vectors.

k(v,u) =

{≈ 0, v and u very different≈ 1, v and u very similar

Two possible choices of such a kernel function:

k(v,u) =

{e−‖v−u‖2

2/ε

|Corr(v,u)|m

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Vector vi −→ vertex vi in the network.v1v2v3v4

k−−−−−→

v1 • 4

2

• v23

wwwwwwwww

v3 •1

• v4

K =

v1 v2 v3 v4

v1

v2

v3

v4

0 4 2 0

4 0 3 0

2 3 0 1

0 0 1 0

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Convert the dataset into a networkGoal: replace original dataset in Rn×p with a similarity network.Network = connected weighted undirected graph.Similarity network = weights represents similarities.

Efficiency gain: n× p data matrix 7→ n× n adjacency matrixv1v2v3v4

k−−−−−→ K =

0 4 2 04 0 3 02 3 0 10 0 1 0

Ki,j := k(vi, vi)

Advantageous for high-dimensional datasets: p >> n.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonicsCoifman and Lafon introduced the machine learning tool“geometric harmonics” in 2005.

Idea: the eigenfunctions of a diffusion operator can be used toperform global analysis of the dataset and of functions on adataset.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

“Restricted matrix multiplication”

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ

Ku,vψj(v) = λjψj(u), u ∈ Γ.

NOTE:k symmetric =⇒ K symmetric =⇒ {ψj} form ONB

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: construction and definitionFor matrix K with Ku,v = k(u, v), consider the integral operator

f 7→ Kf by (Kf )(u) :=∑v∈Γ

Ku,vf (v), u ∈ X.

Diagonalize restricted matrix [K]u,v∈Γ via:∑v∈Γ

Ku,vψj(v) = λjψj(u), u ∈ Γ.

[Nystrom] Reverse this equation to define values off Γ:

Ψj(u) :=1λj

∑v∈Γ

Ku,vψj(v), u ∈ X.

{Ψj}nj=1 are the geometric harmonics, where n = |Γ|.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define

F(x) =

n∑j=1

〈f , ψj〉ΓΨj(x), x ∈ X.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: the extension algorithmFor f : Γ→ Y and n = |Γ|, define

F(x) =

n∑j=1

〈f , ψj〉ΓΨj(x), x ∈ X.

For x ∈ Γ, Ψj(x) = ψj(x), so

F(x) =n∑

j=1

〈f , ψj〉ΓΨj(x) =

n∑j=1

〈f , ψj〉Γψj(x) = f (x),

since this is just the decomposition of f in the ONB {ψj}nj=1.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

The network model associated to a dataset

Geometric harmonics: limitationsGeometric harmonics does not apply to missing data.

Consider f : Γ→ R as extra column with holes:v1v2v3 f...

vn

Geometric harmonics requires first p columns to be complete.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:

Data are samples from a submanifold.

Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: basic ideaUnderlying assumption of geometric harmonics:

Data are samples from a submanifold.

Restated as a continuity assumption:If p− 1 entries of u and v are very close, then so is the pth.

Idea: Consider jth column to be a function of the othersv1v2...

vn

−→

a11a21...

an1

a12a22...

an2

. . .

. . .

. . .

a1ja2j...

anj

. . .

. . .

. . .

a1pa2p...

anp

Geometric harmonics can be applied to jth column.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.

3 Iteration through columns.(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.

Process typically stabilizes after about 4 cycles.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.

(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.

4 Repeat iteration until updates cause negligible change.Process typically stabilizes after about 4 cycles.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: the iteration scheme

1 Record locations of missing values in the dataset.2 Stochastically impute missing values.

Drawn from N(µ, σ2), computed columnwise.3 Iteration through columns.

(a) Choose (at random) a column to update.(b) “Unlock” entries of column to be imputed.(c) Use geometric harmonics to update those entries.

Current column is treated as a function of the others.New values are initially computed in terms of poor guesses.Successive passes improve guesses.

(d) Continue until all columns are updated.4 Repeat iteration until updates cause negligible change.

Process typically stabilizes after about 4 cycles.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

damaged restored original(70% data loss)

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records

Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonics: applicationsIterated geometric harmonics requires continuity assumption

Probably not well-suited to social network analysis, etc.Iterated geometric harmonics requires multiple similardatapoints/records

Video footage is a natural application.10–24 images per second, usually very similar.Applications for security, military, law enforcement.

Iterated geometric harmonics excels when p >> n

However, has demonstrated good performance onlow-dimensional time series.Example: San Diego weather data (next slide)

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

San Diego Airport weather datan = 2000, p = 25

0 1 2 3 4 50

500

1000

1500

2000

2500

GH Iterations

L−2

Erro

r

0.050.10.150.20.250.30.350.4

0 1 2 3 4 5 68

10

12

14

16

18

20

22

GH Iterations

Stan

dard

Dev

iatio

n

0.050.10.150.20.250.30.350.4

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

SummaryIterated Geometric Harmonics (IGH):

Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

SummaryIterated Geometric Harmonics (IGH):

Robust data reconstruction, even at high rates of data loss.Well suited to high-dimensional problems p >> n.Relies on continuity assumptions on underlying data.Application to image reconstruction, video footage, etc.Patent pending (U.S. Patent Application No.: 14/920,556)

Future work: noisy data.

Iterated geometric harmonics for missing data recovery

A solution: Geometric harmonics

Iterated geometric harmonics

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: noisy dataThe problem of “noisy data” is more difficult:

Before improving the data, bad values need to be located.

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: noisy dataThe problem of “noisy data” is more difficult:

Before improving the data, bad values need to be located.Current work: using Markov random fields to detect noise.

Markov random fields: another graph-based tool for dataanalysis.

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

a1 a2 a3

a4

w13

u4

u1

u5

u2

u6

u3

w12

w45

w23

w56

w24 w35

a5 a6

b1 b2 b3

b4 b5 b6

Minimize the energy functional:E =

∑wij(ai − aj)

2 +∑

ui(ai − bi)2

where {bi} are given,wij are tuned by user (and usually all equal), andui are tuned by user (and usually all equal).

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fields

original (noisy) data

improved data

a1 a2 a3

a4

w13

u4

u1

u5

u2

u6

u3

w12

w45

w23

w56

w24 w35

a5 a6

b1 b2 b3

b4 b5 b6

Minimize the energy functional:E =

∑(ai − aj)

2 + λ∑

(ai − bi)2

where {bi} are given,wij = ui = 1, and λ is tuned by user.

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Future work: Markov random fieldsMarkov random fields (MRF) use simulated annealing solve

minimize E given {bi}Output: improved data {ai}.

Our approach:1 Apply MRF to find improved data {ai}.2 Compare {ai} to original data {bi}.3 Label nodes with large values of |ai − bi| as missing data.4 Apply IGH and obtain better improved data.

Iterated geometric harmonics for missing data recovery

Future work

From missing data to noisy data

Iterated geometric harmonicsfor missing data recovery

Jonathan A. Lindgren, Erin P. J. Pearse, and Zach Zhangjlindgre, epearse, zazhang, @calpoly.edu

California Polytechnic State University

Nov. 14, 2015California Polytechnic State University

San Luis Obispo, CA

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X

〈kx, f 〉 = f (x) (reproducing property)

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesSuppose X ∈ Rn and k : X × X → R is

nonnegative: k(x, y) ≥ 0

symmetric: k(x, y) = k(y, x)

positive semidefinite: for any choice of {xi}mi=1,

Ki,j = k(xi, xj) defines a positive semidefinite matrix.

[Aronszajn] There is a Hilbert space H of functions on X withkx := k(x, ·) ∈ H, for x ∈ X

〈kx, f 〉 = f (x) (reproducing property)

In the discrete case, H is the closure off =

∑x axkx, ax ∈ scalars.

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by

(Kf )(x) =

∫Γk(x, y)f (y)dµ(y), x ∈ X,

turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:

K?g(y) = g(y), y ∈ Γ, g ∈ H.

Iterated geometric harmonics for missing data recovery

Theoretical underpinnings

Reproducing kernel Hilbert spaces

Under the hood: reproducing kernel Hilbert spacesFor Γ ⊆ X, the operator K : L2(Γ, µ)→ H given by

(Kf )(x) =

∫Γk(x, y)f (y)dµ(y), x ∈ X,

turns out to have adjoint operator K? : H → L2(Γ, µ) given bydomain restriction:

K?g(y) = g(y), y ∈ Γ, g ∈ H.

K?K is self-adjoint, positive, and compact.

Its eigenvalues are discrete and non-negative.Since K? is restriction, eigs can be found by diagonalizing kon Γ.

top related