Shallow Learning with Kernels for Dictionary-Free Magnetic ...web.eecs.umich.edu/~gmingjie/doc/nataraj-17-slw_presentation.pdf · for Dictionary-Free Magnetic Resonance Fingerprinting

Shallow Learning with Kernels

for Dictionary-Free Magnetic Resonance Fingerprinting

Gopal Nataraj∗, Mingjie Gao∗, Jakob Asslander†, Clayton Scott∗, & Jeffrey A. Fessler∗

ISMRM Workshop on Magnetic Resonance Fingerprinting

∗Dept. of Electrical Engineering and Computer Science, University of Michigan†Center for Biomedical Imaging, NYU School of Medicine

Problem Statement

Given: at every voxel, measurement vector y = s(x) + ǫ

-1

4

a.u

.

MRF “component” images (more later...)

y x(y)x(·)

Task: design fast voxel-by-voxel estimator x(·)

that scales well with #unknowns per voxel, L2

Problem Statement

Given: at every voxel, measurement vector y = s(x) + ǫ

-1

4

a.u

.

MRF “component” images (more later...)

T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms

y x(y)x(·)

Task: design fast voxel-by-voxel estimator x(·)

that scales well with #unknowns per voxel, L2

Machine Learning at Different “Depths” for QMRI

Idea: “learn” separate scalar estimators x1(y), . . . , xL(y) from simulated training data

Deep Learning

• promising for QMRI [Cohen et al., 2017, Virtue et al., 2017]

• needs many training points to avoid overfitting

• trained via non-convex optimization

• limited theoretical basis

3



Deep Learning





3



Deep Learning





Shallow Learning

• simpler structure needs fewer training points

• fast training via convex optimization

3

Shallow Learning with Kernels for QMRI


• sample (x1, ǫ1), . . . , (xN , ǫN) and simulate y1, . . . , yN via signal model s

• design nonlinear functions xl(·) := gl (·) + bl that seek to map each yn to xl ,n:

(gl , bl

)∈

arg min

gl∈Gbl∈R

1

N

N∑

n=1

(gl (yn) + bl − xl ,n)2+ρl‖gl‖

2G

(1)

Solution: Parameter Estimation via Regression with Kernels (PERK)

[Nataraj et al., 2017b, arXiv:1710.02441]

• restrict optimization to a certain rich function space G with kernel k

• optimal gl ∈ G takes form gl(·) =∑N

n=1 al ,nk(·, yn) [Scholkopf et al., 2001]

Fast, simple implementation: nonlinear lifting + high-dimensional linear regression

4





(gl , bl

)∈

arg min

glbl∈R

1

N

N∑

n=1

(gl (yn) + bl − xl ,n)2

ill-posed!







4





(gl , bl

)∈

arg min

glbl∈R

1

N

N∑

n=1

(gl (yn) + bl − xl ,n)2

ill-posed!







4





(gl , bl

)∈

arg min

gl∈Gbl∈R

1

N

N∑

n=1


2G

(1)







4





(gl , bl

)∈

arg min

gl∈Gbl∈R

1

N

N∑

n=1


2G

(1)







4

PERK for Magnetic Resonance Fingerprinting (MRF)

To control lifting dimension, desirable for y to be low-dimensional

5



kx

ky

kx

ky

...

flip 1

...

flip 840

[Asslander et al., 2017]

0 500 1000 1500 2000 2500 3000 3500

Time (ms)

0

0.15

data-sharing across flips;

gridding; FFT; PCA

V ∈ C840×6

minY

∥∥k−A(YVH

)∥∥22

Y ∈ Cnvoxels×6 5



kx

ky

kx

ky

...

flip 1

...

flip 840


0 500 1000 1500 2000 2500 3000 3500

Time (ms)

0

0.15

-1

4

a.u

.


gridding; FFT; PCA

V ∈ C840×6

minY

∥∥k−A(YVH

)∥∥22

Y ∈ Cnvoxels×6 5



kx

ky

kx

ky

...

flip 1

...

flip 840


0 500 1000 1500 2000 2500 3000 3500

Time (ms)

0

0.15

-1

4

a.u

.


gridding; FFT; PCA

V ∈ C840×6

minY

∥∥k−A(YVH

)∥∥22

Y ∈ Cnvoxels×6 5



kx

ky

kx

ky

...

flip 1

...

flip 840


0 500 1000 1500 2000 2500 3000 3500

Time (ms)

0

0.15

-1

4

a.u

.


gridding; FFT; PCA

V ∈ C840×6

minY

∥∥k−A(YVH

)∥∥22

Y ∈ Cnvoxels×6 5

In vivo results

kx

ky

...

kx

ky

flip 1

...

flip 840

Dictionary-based Grid Search Dictionary-Free PERK

T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms

6

In vivo results

kx

ky

...

kx

ky

flip 1

...

flip 840


T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms

6

In vivo results

kx

ky

...

kx

ky

flip 1

...

flip 840


T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms

28s/slice 4s train; 0.2s/slice test50 slices ∼1400s 4s train; ∼10s test 6

In vivo results

kx

ky

...

kx

ky

flip 1

...

flip 840


T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms


In vivo results

kx

ky

...

kx

ky

flip 1

...

flip 840


T1

600

800

1000

1200

1400

1600

1800

2000

ms

T2

50

80

110

140

170

200

ms


Summary

Contributions:

• PERK: fast, dictionary-free ML method for QMRI [Nataraj et al., 2017b]

• demonstrated PERK for in vivo MRF T1,T2 estimation

Future Work:

• QMRI problems involving more unknowns for which we expect

orders-of-magnitude computational gains [Nataraj et al., 2017a, #5076]

• comparison with other ML methods, e.g. deep learning

7

Summary

Contributions:



Future Work:




7

Summary

Contributions:



Future Work:




7

References i

Asslander, J., Lattanzi, R., Sodickson, D. K., and Cloos, M. A. (2017).

Relaxation in spherical coordinates: analysis and optimization of pseudo-SSFP based

MR-fingerprinting.

arxiv 1703.00481.

Cohen, O., Zhu, B., and Rosen, M. (2017).

Deep learning for fast MR fingerprinting reconstruction.

In Proc. Intl. Soc. Mag. Res. Med., page 0688.

Nataraj, G., Nielsen, J.-F., and Fessler, J. A. (2017a).

Myelin water fraction estimation from optimized steady-state sequences using kernel ridge

regression.

In Proc. Intl. Soc. Mag. Res. Med., page 5076.

8

References ii

Nataraj, G., Nielsen, J.-F., Scott, C., and Fessler, J. A. (2017b).

Dictionary-free MRI PERK: Parameter estimation via regression with kernels.

IEEE Trans. Med. Imag.

Submitted.

Rahimi, A. and Recht, B. (2007).

Random features for large-scale kernel machines.

In NIPS.

Scholkopf, B., Herbrich, R., and Smola, A. J. (2001).

A generalized representer theorem.

In Proc. Computational Learning Theory (COLT), pages 416–426.

LNCS 2111.

9

References iii

Virtue, P., Yu, S. X., and Lustig, M. (2017).

Better than real: Complex-valued neural nets for MRI fingerprinting.

In Proc. IEEE Intl. Conf. on Image Processing.

To appear.

10

PERK solution

Closed-form solution for each l ∈ {1, . . . , L}:

xl(·) = xTl

(1

N1N +M(MKM+ Nρl IN)

−1

(k(·)−

1

NK1N

))(2)

• xl := [xl ,1, . . . , xl ,N ]T training point regressands

• K :=

k(y1, y1) · · · k(y1, yN)

.... . .

...

k(yN , y1) · · · k(yN , yN)

Gram matrix

• M := IN − 1N1N1

TN de-meaning operator

• k(·) := [k(·, y1), . . . , k(·, yN )]T nonlinear kernel embedding

Can we scale computation with L more gracefully?

• Yes, in fact (2) separable in l ∈ {1, . . . , L} by construction

• However, explicitly computing K may be undesirable... 11

PERK solution

Closed-form solution for each l ∈ {1, . . . , L}:

xl(·) = xTl

(1

N1N +M(MKM+ Nρl IN)

−1

(k(·)−

1

NK1N

))(2)

• xl := [xl ,1, . . . , xl ,N ]T training point regressands

• K :=

k(y1, y1) · · · k(y1, yN)

.... . .

...

k(yN , y1) · · · k(yN , yN)

Gram matrix

• M := IN − 1N1N1

TN de-meaning operator

• k(·) := [k(·, y1), . . . , k(·, yN )]T nonlinear kernel embedding

Can we scale computation with L more gracefully?

• Yes, in fact (2) separable in l ∈ {1, . . . , L} by construction

• However, explicitly computing K may be undesirable... 11

Backup: PERK as High-Dimensional Bayesian Linear Regression

Suppose there exists“approximate feature mapping” z : Y 7→ RZ

such that Z := [z(y1), . . . , z(yN)] has for dim(Y) ≪ Z ≪ N

K ≈ ZTZ. (3)

Plugging (3) into PERK solution (2) and rearranging gives

xl(·) ≈1

NxTl 1N +

1

NxTl MZT

(1

NZMZT + ρl IZ

)−1(z(·)−

1

NZ1N

)

Does such a z exist and work well in practice?

• Yes, e.g. for kernels of form k(y, y′) ≡ k(y − y′) [Rahimi and Recht, 2007]

• In such cases, can reduce from ∼N2 to ∼NZ computations

12

Backup: PERK as High-Dimensional Bayesian Linear Regression

Suppose there exists“approximate feature mapping” z : Y 7→ RZ

such that Z := [z(y1), . . . , z(yN)] has for dim(Y) ≪ Z ≪ N

K ≈ ZTZ. (3)

Plugging (3) into KRR solution (2) and rearranging gives

xl(·) ≈ mxl + cTxl z

(Czz + ρl IZ

)−1

(z(·)− mz) (4)

which is regularized (“ridge”) Z -dimensional affine regression!

Does such a z exist and work well in practice?

• Yes, e.g. for kernels of form k(y, y′) ≡ k(y − y′) [Rahimi and Recht, 2007]

• In such cases, can reduce from ∼N2 to ∼NZ computations

12

Shallow Learning with Kernels for Dictionary-Free Magnetic ...web.eecs.umich.edu/~gmingjie/doc/nataraj-17-slw_presentation.pdf · for Dictionary-Free Magnetic Resonance Fingerprinting

Documents