Top Banner
1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center
30

1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

1

A Survey on Distance Metric Learning (Part 2)

Gerry Tesauro

IBM T.J.Watson Research Center

Page 2: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

2

Acknowledgement

• Lecture material shamelessly adapted from the following sources:– Kilian Weinberger:

• “Survey on Distance Metric Learning” slides• IBM summer intern talk slides (Aug. 2006)

– Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”)

– Yann LeCun talk slides (CVPR 2005, 2006)

Page 3: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

3

Outline – Part 2

Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis)

Metric Learning for Kernel Regression (Weinberger & Tesauro)

Metric learning for RL basis function construction (Keller et al.)

Similarity learning for image processing (LeCun et al.)

Page 4: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Neighborhood Component Analysis

(Goldberger et. al. 2004)Distance metric for visualization and kNN

Page 5: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Metric Learning for Kernel Regression

Weinberger & Tesauro, AISTATS 2007

Page 6: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Killing three birds with one stone:

We construct a method for linear dimensionality

reduction

that generates a meaningful distance

metric optimally tuned for

distance-based kernel

regression

Page 7: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

7

Kernel Regression

• Given training set {(xj , yj), j=1,…,N} where x is -dim vector and y is real-valued, estimate value of a test point xi by weighted avg. of samples:

where kij = kD (xi, xj) is a distance-based kernel function using distance metric D

ijij

ijijj

i k

ky

y

Page 8: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

8

Choice of Kernel

• Many functional forms for kij can be used in MLKR;

our empirical work uses the Gaussian kernel

where σ is a kernel width parameter (can set σ=1 W.L.O.G. since we learn D)

softmax regression estimate similar to Roweis’ softmax classifier

)/exp( 22 ijij Dk

ij

ij

ij

ijj

i D

Dy

y)exp(

)exp(

ˆ2

2

Page 9: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Distance Metric for Nearest Neighbor Regression

Learn a linear transformation that allows to estimate the value of a test point from its nearest neighbors

Page 10: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Mahalanobis Metric

Distance function is a pseudo Mahalanobis metric (Generalizes

Euclidean distance)

Page 11: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

11

General Metric Learning Objective

• Find parmaterized distance function Dθ that minimizes total leave-one-out cross-validation loss function

– e.g. params θ = elements Aij of A matrix

• Since we’re solving for A not M, optimization is non-convex use gradient descent

2)ˆ( iii

yy

Page 12: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

12

Gradient Computation

where xij = xi – xj

For fast implementation: Don’t sum over all i-j pairs, only go up to ~1000

nearest neighbors for each sample i Maintain nearest neighbors in a heap-tree structure,

update heap tree every 15 gradient steps Ignore sufficiently small values of kij ( < e-34 )

Even better data structures: cover trees, k-d trees

))ˆ()ˆ(4 Tij

i jijijjjii xxkyyyyA

A

Page 13: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Learned Distance Metric example

orig. Euclidean D < 1 learned D < 1

Page 14: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

“Twin Peaks” test

n=8000

Training:

we added 3 dimensions with

1000% noise

we rotated 5 dimensions randomly

Page 15: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Input Variance

Noise Signal

Page 16: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Test data

QuickTime™ and aTIFF (PackBits) decompressorare needed to see this picture.

Page 17: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

QuickTime™ and aTIFF (PackBits) decompressorare needed to see this picture.

Test data

Page 18: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Output Variance

Signal Noise

Page 19: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

DimReduction with MLKR• FG-NET face data: 82 persons, 984 face images w/age

Page 20: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

DimReduction with MLKR• FG-NET face data: 82 persons, 984 face images w/age

Page 21: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

DimReduction with MLKR

Force A to be rectangular

Project onto eigenvectors of A

Allows visualization of data

PowerManagement data (d=21)

Page 22: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Robot arm results (8,32dim)

regression error

Page 23: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

© 2006 IBM Corporation

IBM

Unity Data Center Prototype

Objective: Learn long-range resource value estimates for each application manager

State Variables (~48):

– Arrival rate– ResponseTime– QueueLength– iatVariance– rtVariance

Action: # of servers allocated by Arbiter

Reward: SLA(Resp. Time)

8 xSeries servers

Value(#srvrs)

Trade3

AppManager

Value(RT)

ResourceArbiter

Batch

AppManager

Trade3

Server Server Server Server Server Server Server Server

Value(#srvrs)

Value(#srvrs)

Demand(HTTP req/sec)

WebSphere 5.1

DB2

AppManager

WebSphere 5.1

DB2

Value(#srvrs)

Maximize Total SLA Revenue

5 sec

Value(RT)

Demand(HTTP req/sec)

SLA SLA SLA

(Tesauro, AAAI 2005; Tesauro et al., ICAC 2006)

Page 24: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

© 2006 IBM Corporation

IBM

Power & Performance Management

Objective: Managing systems to multi-discipline objectives: minimize Resp. Time and minimize Power Usage

State Variables (21):

– Power Cap

– Power Usage

– CPU Utilization

– Temperature

– # of requests arrived

– Workload intensity (# Clients)

– Response Time

Action: Power Cap

Reward: SLA(Resp. Time) – Power Usage

(Kephart et al., ICAC 2007)

Page 25: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

© 2006 IBM Corporation25

IBM

IBM Regression Results TEST ERROR

14/47

10/223/5

MLKR

Page 26: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

27

Metric Learning for RL basis function construction (Keller et al. ICML 2006)

• RL Dataset of state-action-reward tuples {(si, ai, ri) , i=1,…,N}

Page 27: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

28

Value Iteration

• Define an iterative “bootstrap” calculation:

• Each round of VI must iterate over all states in the state space• Try to speed this up using state aggregation (Bertsekas &

Castanon, 1989)

• Idea: Use NCA to aggregate states:– project states into lower-dim rep; keep states with similar Bellman

error close together

– use projected states to define a set of basis functions {}– learn linear value function over basis functions: V = θi i

'

''1 )'(max)(s

kass

ass

ak sVRPsV

Page 28: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Chopra et. al. 2005Similarity metric for image

verification.

Problem: Given a pair of face-images,decide if they are from the same person.

Page 29: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Chopra et. al. 2005Similarity metric for image

verification.

Too difficult for linear mapping!

Problem: Given a pair of face-images,decide if they are from the same person.

Page 30: 1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.