1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

1

A Survey on Distance Metric Learning (Part 2)

Gerry Tesauro

IBM T.J.Watson Research Center

2

Acknowledgement

• Lecture material shamelessly adapted from the following sources:– Kilian Weinberger:

• “Survey on Distance Metric Learning” slides• IBM summer intern talk slides (Aug. 2006)

– Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”)

– Yann LeCun talk slides (CVPR 2005, 2006)

3

Outline – Part 2

Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis)

Metric Learning for Kernel Regression (Weinberger & Tesauro)

Metric learning for RL basis function construction (Keller et al.)

Similarity learning for image processing (LeCun et al.)

Neighborhood Component Analysis

(Goldberger et. al. 2004)Distance metric for visualization and kNN

Metric Learning for Kernel Regression

Weinberger & Tesauro, AISTATS 2007

Killing three birds with one stone:

We construct a method for linear dimensionality

reduction

that generates a meaningful distance

metric optimally tuned for

distance-based kernel

regression

7

Kernel Regression

• Given training set {(xj , yj), j=1,…,N} where x is -dim vector and y is real-valued, estimate value of a test point xi by weighted avg. of samples:

where kij = kD (xi, xj) is a distance-based kernel function using distance metric D

ijij

ijijj

i k

ky

y

8

Choice of Kernel

• Many functional forms for kij can be used in MLKR;

our empirical work uses the Gaussian kernel

where σ is a kernel width parameter (can set σ=1 W.L.O.G. since we learn D)

softmax regression estimate similar to Roweis’ softmax classifier

)/exp( 22 ijij Dk

ij

ij

ij

ijj

i D

Dy

y)exp(

)exp(

ˆ2

2

Distance Metric for Nearest Neighbor Regression

Learn a linear transformation that allows to estimate the value of a test point from its nearest neighbors

Mahalanobis Metric

Distance function is a pseudo Mahalanobis metric (Generalizes

Euclidean distance)

11

General Metric Learning Objective

• Find parmaterized distance function Dθ that minimizes total leave-one-out cross-validation loss function

– e.g. params θ = elements Aij of A matrix

• Since we’re solving for A not M, optimization is non-convex use gradient descent

2)ˆ( iii

yy

12

Gradient Computation

where xij = xi – xj

For fast implementation: Don’t sum over all i-j pairs, only go up to ~1000

nearest neighbors for each sample i Maintain nearest neighbors in a heap-tree structure,

update heap tree every 15 gradient steps Ignore sufficiently small values of kij ( < e-34 )

Even better data structures: cover trees, k-d trees

))ˆ()ˆ(4 Tij

i jijijjjii xxkyyyyA

A

Learned Distance Metric example

orig. Euclidean D < 1 learned D < 1

“Twin Peaks” test

n=8000

Training:

we added 3 dimensions with

1000% noise

we rotated 5 dimensions randomly

Input Variance

Noise Signal

Test data

QuickTime™ and aTIFF (PackBits) decompressorare needed to see this picture.

QuickTime™ and aTIFF (PackBits) decompressorare needed to see this picture.

Test data

Output Variance

Signal Noise

DimReduction with MLKR• FG-NET face data: 82 persons, 984 face images w/age

DimReduction with MLKR• FG-NET face data: 82 persons, 984 face images w/age

DimReduction with MLKR

Force A to be rectangular

Project onto eigenvectors of A

Allows visualization of data

PowerManagement data (d=21)

Robot arm results (8,32dim)

regression error

© 2006 IBM Corporation

IBM

Unity Data Center Prototype

Objective: Learn long-range resource value estimates for each application manager

State Variables (~48):

– Arrival rate– ResponseTime– QueueLength– iatVariance– rtVariance

Action: # of servers allocated by Arbiter

Reward: SLA(Resp. Time)

8 xSeries servers

Value(#srvrs)

Trade3

AppManager

Value(RT)

ResourceArbiter

Batch

AppManager

Trade3

Server Server Server Server Server Server Server Server

Value(#srvrs)

Value(#srvrs)

Demand(HTTP req/sec)

WebSphere 5.1

DB2

AppManager

WebSphere 5.1

DB2

Value(#srvrs)

Maximize Total SLA Revenue

5 sec

Value(RT)

Demand(HTTP req/sec)

SLA SLA SLA

(Tesauro, AAAI 2005; Tesauro et al., ICAC 2006)

© 2006 IBM Corporation

IBM

Power & Performance Management

Objective: Managing systems to multi-discipline objectives: minimize Resp. Time and minimize Power Usage

State Variables (21):

– Power Cap

– Power Usage

– CPU Utilization

– Temperature

– # of requests arrived

– Workload intensity (# Clients)

– Response Time

Action: Power Cap

Reward: SLA(Resp. Time) – Power Usage

(Kephart et al., ICAC 2007)

© 2006 IBM Corporation25

IBM

IBM Regression Results TEST ERROR

14/47

10/223/5

MLKR

27

Metric Learning for RL basis function construction (Keller et al. ICML 2006)

• RL Dataset of state-action-reward tuples {(si, ai, ri) , i=1,…,N}

28

Value Iteration

• Define an iterative “bootstrap” calculation:

• Each round of VI must iterate over all states in the state space• Try to speed this up using state aggregation (Bertsekas &

Castanon, 1989)

• Idea: Use NCA to aggregate states:– project states into lower-dim rep; keep states with similar Bellman

error close together

– use projected states to define a set of basis functions {}– learn linear value function over basis functions: V = θi i

'

''1 )'(max)(s

kass

ass

ak sVRPsV

Chopra et. al. 2005Similarity metric for image

verification.

Problem: Given a pair of face-images,decide if they are from the same person.

Chopra et. al. 2005Similarity metric for image

verification.

Too difficult for linear mapping!

Problem: Given a pair of face-images,decide if they are from the same person.

1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Documents