Top Banner
Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research [email protected]
50

Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

Introduction to Machine LearningLecture 16

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Introduction to Machine Learning 2

Ranking

Page 3: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Motivation

Very large data sets:

• too large to display or process.

• limited resources, need priorities.

• ranking more desirable than classification.

Applications:

• search engines, information extraction.

• decision making, auctions, fraud detection.

Can we learn to predict ranking accurately?

3

Page 4: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Score-Based Setting

Single stage: learning algorithm

• receives labeled sample of pairwise preferences;

• returns scoring function .Drawbacks:

• induces a linear ordering for full set .

• does not match a query-based scenario.

Advantages:

• efficient algorithms.

• good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).

4

Uh

h: U → R

Page 5: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Setting

Definitions:

• : universe, full set of objects.

• : finite query subset to rank, .

• : target ranking for (random variable).

Two stages: can be viewed as reduction.

• learn preference function .

• given , use to determine ranking of .

Running-time: measured in terms of |calls to |.

5

V ⊆ U

U

V

τ∗ V

V h σ

h

V

h: U×U→ [0, 1]

Page 6: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Related Problem

Rank aggregation: given candidates and voters each giving a ranking of the candidates, find ordering as close as possible to these.

• closeness measured in number of pairwise misrankings.

• problem NP-hard even for (Dwork et al., 2001).

6

n k

k=4

Page 7: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

This Talk

Score-based ranking

Preference-based ranking

7

Page 8: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Score-Based Ranking

Training data: sample of i.i.d. labeled pairs drawn from according to some distribution ,

Problem: find hypothesis in with small generalization error

8

D

with

Hh :U→R

U×U

yi =

+1 if xi >pref xi

0 if xi =pref xi or no information

−1 if xi <pref xi.

RD(h) = Pr(x,x)∼D

f(x, x)

h(x)− h(x)

< 0

.

S =(x1, x

1, y1), . . . , (xm, x

m, ym)∈U×U×−1, 0, +1,

Page 9: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Notes

Empirical error:

The relation may be non-transitive (needs not even be anti-symmetric).Problem different from classification.

9

xRx ⇔ f(x, x)=1

R(h) =1m

m

i=1

1yi(h(xi)−h(xi))<0 .

Page 10: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Distributional Assumptions

Distribution over points: points (literature).

• labels for pairs.

• squared number of examples .

Distribution over pairs: pairs.

• label for each pair received.

• independence assumption.

• same (linear) number of examples.

10

m

O(m2)

m

Page 11: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Boosting for Ranking

Use weak ranking algorithm and create stronger ranking algorithm.

Ensemble method: combine base rankers returned by weak ranking algorithm.

Finding simple relatively accurate base rankers often not hard.

How should base rankers be combined?

11

Page 12: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

CD RankBoost

12

H⊆0, 1X.

(Freund et al., 2003; Rudin et al., 2005)

0t + +t + −t = 1, st (h) = Pr

(x,x)∼Dt

sgn

f(x, x)(h(x)− h(x))

= s

.

RankBoost(S = ((x1, x1, y1) . . . , (xm, x

m, ym)))

1 for i← 1 to m do2 D1(xi, x

i)← 1

m3 for t← 1 to T do4 ht ← base ranker in H with smallest

−t −

+t = −Ei∼Dt

yi

ht(xi)− ht(xi)

5 αt ← 12 log +t

−t

6 Zt ← 0t + 2[+t

−t ] 1

2 normalization factor7 for i← 1 to m do

8 Dt+1(xi, xi)←

Dt(xi,xi) exp

−αtyi

ht(x

i)−ht(xi)

Zt

9 ϕT ←T

t=1 αtht

10 return h = sgn(ϕT )

Page 13: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Notes

Distributions over pairs of sample points:

• originally uniform.

• at each round, the weight of a misclassified example is increased.

• observation: , since

weight assigned to base classifier : directy depends on the accuracy of at round .

13

Dt

ht αt

ht t

Dt+1(x, x)= e−y[ϕt(x)−ϕt(x)]

|S|Qt

s=1 Zs

Dt+1(x, x) =Dt(x, x)e−yαt[ht(x

)−ht(x)]

Zt=

1|S|

e−yPt

s=1 αs[hs(x)−hs(x)]

ts=1 Zs

.

Page 14: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning 14

Objective Function: convex and differentiable.

Coordinate Descent RankBoost

e−x

0−1pairwise loss

F (α) =

(x,x,y)∈S

e−y[ϕT (x)−ϕT (x)] =

(x,x,y)∈S

exp− y

T

t=1

αt[ht(x)−ht(x)].

Page 15: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning 15

• Direction: unit vector with

• Since ,

et

et = argmint

dF (α + ηet)dη

η=0

.

Thus, direction corresponding to base classifier selected by the algorithm.

F (α + ηet)=

(x,x,y)∈S

e−yPT

s=1 αs[hs(x)−hs(x)]e−yη[ht(x

)−ht(x)]

dF (α + ηet)dη

η=0

= −

(x,x,y)∈S

y[ht(x)− ht(x)] exp− y

T

s=1

αs[hs(x)− hs(x)]

= −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)m

T

s=1

Zs

= −[+t − −t ]m

T

s=1

Zs

.

Page 16: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning 16

• Step size: obtained via

Thus, step size matches base classifier weight used in algorithm.

dF (α + ηet)dη

= 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)] exp− y

T

s=1

αs[hs(x)− hs(x)]e−y[ht(x

)−ht(x)]η = 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)m

T

s=1

Zs

e−y[ht(x

)−ht(x)]η = 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)e−y[ht(x)−ht(x)]η = 0

⇔ −[+t e−η − −t eη] = 0

⇔ η =12

log+t−t

.

Page 17: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

L1 Margin Definitions

Definition: the margin of a pair with label is

Definition: the margin of the hypothesis for a sample is the minimum margin for pairs in with non-zero labels:

17

h

(x, x)

ρ(x, x) =y(ϕ(x)− ϕ(x))m

t=T αt=

yT

t=1 αt[ht(x)− ht(x)]α1

= yα · ∆h(x)α1

.

y =0

S =((x1, x1, y1) . . . , (xm, x

m, ym))S

ρ = min(x,x,y)∈S

y =0

yα · ∆h(x)

α1.

Page 18: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Ranking Margin Bound

Theorem: let be a family of real-valued functions. Fix , then, for any , with probability at least over the choice of a sample of size , the following holds for all :

18

(Cortes and MM, 2011)

H

ρ>0 δ>01−δ m

h∈H

R(h) ≤ Rρ(h) +2ρ

RD1

m (H) + RD2m (H)

+

log 1

δ

2m.

Page 19: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

RankBoost Margin

But, RankBoot does not maximize the margin.

Empirical performance not reported.

19

smooth-margin RankBoost (Rudin et al. ,

2005):G(α) = − log F (α)

α1.

Page 20: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Ranking with SVMs

Optimization problem: application of SVMs.

Decision function:

20

h : x →w · Φ(x) + b.

see for example (Joachims, 2002)

minw,ξ

12w2 + C

m

i=1

ξi

subject to: yi

w ·

Φ(x

i)−Φ(xi)≥ 1− ξi

ξi ≥ 0, ∀i ∈ [1, m] .

Page 21: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Notes

The algorithm coincides with SVMs using feature mapping

Can be used with kernels.Algorithm directly based on margin bound.

21

(x, x) → Ψ(x, x) = Φ(x)−Φ(x).

Page 22: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Bipartite Ranking

Training data:

• sample of negative points drawn according to

• sample of positive points drawn according to

Problem: find hypothesis in with small generalization error

22

Hh :U→R

D+

S+ =(x1, . . . , xm)∈U.

D−

S−=(x1, . . . , xm)∈U.

RD(h) = Prx∼D−,x∼D+

h(x)<h(x)

.

Page 23: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Notes

More efficient algorithm in this special case (Freund

et al., 2003).

Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05).

• if constant base ranker used.

• relationship between objective functions.

Bipartite ranking results typically reported in terms of AUC.

23

Page 24: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

ROC Curve

Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP).

• TP: % positive points correctly labeled positive.

• FP: % negative points incorrectly labeled positive.

24

(Egan, 1975)

0 .2 .4 .6 .8 1

.2

.4

.6

.8

1

0

False positive rate

True

pos

itive

rat

e

h(x3) h(x1)h(x14) h(x23)h(x5)

θ+-

sorted scores

Page 25: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Area under the ROC Curve (AUC)

Definition: the AUC is the area under the ROC curve. Measure of ranking quality.

Equivalently,

25

(Hanley and McNeil, 1982)

0 .2 .4 .6 .8 1

.2

.4

.6

.8

1

0

False positive rate

True

pos

itive

rat

eAUC

AUC(h) =1

mm

m

i=1

m

j=1

1h(xi)>h(xj)= Pr

x∼ bD+

x∼ bD−

[h(x) > h(x)]

= 1− R(h).

Page 26: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

AdaBoost and CD RankBoost

Objective functions: comparison.

26

FRank(α) =

(i,j)∈S−×S+

exp− [f(xj)− f(xi)]

=

(i,j)∈S−×S+

exp (+f(xi)) exp (−f(xi))

= F−(α)F+(α).

FAda(α) =

xi∈S−∪S+

exp (−yif(xi))

=

xi∈S−

exp (+f(xi)) +

xi∈S+

exp (−f(xi))

= F−(α) + F+(α).

Page 27: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

AdaBoost and CD RankBoost

Property: AdaBoost (non-separable case).

• constant base learner equal contribution of positive and negative points (in the limit).

• consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective.

Observations: if ,

27

h=1

F+(α)=F−(α)

d(FRank) = F+d(F−) + F−d(F+)= F+

d(F−) + d(F+)

= F+d(FAda).

(Rudin et al., 2005)

Page 28: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Bipartite RankBoost - Efficiency

Decomposition of distribution: for ,

Thus,

28

D(x, x) = D−(x)D+(x).

(x, x)∈(S−, S+)

Dt+1(x, x) =Dt(x, x)e−αt[ht(x

)−ht(x)]

Zt

=Dt,−(x)eαtht(x)

Zt,−

Dt,+(x)e−αtht(x)

Zt,+,

Zt,− =

x∈S−

Dt,−(x)eαtht(x) Zt,+ =

x∈S+

Dt,+(x)e−αtht(x).with

Page 29: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Ranking ≠ Classification

Bipartite case: can we learn to rank by training a classifier on positive and negative sets?

• different objective functions: AUC vs. 0/1 loss.

• preliminary analysis (Cortes and MM, 2004): different results for imbalanced data sets, on average over all classifications.

• example, stochastic case:

29

A BCB ACC AB

+ -

Page 30: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

This Talk

Score-based ranking

Preference-based ranking

30

Page 31: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Setting

Definitions:

• : universe, full set of objects.

• : finite query subset to rank, .

• : target ranking for (random variable).

Two stages: can be viewed as reduction.

• learn preference function .

• given , use to determine ranking of .

Running-time: measured in terms of |calls to |.

31

V ⊆ U

U

V

τ∗ V

V h σ

h

V

h: U×U→ [0, 1]

Page 32: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Ranking Problem

Training data: pairs sampled i.i.d. according to :

Problem: for any query set , use to return ranking close to target with small average error

32

(V, τ∗)

subsets ranked bydifferent labelers.

D

(V1, τ∗1 ), (V2, τ

∗2 ), . . . , (Vm, τ∗m) Vi ⊆ U.

preference function .

V ⊆ Uτ∗

hσh,V

learn classifier

h : U×U→ [0, 1]

R(h, σ) = E(V,τ∗)∼D

[L(σh,V , τ∗)].

Page 33: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Preference Function

close to when preferred to , close to otherwise. For the analysis, .

Assumed pairwise consistent:

May be non-transitive, e.g.,

Output of classifier or ‘black-box’.

33

h(u, v) + h(v, u) = 1.

h(u, v) = h(v, w) = h(w, v) = 1.

h(u, v) u1 v 0h(u, v)∈0, 1

Page 34: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Loss Functions

Preference loss:

Ranking loss:

34

L(σ, τ∗) =2

n(n− 1)

u =v

σ(u, v)τ∗(v, u).

L(h, τ∗) =2

n(n− 1)

u =v

h(u, v)τ∗(v, u).

(for fixed )(V, τ∗)

Page 35: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

(Weak) Regret

Preference regret:

Ranking regret:

35

Rrank(A) = E

V,τ∗,s[L(As(V ), τ∗)]− E

Vmin

σ∈S(V )E

τ∗|V[L(σ, τ∗)].

Rclass(h) = E

V,τ∗[L(h|V , τ∗)]− E

Vmin

hE

τ∗|V[L(h, τ∗)].

Page 36: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Deterministic Algorithm

Stage one: standard classification. Learn preference function .

Stage two: sort-by-degree using comparison function .

• sort by number of points ranked below.

• quadratic time complexity .

36

h : U×U→ [0, 1]

h

(Balcan et al., 07)

O(n2)

Page 37: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Randomized Algorithm

Stage one: standard classification. Learn preference function .

Stage two: randomized QuickSort (Hoare, 61) using as comparison function.

• comparison function non-transitive unlike textbook description.

• but, time complexity shown to be in general.

37

h : U×U→ [0, 1]

h

O(n log n)

(Ailon & MM, 08)

Page 38: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Randomized QS

38

left recursion right recursion

random pivot

u

v

h(v, u)=1 h(u, v)=1

Page 39: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Bounds: for deterministic sort-by-degree algorithm

• expected loss:

• regret:

Time complexity: .

Deterministic Algo. - Bipartite Case

39

EV,τ∗

[L(A(V ), τ∗)] ≤ 2 EV,τ∗

[L(h, τ∗)].

Rrank(A(V )) ≤ 2R

class(h).

Ω(|V |2)

(V = V+ ∪ V−) (Balcan et al., 07)

Page 40: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Bounds: for randomized QuickSort (Hoare, 61).

• expected loss (equality):

• regret:

Time complexity:

• full set: .

• top k:

Randomized Algo. - Bipartite Case

40

Rrank(Qh

s (·)) ≤ Rclass(h) .

EV,τ∗,s

[L(Qhs (V ), τ∗)] = E

V,τ∗[L(h, τ∗)].

O(n log n)O(n + k log k).

(V = V+ ∪ V−) (Ailon & MM, 08)

Page 41: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Proof Ideas

QuickSort decomposition:

Bipartite property:

41

puv +13

w ∈u,v

puvw

h(u, w)h(w, v) + h(v, w)h(w, u)

= 1.

u

v w

τ∗(u, v) + τ∗(v, w) + τ∗(w, u) =

τ∗(v, u) + τ∗(w, v) + τ∗(u, w).

Page 42: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Lower Bound

Theorem: for any deterministic algorithm , there is a bipartite distribution for which

• thus, factor of 2 best in deterministic case.

• randomization necessary for better bound.Proof: take simple case and assume that induces a cycle.

• up to symmetry, returns

42

Rrank(A) ≥ 2Rclass(h).

U =V =u, v, wh

A

A

u, v, w w, v, u.or

u

v wh

=

Page 43: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Lower Bound

If returns , then choose as:

If returns , then choose as:

43

u, v, w u

v wh

Aτ∗

u, v w

+-

Aτ∗

w, v, u

+-

w, v u

L[h, τ∗] =13;

L[A, τ∗] =23.

Page 44: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Guarantees - General Case

Loss bound for QuickSort:

Comparison with optimal ranking (see (CSS 99)):

44

EV,τ∗,s

[L(Qhs (V ), τ∗)] ≤ 2 E

V,τ∗[L(h, τ∗)].

Es[L(Qhs (V ),σoptimal)] ≤ 2 L(h, σoptimal)

Es[L(h, Qhs (V ))] ≤ 3 L(h, σoptimal),

where σoptimal = argminσ

L(h, σ).

Page 45: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Weight Function

Generalization:

Properties: needed for all previous results to hold,

• symmetry: for all .

• monotonicity: for .

• triangle inequality: for all triplets .

45

τ∗(u, v) = σ∗(u, v) ω(σ∗(u),σ∗(v)).

ω(i, j) = ω(j, i) i, j

ω(i, j),ω(j, k) ≤ ω(i, k) i < j < k

ω(i, j) ≤ ω(i, k) + ω(k, j)i, j, k

Page 46: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

Weight Function - Examples

Kemeny:

Top-k:

Bipartite:

k-partite: can be defined similarly.

46

w(i, j) = 1, ∀ i, j.

w(i, j) =

1 if i ≤ k and j > k;0 otherwise.

w(i, j) =

1 if i ≤ k or j ≤ k;0 otherwise.

Page 47: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

pageMehryar Mohri - Foundations of Machine Learning

(Strong) Regret Definitions

Ranking regret:

Preference regret:

All previous regret results hold if for ,

47

Rclass(h) = EV,τ∗

[L(h|V , τ∗)]−minh

EV,τ∗

[L(h|V , τ∗)].

Eτ∗|V1

[τ∗(u, v)] = Eτ∗|V2

[τ∗(u, v)]

V1, V2 ⊇ u, v

for all (pairwise independence on irrelevant alternatives).

Rrank(A) = EV,τ∗,s

[L(As(V ), τ∗)]−minσ

EV,τ∗

[L(σ|V , τ∗)].

u, v

Page 48: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization

bounds for the area under the roc curve. JMLR 6, 393–425.

• Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).

• Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.

• Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604–619). Springer.

• Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.

• Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).

48

Page 49: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In

Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.

• Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.

• Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.

• Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.

• Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.

• J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.

• Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.

49

Page 50: Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating

characteristic (ROC) curve. Radiology, 1982.

• Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.

• Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.

• Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.

• Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.

• Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.

50