Transcript

Learning to Rank (part 1)

NESCAI 2008 Tutorial

Yisong YueCornell University

Booming Search Industry

Goals for this Tutorial

Basics of information retrieval

What machine learning contributes

New challenges to address

New insights on developing ML algorithms

(Soft) Prerequisites

Basic knowledge of ML algorithms Support Vector Machines Neural Nets Decision Trees Boosting Etc…

Will introduce IR concepts as needed

Outline (Part 1)

Conventional IR Methods (no learning) 1970s to 1990s

Ordinal Regression 1994 onwards

Optimizing Rank-Based Measures 2005 to present

Outline (Part 2)

Effectively collecting training data E.g., interpreting clickthrough data

Beyond independent relevance E.g., diversity

Summary & Discussion

Disclaimer

This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods

Disclaimer

This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods

Information Retrieval Broader than the scope of this talk Deals with more sophisticated modeling questions Will see more interplay between IR and ML in Part 2

Brief Overview of IR

Predated the internet As We May Think by Vannevar Bush (1945)

Active research topic by the 1960’s Vector Space Model (1970s) Probabilistic Models (1980s)

Introduction to Information Retrieval (2008) C. Manning, P. Raghavan & H. Schütze

Basic Approach to IR

Given query q and set of docs d1, … dn

Find documents relevant to q Typically expressed as a ranking on d1,… dn

Basic Approach to IR

Given query q and set of docs d1, … dn

Find documents relevant to q Typically expressed as a ranking on d1,… dn

Similarity measure sim(a,b)!R Sort by sim(q,di) Optimal if relevance of documents are

independent. [Robertson, 1977]

Vector Space Model

Represent documents as vectors One dimension for each word Queries as short documents

Similarity Measures Cosine similarity = normalized dot product

BA

BABA

),cos(

Cosine Similarity Example

Other Methods

TF-IDF [Salton & Buckley, 1988]

Okapi BM25 [Robertson et al., 1995]

Language Models [Ponte & Croft, 1998] [Zhai & Lafferty, 2001]

Machine Learning

IR uses fixed models to define similarity scores

Many opportunities to learn models Appropriate training data Appropriate learning formulation

Will mostly use SVM formulations as examples General insights are applicable to other techniques.

Training Data

Supervised learning problem

Document/query pairs Embedded in high dimensional feature space

Labeled by relevance of doc to query Traditionally 0/1 Recently ordinal classes of relevance (0,1,2,3,…)

Feature Space

Use to learn a similarity/compatibility function

Based off existing IR methods Can use raw values Or transformations of raw values

Based off raw words Capture co-occurrence of words

Training Instances

ikik

ijij

IR

IR

IR

ii

ii

dq

dwqw

dwqw

dqsim

dqrank

dqrank

dqTF

dqTF

dqx

),(

10 in top ),(

5 in top ),(

05.0),(

1.0),(

),(,

Learning Problem

Given training instances: (xq,d, yq,d) for q = {1..N}, d = {1 .. Nq}

Learn a ranking function f(xq,1, … xq,Nq ) ! Ranking

Typically decomposed into per doc scores f(x) ! R (doc/query compatibility) Sort by scores for all instances of a given q

How to Train?

Classification & Regression Learn f(x) ! R in conventional ways Sort by f(x) for all docs for a query Typically does not work well

2 Major Problems Labels have ordering

Additional structure compared to multiclass problems Severe class imbalance

Most documents are not relevant

Conventional multiclass learning does not incorporateordinal structure of class labels

Not Relevant

Somewhat Relevant

Very Relevant

Conventional multiclass learning does not incorporateordinal structure of class labels

Not Relevant

Somewhat Relevant

Very Relevant

Ordinal Regression

Assume class labels are ordered True since class labels indicate level of relevance

Learn hypothesis function f(x) ! R Such that the ordering of f(x) agrees with label ordering Ex: given instances (x, 1), (y, 1), (z, 2)

f(x) < f(z) f(y) < f(z) Don’t care about f(x) vs f(y)

Ordinal Regression

Compare with classification Similar to multiclass prediction But classes have ordinal structure

Compare with regression Doesn’t necessarily care about value of f(x) Only care that ordering is preserved

Ordinal Regression Approaches

Learn multiple thresholds

Learn multiple classifiers

Optimize pairwise preferences

Option 1: Multiple Thresholds

Maintain T thresholds (b1, … bT)

b1 < b2 < … < bT

Learn model parameters + (b1, …, bT)

Goal Model predicts a score on input example Minimize threshold violation of predictions

Ordinal SVM Example

[Chu & Keerthi, 2005]

Ordinal SVM Formulation

T

j i ijiji

bw N

Cw

1,,

2

,,, 2

1minarg

i

jyibxw

jyibxw

jiji

ijijiT

ijijiT

,0,

1: ,1

: ,1

1,,

1,

,

Such that for j = 0..T :

[Chu & Keerthi, 2005]

Tbbb ...21And also:

Learning Multiple ThresholdsGaussian Processes [Chu & Ghahramani, 2005]

Decision Trees [Kramer et al., 2001]

Neural Nets RankProp [Caruana et al., 1996]

SVMs & Perceptrons PRank [Crammer & Singer, 2001] [Chu & Keerthi, 2005]

Option 2: Voting Classifiers Use T different training sets

Classifier 1 predicts 0 vs 1,2,…T Classifier 2 predicts 0,1 vs 2,3,…T … Classifier T predicts 0,1,…,T-1 vs T

Final prediction is combination E.g., sum of predictions

Recent work McRank [Li et al., 2007] [Qin et al., 2007]

•Severe class imbalance •Near perfect performance by always predicting 0

Option 3: Pairwise Preferences

Most popular approach for IR applications

Learn model to minimize pairwise disagreements

%(Pairwise Agreements) = ROC-Area

• 2 pairwise disagreements

Optimizing Pairwise Preferences

Consider instances (x1,y1) and (x2,y2)

Label order has y1 > y2

Optimizing Pairwise Preferences

Consider instances (x1,y1) and (x2,y2)

Label order has y1 > y2

Create new training instance (x’, +1) where x’ = (x1 – x2)

Repeat for all instance pairs with label order preference

Optimizing Pairwise Preferences

Result: new training set! Often represented implicitly

Has only positive examples

Mispredicting means that a lower ordered instance received higher score than higher order instance.

Pairwise SVM Formulation

ji

jiw N

Cw

,,

2

, 2

1minarg

ji

yyjixwxw

ji

jijijT

iT

, ,0

:, ,1

,

,

Such that:

[Herbrich et al., 1999]

Can be reduced to time [Joachims, 2005]. ))log(( nnO

Optimizing Pairwise Preferences

Neural Nets RankNet [Burges et al., 2005]

Boosting & Hedge-Style Methods [Cohen et al., 1998] RankBoost [Freund et al., 2003] [Long & Servidio, 2007]

SVMs [Herbrich et al., 1999] SVM-perf [Joachims, 2005] [Cao et al., 2006]

Rank-Based Measures

Pairwise Preferences not quite right Assigns equal penalty for errors no matter where

in the ranking

People (mostly) care about top of ranking IR community use rank-based measures which

capture this property.

Rank-Based Measures

Binary relevance Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR)

Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG)

Precision@K

Set a rank threshold K

Compute % relevant in top K

Ignores documents ranked lower than K

Ex: Prec@3 of 2/3 Prec@4 of 2/4 Prec@5 of 3/5

Mean Average Precision

Consider rank position of each relevance doc K1, K2, … KR

Compute Precision@K for each K1, K2, … KR

Average precision = average of P@K

Ex: has AvgPrec of

MAP is Average Precision across multiple queries/rankings

76.05

3

3

2

1

1

3

1

Mean Reciprocal Rank

Consider rank position, K, of first relevance doc

Reciprocal Rank score =

MRR is the mean RR across multiple queries

K

1

NDCG

Normalized Discounted Cumulative Gain Multiple Levels of Relevance

DCG: contribution of ith rank position:

Ex: has DCG score of

NDCG is normalized DCG best possible ranking as score NDCG = 1

)1log(

12

i

iy

45.5)6log(

1

)5log(

0

)4log(

1

)3log(

3

)2log(

1

Optimizing Rank-Based Measures

Let’s directly optimize these measures As opposed to some proxy (pairwise prefs)

But… Objective function no longer decomposes

Pairwise prefs decomposed into each pair

Objective function flat or discontinuous

Discontinuity Example

NDCG = 0.63

D1 D2 D3

Retrieval Score 0.9 0.6 0.3

Rank 1 2 3

Relevance 0 1 0

Discontinuity Example

NDCG computed using rank positionsRanking via retrieval scores

D1 D2 D3

Retrieval Score 0.9 0.6 0.3

Rank 1 2 3

Discontinuity Example

NDCG computed using rank positionsRanking via retrieval scores Slight changes to model parameters Slight changes to retrieval scores No change to ranking No change to NDCG

D1 D2 D3

Retrieval Score 0.9 0.6 0.3

Rank 1 2 3

Discontinuity Example

NDCG computed using rank positionsRanking via retrieval scores Slight changes to model parameters Slight changes to retrieval scores No change to ranking No change to NDCG

D1 D2 D3

Retrieval Score 0.9 0.6 0.3

Rank 1 2 3

NDCG discontinuous w.r.t model parameters!

[Yue & Burges, 2007]

Optimizing Rank-Based Measures

Relaxed Upper Bound Structural SVMs for hinge loss relaxation

SVM-map [Yue et al., 2007] [Chapelle et al., 2007]

Boosting for exponential loss relaxation [Zheng et al., 2007] AdaRank [Xu et al., 2007]

Smooth Approximations for Gradient Descent LambdaRank [Burges et al., 2006] SoftRank GP [Snelson & Guiver, 2007]

Structural SVMs

Let x denote the set of documents/query examples for a query Let y denote a (weak) ranking

Same objective function:

Constraints are defined for each incorrect labeling y’ over the set of documents x.

After learning w, a prediction is made by sorting on wTxi

i

iN

Cw 2

2

1

)'(),'(),( :' yxywxywyy TT

[Tsochantaridis et al., 2007]

Structural SVMs for MAP

Maximize

subject to

where ( yij = {-1, +1} )

and

Sum of slacks upper bound MAP loss.

reli relj

jiij xxyxy: :!

)(),(

i

iN

Cw 2

2

1

)'(),'(),( :' yxywxywyy TT

)(1)( yAvgprecy

[Yue et al., 2007]

Too Many Constraints!

For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,

An incorrect labeling would be any other ranking, e.g.,

This ranking has Average Precision of about 0.8 with (y,y’) ¼ 0.2

Intractable number of rankings, thus an intractable number of constraints!

Structural SVM Training

STEP 1: Solve the SVM objective function using only the current working set of constraints.

STEP 2: Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints.

STEP 3: If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set.

Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1.

STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al., 2005]

Illustrative Example

Original SVM Problem Exponential constraints Most are dominated by a small

set of “important” constraints

Structural SVM Approach Repeatedly finds the next most

violated constraint… …until set of constraints is a good

approximation.

Illustrative Example

Original SVM Problem Exponential constraints Most are dominated by a small

set of “important” constraints

Structural SVM Approach Repeatedly finds the next most

violated constraint… …until set of constraints is a good

approximation.

Illustrative Example

Original SVM Problem Exponential constraints Most are dominated by a small

set of “important” constraints

Structural SVM Approach Repeatedly finds the next most

violated constraint… …until set of constraints is a good

approximation.

Illustrative Example

Original SVM Problem Exponential constraints Most are dominated by a small

set of “important” constraints

Structural SVM Approach Repeatedly finds the next most

violated constraint… …until set of constraints is a good

approximation.

Finding Most Violated Constraint

Required for structural SVM training Depends on structure of loss function Depends on structure of joint discriminant Efficient algorithms exist despite intractable

number of constraints.

More than one approach [Yue et al., 2007] [Chapelle et al., 2007]

Gradient Descent

Objective function is discontinuous Difficult to define a smooth global approximation Upper-bound relaxations (e.g., SVMs, Boosting)

sometimes too loose.

We only need the gradient! But objective is discontinuous… … so gradient is undefined

Solution: smooth approximation of the gradient Local approximation

LambdaRank

Assume implicit objective function C

Goal: compute dC/dsi

si = f(xi) denotes score of document xi

Given gradient on document scores Use chain rule to compute gradient on model

parameters (of f)[Burges et al., 2006]

Intuition: •Rank-based measures emphasize top of ranking•Higher ranked docs should have larger derivatives(Red Arrows)•Optimizing pairwise preferences emphasize bottom of ranking (Black Arrows)

[Burges, 2007]

LambdaRank for NDCG

The pairwise derivative of pair i,j is

Total derivative of output si is

)exp(1

1),(

jiij ss

jiNDCG

i iDj Djjiiji

is

C

Properties of LambdaRank 1There exists a cost function C if

Amounts to the Hessian of C being symmetric

If Hessian also positive semi-definite, then C is convex.

i

j

j

i

ssji

:,

1Subject to additional assumptions – see [Burges et al., 2006]

Summary (Part 1)

Machine learning is a powerful tool for designing information retrieval models

Requires clean formulation of objective

Advances Ordinal regression Dealing with severe class imbalances Optimizing rank-based measures via relaxations Gradient descent on non-smooth objective functions

top related