Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st , 2008 Yisong Yue Cornell University Joint work with: Thorsten Joachims (advisor), Filip Radlinski, Thomas Finley, Robert Kleinberg, Josef Broder
98
Embed
Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Structured Prediction and Active Learning for Information Retrieval
Presented at Microsoft Research AsiaAugust 21st, 2008
Yisong YueCornell University
Joint work with:Thorsten Joachims (advisor), Filip Radlinski,
• Active Learning– Learning From Real Users– Multi-armed Bandit Problems
Supervised Learning
• Find function from input space X to output space Y
such that the prediction error is low.
Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters…
• Natural Language Parsing– Given a sequence of words x, predict the parse tree y.– Dependencies from structural constraints, since y has to be a
tree.
The dog chased the catx
S
VPNP
Det NV
NP
Det N
y
• Part-of-Speech Tagging– Given a sequence of words x, predict sequence of tags y.
– Dependencies from tag-tag transitions in Markov model.
Similarly for other sequence labeling problems, e.g., RNA
Intron/Exon Tagging.
The rain wet the catx
Det NVDet Ny
Examples of Complex Output Spaces
Examples of Complex Output Spaces
• Multi-class Labeling
• Sequence Alignment
• Grammar Trees & POS Tagging
• Markov Random Fields• Clustering• Information Retrieval (Rankings)
– Average Precision & NDCG
– Listwise Approaches
– Diversity
– More Complex Goals
Information Retrieval
• Input: x (feature representation of a document/query pair)
• Conventional Approach– Real valued retrieval functions f(x) – Sort by f(xi) to obtain ranking
• Training Method– Human-labeled data (documents labeled by relevance)– Learn f(x) using relatively simple criterion– Computationally convenient– Works pretty well (but we can do better)
Conventional SVMs
• Input: x (high dimensional point)
• Target: y (either +1 or -1)
• Prediction: sign(wTx)
• Training:
subject to:
• The sum of slacks upper bounds the accuracy loss
N
ii
w N
Cw
1
2
, 2
1minarg
iiT xwi 1)(y : i
i
i
Pairwise Preferences SVM
ji
jiw N
Cw
,,
2
, 2
1minarg
ji
yyjixwxw
ji
jijijT
iT
, ,0
:, ,1
,
,
Such that:
Large Margin Ordinal Regression [Herbrich et al., 1999]
Can be reduced to time [Joachims, 2005] )log( nnO
Pairs can be reweighted to more closely model IR goals [Cao et al., 2006]
Mean Average Precision
• Consider rank position of each relevance doc– K1, K2, … KR
• Compute Precision@K for each K1, K2, … KR
• Average precision = average of P@K
• Ex: has AvgPrec of
• MAP is Average Precision across multiple queries/rankings
76.05
3
3
2
1
1
3
1
Optimization Challenges
• Rank-based measures are multivariate– Cannot decompose (additively) into document pairs– Need to exploit other structure
• Defined over rankings– Rankings do not vary smoothly– Discontinuous w.r.t model parameters– Need some kind of relaxation/approximation
– Gaussian smoothing • SoftRank GP [Guiver & Snelson, 2008]
• Upper bound relaxations– Exponential Loss w/ Boosting
• AdaRank [Xu et al., 2007]
– Hinge Loss w/ Structural SVMs • [Chapelle et al., 2007]• SVM-map [Yue et al., 2007]
Structured Prediction
• Let x be a structured input (candidate documents)• Let y be a structured output (ranking)
• Use a joint feature map to encode
the compatibility of predicting y for given x.– Captures all the structure of the prediction problem
• Consider linear models: after learning w, we can make predictions via
FR ),( xy
),(maxargˆ xyyy
Tw
Linear Discriminant for Ranking
• Let x = (x1,…xn) denote candidate documents (features)
• Let yjk = {+1, -1} encode pairwise rank orders
• Feature map is linear combination of documents.
• Prediction made by sorting on document scores wTxi
),(maxargˆ xyyy
Tw
relj relk
kjjk xxy: :!
)(),( xy
Linear Discriminant for Ranking
• Using pairwise preferences is common in IR
• So far, just reformulated using structured prediction notation.
• But we won’t decompose into independent pairs– Treat the entire ranking as a structured object– Allows for optimizing average precision
relj relk
kjT
jkT xxwyw
: :!
)(),( xy
Structural SVM
• Let x denote a structured input (candidate documents)• Let y denote a structured output (ranking)
• Standard objective function:
• Constraints are defined for each incorrect labeling y’ over the set of documents x.
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
[Y, Finley, Radlinski, Joachims; SIGIR 2007]
Structural SVM for MAP
• Minimize
subject to
where ( y jk = {-1, +1} )
and
• Sum of slacks is smooth upper bound on MAP loss.
relj relk
ik
ij
ii xxyjk
: :!
)()()()( )(),( xy
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
)'(1)'( yy Avgprec
i
[Y, Finley, Radlinski, Joachims; SIGIR 2007]
Too Many Constraints!
• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,
• An incorrect labeling would be any other ranking, e.g.,
• This ranking has Average Precision of about 0.8 with (y’) ¼ 0.2
• Intractable number of rankings, thus an intractable number of constraints!
Structural SVM Training
Original SVM Problem• Intractable number of constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
Structural SVM Training
Original SVM Problem• Intractable number of constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
Structural SVM Training
Original SVM Problem• Intractable number of constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
Structural SVM Training
Original SVM Problem• Intractable number of constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
Finding Most Violated Constraint
• A constraint is violated when
• Finding most violated constraint reduces to
• Highly related to inference/prediction:
0)'(),(),'( iTw yxyxy
)'(),'(maxargˆ'
yxyyy
Tw
),(maxarg xyy
Tw
Finding Most Violated Constraint
Observations• MAP is invariant on the order of documents within a relevance
class– Swapping two relevant or non-relevant documents does not change MAP.
• Joint SVM score is optimized by sorting by document score, wTxj
• Reduces to finding an interleaving
between two sorted lists of documents
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of
the non-relevant document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document• Repeat until all non-relevant
documents have been considered
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
Comparison with other SVM methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
SVM-ROC
SVM-ACC
SVM-ACC2
SVM-ACC3
SVM-ACC4
Structural SVM for MAP
• Treats rankings as structured objects
• Optimizes hinge-loss relaxation of MAP– Provably minimizes the empirical risk
– Performance improvement over conventional SVMs
• Relies on subroutine to find most violated constraint
– Computationally compatible with linear discriminant
)'(),'(maxargˆ'
yxyyy
Tw
Need for Diversity (in IR)
• Ambiguous Queries– Users with different information needs issuing the
same textual query– “Jaguar”– At least one relevant result for each information need
• Learning Queries– User interested in “a specific detail or entire breadth
of knowledge available” • [Swaminathan et al., 2008]
– Results with high information diversity
Result #18
Top of First Page
Bottom of First Page
Results From 11/27/2007
Query: “Jaguar”
Learning to Rank
• Current methods– Real valued retrieval functions f(q,d) – Sort by f(q,di) to obtain ranking
• Benefits:– Know how to perform learning– Can optimize for rank-based performance measures– Outperforms traditional IR models
• Drawbacks:– Cannot account for diversity– During prediction, considers each document independently
Example
•Choose K documents with maximal information coverage.•For K = 3, optimal set is {D1, D2, D10}
Diversity via Set Cover
• Documents cover information– Assume information is partitioned into discrete units.
• Documents overlap in the information covered.
• Selecting K documents with maximal coverage is a set cover problem– NP-complete in general
– Greedy has (1-1/e) approximation [Khuller et al., 1997]
Diversity via Subtopics
• Current datasets use manually determined subtopic labels– E.g., “Use of robots in the world today”
• Nanorobots• Space mission robots• Underwater robots
– Manual partitioning of the total information– Relatively reliable– Use as training data
Weighted Word Coverage
• Use words to represent units of information
• More distinct words = more information
• Weight word importance
• Does not depend on human labeling
• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Greedy selection yields (1-1/e) bound.– Need to find good weighting function (learning problem).
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
Related Work Comparison
• Essential Pages [Swaminathan et al., 2008]
– Uses fixed function of word benefit– Depends on word frequency in candidate set
• Our goals– Automatically learn a word benefit function
• Learn to predict set covers • Use training data• Minimize subtopic loss
– No prior ML approach • (to our knowledge)
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• y – subset of x • V(y) – union of words from documents in y.
• Discriminant Function:
• (v,x) – frequency features (e.g., ¸10%, ¸20%, etc).
• Benefit of covering word v is then wT(v,x)
)(
),(),(y
xxyVv
TT vww
),(maxargˆ xyyy
Tw
[Y, Joachims; ICML 2008]
Linear Discriminant
• Does NOT reward redundancy – Benefit of each word only counted once
• Greedy has (1-1/e)-approximation bound
• Linear (joint feature space) – Allows for SVM optimization
)(
),(),(y
xxyVv
TT vww
[Y, Joachims; ICML 2008]
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Microsoft” might cover it
better than another document with only 2 copies.
• Use multiple word sets, V1(y), V2(y), … , VL(y)
• Each Vi(y) contains only words satisfying certain importance criteria.
[Y, Joachims; ICML 2008]
More Sophisticated Discriminant
)(
)( 1
),(
),(
),(1
y
y
x
x
xy
LVv L
Vv
v
v
•Separate i for each importance level i. •Joint feature map is vector composition of all i
),(maxargˆ xyyy
Tw
•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.
[Y, Joachims; ICML 2008]
Weighted Subtopic Loss
• Example:– x1 covers t1
– x2 covers t1,t2,t3
– x3 covers t1,t3
• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics
# Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
Structural SVM
• Input: x (candidate set of documents)• Target: y (subset of x of size K)
• Same objective function:
• Constraints for each incorrect labeling y’.
• Scoreof best y at least as large as incorrect y’ plus loss
• Finding most violated constraint also set cover problem
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
TREC Experiments
• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.
• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem
• Query: “High Energy Physics”– Return a blog feed?– Return blog front page?– Return individual blog posts?
– Optimizing for MAP?– Diversity?
Active Learning
• Batch Learning– Learns a model using pre-collected training data– Assumes training data representative of unseen data– Most studied machine learning paradigm
• Very successful in wide range of applications
– Includes most work on structured prediction
• Active Learning:– Can be applied directly to live users
• Representative of real users
– Removes cost of human labeled training data• Time / Money / Reliability
Implicit Feedback
• Users provide feedback while searching– What results they click on– How they reformulate queries– The length of time from issuing query to
clicking on a result– Geographical– User-specific data
• Personal search history• Age / gender / profession / etc.
Presentation Bias in Click Results
[Granka et al., 2004]
Biased Implicit Feedback
• Users biased towards top of rankings
– Passive collection results in very biased training data– No feedback for relevant documents outside top 10– Most prior work focus on passive collection
• Our goals– Use active learning methods to gather unbiased
implicit feedback– Still present good results to users while learning
• Learn “on the fly”
Preferences Between Rankings
• Interleave two rankings into one ranking– Users click more on documents from better ranking.
• Avoid showing users poor results– Quickly determine bad rankings
– Algorithm must learn “online”
• Formulate as a multi-armed bandit problem
Dueling Bandits Problem
• Given bandits (retrieval functions) r1, …, rN
• Each time step compares two bandits
• Comparison is noisy– Some probability of saying worse bandit is better– Each comparison independent
• Choose pair (rt,rt’) to minimize regret:
• (% users who prefer best bandit over chosen ones)
[Broder, Kleinberg, Y; work in progress]
T
tttT rrPrrPR
1
1)'*()*(
Regret Minimization
• If regret is sublinear in T (e.g., log T)– Then average regret (over time) tends to 0
• Want average regret to approach 0 quickly– RT should be as small as possible
T
tttT rrPrrPR
1
1)'*()*(
[Broder, Kleinberg, Y; work in progress]
Results
• Let ε be the differentiability of top two (r*,r**)– P(r* > r**) = ½ + ε
• Known lower bound:
• Interleaved Filter achieves regret
• Information theoretically optimal – (up to constant factors)
TN
RT log
TN
ORT log
[Broder, Kleinberg, Y; work in progress]
Assumptions
• Strong Stochastic Transitivity– For three bandits ri > rj > rk :
• Stochastic Triangle Inequality– For three bandits ri > rj > rk :
• Satisfied by many standard generative models– E.g., Logistic/Bradley-Terry (K=2)
jkijik , max
jkijik K
Interleaved Filter
• Choose candidate bandit at random ►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits in turn– Maintain mean and confidence interval
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits in turn…– Maintain mean and confidence interval
• …until another bandit is better – With confidence 1 – δ
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits in turn…– Maintain mean and confidence interval
• …until another bandit is better – With confidence 1 – δ
• Repeat process with new candidate– Remove all empirically worse bandits
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits in turn…– Maintain mean and confidence interval
• …until another bandit is better – With confidence 1 – δ
• Repeat process with new candidate– Remove all empirically worse bandits
• Continue until 1 candidate left ►
Regret Analysis
• Stops comparing at 1 – δ confidence– Concludes one bandit is better
• Appropriate choice of δ ** leads to
1 - 1/T probability of finding r*
• Regret is 0 whenever we choose r*– Only accumulate regret when finding r*
T
tttT rrPrrPR
1
1)'*()*(
** δ = N-2T-1
Naïve Approach
• In deterministic case, O(N) comparisons to find max
• Extend to noisy case:– Maintain current candidate
– Run comparisons against 1 other bandit until 1 – δ confidence
– Take better bandit as candidate
– Repeat until all bandits considered
• Problem:– If current candidate awful
– Many comparisons to determine which awful bandit is better
– Incur high regret for each comparison
TN
ORT log2
Naïve vs Interleaved Filter
• Naïve performs poorly due to matches
between two awful bandits – Too many comparisons– Accumulates high regret
• Interleaved Filter bounds matches using
bounds on current candidate vs best– Stops when better bandit found– Regret bounded
T
tttT rrPrrPR
1
1)'*()*(
Naïve vs Interleaved Filter
• But Naïve concentrates on 2 bandits at
any point in time
• Interleaved Filter compares 1 bandit vs
rest simultaneously
– Should experience N2 blowup in regret– … or at least N log N
TN
ORT log2
TN
ORT log
Regret Analysis
• Define a round to be all the time steps for
a particular candidate bandit– O(log N) rounds total w.h.p.
• Define a match to be all the comparisons
between two bandits in a round– O(N) matches in each round– At most O(N log N) total matches
• End of each round – Remove empirically inferior bandits– “Constant fraction” of bandits removed
after each round
Regret Analysis
• O(log N) rounds played
• “Constant fraction” of bandits removed
at end of each round– O(N) total matches w.h.p.
• Each match incurs regret
• Expected regret:
TO log1
TN
ORE
TOT
TN
OT
RE
T
T
log
)(1
log1
1
Dueling Bandits Problem
• Uses a natural (and simple) regret formulation
– Captures preference for the best possible retrieval function– Consistent with unbiased ranking preference feedback
• [Radlinski et al., 2008]
– Online/Bandit formulation of finding the max w/ noisy compares
• Interleaved Filter achieves best possible regret bound
- Logarithmic in T - Linear in N
TN
ORT log
T
tttT rrPrrPR
1
1)'*()*(
Related Work
• Other forms of implicit feedback– Preferences between documents within a ranking
• Other active learning techniques– Bandit algorithm for minimizing abandonment
• [Radlinski et al., 2008]
– Active exploration of pairwise document preferences• [Radlinski et al., 2007]
– These approaches cannot generalize across queries
• Most learning approaches use passive collection– Susceptible to presentation bias
Moving Forward
• Limitations– Assumes users preferences are static
• Interleaved filter first explores, then commits
– Assumes a finite set of ranking functions• Should assume a continuous parameter space
• Future directions– Use Interleaved Filter as an optimization engine
• Collect finite sample from continuous parameter space
– Look at completely new problem formulations– Progress towards live user studies
Summary
• Structured prediction for complex retrieval problems– Rank-based performance measures– Diversity– Potentially much more!
• Active learning using unbiased implicit feedback– Learn directly from users (cheaper & more accurate)– Active learning for structured prediction models?