Dynamic Information Retrieval Tutorial

SIGIR Tutorial July 7th 2014

Grace Hui Yang

Marc Sloan

Jun Wang

Guest Speaker: Emine Yilmaz

Dynamic Information Retrieval

Modeling

Dynamic Information Retrieval Modeling Tutorial 2014 2

Age of Empire




Documents

to explore Information

need

Observed

documents

User

Devise a strategy for

helping the user

explore the

information space in

order to learn which

documents are

relevant and which

aren’t, and satisfy

their information

need.

Evolving IR


Paradigm shifts in IR as new models emerge

e.g. VSM → BM25 → Language Model

Different ways of defining relationship between

query and document

Static → Interactive → Dynamic

Evolution in modeling user interaction with search

engine

Outline


Introduction

Static IR

Interactive IR

Dynamic IR

Theory and Models

Session Search

Reranking

Guest Talk: Evaluation

Conceptual Model – Static IR


Static IR Interactive

IR Dynamic

IR


IR Dynamic

IR

No feedback

Characteristics of Static IR


Does not learn directly from user

Parameters updated periodically

Static Information Retrieval Model


Learning to

Rank


Commonly Used Static IR Models

BM25

PageRank

Language

Model

Feedback in IR


Outline


Introduction

Static IR

Interactive IR

Dynamic IR

Theory and Models

Session Search

Reranking


Conceptual Model – Interactive IR



IR Dynamic

IR


IR Dynamic

IR

Exploit Feedback

Interactive User Feedback


Like, dislike,

pause, skip

Learn the user’s taste

interactively!

At the same time, provide good

recommendations!


Interactive Recommender

Systems

Example - Multi Page Search


Ambiguous

Query



Topic: Car



Topic: Animal

Example – Interactive Search


Click on ‘car’

webpage



Click on ‘Next

Page’



Page 2 results:

Cars



Click on ‘animal’

webpage



Page 2 results:

Animals

Example – Dynamic Search


Topic: Guitar

Example – Dynamic Search


Diversified Page

1

Topics: Cars,

animals, guitars

Toy Example


Multi-Page search scenario

User image searches for “jaguar”

Rank two of the four results over two pages:

𝑟 = 0.5 𝑟 = 0.51 𝑟 = 0.9 𝑟 = 0.49

Toy Example – Static Ranking


Ranked according to PRP

Page 1 Page 2

1.

2.

𝑟 = 0.9

𝑟 = 0.51

1.

2.

𝑟 = 0.5

𝑟 = 0.49

Toy Example – Relevance

Feedback


Interactive Search

Improve 2nd page based on feedback from 1st page

Use clicks as relevance feedback

Rocchio1 algorithm on terms in image webpage

𝑤𝑞′ = 𝛼𝑤𝑞 +

𝛽

|𝐷𝑟| 𝑤𝑑𝑑∈𝐷𝑟

−𝛾

𝐷𝑛 𝑤𝑑𝑑∈𝐷𝑛

New query closer to relevant documents and

different to non-relevant documents

1Rocchio, J. J., ’71, Baeza-Yates &

Ribeiro-Neto ‘99


Feedback


Ranked according to PRP and Rocchio

Page 1 Page 2

2.

𝑟 = 0.9

𝑟 = 0.51

1.

2.

𝑟 = 0.5

𝑟 = 0.49

1.

*

* Click


Feedback


No click when searching for animals

Page 1 Page 2

2.

𝑟 = 0.9

𝑟 = 0.51

1.

2.

1. ?

?

Toy Example – Value Function


Optimize both pages using dynamic IR

Bellman equation for value function

Simplified example:

𝑉𝑡 𝜃𝑡, Σ𝑡 = max𝑠𝑡 𝜃𝑠𝑡 + 𝐸(𝑉𝑡+1 𝜃𝑡+1, Σ𝑡+1 𝐶𝑡)

𝜃𝑡, Σ𝑡 = relevance and covariance of documents for page 𝑡

𝐶𝑡 = clicks on page 𝑡

𝑉𝑡 = ‘value’ of ranking on page 𝑡

Maximize value over all pages based on estimating feedback

1 0.8 0.1 00.8 1 0.1 00.1 0.1 1 0.950 0 0.95 1

Toy Example - Covariance


Covariance matrix represents similarity between images

Toy Example – Myopic Value


For myopic ranking, 𝑉2 = 16.380

Page 1

2.

1.

Toy Example – Myopic Ranking


Page 2 ranking stays the same regardless of clicks

Page 1 Page 2

2.

1.

2.

1.

Toy Example – Optimal Value


For optimal ranking, 𝑉2 = 16.528

Page 1

2.

1.

Toy Example – Optimal Ranking


If car clicked, Jaguar logo is more relevant on next page

Page 1 Page 2

2.

1.

2.

1.

Toy Example – Optimal Ranking


In all other scenarios, rank animal first on next page

Page 1 Page 2

2.

1.

2.

1.

Interactive vs Dynamic IR


• Treats interactions

independently

• Responds to

immediate

feedback

• Static IR used

before feedback

received

• Optimizes over

all interaction

• Long term gains

• Models future

user feedback

• Also used at

beginning of

interaction

Interactive Dynamic

Outline


Introduction

Static IR

Interactive IR

Dynamic IR

Theory and Models

Session Search

Reranking


Conceptual Model – Dynamic IR



IR Dynamic

IR


IR Dynamic

IR

Explore and exploit Feedback

Characteristics of Dynamic IR


Rich interactions

Query formulation

Document clicks

Document examination

eye movement

mouse movements

etc.



Temporal dependency

clicked documents query

D1 ranked documents

q1 C1

D2

q2 C2 ……

…… Dn

qn Cn

I information need

iteration 1 iteration 2 iteration n



Overall goal

Optimize over all iterations for goal

IR metric or user satisfaction

Optimal policy

Dynamic IR


Dynamic IR explores actions

Dynamic IR learns from user and adjusts its

actions

May hurt performance in a single stage, but

improves over all stages

Applications to IR


Dynamics found in lots of different aspects of IR

Dynamic Users

Users change behaviour over time, user history

Dynamic Documents

Information Filtering, document content change

Dynamic Queries

Changing query definition i.e. ‘Twitter’

Dynamic Information Needs

Topic ontologies evolve over time

Dynamic Relevance

Seasonal/time of day change in relevance

User Interactivity in DIR


Modern IR interfaces

Facets

Verticals

Personalization

Responsive to particular user

Complex log data

Mobile

Richer user interactions

Ads

Adaptive targeting

Big Data


Data set sizes are always increasing

Computational footprint of learning to rank

Rich, sequential data

1Yin He et. al, ’11

Complex user model behaviour found in data, takes into

account reading, skipping and re-reading behaviours1

Uses a POMDP

Example

Online Learning to Rank


Learning to rank iteratively on sequential data

Clicks as implicit user feedback/preference

Often uses multi-armed bandit techniques

1Katja Hofmann et. al., ’11 2Yisong Yue et. al., ‘09

Uses click models to interpret clicks and a contextual bandit to improve learning1

Pairwise comparison of rankings using duelling bandits formulation2

Example

Evaluation


Use complex user interaction data to assess rankings

Compare ranking techniques in online testing

Minimise user dissatisfaction

1Jeff Huang et. al., ‘11 2Olivier Chapelle et. al., ‘12

Modelled cursor activity and correlated with eye tracking to validate good or bad abandonment1

Interleave search results from two ranking algorithms to determine which is better2

Example

Filtering and News


Adaptive techniques to personalize information filtering

or news recommendation

Understand the complex dynamics of real world events

in search logs

Capture temporal document change1

1Dennis Fetterly et. al., ‘03 2Stephen Robertson, ‘02 3Jure Leskovec et. al., ‘09

Uses relevance feedback to adapt threshold sensitivity over time in information filtering to maximise overal utility1

Detected patterns and memes in news cycles and modeled how information spreads2

Example

Advertising


Behavioural targeting and personalized ads

Learn when to display new ads

Maximise profit from available ads

1Shuai Yuan et. al., ‘12 2Zeyuan Allen Zhu et. al., ‘10

Uses a POMDP and ad correlation to find the optimal ad to display to a user1

Dynamic click model that can interpret complex user behaviour in logs and apply results to tail queries and unseen ads2

Example

Outline


Introduction

Theory and Models

Session Search

Reranking


Outline


Introduction

Theory and Models

Why not use supervised learning

Markov Models

Session Search

Reranking

Evaluation

Why not use Supervised Learning

for Dynamic IR Modeling?


Lack of enough training data

Dynamic IR problems contain a sequence of dynamic interactions

E.g. a series of queries in session

Rare to find repeated sequences (close to zero)

Even in large query logs (WSCD 2013 & 2014, query logs from Yandex)

Chance of finding repeated adjacent query pairs is

also low

Dataset Repeated Adjacent

Query Pairs

Total Adjacent

Query Pairs

Repeated

Percentage

WSCD 2013 476,390 17,784,583 2.68%

WSCD 2014 1,959,440 35,376,008 5.54%

Our Solution


Try to find an optimal solution through a

sequence of dynamic interactions

Trial and Error: learn from repeated, varied attempts which

are continued until success

No Supervised Learning

Trial and Error


q1 – "dulles hotels"

q2 – "dulles airport"

q3 – "dulles airport location"

q4 – "dulles metrostop"


Rich interactions

Query formulation, Document clicks, Document examination,

eye movement, mouse movements, etc.

Temporal dependency

Overall goal

Recap – Characteristics of

Dynamic IR


Model interactions, which means it needs to have place holders for actions;

Model information need hidden behind user queries and other interactions;

Set up a reward mechanism to guide the entire search algorithm to adjust its retrieval strategies;

Represent Markov properties to handle the temporal dependency.

What is a Desirable Model for

Dynamic IR

A model in Trial and Error setting will do!

A Markov Model will do!

Outline


Introduction

Theory and Models

Why not use supervised learning

Markov Models

Session Search

Reranking

Evaluation

Markov Process Markov Property1 (the “memoryless” property)

for a system, its next state depends on its current state.

Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)

Markov Process

a stochastic process with Markov property.

e.g.

Dynamic Information Retrieval Modeling Tutorial 2014 60 1A. A. Markov, ‘06

s0 s1 …… si

…… si+1


Markov Chain

Hidden Markov Model

Markov Decision Process

Partially Observable Markov Decision Process

Multi-armed Bandit

Family of Markov Models

A

Pagerank(A)

Discrete-time Markov process

Example: Google PageRank1

Markov Chain

B

Pagerank(B)

𝑃𝑎𝑔𝑒𝑟𝑎𝑛𝑘 𝑆 =1 − 𝛼

𝑁+ 𝛼

𝑃𝑎𝑔𝑒𝑟𝑎𝑛𝑘(𝑌)

𝐿(𝑌)𝑌∈Π

# of pages # of outlinks

pages linked to S


D

Pagerank(D)

C

Pagerank(C)

E

Pagerank(E)

Random jump factor

1L. Page et. al., ‘99

The stable state distribution of such an MC is PageRank

State S – web page

Transition probability M

PageRank: how likely a random web surfer will land on a page

(S, M)

Hidden Markov Model

A Markov chain that states are hidden and observable

symbols are emitted with some probability according to its

states1.


s0 s1 s2 ……

o0 o1 o2

p0

𝑒0

p1 p2

𝑒1 𝑒2

Si– hidden state pi -- transition probability oi --observation

ei --observation probability (emission probability)

1Leonard E. Baum et. al., ‘66

(S, M, O, e)

An HMM example for IR

Construct an HMM for each document1


s0 s1 s2 ……

t0 t1 t2

p0

𝑒0

p1 p2

𝑒1 𝑒2

Si– “Document” or

“General English”

pi –a0 or a1

ti – query term

ei – Pr(t|D) or Pr(t|GE)

P(D|q)∝ (𝑎0𝑃 𝑡 𝐺𝐸 + 𝑎1𝑃(𝑡|𝐷))𝑡∈𝑞

Document-to-query relevance

1Miller et. al. ‘99

query

MDP extends MC with actions and rewards1

si– state ai – action ri – reward

pi – transition probability

p0 p1 p2



…… s0 s1

r0

a0

s2

r1

a1

s3

r2

a2

1R. Bellman, ‘57

(S, M, A, R, γ)

Definition of MDP A tuple (S, M, A, R, γ)

S : state space

M: transition matrix

Ma(s, s') = P(s'|s, a)

A: action space

R: reward function

R(s,a) = immediate reward taking action a at state s

γ: discount factor, 0< γ ≤1

policy π

π(s) = the action taken at state s

Goal is to find an optimal policy π* maximizing the expected total rewards.


Policy

Policy: (s) = a According to which,

select an action a at

state s.

(s0) =move right and up s0

(s1) =move right and up s1

(s2) = move right s2

Dynamic Information Retrieval Modeling Tutorial 2014 67 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy

Value: V(s) Expected long-term

reward starting from s

Start from s0

s0

R(s0) (s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3)

+ 4 R(s4) + ]

Future rewards

discounted by [0,1)

Dynamic Information Retrieval Modeling Tutorial 2014 68 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy



Start from s0

s0

R(s0) (s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3)

+ 4 R(s4) + ]

Future rewards

discounted by [0,1)

s1

R(s1) s1’’

s1’

R(s1’)

R(s1’’) Dynamic Information Retrieval Modeling Tutorial 2014 69 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy



Start from s0

s0

R(s0) (s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3)

+ 4 R(s4) + ]

Future rewards

discounted by [0,1)

s1

R(s1) s1’’

s1’

R(s1’)

R(s1’’)

(s1)

R(s2)

s2

(s1’)

(s1’’)

s2’’

s2’

R(s2’)

R(s2’’) Dynamic Information Retrieval Modeling Tutorial 2014 70 [Slide altered from Carlos Guestrin’s ML lecture]

Computing the value of a policy


V(s0) = 𝐸𝜋[𝑅 𝑠0, 𝑎 + 𝛾𝑅 𝑠1, 𝑎 + 𝛾2𝑅 𝑠2, 𝑎 + 𝛾

3𝑅 𝑠3, 𝑎 + ⋯ ]

=𝐸𝜋[𝑅 𝑠0, 𝑎 + 𝛾 𝛾𝑡−1𝑅(𝑠𝑡 , 𝑎)

∞𝑡=1 ]

=𝑅 𝑠0, 𝑎 + 𝛾𝐸𝜋[ 𝛾𝑡−1𝑅(𝑠𝑡 , 𝑎)∞𝑡=1 ]

=𝑅 𝑠0, 𝑎 + 𝛾 𝑀𝜋 𝑠 (𝑠, 𝑠′) 𝑉(𝑠′)𝑠′

Value function

A possible next state The current

state

Optimality — Bellman Equation

The Bellman equation1 to MDP is a recursive definition of

the optimal value function V*(.)

𝑉∗ s = max𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉

∗(𝑠′)

𝑠′


Optimal Policy

π∗ s = arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎 𝑠, 𝑠

′ 𝑉∗(𝑠′)

𝑠′

1R. Bellman, ‘57

state-value function

Optimality — Bellman Equation

The Bellman equation can be rewritten as

𝑉∗ 𝑠 = maxa𝑄(𝑠, 𝑎)

𝑄(𝑠, 𝑎) = 𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉∗(𝑠′)

𝑠′


Optimal Policy

π∗ s = arg𝑚𝑎𝑥𝑎𝑄 𝑠, 𝑎

action-value function

Relationship

between V and Q

MDP algorithms


Value Iteration

Policy Iteration

Modified Policy Iteration

Prioritized Sweeping

Temporal Difference (TD) Learning

Q-Learning

Model free

approaches

Model-based

approaches

[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98,

Richard Sutton, ‘88, Watkins, ‘92]

Solve Bellman

equation Optimal

value V*(s)

Optimal

policy *(s)

[Slide altered from Carlos Guestrin’s ML lecture]

Value Iteration

Initialization Initialize 𝑉0 𝑠 arbitrarily

Loop

Iteration 𝑉𝑖+1 𝑠 ← max

𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)𝑠′

π s ← arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)𝑠′

Stopping criteria π s is good enough

Dynamic Information Retrieval Modeling Tutorial 2014 75 1Bellman, ‘57

Greedy Value Iteration

Initialization Initialize 𝑉0 𝑠 arbitrarily

Iteration 𝑉𝑖+1 𝑠 ← max

𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)𝑠′

Stopping criteria

∀𝑠 𝑉𝑖+1 𝑠 − 𝑉𝑖 𝑠 < ε

Optimal policy

π s ← arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)

𝑠′

Dynamic Information Retrieval Modeling Tutorial 2014 76 1Bellman, ‘57


1. For each state s∈S

Initialize V0(s) arbitrarily End for 2. 𝑖 ← 0 3. Repeat 3.1 𝑖 ← 𝑖 + 1 3.2 For each 𝑠 ∈ 𝑆 𝑉𝑖 𝑠 ← max

𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖−1(𝑠′)𝑠′

end for until ∀𝑠 𝑉𝑖 𝑠 − 𝑉𝑖−1 𝑠 < ε

4. For each 𝑠 ∈ 𝑆

π s ← arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)𝑠′

end for

Algorithm


V(0)(S1)=max{R(S1,a1), R(S1,a2)}=6

V(1)(S1)=max{ 3+0.96*(0.3*6+0.7*4), 6+0.96*(1.0*8) } =max{3+0.96*4.6, 6+0.96*8.0}

=max{7.416, 13.68}

=13.68


𝑉 s = max𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉(𝑠′)

𝑠′

V(0)(S2)=max{R(S2,a1), R(S2,a2)}=4

V(0)(S3)=max{R(S3,a1), R(S3,a2)}=8


Ma1=0.3 0.7 01.0 0 00.8 0.2 0

Ma2=0 0 1.00 0.2 0.80 1.0 0

a1 a2


𝑉 s = max𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉(𝑠′)

𝑠′


i V(i)(S1) V(i)(S2) V(i)(S3)

0 6 4 8

1 13.680 9.760 13.376

2 18.841 17.133 20.380

3 25.565 22.087 25.759

… … … …

200 168.039 165.316 168.793

Ma1=0.3 0.7 01.0 0 00.8 0.2 0

Ma2=0 0 1.00 0.2 0.80 1.0 0

a1 a2 a1

π S1 π S𝟐 π S𝟑

a2 a1 a1

Policy Iteration

Initialization

𝑉π0 𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦

Iteration (over i ) Policy Evaluation

𝑉π𝑖 𝑠∞←𝑅 𝑠, π𝑖 s + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉π𝑖(𝑠′)

𝑠′

Policy Improvement

π𝑖+1 s ← arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)𝑉π𝑖(𝑠′)𝑠′

Stop criteria

Policy stops changing

Dynamic Information Retrieval Modeling Tutorial 2014 80 1Howard , ‘60

Policy Iteration

1.For each state s∈S 𝑉 𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 , 𝑖 ← 0 End for 2. Repeat 2.1 Repeat For each 𝑠 ∈ 𝑆 𝑉′(𝑠) ← 𝑉(𝑠) 𝑉 𝑠 ← 𝑅 𝑠, π𝑖 s + 𝛾 𝑀𝑎 𝑠, 𝑠

′ 𝑉(𝑠′)𝑠′

End for until ∀𝑠 𝑉 𝑠 − 𝑉′ 𝑠 < ε 2.2 For each 𝑠 ∈ 𝑆

π𝑖+1 s ← arg𝑚𝑎𝑥𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎 𝑠, 𝑠

′ 𝑉(𝑠′)

𝑠′

End for 2.3 𝑖 ← 𝑖 + 1 Until π𝑖 = π𝑖−1

Algorithm


Modified Policy Iteration The “Policy Evaluation” step in Policy Iteration is time-

consuming, especially when the state space is large.

The Modified Policy Iteration calculates an approximated

policy evaluation by running just a few iterations


Modified Policy

Iteration Policy Iteration

Greedy Value Iteration k=1

k=∞


1.For each state s∈S 𝑉 𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 , 𝑖 ← 0 End for 2. Repeat 2.1 Repeat k times For each 𝑠 ∈ 𝑆

𝑉 𝑠 ← 𝑅 𝑠, π𝑖 s + 𝛾 𝑀𝑎 𝑠, 𝑠′ 𝑉(𝑠′)𝑠′

End for 2.2 For each 𝑠 ∈ 𝑆 π𝑖+1 s ← arg𝑚𝑎𝑥

𝑎𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎 𝑠, 𝑠

′ 𝑉(𝑠′)

𝑠′

End for 2.3 𝑖 ← 𝑖 + 1 Until π𝑖 = π𝑖−1

Algorithm


MDP algorithms


Value Iteration

Policy Iteration


Prioritized Sweeping

Temporal Difference (TD) Learning

Q-Learning

Model free

approaches

Model-based

approaches

[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98,

Richard Sutton, ‘88, Watkins, ‘92]

Solve Bellman

equation Optimal

value V*(s)

Optimal

policy *(s)

[Slide altered from Carlos Guestrin’s ML lecture]

Temporal Difference Learning


Monte Carlo Sampling can be used for model-free policy iteration Estimate 𝑉𝜋 s in “Policy Evaluation” by the average reward of trajectories from s However, on the trajectories, some of them can be reused

So, we estimate them by an expectation over next state

𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝑟 + γ𝐸 𝑉𝜋 𝑠′ |𝑠, 𝑎

The simplest estimation: 𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝑟 + 𝛾𝑉𝜋 s′

A smoothed version:

𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉𝜋 s′ + (1 − 𝛼) 𝑉𝜋 𝑠

TD-Learning rule: 𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉𝜋 𝑠

′ − 𝑉𝜋(𝑠)

r is the immediate reward, α is the learning rate

Temporal difference

Richard Sutton, ‘88

Singh & Sutton, ‘96

Sutton & Barto, ‘98


1. For each state s∈S

Initialize V𝜋(s) arbitrarily

End for

2. For each step in the state sequence

2.1 Initialize s

2.2 repeat

2.2.1 take action a at state s according to 𝜋

2.2.2 observe immediate reward r and the next state 𝑠′

2.2.3 𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉𝜋 𝑠′ − 𝑉𝜋(𝑠)

2.2.4 𝑠 ← 𝑠′

Until s is a terminal state

End for

Algorithm

Temporal Difference Learning

Q-Learning


TD-Learning rule

Q-learning rule

𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾max𝑎′𝑄 𝑠′, 𝑎′ − 𝑄(𝑠, 𝑎)

𝑉𝜋 s ← 𝑉𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉𝜋 𝑠′ − 𝑉𝜋(𝑠)

𝑉 𝑠 = maxa𝑄(𝑠, 𝑎)

𝜋∗ 𝑠 = arg𝑚𝑎𝑥𝑎𝑄∗(𝑠, 𝑎)

𝑄∗ 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑀𝑎(𝑠, 𝑠′)max𝑎′𝑄∗(𝑠′, 𝑎′)

𝑠′

Q-Learning


1. For each state s∈S and a∈A initialize Q0(s,a) arbitrarily End for 2. 𝑖 ← 0 3. For each step in the state sequence 3.1 Initialize s 3.2 Repeat 3.2.1 𝑖 ← 𝑖 + 1 3.2.2 select an action a at state s according to Qi-1

3.2.3 take action a, observe immediate reward r and the next state 𝑠′

3.2.4 𝑄𝑖 𝑠, 𝑎 ← 𝑄𝑖−1 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾max𝑎′𝑄𝑖−1 𝑠

′, 𝑎′ − 𝑄𝑖−1(𝑠, 𝑎)

3.2.5 𝑠 ← 𝑠′ Until s is a terminal state End for 4. For each 𝑠 ∈ 𝑆 π s ← arg𝑚𝑎𝑥

𝑎𝑄𝑖 𝑠, 𝑎

End for

Algorithm

Apply an MDP to an IR Problem


We can model IR systems using a Markov Decision

Process

Is there a temporal component?

States – What changes with each time step?

Actions – How does your system change the state?

Rewards – How do you measure feedback or

effectiveness in your problem at each time step?

Transition Probability – Can you determine this?

If not, then model free approach is more suitable

Apply an MDP to an IR Problem -

Example


User agent in session search

States – user’s relevance judgement

Action – new query

Reward – information gained

Apply an MDP to an IR Problem -

Example


Search engine’s perspective

What if we can’t directly observe user’s relevance

judgement?

Click ≠ relevance

? ? ? ?


Markov Chain

Hidden Markov Model



Multi-armed Bandit


POMDP Model


…… s0 s1

r0

a0

s2

r1

a1

s3

r2

a2

Hidden states

Observations

Belief

1R. D. Smallwood et. al., ‘73

o1 o2 o3

POMDP Definition


A tuple (S, M, A, R, γ, O, Θ, B) S : state space M: transition matrix A: action space R: reward function

γ: discount factor, 0< γ ≤1 O: observation set an observation is a symbol emitted according to a hidden state.

Θ: observation function Θ(s,a,o) is the probability that o is observed when the system transitions into state s after taking action a, i.e. P(o|s,a). B: belief space Belief is a probability distribution over hidden states.


The agent uses a state estimator to update its belief about the hidden states

b′ = 𝑆𝐸(𝑏, 𝑎, 𝑜′)

b′ s′ = P s′ o′, a, b =𝑃(𝑠′,𝑜′|𝑎,𝑏)

P(𝑜′|𝑎,𝑏)

=Θ(𝑠′, 𝑎, 𝑜′) 𝑀(𝑠, 𝑎, 𝑠′)𝑏(𝑠)𝑠

𝑃(𝑜′|𝑎, 𝑏)

POMDP → Belief Update


The Bellman equation for POMDP

𝑉 𝑏 = max𝑎𝑟 𝑏, 𝑎 + 𝛾 𝑃(𝑜′|𝑎, 𝑏)𝑉(𝑏′)

𝑜′

A POMDP can be transformed into a continuous belief MDP (B, 𝑀′, A, r, γ)

B : the continuous belief space

𝑀′: transition function 𝑀𝑎′ (𝑏, 𝑏′)= 1𝑎,𝑜′(𝑏

′, 𝑏)Pr(𝑜′|𝑎, 𝑏)𝑜∈𝑂

where 1𝑎,𝑜′ 𝑏′, 𝑏 =

1, 𝑖𝑓 𝑆𝐸 𝑏, 𝑎, 𝑜′ = 𝑏′

0, 𝑒𝑙𝑠𝑒 .

A: action space

r: reward function r(b, a)= 𝑏 𝑠 𝑅(𝑠, 𝑎)𝑠∈𝑆

POMDP → Bellman Equation


The optimal policy of a POMDP

The optimal policy of its belief MDP

1L. Kaelbling et. al., ’98

A variation of the value iteration algorithm

Solving POMDPs – The Witness

Algorithm

Policy Tree


• A policy tree of depth i is an i-step non-stationary policy

• As if we run value iteration until the ith iteration

a(h)

ok(h) ok

a11

a21 a2k a2l

… …

…

…

…

… … … … … …

o1 ol

… aik …

a(i-1)k

ai1 ail

o1 ol ok

i steps to go

i-1 steps to go

2 steps to go

1 step to go

Value of a Policy Tree


Can only determine the value of a policy tree h from some belief state

b, because it never knows the exact state.

𝑉ℎ 𝑏 = 𝑏(𝑠)𝑉ℎ(𝑠)𝑠∈𝑆

𝑉ℎ 𝑠 = 𝑅 𝑠, 𝑎 ℎ + 𝛾 𝑀𝑎 ℎ (𝑠, 𝑠′) Θ(𝑠′, 𝑎 ℎ , 𝑜𝑖)𝑉𝑜𝑘 ℎ (𝑠′)𝑜𝑘∈𝑂𝑠′∈𝑆

the action at the

root node of h

the (i-1)-step subtree associated

with ok under the root node of h

Idea of the Witness Algorithm


For each action a, compute Γ𝑖𝑎, the set of candidate i-step policy

trees with action a at their roots

The optimal value function at the ith step, 𝑉𝑖∗(b), is the upper

surface of the value functions of all i-step policy trees.

Optimal value function


Geometrically, 𝑉𝑖∗(b) is piecewise linear and convex.

An example for a two-state POMDP

b(s1)+b(s2)=1

Simplex constraint

The belief space is one-dimensional

Vh2(b)

Vh3(b)

Vh1(b)

Vh5(b)

Vh4(b)

𝑉𝑖∗ 𝑏 = max

ℎ∈H 𝑉ℎ 𝑏

Pruning the Set of

Policy Trees

Outlines of the Witness Algorithm


Algorithm

1.𝐻1 ←{}

2. i ← 1

3. Repeat

3.1 i ← i+1

3.2 For each a in A Γ𝑖

𝑎 ← witness(𝐻i−1, a)

end for 3.3 Prune Γ𝑖

𝑎𝑎 to get 𝐻i

until 𝑠𝑢𝑝𝑏|Vi(b)− Vi−1(b)| < 𝜀

the inner loop

Inner Loop of the Witness

Algorithm


Inner loop of the witness algorithm

1. Select a belief b arbitrarily. Generate a best i-step policy tree hi. Add

ℎi to an agenda.

2. In each iteration

2.1 Select a policy tree ℎ𝑛𝑒𝑤 from the agenda.

2.2 Look for a witness point b using Za and ℎ𝑛𝑒𝑤. 2.3 If find such a witness point b,

2.3.1 Calculate the best policy tree ℎ𝑏𝑒𝑠𝑡 for b.

2.3.2 Add ℎ𝑏𝑒𝑠𝑡 to Za.

2.3.3 Add all the alternative trees of ℎ𝑏𝑒𝑠𝑡 to the agenda.

2.4 Else remove ℎ𝑛𝑒𝑤 from the agenda.

3. Repeat the above iteration until the agenda is empty.

Other Solutions


QMDP1

MC-POMDP (Monte Carlo POMDP)2

Grid Based Approximation3

Belief Compression4

……

1 Thrun et. al., ‘06 2 Thrun et. al., ‘05 3 Lovejoy, ‘91 4 Roy, ‘03


POMDP Dynamic IR

Environment Documents

Agents User, Search engine

States Queries, User’s decision making status, Relevance of

documents, etc

Actions Provide a ranking of documents, Weigh terms in the query,

Add/remove/unchange the query terms, Switch on or

switch off a search technology, Adjust parameters for a

search technology

Observations Queries, Clicks, Document lists, Snippets, Terms, etc

Rewards Evaluation measures (such as DCG, NDCG or MAP)

Clicking information

Transition matrix Given in advance or estimated from training data.

Observation

function

Problem dependent, Estimated based on sample datasets

Applying POMDP to Dynamic IR

Session Search Example - States

SRT

Relevant &

Exploitation

SRR

Relevant &

Exploration

SNRT

Non-Relevant &

Exploitation

SNRR

Non-Relevant &

Exploration

scooter price ⟶ scooter stores Hartford visitors ⟶ Hartford

Connecticut tourism

Philadelphia NYC travel ⟶ Philadelphia NYC train

distance New York Boston ⟶

maps.bing.com

q0

106 [ J. Luo ,et al., ’14]

Session Search Example - Actions

(Au, Ase)

User Action(Au)

Add query terms (+Δq)

Remove query terms (-Δq)

keep query terms (qtheme)

clicked documents

SAT clicked documents

Search Engine Action(Ase)

increase/decrease/keep term weights,

Switch on or switch off query expansion

Adjust the number of top documents used in PRF

etc.

107 [ J. Luo et al., ’14]

Multi Page Search Example -

States & Actions


State:

Relevance

of

document

Action:

Ranking of

documents

Observation:

Clicks Belief: Multivariate

Guassian

Reward: DCG over 2

pages

[Xiaoran Jin et. al., ’13]


Grace Hui Yang

Marc Sloan

Jun Wang



Modeling

Exercise


Markov Chain

Hidden Markov Model



Multi-Armed Bandit


Multi Armed Bandits (MAB)


……

……

Which slot

machine should

I select in this

round?

Reward

Multi Armed Bandits (MAB)


I won! Is this

the best slot

machine?

Reward

MAB Definition


A tuple (S, A, R, B)

S : hidden reward distribution of each bandit

A: choose which bandit to play

R: reward for playing bandit

B: belief space, our estimate of each bandit’s

distribution

Comparison with Markov Models


Single state Markov Decision Process

No transition probability

Similar to POMDP in that we maintain a belief

state

Action = choose a bandit, does not affect state

Does not ‘plan ahead’ but intelligently adapts

Somewhere between interactive and dynamic IR

Markov Multi Armed Bandits


……

……

Markov

Process 1

Markov

Process 2

Markov

Process k

Which slot

machine should

I select in this

round?

Reward

Markov Multi Armed Bandits


……

……

Markov

Process 1

Markov

Process 2

Markov

Process k

Markov

Process

Action

Which slot

machine should

I select in this

round?

Reward

MAB Policy Reward


MAB algorithm describes a policy 𝜋 for choosing

bandits

Maximise rewards from chosen bandits over all

time steps

Minimize regret

𝑅𝑒𝑤𝑎𝑟𝑑 𝑎∗ − 𝑅𝑒𝑤𝑎𝑟𝑑(𝑎𝜋(𝑡))𝑇𝑡=1

Cumulative difference between optimal reward and

actual reward

Exploration vs Exploitation


Exploration

Try out bandits to find which has highest average reward

Exploitation

Too much exploration leads to poor performance

Play bandits that are known to pay out higher reward on average

MAB algorithms balance exploration and exploitation

Start by exploring more to find best bandits

Exploit more as best bandits become known

Exploration vs Exploitation


MAB – Index Algorithms


Gittens index1

Play bandit with highest ‘Dynamic Allocation Index’

Modelled using MDP but suffers ‘curse of dimensionality’

𝜖-greedy2

Play highest reward bandit with probability 1 − ϵ Play random bandit with probability 𝜖

UCB (Upper Confidence Bound)3

Play bandit 𝑖 with highest 𝑥𝑖 + 2 ln 𝑡

𝑇𝑖

Chances of playing infrequently played bandits increases over time

1J. C. Gittins. ‘89 2Nicolò Cesa-Bianchi et. al., ‘98 3P. Auer et. al., ‘02

MAB use in IR


Choosing ads to display to users1

Each ad is a bandit

User click through rate is reward

Recommending news articles2

News article is a bandit

Similar to Information Filtering case

Diversifying search results3

Each rank position is an MAB dependent on higher ranks

Documents are bandits chosen by each rank

1Deepayan Chakrabarti et. al. , ‘09 2Lihong Li et. al., ’10 3Radlinski et. al., ‘08

MAB Variations


Contextual Bandits1

World has some context 𝑥 ∈ 𝑋 (i.e. user location)

Learn policy 𝜋: 𝑋 → 𝐴 that maps context to arms (online or offline)

Duelling Bandits2

Play two (or more) bandits at each time step

Observe relative reward rather than absolute

Learn order of bandits

Mortal Bandits3

Value of bandits decays over time

Exploitation > exploration

1Lihong Li et. al., ‘10 2Yisong Yue et. al., ‘09 3Deepayan Chakrabarti et. al. , ‘09

Comparison of Markov Models


MC – a fully observable stochastic process

HMM – a partially observable stochastic process

MDP – a fully observable decision process

MAB – a decision process, either fully or partially observable

POMDP – a partially observable decision process

actions rewards states

MC No No Observable

HMM No No Unobservable

MDP Yes Yes Observable

POMDP Yes Yes Unobservable

MAB Yes Yes Fixed


Grace Hui Yang

Marc Sloan

Jun Wang



Modeling

Exercise

Outline


Introduction

Theory and Models

Session Search

Reranking


TREC Session Tracks (2010-2012)

Given a series of queries {q1,q2,…,qn}, top 10 retrieval

results {D1, … Di-1 } for q1 to qi-1, and click information

The task is to retrieve a list of documents for the current/last

query, qn

Relevance judgment is made based on how relevant the

documents are for qn, and how relevant they are for information

needs for the entire session (in topic description)

no need to segment the sessions

126

1.pocono mountains pennsylvania

2.pocono mountains pennsylvania hotels

3.pocono mountains pennsylvania things to do

4.pocono mountains pennsylvania hotels

5.pocono mountains camelbeach

6.pocono mountains camelbeach hotel

7.pocono mountains chateau resort

8.pocono mountains chateau resort attractions

9.pocono mountains chateau resort getting to

10.chateau resort getting to

11.pocono mountains chateau resort directions

TREC 2012 Session 6

127

Information needs:

You are planning a winter vacation to the

Pocono Mountains region in Pennsylvania in

the US. Where will you stay? What will you

do while there? How will you get there?

In a session, queries change

constantly

Query change is an important

form of feedback

We define query change as the syntactic editing changes

between two adjacent queries:

includes

, added terms

, removed terms

The unchanged/shared terms are called:

, theme term

1 iii qqq

iq

128

iqiq

iq

themeq q1 = “bollywood legislation”

q2 = “bollywood law”

---------------------------------------

Theme Term = “bollywood”

Added (+Δq) = “law”

Removed (-Δq) = “legislation”

Where do these query changes come

from?

Given TREC Session settings, we consider two sources of

query change:

the previous search results that a user viewed/read/examined

the information need

Example:

Kurosawa Kurosawa wife

`wife’ is not in any previous results, but in the topic description

However, knowing information needs before search is

difficult to achieve

129

Previous search results could influence

query change in quite complex ways

Merck lobbyists Merck lobbying US policy

D1 contains several mentions of ‘policy’, such as “A lobbyist who until 2004 worked as senior policy advisor to

Canadian Prime Minister Stephen Harper was hired last month by Merck …”

These mentions are about Canadian policies; while the user adds US policy in q2

Our guess is that the user might be inspired by ‘policy’, but he/she prefers a different sub-concept other than `Canadian policy’

Therefore, for the added terms `US policy’, ‘US’ is the novel term here, and ‘policy’ is not since it appeared in D1. The two terms should be treated differently

130

We propose to model session search as a Markov decision process (MDP)

Two agents: the User and the Search Engine


Environments

Search results

States Queries

Actions

User actions:

Add/remove/unchange

the query terms

Search Engine actions:

Increase/ decrease

/remain term weights

Applying MDP to Session Search

Search Engine Agent’s Actions

∈ Di−1 action Example

qtheme

Y increase “pocono mountain” in s6

N increase “france world cup 98 reaction” in s28,

france world cup 98 reaction stock market→ france world cup 98 reaction

+∆q

Y decrease ‘policy’ in s37, Merck lobbyists → Merck

lobbyists US policy

N increase ‘US’ in s37, Merck lobbyists → Merck lobbyists

US policy

−∆q

Y decrease ‘reaction’ in s28, france world cup 98 reaction

→ france world cup 98

N No change

‘legislation’ in s32, bollywood legislation →bollywood law

132

Query Change retrieval Model

(QCM)

Bellman Equation gives the optimal value for an MDP:

The reward function is used as the document relevance score

function and is tweaked backwards from Bellman equation:

133

V*(s) = maxa

R(s,a) + g P(s' | s,a)s '

å V*(s')

a

Di

)D|(q P maxa) ,D ,q|(q P + d)|(q P = d) ,Score(q 1-i1-i1-i1-iiii1

Document

relevant score Query

Transition

model

Maximum

past

relevance Current

reward/relevanc

e score

Calculating the Transition Model

)|(log)|(

)|(log)()|(log)|(

)|(log)]|(1[+ d)|P(q log = d) ,Score(q

*1

*1

*1ii

*1

*1

dtPdtP

dtPtidfdtPdtP

dtPdtP

qti

dtqt

dtqt

i

qthemeti

ii

134

• According to Query Change and Search Engine

Actions Current reward/

relevance score

Increase weights

for theme terms

Decrease weights

for removed terms

Increase weights

for novel added

terms Decrease weights

for old added

terms

Maximizing the Reward Function

Generate a maximum rewarded document denoted as d*i-1,

from Di-1

That is the document(s) most relevant to qi-1

The relevance score can be calculated as

𝑃 𝑞𝑖−1 𝑑𝑖−1 = 1 − {1 − 𝑃(𝑡|𝑑𝑖−1)}𝑡∈𝑞𝑖−1

𝑃 𝑡 𝑑𝑖−1 =#(𝑡,𝑑𝑖−1)

|𝑑𝑖−1|

From several options, we choose to only use the document with top relevance

maxDi-1

P(qi-1 |Di-1)

135

Scoring the Entire Session

The overall relevance score for a session of queries is

aggregated recursively :

Scoresession(qn, d) = Score(qn, d) + gScoresession(qn-1, d)

= Score(qn, d) + g[Score(qn-1, d) + gScoresession (qn-2, d)]

= g n-i

i=1

n

å Score(qi, d)

136

Experiments

TREC 2011-2012 query sets, datasets

ClubWeb09 Category B

137

Search Accuracy (TREC 2012)

nDCG@10 (official metric used in TREC)

Approach nDCG@10 %chg MAP %chg

Lemur 0.2474 -21.54% 0.1274 -18.28%

TREC’12 median 0.2608 -17.29% 0.1440 -7.63%

Our TREC’12 submission

0.3021 −4.19% 0.1490 -4.43%

TREC’12 best 0.3221 0.00% 0.1559 0.00%

QCM 0.3353 4.10%† 0.1529 -1.92%

QCM+Dup 0.3368 4.56%† 0.1537 -1.41%

138

Search Accuracy (TREC 2011)

nDCG@10 (official metric used in TREC)

Approach nDCG@10 %chg MAP %chg

Lemur 0.3378 -23.38% 0.1118 -25.86%

TREC’11 median 0.3544 -19.62% 0.1143 -24.20%

TREC’11 best 0.4409 0.00% 0.1508 0.00%

QCM 0.4728 7.24%† 0.1713 13.59%†

QCM+Dup 0.4821 9.34%† 0.1714 13.66%†

Our TREC’12 submission

0.4836 9.68%† 0.1724 14.32%†

139

Search Accuracy for Different

Session Types TREC 2012 Sessions are classified into:

Product: Factual / Intellectual

Goal quality: Specific / Amorphous

Intellec

tual %chg Amorphous %chg Specific %chg Factual %chg

TREC best 0.3369 0.00% 0.3495 0.00% 0.3007 0.00% 0.3138 0.00%

Nugget 0.3305 -1.90% 0.3397 -2.80% 0.2736 -9.01% 0.2871 -8.51%

QCM 0.3870 14.87% 0.3689 5.55% 0.3091 2.79% 0.3066 -2.29%

QCM+DUP 0.3900 15.76% 0.3692 5.64% 0.3114 3.56% 0.3072 -2.10%

140

- Better handle sessions that demonstrate evolution and exploration

Because QCM treats a session as a continuous process by studying

changes among query transitions and modeling the dynamics

Outline


Introduction

Theory and Models

Session Search

Reranking


Multi Page Search


Multi Page Search


Page 1 Page 2

2.

1.

2.

1.

Relevance Feedback


No UI Changes

Interactivity is Hidden

Private, performed in browser

Relevance Feedback


Page 1

• Diverse Ranking

• Maximise

learning

potential

• Exploration vs

Exploitation

Page 2

• Clickthroughs or

explicit ratings

• Respond to

feedback from

page 1

• Personalized

Model


Model


𝑁 𝜃1, Σ1

𝜃1 -prior estimate of relevance

Σ1 - prior estimate of covariance Document similarity

Topic Clustering

Model


Rank action for page 1

Model


Model


Feedback from page 1

𝒓 ~ 𝑁(𝜃𝒔1, Σ𝒔1)

Model


Update estimates using 𝒓1

𝜃1 = 𝜃\𝒔′𝜃𝒔′ Σ1 =

Σ\𝒔′ Σ\s′𝒔′Σs′\𝒔′ Σ𝒔′

𝜃2 = 𝜃\𝒔′ + Σ\s′𝒔′Σ𝒔′−1(𝒓1 − 𝜃𝒔′)

Σ2 = Σ\𝒔′ - Σ\s′𝒔′Σ𝒔′−1Σs′\𝒔′

Model


Rank using PRP

Model


Utility or Ranking

𝜆 𝜃𝑠𝑗1

log2(𝑗+1)+ 1 − 𝜆

𝜃𝑠𝑗2

log2(𝑗+1)2𝑀𝑗=1+𝑀

𝑀𝑗=1

DCG

Model – Bellman Equation


Optimize 𝒔1 to improve 𝑼𝒔2

𝑉 𝜃1, Σ1, 1 =

max𝒔1𝜆𝜃𝒔1.𝑾1 + max

𝒔2(1 − 𝜆) 𝜃𝒔

2.𝑾2𝑃 𝒓 𝑑𝒓𝒓

𝜆


Balances exploration and exploitation in page 1

Tuned for different queries

Navigational

Informational

𝜆 = 1 for non-ambiguous search

Approximation


Monte Carlo Sampling

≈ max𝒔1𝜆𝜃𝒔1.𝑾1 +max

𝒔21 − 𝜆

1

𝑆 𝜃𝒔

2.𝑾2𝑃 𝒓𝑟 ∈𝑂

Sequential Ranking Decision

Experiment Data


Difficult to evaluate without access to live users

Simulated using 3 TREC collections and relevance

judgements

WT10G – Explicit Ratings

TREC8 – Clickthroughs

Robust – Difficult (ambiguous) search

User Simulation


Rank M documents

Simulated user clicks according to relevance judgements

Update page 2 ranking

Measure at page 1 and 2

Recall

Precision

nDCG

MRR

BM25 – prior ranking model

Investigating λ


Baselines


𝜆 determined experimentally

BM25

BM25 with conditional update (𝜆 = 1)

Maximum Marginal Relevance (MMR)

Diversification

MMR with conditional update

Rocchio

Relevance Feedback

Results


Results


Results


Results


Results


Similar results across data sets and metrics

2nd page gain outweighs 1st page losses

Outperformed Maximum Marginal Relevance using MRR to

measure diversity

BM25-U simply no exploration case

Similar results when 𝑀 = 5

Results


Outline


Introduction

Theory and Models

Session Search

Reranking



Evaluation

Emine Yilmaz

University College London

[email protected]

mailto:[email protected]

Information Retrieval Systems

Match information seekers with

the information they seek

Retrieval Evaluation: Traditional

View

Retrieval Evaluation: Dynamic

View


View


View

Different Approaches to

Evaluation

Online Evaluation

Design interactive experiments

Use users’ actions to evaluate the quality

Inherently dynamic in nature

Offline Evaluation

Controlled laboratory experiments

The users’ interaction with the engine is only simulated

Recent work focused on dynamic IR evaluation

Online Evaluation

Standard click metrics

Clickthrough rate

Probability user skips over results they have considered (pSkip)

Most recently: Result interleaving

Click/Noclick

Evaluate

175

What is result interleaving? A way to compare rankers online

Given the two rankings produced by two methods

Present a combination of the rankings to users

Team Draft Interleaving (Radlinski et al., 2008)

Interleaving two rankings

Input: Two rankings (“can be seen as teams who pick players”)

Repeat:

o Toss a coin to see which team (ranking) picks next

o Winner picks their best remaining player (document)

o Loser picks their best remaining player (document)

Output: One ranking (2 teams of 5)

Credit assignment

Ranking providing more of the clicked results wins

Team Draft Interleaving

Ranking A 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org

A B





B wins!





B wins!

Repeat Over Many Different

Queries!

Offline Evaluation

Controlled laboratory experiments

The user’s interaction with the engine is

only simulated Ask experts to judge each query result

Predict how users behave when they search

Aggregate judgments to evaluate

180

Offline Evaluation

Until recently: Metrics assume that user’s information need was not affected by the documents read

E.g. Average Precision, NDCG, …

• Users are more likely to stop searching when they see a highly relevant document

• Lately: Metrics that incorporate the affect of relevance of documents seen by the user on user behavior

Based on devising more realistic user models

EBU, ERR [Yilmaz et al CIKM10, Chapelle et al CIKM09]

181

Modeling User Behavior

Cascade-based models

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

…

• The user views search results from top to bottom

• At each rank i, the user has a certain probability of being

satisfied.

• Probability of satisfaction proportional to the

relevance grade of the document at rank i.

• Once the user is satisfied with a document, he terminates

the search.

Rank Biased Precision

Query

Stop

View Next

Item

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

…

Rank Biased Precision black powder

ammunition

1

2

3

4

5

6

7

8

9

10

…

1=i

1=utility Total i

irel

examined docs m.utility/Nu Total RBP

)1/(1)1(=examined docs Num.1=i

1

ii

)-(1= RBP1=i

1

i

irel

Expected Reciprocal Rank [Chapelle et al CIKM09]

Query

Stop

Relevant?

View Next

Item

no somewhat highly

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

…

Expected Reciprocal Rank [Chapelle et al CIKM09]

black powder

ammunition

1

2

3

4

5

6

7

8

9

10

…

rrank at document"perfect the" finding of Utility :(r)

1/r (r)

)position at stopsuser (1

1

rPr

ERRn

r

1

11

)1(1 r

i

ri

n

r

RRr

ERR

document i theof grade relevance : th

ig

iRig

g

i

i

docat stop of Prob.2

12 doc of relevance of Prob.

max

Paris Luxurious Hotels Paris Hilton J Lo Session Evaluation

What is a good system?

Measuring “goodness”

The user steps down a ranked list of documents and

observes each one of them until a decision point and either

a) abandons the search, or

b) reformulates

While stepping down or sideways, the user accumulates

utility

Evaluation over a single ranked list

1

2

3

4

5

6

7

8

9

10

…

kenya cooking

traditional swahili

kenya cooking

traditional

kenya swahili

traditional food

recipes

Session DCG [Järvelin et al ECIR 2008]

kenya cooking

traditional swahili

kenya cooking

traditional

2rel(r ) 1

logb (r b 1)r1

k

2rel(r ) 1

logb (r b 1)r1

k

1

logc (1 c 1)DCG(RL1)

1

logc (2 c 1) DCG(RL2)

Model-based measures

Probabilistic space of users following

different paths

Ω is the space of all paths

P(ω) is the prob of a user following a path ω in Ω

Mω is a measure over a path ω

[Yang and Lad ICTIR 2009,

Kanoulas et al. SIGIR 2011]

Probability of a path

Probability of abandoning at

reform 2

X

Probability of reformulating at rank

3

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

(1)

(2)

Expected Global Utility [Yang and Lad ICTIR 2009]

1. User steps down ranked results one-by-one

2. Stops browsing documents based on a stochastic process

that defines a stopping probability distribution over ranks

and reformulates

3. Gains something from relevant documents, accumulating

utility

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Probability

of abandoning

the session at

reformulation i

Geometric w/ parameter preform

(1)

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Geo

met

ric

w/

par

amet

er p

dow

n

Probability

of reformulating

at rank j

(2)

Geometric w/ parameter preform

Expected Global Utility [Yang and Lad ICTIR 2009]

The probability of a user following a path ω:

P(ω) = P(r1, r2, ..., rK)

ri is the stopping and reformulation point in list i

Assumption: stopping positions in each list are independent

P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)

Use geometric distribution (RBP) to model the stopping and

reformulation behaviour

P(ri = r) = (1-) k1

Conclusions

Recent focus on evaluating the dynamic nature of the search

process

Interleaving

New offline evaluation metrics

ERR, RBU

Session evaluation metrics

Outline


Introduction

Theory and Models

Session Search

Reranking


Conclusion

Conclusions


Dynamic IR describes a new class of interactive model

Incorporates rich feedback, temporal dependency and is goal

oriented.

Family of Markov models and Multi Armed Bandit theory

useful in building DIR models

Applicable to a range of IR problems

Useful in applications such as session search and evaluation

Dynamic IR Book


Published by Morgan & Claypool

‘Synthesis Lectures on Information Concepts, Retrieval, and

Services’

Due March/April 2015 (in time for SIGIR 2015)

Acknowledgment


We thank Dr. Emine Yilmaz for giving us the guest speech.

We sincerely thank Dr. Xuchu Dong for his help in

preparation of the tutorial

We also thank comments and suggestions from the following

colleagues:

Dr. Jamie Callan

Dr. Ophir Frieder

Dr. Fernando Diaz

Dr Filip Radlinski


Thank You


References


Static IR

Modern Information Retrieval. R. Baeza-Yates and B. Ribeiro-

Neto. Addison-Wesley, 1999.

The PageRank Citation Ranking: Bringing Order to the Web.

Lawrence Page , Sergey Brin , Rajeev Motwani , Terry Winograd.

1999

Implicit User Modeling for Personalized Search, Xuehua Shen et.

al, CIKM, 2005

A Short Introduction to Learning to Rank. Hang Li, IEICE

Transactions 94-D(10): 1854-1862, 2011.

References


Interactive IR

Relevance Feedback in Information Retrieval, Rocchio, J. J., The

SMART Retrieval System (pp. 313-23), 1971

A study in interface support mechanisms for interactive

information retrieval, Ryen W. White et. al, JASIST, 2006

Visualizing stages during an exploratory search session, Bill Kules

et. al, HCIR, 2011

Dynamic Ranked Retrieval, Cristina Brandt et. al, WSDM, 2011

Structured Learning of Two-level Dynamic Rankings, Karthik

Raman et. al, CIKM, 2011

References


Dynamic IR

A hidden Markov model information retrieval system. D. R. H. Miller, T. Leek, and R. M. Schwartz. In SIGIR’99, pages 214-221.

Threshold setting and performance optimization in adaptive filtering, Stephen Robertson, JIR 2002

A large-scale study of the evolution of web pages, Dennis Fetterly et. al., WWW 2003

Learning diverse rankings with multi-armed bandits. Filip Radlinski, Robert Kleinberg, Thorsten Joachims. ICML, 2008.

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, Yisong Yue et. al., ICML 2009

Meme-tracking and the dynamics of the news cycle, Jure Leskovec et. al., KDD 2009

References


Dynamic IR

Mortal multi-armed bandits. Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, Eli Upfal. NIPS 2009

A Novel Click Model and Its Applications to Online Advertising , Zeyuan Allen Zhu et. al., WSDM 2010

A contextual-bandit approach to personalized news article recommendation. Lihong Li, Wei Chu, John Langford, Robert E. Schapire. WWW, 2010

Inferring search behaviors using partially observable markov model with duration (POMD), Yin he et. al., WSDM, 2011

No Clicks, No Problem: Using Cursor Movements to Understand and Improve Search, Jeff Huang et. al., CHI 2011

Balancing Exploration and Exploitation in Learning to Rank Online, Katja Hofmann et. al., ECIR, 2011

Large-Scale Validation and Analysis of Interleaved Search Evaluation, Olivier Chapelle et. al., TOIS 2012

References


Dynamic IR

Using Control Theory for Stable and Efficient Recommender Systems. T. Jambor, J. Wang, N. Lathia. In: WWW '12, pages 11-20.

Sequential selection of correlated ads by POMDPs, Shuai Yuan et. al., CIKM 2012

Utilizing query change for session search. D. Guan, S. Zhang, and H. Yang. In SIGIR ’13, pages 453–462.

Query Change as Relevance Feedback in Session Search (short paper). S. Zhang, D. Guan, and H. Yang. In SIGIR 2013.

Interactive exploratory search for multi page search results. X. Jin, M. Sloan, and J. Wang. In WWW ’13.

Interactive Collaborative Filtering. X. Zhao, W. Zhang, J. Wang. In: CIKM'2013, pages 1411-1420.

Win-win search: Dual-agent stochastic game in session search. J. Luo, S. Zhang, and H. Yang. In SIGIR ’14.

References


Markov Processes

A markovian decision process. R. Bellman. Indiana University

Mathematics Journal, 6:679–684, 1957.

Dynamic Programming. R. Bellman. Princeton University Press,

Princeton, NJ, USA, first edition, 1957.

Dynamic Programming and Markov Processes. R.A. Howard. MIT Press.

1960

Linear Programming and Sequential Decisions. Alan S. Manne.

Management Science, 1960

Statistical Inference for Probabilistic Functions of Finite State Markov

Chains. Baum, Leonard E.; Petrie, Ted. The Annals of Mathematical

Statistics 37, 1966

References


Markov Processes

Learning to predict by the methods of temporal differences. Richard Sutton. Machine Learning 3. 1988

Computationally feasible bounds for partially observed Markov decision processes. W. Lovejoy. Operations Research 39: 162–175, 1991.

Q-Learning. Christopher J.C.H. Watkins, Peter Dayan. Machine Learning. 1992

Reinforcement learning with replacing eligibility traces. Singh, S. P. & Sutton, R. S. Machine Learning, 22, pages 123-158, 1996.

Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto. MIT Press, 1998.

Planning and acting in partially observable stochastic domains. L. Kaelbling, M. Littman, and A. Cassandra. Artificial Intelligence, 101(1-2):99–134, 1998.

References


Markov Processes

Finding approximate POMDP solutions through belief compression. N. Roy. PhD Thesis Carnegie Mellon. 2003

VDCBPI: an approximate scalable algorithm for large scale POMDPs, P. Poupart and C. Boutilier. In NIPS-2004, pages 1081–1088.

Finding Approximate POMDP solutions Through Belief Compression. N. Roy, G. Gordon and S. Thrun. Journal of Artificial Intelligence Research, 23:1-40,2005.

Probabilistic robotics. S. Thrun, W. Burgard, D. Fox. Cambridge. MIT Press. 2005

Anytime Point-Based Approximations for Large POMDPs. J. Pineau, G. Gordon and S. Thrun. Volume 27, pages 335-380, 2006

Probabilistic Robotics. S. Thrun, W. Burgard, D. Fox. The MIT Press, 2006.

References


Markov Processes

The optimal control of partially observable Markov decision processes over a finite horizon. R. D. Smallwood, E.J. Sondik. Operations Research. 1973

Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. M. L. Puterman and Shin M. C. Management Science 24, 1978.

An example of statistical investigation of the text eugene onegin the connection of samples in chains. A. A. Markov. Science in Context, 19:591–600, 12 2006.

Learning to Rank for Information Retrieval. Tie-Yan Liu. Springer Science & Business Media. 2011

Finite-Time Regret Bounds for the Multiarmed Bandit Problem, Nicolò Cesa-Bianchi, Paul Fischer. ICML 100-108, 1998

Multi-armed bandit allocation indices, Wiley, J. C. Gittins. 1989

Finite-time Analysis of the Multiarmed Bandit Problem, Peter Auer et. al., Machine Learning 47, Issue 2-3. 2002.