Incorporating Clicks, Attention and Satisfaction into a SERP Evaluation Model

Background Motivation Model & Metric Experimental Setup Results Summary

Incorporating Clicks, Attention and Satisfactioninto a SERP Evaluation Model

Aleksandr Chuklin¶,§ Maarten de Rijke§

[email protected] [email protected]

¶Google Research Europe§University of Amsterdam

AC–MdR Incorporating Clicks, Attention and Satisfaction. . . 1

Background


Search Engine Result Page (SERP) Evaluation

Main problem

Combining relevance of individual SERP items (Rk) into awhole-page metric.




Examples

Precision at N:

P@N =1

N

N∑k=1

Rk

Discounted Cumulative Gain (DCG):

DCG@N =N∑

k=1

1

log2 (1 + k)· Rk

Model-Based Metrics (Chuklin et al. 2013):

Utility@N =N∑

k=1

P(Ck = 1) · Rk

document 3

document 4

document 1

document 2

document 5




Examples

Precision at N:

P@N =1

N

N∑k=1

Rk


DCG@N =N∑

k=1

1

log2 (1 + k)· Rk


Utility@N =N∑

k=1

P(Ck = 1) · Rk

document 3

document 4

document 1

document 2

document 5




Examples

Precision at N:

P@N =1

N

N∑k=1

Rk


DCG@N =N∑

k=1

1

log2 (1 + k)· Rk


Utility@N =N∑

k=1

P(Ck = 1) · Rk

document 3

document 4

document 1

document 2

document 5




Examples

Precision at N:

P@N =1

N

N∑k=1

Rk


DCG@N =N∑

k=1

1

log2 (1 + k)· Rk


Utility@N =N∑

k=1

P(Ck = 1) · Rk

document 3

document 4

document 1

document 2

document 5



Main Goal of This Paper

Better measure for SERP utility

Namely, improve this (Chuklin et al. 2013):

N∑k=1

P(Ck = 1) · Rk


Motivation


Complex Heterogeneous SERPs



Motivation 1: Non-Trivial Attention Patterns

Hello world program - Wikipedia, the free encyclopediaA " Hello world" program is a computer program that prints out " Hello world" on a display device. It pre-dates the age of the World Wide Web where posted messages ...

Hello-World: World Languages for Children of all agesGames, songs and activities make learning any language fun. Use hello-world by itself or as an enhancement to any language program!

CMT : Videos : Lady Antebellum : Hello WorldWatch Lady Antebellum's music video Hello World for free on CMT.com

hello world

Image Results

Search

Hello WorldHello World Approach and Methodology The idea of Hello World was first conceived by scholars teaching English in Asia in the early 1990's. Their basic belief was that ...

Hello World SoftwareBuy hello world.

FiltersAnytimePast DayPast WeekPast Month

hello world Search

(a) Presentation

9

1

3

5

6

7

8

42

(b) Arrangement

9

1

3

5

6

7

8

42

(c) Mouse Data

Figure 1: Module-level representation of mouse-tracking data. The session sequence for this data would be[1, 3, 5, 6, 7, 6, 5, 3, 5].

Figure 2: Distribution of unique page arrangements forSERPs from two large scale web search engines. The hor-izontal axis indicates the rank of the arrangement whensorted by frequency. The vertical axis indicates the fre-quency of that arrangement.

In addition, we propose a user model which allows us togeneralize to arbitrary page arrangements. This is impor-tant because previous user models based on click logs allassume a single topology across all queries. That is, by ig-noring non-web modules, the graph structure in Figure 3(a)is shared across all queries. In our case, the topology in Fig-ure 3(b) might be di↵erent for two arbitrary queries. There-fore, the edge weights learned for one query will be uselessof a novel arrangement (topology).

In order to estimate the parameters of our user model,we exploit user mouse behavior associated with a SERP ar-rangement. We adopt this strategy because of the high cor-relation in general between eye fixation and mouse position[9]. Previous work has confirmed this correlation for SERPs[30, 16].

The focus of our study will be on the problem of construct-ing robust models able to make predictions about mouse be-havior on arrangements for which we have little or no dataavailable. Having such models provide a tool which can beused when manually designing new pages [31]. At a largerscale, mouse-tracking models could be useful for retrospec-

m1

m2

m3

m4

m5

m0

m6

(a) linear

m1

m2

m3

m4

m5

m0

m6

(b) relaxed

Figure 3: The linear scan model and its relaxation.

tively detecting ‘good abandonments’, cases where the userwas satisfied without clicking a link [21].

In this paper, we make the following contributions,

• a generalization of the linear scan model.

• an e�cient and e↵ective method for estimating thegeneralized model.

• an e�cient and e↵ective method for estimating param-eters of unobserved arrangements (topologies).

• experiments reproduced on data sets from two largecommercial search engines.

2. RELATED WORKThe motivation for capturing mouse movement at scale

originates from results demonstrating a strong correlationbetween eye and mouse position [9]. In the context of websearch, this correlation has been reproduced on SERPs [30],suggesting that, with some care [16], we can use loggedmouse data as a ‘big data’ complement to eye-tracking stud-ies [3]. Such studies have found that mouse-tracking is usefulfor click prediction [17] and advertisement interest predic-tion [14]. In fact, mouse movement analysis has been sug-gested as useful for web site usability analysis in general [2,3]. Even without assuming a relationship between eye andmouse, important search signals such as query intent [13]and document relevance [18] can be detected.

1452

Image credits: F. Diaz, R.W. White, G. Buscher, and D. Liebling. Robust models of mouse movement on dynamicweb search results pages. In CIKM, 2013. ACM Press



Motivation 2: Satisfaction Without Clicks

High direct page utility (measured by DCG or ERR) leads to higherabandonment rate (SERPs with no clicks)

direct page utility

Image credits: from A. Chuklin and P. Serdyukov. Good abandonments in factoid queries. In WWW, 2012.



Problems of Existing Models and Evaluation Metrics

existing models mostly do not model non-trivial userattention patterns

existing models do not use explicit user satisfaction data












Model & Metric


Clicks + Attention + Satisfaction (CAS) Model

SERP

𝜑&

𝐸&

𝐶&

𝜑)

𝐸)

𝐶)

𝜑*

𝐸*

𝐶*

𝑆

…

Utility




SERP

𝜑&

𝐸&

𝐶&

𝜑)

𝐸)

𝐶)

𝜑*

𝐸*

𝐶*

𝑆

…

Utility



Click Model

Examination assumption: click happens only when an item wasexamined and attractive:

P(Ck = 1) = P(Ek = 1) · P(Ck = 1 | Ek = 1)

N.B. Here we assume that P(Ck = 1 | Ek = 1) = α(~Rk) where ~Rk

comes from the raters and α is a logistic function.



Click Model

Examination assumption: click happens only when an item wasexamined and attractive:

P(Ck = 1) = P(Ek = 1) · P(Ck = 1 | Ek = 1)

N.B. Here we assume that P(Ck = 1 | Ek = 1) = α(~Rk) where ~Rk

comes from the raters and α is a logistic function.




SERP

𝜑&

𝐸&

𝐶&

𝜑)

𝐸)

𝐶)

𝜑*

𝐸*

𝐶*

𝑆

…

Utility



Attention (Examination) Model

Logistic regression model:

P(Ek = 1) = ε(~ϕk),

where ~ϕk is a vector of features for SERP item k .

Feature group Features # of features

rank user-perceived rank of the SERP item(can be different from k)

1

CSS classes SERP item type (Web, News,Weather, Currency, KnowledgePanel, etc.)

10

geometry offset from the top, first or second col-umn (binary), width (w), height (h),w × h

5





P(Ek = 1) = ε(~ϕk),




1


10


5





P(Ek = 1) = ε(~ϕk),




1


10


5




SERP

𝜑&

𝐸&

𝐶&

𝜑)

𝐸)

𝐶)

𝜑*

𝐸*

𝐶*

𝑆

…

Utility



Satisfaction Model

in previous models, satisfaction comes only from clickedresults;

in our model it also comes from the SERP items that simplyattracted attention;

P(S = 1) = σ(τ0 + U) =

σ

(τ0 +

∑k

P(Ek = 1)ud( ~Dk) +∑k

P(Ck = 1)ur (~Rk)

)

where ~Dk and ~Rk are ratings assigned by the raters for directsnippet relevance and result relevance respectively. ud and ur arelinear functions of rating histograms.



Satisfaction Model



P(S = 1) = σ(τ0 + U) =

σ

(τ0 +

∑k

P(Ek = 1)ud( ~Dk) +∑k

P(Ck = 1)ur (~Rk)

)




Satisfaction Model



P(S = 1) = σ(τ0 + U) =

σ

(τ0 +

∑k

P(Ek = 1)ud( ~Dk) +∑k

P(Ck = 1)ur (~Rk)

)




Satisfaction Model



P(S = 1) = σ(τ0 + U) =

σ

(τ0 +

∑k

P(Ek = 1)ud( ~Dk) +∑k

P(Ck = 1)ur (~Rk)

)




The CAS Metric

Utility that determines the satisfaction probability:

U =∑k

P(Ek = 1)ud( ~Dk)

︸︷︷︸NEW

+∑k

P(Ck = 1)ur (~Rk)

︸︷︷︸Chuklin et al. 2013

has an additional term

trained on mousing and satisfaction (in addition to clicks)



The CAS Metric


U =∑k

P(Ek = 1)ud( ~Dk)︸︷︷︸NEW

+∑k

P(Ck = 1)ur (~Rk)︸︷︷︸Chuklin et al. 2013





The CAS Metric


U =∑k

P(Ek = 1)ud( ~Dk)︸︷︷︸NEW

+∑k

P(Ck = 1)ur (~Rk)︸︷︷︸Chuklin et al. 2013




Experimental Setup


Dataset

199 queries with explicit unambiguousfeedback (satisfied / not satisfied);

1,739 rated results

direct snippet relevance (D)

result relevance (R)



Dataset

199 queries with explicit unambiguousfeedback (satisfied / not satisfied);

1,739 rated results

direct snippet relevance (D)

result relevance (R)



Baselines and CAS Model Variants

UBM model that agreeswell with online team-draftexperimental outcomes;

PBM position-based model,a robust model with fewerparameters than UBM;

random model that predictsclick and satisfaction withfixed probabilities (learnedfrom the data).

uUBM fromChuklin et al. 2013. Similarto UBM, but parameters aretrained on a different andmuch bigger dataset.

CASnod is a stripped-downversion that does not use(D) labels;

CASnosat is a version ofthe CAS model that doesnot include the satisfactionterm while optimizing themodel;

CASnoreg is a version ofthe CAS model that doesnot use regularization whiletraining. All other modelswere trained withL2-regularization.



Baselines and CAS Model Variants

UBM model that agreeswell with online team-draftexperimental outcomes;

PBM position-based model,a robust model with fewerparameters than UBM;

random model that predictsclick and satisfaction withfixed probabilities (learnedfrom the data).

uUBM fromChuklin et al. 2013. Similarto UBM, but parameters aretrained on a different andmuch bigger dataset.

CASnod is a stripped-downversion that does not use(D) labels;

CASnosat is a version ofthe CAS model that doesnot include the satisfactionterm while optimizing themodel;

CASnoreg is a version ofthe CAS model that doesnot use regularization whiletraining. All other modelswere trained withL2-regularization.


Results


Is the New Metric Really New?Correlation Between Metrics

Table: Correlation between metrics measured by average Pearson’scorrelation coefficient.

CASnosat CASnoreg CAS UBM PBM DCG uUBM

CASnod 0.593 0.564 0.633 0.470 0.487 0.546 0.441CASnosat 0.664 0.715 0.707 0.668 0.735 0.684CASnoreg 0.974 0.363 0.379 0.417 0.341CAS 0.377 0.394 0.440 0.360

UBM 0.814 0.972 0.882PBM 0.906 0.965DCG 0.943



Is the New Metric Measuring the Right Thing?Metric Correlation with True Satisfaction

CASnod

CASnosat

CASnoregCAS

UBMPBM

randomDCG

uUBM0.2

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pearson correlation coefficient between different model-basedmetrics and the user-reported satisfaction.



Bonus PointLog-Likelihood of Click Prediction

CASnod

CASnosat

CASnoregCAS

UBMPBM

randomuUBM

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Log-likelihood of the click data. Note that uUBM was trained on atotally different dataset.


Summary


Summary

A model-based metric needs to model satisfaction explicitlyand use it for training.

Direct snippet relevance (D) is essential for predictingsatisfaction.

The CAS metric is quite different from the previously usedmetrics, making it an interesting addition to TREC.

When used as a model, CAS consistently predicts usersatisfaction with a relatively small penalty in click prediction.



Summary







Summary







Summary







Acknowledgments

All content represents the opinion of the authors which is not necessarily shared or endorsed by their respectiveemployers and/or sponsors.



Evaluating the User ModelLog-Likelihood of Satisfaction Prediction

CASnod

CASnosat

CASnoregCAS

UBMPBM

randomuUBM

0.8

0.7

0.6

0.5

0.4

0.3

0.2

Log-likelihood of the satisfaction prediction. Some models havelog-likelihood below −0.8, hence there are no boxes for them.



Analyzing the Attention Features

CASrank is themodel that only usesthe rank to predictattention;

CASnogeom onlyuses the rank andSERP item typeinformation and doesnot use geometry;

CASnoclass does notuse the CSS classfeatures (SERP itemtype).

Pearson correlation with satisfaction

CASrank

CASnogeom

CASnoclassCASnod

CAS0.2

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Log-likelihood of clicks / satisfaction

CASrank

CASnogeom

CASnoclassCASnod

CAS2.5

2.4

2.3

2.2

2.1

2.0

1.9

1.8

1.7

CASrank

CASnogeom

CASnoclassCASnod

CAS0.65

0.60

0.55

0.50

0.45

0.40

0.35

0.30

0.25

0.20



Heterogeneous SERPs

12% of the SERPs in our data are heterogeneous and our metricdoes well for them.

Table: Pearson correlation between utility of heterogeneous SERP anduser-reported satisfaction.

CAS UBM PBM random DCG uUBM

0.60 0.38 -0.05 -0.39 0.24 -0.08

CASrank CASnogeom CASclass CASnod CASnosat CASnoreg

0.15 -0.04 0.27 -0.04 0.48 0.67



Spammers

Some raters were filtered out as spammers, but there was stillsome natural disagreement:

Table: Filtered out workers and agreement scores for remaining workers.

% of workers % of ratings Cohen’s Krippendorf’slabel removed removed kappa alpha

(D) 32% 27% 0.339 0.144(R) 41% 29% 0.348 0.117


Incorporating Clicks, Attention and Satisfaction into a SERP Evaluation Model

Internet