Searching Web Better

1

Searching Web Better

Dr Wilfred NgDepartment of Computer Science

The Hong Kong University of Science and Technology

2

Outline

Introduction Main Techniques (RSCF)

Clickthrough Data Ranking Support Vector Machine Algorithm Ranking SVM in Co-training Framework

The RSCF-based Metasearch Engine Search Engine Components Feature Extraction Experiments

Current Development

3

Search Engine Adaptation

Google, MSNsearch, Wisenut, Overture, …

Computer Science Finance Social Science

Adapt the search engine by learning from implicit feedback ---- Clickthrough data

CS termsProduct

News

4

Clickthrough Data

Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users

Formally, a triplet (q, r, c) q – the input query r – the ranking result presented to the user c – the set of links the user clicked on

Benefits: Can be obtained timely No intervention to the search activity

5

An Example of Clickthrough Data

User’s inputquery

Clicked bythe user

l

l

ll

l

l

l

l

6

Target Ranking (Preference Pairs Set )

Arising from l1 Arising from l7 Arising from l10

Empty Set l7 <r l2

l7 <r l3l7 <r l4l7 <r l5l7 <r l6

l10 <r l2

l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9

7

An Example of Clickthrough Data

User’s inputquery

Clicked bythe user

l

l

ll

l

l

l

l

Labelleddata set

Unlabelleddata set

8


Empty Set l7 <r l2

l7 <r l3l7 <r l4l7 <r l5l7 <r l6

l10 <r l2

l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9 Labelled data set: l1, l2,…, l10

Unlabelled data set: l11, l12,…

Target Ranking (Preference Pairs Set )

9

The Ranking SVM Algorithm

1

321 ,, lll Three links, each described by a feature vector

Target ranking: l1 <r’ l2 <r’ l3

Weight vector -- Ranker

Distance between two closest projected links

Cons: It needs a large set of labelled data

l2

l1

l3

2l2’

l1’

l3’

l2’

l1’

l3’

10

The Ranking SVM in Co-training Framework Divide the feature vector into two subvectors

Two rankers are built over these two feature subvectors

Each ranker chooses several unlabelled preference pairs and add them to the labelled data set

Rebuild each ranker from the augmented labelled data set

Labelled Preference Feedback Pairs P_l

Unlabelled Preference Pairs P_u

Ranker a_B

Training

Selecting confident pairs

Ranker a_AA

ugm

ente

d p

airs

Au

gmen

ted

pai

rs

11

Some Issues

Guideline for partitioning the feature vector After the partition each subvector must be sufficient fo

r the later ranking Number of rankers

Depend on the number of features When to terminate the procedure?

Prediction difference: indicates the ranking difference between the two rankers

After termination, get a final ranker on the augmented labelled data set

12

Metasearch Engine Receives query from

user Sends query to

multiple search engines

Combines the retrieved results from the underlying search engines

Presents a unified ranking result to user

User

Metasearch Engine

SearchEngine 1

query

SearchEngine 2

SearchEngine n

RetrievedResults 1

RetrievedResults 2

RetrievedResults n

Unified Ranking

Result

13

Search Engine Components

Powered by Inktomi, relatively mature

One of the most powerful search engines nowadays

A new but growing search engine

Ranks links based on the prices paid by the sponsors on t

he links

14

Feature Extraction

Ranking Features (12 binary features) Rank(E,T) where E {M,W,O} T {1,3,5,10}

(M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying sea

rch engine

Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) URL,Title, Abstract Cover, Abstract Group Indicate the similarity between the query and the link

15

Experiments

Experiment data: within the same domain – Computer science

Objectives:Offline experiments – compared with RSVMOnline experiments – compared with Google

16

Prediction Error Prediction Error: difference between the

ranker’s ranking and the target rankingTarget ranking:

l1 <r’ l2, l1 <r’ l3, l2 <r’ l3Projected ranking:

l2 <r’ l1, l1 <r’ l3, l2 <r’ l3Prediction error = 33%

1

l2

l1

l3

l2’

l1’

l3’

17

Offline Experiment (Compared with RSVM)

B

A

R

10 queries 30 queries 60 queries

The ranker trained by the RSVM algorithm on the whole feature vector

The ranker trained by the RSCF algorithm on one feature subvector

The ranker trained by the RSCF algorithm on another feature subvector

Prediction error rise up again!The number of iterations in RSCF algorithm is about four to five!

18

Offline Experiment (Compare with RSVM)

C

R

The ranker trained by the RSVM algorithm

The final ranker trained by the RSCF algorithm

Overall comparison

19

Online Experiment (Compare with Google) Experiment data: CS terms

e.g. radix sort, TREC collection, … Experiment Setup

Combine the results returned by RSCF and those by Google into one shuffled list

Present to the users in a unified wayRecord the users’ clicks

Cases More clicks on RSCF

More clicks on Google

Tie No clicks Total

Queries 26 17 13 2 58

20

Experimental AnalysisFeatures Weight Features Weight

Rank(M,1) 0.1914 Rank(W,1) 0.0184

Rank(M,3) 0.2498 Rank(W,3) 0.1014

Rank(M,5) 0.1152 Rank(W,5) -0.3021

Rank(M,10) 0.2498 Rank(W,10)

-0.4367

Rank(O,1) -0.1673 Sim_U(q,l) 0.5382

Rank(O,3) -0.1229 Sim_T(q,t) 0.4928

Rank(O,5) -0.4976 Sim_C(q,a) 0.4136

Rank(O,10) 0.4441 Sim_G(q,a) 0.5010

21


Rank(M,1) 0.1914 Rank(W,1) 0.0184

Rank(M,3) 0.2498 Rank(W,3) 0.1014

Rank(M,5) 0.1152 Rank(W,5) -0.3021

Rank(M,10) 0.2498 Rank(W,10)

-0.4367

Rank(O,1) -0.1673 Sim_U(q,l) 0.5382

Rank(O,3) -0.1229 Sim_T(q,t) 0.4928

Rank(O,5) -0.4976 Sim_C(q,a) 0.4136

Rank(O,10) 0.4441 Sim_G(q,a) 0.5010

22


Rank(M,1) 0.1914 Rank(W,1) 0.0184

Rank(M,3) 0.2498 Rank(W,3) 0.1014

Rank(M,5) 0.1152 Rank(W,5) -0.3021

Rank(M,10) 0.2498 Rank(W,10)

-0.4367

Rank(O,1) -0.1673 Sim_U(q,l) 0.5382

Rank(O,3) -0.1229 Sim_T(q,t) 0.4928

Rank(O,5) -0.4976 Sim_C(q,a) 0.4136

Rank(O,10) 0.4441 Sim_G(q,a) 0.5010

23

Conclusion on RSCF

Search engine adaptationThe RSCF algorithm

Train on clickthrough data Apply RSVM in the co-training framework

The RSCF-based metasearch engine Offline experiments – better than RSVM Online experiments – better than Google

24

Current Development

Features extraction and divisionApply in different domainsSearch engine personalizationSpyNoby Project: Personalized search engine

with clickthrough analysis

25

If l1 and l7 are from the same underlying search engine, the preference pairs set arising from l1 should be l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6

Advantages: Alleviate the penalty on high-ranked links Give more credit to the ranking ability of the

underlying search engines

Modified Target Ranking for Metasearch Engines

26


l1 <r l2

l1 <r l3l1 <r l4l1 <r l5l1 <r l6

l7 <r l2

l7 <r l3l7 <r l4l7 <r l5l7 <r l6

l10 <r l2

l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9 Labeled data set: l1, l2,…, l10

Unlabelled data set: l11, l12,…

Modified Target Ranking

27

RSCF-based Metasearch Engine - MEA User

MEA

query

1. ……2. …………………………30. ......

Unified Ranking

Result

q

qq

q

1. ……2. …………………………30. ……

1. ……2. …………………………30. ……

28

RSCF-based Metasearch Engine - MEB User

MEB

query

1. ……2. …………………………30. ……

Unified Ranking

Result

q

qq

q

1. ……2. …………………………30. ……

1. ……2. …………………………30. ……

1. ……2. …………………………30. ……

q

29

Generating Clickthrough Data Probability of being clicked on:

k: the ranking of the link in the metasearch engine

n: the number of all the links in the metasearch engine : the skewness parameter in Zipf’s lawHarmonic number:

Judge the link’s relevance manually If the link is irrelevant not be clicked on If the link is relevant has the probability of Pr(k) to be

clicked on

)()Pr(

VHk

nk

n

i iVH

1

1)(

30

Feature Extraction Ranking Features (binary features)

Rank(E,T): whether the link is ranked within ST in E

where E {G,M,W,O} T {1,3,5,10,15,20,25,30}

S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} ……

(G: Google, M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying sear

ch engine

Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) Measure the similarity between the query and the link

31

Experiments

Experiment data: three different domains CS terms News E-shopping

Objectives:Prediction Error – better than RSVMTop-k Precision – adaptation ability

32

Top-k Precision

Advantages: Precision is more easier to obtained than recall Users care only top-k links (k=10)

Evaluation data: 30 queries in each domain

33

Comparison of Top-k precision

CS terms

News

E-shopping

34

Statistical AnalysisHypothesis Testing: (two-sample hypothesis

testing about means)used to analyze

whether there is a statistically significant difference between two means of two samples

Comparison between Engines

News E-Shopping CS terms

Combined

MEA VS Google ≈ > ≈ >MEA VS MSNsearch ≈ > ≈ >MEA VS Overture > ≈ > >MEA VS Wisenut ≈ > > >MEB VS Google ≈ > ≈ >MEB VS MSNsearch ≈ ≈ > >MEB VS Overture > ≈ > >MEB VS Wisenut ≈ > > >MEA VS MEB ≈ ≈ ≈ ≈

35

Comparison Results MEA can produce better search quality than Goo

gle Google does not excel in every query category MEA and MEB is able to adapt to bring out the str

engths of each underlying search engine MEA and MEB are better than, or comparable to

all their underlying search engine components in every query category

The RSCF-based metasearch engine Comparison of prediction error – better than RSVM Comparison of top-k precision – adaptation ability

36

Spy Naïve Bayes – Motivation

The problem of Joachims method Strong assumptions Excessively penalize high-ranked

links

l1, l2, l3 are apt to appear on the right, while l7, l10 on the left

New interpretation of clickthrough data Clicked – positive (P) Unclicked – unlabeled (U),

containing both positive and negative samples.

Goal: identify Reliable Negatives (RN) from U

Arising from l1

Arising from l7

Arising from l10

Empty Set

l7 <r l2

l7 <r l3l7 <r l4l7 <r l5l7 <r l6

l10 <r l2

l10 <r l3l10 <r l4l10 <r l5l10 <r l6l10 <r l8l10 <r l9

lp <r ln

37

Spy Naïve Bayes: Ideas Standard naïve Bayes – classify positi

ve and negative samples One-step spy naïve Bayes: Spying out

RN from U Put a small number of positive samples

into U to act as “spies”, (to scout the behavior of real positive samples in U)

Take U as negative samples to train a naïve Bayes classifier

Samples with lower probabilities to be positive will be assigned into RN

Voting procedure: make Spying more robust Run one-step SpyNB for n times and g

et n sets of RNi

A sample appear in at least m (m<≈n) sets of RNi will appear in the final RN

38

http://dleecpu1.cs.ust.hk:8080/SpyNoby/

39

My publications1. Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation . An International

Journal of Information Processing & Management, pp. 290-292, 43(1) (2007).2. Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching.

Accepted and to appear: ACM Transactions on Internet Technology, (2006).3. Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE.

Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005).

4. Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data . International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567-581, (2004).

5. Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004).

6. Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519-532, (2004).

7. Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003).

8. Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003).

Searching Web Better

Documents

ranking difference

set of links

ranking svm algorithm

feature vector target

unified ranking result

returned ranking results

linkexperimentsexperiment

l10 unlabelled data