Top Banner
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan [email protected] (Arizona State University)
39

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

Jan 01, 2016

Download

Documents

ingrid-fry

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements. Raju Balakrishnan [email protected] (Arizona State University). Agenda. Problem 1: Ranking the Deep Web Need for New Ranking. SourceRank : Agreement Analysis. Computing Agreement and Collusion . - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

Raju [email protected]

(Arizona State University)

Page 2: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

2

Agenda Problem 1: Ranking the Deep Web

– Need for New Ranking.– SourceRank: Agreement Analysis.– Computing Agreement and Collusion .– Results & System Implementation.

Problem 2: Ad-Ranking sensitive to Mutual Influences.

– Browsing model & Nature of Influence.– Ranking Function and Generalizations.– Results.

Page 3: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

3

Deep Web Integration Scenario

Web DB

Mediator

←query

Web DB

Web DB

Web DB

Web DB

Millions of sources containing structured tuples

Uncontrolled collection of redundant information

answer tu

ples→

answ

er tu

ples

answ

er tu

ples

←answer tuples

←answer tuples

←qu

ery

←qu

ery

query→query→

Deep Web

Search engines have nominal access. We don’t Google for a “Honda Civic 2008 Tampa”

Page 4: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

4

Why Another Ranking?

Example Query: “Godfather Trilogy” on Google Base

Importance: Searching for titles matching with the query. None of the results are the classic Godfather

Rankings are oblivious to result Importance & Trustworthiness

Trustworthiness (bait and switch)The titles and cover image match

exactly. Prices are low. Amazing deal! But when you proceed towards

check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky)

Page 5: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

5

Problem: Given a user query, select a subset of sources to provide important and trustworthy answers.

Surface web search combines link analysis with Query-Relevance to consider trustworthiness and relevance of the results.

Unfortunately, deep web records do not have hyper-links.

Source Selection in the Deep Web

Page 6: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

6

Observations Many sources return answers to the same

query. Comparison of semantics of the answers is

facilitated by structure of the tuples.

Idea: Compute importance and trustworthiness

of sources based on the agreement of

answers returned by different sources.

Source Agreement

Page 7: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

7

Agreement Implies Trust & Importance.

Important results are likely to be returned by a large number of sources. e.g. For the query “Godfather” hundreds of

sources return the classic “The Godfather” while a few sources return the little known movie “Little Godfather”.

Two independent sources are not likely to agree upon corrupt/untrustworthy answers.e.g. The wrong author of the book (e.g.

Godfather author as “Nino Rota”) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many.

Page 8: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

8

Which tire?

Agreement is not just for the search

Page 9: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

9

Agreement Implies Trust & Relevance

Probability of agreement of two independently selected irrelevant/false tuples is

||

1),( 21

UffPa

Probability of agreement or two independently picked relevant and true tuples is

||

1),( 21

Ta

RrrP

),(),(|||| 2121 ffPrrPRU aaT

k100

1

3

1

Page 10: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

10

S2

S1

0.14

0.86

0.78

0.4

S3

0.6

0.22

Method: Sampling based Agreement

Link semantics from Si to Sj with weight w: Si acknowledges w fraction of tuples in Sj. Since weight is the fraction, links are unsymmetrical.

||

),()1()(

2

2121

R

RRASSW

where induces the smoothing links to account for the unseen samples. R1, R2 are the result sets of S1, S2.

Agreement is computed using key word queries.

Partial titles of movies/books are used as queries.

Mean agreement over all the queries are used as the final agreement.

Page 11: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

11

Method: Calculating SourceRankHow can I use the agreement graph for improved search?

• Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources.

• The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk.

SourceRank is equal to this stationary visit probability of the random walk on the database vertex.

This static SourceRank may be combined with a query-specific source-relevance measure for the final ranking.

Page 12: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

12

Computing Agreement is Hard

Computing semantic agreement between two records is the record linkage problem, and is known to be hard.

Semantically same entities may be represented syntactically differently by two databases (non-common domains).

Godfather, The: The Coppola Restoration

James Caan /Marlon Brando more

$9.99

Marlon Brando, Al Pacino

13.99 USD

The Godfather - The Coppola Restoration Giftset [Blu-ray]

Example “Godfather” tuples from two web sources. Note that titles and castings are denoted differently.

Page 13: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

13

Method: Computing AgreementAgreement Computation has Three levels.1. Comparing Attribute-Value

Soft-TFIDF with Jaro-Winkler as the similarity measure is used. 2. Comparing Records. We do not assume predefined schema matching. Instance of a bipartite

matching problem. Optimal matching is .

Greedy matching is used. Values are greedily matched against most similar value in the other record.

The attribute importance are weighted by IDF. (e.g. same titles (Godfather) is more important than same format (paperback))

3. Comparing result sets. Using the record similarity computed above, result set similarities

are computed using the same greedy approach.

)( 3vO

)( 2vO

Page 14: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

14

Detecting Source Collusion

Observation 1: Even non-colluding sources in the same domain may contain same data. e.g. Movie databases may contain all Hollywood movies. Observation 2: Top-k answers of even non-colluding sources may be similar.e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy.

The sources may copy data from each other, or make mirrors, boosting SourceRank of the group.

Page 15: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

15

Source Collusion--Continued

Basic Method: If two sources return same top-k answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding.

We compute the degree of collusion of sources as the agreement on large answer queries.

Words with highest DF in the crawl is used as the queries.

The agreement between two databases are adjusted for collusion by multiplying by

(1-collusion).

Page 16: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

16

Factal: Search based on SourceRank

http://factal.eas.asu.edu

 ”I personally ran a handful of test queries this way and gotmuch better results [than Google Products] results using Factal” --- Anonymous WWW’11 Reviewer.

Page 17: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

17

Evaluation Precision and DCG are compared with the following baseline methods

1) CORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected [Callan et al. 1995].

2) Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries [Nie et al. 2004].

3) Google Products: Products Search that is used over Google Base

All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels.

Page 18: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

18

Online Top-4 Sources-Movies

Cover

age

Sour

ceRan

kCORI

SR-C

over

age

SR-C

ORI

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45 PrecisionDCG

29%

Though combinations are not our competitors, note that they are not better:1.SourceRank implicitly considers query relevance, as selected sources fetch answers by query similarity. Combining again with query similarity may be an “overweighting”.2. Search is Vertical

Page 19: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

19

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35 Precision

DCG

Online Top-4 Sources-Books

48%

Page 20: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

20

Google Base Top-5 Precision-Books

0

0.1

0.2

0.3

0.4

0.5

24% 675 Google Base

sources responding to a set of book queries are used as the book domain sources.

GBase-Domain is the Google Base searching only on these 675 domain sources.

Source Selection by SourceRank (coverage) followed by ranking by Google Base.

675 Sources

Page 21: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

21

Gbase Gbase-Domain SourceRank Coverage0

0.05

0.1

0.15

0.2

0.25

209 Sources

Google Base Top-5 Precision-Movies

25%

Page 22: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

22

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-10

0

10

20

30

40

50

60

Corruption Level

Dec

reas

e in

Ran

k(%

)

SourceRankCoverageCORI

Trustworthiness of Source Selection

Google Base Movies1. Corrupted the results in sample

crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles).

2.If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels.

Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries.

Page 23: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

23

Trustworthiness- Google Base Books

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-5

0

5

10

15

20

25

30

35

40

45

Corruption Level

Dec

reas

e in

Ran

k(%

)

SourceRankCoverageCORI

Page 24: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

24

00.10.20.30.40.50.60.70.80.910

0.2

0.4

0.6

0.8

1

Rank Correlation

CollusionAgreementAdjusted Agreement

Collusion—Ablation StudyTwo database with the

same one million tuples from IMDB are created.

Correlation between the ranking functions reduced increasingly.

Natural agreement will be preserved while catching near-mirrors.

Observations: 1. At high correlation the

adjusted agreement is very low.

2. Adjusted agreement is almost the same as the pure agreement at low correlations.

Page 25: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

25

Computation TimeRandom walk is

known to be feasible in large scale.

Time to compute the agreements is evaluated against number of sources.

Note that the computation is offline.

Easy to parallelize.

Page 26: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

26

Publications and Recognition

SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement. Raju Balakrishnan, Subbarao Kambhampati.  WWW 2011 (Full Paper). 

Factal: Integrating Deep Web Based on Trust and Relevance. Raju Balakrishnan, Subbarao Kabmbhampati.  WWW 2011 (Demonstration). 

SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement (Best Poster Award, WWW 2010). Raju Balakrishnan, Subbarao Kambhampati. WWW 2010 Pages 1055~1056. 

Page 27: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

27

Contributions

1. Agreement based trust assessment for the deep web

2. Agreement based relevance assessment for the deep web

3. Collusion detection between the web sources

4. Evaluations in Google Base sources and online web databases

Page 28: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

28

Agenda Ranking the Deep Web

– Need for new ranking.– SourceRank: Agreement Analysis.– Computing Agreement and Collusion .– Results & System Implementation. Proposed Work: Ranking the Deep Web Results.

Problem 2: Ad-Ranking sensitive to Mutual Influences. – Motivation and Problem Definition. – Browsing model & Nature of Influence– Ranking Function & Generalization– Results.Proposed Work: Mechanism Design &

Evaluations.

Search engines generate their multi-billion dollar revenue by textual ads. Related problem of ranking of ads is

as important as the ranking of results.

A different aspect of ranking

Page 30: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

30

User’s Browsing Model

• User browses down staring at the first ad

®

Abandon browsing with probability

Goes down to the next ad with probability

• At every ad he May

Process repeats for the ads below with a reduced

probability

Click the ad with relevance probability

))(|)(()( aviewaclickPaR

If is similar to residual relevance of decreases and abandonment probability increases.

2a 1a2a

Page 31: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

31

Mutual Influences

Three Manifestations of Mutual Influences on an ad are:1. Similar ads placed above

Reduces user’s residual relevance of 2. Relevance of other ads placed above

User may click on above ads may not view 3. Abandonment probability of other ads placed

above User may abandon search and may not view

aa

aaa

aa

Page 32: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

32

Expected Profit Considering Ad Similarities

Considering bids ( ), residual Relevance ( ), abandonment probability ( ), and similarities the expected profit from a set of n results is,

THEOREM: Ranking maximizing expected profit considering similarities between the results is NP-Hard

Proof is a reduction of independent set problem to choosing top-k ads considering similarities.

1

11

)()(1)()$(i

j

jjrir

n

i

i aaRaRa Expected Profit =

)( ia)( iaR)$( ia

Even worse, constant ratio approximation algorithms are hard (unless NP = ZPP) for diversity ranking problem

Page 33: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

33

Dropping similarity, hence replacing Residual Relevance ( ) by Absolute Relevance ( ),

Ranking to maximize this expected utility is a sorting problem

Expected Profit Considering other two Mutual Influences (2 and 3)

1

11

)()(1)()$(i

j

jji

n

i

i aaRaRa Expected Profit =

)( iaR)( ir aR

Page 34: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

34

Optimal Ranking

The physical meaning RF is the profit generated for unit consumed view probability of ads

Higher ads have more view probability. Placing ads producing more profit for unit consumed view probability higher up is intuitive.

Rank ads in the descending order of:

)()(

)()$()(

aaR

aRaaRF

Page 37: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

37

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

Exp

ecte

d P

rofit

RFBid Amount x RelevanceBid Amount

Quantifying Expected Profit

Proposed strategy gives maximum profit for the entire range

45.7%35.9%

Number of ClicksZipf random with exponent 1.5

Abandonment probabilityUniform Random as

RelevanceUniform random as

Bid AmountsUniform random

Difference in profit between RF and competing strategy is significant

10)$(0 a

)(0 aR

1)(0 a

Bid amount only strategy becomes optimal at 0)( a

Page 38: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

38

Optimal Ad-Ranking for Profit Maximization. Raju Balakrishnan, Subbarao

Kabmbhampati. WebDB 2008

Yahoo! Research Key scientific Challenge award for Computation advertising, 2009-10

Publication and Recognition

Page 39: Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements

39

Overall Contributions

1. SourceRank based source selection sensitive to1. Trustworthiness 2. Importance

of the deep web sources. 2. A method to assess the collusion of the deep

web sources.3. An optimal generalized ranking for ads and

search results.4. A ranking framework optimal with respect to

the perceived relevance of search snippets, and abandonment probability.