Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides.

Web Searching & Ranking

Zachary G. IvesUniversity of Pennsylvania

CIS 455/555 – Internet and Web Systems

April 20, 2023Some content based on slides by Marti Hearst, Ray Larson

Recall Where We Left Off

We were discussing information retrieval ranking models

The Boolean model captures some intuitions of what we want – AND, OR

But it’s too restrictive, and has no real ranking between returned answers

Sim(q,dj) = cos()= [vec(dj) vec(q)] / |

dj| * |q|= [ wij * wiq] / |dj| * |q|

Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1

A document is retrieved even if it matches the query terms only partially

Vector Model

Weights in the Vector Model

Sim(q,dj) = [ wij * wiq] / |dj| * |q|

How do we compute the weights wij and wiq? A good weight must take into account two

effects: quantification of intra-document contents

(similarity) tf factor, the term frequency within a document

quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

TF and IDF Factors

Let:N be the total number of docs in the collectionni be the number of docs which contain ki

freq(i,j) raw frequency of ki within dj

A normalized tf factor is given byf(i,j) = freq(i,j) / max(freq(l,j))

where the maximum is computed over all terms which occur within the document dj

The idf factor is computed asidf(i) = log (N / ni)

the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information

associated with the term ki

d3d4 d5

k1 k2 k3 q dj d1 1 0 1 4 d2 1 0 0 1 d3 0 1 1 5 d4 1 0 0 1 d5 1 1 1 6 d6 1 1 0 3 d7 0 1 0 2

q 1 2 3

Vector ModelExample 1I

d3d4 d5

k1 k2 k3 q dj d1 2 0 1 5 d2 1 0 0 1 d3 0 1 3 11 d4 2 0 0 2 d5 1 2 4 17 d6 1 2 0 5 d7 0 5 0 10

q 1 2 3

Vector ModelExample III

Vector Model, Summarized

The best term-weighting schemes tf-idf weights:wij = f(i,j) * log(N/ni)

For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N / ni)

This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known

ranking alternatives

Advantages: term-weighting improves quality of the answer

set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query

Disadvantages: assumes independence of index terms; not

clear if this is a good or bad assumption

Pros & Cons of Vector Model

Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general

Generally we use a variation of the vector model in most text search systems

Switching Our Sights to the Web

Information retrieval is more heterogeneous in nature: No editor to control quality Deliberately misleading information (“web spam”) Great variety in types of information

Phone books, catalogs, technical reports, news, slide shows, …

Many languages; partial duplication; jargon Diverse user goals

Very short queries ~2.35 words on average (Aug 2000; Google results)

And much larger scale!

Handling Short Queries &Mixed-Quality Information

Human processing Web directories: Yahoo, Open Directory, … Human-created answers: about.com, Search Wikia (Still not clear that automated question-answering

works) Capitalism: “paid placement”

Advertisers pay to be associated with certain keywords Clicks / page popularity: pages visited most often Link analysis: use link structure to determine

credibility

… combination of all?

Link Analysis for Starting Points:HITS (Kleinberg), PageRank (Google)

Assumptions: Credible sources will mostly point to credible

sources Names of hyperlinks suggest meaning Ranking is a function of the query terms and of the hyperlink

structure

An example of why this makes sense: The official Olympics site will be linked to by most

high-quality sites about sports, Olympics, etc. A spammer who adds “Olympics” to his/her web site

probably won’t have many links to it Caveat: “Search engine optimization”

Google’s PageRank (Brin/Page 98)

Mine structure of web graph independently of the query!

Each web page is a node, each hyperlink is a directed edge

Assumes a random walk (surf) through the web: Start at a random page

At each step, the surfer proceeds to a randomly chosen web page with probability d

to a randomly chosen successor of the current page with probability 1- d

The PageRank of a page p is the fraction of steps the surfer spends at p in the limit

Link Counts Aren’t Everything…

“A-Team” page

Hollywood“Series to

Recycle” page

YahooDirectory

WikipediaMr. T’spage

TeamSports

CheesyTV

Showspage

PageRank

Rank of page jRank of page i

Every pagej that links to i

Number oflinks out

from page j

Importance of page i is governed by pages linking to it

Computing PageRank (Simple version)

)()1( 1 kj

1)0( Initialize so total rank sums to 1.0

Iterate untilconvergence

Computing PageRank (Step 0)

Initialize so total rank sums to 1.0 n

)()1( 1 kj

Propagate weightsacross out-edges

Compute weightsbased on in-edges

)0()1( 1j

Bj ji x

Computing PageRank (Convergence)

)()1( 1 kj

Naïve PageRank Algorithm Restated

Let N(p) = number outgoing links from page p B(p) = number of back-links to page p

Each page b distributes its importance to all of the pages it points to (so we scale by N(b))

Page p’s importance is increased by the importance of its back set

1)( bPageRank

bNpPageRank

In Linear Algebra Terms

Create an m x m matrix M to capture links: M(i, j) = 1 / nj if page i is pointed to by page j

and page j has nj outgoing links

Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:

(Computes principal eigenvector via power iteration)

mm pPageRank

pPageRank

A Brief Example

Google

Amazon Yahoo

0 0 0.5

Total rank sums to number of pages

Running for multiple iterations:

Oops #1 – PageRank Sinks: Dead Ends

Google

Amazon Yahoo

0 0 0.5

0, …

Oops #2 – Hogging all the PageRank

Google

Amazon Yahoo

0 0 0.5

0, …

Improved PageRank

Remove out-degree 0 nodes (or consider them to refer back to referrer)

Add decay factor to deal with sinks PageRank(p) = d b B(p) (PageRank(b) / N(b)) + (1

– d)

Intuition in the idea of the “random surfer”: Surfer occasionally stops following link sequence and

jumps to new random page, with probability 1 - d

Stopping the Hog

0 0 0.5

= 0.8 *

… though does this seem right?

Google

Amazon Yahoo

Summary of Link Analysis

Use back-links as a means of adjusting the “worthiness” or “importance” of a page

Use iterative process over matrix/vector values to reach a convergence point

PageRank is query-independent and considered relatively stable But vulnerable to SEO

Can We Go Beyond?

PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page

A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals

from each label In essence, captures the relatedness or influence of

nodes Used in YouTube video matching, schema matching, …

Overall Ranking Strategies inWeb Search Engines

Everybody has their own “secret sauce” that uses: Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares

these factors

Some alternative approaches: Some new engines (Vivisimo, Teoma, Clusty) try to do

clustering A few engines (Dogpile, Mamma.com) try to do meta-search

Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides.

Documents

Query Execution Zachary G. Ives University of Pennsylvania.....

Remote Procedure Calls and Web Services Zachary G. Ives...

Web Services and Data Integration Zachary G. Ives University...

Indexing and Data Exchange Formats Zachary G. Ives...

XML Databases Zachary G. Ives University of Pennsylvania CIS...

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data...

Data Integration Methods Zachary G. Ives University of...

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 16: KEYWORD...

Future Directions and Course Wrap-up Zachary G. Ives...

Adaptively Processing Remote Data Zachary G. Ives University...

Index and Distributed Index Methods Zachary G. Ives...

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 7: Data Matching...

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13:...

Amol Deshpande, University of Maryland Zachary G. Ives

Finding What We Want: DNS and XPath-Based Pub-Sub Zachary G....

Query Execution and Optimization Zachary G. Ives University....