A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

A Survey on Web Information Retrieval Technologies

Lan HuangComputer Science Department

State University of New York, Stony Brook

Presented by Kajal Miyan

Michigan State University

Overview

● Web Information Retrieval Challenges● Search Engine – Overview and Architecture(Google as Case Study)● Various Algorithms● Directories – Overview and some Algos.● Estimating Size of Web● Main Challenges● Google File System● Sponsored Search● Discussion

Web Information Retrieval Challenges

● Bulk● Dynamic Internet● Heterogeneity● Variety of Languages● Duplication● High Linkage● Ill-formed queries● Wide Variance in Users● Specific Behavior

The Goal

Evaluation● Precision - fraction of the documents re-

trieved that are relevant to the user's informa-tion need.Precision = (relevant documents AND retrieved

documents)/retrieved documents● Recall - fraction of the documents that are

relevant to the query that are successfully re-trieved.Recall = (relevant documents AND retrieved docu-

ments)/relevant documents

Current Goal

● Precision at Top 10 results &● Precision at Top 10 result pages

Various SEs

● SEs● Directories● News SEs● Meta Search Engines● Social Search Engines● Opinions, Forums, Usenets SEs

Your Browser

Architecture (Sherman 2003)

The Web

URL1 URL2

URL3 URL4

Crawler

Indexer

Query Server Eggs?

Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am

Crawling the web

● Robust – should avoid overloading the websites and must deal with huge amounts of data.

● Decides in what order to crawl the pages.● Frequency of revisiting pages● Rule of Thumb – Important pages first

Crawler

● Cho etc.99 (spread the workload)● – Allocation that URL’s in 500 Queues● – Allocation based on the Hash of the server

name● – Read one URL from each queue at a time

Inverted Indexes the IR Way

How Inverted Index Are Created???

Periodically rebuilt, static otherwise.

Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Multiple term entries for a single document are merged.

Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

How Inverted Files are Created

Finally, the file can be split into – Dictionary or Lexicon file– Postings file

Dictionary/Lexicon PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Inverted indexes

Permit fast search for individual termsFor each term, you get a list consisting of:

document ID frequency of term in doc (optional) position of term in doc (optional)

These lists can be used to solve Boolean queries:country -> d1, d2manor -> d2country AND manor -> d2

Inverted Indexes for Web Search Engines

Inverted indexes are still used, even though the web is so huge.Some systems partition the indexes across different machines. Each machine handles different parts of the data.Other systems duplicate the data across many machines; queries are distributed among the machines.Most do a combination of these.

Google Architecture

Repository– Contains the full HTML text– Compressed using zlib (RFC1950)– Prefixed by docID, length, and URL• Document Index– Each entry contain● • The current doc status (crawled ?)● • A pointer into the repository (if crawled)● • A document checksum (using binary search

to find the docID )● • Various statistics• Lexicon– Keep in memory on a 256M– Current contains 14 million words

Google’s Indexing

● The Indexer converts each doc into a collection of “hit lists” and puts these into “barrels”, sorted by docID. It also creates a database of “links”.– Hit: <wordID, position in doc, font info, hit type>– Hit type: Plain or fancy.– Fancy hit: Occurs in URL, title, anchor text, metatag.– Optimized representation of hits (2 bytes each).

● Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file.– Lexicon: <wordID, offset into inverted index>– Lexicon is mostly cached in-memory

wordid #docswordid #docswordid #docs

Lexicon (in-memory) Postings (“Inverted barrels”, on disk)

Each “barrel” contains postings for a range of wor-dids.

Google’s Inverted Index

Sorted by wordid

Docid #hits Hit, hit, hit, hit, hitDocid #hits Hit

Docid #hits HitDocid #hits Hit, hit, hit

Docid #hits Hit, hit

Barrel i

Barrel i+1

Sortedby Docid

Google crawler

– Maintain its own DNS cache– Asynchronous I/O to manage events– 4 crawler• Both URLserver & crawler are implement in

Python• Each crawler keeps 300 connections open at

once• >100 pages / s , roughly 600K/s

● How is Relevance of the page decided???

Content Relevance

• Phrase matching.• Synonyms.• URL analysis.• Date last updated.• Spell checking.• Home page detection.

HTML Weighting

A 5) Anchor

H1, H2, H3, H4, H5, H64) Header

DL, OL, UL3) List

STRONG, B, EM, I, U2) Strong

None of the above1) Plain Text

TITLE6) Title

HTML tagsClass Name

• Meta tag text is mostly ignored by search engines

Link-Based Metrics

• A link from A to B can be viewed as a recommendation, a vote or a citation.

• Links can be – referential, or – informational

• Links effect the ranking of web pages and thus have commercial value.

Citation and Linking

PageRank - Motivation

• The number incoming links to a page is a measure of importance and authority of the page.

• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

The Random Surfer• Assume the web is a Markov chain.• Surfers randomly click on links, where the probability of an

outlink from page A is 1/m, where m is the number of outlinks from A.

• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.

• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

PageRank (PR) - Definition

• W is a web page• Wi are the web pages that have a link to W• O(Wi) is the number of outlinks from Wi• T is the teleportation probability• N is the size of the web

)()()(...

)()(

)()()1()(

2

2

1

1

n

n

WOWPR

WOWPR

WOWPRT

NTWPR +++−+=

http://dbpubs.stanford.edu:8090/pub/1999-66




HITS – Hubs and Authorities - Hyperlink-Induced Topic Search

• A on the left is an authority• A on the right is a hub

HITS Algorithm – Iterate until Convergence

∑

∑

→∈

→∈

=

=

qpBq

pqBq

qApH

qHpA

|

|

)()(

)()(

• B is the base set• q and p are web pages in B• A(p) is the authority score for p• H(p) is the hub score for p

HITS

● Algorithm Overview● • Input: Q – a query string● SE – a text-based search engine● t – size of the root set● d – max number of “in” links● Top t pages (highest-ranked pages) from the

text-based search engine form the root set

● • Output: – focused subset

( Q = “java”, SE = AltaVista, t = 3, d = 3)

( Q = “java”, SE = AltaVista, t = 3, d = 3)

Algorithm● An Iterative Algorithm● [authority] weights vector x0 = (1, 1, 1,

…, 1)● [hub] weights vector y0 = (1, 1, 1, …, 1)● for i = 1, 2, …, k● xi = update_authorityw(yi-1)● yi = update_hubw(xi)● normalize(xi, yi)● return (xk, yk)

Applications of HITS

• Search engine querying (speed is an issue).• Finding web communities.• Finding related pages.• Populating categories in web directories.• Citation analysis.

HITS Did not Work Well!!!

• Mutually Reinforcing Relationship Between Hosts

• Automatically Generated Links• Non-Relevant Node

Topic drift approach• K edges - 1/k authority weight• L edges - 1/l hub weight

Duplicate Elimination

Challenges– Defining the notation of a replicated

collectionprecisely• Slight differences between copies– Efficient algorithm to identify such collectionand exploiting this knowledge of replication• Hundreds of millions of pages– Subgraph isomorphism: NP

Reasons for inability to Detect Duplicates

● Update Frequency● Different Formats● Partial Crawls

Some Solutions

● IR for Textual similarity● Data Mining for clusteringso on...

Formal definition of similar collections follows:Similar PagesSimilar Links

Growing Similar Clusters

Directories vs. Search Engines

● Directories– Hand-selected sites– Search over the

contents of the descriptions of the pages

– Organized in advance into categories

● Search Engines– All pages in all sites – Search over the

contents of the pages themselves

– Organized in response to a query by relevance rankings or other scores

Directories & Categorization

● Automatic Categorization 1-Taper

– A taxonomy-and-path-enhanced-retrieval system– Given• Hypertext document corpus• A “small” set of classified documents– Goal• Construct a classifier• Apply to new documents

● Manual Categorization 2-OpenGrid and ODP

Good discriminating power: large interclass distance, small intraclass distance

Size of the Web

Typical Questions– Which search engine has the largest coverage?– How many pages are out there and how many are indexed?

• Approach- Measure search engine coverage and overlap through

random queries- Allows a third party to measure relative sizes andoverlaps of search engines- Take two search engines, E1 and E2, we can:• Compute their relative sizes• Compute the fraction of E1’s database indexed by E2

Size Wars

August 2005 : We index 20 billion documents.

So, who’s right?

September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

● As of 10/30/2008 02:33:00 PM

7/25/2008 10:12:00 AM Claimed to have found: 1 trillion (as in

1,000,000,000,000) unique URLs on the web at once!

We knew the web was big...7/25/2008 10:12:00 AM

Challenges

● Spam - Text Spam - Link Spam● Cloaking● Content Quality● Quality Evaluation● Web Conventions (META tag)● Duplicate Hosts● Vaguely Structured Data

Google File System

● Performance, Scalability, Reliability, Avalability

● Component failures are a norm● Files are huge● Most files are mutated by writing in the end

rather than overwriting data● Co-designing APIs with File System to

increase flexibility

GFS

Sponsored Search

● More than 50% users visit a SE every few days

● Over 13% of traffic to commercial sites was generated by Ses

● Over 40% of product searches on web were initiated via Ses

● Pioneered by Goto, later Overture● Google

Measurement and Pricing

● Cost per mille● Cost per action● Cost per click● Click through rate = (1000*CPM) + CPC● Yahoo! - Uses CPC ● Google – Uses CTR * CPC

Discussion

● Large field with great challenges● Google really lived upto its name 10^100!!!● Highly profitable● Internet today is a Big Brain...● Can we contribute???