A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University
A Survey on Web Information Retrieval Technologies
Lan HuangComputer Science Department
State University of New York, Stony Brook
Presented by Kajal Miyan
Michigan State University
Overview
● Web Information Retrieval Challenges● Search Engine – Overview and Architecture(Google as Case Study)● Various Algorithms● Directories – Overview and some Algos.● Estimating Size of Web● Main Challenges● Google File System● Sponsored Search● Discussion
Web Information Retrieval Challenges
● Bulk● Dynamic Internet● Heterogeneity● Variety of Languages● Duplication● High Linkage● Ill-formed queries● Wide Variance in Users● Specific Behavior
The Goal
Evaluation● Precision - fraction of the documents re-
trieved that are relevant to the user's informa-tion need.Precision = (relevant documents AND retrieved
documents)/retrieved documents● Recall - fraction of the documents that are
relevant to the query that are successfully re-trieved.Recall = (relevant documents AND retrieved docu-
ments)/relevant documents
Current Goal
● Precision at Top 10 results &● Precision at Top 10 result pages
Various SEs
● SEs● Directories● News SEs● Meta Search Engines● Social Search Engines● Opinions, Forums, Usenets SEs
Your Browser
Architecture (Sherman 2003)
The Web
URL1 URL2
URL3 URL4
Crawler
Indexer
Query Server Eggs?
Eggs.
Eggs - 90%Eggo - 81%Ego- 40%
Huh? - 10%
All AboutEggsby
S. I. Am
Crawling the web
● Robust – should avoid overloading the websites and must deal with huge amounts of data.
● Decides in what order to crawl the pages.● Frequency of revisiting pages● Rule of Thumb – Important pages first
Crawler
● Cho etc.99 (spread the workload)● – Allocation that URL’s in 500 Queues● – Allocation based on the Hash of the server
name● – Read one URL from each queue at a time
Inverted Indexes the IR Way
How Inverted Index Are Created???
Periodically rebuilt, static otherwise.
Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
Multiple term entries for a single document are merged.
Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
How Inverted Files are Created
Finally, the file can be split into – Dictionary or Lexicon file– Postings file
Dictionary/Lexicon PostingsTerm Doc # Freq
a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Inverted indexes
Permit fast search for individual termsFor each term, you get a list consisting of:
document ID frequency of term in doc (optional) position of term in doc (optional)
These lists can be used to solve Boolean queries:country -> d1, d2manor -> d2country AND manor -> d2
Inverted Indexes for Web Search Engines
Inverted indexes are still used, even though the web is so huge.Some systems partition the indexes across different machines. Each machine handles different parts of the data.Other systems duplicate the data across many machines; queries are distributed among the machines.Most do a combination of these.
Google Architecture
Repository– Contains the full HTML text– Compressed using zlib (RFC1950)– Prefixed by docID, length, and URL• Document Index– Each entry contain● • The current doc status (crawled ?)● • A pointer into the repository (if crawled)● • A document checksum (using binary search
to find the docID )● • Various statistics• Lexicon– Keep in memory on a 256M– Current contains 14 million words
Google’s Indexing
● The Indexer converts each doc into a collection of “hit lists” and puts these into “barrels”, sorted by docID. It also creates a database of “links”.– Hit: <wordID, position in doc, font info, hit type>– Hit type: Plain or fancy.– Fancy hit: Occurs in URL, title, anchor text, metatag.– Optimized representation of hits (2 bytes each).
● Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file.– Lexicon: <wordID, offset into inverted index>– Lexicon is mostly cached in-memory
wordid #docswordid #docswordid #docs
Lexicon (in-memory) Postings (“Inverted barrels”, on disk)
Each “barrel” contains postings for a range of wor-dids.
Google’s Inverted Index
Sorted by wordid
Docid #hits Hit, hit, hit, hit, hitDocid #hits Hit
Docid #hits HitDocid #hits Hit, hit, hit
Docid #hits Hit, hit
Barrel i
Barrel i+1
Sortedby Docid
Google crawler
– Maintain its own DNS cache– Asynchronous I/O to manage events– 4 crawler• Both URLserver & crawler are implement in
Python• Each crawler keeps 300 connections open at
once• >100 pages / s , roughly 600K/s
● How is Relevance of the page decided???
Content Relevance
• Phrase matching.• Synonyms.• URL analysis.• Date last updated.• Spell checking.• Home page detection.
HTML Weighting
A 5) Anchor
H1, H2, H3, H4, H5, H64) Header
DL, OL, UL3) List
STRONG, B, EM, I, U2) Strong
None of the above1) Plain Text
TITLE6) Title
HTML tagsClass Name
• Meta tag text is mostly ignored by search engines
Link-Based Metrics
• A link from A to B can be viewed as a recommendation, a vote or a citation.
• Links can be – referential, or – informational
• Links effect the ranking of web pages and thus have commercial value.
Citation and Linking
PageRank - Motivation
• The number incoming links to a page is a measure of importance and authority of the page.
• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.
The Random Surfer• Assume the web is a Markov chain.• Surfers randomly click on links, where the probability of an
outlink from page A is 1/m, where m is the number of outlinks from A.
• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.
• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.
PageRank (PR) - Definition
• W is a web page• Wi are the web pages that have a link to W• O(Wi) is the number of outlinks from Wi• T is the teleportation probability• N is the size of the web
)()()(...
)()(
)()()1()(
2
2
1
1
n
n
WOWPR
WOWPR
WOWPRT
NTWPR +++−+=
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search
• A on the left is an authority• A on the right is a hub
HITS Algorithm – Iterate until Convergence
∑
∑
→∈
→∈
=
=
qpBq
pqBq
qApH
qHpA
|
|
)()(
)()(
• B is the base set• q and p are web pages in B• A(p) is the authority score for p• H(p) is the hub score for p
HITS
● Algorithm Overview● • Input: Q – a query string● SE – a text-based search engine● t – size of the root set● d – max number of “in” links● Top t pages (highest-ranked pages) from the
text-based search engine form the root set
● • Output: – focused subset
( Q = “java”, SE = AltaVista, t = 3, d = 3)
( Q = “java”, SE = AltaVista, t = 3, d = 3)
Algorithm● An Iterative Algorithm● [authority] weights vector x0 = (1, 1, 1,
…, 1)● [hub] weights vector y0 = (1, 1, 1, …, 1)● for i = 1, 2, …, k● xi = update_authorityw(yi-1)● yi = update_hubw(xi)● normalize(xi, yi)● return (xk, yk)
Applications of HITS
• Search engine querying (speed is an issue).• Finding web communities.• Finding related pages.• Populating categories in web directories.• Citation analysis.
HITS Did not Work Well!!!
• Mutually Reinforcing Relationship Between Hosts
• Automatically Generated Links• Non-Relevant Node
Topic drift approach• K edges - 1/k authority weight• L edges - 1/l hub weight
Duplicate Elimination
Challenges– Defining the notation of a replicated
collectionprecisely• Slight differences between copies– Efficient algorithm to identify such collectionand exploiting this knowledge of replication• Hundreds of millions of pages– Subgraph isomorphism: NP
Reasons for inability to Detect Duplicates
● Update Frequency● Different Formats● Partial Crawls
Some Solutions
● IR for Textual similarity● Data Mining for clusteringso on...
Formal definition of similar collections follows:Similar PagesSimilar Links
Growing Similar Clusters
Directories vs. Search Engines
● Directories– Hand-selected sites– Search over the
contents of the descriptions of the pages
– Organized in advance into categories
● Search Engines– All pages in all sites – Search over the
contents of the pages themselves
– Organized in response to a query by relevance rankings or other scores
Directories & Categorization
● Automatic Categorization 1-Taper
– A taxonomy-and-path-enhanced-retrieval system– Given• Hypertext document corpus• A “small” set of classified documents– Goal• Construct a classifier• Apply to new documents
● Manual Categorization 2-OpenGrid and ODP
Good discriminating power: large interclass distance, small intraclass distance
Size of the Web
Typical Questions– Which search engine has the largest coverage?– How many pages are out there and how many are indexed?
• Approach- Measure search engine coverage and overlap through
random queries- Allows a third party to measure relative sizes andoverlaps of search engines- Take two search engines, E1 and E2, we can:• Compute their relative sizes• Compute the fraction of E1’s database indexed by E2
Size Wars
August 2005 : We index 20 billion documents.
So, who’s right?
September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.
● As of 10/30/2008 02:33:00 PM
7/25/2008 10:12:00 AM Claimed to have found: 1 trillion (as in
1,000,000,000,000) unique URLs on the web at once!
We knew the web was big...7/25/2008 10:12:00 AM
Challenges
● Spam - Text Spam - Link Spam● Cloaking● Content Quality● Quality Evaluation● Web Conventions (META tag)● Duplicate Hosts● Vaguely Structured Data
Google File System
● Performance, Scalability, Reliability, Avalability
● Component failures are a norm● Files are huge● Most files are mutated by writing in the end
rather than overwriting data● Co-designing APIs with File System to
increase flexibility
GFS
Sponsored Search
● More than 50% users visit a SE every few days
● Over 13% of traffic to commercial sites was generated by Ses
● Over 40% of product searches on web were initiated via Ses
● Pioneered by Goto, later Overture● Google
Measurement and Pricing
● Cost per mille● Cost per action● Cost per click● Click through rate = (1000*CPM) + CPC● Yahoo! - Uses CPC ● Google – Uses CTR * CPC
Discussion
● Large field with great challenges● Google really lived upto its name 10^100!!!● Highly profitable● Internet today is a Big Brain...● Can we contribute???