Web search engines Rooted in Information Retrieval (IR) systems • Prepare a keyword index for corpus • Respond to keyword queries with a ranked list of documents. ARCHIE • Earliest application of rudimentary IR systems to the Internet • Title search across sites serving files over FTP
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Web search enginesRooted in Information Retrieval (IR) systems
•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of documents.
ARCHIE•Earliest application of rudimentary IR systems to the Internet
relationships between terms and documents•Documents containing the word Java
•Documents containing the word Java but not the word coffee
Proximity queries•Documents containing the phrase Java
beans or the term API
•Documents where Java and island occur in the same sentence
Mining the Web Chakrabarti and Ramakrishnan 3
Document preprocessing Tokenization
•Filtering away tags
•Tokens regarded as nonempty sequence of characters excluding spaces and punctuations.
•Token represented by a suitable integer, tid, typically 32 bits
•Optional: stemming/conflation of words
•Result: document (did) transformed into a sequence of integers (tid, pos)
Mining the Web Chakrabarti and Ramakrishnan 4
Storing tokens Straight-forward implementation
using a relational database•Example figure
•Space scales to almost 10 times Accesses to table show common
pattern•reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos) tuples.
• Indexing = transposing document-term matrix
Mining the Web Chakrabarti and Ramakrishnan 5
Two variants of the inverted index data structure, usually stored on disk. The simplerversion in the middle does not store term offset information; the version to the right stores termoffsets. The mapping from terms to documents and positions (written as “document/position”) maybe implemented using a B-tree or a hash-table.
Mining the Web Chakrabarti and Ramakrishnan 6
Storage For dynamic corpora
•Berkeley DB2 storage manager
•Can frequently add, modify and delete documents
For static collections• Index compression techniques (to be
discussed)
Mining the Web Chakrabarti and Ramakrishnan 7
Stopwords Function words and connectives Appear in large number of documents and
little use in pinpointing documents Indexing stopwords
• Stopwords not indexed For reducing index space and improving performance
• Replace stopwords with a placeholder (to remember the offset)
Issues• Queries containing only stopwords ruled out• Polysemous words that are stopwords in one
sense but not in others E.g.; can as a verb vs. can as a noun
Mining the Web Chakrabarti and Ramakrishnan 8
Stemming Conflating words to help match a query term with a
morphological variant in the corpus. Remove inflections that convey parts of speech,
tense and number E.g.: university and universal both stem to universe. Techniques
•D+ and D- generated automatically E.g.: Cornell SMART system top 10 documents reported by the first
round of query execution are included in D+
• typically set to 0; D- not used Not a commonly available feature
•Web users want instant gratification
•System complexity Executing the second round query slower
and expensive for major search engines
Mining the Web Chakrabarti and Ramakrishnan 31
Ranking by odds ratio R : Boolean random variable which
represents the relevance of document d w.r.t. query q.
Ranking documents by their odds ratio for relevance• .
Approximating probability of d by product of the probabilities of individual terms in d• .
•Approximately…
),|Pr(/)|Pr(
),|Pr(/)|Pr(
),Pr(/),,Pr(
),Pr(/),,Pr(
),|Pr(
),|Pr(
qRdqR
qRdqR
dqdqR
dqdqR
dqR
dqR
t t
t
qRx
qRx
qRd
qRd
),|Pr(
),|Pr(
),|Pr(
),|Pr(
dqt qtqt
qtqt
ab
ba
dqR
dqR
)1(
)1(
),|Pr(
),|Pr(
,,
,,
Mining the Web Chakrabarti and Ramakrishnan 32
Bayesian Inferencing
Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query.
Manual specification of mappings between terms to approximate concepts.
Mining the Web Chakrabarti and Ramakrishnan 33
Bayesian Inferencing (contd.) Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
Each node is associated with a random Boolean variable, reflecting belief
Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..)
Mining the Web Chakrabarti and Ramakrishnan 34
Bayesian Inferencing systems 2 & 3 same for basic vector-space IR
systems Verity's Search97
•Allows administrators and users to define hierarchies of concepts in files
Estimation of relevance of a document d w.r.t. the query q•Set the belief of the corresponding node to
1 •Set all other document beliefs to 0•Compute the belief of the query•Rank documents in decreasing order of
belief that they induce in the query
Mining the Web Chakrabarti and Ramakrishnan 35
Other issues Spamming
• Adding popular query terms to a page unrelated to those terms
• E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling”
• Little setback due to hyperlink-based ranking
Titles, headings, meta tags and anchor-text• TFIDF framework treats all terms the same
• Meta search engines: Assign weight age to text occurring in tags, meta-tags
• Using anchor-text on pages u which link to v Anchor-text on u offers valuable editorial judgment
about v as well.
Mining the Web Chakrabarti and Ramakrishnan 36
Other issues (contd..) Including phrases to rank complex
queries•Operators to specify word inclusions and
exclusions•With operators and phrases
queries/documents can no longer be treated as ordinary points in vector space
Dictionary of phrases•Could be cataloged manually•Could be derived from the corpus itself
using statistical techniques•Two separate indices:
one for single terms and another for phrases
Mining the Web Chakrabarti and Ramakrishnan 37
Corpus derived phrase dictionary
Two terms and Null hypothesis = occurrences of and are
independent To the extent the pair violates the null
hypothesis, it is likely to be a phrase
•Measuring violation with likelihood ratio of the hypothesis
•Pick phrases that violate the null hypothesis with large confidence
Approximate string matching Non-uniformity of word spellings
• dialects of English
• transliteration from other languages Two ways to reduce this problem.
1. Aggressive conflation mechanism to collapse variant spellings into the same token
2. Decompose terms into a sequence of q-grams or sequences of q characters
Mining the Web Chakrabarti and Ramakrishnan 40
Approximate string matching1. Aggressive conflation mechanism to
collapse variant spellings into the same token
• E.g.: Soundex : takes phonetics and pronunciation details into account
• used with great success in indexing and searching last names in census and telephone directory data.
2. Decompose terms into a sequence of q-grams or sequences of q characters
• Check for similarity in the grams• Looking up the inverted index : a two-stage affair:
• Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms
• These terms are submitted to the regular index• Used by Google for spelling correction• Idea also adopted for eliminating near-duplicate
pages
)42( qq
Mining the Web Chakrabarti and Ramakrishnan 41
Meta-search systems• Take the search engine to the document
• Forward queries to many geographically distributed repositories
• Each has its own search service
• Consolidate their responses.
• Advantages• Perform non-trivial query rewriting
• Suit a single user query to many search engines with different query syntax
• Surprisingly small overlap between crawls
• Consolidating responses• Function goes beyond just eliminating duplicates• Search services do not provide standard ranks
which can be combined meaningfully
Mining the Web Chakrabarti and Ramakrishnan 42
Similarity search• Cluster hypothesis
•Documents similar to relevant documents are also likely to be relevant
• Handling “find similar” queries•Replication or duplication of pages
•Mirroring of sites
Mining the Web Chakrabarti and Ramakrishnan 43
Document similarity• Jaccard coefficient of similarity
between document and • T(d) = set of tokens in document d
• .
•Symmetric, reflexive, not a metric
•Forgives any number of occurrences and any permutations of the terms.
• is a metric
1d 2d
|)()(|
|)()(|),('
21
2121 dTdT
dTdTddr
),('1 21 ddr
Mining the Web Chakrabarti and Ramakrishnan 44
Estimating Jaccard coefficient with random permutations
1. Generate a set of m random permutations
2. for each do3. compute and 4. check if5. end for6. if equality was observed in k cases,
estimate.
m
kddr ),(' 21
)(min)(min 21 dTdT
)( 2d)( 1d
Mining the Web Chakrabarti and Ramakrishnan 45
Fast similarity search with random permutations
1. for each random permutation do2. create a file3. for each document d do4. write out to 5. end for6. sort using key s--this results in contiguous blocks
with fixed s containing all associated 7. create a file8. for each pair within a run of having a given
s do9. write out a document-pair record to g10. end for11. sort on key 12. end for13. merge for all in order, counting the
number of entries
),( 21 dd
sd
ddTs )),((min
f
f
f
g
f
),( 21 dd
g ),( 21 dd
g ),( 21 dd ),( 21 dd
Mining the Web Chakrabarti and Ramakrishnan 46
Eliminating near-duplicates via shingling
• “Find-similar” algorithm reports all duplicate/near-duplicate pages
• Eliminating duplicates• Maintain a checksum with every page in the corpus
• Eliminating near-duplicates• Represent each document as a set T(d) of q-grams
(shingles)
• Find Jaccard similarity between and
• Eliminate the pair from step 9 if it has similarity above a threshold
1d),( 21 ddr 2d
Mining the Web Chakrabarti and Ramakrishnan 47
Detecting locally similar sub-graphs of the Web
• Similarity search and duplicate elimination on the graph structure of the web
• To improve quality of hyperlink-assisted ranking
1. Start process with textual duplicate detection• cleaned URLs are listed and sorted to find duplicates/near-
duplicates• each set of equivalent URLs is assigned a unique token ID • each page is stripped of all text, and represented as a
sequence of outlink IDs
2. Continue using link sequence representation 3. Until no further collapse of multiple URLs are possible
• Approach 2 [Bottom-up Approach]1. identify single nodes which are near duplicates (using text-
shingling)2. extend single-node mirrors to two-node mirrors3. continue on to larger and larger graphs which are likely
mirrors of one another
Mining the Web Chakrabarti and Ramakrishnan 48
Detecting mirrored sites (contd.)• Approach 3 [Step before fetching all pages]
• Uses regularity in URL strings to identify host-pairs which are mirrors
• Preprocessing• Host are represented as sets of positional bigrams
• Convert host and path to all lowercase characters• Let any punctuation or digit sequence be a token
separator• Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)• Eliminate stop terms such as htm, html, txt, main, index,
home, bin, cgi• Form positional bigrams from the token sequence
• Two hosts are said to be mirrors if• A large fraction of paths are valid on both web sites• These common paths link to pages that are near-duplicates.