Week 10 What are we searching for? The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
58
Embed
The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Week 10What are we searching for?
The College of Saint RoseCIS 521 / MBA 541 – Introduction to Internet DevelopmentDavid Goldschmidt, Ph.D.
selected material from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
What is search?
What is search?
What are we searching for?
How many searches areprocessed per day?
What is the average number ofwords in text-based searches?
Finding things
Applications and varieties of search: Web search Site search Vertical search Enterprise search Desktop search Peer-to-peer search
search
Acquisition and indexing
where do wesearch next?
how do we acquirenew documents?
Text transformation
how do we bestconvert documentsto their index terms
how do we makeacquired documents
searchable?
User interaction and querying
Measures of success (i)
Relevance Search results contain information
the searcher was looking for Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)
User relevance Search results relevant to one user
may be completely irrelevant toanother user
SNOOKI
Measures of success (ii)
Precision Proportion of retrieved documents
that are relevant How precise were the results?
Recall (and coverage) Proportion of relevant documents
that were actually retrieved Did we retrieve all of the relevant
Timeliness and freshness Search results contain information that
is current and up-to-date
Performance Users expect subsecond response times
Media Users increasingly use cellphones,
mobile devices
Measures of success (iv)
Scalability Designs that work, must perform equally well
as the system grows and expands▪ Increased number of documents, number of users,
etc.
Flexibility (or adaptability) Tune search engine components to
keep up with changing landscape
Spam-resistance
Information retrieval (IR)
Gerard Salton (1927-1995) Pioneer in information retrieval
Defined information retrieval as: “a field concerned with the
structure, analysis, organization, storage, searching, and retrieval of information”
This was 1968 (before the Internet and Web!)
(Un)structured information Structured information:
Often stored in a database Organized via predefined
tables, columns, etc. Select all accounts with balances less than
$200
Unstructured information Document text (headings, words, phrases) Images, audio, video (often relies on textual
tags)
account number
balance
7004533711 $498.19
7004533712 $781.05
7004533713 $147.15
7004533714 $195.75
Processing text
Search and IR has largelyfocused on text processingand documents
Search typically uses thestatistical properties of text Word counts Word frequencies But ignore linguistic features (noun,
verb, etc.)
Image search?
Image search currently relies on textual tags Therefore just another form of text-
based search
Edie; little girlkid’s laptop
bare foot
drink; sippy cup
Uniform Resource Locator (URL)
A URL identifies a resource on the Web,and consists of: A scheme or protocol (e.g. http, https) A hostname (e.g. academic2.strose.edu) A resource (e.g.
/math_and_science/goldschd)
e.g. http://cs.strose.edu/courses-ug.html
GET and POST requests
When a client requests a Web page, the client uses either a GET or POST request (followed by the protocol and a blank
line)
We hopefully receive a 200 response
GET / HTTP/1.0GET /subdir/stuff.html HTTP/1.0GET /images/icon.png HTTP/1.0GET /docs/paper.pdf HTTP/1.0
HTTP/1.1 200 OKDate: current date and time Last-Modified: last modified date and time etc.
Politeness and robots.txt
Web crawlers adhere to a politeness policy: GET requests sent every few seconds or
minutes A robots.txt file
specifies whatcrawlers areallowed to crawl:
Sitemaps
default priority is 0.5
some URLs might not be discovered by crawler
Text transformation
how do we bestconvert documentsto their index terms
how do we makeacquired documents
searchable?
Find/Replace
Simplest approach is find, whichrequires no text transformation Useful in user applications,
but not in search (why?) Optional transformation
handled during the findoperation: case sensitivity
Text statistics (i)
English documents are predictable: Top two most frequently occurring words
are “the” and “of” (10% of word occurrences)
Top six most frequently occurring wordsaccount for 20% of word occurrences
Top fifty most frequently occurring words account for 50% of word occurrences
Given all unique words in a (large) document, approximately 50% occur only once
Text statistics (ii)
Zipf’s law: Rank words in order of decreasing
frequency The rank (r) of a word times its
frequency (f) is approximately equal to a constant (k)
r x f = k In other words, the frequency of the rth
most common word is inversely proportional to r
George Kingsley Zipf(1902-1950)
Text statistics (iii)
The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document
Revise Zipf’s law as: r x Pr = c
for English,c ≈ 0.1
Text statistics (iv)
Verify Zipf’s law using the AP89 dataset: Collection of Associated Press (AP) news
stories from 1989 (available at http://trec.nist.gov):
Total documents 84,678Total word occurrences39,749,179Vocabulary size 198,763Words occurring > 1000 times 4,169Words occurring once 70,064
For each document we process, the goal is to isolate each word occurrence This is called tokenization or lexical
analysis
We might also recognize various typesof content, including: Metadata (i.e. invisible <meta> tags) Images and video (via textual tags) Document structure (sections, tables,
etc.)
Text normalization
Before we tokenize the given sequence of characters, we might normalize the text by: Converting to lowercase Omitting punctuation and special
characters Omitting words less than 3 characters
long Omitting HTML/XML/other tags
What do we do with numbers?
Stopping and stopwords (i)
Certain function words (e.g. “the” and “of”) are typically ignored during text processing These are called stopwords, because
Browse the Web as a random surfer: Choose a random number r between 0 and 1 If r < λ then go to a random page else follow a random link from the current
page Repeat!
The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page
PageRank (iii)
Jumping to a random pageavoids getting stuck in: Pages that have no links Pages that only have broken links
Pages that loop back to previously visited pages
Link quality (and avoiding spam)
A cycle tends to negate theeffectiveness of thePageRank algorithm
Retrieval models (i)
A retrieval model is a formal (mathematical) representation of the process of matching a query and a document
Forms the basis of ranking results
doc 234
doc 345
doc 455
doc 567
doc 678
doc 789
doc 881
doc 972
doc 123
doc 257
user query terms ?
doc 913
Retrieval models (ii)
Goal: Retrieve exactly the documents that users want (whether they know it or not!) A good retrieval model finds documents
that are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance)
A good retrieval model alsooften considers topical relevance
Topical relevance
Given a query, topical relevance identifies documents judged to be on the same topic Even though keyword-based document
scores might show a lack of relevance!Abraham Lincoln
query: Abraham Lincoln
Civil War
Tall Guys with
Beards
Stovepipe Hats
U.S. President
s
User relevance
User relevance is difficult to quantify because of each user’s subjectivity Humans often have difficulty
explaining why one documentis more relevant than another
Humans may disagree abouta given document’s relevancein relation to the same query