what is IR course schedule grading scheme 60-538: Information Retrieval September 4, 2014 1 / 48
what is IRcourse schedulegrading scheme
60-538: Information Retrieval
September 4, 2014
1 / 48
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
2 / 48
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
3 / 48
what is IRcourse schedulegrading scheme
IR not long time ago
4 / 48
what is IRcourse schedulegrading scheme
5 / 48
what is IRcourse schedulegrading scheme
now IR is mostly about search engines
there are many search engines ...
6 / 48
what is IRcourse schedulegrading scheme
7 / 48
what is IRcourse schedulegrading scheme
8 / 48
what is IRcourse schedulegrading scheme
9 / 48
what is IRcourse schedulegrading scheme
10 / 48
what is IRcourse schedulegrading scheme
11 / 48
what is IRcourse schedulegrading scheme
12 / 48
what is IRcourse schedulegrading scheme
IR is more than web search
These days we frequently think first of web search, but there aremany other cases:
digital library search
E-mail search, Searching your desktop and laptop computers
Corporate knowledge bases, local business search, expertsearch
Legal information retrieval, patent search
news search
image and video search
(micro-)blog search
product search, federated search
social search, community Q&A, question-answering
recommender systems
opinion mining
13 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.
14 / 48
what is IRcourse schedulegrading scheme
Structured vs. unstructured data
in the 90’s. today
15 / 48
what is IRcourse schedulegrading scheme
other definitions
Information retrieval (IR) is the science and practice ofmatching information seekers with the information they seek.
Gerard Salton, 1968:
Information retrieval is a field concerned with the structure,analysis, organization, storage, and retrieval of information.
16 / 48
what is IRcourse schedulegrading scheme
The search task
Given a query and a corpus, find relevant items
query: users expression of their information need
corpus: a repository of retrievable items
relevance: satisfaction of the users information need
17 / 48
what is IRcourse schedulegrading scheme
Why is IR fascinating?
Information retrieval is an uncertain process
users don’t know what they want
users don’t know how to convey what they want
computers can’t elicit information like a librarian
computers can’t understand natural language text
the search engine can only guess what is relevant
the search engine can only guess if a user is satisfied
over time, we can only guess how users adjust their short- andlong-term behavior
18 / 48
what is IRcourse schedulegrading scheme
classic search model
19 / 48
what is IRcourse schedulegrading scheme
A query is an impoverished description of the user’sinformation need
Highly ambiguous to anyone other than the user
Retrieval Model
A formal method that predicts the degree of relevance of adocument to a query
20 / 48
what is IRcourse schedulegrading scheme
taxonomy of IR models
Document Property
text
links
multimedia
IR models
Boolean
vector
probalistic
Semistructured text
proximal nodes
xml based
web
page rank
hubs and authorities (HITs)
Multimedia
image retrieval
audio
video
Set theoretic
fuzzy
extended boolean
set-based
algebraic
generalized vector
LSI
NN
probablistic
BM25
language models
Bayersian networks
21 / 48
what is IRcourse schedulegrading scheme
Boolean Retrieval Model
The user describes their information need using booleanconstraints (e.g., AND, OR, and AND NOT)
The burden is on the user to formulate a good boolean query
22 / 48
what is IRcourse schedulegrading scheme
Example
Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?
One choice: use grep command in unix.
grep all of Shakespeares plays for Brutus and Caesar,strip out lines containing Calpurnia
Why is that not the answer?
Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)
so we need to index the text
23 / 48
what is IRcourse schedulegrading scheme
what is an index
24 / 48
what is IRcourse schedulegrading scheme
index construction process
25 / 48
what is IRcourse schedulegrading scheme
Initial stages of text processing
TokenizationCut character sequence into word tokens
Deal with Johns, a state-of-the-art solution
NormalizationMap text and query term to same form
You want U.S.A. and USA to match
StemmingWe may wish different forms of a root to match
authorize, authorization
Stop wordsWe may omit very common words (or not)
the, a, to, of
26 / 48
what is IRcourse schedulegrading scheme
postings
Multiple term entries in a singledocument are merged. Split intoDictionary and Postings Doc.frequency information is added.
27 / 48
what is IRcourse schedulegrading scheme
28 / 48
what is IRcourse schedulegrading scheme
query processing
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesar in the Dictionary;
Retrieve its postings.
Merge the two postings (intersect the document sets):
brutus 1 2 4 11 31 45 173 174
caesar 1 2 4 5 6 16 57 132
29 / 48
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
30 / 48
what is IRcourse schedulegrading scheme
tentative schedule
boolean model
text transformation
surface web and deep web crawling. near-duplicate detection
build a search engine using Lucene
statistic properties and laws in language
vector space model
evaluation
link analysis and PageRank
clustering and classification. Naive bayes, SVM, LSI.
sentiment analysis
31 / 48
what is IRcourse schedulegrading scheme
Text Book
IIR Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Press, 2008.book website
Other reference books:
SE Search Engines: Information Retrieval in Practice, by BruceCroft, Donald Metzler and Trevor Strohman.
MIR Modern Information Retrieval, by R. Baeza-Yates and B.Ribeiro-Neto.
MMD Anand Rajaraman and Jeff Ullman, Mining of massivedatasets , 2013.
32 / 48
what is IRcourse schedulegrading scheme
IIR 02: The term vocabulary and postings lists
Phrase queries: “Stanford University”
Proximity queries: Gates near Microsoft
We need an index that captures position information forphrase queries and proximity queries.
33 / 48
what is IRcourse schedulegrading scheme
IIR 04: Index construction
masterassign
mapphase
reducephase
assign
parser
splits
parser
parser
inverter
postings
inverter
inverter
a-f
g-p
q-z
a-f g-p q-z
a-f g-p q-z
a-f
segmentfiles
g-p q-z
34 / 48
what is IRcourse schedulegrading scheme
statistic properties of text
0 1 2 3 4 5 6 7
01
23
45
67
log10 rank
log1
0 cf
Zipf’s law, heaps’ law, power law.
the mechanism: Yule process, Preferential attachment
35 / 48
what is IRcourse schedulegrading scheme
IIR 06: Scoring, term weighting and the vector spacemodel
Ranking search results
Boolean queries only give inclusion or exclusion of documents.For ranked retrieval, we measure the proximity between the query andeach document.One formalism for doing this: the vector space model
Key challenge in ranked retrieval: evidence accumulation for a term ina document
1 vs. 0 occurence of a query term in the document3 vs. 2 occurences of a query term in the documentUsually: more is betterBut by how much?Need a scoring function that translates frequency into score or weight
36 / 48
what is IRcourse schedulegrading scheme
Language models
assign a probability to a sequence of m words by means of aprobability distribution.
How to compute this joint probability:
P(its,water , is, so, transparent, that) (1)
P(w1w2 . . .wn) = ΠP(wi )? (2)
37 / 48
what is IRcourse schedulegrading scheme
Text classification & Naive Bayes
Text classification = assigning documents automatically topredefined classes
Examples:
positive/negative reviewsSpams
38 / 48
what is IRcourse schedulegrading scheme
Support vector machines
large margin around decision boundary
39 / 48
what is IRcourse schedulegrading scheme
clustering
flat clustering
Hierarchical clustering
Single-link and complete-link clustering
Centroid and group-average agglomerative clustering (GAAC)
Bisecting K-means
How to label clusters automatically
40 / 48
what is IRcourse schedulegrading scheme
Latent Semantic Indexing
how to find semantically related documents?matrix decompositionSVD
41 / 48
what is IRcourse schedulegrading scheme
Crawling
Surface web and deep web crawling
Sampling the hidden web
42 / 48
what is IRcourse schedulegrading scheme
Link analysis / PageRank
which web page is more important?
who are in a community?
PageRank and HITS algorithms
graph analysis and mining. Modularity maximizationalgorithms.
43 / 48
what is IRcourse schedulegrading scheme
Sentiment analysis
Examples
Movie: is this review positive or negative?Products: what do people think about the new iPhone?Public sentiment: how is consumer confidence? Is despairincreasing?Politics: what do people think about this candidate or issue?Prediction: predict election outcomes or market trends fromsentiment
Many other names
Opinion extractionOpinion miningSentiment miningSubjectivity analysis
44 / 48
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
45 / 48
what is IRcourse schedulegrading scheme
project
crawling
twitterhigh quality CS papers in google scholar
clustering and classification
46 / 48
what is IRcourse schedulegrading scheme
marking scheme
project: 50%
project presentation at the end of the termproject report
final exam: 40%
cover lecture materials
class participation: 10%
be active in class
47 / 48
what is IRcourse schedulegrading scheme
open source search engines
Lemur
C++
used by IR researchers
compare advanced IR techniques
Lucene
Java-based
relatively simple IR techniques
Galago
Java-based
used by the book [SE] Search Engines: Information Retrievalin Practice, by Bruce Croft, Donald Metzler and TrevorStrohman. 48 / 48