Top Banner
what is IR course schedule grading scheme 60-538: Information Retrieval September 4, 2014 1 / 48
54

60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

Mar 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

60-538: Information Retrieval

September 4, 2014

1 / 48

Page 2: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

2 / 48

Page 3: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

3 / 48

Page 4: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

IR not long time ago

4 / 48

Page 5: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

5 / 48

Page 6: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

now IR is mostly about search engines

there are many search engines ...

6 / 48

Page 7: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

7 / 48

Page 8: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

8 / 48

Page 9: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

9 / 48

Page 10: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

10 / 48

Page 11: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

11 / 48

Page 12: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

12 / 48

Page 13: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

IR is more than web search

These days we frequently think first of web search, but there aremany other cases:

digital library search

E-mail search, Searching your desktop and laptop computers

Corporate knowledge bases, local business search, expertsearch

Legal information retrieval, patent search

news search

image and video search

(micro-)blog search

product search, federated search

social search, community Q&A, question-answering

recommender systems

opinion mining

13 / 48

Page 14: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 15: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 16: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 17: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 18: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 19: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 20: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).–from IIR book.

14 / 48

Page 21: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Structured vs. unstructured data

in the 90’s. today

15 / 48

Page 22: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

other definitions

Information retrieval (IR) is the science and practice ofmatching information seekers with the information they seek.

Gerard Salton, 1968:

Information retrieval is a field concerned with the structure,analysis, organization, storage, and retrieval of information.

16 / 48

Page 23: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

The search task

Given a query and a corpus, find relevant items

query: users expression of their information need

corpus: a repository of retrievable items

relevance: satisfaction of the users information need

17 / 48

Page 24: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Why is IR fascinating?

Information retrieval is an uncertain process

users don’t know what they want

users don’t know how to convey what they want

computers can’t elicit information like a librarian

computers can’t understand natural language text

the search engine can only guess what is relevant

the search engine can only guess if a user is satisfied

over time, we can only guess how users adjust their short- andlong-term behavior

18 / 48

Page 25: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

classic search model

19 / 48

Page 26: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

A query is an impoverished description of the user’sinformation need

Highly ambiguous to anyone other than the user

Retrieval Model

A formal method that predicts the degree of relevance of adocument to a query

20 / 48

Page 27: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

taxonomy of IR models

Document Property

text

links

multimedia

IR models

Boolean

vector

probalistic

Semistructured text

proximal nodes

xml based

web

page rank

hubs and authorities (HITs)

Multimedia

image retrieval

audio

video

Set theoretic

fuzzy

extended boolean

set-based

algebraic

generalized vector

LSI

NN

probablistic

BM25

language models

Bayersian networks

21 / 48

Page 28: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Boolean Retrieval Model

The user describes their information need using booleanconstraints (e.g., AND, OR, and AND NOT)

The burden is on the user to formulate a good boolean query

22 / 48

Page 29: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Example

Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

One choice: use grep command in unix.

grep all of Shakespeares plays for Brutus and Caesar,strip out lines containing Calpurnia

Why is that not the answer?

Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)

so we need to index the text

23 / 48

Page 30: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

what is an index

24 / 48

Page 31: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

index construction process

25 / 48

Page 32: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Initial stages of text processing

TokenizationCut character sequence into word tokens

Deal with Johns, a state-of-the-art solution

NormalizationMap text and query term to same form

You want U.S.A. and USA to match

StemmingWe may wish different forms of a root to match

authorize, authorization

Stop wordsWe may omit very common words (or not)

the, a, to, of

26 / 48

Page 33: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

postings

Multiple term entries in a singledocument are merged. Split intoDictionary and Postings Doc.frequency information is added.

27 / 48

Page 34: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

28 / 48

Page 35: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

query processing

Consider processing the query:

Brutus AND Caesar

Locate Brutus in the Dictionary;

Retrieve its postings.

Locate Caesar in the Dictionary;

Retrieve its postings.

Merge the two postings (intersect the document sets):

brutus 1 2 4 11 31 45 173 174

caesar 1 2 4 5 6 16 57 132

29 / 48

Page 36: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

30 / 48

Page 37: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

tentative schedule

boolean model

text transformation

surface web and deep web crawling. near-duplicate detection

build a search engine using Lucene

statistic properties and laws in language

vector space model

evaluation

link analysis and PageRank

clustering and classification. Naive bayes, SVM, LSI.

sentiment analysis

31 / 48

Page 38: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Text Book

IIR Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Press, 2008.book website

Other reference books:

SE Search Engines: Information Retrieval in Practice, by BruceCroft, Donald Metzler and Trevor Strohman.

MIR Modern Information Retrieval, by R. Baeza-Yates and B.Ribeiro-Neto.

MMD Anand Rajaraman and Jeff Ullman, Mining of massivedatasets , 2013.

32 / 48

Page 39: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

IIR 02: The term vocabulary and postings lists

Phrase queries: “Stanford University”

Proximity queries: Gates near Microsoft

We need an index that captures position information forphrase queries and proximity queries.

33 / 48

Page 40: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

IIR 04: Index construction

masterassign

mapphase

reducephase

assign

parser

splits

parser

parser

inverter

postings

inverter

inverter

a-f

g-p

q-z

a-f g-p q-z

a-f g-p q-z

a-f

segmentfiles

g-p q-z

34 / 48

Page 41: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

statistic properties of text

0 1 2 3 4 5 6 7

01

23

45

67

log10 rank

log1

0 cf

Zipf’s law, heaps’ law, power law.

the mechanism: Yule process, Preferential attachment

35 / 48

Page 42: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

IIR 06: Scoring, term weighting and the vector spacemodel

Ranking search results

Boolean queries only give inclusion or exclusion of documents.For ranked retrieval, we measure the proximity between the query andeach document.One formalism for doing this: the vector space model

Key challenge in ranked retrieval: evidence accumulation for a term ina document

1 vs. 0 occurence of a query term in the document3 vs. 2 occurences of a query term in the documentUsually: more is betterBut by how much?Need a scoring function that translates frequency into score or weight

36 / 48

Page 43: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Language models

assign a probability to a sequence of m words by means of aprobability distribution.

How to compute this joint probability:

P(its,water , is, so, transparent, that) (1)

P(w1w2 . . .wn) = ΠP(wi )? (2)

37 / 48

Page 44: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Text classification & Naive Bayes

Text classification = assigning documents automatically topredefined classes

Examples:

positive/negative reviewsSpams

38 / 48

Page 45: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Support vector machines

large margin around decision boundary

39 / 48

Page 46: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

clustering

flat clustering

Hierarchical clustering

Single-link and complete-link clustering

Centroid and group-average agglomerative clustering (GAAC)

Bisecting K-means

How to label clusters automatically

40 / 48

Page 47: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Latent Semantic Indexing

how to find semantically related documents?matrix decompositionSVD

41 / 48

Page 48: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Crawling

Surface web and deep web crawling

Sampling the hidden web

42 / 48

Page 49: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Link analysis / PageRank

which web page is more important?

who are in a community?

PageRank and HITS algorithms

graph analysis and mining. Modularity maximizationalgorithms.

43 / 48

Page 50: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Sentiment analysis

Examples

Movie: is this review positive or negative?Products: what do people think about the new iPhone?Public sentiment: how is consumer confidence? Is despairincreasing?Politics: what do people think about this candidate or issue?Prediction: predict election outcomes or market trends fromsentiment

Many other names

Opinion extractionOpinion miningSentiment miningSubjectivity analysis

44 / 48

Page 51: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

45 / 48

Page 52: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

project

crawling

twitterhigh quality CS papers in google scholar

clustering and classification

46 / 48

Page 53: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

marking scheme

project: 50%

project presentation at the end of the termproject report

final exam: 40%

cover lecture materials

class participation: 10%

be active in class

47 / 48

Page 54: 60-538: Information Retrievaljlu.myweb.cs.uwindsor.ca/538/538intro.pdf · digital library search E-mail search, Searching your desktop and laptop computers Corporate knowledge bases,

what is IRcourse schedulegrading scheme

open source search engines

Lemur

C++

used by IR researchers

compare advanced IR techniques

Lucene

Java-based

relatively simple IR techniques

Galago

Java-based

used by the book [SE] Search Engines: Information Retrievalin Practice, by Bruce Croft, Donald Metzler and TrevorStrohman. 48 / 48