Top Banner
AG Corporate Semantic Web Freie Universität Berlin http://www.inf.fu-berlin.de/groups/ag-csw/ Information Retrieval (IR) Mohammed Al-Mashraee Corporate Semantic Web (AG-CSW) Institute for Computer Science, Freie Universität Berlin [email protected] http://www.inf.fu-berlin.de/groups/ag-csw/
55
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ir

AG Corporate Semantic Web

Freie Universität Berlin

http://www.inf.fu-berlin.de/groups/ag-csw/

Information Retrieval (IR)

Mohammed Al-Mashraee

Corporate Semantic Web (AG-CSW)

Institute for Computer Science, Freie Universität Berlin

[email protected]

http://www.inf.fu-berlin.de/groups/ag-csw/

Page 2: Ir

2 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Agenda Introduction Motivation

Data structures and general representations

IR Definition

IR Models Set theoritic / Boolean

Weighting Methods

Algebric / Vector

IR Evaluation

Page 3: Ir

3 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Introduction

IR System

IR

System Query

String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3

.

.

Page 4: Ir

4 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Introduction - IR Tasks

Given • A corpus of textual natural-language documents.

• A user query in the form of a textual string.

Find:

A ranked set of documents that are relevant

to the query.

Page 5: Ir

5 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Introduction

Motivation

These days we frequently think first of web

search, but there are many other cases:

• E-mail search

• Searching your laptop

• Corporate knowledge bases

• Legal information retrieval

Page 6: Ir

6 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Introduction - Motivation

Unstructured (text) vs. structured (database) data

in the mid-nineties

Page 7: Ir

7 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Introduction - Motivation

Unstructured (text) vs. structured (database) data today

Page 8: Ir

Data structures and

general representations

Page 9: Ir

9 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Data structures and representations

[Almashraee, 2013]

Data representation is a process of providing a

good environment for data to be accessed and

manipulated fast. There are different structures: Database scheme structure.

Semi-structured data.

Semantic representation / RDF representation.

Feature vector representation

Page 10: Ir

10 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Data structures and representations

Database scheme structure A structured data representation.

Provides a smooth access and manipulation to

the data stored in its scheme, e.g., Oracle,

MySQL, etc.

Semi-structured data representation Can have direct storage manipulation to data, but

limited querying ability, e.g., XML.

Page 11: Ir

11 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Data structures and representations

Semantic representation / RDF

representation Newly available and structured data

representation.

It is usually used by Semantic Web

technology applications to interpret their

related information and store them in a triple

(Subject, Predicate, and Object) format.

Page 12: Ir

12 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Data structures and representations

Feature vector representation The most common used representation in which

some extracted features presented as a vector.

This representation allow different methods

(such as, information retrieval, support vector

machines, Nave Bayes, association rule mining,

decision trees, hidden Markov models, maximum

entropy models, etc.) to build useful models to

solve related problems.

Page 13: Ir

13 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

IR vs. databases:

Structured vs unstructured data

Structured data tends to refer to information

in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,

Salary < 60000 AND Manager = Smith

Page 14: Ir

14 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Unstructured data

Typically refers to free text

Allows

o Keyword queries including operators

o More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse

Classic model for searching text documents

Page 15: Ir

15 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

More General

Structured vs. Unstructured data

Search vs. Discoveryvery

Page 16: Ir

16 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Definition

[Manning et al. 2008]

Information Retrieval (IR)

Finding material (usually documents) of an

unstructured nature (usually text) that satisfies

an information need from within large

collections (usually stored on computers)

Page 17: Ir

17 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Representation of Documents

[Paschke notes]

Set of terms T = {t1,…,tn}

Each document dj is represented as a vector

of weighted term:

dj=(w1,j,…,wn,j)

wi,j is a weight for the term ti in the

document dj

Set of documents D

Similarity measure sim describes the

similarity of a document to the query

Page 18: Ir

IR Models

Page 19: Ir

19 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

IR Models

Boolean models/Set theoretic

Vector space models (Statistical/Algebric)

Probabilitic models

Page 20: Ir

Boolean models/

Set theoretic

Page 21: Ir

21 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Set theoretic / Boolean Retrieval

Documents represented as vector of index terms True if term exist in document, false otherwise

Weight wi,j i.e. 0 or 1 (Boolean truth weight)

Interpreted as Boolean variables

Queries represented as Boolean expressions Terms are queries

(q1 AND q2), (q1 OR q2), (: q1) are queries

Document is relevant, if the query expression and

the document expression together are true

Similarity measure is also Boolean

Page 22: Ir

22 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Example.1: Set theoretic/Boolean Retrieval

T = („today“, „is“, „Monday“, „lecture“, „no“)

Documents d1:„today is Monday“, d2:„today is

lecture“, d3:„Monday is lecture“

Today is Monday lecture no

d1 1 1 1 0 0

d2 1 1 0 1 0

d3 0 1 1 1 0

q is Monday AND lecture Today OR Monday NOT lecture

d1 1 0 1 1

d2 1 0 1 0

d3 1 1 1 0

Page 23: Ir

23 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary incidence matrix

Example.2: Set theoretic/Boolean Retrieval

Page 24: Ir

24 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary incidence matrix

Example.2: Set theoretic/Boolean Retrieval

A document is represented by a binary vector ϵ{0, 1}|v|

The size of the vectore depends on the size of the

vocabulary (dictionary)

|v|

Page 25: Ir

Weighting Schemes

Page 26: Ir

26 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Weighting Schemes

Weighting schemes are used to give a score

for documents according to a particular

quiry to rank the document returned:

• Bag-of-words model

• Term frequency (tf) model

• Document frequency (df)

• Inverse document frequency (idf)

• Term frequency – Inverse document frequency

(tf-idf)

Page 27: Ir

27 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Term-document count matrices • Consider the number of occurrences of a term in a document:

The number of occurrences of a term in a document is considered

A document is represented by a natural number vector N|V|

The size of the vectore depends on the size of the vocabulary (dictionary)

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

|v|

Weighting Schemes

Page 28: Ir

28 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Bag of words model

Vector representation doesn’t consider

the ordering of words in a document

E.g.,

John is quicker than Mary

and

Mary is quicker than John

have the same vectors

Page 29: Ir

29 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Term Frequency (tf) weighting

The term frequency tft,d of term t in document d is

defined as the number of times that t occurs in d.

tf is used to compute the query-document match scores.

Raw term frequency is not what we want: • A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.

• But not 10 times more relevant.

Relevance does not increase proportionally with term

frequency (relevance goes up but not linearly).

Frequency in IR denotes the count of a word in the

document

Page 30: Ir

30 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Log-frequency weighting

The log frequency weight of term t in document d is:

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in

both query q and document d:

Score

The score is 0 if none of the query terms is present in

the document.

otherwise 0,

0 tfif, tflog 1

10 t,dt,d

t,dw

dqt dt ) tflog (1 ,

Page 31: Ir

31 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Inverse Document Frequency (idf)

Another score used for ranking the matches of documents

to a query

Idea: Terms that appear in many different documents are

less indicative of overall topic

Rare terms are more informative than frequent terms.

dft is the document frequency of t: the number of

documents that contain t • dft is an inverse measure of the informativeness of t

• dft N

We use log (N/dft) instead of N/dft to “dampen” the effect

of idf.

Page 32: Ir

32 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Inverse Document Frequency (idf)

We define the idf (inverse document frequency) of t by

)/df( log idf 10 tt N

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)

Page 33: Ir

33 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Inverse Document Frequency (idf)

Example:

• Suppose N = 1 million (total number of

documents) )/df( log idf 10 tt N

term dft idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

Page 34: Ir

34 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Tf-idf Weighting

The tf-idf weight of a term is the product of its tf weight and

its idf weight

 

Score(q,d) = tf.idft,dt ÎqÇdå

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Experimentally, tf-idf has been found to work well and proved to be the Best known weighting scheme in information retrieval

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

)df/(log)tflog1(w 10,, tdt Ndt

Page 35: Ir

35 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Tf-idf Weighting

Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

|v|

Each document is now represented by a real-valued vector of

tf-idf weights ∈ R|V|

Page 36: Ir

36 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Tf-idf Weighting

Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and

document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6

B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0

C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Page 37: Ir

Vector space model

Page 38: Ir

38 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Documents are represented as vectors

Queries are also represented as vectors

Now we have a |V|-dimensional vector space

Terms are axes of the space

Documents are points or vectors in this space

T3

T1

T2

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

3 2

5

T3

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

T1

T2

D2 = 3T1 + 7T2 + T3

Example:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

Page 39: Ir

39 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

How to measure the similarity between

Di and Q?

Page 40: Ir

40 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Similarity Measure Documents are ranked according to their

proximity (similarity) to the query in a given

space.

A similarity measure is a function that

computes the degree of similarity between

two vectors.

• Scalar Product (Inner Product)

• Cosine measure

Page 41: Ir

41 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Similarity measure

Inner Product Distance between the end points of the two vectors

Similarity between vectors for the document di and query q can be computed

as the vector inner product (dot product):

sim(dj,q) = dj•q =

where wij is the weight of term i in document j and wiq is the

weight of term i in the query.

Example:

D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

iq

t

i

ijww1

Page 42: Ir

42 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Why Inner Product is not a good solution

for vector similarities

The Euclidean distance between q and

d2 is large even though the distribution

of terms in the query q and the

distribution of terms in the document

d2 are very similar

Page 43: Ir

43 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Similarity measure

Cosine measure Compute the weight for D and Q

Normalize the length of vectors for D and Q

Compute the cosine similarity

Page 44: Ir

44 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Cosine measure

Length normalization

• A vector can be length normalized by

dividing each of its components by its length

• Long and short documents now have

comparable weights

Page 45: Ir

45 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Cosine measure

Page 46: Ir

46 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Vector space model

Cosine measure

Cosine similarity

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

Page 47: Ir

47

Vector space model

Cosine for length-normalized vectors

Since vectors are normalized

Since vectors are normalized (length-

normalized) vectors, cosine similarity is

simply the dot product (or scalar product

for q, d length-normalized.

47

 

cos(

q ,

d ) =

q ·

d = qidii=1

V

å

Page 48: Ir

48

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights? Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.

Example

Page 49: Ir

49

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69

Sec. 6.3

Page 50: Ir

Evaluation

Page 51: Ir

51 51

Relevant documents

Retrieved documents

Entire document

collection

retrieved &

relevant

not retrieved but

relevant

retrieved &

irrelevant

Not retrieved &

irrelevant

retrieved not retrieved

rele

van

t ir

rele

van

t

Precision and Recall

Meature Formula

Precision TP / (TP + FP)

Recall TP / (TP + FN)

Page 52: Ir

52 52

Precision and Recall

Precision

The ability to retrieve top-ranked documents that

are mostly relevant.

Recall

The ability of the search to find all of the relevant

items in the corpus.

Page 53: Ir

53

Precision and Recall

Example

Assume the following:

• A database contains 80 records on a particular topic

• A search was conducted on that topic and 60 records were retrieved.

• Of the 60 records retrieved, 45 were relevant.

Calculate the precision and recall scores for the search

Using the designations above:

• A = The number of relevant records retrieved,

• B = The number of relevant records not retrieved, and

• C = The number of irrelevant records retrieved.

In this example A = 45, B = 35 (80-45) and C = 15 (60-45).

Recall = (45 / (45 + 35)) * 100% => 45/80 * 100% = 56%

Precision = (45 / (45 + 15)) * 100% => 45/60 * 100% = 75%

AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Page 54: Ir

54 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Thank You,

questions!

Page 55: Ir

55

References

Manning, Christopher et al. Introduction to Information

Retrieval, 2008.

Baeza-Yates et al. Modern Information Retrieval,

1999.

Adrian Paschke - „Web Based Information Systems -

Lecture notes“, FU Berlin.

Rohit Kate- „ Natural Language Processing - Lecture notes“,

University of Wisconsin-Milwaukee.

AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/