Top Banner
Web- and Multimedia-based Information Systems Lecture 2
30

Web- and Multimedia-based Information Systems Lecture 2.

Dec 14, 2015

Download

Documents

Brenda Nelson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web- and Multimedia-based Information Systems Lecture 2.

Web- and Multimedia-based Information Systems

Lecture 2

Page 2: Web- and Multimedia-based Information Systems Lecture 2.

Vector Model

Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results

Page 3: Web- and Multimedia-based Information Systems Lecture 2.

Vector Model

Document Vector with weights for every index term

Query Vector with weights for every index term

Vectors of the dimension of the total number of index terms in the collection

Page 4: Web- and Multimedia-based Information Systems Lecture 2.

Documents in Vector Space

t1

t2

t3

D1

D2

D10D3

D9

D4

D7D8

D5

D11

D6

Page 5: Web- and Multimedia-based Information Systems Lecture 2.

Vector Model

Position 1 corresponds to term 1, position 2 to term 2, position t to term t

The weight of the term is stored in each position

absent is terma if 0

,...,,

,...,,

21

21

w

wwwQ

wwwD

qtqq

dddi itii

Page 6: Web- and Multimedia-based Information Systems Lecture 2.

Vector Model

Cosine of the angle between the vectors taken as similarity measure

Sorting/Ranking of results Threshold for results More precise answer with more relevant docs

on the top

Page 7: Web- and Multimedia-based Information Systems Lecture 2.

Similarity Function

*

),(

1

2,

1

2,

1

t

jqi

t

iji

t

kjkik

ji

ww

wwDDsim

ji

jiji

dd

ddDDsim

cos),(

Page 8: Web- and Multimedia-based Information Systems Lecture 2.

Vector Model Index Terms Weighting

Binary Weights Raw Term Weights Term frequency x Inverse document

frequency

Page 9: Web- and Multimedia-based Information Systems Lecture 2.

Binary Weights

Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1

Page 10: Web- and Multimedia-based Information Systems Lecture 2.

Raw Term Weights

The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1

Page 11: Web- and Multimedia-based Information Systems Lecture 2.

Term frequency x Inverse document frequency

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

i

ikik

freq

freqtf

max

Page 12: Web- and Multimedia-based Information Systems Lecture 2.

IDF Example

IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Page 13: Web- and Multimedia-based Information Systems Lecture 2.

Probabilistic Model

Based on Probability For every document, a probability is

calculated for:– Document being relevant– Document being irrelevant

to the query

Documents more relevant than not ranked in decreasing order of relevance

Page 14: Web- and Multimedia-based Information Systems Lecture 2.

Text Operations in Detail

Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space

requirements Rules for extraction from documents

– Rules for divison of terms Punctuation Dashes

– List of Stop Words Articles, prepositions, conjunctions

Page 15: Web- and Multimedia-based Information Systems Lecture 2.

Word-oriented Reduction Schemes

Lemmatisations Smaller term lists Generalization of terms Methods

– Reduction to the infinitive– Reduction to a stem

Algorithmic Methods for English German:

– Biggest Problems: Prefixes & Compositions– Only with dictionaries

Explicit listing of all forms Or rules to derive forms

Page 16: Web- and Multimedia-based Information Systems Lecture 2.

Stemming

Different Methods Most efficiently: Affix removal

– Porter Algorithm– Implement later – Series of rules to strip suffixes

s -> nil sses -> ss

Page 17: Web- and Multimedia-based Information Systems Lecture 2.

Word Type Index Term Selection

Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science)

– Noun groups– Maximum distance between terms

Page 18: Web- and Multimedia-based Information Systems Lecture 2.

Thesauri

„Treasury of words“ For every entry

– Definition– Synonyms

Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained

Difficult with a large and dynamic document collection as the web

Page 19: Web- and Multimedia-based Information Systems Lecture 2.

Creation of Inverted List

Create Vocabulary Note document, position in Document for

each term Sort List (first by terms, then by positions) Split Terms & Positions

Page 20: Web- and Multimedia-based Information Systems Lecture 2.

Basic Query

Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present

Page 21: Web- and Multimedia-based Information Systems Lecture 2.

Advanced Query Functionality

Comparison Operators for Metadata String of multiple terms More general: take into account distance and

order of terms Truncation (Wildcards)

Page 22: Web- and Multimedia-based Information Systems Lecture 2.

Information Retrieval System Evaluation

Functionality Analysis Performance

– Time– Space

Retrieval Performance– Batch vs. Interactive mode

Page 23: Web- and Multimedia-based Information Systems Lecture 2.

Retrieval Performance Measures

Recall– The fraction of relevant documents which has

been retrieved

Precision– The fraction of the retrieved documents which is

relevant

Page 24: Web- and Multimedia-based Information Systems Lecture 2.

Precision vs. Recall

User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system

1. d1 2. d5 3. d2

For the second result, recall is at 50%, precision is also 50%

For the third result, recall is 100%, precision is 67%

Page 25: Web- and Multimedia-based Information Systems Lecture 2.

Programming Assignment

Page 26: Web- and Multimedia-based Information Systems Lecture 2.

Programming Assignment

Different part each week Web Search Engine

Page 27: Web- and Multimedia-based Information Systems Lecture 2.

WWW Search Engine

Search Engine

Indexer

Robot

DB

WWW-Server

Index

WWW-Server WWW-Client

Query Result List

Query Results

Files Request

Documents

Page 28: Web- and Multimedia-based Information Systems Lecture 2.

Assignment Part 1

Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable

in a tree-like datastructure Stores result code & important header fields

for every request to disk in a format suitable for further processing

Page 29: Web- and Multimedia-based Information Systems Lecture 2.

Assignment Part 1 (cont.)

Implementation in Java Pure TCP socket communications No need to save documents in this

assignment Robot shall identify itself via HTTP User-

Agent header Extensibility required for future assignments

Page 30: Web- and Multimedia-based Information Systems Lecture 2.

Example HTTP session

telnet www 80

GET / HTTP/1.0

HTTP/1.0 200 Document follows

Date: Tue, 10 Sep 1996 14:34:06 GMT

Server: NCSA/1.4.2

Content-type: image/gif

Last-modified: Tue, 10 Sep 1996 13:25:26 GMT

Content-length: 9755

<HTML>

TCP connectionHTTP Request<CRLF>Response Headers

<CRLF>Start of content