Top Banner
Information Retrieval Document Parsing
19

Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Dec 23, 2015

Download

Documents

Barbra Mathews
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Information Retrieval

Document Parsing

Page 2: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

Documents tobe indexed.

Friends, Romans, countrymen.

Page 3: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Parsing a document

What format is it in? pdf/word/excel/html?

What language is it in? What character set is in use?

Plain ASCII, UTF-8, UTF-16,…

Each of these is a classification problem, with many complications…

Page 4: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Tokenization: Issues

Chinese/Japanese no spaces between words: Not always guaranteed a unique tokenization Dates/amounts in multiple formats

フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )

Katakana Hiragana Kanji “Romaji”

What about DNA sequences ? ACCCGGTACGCAC...

Definition of Tokens What you can search !!

Page 5: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Case folding

Reduce all letters to lower case exception: upper case (in mid-

sentence?) e.g., General Motors USA vs. usa

Morgen will ich in MIT … Is this the

German “mit”?

Page 6: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Stemming

Reduce terms to their “roots” language dependent

e.g., automate(s), automatic, automation all reduced to automat.

e.g., casa, casalinga, casata, casamatta, casolare, casamento, casale, rincasare, case reduced to cas

Page 7: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Porter’s algorithm

Commonest algorithm for stemming English Conventions + 5 phases of reductions

phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a

compound command, select the one that applies to the longest suffix.

Full morphologial analysis modest benefit !!

sses ss, ies i, ational ate, tional tion

Page 8: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Thesauri

Handle synonyms and homonyms Hand-constructed equivalence classes

e.g., car = automobile e.g., macchina = automobile = spider

List of words important for a given domain

For each word it specifies a list of correlated words (usually,

synonyms, polysemic or phrases for complex concepts).

Co-occurrence Pattern: BT (broader term), NT (narrower

term) Vehicle (BT) Car Fiat 500 (NT)

How to use it in SE ??

Page 9: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Page 10: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Dmoz Directory

Page 11: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Yahoo! Directory

Page 12: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Information Retrieval

Statistical Properties of Documents

Page 13: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Statistical properties of texts

Token are not distributed uniformly They follow the so called “Zipf Law”

Few tokens are very frequent A middle sized set has medium frequency Many are rare

The first 100 tokens sum up to 50% of the text Many of these tokens are stopwords

Page 14: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

K-th most frequent term has frequency approximately 1/k; or the product of the frequency (f) of a token and its rank (r) is almost a constant

The Zipf Law, in detail

f = c |T| / r

r * f = c |T|f = c |T| / r

General Law

Sum after the k-th element is ≤ fkk/(z-1)

For the initial top-elements is a constant

Page 15: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

An example of “Zipf curve”

Page 16: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Zipf’s law log-log plot

Page 17: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Consequences of Zipf Law

Do exist many not frequent tokens that do not

discriminate. These are the so called “stop words” English: to, from, on, and, the, ... Italian: a, per, il, in, un,…

Do exist many tokens that occur once in a text and thus are poor to discriminate (error?).

English: Calpurnia Italian: Precipitevolissimevolmente (o, paklo)

Words with medium frequency Words that discriminate

Page 18: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Other statistical properties of texts

The number of distinct tokens grows as The so called “Heaps Law” (|T|where ) Hence the token length is (log |T|)

Interesting words are the ones with Medium frequency (Luhn)

Page 19: Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Frequency vs. Term significance (Luhn)