Top Banner
Special Topics on Information Retrieval Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010.
53

Special Topics on Information Retrieval

Feb 10, 2016

Download

Documents

alaric

Special Topics on Information Retrieval. Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010. Beyond word-based representations. Content of the section. Language ambiguity and IR Indexing with parts of speech POS tagging - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Special Topics on Information Retrieval

Special Topics onInformation Retrieval

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

University of Alabama at Birmingham, Fall 2010.

Page 2: Special Topics on Information Retrieval

Beyond word-based representations

Page 3: Special Topics on Information Retrieval

Special Topics on Information Retrieval3

Content of the section

• Language ambiguity and IR• Indexing with parts of speech– POS tagging

• Indexing with senses– Approaches for word sense disambiguation

• Concept indexing– DOR and TCOR representations– Random indexing

Page 4: Special Topics on Information Retrieval

Special Topics on Information Retrieval4

Language ambiguity

• Ambiguity is a condition where information can be understood or interpreted in more than one way.

• Context may play a role in resolving ambiguity.• Different kinds of ambiguity:– Lexical: words may have different meanings– Syntactic: sentence can be parsed in more than

one way (or words having two parts of speech).– Semantic: words or concepts have an inherently

diffuse meaning based on informal usage

Page 5: Special Topics on Information Retrieval

Special Topics on Information Retrieval5

Examples of ambiguity

• Lexical:– “Plants/N need light and water” vs. “Each one

plant/V one”– “The fisherman jumped off the bank and into the

water” vs. “The bank down the street was robbed!”

• Syntactic– He ate the cookies on the couch• He was seated on the couch or the cookies were there?

Page 6: Special Topics on Information Retrieval

Ambiguity and IR – looking for what?

• “Paris Hilton”– Really interested in The Hilton Hotel in Paris?

• “Tiger Woods”– Searching something about wildlife or the

famous golf player?

• Conclusion, “simple word matching fails”.

Page 7: Special Topics on Information Retrieval

Ambiguity and IR – two problems

• Most IR models represent documents as “bag of words”– There is no information on the words’ positions.

• Two main problems:– Synonymy: many ways to refer to the same object, e.g.

car and automobile• leads to poor recall

– Polysemy: most words have more than one distinct meaning, e.g. model, bank, chip• leads to poor precision

Page 8: Special Topics on Information Retrieval

Example: Vector Space Model(Taken from Lillian Lee)

autoenginebonnet

tyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

SynonymyWill have small cosine

but are related

PolysemyWill have large cosine

but not truly related

Page 9: Special Topics on Information Retrieval

Special Topics on Information Retrieval9

First idea: indexing with POS tags

w1t1 w1t2 Plant|NN Plant|VB … wntm

d1

d2

: wi,j

dm

Weight indicating the contributionof term-pos j in document i.

Whole vocabulary of the collection with POS tags

• Simple and nice idea, but how to determine the POS tag of each word of a given document?

Page 10: Special Topics on Information Retrieval

Special Topics on Information Retrieval10

Part-Of-Speech tagging(based on matieial from Dana S. Nau of University of Maryland and Huong LeThanh of the Dresden University of Technology)

• Part of Speech (POS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence.– Input: a string of words + a tag set– Output: a single best tag for each word

• Example (from Penn Treebank):– The/DT grand/JJ jury/NN commented/VBD on/IN

a/DT number/NN of/IN other/JJ topics/NNS ./.

Page 11: Special Topics on Information Retrieval

Special Topics on Information Retrieval11

Brown/Penn Treebank tags

Page 12: Special Topics on Information Retrieval

Special Topics on Information Retrieval12

Main approaches

• Rule-Based POS tagging– e.g., ENGTWOL [ Voutilainen, 1995 ]

• Transformation-based tagging– e.g.,Brill’s tagger [ Brill, 1995 ]

• Stochastic (Probabilistic) tagging– e.g., TNT [ Brants, 2000 ]• Necessitates a training corpus (the Brown Corpus)• Based on probability of certain tag occurring, given

information from the word and previous tags.

Page 13: Special Topics on Information Retrieval

Special Topics on Information Retrieval13

Very first approach

• Assign each word its most likely POS tag– If w has tags t1, …, tk, then can use– P(ti|w) = c(w,ti)/(c(w,t1) + … + c(w,tk)), where• c(w,ti) = number of times w/ti appears in the corpus

• Success: 91% for English– For instance, heat is more used as a noun than as

a verb.

Page 14: Special Topics on Information Retrieval

Special Topics on Information Retrieval14

HMM tagging

• A HMM simplifying assumption: the tagging problem can be solved by looking at nearby words and tags.

ti = argmaxj P(tj | tj-1 )P(wi | tj )

Previous tag sequence(tag co-occurence)

word (lexical) likelihood

Page 15: Special Topics on Information Retrieval

Special Topics on Information Retrieval15

Example

Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN

• Suppose we have tagged all but race– Look at just preceding word (bigram):– to/TO race/??? NN or VB?

• Choose tag with greater of the two probabilities:– P(VB|TO)P(race|VB) or P(NN|TO)P(race|NN)

Page 16: Special Topics on Information Retrieval

Special Topics on Information Retrieval16

Does indexing with POS work?

• Improves precision but reduces recall.• Conclusion, annotating POS does not seem

worthy as a standalone indexing strategy, even if tagging is performed manually.

• Example:– Query: “talented baseball player”– Document: “is one of the top talents of the time”

Page 17: Special Topics on Information Retrieval

Special Topics on Information Retrieval17

Second idea: motivation

• Using single words as index terms generally has good exhaustivity, but poor specificity due to word ambiguity.

• Some word associations have a totally different meaning of the “sum” of the meanings of the words that compose them.– Hot + dog ≠ “hot dog”

• To remedy this problem: use index terms more complex than single words, such as phrases. – Distinguish the two meanings by using phrasal index

terms such as “bank of the Seine” and “bank of Japan”

Page 18: Special Topics on Information Retrieval

Special Topics on Information Retrieval18

Second idea: indexing with phrases

p1 P2Information

retrievalManuelMontes

Brownsugar

pn

d1

d2

: wi,j

dm

Weight indicating the contributionof phrase j in document i.

Extracted phrases from the collection

• Here the questions are, which kind of word sequences are relevant phrases?, how to extract them?

Page 19: Special Topics on Information Retrieval

Special Topics on Information Retrieval19

Syntactical phrases as index terms

This apple pie looks good and is a real treat

• adjective-noun relation (real-treat)• noun-noun relation (apple-pie)• subject-verb relation (pie-looks)• verb-object relation (is-treat)• The complication is that they are extracted

from the POS tagged text or from the syntactic tree.

Page 20: Special Topics on Information Retrieval

Special Topics on Information Retrieval20

Named entities as index terms

• Proper names in texts– Three universally accepted categories: person,

location and organisation– Other categories: date/time expressions,

measures (percent, money, weight etc), email addresses, etc.

• One problem: they can be also ambiguous!– George Bush: person or location? – Mexico: geo-political organization or location?

• How to detect named entities?

Page 21: Special Topics on Information Retrieval

Special Topics on Information Retrieval21

Named entity recognition• Two tasks: identification and classification• Two main approaches:– Knowledge-based• rule based; developed by experienced language engineers;

make use of human intuition • Names often have internal structure and style.

– Learning-based• Use statistics or machine learning methods • Requires large amounts of labeled documents• Typical features are: Capitalisation, numeric symbols,

punctuation marks, position in the sentence and the words.

Page 22: Special Topics on Information Retrieval

Special Topics on Information Retrieval22

N-grams as index terms

• N-gram is a subsequence of n items from a given sequence

• N-grams are easily computed• Combining n-grams for different sizes

produces great flexibility at searching time.• Main problem is the high dimensionality.

How to reduce dimensionality? How to select only the most useful n-grams?

Page 23: Special Topics on Information Retrieval

Special Topics on Information Retrieval23

Maximal Frequent Sequences as index terms

• Sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. – A sequence is considered to be frequent if it appears

in at least σ documents.• Its main strength is to form a very compact index– Avoids storing the numerous least significant phrases

• The extraction of MFS is commonly based on a combination of bottom-up and greedy methods

Page 24: Special Topics on Information Retrieval

Special Topics on Information Retrieval24

Does indexing with phrases work?

• Early results were very promising. However, the constant growth of test collections caused a drastic fall in the quality of the results.

• A conclusion of research works is that phrases improve results in low levels of recall.

• The recommendation is to consider phrases as supplementary terms of the vector space– Terms + phrases as index terms

Page 25: Special Topics on Information Retrieval

Special Topics on Information Retrieval25

Third idea: motivation

• Traditional IR approaches are highly dependent on term-matching

• Term matching is affected by the synonymy and polysemy phenomena.

• Need to capture the concepts instead of only the words

• Solution: indexing by senses!

Page 26: Special Topics on Information Retrieval

What is word sense?

• Word sense is one of the meanings of a word.• “Words” are having different meanings based

on the context of the word. • Example:– We went to see a play at the theater– The children went out to play in the park

A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human

04/22/2023NILESH.A.SHEWALE

26

Page 27: Special Topics on Information Retrieval

Special Topics on Information Retrieval27

Third idea: indexing by senses

• How to construct this index? How to determine the sense of each word from the document collection?

w11 w12Bank

(institution)Bank

(hill)pn1 pnm

d1

d2

: wi,j

dm

Weight indicating the contribution of the word-sense j in document i.

All different word senses from the target collection

Page 28: Special Topics on Information Retrieval

Special Topics on Information Retrieval28

Word sense disambiguation

• The task of selecting a sense for a word from a set of predefined possibilities.– Sense Inventory usually comes from a dictionary

or thesaurus.• A related task is word sense discrimination;

the task of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.

Page 29: Special Topics on Information Retrieval

Special Topics on Information Retrieval29

The WSD process

• Choose a sense inventory– Dictionary or thesaurus where word senses are

explicitly indicated.• Design/apply a disambiguation procedure– Two main approaches: Knowledge-Based and

Machine Learning

1. Evaluate the performance of the procedure– Using a manually labeled corpus– Using as baseline the more frequent sense

Page 30: Special Topics on Information Retrieval

Special Topics on Information Retrieval30

Approaches for WSD• Knowledge Based Approaches– Rely on knowledge resources like WordNet,

Thesaurus, etc.– May use grammar rules for disambiguation.– May use hand coded rules for disambiguation.

• Machine Learning Based Approaches– Rely on corpus evidence.– Train a model using tagged (and untagged) corpus.– Probabilistic/Statistical models.

Page 31: Special Topics on Information Retrieval

Special Topics on Information Retrieval31

Knowledge resources

• Dictionaries in Machine-readable form (MRD)– Oxford English Dictionary, Collins, Longman

Dictionary of Ordinary Contemporary English. Roget’s Thesaurus

• Thesaurus – add synonymy information– Roget’s Thesaurus

• Semantic networks – add more relations– WordNet, EuroWordNet

Page 32: Special Topics on Information Retrieval

Henrik Bulskov 32November 3th, 2006

Wordnet

• A large lexical database organized in terms of meanings. – Includes nouns, adjectives, adverbs, and verbs– Synonym words are grouped into synset

• Example:– {car, auto, automobile, machine, motorcar}

Page 33: Special Topics on Information Retrieval

Special Topics on Information Retrieval33

Wordnet example

Page 34: Special Topics on Information Retrieval

Special Topics on Information Retrieval34

Lesk algorithm

• Identify senses of words in a context using definition overlap.– Identify simultaneously the correct senses for all

words in context• Algorithm:

1. Retrieve from MRD all sense definitions of the words to be disambiguated

2. Determine the definition overlap for all possible sense combinations

3. Choose senses that lead to highest overlap

Page 35: Special Topics on Information Retrieval

Special Topics on Information Retrieval35

Example

• Disambiguate “PINE CONE”– PINE

• kinds of evergreen tree with needle-shaped leaves

• waste away through sorrow or illness

– CONE • solid body which narrows to a

point• something of this shape

whether solid or hollow• fruit of certain evergreen trees

Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0

Page 36: Special Topics on Information Retrieval

Special Topics on Information Retrieval36

Disadvantages of Lesk algorithm

• Two many combinations need to be evaluated; problem with long sentences.– Simplified version is to compare the dictionary

definition of an ambiguous word with the terms contained in its neighborhood.

• No enough overlapping words between definitions– Extend definitions by use such information as

synonyms, different derivatives, or words from definitions of words from definitions.

Page 37: Special Topics on Information Retrieval

Special Topics on Information Retrieval37

WSD using the conceptual density

• Select a sense based on the relatedness of that word-sense to the context.– Relatedness is measured in terms of conceptual

density (in a structured hierarchical semantic net)• Idea: if all words in the context are strong

indicators of a particular concept then that concept will have a higher density.

Page 38: Special Topics on Information Retrieval

Special Topics on Information Retrieval38

Example of the conceptual density

• The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.

• The CD formula will yield highest density for the sub-hierarchy containing more senses.

• The sense of W contained in the sub-hierarchy with the highest CD will be chosen.

Page 39: Special Topics on Information Retrieval

Special Topics on Information Retrieval39

Supervised approach for WSD

• Induces a classifier from manually sense-tagged text using machine learning techniques.

• Resources:– Sense Tagged Text– Dictionary (implicit source of sense inventory)– Syntactic Analysis (POS tagger, Chunker, Parser, …)

– Reduces WSD to a classification problem– A target word is assigned the most appropriate sense

from a given set of possibilities based on the context in which it occurs

Page 40: Special Topics on Information Retrieval

Special Topics on Information Retrieval40

Supervised methodology1. Create a sample of training data where a given target

word is manually annotated with its senses2. Select a set of features with which to represent

context information. 3. Convert sense-tagged training instances to feature

vectors. 4. Apply a machine learning algorithm to induce a

classifier. 5. Convert a held out sample of test data into feature

vectors. 6. Apply classifier to test instances to assign a sense tag.

Page 41: Special Topics on Information Retrieval

Special Topics on Information Retrieval41

Some interesting data

• High polysemy: especially verbs.

• Imbalanced training sets: Most examples are from the first sense.

• Current methods: explore semi-supervised machine learning approaches.

Sense n-secmicNouns

Average number of examples

1 9082 13.512 1368 4.613 544 3.684 228 3.555 117 3.246 59 2.747 43 3.528 22 3.139 8 3.17

10 4 2.33>10 11 1.75

Page 42: Special Topics on Information Retrieval

Does indexing with senses work?

• How much can WSD help improve IR effectiveness? Open question– Weiss: 1%, Voorhees’ method : negative– Krovetz and Croft, Sanderson : only useful for short

queries– Schütze and Pedersen’s approaches and Gonzalo’s

experiment : positive result• WSD must be accurate to be useful for IR• It seems that it can be more useful as

visualization strategy.

Page 43: Special Topics on Information Retrieval

Special Topics on Information Retrieval43

Fourth idea: motivation

• Bag of words representation ignores all semantic or conceptual information.– It simply looks at the surface word forms

• Words (forms) are very ambiguous.– Polysemy and synonymy are big problems

• It is necessary to have representations at concept level.– “Concept ” is related with “sense”, but from a

practical (usage) point of view.

Page 44: Special Topics on Information Retrieval

Special Topics on Information Retrieval44

Fourth idea: concept-based representations

• In IR, documents are represented by the words occurring in them.– The semantics of a document is conveyed by the

words that occur in it.• Can the semantics of a word be conveyed by

the documents in which it occurs?• Basis of a representation called:– Document Occurrence Representation (DOR)

Page 45: Special Topics on Information Retrieval

Special Topics on Information Retrieval45

Document Occurrence Representation

• Intuitions about the weights:– The more frequently ti occurs in dj, the more important

is dj for characterizing the semantics of ti

– The more distinct the words dj contains, the smaller its contribution to characterizing the semantics of ti.

d1 d2 … dn

t1

t2

: wi,j

tm

All documents from the collection

All words from the collection

Weight indicating the contribution of document j for the semantics of term i.

Page 46: Special Topics on Information Retrieval

Special Topics on Information Retrieval46

Representing documents by DOR• DOR is a word representation, not a document

representation.• Representation of documents is obtained by the sum of

the vectors from their words.– Queries are represented in the same way: sum of the vectors

from its words.

d1 d2 … dn

t1

t2

: wi,j

tm

d1 d2 … dn

d1

d2

: wi,j

dn

Word representationWord–Document matrix

Index for IRDocument–Document matrix

SUM

Page 47: Special Topics on Information Retrieval

Special Topics on Information Retrieval47

Alternative representation

• In WSD, words are represented by the terms occurring in their context.– The semantics (meaning) of a word is conveyed by

the words commonly co-occurring with it.

• Basis of a representation called:– Term Co-Occurrence Representation (TCOR)

Page 48: Special Topics on Information Retrieval

Special Topics on Information Retrieval48

Term Co-Occurrence Representation

• Intuitions about the weights:– The more words ti and tj co-occur in, the more

important tj is for characterizing the semantics of ti

– The more distinct words tj co-occurs with, the smaller its contribution for characterizing the semantics of ti.

t1 t2 … tm

t1

t2

: wi,j

tm

All words from the collection

Weight indicating the co-occurrenceof words i and j

Page 49: Special Topics on Information Retrieval

Special Topics on Information Retrieval49

Representing documents by TCOR• TCOR, such as DOR, is a word representation, not a

document representation.• Representation of documents is obtained by the sum

of the vectors from their words.– Queries are represented in the same way: sum of the

vectors from its words.

t1 t2 … tm

t1

t2

: wi,j

tm

t1 t2 … tm

d1

d2

: wi,j

dn

Word representationWord–Word matrix

Index for IRDocument–Word matrix

SUM

Page 50: Special Topics on Information Retrieval

Special Topics on Information Retrieval50

Other bag-of-concepts representations

• Standard BoW representations are usually refined before used:– Feature selection: remove some words based on

statistical measures– Feature extraction: artificial features are created from

the originals using distributional clustering of words or factor analytic methods.

• Problem with these approaches is that they are computationally expensive.– Random indexing is a simple approach to generate

BoC representations

Page 51: Special Topics on Information Retrieval

Special Topics on Information Retrieval51

Random indexing• Random Indexing is a vector space methodology that

accumulated context vectors for words base on co-ocurrence data– First step: a unique random representation known as `index

vector´ is assigned to each context (document , paragraph or sentence)

D1

D2

Dn

Documentsk << n

Index Vectors (IV)

1 -1

1 -1

1 -1

0 k

Page 52: Special Topics on Information Retrieval

Special Topics on Information Retrieval52

Random Indexing (2)– Second step: index vectors are used to produce context

vectors by scanning through the text

D1: Towards an Automata Theory of Brain

D2: From Automata Theory to Brain Theory

1 -1

1 -1

0 k

1 1 -1 -1The context vector for brain

– Third step: build document vectors from their word’s context vectors.

di: “From Automata Theory to Brain Theory” CV1 CV2 CV3 CV2

di will be represented as the weighted sum of these vectors:

a1CV1+a2CV2+a3CV3+a2CV2 a1, a2, a2 are idf-values

Page 53: Special Topics on Information Retrieval

Special Topics on Information Retrieval53

Do concept-based representations work?• Useful solutions for a number of conceptual matching

problems– Capture key relationship information, including causal,

goal-oriented, and taxonomic information.• Not to much work in IR– Recent experiments demonstrate that TCOR, DOR and

random indexing results outperform those from traditional VSM; in CLEF collections improvement has been around 7%.

• The more used approach is the one based on Latent Semantic Indexing– But it is computationally expensive