Special Topics on Information Retrieval Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010.
Feb 10, 2016
Special Topics onInformation Retrieval
Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/
University of Alabama at Birmingham, Fall 2010.
Beyond word-based representations
Special Topics on Information Retrieval3
Content of the section
• Language ambiguity and IR• Indexing with parts of speech– POS tagging
• Indexing with senses– Approaches for word sense disambiguation
• Concept indexing– DOR and TCOR representations– Random indexing
Special Topics on Information Retrieval4
Language ambiguity
• Ambiguity is a condition where information can be understood or interpreted in more than one way.
• Context may play a role in resolving ambiguity.• Different kinds of ambiguity:– Lexical: words may have different meanings– Syntactic: sentence can be parsed in more than
one way (or words having two parts of speech).– Semantic: words or concepts have an inherently
diffuse meaning based on informal usage
Special Topics on Information Retrieval5
Examples of ambiguity
• Lexical:– “Plants/N need light and water” vs. “Each one
plant/V one”– “The fisherman jumped off the bank and into the
water” vs. “The bank down the street was robbed!”
• Syntactic– He ate the cookies on the couch• He was seated on the couch or the cookies were there?
Ambiguity and IR – looking for what?
• “Paris Hilton”– Really interested in The Hilton Hotel in Paris?
• “Tiger Woods”– Searching something about wildlife or the
famous golf player?
• Conclusion, “simple word matching fails”.
Ambiguity and IR – two problems
• Most IR models represent documents as “bag of words”– There is no information on the words’ positions.
• Two main problems:– Synonymy: many ways to refer to the same object, e.g.
car and automobile• leads to poor recall
– Polysemy: most words have more than one distinct meaning, e.g. model, bank, chip• leads to poor precision
Example: Vector Space Model(Taken from Lillian Lee)
autoenginebonnet
tyreslorryboot
caremissions
hood makemodeltrunk
makehiddenMarkovmodel
emissionsnormalize
SynonymyWill have small cosine
but are related
PolysemyWill have large cosine
but not truly related
Special Topics on Information Retrieval9
First idea: indexing with POS tags
w1t1 w1t2 Plant|NN Plant|VB … wntm
d1
d2
: wi,j
dm
Weight indicating the contributionof term-pos j in document i.
Whole vocabulary of the collection with POS tags
• Simple and nice idea, but how to determine the POS tag of each word of a given document?
Special Topics on Information Retrieval10
Part-Of-Speech tagging(based on matieial from Dana S. Nau of University of Maryland and Huong LeThanh of the Dresden University of Technology)
• Part of Speech (POS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence.– Input: a string of words + a tag set– Output: a single best tag for each word
• Example (from Penn Treebank):– The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.
Special Topics on Information Retrieval11
Brown/Penn Treebank tags
Special Topics on Information Retrieval12
Main approaches
• Rule-Based POS tagging– e.g., ENGTWOL [ Voutilainen, 1995 ]
• Transformation-based tagging– e.g.,Brill’s tagger [ Brill, 1995 ]
• Stochastic (Probabilistic) tagging– e.g., TNT [ Brants, 2000 ]• Necessitates a training corpus (the Brown Corpus)• Based on probability of certain tag occurring, given
information from the word and previous tags.
Special Topics on Information Retrieval13
Very first approach
• Assign each word its most likely POS tag– If w has tags t1, …, tk, then can use– P(ti|w) = c(w,ti)/(c(w,t1) + … + c(w,tk)), where• c(w,ti) = number of times w/ti appears in the corpus
• Success: 91% for English– For instance, heat is more used as a noun than as
a verb.
Special Topics on Information Retrieval14
HMM tagging
• A HMM simplifying assumption: the tagging problem can be solved by looking at nearby words and tags.
ti = argmaxj P(tj | tj-1 )P(wi | tj )
Previous tag sequence(tag co-occurence)
word (lexical) likelihood
Special Topics on Information Retrieval15
Example
Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN
• Suppose we have tagged all but race– Look at just preceding word (bigram):– to/TO race/??? NN or VB?
• Choose tag with greater of the two probabilities:– P(VB|TO)P(race|VB) or P(NN|TO)P(race|NN)
Special Topics on Information Retrieval16
Does indexing with POS work?
• Improves precision but reduces recall.• Conclusion, annotating POS does not seem
worthy as a standalone indexing strategy, even if tagging is performed manually.
• Example:– Query: “talented baseball player”– Document: “is one of the top talents of the time”
Special Topics on Information Retrieval17
Second idea: motivation
• Using single words as index terms generally has good exhaustivity, but poor specificity due to word ambiguity.
• Some word associations have a totally different meaning of the “sum” of the meanings of the words that compose them.– Hot + dog ≠ “hot dog”
• To remedy this problem: use index terms more complex than single words, such as phrases. – Distinguish the two meanings by using phrasal index
terms such as “bank of the Seine” and “bank of Japan”
Special Topics on Information Retrieval18
Second idea: indexing with phrases
p1 P2Information
retrievalManuelMontes
Brownsugar
pn
d1
d2
: wi,j
dm
Weight indicating the contributionof phrase j in document i.
Extracted phrases from the collection
• Here the questions are, which kind of word sequences are relevant phrases?, how to extract them?
Special Topics on Information Retrieval19
Syntactical phrases as index terms
This apple pie looks good and is a real treat
• adjective-noun relation (real-treat)• noun-noun relation (apple-pie)• subject-verb relation (pie-looks)• verb-object relation (is-treat)• The complication is that they are extracted
from the POS tagged text or from the syntactic tree.
Special Topics on Information Retrieval20
Named entities as index terms
• Proper names in texts– Three universally accepted categories: person,
location and organisation– Other categories: date/time expressions,
measures (percent, money, weight etc), email addresses, etc.
• One problem: they can be also ambiguous!– George Bush: person or location? – Mexico: geo-political organization or location?
• How to detect named entities?
Special Topics on Information Retrieval21
Named entity recognition• Two tasks: identification and classification• Two main approaches:– Knowledge-based• rule based; developed by experienced language engineers;
make use of human intuition • Names often have internal structure and style.
– Learning-based• Use statistics or machine learning methods • Requires large amounts of labeled documents• Typical features are: Capitalisation, numeric symbols,
punctuation marks, position in the sentence and the words.
Special Topics on Information Retrieval22
N-grams as index terms
• N-gram is a subsequence of n items from a given sequence
• N-grams are easily computed• Combining n-grams for different sizes
produces great flexibility at searching time.• Main problem is the high dimensionality.
How to reduce dimensionality? How to select only the most useful n-grams?
Special Topics on Information Retrieval23
Maximal Frequent Sequences as index terms
• Sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. – A sequence is considered to be frequent if it appears
in at least σ documents.• Its main strength is to form a very compact index– Avoids storing the numerous least significant phrases
• The extraction of MFS is commonly based on a combination of bottom-up and greedy methods
Special Topics on Information Retrieval24
Does indexing with phrases work?
• Early results were very promising. However, the constant growth of test collections caused a drastic fall in the quality of the results.
• A conclusion of research works is that phrases improve results in low levels of recall.
• The recommendation is to consider phrases as supplementary terms of the vector space– Terms + phrases as index terms
Special Topics on Information Retrieval25
Third idea: motivation
• Traditional IR approaches are highly dependent on term-matching
• Term matching is affected by the synonymy and polysemy phenomena.
• Need to capture the concepts instead of only the words
• Solution: indexing by senses!
What is word sense?
• Word sense is one of the meanings of a word.• “Words” are having different meanings based
on the context of the word. • Example:– We went to see a play at the theater– The children went out to play in the park
A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human
04/22/2023NILESH.A.SHEWALE
26
Special Topics on Information Retrieval27
Third idea: indexing by senses
• How to construct this index? How to determine the sense of each word from the document collection?
w11 w12Bank
(institution)Bank
(hill)pn1 pnm
d1
d2
: wi,j
dm
Weight indicating the contribution of the word-sense j in document i.
All different word senses from the target collection
Special Topics on Information Retrieval28
Word sense disambiguation
• The task of selecting a sense for a word from a set of predefined possibilities.– Sense Inventory usually comes from a dictionary
or thesaurus.• A related task is word sense discrimination;
the task of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.
Special Topics on Information Retrieval29
The WSD process
• Choose a sense inventory– Dictionary or thesaurus where word senses are
explicitly indicated.• Design/apply a disambiguation procedure– Two main approaches: Knowledge-Based and
Machine Learning
1. Evaluate the performance of the procedure– Using a manually labeled corpus– Using as baseline the more frequent sense
Special Topics on Information Retrieval30
Approaches for WSD• Knowledge Based Approaches– Rely on knowledge resources like WordNet,
Thesaurus, etc.– May use grammar rules for disambiguation.– May use hand coded rules for disambiguation.
• Machine Learning Based Approaches– Rely on corpus evidence.– Train a model using tagged (and untagged) corpus.– Probabilistic/Statistical models.
Special Topics on Information Retrieval31
Knowledge resources
• Dictionaries in Machine-readable form (MRD)– Oxford English Dictionary, Collins, Longman
Dictionary of Ordinary Contemporary English. Roget’s Thesaurus
• Thesaurus – add synonymy information– Roget’s Thesaurus
• Semantic networks – add more relations– WordNet, EuroWordNet
Henrik Bulskov 32November 3th, 2006
Wordnet
• A large lexical database organized in terms of meanings. – Includes nouns, adjectives, adverbs, and verbs– Synonym words are grouped into synset
• Example:– {car, auto, automobile, machine, motorcar}
Special Topics on Information Retrieval33
Wordnet example
Special Topics on Information Retrieval34
Lesk algorithm
• Identify senses of words in a context using definition overlap.– Identify simultaneously the correct senses for all
words in context• Algorithm:
1. Retrieve from MRD all sense definitions of the words to be disambiguated
2. Determine the definition overlap for all possible sense combinations
3. Choose senses that lead to highest overlap
Special Topics on Information Retrieval35
Example
• Disambiguate “PINE CONE”– PINE
• kinds of evergreen tree with needle-shaped leaves
• waste away through sorrow or illness
– CONE • solid body which narrows to a
point• something of this shape
whether solid or hollow• fruit of certain evergreen trees
Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0
Special Topics on Information Retrieval36
Disadvantages of Lesk algorithm
• Two many combinations need to be evaluated; problem with long sentences.– Simplified version is to compare the dictionary
definition of an ambiguous word with the terms contained in its neighborhood.
• No enough overlapping words between definitions– Extend definitions by use such information as
synonyms, different derivatives, or words from definitions of words from definitions.
Special Topics on Information Retrieval37
WSD using the conceptual density
• Select a sense based on the relatedness of that word-sense to the context.– Relatedness is measured in terms of conceptual
density (in a structured hierarchical semantic net)• Idea: if all words in the context are strong
indicators of a particular concept then that concept will have a higher density.
Special Topics on Information Retrieval38
Example of the conceptual density
• The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.
• The CD formula will yield highest density for the sub-hierarchy containing more senses.
• The sense of W contained in the sub-hierarchy with the highest CD will be chosen.
Special Topics on Information Retrieval39
Supervised approach for WSD
• Induces a classifier from manually sense-tagged text using machine learning techniques.
• Resources:– Sense Tagged Text– Dictionary (implicit source of sense inventory)– Syntactic Analysis (POS tagger, Chunker, Parser, …)
– Reduces WSD to a classification problem– A target word is assigned the most appropriate sense
from a given set of possibilities based on the context in which it occurs
Special Topics on Information Retrieval40
Supervised methodology1. Create a sample of training data where a given target
word is manually annotated with its senses2. Select a set of features with which to represent
context information. 3. Convert sense-tagged training instances to feature
vectors. 4. Apply a machine learning algorithm to induce a
classifier. 5. Convert a held out sample of test data into feature
vectors. 6. Apply classifier to test instances to assign a sense tag.
Special Topics on Information Retrieval41
Some interesting data
• High polysemy: especially verbs.
• Imbalanced training sets: Most examples are from the first sense.
• Current methods: explore semi-supervised machine learning approaches.
Sense n-secmicNouns
Average number of examples
1 9082 13.512 1368 4.613 544 3.684 228 3.555 117 3.246 59 2.747 43 3.528 22 3.139 8 3.17
10 4 2.33>10 11 1.75
Does indexing with senses work?
• How much can WSD help improve IR effectiveness? Open question– Weiss: 1%, Voorhees’ method : negative– Krovetz and Croft, Sanderson : only useful for short
queries– Schütze and Pedersen’s approaches and Gonzalo’s
experiment : positive result• WSD must be accurate to be useful for IR• It seems that it can be more useful as
visualization strategy.
Special Topics on Information Retrieval43
Fourth idea: motivation
• Bag of words representation ignores all semantic or conceptual information.– It simply looks at the surface word forms
• Words (forms) are very ambiguous.– Polysemy and synonymy are big problems
• It is necessary to have representations at concept level.– “Concept ” is related with “sense”, but from a
practical (usage) point of view.
Special Topics on Information Retrieval44
Fourth idea: concept-based representations
• In IR, documents are represented by the words occurring in them.– The semantics of a document is conveyed by the
words that occur in it.• Can the semantics of a word be conveyed by
the documents in which it occurs?• Basis of a representation called:– Document Occurrence Representation (DOR)
Special Topics on Information Retrieval45
Document Occurrence Representation
• Intuitions about the weights:– The more frequently ti occurs in dj, the more important
is dj for characterizing the semantics of ti
– The more distinct the words dj contains, the smaller its contribution to characterizing the semantics of ti.
d1 d2 … dn
t1
t2
: wi,j
tm
All documents from the collection
All words from the collection
Weight indicating the contribution of document j for the semantics of term i.
Special Topics on Information Retrieval46
Representing documents by DOR• DOR is a word representation, not a document
representation.• Representation of documents is obtained by the sum of
the vectors from their words.– Queries are represented in the same way: sum of the vectors
from its words.
d1 d2 … dn
t1
t2
: wi,j
tm
d1 d2 … dn
d1
d2
: wi,j
dn
Word representationWord–Document matrix
Index for IRDocument–Document matrix
SUM
Special Topics on Information Retrieval47
Alternative representation
• In WSD, words are represented by the terms occurring in their context.– The semantics (meaning) of a word is conveyed by
the words commonly co-occurring with it.
• Basis of a representation called:– Term Co-Occurrence Representation (TCOR)
Special Topics on Information Retrieval48
Term Co-Occurrence Representation
• Intuitions about the weights:– The more words ti and tj co-occur in, the more
important tj is for characterizing the semantics of ti
– The more distinct words tj co-occurs with, the smaller its contribution for characterizing the semantics of ti.
t1 t2 … tm
t1
t2
: wi,j
tm
All words from the collection
Weight indicating the co-occurrenceof words i and j
Special Topics on Information Retrieval49
Representing documents by TCOR• TCOR, such as DOR, is a word representation, not a
document representation.• Representation of documents is obtained by the sum
of the vectors from their words.– Queries are represented in the same way: sum of the
vectors from its words.
t1 t2 … tm
t1
t2
: wi,j
tm
t1 t2 … tm
d1
d2
: wi,j
dn
Word representationWord–Word matrix
Index for IRDocument–Word matrix
SUM
Special Topics on Information Retrieval50
Other bag-of-concepts representations
• Standard BoW representations are usually refined before used:– Feature selection: remove some words based on
statistical measures– Feature extraction: artificial features are created from
the originals using distributional clustering of words or factor analytic methods.
• Problem with these approaches is that they are computationally expensive.– Random indexing is a simple approach to generate
BoC representations
Special Topics on Information Retrieval51
Random indexing• Random Indexing is a vector space methodology that
accumulated context vectors for words base on co-ocurrence data– First step: a unique random representation known as `index
vector´ is assigned to each context (document , paragraph or sentence)
D1
D2
Dn
Documentsk << n
Index Vectors (IV)
1 -1
1 -1
1 -1
0 k
Special Topics on Information Retrieval52
Random Indexing (2)– Second step: index vectors are used to produce context
vectors by scanning through the text
D1: Towards an Automata Theory of Brain
D2: From Automata Theory to Brain Theory
1 -1
1 -1
0 k
1 1 -1 -1The context vector for brain
– Third step: build document vectors from their word’s context vectors.
di: “From Automata Theory to Brain Theory” CV1 CV2 CV3 CV2
di will be represented as the weighted sum of these vectors:
a1CV1+a2CV2+a3CV3+a2CV2 a1, a2, a2 are idf-values
Special Topics on Information Retrieval53
Do concept-based representations work?• Useful solutions for a number of conceptual matching
problems– Capture key relationship information, including causal,
goal-oriented, and taxonomic information.• Not to much work in IR– Recent experiments demonstrate that TCOR, DOR and
random indexing results outperform those from traditional VSM; in CLEF collections improvement has been around 7%.
• The more used approach is the one based on Latent Semantic Indexing– But it is computationally expensive