2003.10.30 - SLIDE 1 IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/ is202/f03/ SIMS 202: Information Organization and Retrieval Lecture 18: Statistical Properties of Texts and Vector Representation
77
Embed
2003.10.30 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2003.10.30 - SLIDE 1IS 202 – FALL 2003
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/
SIMS 202:
Information Organization
and Retrieval
Lecture 18: Statistical Properties of Texts and Vector Representation
2003.10.30 - SLIDE 2IS 202 – FALL 2003
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.30 - SLIDE 3IS 202 – FALL 2003
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.30 - SLIDE 4IS 202 – FALL 2003
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
2003.10.30 - SLIDE 5IS 202 – FALL 2003
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
2003.10.30 - SLIDE 6IS 202 – FALL 2003
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
2003.10.30 - SLIDE 7IS 202 – FALL 2003
Boolean Systems
• Most of the commercial database search systems that pre-date the WWW are based on Boolean search– Dialog, Lexis-Nexis, etc.
• Most Online Library Catalogs are Boolean systems– E.g., MELVYL
• Database systems use Boolean logic for searching
• Many of the search engines sold for intranet search of web sites are Boolean
2003.10.30 - SLIDE 8IS 202 – FALL 2003
Content Analysis
• Automated Transformation of raw text into a form that represents some aspect(s) of its meaning
• Including, but not limited to:– Automated Thesaurus Generation– Phrase Detection– Categorization– Clustering– Summarization
2003.10.30 - SLIDE 9IS 202 – FALL 2003
Techniques for Content Analysis
• Statistical– Single Document– Full Collection
• Linguistic– Syntactic– Semantic– Pragmatic
• Knowledge-Based (Artificial Intelligence)
• Hybrid (Combinations)
2003.10.30 - SLIDE 10IS 202 – FALL 2003
Text Processing
• Standard Steps:– Recognize document structure
• Titles, sections, paragraphs, etc.
– Break into tokens• Usually space and punctuation delineated• Special issues with Asian languages
– Stemming/morphological analysis– Store in inverted index (to be discussed later)
2003.10.30 - SLIDE 11IS 202 – FALL 2003
Techniques for Content Analysis
• Statistical– Single Document– Full Collection
• Linguistic– Syntactic– Semantic– Pragmatic
• Knowledge-Based (Artificial Intelligence)
• Hybrid (Combinations)
2003.10.30 - SLIDE 12
Document Processing Steps
From “Modern IR” Textbook
2003.10.30 - SLIDE 13IS 202 – FALL 2003
Errors Generated by Porter Stemmer
Too Aggressive Too Timid organization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
From Krovetz ‘93
2003.10.30 - SLIDE 14IS 202 – FALL 2003
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)
2003.10.30 - SLIDE 21IS 202 – FALL 2003
Word Frequency vs. Resolving Power
The most frequent words are not the most descriptive
(from van Rijsbergen 79)
2003.10.30 - SLIDE 22IS 202 – FALL 2003
Statistical Independence
• Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together
),()()( yxPyPxP
2003.10.30 - SLIDE 23IS 202 – FALL 2003
Lexical Associations
• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora can yield similar associations• One measure: Mutual Information (Church and
Hanks 89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
2003.10.30 - SLIDE 24IS 202 – FALL 2003
Interesting Associations with “Doctor”
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
AP Corpus, N=15 million, Church & Hanks 89
2003.10.30 - SLIDE 25IS 202 – FALL 2003
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
These associations were likely to happen because the non-doctor words shown here are very common
and therefore likely to co-occur with any noun
Un-Interesting Associations with “Doctor”
AP Corpus, N=15 million, Church & Hanks 89
2003.10.30 - SLIDE 26IS 202 – FALL 2003
Content Analysis Summary
• Content Analysis: transforming raw text into more computationally useful forms
• Words in text collections exhibit interesting statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
2003.10.30 - SLIDE 27IS 202 – FALL 2003
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.30 - SLIDE 28IS 202 – FALL 2003
Inverted Indexes
• We have seen “Vector files” conceptually– An Inverted File is a vector file “inverted” so
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.30 - SLIDE 33IS 202 – FALL 2003
Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally– A vector is like an array of floating point– Has direction and magnitude– Each vector holds a place for every term in
the collection– Therefore, most vectors are sparse
2003.10.30 - SLIDE 34IS 202 – FALL 2003
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary or weighted vectors
of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
2003.10.30 - SLIDE 35IS 202 – FALL 2003
Vector Representation
• Documents and Queries are represented as vectors
• Position 1 corresponds to term 1, position 2 to term 2, position t to term t
• The weight of the term is stored in each position
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
2003.10.30 - SLIDE 36IS 202 – FALL 2003
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
2003.10.30 - SLIDE 37IS 202 – FALL 2003
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
2003.10.30 - SLIDE 38IS 202 – FALL 2003
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
2003.10.30 - SLIDE 39IS 202 – FALL 2003
We Can Plot the Vectors
Star
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
2003.10.30 - SLIDE 40IS 202 – FALL 2003
Documents in 3D Space
Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning
• tf*idf measure:– Term frequency (tf)– Inverse document frequency (idf)
• A way to deal with some of the problems of the Zipf distribution
• Goal: Assign a tf*idf weight to each term in each document
2003.10.30 - SLIDE 48IS 202 – FALL 2003
tf*idf
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
2003.10.30 - SLIDE 49IS 202 – FALL 2003
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents(N = 10000)
2003.10.30 - SLIDE 50IS 202 – FALL 2003
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.30 - SLIDE 51IS 202 – FALL 2003
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
2003.10.30 - SLIDE 52IS 202 – FALL 2003
tf*idf Normalization
• Normalize the term weights (so longer vectors are not unfairly given more weight)– Normalize usually means force all values to
fall within a certain range, usually between 0 and 1, inclusive
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
2003.10.30 - SLIDE 53IS 202 – FALL 2003
Vector Space Similarity
• Now, the similarity of two documents is:
• This is also called the cosine, or normalized inner product – The normalization was done when weighting
the terms
),( 1
t
kjkikji wwDDsim
2003.10.30 - SLIDE 54IS 202 – FALL 2003
Vector Space Similarity Measure
• Combine tf and idf into a similarity measure
)()(
),(
:combined
are comparison similarity theandion normalizat otherwise
),( :normalized are weights termif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
2003.10.30 - SLIDE 55IS 202 – FALL 2003
Computing Similarity Scores
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
2
1 1D
Q2D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
2003.10.30 - SLIDE 56IS 202 – FALL 2003
What’s Cosine Anyway?
“One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arcendpoint. As a result of this definition, the cosine function is periodic with period 2pi.”
Clustering and re-clustering is entirely automated
2003.10.30 - SLIDE 70IS 202 – FALL 2003
Clustering Result Sets
• Advantages:– See some main themes
• Disadvantage:– Many ways documents could group together
are hidden
• Thinking point: What is the relationship to classification systems and facets?
2003.10.30 - SLIDE 71IS 202 – FALL 2003
Dan Perkel on Cooper
• Are the problems that Cooper lays out the most pressing ones that web users face today? If not, what are some more pressing problems? Who are Cooper’s users?
• Regardless of answer to previous question, how adequate are his solutions? Where are strengths and weaknesses?
2003.10.30 - SLIDE 72IS 202 – FALL 2003
Simon King on Hearst
• Prof. Hearst mentions "an algorithm called TextTiling that automatically splits long documents into multi-paragraph subtopical units." Sounds nice, but what if the termsets/concepts you're searching on just happen to appear on opposite sides of one of the boundaries that TextTiling created? In plans for future work she mentions using an inverse distance measure rather than a fixed proximity constraint. This is good unless your search terms appear at the end of one section of a document and the beginning of the next (they're not separated by many words, but may not be related within the document.) Is one of these approaches clearly better than the other?
2003.10.30 - SLIDE 73IS 202 – FALL 2003
Simon King on Hearst
• Is there any reason that the optimal query size for Hearst's queries seems to be two or three concepts? Is this due to the way we write and think -- can't discuss more than a couple ideas at a time? Or is there some other reason?
2003.10.30 - SLIDE 74IS 202 – FALL 2003
Sean Savage on MIR 7
• Considering these trends:– the proliferation of networked, mobile devices used at
the front end in everyday information retrieval scenarios, and
– the increase in cheap processing power and memory on the back end;
• And considering these facts:– mobile devices possess very limited input and output
capabilities compared to those of desktop machines; and
– most usage scenarios beyond the desktop involve significant constraints on the amount of time and attention that users can devote to these devices.
2003.10.30 - SLIDE 75IS 202 – FALL 2003
Sean Savage on MIR 7
• Should we now focus most of our development resources in the realm of large-scale text transformations on improving the quality of search results (i.e., striving to improve precision and recall by pre-processing text, and by using categorization hierarchies at the front end to guide users in focusing queries), as opposed to directing those resources towards even more effective compression techniques to reduce query response times?
2003.10.30 - SLIDE 76IS 202 – FALL 2003
Sean Savage on MIR 7
• Given the trends and facts above, should those in the IR community who work on text compression now focus on compressing small batches of text to be transmitted more efficiently across wireless networks, rather than on compressing the gigantic collections residing in databases, which this chapter seems to chiefly address?
2003.10.30 - SLIDE 77IS 202 – FALL 2003
Next Time
• Avi Rappoport of Searchtools.com on “Implementing Web Site Search Engines”– For discussion, please prepare by looking at
some web sites with search capabilities (but NOT Ebay, Amazon, Google, Yahoo, or AllTheWeb) and find one that you like and one that you don’t
• Ray will be away from Tuesday-Friday next week, Marc will be in town, but at a conference all next week