9/11/2000 Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval
50
Embed
Content Analysis and Statistical Properties of Text
Content Analysis and Statistical Properties of Text. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Today. Overview of Content Analysis Text Representation - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
9/11/2000 Information Organization and Retrieval
Content Analysis and Statistical Properties of Text
Ray Larson & Marti Hearst
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
9/11/2000 Information Organization and Retrieval
Today
• Overview of Content Analysis
• Text Representation
• Statistical Characteristics of Text Collections
• Zipf distribution
• Statistical dependence
9/11/2000 Information Organization and Retrieval
Content Analysis• Automated Transformation of raw text into a form
that represent some aspect(s) of its meaning• Including, but not limited to:
Zoom in on the Knee of the Curve43 6 approach44 5 work45 5 variabl46 5 theori47 5 specif48 5 softwar49 5 requir50 5 potenti51 5 method52 5 mean53 5 inher54 5 data55 5 commit
56 5 applic57 4 tool58 4 technolog59 4 techniqu
9/11/2000 Information Organization and Retrieval
Zipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have medium
frequency– many elements occur very infrequently
9/11/2000 Information Organization and Retrieval
Zipf Distribution• The product of the frequency of words (f) and their rank (r) is
approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …
10/
/1
NC
rCf
Information Organization and Retrieval
Zipf Distribution(linear and log scale)
9/11/2000 Information Organization and Retrieval
What Kinds of Data Exhibit a Zipf Distribution?
• Words in a text collection– Virtually any language usage
• Library book checkout patterns• Incoming Web Page Requests (Nielsen)
• Outgoing Web Page Requests (Cunha & Crovella)
• Document Size on Web (Cunha & Crovella)
9/11/2000 Information Organization and Retrieval
Related Distributions/”Laws”
• Bradford’s Law of Scattering
• Lotka’s Law of Productivity
• De Solla Price’s Urn Model for “Cumulative Advantage Processes”
Consequences of Zipf• There are always a few very frequent tokens
that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of
“closed-class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.
• There are always a large number of tokens that occur once and can mess up algorithms.
• Medium frequency words most descriptive
9/11/2000 Information Organization and Retrieval
Word Frequency vs. Resolving Power (from van Rijsbergen 79)
The most frequent words are not the most descriptive.
9/11/2000 Information Organization and Retrieval
Statistical Independence vs. Statistical Dependence
• How likely is a red car to drive by given we’ve seen a black one?
• How likely is the word “ambulence” to appear, given that we’ve seen “car accident”?
• Color of cars driving by are independent (although more frequent colors are more likely)
• Words in text are not independent (although again more frequent words are more likely)
9/11/2000 Information Organization and Retrieval
Statistical Independence
Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.
),()()( yxPyPxP
9/11/2000 Information Organization and Retrieval
Statistical Independence and Dependence
• What are examples of things that are statistically independent?
• What are examples of things that are statistically dependent?
9/11/2000 Information Organization and Retrieval
Lexical Associations
• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks
89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
9/11/2000 Information Organization and Retrieval
Statistical Independence• Compute for a window of words
collectionin words ofnumber
in occur -co and timesofnumber ),(
position at startingwindow within words
5)(say window oflength ||
),(1
),(
:follows as ),( eapproximat llWe'
/)()(
t.independen if ),()()(
||
1
N
wyxyxw
iw
ww
yxwN
yxP
yxP
NxfxP
yxPyPxP
i
wN
ii
w1 w11w21
a b c d e f g h i j k l m n o p
9/11/2000 Information Organization and Retrieval
Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks
89)I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
9/11/2000 Information Organization and Retrieval
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
Un-Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.
9/11/2000 Information Organization and Retrieval
Document Vectors
• Documents are represented as “bags of words”• Represented as vectors when used
computationally– A vector is like an array of floating point
– Has direction and magnitude
– Each vector holds a place for every term in the collection
– Therefore, most vectors are sparse
9/11/2000 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
9/11/2000 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
9/11/2000 Information Organization and Retrieval
Document Vectors
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
9/11/2000 Information Organization and Retrieval
We Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
Information Organization and Retrieval
Documents in 3D Space
9/11/2000 Information Organization and Retrieval
Content Analysis Summary• Content Analysis: transforming raw text into more
computationally useful forms• Words in text collections exhibit interesting
statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,