Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.
Post on 31-Dec-2015
215 Views
Preview:
Transcript
Statistical Methods for Corpus Analysis
Xiaofei Lu
APLNG 596D
July 14, 2009
2
Overview
Describing data Comparing groups Describing relationships
3
Basic concepts
Probability experiments – jargon Experiment: a situation for which the outcomes
occur randomly Sample space (Ω): the set of all possible outcomes Outcome (w): a point in the sample space Event: a subset of the sample space
4
Example 1
Experiment: toss a fair die 6 outcomes: 1,2,3,4,5,6 Sample space = Ω = {1,2,3,4,5,6} An event is any subset of the sample space
“an even number is rolled”: A = {2,4,6}, P(A) = 1/2 “ a 3 is rolled”: B = {3}, P(B) = 1/6
5
Example 2
Experiment: toss 2 fair dice Outcomes: ordered pairs (x, y); x and y are
results of the 1st and 2nd toss respectively Sample space = Ω = set of such ordered pairs =
{(x,y)|x = 1, 2,…, or 6 and y = 1,2,…, or 6} An event is any subset of Ω, e.g., “sum is 7”
A = {(x,y}|x+y=7}
= {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}
6
More jargon
A=“sum is 7”; B=“first toss is an odd number” Union of two events
The event C that either A or B occurs or both occur Intersection of two events
The event C that both A and B occur, C=A∩B Complement of an event
The event that A does not occur Disjoint events
Two events with no common outcome
7
Independence
Two events A and B are independent if knowing that one had occurred gave no information about whether the other had occurred
P(A∩B) = P(A)P(B) Outcomes of two successive tosses of an unbiased coin
P(2 heads)=P(A=1∩B=1)=P(A=1)×P(B=1)=1/4
8
Random variable
Random variable (X) Essentially a random number Formally a function from Ω to the real numbers
Discrete random variable A random variable that can take on only a finite or a
countably infinite number of possible values
9
Example 3
Experiment: toss a biased coin 3 times Bias: P(Heads) = 0.6 Ω = {hhh, hht, htt, hth, ttt, tth, thh, tht} X = total number of heads in the 3 tosses X is a r.v., a function from Ω to the real
numbers with possible values (x) 0, 1, 2, 3
10
Example 3 (cont.)
P(x=0)=0.064 P(x=1)=0.288 P(x=2)=0.432 P(x=3)=0.216
w X(w) P{w}
HHH 3 (0.6)
HHT 2 (0.6)(0.4)
HTH 2 (0.6)(0.4)
HTT 1 (0.6)(0.4)
THH 2 (0.6)(0.4)
THT 1 (0.6)(0.4)
TTH 1 (0.6)(0.4)
TTT 0 (0.4)
11
Random variable (cont.)
Continuous random variable A random variable that can take an uncountably
infinite number of possible values, e.g., height Defined over an interval of values, e.g., (0,2], and
represented by the area under a curve The probability of observing any single value is
equal to 0
12
Probability distribution
Describes the possible values of a random variable and their probabilities
Probability mass function (discrete) Probability density function (continuous)
13
Descriptive vs. inferential statistics
Descriptive statistics Summarize important properties of observed data Measures of central tendency Measures of variability
Inferential statistics The use of statistics to make inferences concerning
some unknown aspect of a population Hypothesis testing
14
Measures of central tendency
The most typical score for a data set The mode
The most frequently obtained score in a data set, (2, 4, 4, 7, 8)
The median Central score in sample with an odd number of
items, (2, 4, 4, 7, 8) Average of two central scores in sample with an
even number of items (2, 4, 4, 7, 8, 100)
15
Measures of central tendency (cont.)
The mean The average of all scores in a data set, (2,4,4,7,8)
Disadvantage of the mean Affected by extreme values (2,4,4,7,100) What is a more suitable measure in such cases?
16
Measures of variability
Statistical dispersion in a r.v. or probability distribution
Range Highest value minus lowest value: (2,4,4,7,8) Affected by extreme scores: (2,4,4,7,100)
Inter-quartile range: difference between The value ¼ of the way from the top, and The value ¼ of the way from the bottom
Semi inter-quartile range: ½ of the IQR
17
Measures of variability (cont.)
The variance Considers distance of every data item from mean Population variance
Sample variance: (n-1) indicates degree of freedom
18
Measures of variability (cont.)
The standard deviation The most common measure of statistical dispersion Standard deviation of a random variable
Sample standard deviation: N-1 indicates d.o.f.
19
Shape of a distribution
Asymmetrical distribution Positively (or right) skewed distribution Negatively (or left) skewed distribution
Symmetrical distribution Normal distribution (single modal peak)
mode=median=mean Assumed by many statistical tests in corpus linguistics
Bimodal distribution
20
Normal distribution
A statistical distribution N(μ, σ) with the following probability density function
Parameters: mean μ and variance σ e is a mathematical constant Density is bell-shaped, symmetric about μ Standard normal distribution: μ=0, σ=1
21
Central limit theorem
The theorem When samples are repeatedly drawn from a
population, the means of the samples will be normally distributed around the population mean
This occurs even if the distribution of the data in the population is not normal
This makes the normal distribution important The distribution of IQ scores
22
Properties of the normal curve
Shape of curve defined by μ and σ Important property
For any normal curve, if we draw a vertical line through it at any number of standard deviations away from the mean, the proportions of the area under the curve are always the same
See here
23
The z score
A measure of how far a given value is from the mean, expressed as a number of s.d.’s
How probable a z score is for any test Measured by proportion of the total area under the
tail of the curve which lies beyond a given z value Consult the z score table
x
z
24
Example 4
Mean frequency of there in a 1000-word sample written by a given author is 10, σ = 4
A sample contains 17 occurrences of there z score = (17-10)/4= 1.75 The area beyond the z score of 1.75 is 0.0401, or
4.01% of the total area under the curve The probability of seeing a sample with more than 17
occurrences of there is 4.01% or less
25
Hypothesis testing
Using descriptive statistics as evidence for or against experimental hypotheses
The null hypothesis H0
There is no difference between the sample value and the population from which it was drawn
The alternative hypothesis H1
The is a significant difference between the sample value and the population from which it was drawn
Goal: to reject H0 with a certain level of significance (e.g., 5%)
26
Hypothesis testing (cont.)
Use of statistical tests Estimates the probability that the claims are wrong Enables us to claim statistical significance for our
results and have confidence in our claims One and two-tailed tests
One-tailed: likely direction of difference known Two-tailed: nature of the difference not specified
If using z-score, proportions in Appendix 1 must be doubled
Comparing Groups
Xiaofei Lu
APLNG 596D
28
Outline
Basic concepts Parametric comparisons of two groups Non-parametric comparisons of two groups Comparisons between three or more groups
29
Basic concepts
Types of scales of measurement Independent and dependent variables Parametric and non-parametric tests Population mean Between-groups and repeated measures design One-sample and two-sample studies
30
Types of scales of measurement
Ratio scale: units on the scale are the same Measurement in meters
Interval scale: the zero point is arbitrary Centigrade scale of temperature
Ordinal scale: records order only Ranks in a contest
Nominal scale: categorical data Part-of-speech categories
31
Independent and dependent variables
Independent variables: what do I change? Dependent variables: what do I observe? Controlled variables: what do I keep the
same?
32
Two examples
Effect of education on income Independent variable: academic degree of the
individual Dependent variable: level of income of the
individual measured in monetary units Effect of sentence complexity on recall
Independent variable: sentence complexity Dependent variable: amount of sentence correctly
recalled
33
Parametric tests
Dependent variables are ratio-/interval- scored Observations should be independent Often assumes normal distribution of data
Mean an appropriate measure of central tendency Standard deviation an appropriate measure of
variability Works with any distribution with parameters
34
Non-parametric tests
Do not assume normal distribution of data Best for small samples with no normal distribution
Work with rank-ordered scales and frequencies
35
Population mean
Sampling distribution of means A distribution made up of group means Describes a symmetric curve Group means within a population closer to each
other than individual scores to group mean Population mean
The average of a group of means
36
Experimental design
Between-groups design Data comes from two different groups
Repeated measures design Data is the result of two or more measures taken
from the same group
37
One-sample and two-sample studies
One-sample studies Compare group mean with population mean Determine whether group mean differs from
population mean Two-sample studies
Compare means from two different groups (experimental and control group)
Determine whether these means differ for reasons other than pure chance
38
Parametric comparison of two groups
The t test for independent samples The matched pairs t test
39
The t test for independent samples
Tests difference between two groups Normally-distributed interval data Mean and standard deviation good measures of
central tendency and variability Especially useful for small samples (N<30)
40
One-sample t test
H0: no significant difference between group mean and population mean
Computing the t statistic (in SPSS)
Standard error of the means s: standard deviation of the sample group n: sample size
ns
xtobs
41
Corpus linguistics example
A balanced corpus Mean verbs per sentence: 2.5; s.d. = 1.2
A 100-sentence specialized subcorpus Mean verbs per sentence: 3.5; s.d. = 1.6
t statistic: (3.5-2.5)/(1.6/10)=6.25
42
Corpus linguistics example (cont.)
Consult the t table Two-tailed test (non-directional) Degree of freedom: (n-1) = (100-1) = 99
use next lower value – 90
Significance level: go with 0.05 or 0.01 Critical value: 1.987 (for 0.05) or 2.632 (for 0.01) Observed value of t (6.25) is greater than 2.632 Can reject H0 at the 1 percent significance level
43
Two-sample t test
H0: difference between 2 groups expected for any 2 means in a population due to chance
Show that the difference falls in the extreme left or right tail of the t distribution
Standard error of differences between the mean
ccee
ce
nsns
xxt
22
44
Corpus linguistics example
Number of errors of a specific type in each of 15 equal-length essays
Control group: 8 essays produced by students learning by traditional methods
Experimental group: 7 essays produced by students learning by a novel method
45
Corpus linguistics example (cont.)
t=(6-3)/sqrt((2.27*2.27/7)+(2.21*2.21/8))=2.584 Degree of freedom = (8-1)+(7-1)=13 Critical value of t for a two-tailed test at the 5 percent
significance level for 13 d.o.f. is 2.16 Observed t is greater than 2.16; difference is
significant
n Mean Standard deviation
Control 8 6 2.21
Experimental 7 3 2.27
46
Some caveats
The matched pairs t test should be used for repeated measures designs (correlated samples)
A non-parametric test should be used if data is very skewed and not normally-distributed
A parametric test for comparing 3 or more groups should be used to cross-compare groups
47
The matched pairs t test
Comparing paired or correlated samples Not independent but closer to each than random
samples A feature observed under 2 different conditions
Same students tested before and after taking class Pairs of subjects matched according to any
characteristic Studying husbands and wives rather than random
samples
48
The matched pairs t test (cont.)
di denotes the difference between the ith pair N denotes the number of pairs of observations
1
)( 22
N
ddN
dt
ii
i
49
Corpus linguistics example
Lengths of the vowels produced by 10 speakers in two different consonant environments
t = -2.95; d.o.f. = 9 Critical value of t for a two-
tailed test at the 2 percent significance level for 9 d.o.f. is 2.821
ID E1 E2 d
1 22 26 -4
2 18 22 -4
3 26 27 -1
4 17 15 2
5 19 24 -5
6 23 27 -4
7 15 17 -2
8 16 20 -4
9 19 17 2
10 25 30 -5
50
Non-parametric comparisons of two groups
Used in two-sample studies where the assumptions of the t test do not hold
Between-group design (independent samples) The Wilcoxon rank sums test
Repeated measures design (correlated/paired samples) The Wilcoxon matched pairs signed rank test
51
The Wilcoxon rank sums test
Also known as the Mann-Whitney U test Useful for comparing ordinal rating scales
Combine and rank scores for two groups Calculate the sum of ranks in the smaller group (R1) Calculate the sum of ranks in the larger group (R2) U = the smaller of U1 and U2
222
212
111
211
2
)1(2
)1(
RNN
NNU
RNN
NNU
52
The Wilcoxon rank sums test (cont.)
If N1 ≥ 20 and N2 ≥ 20, can compute z score Let N = N1 + N2
3
)1(
)1(2
21
111
NNN
NNRz
53
Corpus linguistics example
Questionnaire distributed to 2 student groups Group 1: Computer-taught Group 2: Classroom-taught Question: ‘How hard/useful did you find the task?’ Answer: Likert scale (1-5), 1=very hard; 5=very easy
Data processing Aggregate scores found for each subject Combined scores from 2 groups ranked Average scores given to tied ranks
54
Corpus linguistics example (cont.)
H0: no difference between 2 groups
Calculate level of significance here
Cannot reject H0
G2 Rank G1 Rank
14 1 10 6
12 2.5 10 6
12 2.5 10 6
11 4 8 10.5
9 8.5 7 12
9 8.5 6 13
8 10.5 - -
R2 37.5 R1 53.5
5.325.372
)17(776
5.95.532
)16(676
2
1
U
U
55
The Wilcoxon matched pairs signed ranks test
Used on interval level of measurement Ranks differences between pairs of observations Considers both direction and degree of difference
Procedure Obtain matched pairs of scores Calculate difference for each pair Rank differences according to absolute magnitude Find the sum of negative and positive ranks
56
The Wilcoxon matched pairs signed ranks test (cont.)
Consult a significance table W = smaller of the sum of negative/positive ranks N = number of pairs with a difference W should be smaller than or equal to critical value
If N ≥ 25, can compute z score
24
)12)(1(
4/)1(
NNN
NNWz
57
Corpus linguistics example
# of errors in translating 2 passages into French
W=6.5 Sum of positive ranks: 6.5 Sum of negative ranks: 38.5
N=9 Critical value is 5 (2-
tailed, p=0.05, N=9) W=6.5>5, H0 holds
Subj A B A-B Rank
1 8 10 -2 -4.5
2 7 6 +1 +2
3 4 4 0 -
4 2 5 -3 -7.5
5 4 7 -3 -7.5
6 10 11 -1 -2
7 17 15 +2 +4.5
8 3 6 -3 -7.5
9 2 3 -1 -2
10 11 14 -3 -7.5
58
Comparisons between three or more groups
Analysis of variance (ANOVA) A method of testing for significant differences
between means of more than 2 samples H0: samples taken from populations with same mean;
no significant difference between samples Samples not from same population if
Between-groups variance significantly greater than within-groups variance
59
ANOVA
Between-groups variance Sum of squared difference between each sample
mean and overall mean weighted by sample size Normalized by degree of freedom (what is it?)
Within-groups variance Sum of squared difference between each score in
each sample and the corresponding sample mean Normalized by degree of freedom (what is it?)
60
ANOVA (cont.)
Consult an ANOVA significance table Degree of freedom in numerator
# of groups -1 Degree of freedom in denominator
# of data items in all groups - # of groups F value smaller than critical value: H0 holds
variancegroupsWithin
variancegroupsBetween ratio F The
61
Corpus linguistics example
# of words 3 poets fit into a heroic couplet
3 samples with 5 couplets each Overall mean: 240/15=16 Bgv=(5*0+5*1+5*1)/(3-1)=5 Wgv=[(1+4+1+1+1)+(1+1+1+1+
4)+(0+1+4+1+0)]/(15-3)=1.833 F=5/1.83=2.73 Critical value is 3.89 (Df 2&12,
p≤0.10, 2-tailed); H0 holds
S1 S2 S3
1 17 16 17
2 18 14 18
3 15 14 15
4 15 14 18
5 15 17 17
mean 16 15 17
Describing Relationships
Xiaofei Lu
APLNG 596D
63
Outline
The chi-square test Correlation
64
The chi-square test
Dealing with nominal data Facts that can be sorted into categories Measured as frequencies
Significant differences between frequencies? Chi-square test: a non-parametric test of
relationship between frequencies Compare observed frequencies with those expected
on the basis of some theoretical model
65
The chi-square test (cont.)
Observed value (O) and expected value (E) O: Actual frequency in a cell E: Expected frequency in a cell
Computing the chi-square
E
EOX
22 )(
66
General caveats
Use of chi-square test is inappropriate if Any expected frequency is below 1; or E<5 in more than 20% of the cells
Yates’ correction factor Applicable if df = 1 If O>E, O=O-0.5; if O<E, O=O+0.5
67
One-way design
Compare relation of frequencies for one variable Df = (# of cells) - 1 E = (sum of frequencies in all cells)/(# of cells)
68
Example 1: one-way design
Toss a coin 100 times; H0: Coin is fair
Chi Square Table Critical value is 3.84 (2-tailed, df=1, p=0.05) 0.36 < 3.84; cannot reject H0
36.050
)5047(
50
)5053( 222
X
Heads Tails Total
Observed 53 47 100
Expected 50 50 100
69
Two-way design
Compare relation of frequencies for two variables Df = (# of columns -1)× (# of rows -1)
Contingency table Tests whether two characteristics are independent
or associated Classifies experiment outcomes according to two
criteria
items of totalgrand
alcolumn tot totalrow E
70
Two-by-two contingency table
A occurs A does not occur
B occurs a c
B does not occur b d
))()()((
)2|(| 22
dbcadcba
NbcadNX
dcbaN
Shortcut for a two-by-two contingency table
Replace N/2 to remove Yates’s correction Can also compute chi square using normal method
71
Example 2: 2-by-2 contingency table
Male Female
Believe in CMC romance 36 14
Don’t believe in CMC romance 30 25
418.3)2530)(1436)(2514)(3036(
)30*1425*36(105
105253014362
2
X
N
72
What is correlation?
Degree to which two variables are related Positive: high values of X associated with high
values of Y Negative: high values of X associated with low
values of Y Correlation coefficient: -1 to +1
+1: perfect positive correlation 0: no correlation -1: perfect negative correlation
73
Pearson’s correlation coefficient
Assumptions: X and Y are Interval or ratio-type data (continuous) Independent Normally distributed In a linear relationship
Useful terms to know Correlation is covariance of standardized variables
74
Pearson’s correlation coefficient (cont.)
Computing the coefficient
Standard error of estimation
Partitioning the sums of squares
2222 )()(
),(
iiii
iiii
yxxy
yynxxn
yxyxn
ss
YXCovr
1 2xyyxy rss
chance) todue(not ableother vari by thefor accounted
variableonein varianceof proportion thegives 2xyr
75
Example 3
Correlation between X and Y X: Number of salespeople Y: Total number of sales r=0.921, N=5
Significance of the correlation coefficient Significance table (critical value=0.878, N=5,
p=0.05, 2-tailed) The t test (if N≥6)
76
Spearman’s rank correction coefficient
Used with ordinal data that can be ranked X ordinal & Y continuous: convert Y to ranked data
Computing Spearman’s rank correlation coefficient
ii
2
2
xofRank y ofRank
)1(
61
id
NN
d
77
Example 4
Correlation between X and Y X: rating of product quality (1-4, 4 best) Y: perceived reputation of company (1-3, 3 best) ρ=0.830, N=7
Significance of the correlation coefficient Table (critical value=0.786, N=7, p=0.05, 2-tailed) The t test (if N≥30)
78
Resources
Resources to help you learn and use SPSS What statistical analysis should I use?Statistical analyses using SPSS
top related