Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Statistical Methods for Corpus Analysis

Xiaofei Lu

APLNG 596D

July 14, 2009

Overview

Describing data Comparing groups Describing relationships

Basic concepts

Probability experiments – jargon Experiment: a situation for which the outcomes

occur randomly Sample space (Ω): the set of all possible outcomes Outcome (w): a point in the sample space Event: a subset of the sample space

Example 1

Experiment: toss a fair die 6 outcomes: 1,2,3,4,5,6 Sample space = Ω = {1,2,3,4,5,6} An event is any subset of the sample space

“an even number is rolled”: A = {2,4,6}, P(A) = 1/2 “ a 3 is rolled”: B = {3}, P(B) = 1/6

Example 2

Experiment: toss 2 fair dice Outcomes: ordered pairs (x, y); x and y are

results of the 1st and 2nd toss respectively Sample space = Ω = set of such ordered pairs =

{(x,y)|x = 1, 2,…, or 6 and y = 1,2,…, or 6} An event is any subset of Ω, e.g., “sum is 7”

A = {(x,y}|x+y=7}

= {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}

More jargon

A=“sum is 7”; B=“first toss is an odd number” Union of two events

The event C that either A or B occurs or both occur Intersection of two events

The event C that both A and B occur, C=A∩B Complement of an event

The event that A does not occur Disjoint events

Two events with no common outcome

Independence

Two events A and B are independent if knowing that one had occurred gave no information about whether the other had occurred

P(A∩B) = P(A)P(B) Outcomes of two successive tosses of an unbiased coin

P(2 heads)=P(A=1∩B=1)=P(A=1)×P(B=1)=1/4

Random variable

Random variable (X) Essentially a random number Formally a function from Ω to the real numbers

Discrete random variable A random variable that can take on only a finite or a

countably infinite number of possible values

Example 3

Experiment: toss a biased coin 3 times Bias: P(Heads) = 0.6 Ω = {hhh, hht, htt, hth, ttt, tth, thh, tht} X = total number of heads in the 3 tosses X is a r.v., a function from Ω to the real

numbers with possible values (x) 0, 1, 2, 3

Example 3 (cont.)

P(x=0)=0.064 P(x=1)=0.288 P(x=2)=0.432 P(x=3)=0.216

w X(w) P{w}

HHH 3 (0.6)

HHT 2 (0.6)(0.4)

HTH 2 (0.6)(0.4)

HTT 1 (0.6)(0.4)

THH 2 (0.6)(0.4)

THT 1 (0.6)(0.4)

TTH 1 (0.6)(0.4)

TTT 0 (0.4)

Random variable (cont.)

Continuous random variable A random variable that can take an uncountably

infinite number of possible values, e.g., height Defined over an interval of values, e.g., (0,2], and

represented by the area under a curve The probability of observing any single value is

equal to 0

Probability distribution

Describes the possible values of a random variable and their probabilities

Probability mass function (discrete) Probability density function (continuous)

Descriptive vs. inferential statistics

Descriptive statistics Summarize important properties of observed data Measures of central tendency Measures of variability

Inferential statistics The use of statistics to make inferences concerning

some unknown aspect of a population Hypothesis testing

Measures of central tendency

The most typical score for a data set The mode

The most frequently obtained score in a data set, (2, 4, 4, 7, 8)

The median Central score in sample with an odd number of

items, (2, 4, 4, 7, 8) Average of two central scores in sample with an

even number of items (2, 4, 4, 7, 8, 100)

Measures of central tendency (cont.)

The mean The average of all scores in a data set, (2,4,4,7,8)

Disadvantage of the mean Affected by extreme values (2,4,4,7,100) What is a more suitable measure in such cases?

Measures of variability

Statistical dispersion in a r.v. or probability distribution

Range Highest value minus lowest value: (2,4,4,7,8) Affected by extreme scores: (2,4,4,7,100)

Inter-quartile range: difference between The value ¼ of the way from the top, and The value ¼ of the way from the bottom

Semi inter-quartile range: ½ of the IQR

Measures of variability (cont.)

The variance Considers distance of every data item from mean Population variance

Sample variance: (n-1) indicates degree of freedom

Measures of variability (cont.)

The standard deviation The most common measure of statistical dispersion Standard deviation of a random variable

Sample standard deviation: N-1 indicates d.o.f.

Shape of a distribution

Asymmetrical distribution Positively (or right) skewed distribution Negatively (or left) skewed distribution

Symmetrical distribution Normal distribution (single modal peak)

mode=median=mean Assumed by many statistical tests in corpus linguistics

Bimodal distribution

Normal distribution

A statistical distribution N(μ, σ) with the following probability density function

Parameters: mean μ and variance σ e is a mathematical constant Density is bell-shaped, symmetric about μ Standard normal distribution: μ=0, σ=1

Central limit theorem

The theorem When samples are repeatedly drawn from a

population, the means of the samples will be normally distributed around the population mean

This occurs even if the distribution of the data in the population is not normal

This makes the normal distribution important The distribution of IQ scores

Properties of the normal curve

Shape of curve defined by μ and σ Important property

For any normal curve, if we draw a vertical line through it at any number of standard deviations away from the mean, the proportions of the area under the curve are always the same

See here

The z score

A measure of how far a given value is from the mean, expressed as a number of s.d.’s

How probable a z score is for any test Measured by proportion of the total area under the

tail of the curve which lies beyond a given z value Consult the z score table

Example 4

Mean frequency of there in a 1000-word sample written by a given author is 10, σ = 4

A sample contains 17 occurrences of there z score = (17-10)/4= 1.75 The area beyond the z score of 1.75 is 0.0401, or

4.01% of the total area under the curve The probability of seeing a sample with more than 17

occurrences of there is 4.01% or less

Hypothesis testing

Using descriptive statistics as evidence for or against experimental hypotheses

The null hypothesis H0

There is no difference between the sample value and the population from which it was drawn

The alternative hypothesis H1

The is a significant difference between the sample value and the population from which it was drawn

Goal: to reject H0 with a certain level of significance (e.g., 5%)

Hypothesis testing (cont.)

Use of statistical tests Estimates the probability that the claims are wrong Enables us to claim statistical significance for our

results and have confidence in our claims One and two-tailed tests

One-tailed: likely direction of difference known Two-tailed: nature of the difference not specified

If using z-score, proportions in Appendix 1 must be doubled

Comparing Groups

Xiaofei Lu

APLNG 596D

Outline

Basic concepts Parametric comparisons of two groups Non-parametric comparisons of two groups Comparisons between three or more groups

Basic concepts

Types of scales of measurement Independent and dependent variables Parametric and non-parametric tests Population mean Between-groups and repeated measures design One-sample and two-sample studies

Types of scales of measurement

Ratio scale: units on the scale are the same Measurement in meters

Interval scale: the zero point is arbitrary Centigrade scale of temperature

Ordinal scale: records order only Ranks in a contest

Nominal scale: categorical data Part-of-speech categories

Independent and dependent variables

Independent variables: what do I change? Dependent variables: what do I observe? Controlled variables: what do I keep the

Two examples

Effect of education on income Independent variable: academic degree of the

individual Dependent variable: level of income of the

individual measured in monetary units Effect of sentence complexity on recall

Independent variable: sentence complexity Dependent variable: amount of sentence correctly

recalled

Parametric tests

Dependent variables are ratio-/interval- scored Observations should be independent Often assumes normal distribution of data

Mean an appropriate measure of central tendency Standard deviation an appropriate measure of

variability Works with any distribution with parameters

Non-parametric tests

Do not assume normal distribution of data Best for small samples with no normal distribution

Work with rank-ordered scales and frequencies

Population mean

Sampling distribution of means A distribution made up of group means Describes a symmetric curve Group means within a population closer to each

other than individual scores to group mean Population mean

The average of a group of means

Experimental design

Between-groups design Data comes from two different groups

Repeated measures design Data is the result of two or more measures taken

from the same group

One-sample and two-sample studies

One-sample studies Compare group mean with population mean Determine whether group mean differs from

population mean Two-sample studies

Compare means from two different groups (experimental and control group)

Determine whether these means differ for reasons other than pure chance

Parametric comparison of two groups

The t test for independent samples The matched pairs t test

The t test for independent samples

Tests difference between two groups Normally-distributed interval data Mean and standard deviation good measures of

central tendency and variability Especially useful for small samples (N<30)

One-sample t test

H0: no significant difference between group mean and population mean

Computing the t statistic (in SPSS)

Standard error of the means s: standard deviation of the sample group n: sample size

Corpus linguistics example

A balanced corpus Mean verbs per sentence: 2.5; s.d. = 1.2

A 100-sentence specialized subcorpus Mean verbs per sentence: 3.5; s.d. = 1.6

t statistic: (3.5-2.5)/(1.6/10)=6.25

Corpus linguistics example (cont.)

Consult the t table Two-tailed test (non-directional) Degree of freedom: (n-1) = (100-1) = 99

use next lower value – 90

Significance level: go with 0.05 or 0.01 Critical value: 1.987 (for 0.05) or 2.632 (for 0.01) Observed value of t (6.25) is greater than 2.632 Can reject H0 at the 1 percent significance level

Two-sample t test

H0: difference between 2 groups expected for any 2 means in a population due to chance

Show that the difference falls in the extreme left or right tail of the t distribution

Standard error of differences between the mean

Number of errors of a specific type in each of 15 equal-length essays

Control group: 8 essays produced by students learning by traditional methods

Experimental group: 7 essays produced by students learning by a novel method

t=(6-3)/sqrt((2.27*2.27/7)+(2.21*2.21/8))=2.584 Degree of freedom = (8-1)+(7-1)=13 Critical value of t for a two-tailed test at the 5 percent

significance level for 13 d.o.f. is 2.16 Observed t is greater than 2.16; difference is

significant

n Mean Standard deviation

Control 8 6 2.21

Experimental 7 3 2.27

Some caveats

The matched pairs t test should be used for repeated measures designs (correlated samples)

A non-parametric test should be used if data is very skewed and not normally-distributed

A parametric test for comparing 3 or more groups should be used to cross-compare groups

The matched pairs t test

Comparing paired or correlated samples Not independent but closer to each than random

samples A feature observed under 2 different conditions

Same students tested before and after taking class Pairs of subjects matched according to any

characteristic Studying husbands and wives rather than random

samples

The matched pairs t test (cont.)

di denotes the difference between the ith pair N denotes the number of pairs of observations

Lengths of the vowels produced by 10 speakers in two different consonant environments

t = -2.95; d.o.f. = 9 Critical value of t for a two-

tailed test at the 2 percent significance level for 9 d.o.f. is 2.821

ID E1 E2 d

1 22 26 -4

2 18 22 -4

3 26 27 -1

4 17 15 2

5 19 24 -5

6 23 27 -4

7 15 17 -2

8 16 20 -4

9 19 17 2

10 25 30 -5

Non-parametric comparisons of two groups

Used in two-sample studies where the assumptions of the t test do not hold

Between-group design (independent samples) The Wilcoxon rank sums test

Repeated measures design (correlated/paired samples) The Wilcoxon matched pairs signed rank test

The Wilcoxon rank sums test

Also known as the Mann-Whitney U test Useful for comparing ordinal rating scales

Combine and rank scores for two groups Calculate the sum of ranks in the smaller group (R1) Calculate the sum of ranks in the larger group (R2) U = the smaller of U1 and U2

The Wilcoxon rank sums test (cont.)

If N1 ≥ 20 and N2 ≥ 20, can compute z score Let N = N1 + N2

Questionnaire distributed to 2 student groups Group 1: Computer-taught Group 2: Classroom-taught Question: ‘How hard/useful did you find the task?’ Answer: Likert scale (1-5), 1=very hard; 5=very easy

Data processing Aggregate scores found for each subject Combined scores from 2 groups ranked Average scores given to tied ranks

H0: no difference between 2 groups

Calculate level of significance here

Cannot reject H0

G2 Rank G1 Rank

14 1 10 6

12 2.5 10 6

11 4 8 10.5

9 8.5 7 12

9 8.5 6 13

8 10.5 - -

R2 37.5 R1 53.5

5.325.372

)17(776

5.95.532

)16(676

The Wilcoxon matched pairs signed ranks test

Used on interval level of measurement Ranks differences between pairs of observations Considers both direction and degree of difference

Procedure Obtain matched pairs of scores Calculate difference for each pair Rank differences according to absolute magnitude Find the sum of negative and positive ranks

The Wilcoxon matched pairs signed ranks test (cont.)

Consult a significance table W = smaller of the sum of negative/positive ranks N = number of pairs with a difference W should be smaller than or equal to critical value

If N ≥ 25, can compute z score

)12)(1(

# of errors in translating 2 passages into French

W=6.5 Sum of positive ranks: 6.5 Sum of negative ranks: 38.5

N=9 Critical value is 5 (2-

tailed, p=0.05, N=9) W=6.5>5, H0 holds

Subj A B A-B Rank

1 8 10 -2 -4.5

2 7 6 +1 +2

3 4 4 0 -

4 2 5 -3 -7.5

5 4 7 -3 -7.5

6 10 11 -1 -2

7 17 15 +2 +4.5

8 3 6 -3 -7.5

9 2 3 -1 -2

10 11 14 -3 -7.5

Comparisons between three or more groups

Analysis of variance (ANOVA) A method of testing for significant differences

between means of more than 2 samples H0: samples taken from populations with same mean;

no significant difference between samples Samples not from same population if

Between-groups variance significantly greater than within-groups variance

Between-groups variance Sum of squared difference between each sample

mean and overall mean weighted by sample size Normalized by degree of freedom (what is it?)

Within-groups variance Sum of squared difference between each score in

each sample and the corresponding sample mean Normalized by degree of freedom (what is it?)

ANOVA (cont.)

Consult an ANOVA significance table Degree of freedom in numerator

# of groups -1 Degree of freedom in denominator

# of data items in all groups - # of groups F value smaller than critical value: H0 holds

variancegroupsWithin

variancegroupsBetween ratio F The

# of words 3 poets fit into a heroic couplet

3 samples with 5 couplets each Overall mean: 240/15=16 Bgv=(5*0+5*1+5*1)/(3-1)=5 Wgv=[(1+4+1+1+1)+(1+1+1+1+

4)+(0+1+4+1+0)]/(15-3)=1.833 F=5/1.83=2.73 Critical value is 3.89 (Df 2&12,

p≤0.10, 2-tailed); H0 holds

S1 S2 S3

1 17 16 17

2 18 14 18

3 15 14 15

4 15 14 18

5 15 17 17

mean 16 15 17

Describing Relationships

Xiaofei Lu

APLNG 596D

Outline

The chi-square test Correlation

The chi-square test

Dealing with nominal data Facts that can be sorted into categories Measured as frequencies

Significant differences between frequencies? Chi-square test: a non-parametric test of

relationship between frequencies Compare observed frequencies with those expected

on the basis of some theoretical model

The chi-square test (cont.)

Observed value (O) and expected value (E) O: Actual frequency in a cell E: Expected frequency in a cell

Computing the chi-square

General caveats

Use of chi-square test is inappropriate if Any expected frequency is below 1; or E<5 in more than 20% of the cells

Yates’ correction factor Applicable if df = 1 If O>E, O=O-0.5; if O<E, O=O+0.5

One-way design

Compare relation of frequencies for one variable Df = (# of cells) - 1 E = (sum of frequencies in all cells)/(# of cells)

Example 1: one-way design

Toss a coin 100 times; H0: Coin is fair

Chi Square Table Critical value is 3.84 (2-tailed, df=1, p=0.05) 0.36 < 3.84; cannot reject H0

36.050

)5047(

)5053( 222

Heads Tails Total

Observed 53 47 100

Expected 50 50 100

Two-way design

Compare relation of frequencies for two variables Df = (# of columns -1)× (# of rows -1)

Contingency table Tests whether two characteristics are independent

or associated Classifies experiment outcomes according to two

criteria

items of totalgrand

alcolumn tot totalrow E

Two-by-two contingency table

A occurs A does not occur

B occurs a c

B does not occur b d

))()()((

)2|(| 22

dbcadcba

NbcadNX

Shortcut for a two-by-two contingency table

Replace N/2 to remove Yates’s correction Can also compute chi square using normal method

Example 2: 2-by-2 contingency table

Male Female

Believe in CMC romance 36 14

Don’t believe in CMC romance 30 25

418.3)2530)(1436)(2514)(3036(

)30*1425*36(105

105253014362

What is correlation?

Degree to which two variables are related Positive: high values of X associated with high

values of Y Negative: high values of X associated with low

values of Y Correlation coefficient: -1 to +1

+1: perfect positive correlation 0: no correlation -1: perfect negative correlation

Pearson’s correlation coefficient

Assumptions: X and Y are Interval or ratio-type data (continuous) Independent Normally distributed In a linear relationship

Useful terms to know Correlation is covariance of standardized variables

Pearson’s correlation coefficient (cont.)

Computing the coefficient

Standard error of estimation

Partitioning the sums of squares

2222 )()(

yynxxn

YXCovr

1 2xyyxy rss

chance) todue(not ableother vari by thefor accounted

variableonein varianceof proportion thegives 2xyr

Example 3

Correlation between X and Y X: Number of salespeople Y: Total number of sales r=0.921, N=5

Significance of the correlation coefficient Significance table (critical value=0.878, N=5,

p=0.05, 2-tailed) The t test (if N≥6)

Spearman’s rank correction coefficient

Used with ordinal data that can be ranked X ordinal & Y continuous: convert Y to ranked data

Computing Spearman’s rank correlation coefficient

xofRank y ofRank

Example 4

Correlation between X and Y X: rating of product quality (1-4, 4 best) Y: perceived reputation of company (1-3, 3 best) ρ=0.830, N=7

Significance of the correlation coefficient Table (critical value=0.786, N=7, p=0.05, 2-tailed) The t test (if N≥30)

Resources

Resources to help you learn and use SPSS What statistical analysis should I use?Statistical analyses using SPSS

Statistical Methods for Corpus Analysis Xiaofei Lu APLNG 596D July 14, 2009.

Documents

Department of Applied Linguistics -...

Xiaofei Tang, Jianmin Jia, Tingrui Zhou, Hongjuan Yin ...

APLNG UPSTREAM PH1 WTF TECHNICAL SPECIFICATION – …

Dobby@mmlab.snu.ac.kr Measurement, Modeling and Analysis of....

Attachment B APLNG Stakeholder feedback...

APLNG - Community Health and Safety Strategy · This...

APLNG Newsletter Fall 2010.pdf

Power Iteration Clustering Speaker: Xiaofei Di 2010.10.11.

Research methods in corpus linguistics Xiaofei Lu.

Seminar in Applied Corpus Linguistics: Introduction APLNG...

Copyright by Xiaofei Ren 2011

APLNG Newsletter, Fall 2011 · APLNG Newsletter, Fall 2011....

Unsupervised Feature Selection for Multi-Cluster Data Deng.....

APLNG OVERVIEW rev 2

Xiaofei Lu - Pennsylvania State University ·...

An Introduction and Overview of Grid Computing Presenter:...