Top Banner
Unit 3: Descriptive Statistics for Continuous Data Statistics for Linguists with R – A SIGIL Course Designed by Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento, Italy 2 Corpus Linguistics Group Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany http://SIGIL.r-forge.r-project.org/ Copyright © 2007–2015 Baroni & Evert SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 1 / 40
91

Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I...

May 19, 2018

Download

Documents

lammien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Unit 3: Descriptive Statistics for Continuous DataStatistics for Linguists with R – A SIGIL Course

Designed by Marco Baroni1 and Stefan Evert2

1Center for Mind/Brain Sciences (CIMeC)University of Trento, Italy

2Corpus Linguistics GroupFriedrich-Alexander-Universität Erlangen-Nürnberg, Germany

http://SIGIL.r-forge.r-project.org/

Copyright © 2007–2015 Baroni & Evert

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 1 / 40

Page 2: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Outline

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 2 / 40

Page 3: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 3 / 40

Page 4: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Reminder: the library metaphor

I In the library metaphor, we took random samples from aninfinite population of tokens (words, VPs, sentences, . . . )

I Relevant property is a binary (or categorical) classificationI active vs. passive VP or sentence (binary)I instance of lemma TIME vs. some other word (binary)I subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . )I part-of-speech tag of word token (50+ categories)

I Characterisation of population distribution is straightforwardI binomial: true proportion π = 10% of passive VPs,

or relative frequency of TIME, e.g. π = 2000 pmwI alternatively: specify redundant proportions (π, 1− π),

e.g. passive/active VPs (.1, .9) or TIME/other (.002, .998)I multinomial: multiple proportions π1 + π2 + · · ·+ πK = 1,

e.g. (πnoun = .28, πverb = .17, πadj = .08, . . .)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 4 / 40

Page 5: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Reminder: the library metaphor

I In the library metaphor, we took random samples from aninfinite population of tokens (words, VPs, sentences, . . . )

I Relevant property is a binary (or categorical) classificationI active vs. passive VP or sentence (binary)I instance of lemma TIME vs. some other word (binary)I subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . )I part-of-speech tag of word token (50+ categories)

I Characterisation of population distribution is straightforwardI binomial: true proportion π = 10% of passive VPs,

or relative frequency of TIME, e.g. π = 2000 pmwI alternatively: specify redundant proportions (π, 1− π),

e.g. passive/active VPs (.1, .9) or TIME/other (.002, .998)I multinomial: multiple proportions π1 + π2 + · · ·+ πK = 1,

e.g. (πnoun = .28, πverb = .17, πadj = .08, . . .)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 4 / 40

Page 6: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Numerical properties

In many other cases, the properties of interest are numerical:

Population census

height weight shoes sex

178.18 69.52 39.5 f160.10 51.46 37.0 f150.09 43.05 35.5 f182.24 63.21 46.0 m169.88 63.04 43.5 m185.22 90.59 46.5 m166.89 47.43 43.0 m162.58 54.13 37.0 f

Wikipedia articles

tokens types TTR avg len.

696 251 2.773 4.532228 126 1.810 4.488390 174 2.241 4.251455 176 2.585 4.412399 214 1.864 4.301297 148 2.007 4.399755 275 2.745 3.861299 171 1.749 4.524

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 5 / 40

Page 7: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Numerical properties

In many other cases, the properties of interest are numerical:

Population census

height weight shoes sex

178.18 69.52 39.5 f160.10 51.46 37.0 f150.09 43.05 35.5 f182.24 63.21 46.0 m169.88 63.04 43.5 m185.22 90.59 46.5 m166.89 47.43 43.0 m162.58 54.13 37.0 f

Wikipedia articles

tokens types TTR avg len.

696 251 2.773 4.532228 126 1.810 4.488390 174 2.241 4.251455 176 2.585 4.412399 214 1.864 4.301297 148 2.007 4.399755 275 2.745 3.861299 171 1.749 4.524

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 5 / 40

Page 8: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Descriptive vs. inferential statistics

Two main tasks of “classical” statistical methods (numerical data):

1. Descriptive statisticsI compact description of the distribution of a (numerical)

property in a very large or infinite populationI often by characteristic parameters such as mean, variance, . . .I this was the original purpose of statistics in the 19th century

2. Inferential statisticsI infer (aspects of) population distribution from a comparatively

small random sampleI accurate estimates for level of uncertainty involvedI often by testing (and rejecting) some null hypothesis H0

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40

Page 9: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Descriptive vs. inferential statistics

Two main tasks of “classical” statistical methods (numerical data):

1. Descriptive statisticsI compact description of the distribution of a (numerical)

property in a very large or infinite populationI often by characteristic parameters such as mean, variance, . . .I this was the original purpose of statistics in the 19th century

2. Inferential statisticsI infer (aspects of) population distribution from a comparatively

small random sampleI accurate estimates for level of uncertainty involvedI often by testing (and rejecting) some null hypothesis H0

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40

Page 10: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Categorical vs. numerical variables

Descriptive vs. inferential statistics

Two main tasks of “classical” statistical methods (numerical data):

1. Descriptive statisticsI compact description of the distribution of a (numerical)

property in a very large or infinite populationI often by characteristic parameters such as mean, variance, . . .I this was the original purpose of statistics in the 19th century

2. Inferential statisticsI infer (aspects of) population distribution from a comparatively

small random sampleI accurate estimates for level of uncertainty involvedI often by testing (and rejecting) some null hypothesis H0

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40

Page 11: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 7 / 40

Page 12: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data

1. Nominal scale: purely qualitative classificationI male vs. female, passive vs. active, POS tags, subcat frames

2. Ordinal scale: ordered categoriesI school grades A–E, social class, low/medium/high rating

Numerical data

3. Interval scale: meaningful comparison of differencesI temperature (°C), plausibility & grammaticality ratings

4. Ratio scale: comparison of magnitudes, absolute zeroI time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 13: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data1. Nominal scale: purely qualitative classification

I male vs. female, passive vs. active, POS tags, subcat frames

2. Ordinal scale: ordered categoriesI school grades A–E, social class, low/medium/high rating

Numerical data

3. Interval scale: meaningful comparison of differencesI temperature (°C), plausibility & grammaticality ratings

4. Ratio scale: comparison of magnitudes, absolute zeroI time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 14: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data1. Nominal scale: purely qualitative classification

I male vs. female, passive vs. active, POS tags, subcat frames2. Ordinal scale: ordered categories

I school grades A–E, social class, low/medium/high rating

Numerical data

3. Interval scale: meaningful comparison of differencesI temperature (°C), plausibility & grammaticality ratings

4. Ratio scale: comparison of magnitudes, absolute zeroI time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 15: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data1. Nominal scale: purely qualitative classification

I male vs. female, passive vs. active, POS tags, subcat frames2. Ordinal scale: ordered categories

I school grades A–E, social class, low/medium/high rating

Numerical data3. Interval scale: meaningful comparison of differences

I temperature (°C), plausibility & grammaticality ratings

4. Ratio scale: comparison of magnitudes, absolute zeroI time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 16: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data1. Nominal scale: purely qualitative classification

I male vs. female, passive vs. active, POS tags, subcat frames2. Ordinal scale: ordered categories

I school grades A–E, social class, low/medium/high rating

Numerical data3. Interval scale: meaningful comparison of differences

I temperature (°C), plausibility & grammaticality ratings4. Ratio scale: comparison of magnitudes, absolute zero

I time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 17: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Statisticians distinguish 4 scales of measurement

Categorical data1. Nominal scale: purely qualitative classification

I male vs. female, passive vs. active, POS tags, subcat frames2. Ordinal scale: ordered categories

I school grades A–E, social class, low/medium/high rating

Numerical data3. Interval scale: meaningful comparison of differences

I temperature (°C), plausibility & grammaticality ratings4. Ratio scale: comparison of magnitudes, absolute zero

I time, length/width/height, weight, frequency counts

Additional dimension: discrete vs. continuous numerical dataI discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .I continuous: length, time, weight, temperature, . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40

Page 18: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 19: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frame

I reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 20: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)

I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 21: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7

I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 22: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room number

I grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 23: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”

I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 24: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)

I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 25: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in text

I relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 26: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPs

I token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 27: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 28: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Introduction Scales of measurement

Quiz

Which scale of measurement / data type is it?

I subcategorisation frameI reaction time (in psycholinguistic experiment)I familiarity rating on scale 1, . . . , 7I room numberI grammaticality rating: “*”, “??”, “?” or “ok”I magnitude estimation of plausibility (graphical scale)I frequency of passive VPs in textI relative frequency of passive VPsI token-type-ratio (TTR) and average word length (Wikipedia)

+ in this unit: continuous numerical variables on ratio scale

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40

Page 29: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 10 / 40

Page 30: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

The task

I Census data from small country of Ingary with m = 502,202inhabitants. The following properties were recorded:

I body height in cmI weight in kgI shoe size in Paris points (Continental European system)I sex (male, female)

I Frequency statistics for m = 1,429,649 Wikipedia articles:I token countI type countI token-type ratio (TTR)I average word length (across tokens)

+ Describe / summarise these data sets (continuous variables)

> library(SIGIL)> FakeCensus <- simulated.census()> WackypediaStats <- simulated.wikipedia()

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 11 / 40

Page 31: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

The task

I Census data from small country of Ingary with m = 502,202inhabitants. The following properties were recorded:

I body height in cmI weight in kgI shoe size in Paris points (Continental European system)I sex (male, female)

I Frequency statistics for m = 1,429,649 Wikipedia articles:I token countI type countI token-type ratio (TTR)I average word length (across tokens)

+ Describe / summarise these data sets (continuous variables)

> library(SIGIL)> FakeCensus <- simulated.census()> WackypediaStats <- simulated.wikipedia()

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 11 / 40

Page 32: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: central tendency

I How would you describe body heights with a single number?

mean µ =x1 + · · ·+ xm

m=

1m

m∑i=1

xi

I Is this intuitively sensible? Or are we just used to it?

> mean(FakeCensus$height)[1] 170.9781> mean(FakeCensus$weight)[1] 65.28917> mean(FakeCensus$shoe.size)[1] 41.49712

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40

Page 33: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: central tendency

I How would you describe body heights with a single number?

mean µ =x1 + · · ·+ xm

m=

1m

m∑i=1

xi

I Is this intuitively sensible? Or are we just used to it?

> mean(FakeCensus$height)[1] 170.9781> mean(FakeCensus$weight)[1] 65.28917> mean(FakeCensus$shoe.size)[1] 41.49712

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40

Page 34: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: central tendency

I How would you describe body heights with a single number?

mean µ =x1 + · · ·+ xm

m=

1m

m∑i=1

xi

I Is this intuitively sensible? Or are we just used to it?

> mean(FakeCensus$height)[1] 170.9781> mean(FakeCensus$weight)[1] 65.28917> mean(FakeCensus$shoe.size)[1] 41.49712

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40

Page 35: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

I Average weight of 65.3 kg not very useful if we have to designan elevator for 10 persons or a chair that doesn’t collapse:We need to know if everyone weighs close to 65 kg, or whetherthe typical range is 40–100 kg, or whether it is even larger.

I Measure of spread: minimum and maximum, here 30–196 kgI We’re more interested in the “typical” range of values without

the most extreme casesI Average variability based on error xi − µ for each individual

shows how well the mean µ describes the entire population

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40

Page 36: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

I Average weight of 65.3 kg not very useful if we have to designan elevator for 10 persons or a chair that doesn’t collapse:We need to know if everyone weighs close to 65 kg, or whetherthe typical range is 40–100 kg, or whether it is even larger.

I Measure of spread: minimum and maximum, here 30–196 kgI We’re more interested in the “typical” range of values without

the most extreme casesI Average variability based on error xi − µ for each individual

shows how well the mean µ describes the entire population

1m

m∑i=1

(xi − µ) = 0

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40

Page 37: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

I Average weight of 65.3 kg not very useful if we have to designan elevator for 10 persons or a chair that doesn’t collapse:We need to know if everyone weighs close to 65 kg, or whetherthe typical range is 40–100 kg, or whether it is even larger.

I Measure of spread: minimum and maximum, here 30–196 kgI We’re more interested in the “typical” range of values without

the most extreme casesI Average variability based on error xi − µ for each individual

shows how well the mean µ describes the entire population

1m

m∑i=1

|xi − µ| is mathematically inconvenient

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40

Page 38: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

I Average weight of 65.3 kg not very useful if we have to designan elevator for 10 persons or a chair that doesn’t collapse:We need to know if everyone weighs close to 65 kg, or whetherthe typical range is 40–100 kg, or whether it is even larger.

I Measure of spread: minimum and maximum, here 30–196 kgI We’re more interested in the “typical” range of values without

the most extreme casesI Average variability based on error xi − µ for each individual

shows how well the mean µ describes the entire population

variance σ2 =1m

m∑i=1

(xi − µ)2

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40

Page 39: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

variance σ2 =1m

m∑i=1

(xi − µ)2

+ Do you remember how to calculate this in R?

I height: µ = 171.00, σ2 = 199.50I weight: µ = 65.29, σ2 = 306.72I shoe size: µ = 41.50, σ2 = 21.70

I Mean and variance are not on a comparable scaleÜ standard deviation (s.d.) σ =

√σ2

I NB: still gives more weight to larger errors!

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40

Page 40: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

variance σ2 =1m

m∑i=1

(xi − µ)2

+ Do you remember how to calculate this in R?I height: µ = 171.00, σ2 = 199.50I weight: µ = 65.29, σ2 = 306.72I shoe size: µ = 41.50, σ2 = 21.70

I Mean and variance are not on a comparable scaleÜ standard deviation (s.d.) σ =

√σ2

I NB: still gives more weight to larger errors!

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40

Page 41: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: variability (spread)

variance σ2 =1m

m∑i=1

(xi − µ)2

+ Do you remember how to calculate this in R?I height: µ = 171.00, σ2 = 199.50, σ = 14.12I weight: µ = 65.29, σ2 = 306.72, σ = 17.51I shoe size: µ = 41.50, σ2 = 21.70, σ = 4.66

I Mean and variance are not on a comparable scaleÜ standard deviation (s.d.) σ =

√σ2

I NB: still gives more weight to larger errors!

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40

Page 42: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: higher moments

I Mean based on (xi )1 is also known as a “first moment”,

variance based on (xi )2 as a “second moment”

I The third moment is called skewness

γ1 =1m

m∑i=1

(xi − µσ

)3

and measures the asymmetry of a distributionI The fourth moment (kurtosis) measures “bulginess”

I How useful are these characteristic measures?I Given the mean, s.d., skewness, . . . , can you tell how many

people are taller than 190 cm, or how many weigh ≈ 100 kg?I Such measures mainly used for computational efficiency, and

even this required an elaborate procedure in the 19th century

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 15 / 40

Page 43: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Characteristic measures

Characteristic measures: higher moments

I Mean based on (xi )1 is also known as a “first moment”,

variance based on (xi )2 as a “second moment”

I The third moment is called skewness

γ1 =1m

m∑i=1

(xi − µσ

)3

and measures the asymmetry of a distributionI The fourth moment (kurtosis) measures “bulginess”

I How useful are these characteristic measures?I Given the mean, s.d., skewness, . . . , can you tell how many

people are taller than 190 cm, or how many weigh ≈ 100 kg?I Such measures mainly used for computational efficiency, and

even this required an elaborate procedure in the 19th century

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 15 / 40

Page 44: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 16 / 40

Page 45: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

The shape of a distribution: discrete dataDiscrete numerical data can be tabulated and plotted

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Shoe size

Pro

port

ion

of p

opul

atio

n

30 32 34 36 38 40 42 44 46 48 50 52 54

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 17 / 40

Page 46: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

The shape of a distribution: histogram for continuous dataContinuous data must be collected into bins Ü histogram

body height

Fre

quen

cy

120 140 160 180 200 220

010

000

2000

030

000

4000

050

000

6000

0

1 10 39 27510963874

9905

21286

35374

47776

55453

593796197162637

56578

42244

25752

12257

4584

1329324 47 10 1

body height

Fre

quen

cy

120 140 160 180 200 220

010

000

2000

030

000

4000

050

000

6000

0

I No two people have exactly the same body height, weight, . . .

I Frequency counts (= y-axis scale) depend on number of bins

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 18 / 40

Page 47: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

The shape of a distribution: histogram for continuous dataContinuous data must be collected into bins Ü histogram

body height

Fre

quen

cy

120 140 160 180 200 220

010

000

2000

030

000

4000

050

000

6000

0

1 10 39 27510963874

9905

21286

35374

47776

55453

593796197162637

56578

42244

25752

12257

4584

1329324 47 10 1

body height

Fre

quen

cy120 140 160 180 200 220

010

000

2000

030

000

4000

050

000

6000

0

I No two people have exactly the same body height, weight, . . .I Frequency counts (= y-axis scale) depend on number of bins

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 18 / 40

Page 48: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

The shape of a distribution: histogram for continuous dataContinuous data must be collected into bins Ü histogram

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Density scale is comparable for different numbers of binsI Area of histogram bar ≡ relative frequency in population

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 19 / 40

Page 49: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Refining histograms: the density function

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Contour of histogram = density function

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40

Page 50: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Refining histograms: the density function

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Contour of histogram = density function

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40

Page 51: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Refining histograms: the density function

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Contour of histogram = density function

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40

Page 52: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Refining histograms: the density function

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Contour of histogram = density function

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40

Page 53: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Histogram & density

Refining histograms: the density function

body height

Den

sity

120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

I Contour of histogram = density function

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40

Page 54: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 21 / 40

Page 55: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Formal mathematical notation

I Population Ω = ω1, ω2, . . . , ωm with m ≈ ∞I item ωk = person, Wikipedia article, word (lexical RT), . . .

I For each item, we are interested in several properties (e.g.height, weight, shoe size, sex) called random variables (r.v.)

I height X : Ω→ R+ with X (ωk) = height of person ωk

I weight Y : Ω→ R+ with Y (ωk) = weight of person ωk

I sex G : Ω→ 0, 1 with G (ωk) = 1 iff ωk is a woman+ formally, a r.v. is a (usually real-valued) function over Ω

I Mean, variance, etc. computed for each random variable:

µX =1m

∑ω∈Ω

X (ω) =: E[X ] expectation

σ2X =

1m

∑ω∈Ω

(X (ω)− µX

)2=: Var[X ] variance

= E[(X − µX )2]

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 22 / 40

Page 56: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Formal mathematical notation

I Population Ω = ω1, ω2, . . . , ωm with m ≈ ∞I item ωk = person, Wikipedia article, word (lexical RT), . . .

I For each item, we are interested in several properties (e.g.height, weight, shoe size, sex) called random variables (r.v.)

I height X : Ω→ R+ with X (ωk) = height of person ωk

I weight Y : Ω→ R+ with Y (ωk) = weight of person ωk

I sex G : Ω→ 0, 1 with G (ωk) = 1 iff ωk is a woman+ formally, a r.v. is a (usually real-valued) function over Ω

I Mean, variance, etc. computed for each random variable:

µX =1m

∑ω∈Ω

X (ω) =: E[X ] expectation

σ2X =

1m

∑ω∈Ω

(X (ω)− µX

)2=: Var[X ] variance

= E[(X − µX )2]

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 22 / 40

Page 57: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Formal mathematical notation

I Population Ω = ω1, ω2, . . . , ωm with m ≈ ∞I item ωk = person, Wikipedia article, word (lexical RT), . . .

I For each item, we are interested in several properties (e.g.height, weight, shoe size, sex) called random variables (r.v.)

I height X : Ω→ R+ with X (ωk) = height of person ωk

I weight Y : Ω→ R+ with Y (ωk) = weight of person ωk

I sex G : Ω→ 0, 1 with G (ωk) = 1 iff ωk is a woman+ formally, a r.v. is a (usually real-valued) function over Ω

I Mean, variance, etc. computed for each random variable:

µX =1m

∑ω∈Ω

X (ω) =: E[X ] expectation

σ2X =

1m

∑ω∈Ω

(X (ω)− µX

)2=: Var[X ] variance

= E[(X − µX )2]

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 22 / 40

Page 58: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Working with random variables

I X ′(ω) :=(X (ω)− µ

)2 defines new r.v. X ′ : Ω→ R+ any function f (X ) of a r.v. is itself a random variable

I The expectation is a linear functional on r.v.:I E[X + Y ] = E[X ] + E[Y ] for X ,Y : Ω→ RI E[r · X ] = r · E[X ] for r ∈ RI E[a] = a for constant r.v. a ∈ R (additional property)

I These rules enable us to simplify the computation of σ2X :

σ2X = Var[X ] = E

[(X − µX )2] = E

[X 2 − 2µXX + µ2

X

]= E[X 2]− 2µX E[X ]︸ ︷︷ ︸

=µX

+µ2X = E[X 2]− µ2

X

I Random variables and probabilities: r.v. X describes outcomeof picking a random ω ∈ Ω Ü sampling distribution

Pr(a ≤ X ≤ b) =1m

∣∣ω ∈ Ω | a ≤ X (ω) ≤ b∣∣

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40

Page 59: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Working with random variables

I X ′(ω) :=(X (ω)− µ

)2 defines new r.v. X ′ : Ω→ R+ any function f (X ) of a r.v. is itself a random variable

I The expectation is a linear functional on r.v.:I E[X + Y ] = E[X ] + E[Y ] for X ,Y : Ω→ RI E[r · X ] = r · E[X ] for r ∈ RI E[a] = a for constant r.v. a ∈ R (additional property)

I These rules enable us to simplify the computation of σ2X :

σ2X = Var[X ] = E

[(X − µX )2] = E

[X 2 − 2µXX + µ2

X

]= E[X 2]− 2µX E[X ]︸ ︷︷ ︸

=µX

+µ2X = E[X 2]− µ2

X

I Random variables and probabilities: r.v. X describes outcomeof picking a random ω ∈ Ω Ü sampling distribution

Pr(a ≤ X ≤ b) =1m

∣∣ω ∈ Ω | a ≤ X (ω) ≤ b∣∣

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40

Page 60: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

Working with random variables

I X ′(ω) :=(X (ω)− µ

)2 defines new r.v. X ′ : Ω→ R+ any function f (X ) of a r.v. is itself a random variable

I The expectation is a linear functional on r.v.:I E[X + Y ] = E[X ] + E[Y ] for X ,Y : Ω→ RI E[r · X ] = r · E[X ] for r ∈ RI E[a] = a for constant r.v. a ∈ R (additional property)

I These rules enable us to simplify the computation of σ2X :

σ2X = Var[X ] = E

[(X − µX )2] = E

[X 2 − 2µXX + µ2

X

]= E[X 2]− 2µX E[X ]︸ ︷︷ ︸

=µX

+µ2X = E[X 2]− µ2

X

I Random variables and probabilities: r.v. X describes outcomeof picking a random ω ∈ Ω Ü sampling distribution

Pr(a ≤ X ≤ b) =1m

∣∣ω ∈ Ω | a ≤ X (ω) ≤ b∣∣

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40

Page 61: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

A justification for the mean

I σ2X tells us how well the r.v. X is characterised by µX

I More generally, E[(X − a)2] tells us how well X is

characterised by some real number a ∈ R

I The best single value we can give for X is the one thatminimises the average squared error:

E[(X − a)2] = E[X 2]− 2aE[X ]︸ ︷︷ ︸

=µX

+a2

I It is easy to see that a minimum is achieved for a = µX+ The quadratic error term in our definition of σ2

X guaranteesthat there is always a unique minimum. This would not havebeen the case e.g. with |X − a| instead of (X − a)2.

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40

Page 62: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

A justification for the mean

I σ2X tells us how well the r.v. X is characterised by µX

I More generally, E[(X − a)2] tells us how well X is

characterised by some real number a ∈ RI The best single value we can give for X is the one that

minimises the average squared error:

E[(X − a)2] = E[X 2]− 2aE[X ]︸ ︷︷ ︸

=µX

+a2

I It is easy to see that a minimum is achieved for a = µX+ The quadratic error term in our definition of σ2

X guaranteesthat there is always a unique minimum. This would not havebeen the case e.g. with |X − a| instead of (X − a)2.

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40

Page 63: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

A justification for the mean

I σ2X tells us how well the r.v. X is characterised by µX

I More generally, E[(X − a)2] tells us how well X is

characterised by some real number a ∈ RI The best single value we can give for X is the one that

minimises the average squared error:

E[(X − a)2] = E[X 2]− 2aE[X ]︸ ︷︷ ︸

=µX

+a2

I It is easy to see that a minimum is achieved for a = µX+ The quadratic error term in our definition of σ2

X guaranteesthat there is always a unique minimum. This would not havebeen the case e.g. with |X − a| instead of (X − a)2.

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40

Page 64: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a discrete variable

I Population distribution of a discrete variable is fully describedby giving the relative frequency of each possible value t ∈ R:

πt = Pr(X = t)

E[X ] =∑ω∈Ω

X (ω)

m=

∑t

∑X (ω)=t︸ ︷︷ ︸

group by value of X

t

m=∑t

t∑

X (ω)=t

1m

=∑t

t · |X (ω) = t|m

=∑t

t · πt =∑t

t · Pr(X = t)

I The second moment E[X 2] needed for Var[X ] can also beobtained in this way from the population distribution:

E[X 2] =∑t

t2 · Pr(X = t)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40

Page 65: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a discrete variable

I Population distribution of a discrete variable is fully describedby giving the relative frequency of each possible value t ∈ R:

πt = Pr(X = t)

E[X ] =∑ω∈Ω

X (ω)

m=

∑t

∑X (ω)=t︸ ︷︷ ︸

group by value of X

t

m=∑t

t∑

X (ω)=t

1m

=∑t

t · |X (ω) = t|m

=∑t

t · πt =∑t

t · Pr(X = t)

I The second moment E[X 2] needed for Var[X ] can also beobtained in this way from the population distribution:

E[X 2] =∑t

t2 · Pr(X = t)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40

Page 66: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a discrete variable

I Population distribution of a discrete variable is fully describedby giving the relative frequency of each possible value t ∈ R:

πt = Pr(X = t)

E[X ] =∑ω∈Ω

X (ω)

m=

∑t

∑X (ω)=t︸ ︷︷ ︸

group by value of X

t

m=∑t

t∑

X (ω)=t

1m

=∑t

t · |X (ω) = t|m

=∑t

t · πt =∑t

t · Pr(X = t)

I The second moment E[X 2] needed for Var[X ] can also beobtained in this way from the population distribution:

E[X 2] =∑t

t2 · Pr(X = t)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40

Page 67: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a continuous variable

I Population distribution of continuous variable can bedescribed by its density function g : R→ [0,∞]

I keep in mind that Pr(X = t) = 0 for almost every valuet ∈ R: nobody is exactly 172.3456789 cm tall!

Area under density curve between a and b =proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b.

Pr(a ≤ X ≤ b) =

∫ b

ag(t) dt

Same reasoning as for discrete variable leads to:

a b

E[X ] =

∫ +∞

−∞t · g(t) dt and

E[f (X )] =

∫ +∞

−∞f (t) · g(t) dt

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 26 / 40

Page 68: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a continuous variable

I Population distribution of continuous variable can bedescribed by its density function g : R→ [0,∞]

I keep in mind that Pr(X = t) = 0 for almost every valuet ∈ R: nobody is exactly 172.3456789 cm tall!

Area under density curve between a and b =proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b.

Pr(a ≤ X ≤ b) =

∫ b

ag(t) dt

Same reasoning as for discrete variable leads to:

a b

E[X ] =

∫ +∞

−∞t · g(t) dt and

E[f (X )] =

∫ +∞

−∞f (t) · g(t) dt

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 26 / 40

Page 69: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Descriptive statistics Random variables & expectations

How to compute the expectation of a continuous variable

I Population distribution of continuous variable can bedescribed by its density function g : R→ [0,∞]

I keep in mind that Pr(X = t) = 0 for almost every valuet ∈ R: nobody is exactly 172.3456789 cm tall!

Area under density curve between a and b =proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b.

Pr(a ≤ X ≤ b) =

∫ b

ag(t) dt

Same reasoning as for discrete variable leads to: a b

E[X ] =

∫ +∞

−∞t · g(t) dt and

E[f (X )] =

∫ +∞

−∞f (t) · g(t) dt

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 26 / 40

Page 70: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 27 / 40

Page 71: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Different types of continuous distributions

3.5 4.0 4.5 5.0 5.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Den

sity

µ µ+

σ

µ−

σ

µ+

µ−

symmetric, bell-shaped

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 28 / 40

Page 72: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Different types of continuous distributions

140 160 180 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

Den

sity

µ µ+

σ

µ−

σ

µ+

µ−

symmetric, bulgy

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 29 / 40

Page 73: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Different types of continuous distributions

−2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Den

sity

µ µ+

σ

µ−

σ

µ+

µ−

med

ian

skewed (median 6= mean)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 30 / 40

Page 74: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Different types of continuous distributions

40 60 80 100 120 140

0.00

00.

005

0.01

00.

015

0.02

00.

025

Den

sity

µ µ+

σ

µ−

σ

µ+

µ−

med

ian

complicated . . .

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 31 / 40

Page 75: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The shape of a distribution

Different types of continuous distributions

30 35 40 45 50 55

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Den

sity

µ µ+

σ

µ−

σ

µ+

µ−

med

ian

bimodal (mean & median misleading)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 32 / 40

Page 76: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Outline

IntroductionCategorical vs. numerical variablesScales of measurement

Descriptive statisticsCharacteristic measuresHistogram & densityRandom variables & expectations

Continuous distributionsThe shape of a distributionThe normal distribution (Gaussian)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 33 / 40

Page 77: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

The Gaussian distribution

I In many real-life data sets, the distribution has a typical“bell-shaped” form known as a Gaussian (or normal)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 34 / 40

Page 78: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

The Gaussian distribution

I In many real-life data sets, the distribution has a typical“bell-shaped” form known as a Gaussian (or normal)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 34 / 40

Page 79: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

I Idealised density function is given by simple equation:

g(t) =1

σ√2π

e−(t−µ)2/2σ2

with parameters µ ∈ R (location) and σ > 0 (width)

t

g(t)

µ

σσ

2σ2σ

I Notation: X ∼ N(µ, σ2) if r.v. has such a distributionI No coincidence: E[X ] = µ and Var[X ] = σ2 (Ü homework ;-)

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 35 / 40

Page 80: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Important properties of the Gaussian distribution

I Distribution is well-behaved: symmetric, and most values arerelatively close to the mean µ (within 2 standard deviations)

Pr(µ− 2σ ≤ X ≤ µ+ 2σ) =

∫ µ+2σ

µ−2σ

1σ√2π

e−(t−µ)2/2σ2 dt

≈ 95.5%

I 68.3% are within range µ− σ ≤ X ≤ µ+ σ (one s.d.)

I The central limit theorem explains why this particulardistribution is so widespread (sum of independent effects)

+ Mean and standard deviation are meaningful characteristics ifdistribution is Gaussian or near-Gaussian

I completely determined by these parameters

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40

Page 81: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Important properties of the Gaussian distribution

I Distribution is well-behaved: symmetric, and most values arerelatively close to the mean µ (within 2 standard deviations)

Pr(µ− 2σ ≤ X ≤ µ+ 2σ) =

∫ µ+2σ

µ−2σ

1σ√2π

e−(t−µ)2/2σ2 dt

≈ 95.5%

I 68.3% are within range µ− σ ≤ X ≤ µ+ σ (one s.d.)

I The central limit theorem explains why this particulardistribution is so widespread (sum of independent effects)

+ Mean and standard deviation are meaningful characteristics ifdistribution is Gaussian or near-Gaussian

I completely determined by these parameters

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40

Page 82: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Important properties of the Gaussian distribution

I Distribution is well-behaved: symmetric, and most values arerelatively close to the mean µ (within 2 standard deviations)

Pr(µ− 2σ ≤ X ≤ µ+ 2σ) =

∫ µ+2σ

µ−2σ

1σ√2π

e−(t−µ)2/2σ2 dt

≈ 95.5%

I 68.3% are within range µ− σ ≤ X ≤ µ+ σ (one s.d.)

I The central limit theorem explains why this particulardistribution is so widespread (sum of independent effects)

+ Mean and standard deviation are meaningful characteristics ifdistribution is Gaussian or near-Gaussian

I completely determined by these parameters

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40

Page 83: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality

I Many hypothesis tests and other statistical techniques assumethat random variables follow a Gaussian distribution

I If this normality assumption is not justified, a significant testresult may well be entirely spurious.

I It is therefore important to verify that sample data come fromsuch a Gaussian or near-Gaussian distribution

I Method 1: Comparison of histograms and density functions

I Method 2: Quantile-quantile plots

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40

Page 84: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality

I Many hypothesis tests and other statistical techniques assumethat random variables follow a Gaussian distribution

I If this normality assumption is not justified, a significant testresult may well be entirely spurious.

I It is therefore important to verify that sample data come fromsuch a Gaussian or near-Gaussian distribution

I Method 1: Comparison of histograms and density functions

I Method 2: Quantile-quantile plots

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40

Page 85: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality

I Many hypothesis tests and other statistical techniques assumethat random variables follow a Gaussian distribution

I If this normality assumption is not justified, a significant testresult may well be entirely spurious.

I It is therefore important to verify that sample data come fromsuch a Gaussian or near-Gaussian distribution

I Method 1: Comparison of histograms and density functions

I Method 2: Quantile-quantile plots

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40

Page 86: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality: Histogram & density function

Plot histogram andestimated density:> hist(x,freq=FALSE)> lines(density(x))

Compare best-matchingGaussian distribution:> xG <-seq(min(x),max(x),len=100)> yG <-dnorm(xG,mean(x),sd(x))> lines(xG,yG,col="red")

Substantial deviation Ü

not normal (problematic)

Den

sity

100 120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

µ

µ+

σ

µ−

σ

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 38 / 40

Page 87: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality: Histogram & density function

Plot histogram andestimated density:> hist(x,freq=FALSE)> lines(density(x))

Compare best-matchingGaussian distribution:> xG <-seq(min(x),max(x),len=100)> yG <-dnorm(xG,mean(x),sd(x))> lines(xG,yG,col="red")

Substantial deviation Ü

not normal (problematic)

Den

sity

100 120 140 160 180 200 220

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

µ

µ+

σ

µ−

σestimated densitynormal approximation

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 38 / 40

Page 88: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality: Histogram & density function

Plot histogram andestimated density:> hist(x,freq=FALSE)> lines(density(x))

Compare best-matchingGaussian distribution:> xG <-seq(min(x),max(x),len=100)> yG <-dnorm(xG,mean(x),sd(x))> lines(xG,yG,col="red")

Substantial deviation Ü

not normal (problematic)

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

µ

µ+

σ

µ−

σestimated densitynormal approximation

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 38 / 40

Page 89: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality: Quantile-quantile plots

Quantile-quantile plotsare better suited forsmall samples:

> qqnorm(x)> qqline(x,col="red")

If distribution isnear-Gaussian, pointsshould follow red line.

One-sided deviationÜ skewed distribution

−3 −2 −1 0 1 2 3

140

160

180

200

Theoretical Quantiles

Sam

ple

Qua

ntile

s

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 39 / 40

Page 90: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Assessing normality: Quantile-quantile plots

Quantile-quantile plotsare better suited forsmall samples:

> qqnorm(x)> qqline(x,col="red")

If distribution isnear-Gaussian, pointsshould follow red line.

One-sided deviationÜ skewed distribution

−2 −1 0 1 2

4060

8010

012

0

Theoretical Quantiles

Sam

ple

Qua

ntile

s

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 39 / 40

Page 91: Unit 3: Descriptive Statistics for Continuous Data ...Baroni&Evert) 3a.ContinuousData: Description sigil.r-forge.r-project.org 4/40. ... Descriptive statistics I compactdescriptionofthedistributionofa(numerical)

Continuous distributions The normal distribution (Gaussian)

Playtime!

I Take random samples of n items each from the census andwikipedia data sets (e.g. n = 100)

library(corpora)Survey <- sample.df(FakeCensus, n, sort=TRUE)

I Plot histograms and estimated density for all variablesI Assess normality of the underlying distributions

I by comparison with Gaussian density functionI by inspection of quantile-quantile plots

+ Can you make them look like the figures in the slides?

I Plot histograms for all variables in the full data sets(and estimated density functions if you’re patient enough)

I What kinds of distributions do you find?I Which variables can meaningfully be described by

mean µ and standard deviation σ?

SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 40 / 40