MEASURES OF CENTRALITY
Jan 05, 2016
MEASURES OF CENTRALITY
Last lecture summary• Which graphs did we meet?
• scatter plot (bodový graf)• bar chart (sloupcový graf)• histogram• pie chart (koláčový graf)
• How do they work, what are their advantages and/or disadvantages?
SDA women – histogram of heights 2014
n = 48 or N = 48
bin size = 3.8
Distributions
negatively skewedskewed to the left
positively skewedskewed to the left
http://turnthewheel.org/free-textbooks/street-smart-stats/
e.g., life expectancy e.g., body height e.g., income
STATISTICS IS BEATIFULnew stuff
Life expectancy data• Watch TED talk by Hans Rosling, Gapminder Foundation:
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html
STATISTICS IS DEEP
UC BerkeleyThough data are fake, the paradox is the same
Simpson’s paradox
www.udacity.com – Introduction to statistics
Male
Applied Admitted Rate [%]MAJOR A 900 450MAJOR B 100 10
www.udacity.com – Introduction to statistics
Male
Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10
www.udacity.com – Introduction to statistics
Female
Applied Admitted Rate [%]MAJOR A 100 80MAJOR B 900 180
www.udacity.com – Introduction to statistics
Female
Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20
www.udacity.com – Introduction to statistics
Gender bias
What do you think, is there a gender bias?
Who do you think is favored? Male or female?
Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10
Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20
www.udacity.com – Introduction to statistics
Gender bias
Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10
Both 1000 460 46
Applied Admitted Rate [%]MAJOR A 100 80 80MAJOR B 900 180 20
Both 1000 260 26
male
female
www.udacity.com – Introduction to statistics
Gender bias
Rate [%]MAJOR A 50MAJOR B 10
Both 46
Rate [%]MAJOR A 80MAJOR B 20
Both 26
male
female
www.udacity.com – Introduction to statistics
Statistics is ambiguous• This example ilustrates how ambiguous the statistics is.
• In choosing how to graph your data you may majorily impact what people believe to be the case.
“I never believe in statistics I didn’t doctor myself.”“Nikdy nevěřím statistice, kterou si sám nezfalšuji.”
Who said that?
Winston Churchill
www.udacity.com – Introduction to statistics
What is statistics?• Statistics – the science of collecting, organizing,
summarizing, analyzing and interpreting data• Goal – use imperfect information (our data) to infer facts,
make predictions, and make decisions
• Descriptive statistic – describing and summarising data with numbers or pictures
• Inferential statistics – making conclusions or decisions based on data
Variables• variable – a value or characteristics that can vary from
individual to individual• example: favorite color, age
• How variables are classified?
• quantitative variable – numerical values, often with units of measurement, arise from the how much/how many question, example: age, annual income, number children• continuous (spojitá proměnná), example: height, weight• discrete (diskrétní proměnná), example: number of children
• continuous variables can be discretized
Variables• categorical (qualitative) variables
• categories that have no particular order• example: favorite color, gender, nationality
• ordinal• they are not numerical but their values have a natural order• example: tempterature low/medium/high
variable(proměnná)
quantitative(kvantitativní)
categorical(kategorická)
continuous(spojitá)
discrete(diskrétní)
ordinal(ordinální)
Variables
Choosing a profession
Chemistry Geography
50 000 – 60 000 40 000 – 55 000
www.udacity.com – Statistics
Choosing a profession• We made an interval estimate.• But ideally we want one number that describes the entire
dataset. This allows us to quickly summarize all our data.
www.udacity.com – Statistics
Choosing a profession
1. The value at which frequency is highest.
2. The value where frequency is lowest.
3. Value in the middle.
4. Biggest value of x-axis.
5. Mean
Chemistry Geography
www.udacity.com – Statistics
Three big M’s
• The value at which frequency is highest is called the mode. i.e. the most common value is the mode.
• The value in the middle of the distribution is called the median.
• The mean is the mean (average is the synonymum).
Chemistry Geography
www.udacity.com – Statistics
Quick quiz• What is the mode in our data?
2 5 6 5 2 6 9 8 5 2 3 5
www.udacity.com – Statistics
Mode in negatively skewed distribution
www.udacity.com – Statistics
Mode in uniform distribution
www.udacity.com – Statistics
Multimodal distribution
www.udacity.com – Statistics
Mode in categorical data
www.udacity.com – Statistics
More of modeTrue or False?
1. The mode can be used to describe any type of data we have, whether it’s numerical or categorical.
2. All scores in the dataset affect the mode.
3. If we take a lot of samples from the same population, the mode will be the same in each sample.
4. There is an equation for the mode.
• Ad 3.• http://onlinestatbook.com/stat_sim/sampling_dist/ • http://www.shodor.org/interactivate/activities/Histogram/ - mode changes as you
change a bin size.
• Because 3. is not true, we can’t use mode to learn something about our population. Mode depends on how you present the data.
www.udacity.com – Statistics
Life expectancy data
www.coursera.org – Statistics: Making Sense of Data
Minimum
Sierra Leone
minimum = 47.8
www.coursera.org – Statistics: Making Sense of Data
Maximum
Japan
maximum = 84.3
www.coursera.org – Statistics: Making Sense of Data
Life expectancy data
all countries
www.coursera.org – Statistics: Making Sense of Data
Life expectancy data
1 197
Egypt
99
73.2half larger
half smaller
www.coursera.org – Statistics: Making Sense of Data
Life expectancy data
Minimum = 47.8
Maximum = 83.4
Median = 73.2
www.coursera.org – Statistics: Making Sense of Data
Q1
1 197
Sao Tomé & Príncipe
50 (¼ way)
1st quartile = 64.7
www.coursera.org – Statistics: Making Sense of Data
Q1
¾ larger¼ smaller
1st quartile = 64.7
www.coursera.org – Statistics: Making Sense of Data
Q3
1 197
NetherlandAntilles
148 (¾ way)
3rd quartile = 76.7
www.coursera.org – Statistics: Making Sense of Data
Q3
3rd quartile = 76.7
¾ smaller ¼ larger
www.coursera.org – Statistics: Making Sense of Data
Life expectancy data
Minimum = 47.8
Maximum = 83.4
Median = 73.2
1st quartile = 64.7
3rd quartile = 76.7
www.coursera.org – Statistics: Making Sense of Data
Box Plot
www.coursera.org – Statistics: Making Sense of Data
Box plot
1st quartile
3rd quartilemedian
minimum
maximum
Modified box plot
IQRinterquartile range
1.5 x IQR
outliers
outliers
Quartiles, median – how to do it?
79, 68, 88, 69, 90, 74, 87, 93, 76
Find min, max, median, Q1, Q3 in these data. Then, draw the box plot.
www.coursera.org – Statistics: Making Sense of Data
Another example
Min. 1st Qu. Median 3rd Qu. Max.
68.00 75.00 81.00 88.50 93.00
78, 93, 68, 84, 90, 74
Percentiles
věk [roky]http://www.rustovyhormon.cz/on-line-rustove-grafy
3rd M – Mean• Mathematical notation:
• … Greek letter capital sigma• means SUM in mathematics
• Another measure of the center of the data: mean (average)
• Data values:
Salary of 25 players of the American football (NY red Bulls) in 2012.
33 750
33 750
33 750
33 750
44 000
44 000
44 000
44 000
45 566
65 000
95 000
103 500
112 495
138 188
141 666
181 500
185 000
190 000
194 375
195 000
205 000
292 500
301 999
4 600 000
5 600 000
median = 112 495
mean = 518 311
Mean is not a robust statistic.
Median is a robust statistic.
Robust statistic
10% trimmed mean … eliminate upper and lower 10% of data
Trimmed mean is more robust.
Trimmed mean33 750
33 750
33 750
33 750
44 000
44 000
44 000
44 000
45 566
65 000
95 000
103 500
112 495
138 188
141 666
181 500
185 000
190 000
194 375
195 000
205 000
292 500
301 999
4 600 000
5 600 000
median = 112 495
mean = 518 311
10% trimmed mean = 128 109