measures of centrality

MEASURES OF CENTRALITY

Last lecture summary• Which graphs did we meet?

• scatter plot (bodový graf)• bar chart (sloupcový graf)• histogram• pie chart (koláčový graf)

• How do they work, what are their advantages and/or disadvantages?

SDA women – histogram of heights 2014

n = 48 or N = 48

bin size = 3.8

Distributions

negatively skewedskewed to the left

positively skewedskewed to the left

http://turnthewheel.org/free-textbooks/street-smart-stats/

e.g., life expectancy e.g., body height e.g., income

STATISTICS IS BEATIFULnew stuff

Life expectancy data• Watch TED talk by Hans Rosling, Gapminder Foundation:

http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html




STATISTICS IS DEEP

UC BerkeleyThough data are fake, the paradox is the same

Simpson’s paradox

www.udacity.com – Introduction to statistics

Male

Applied Admitted Rate [%]MAJOR A 900 450MAJOR B 100 10


Male

Applied Admitted Rate [%]MAJOR A 900 450 50MAJOR B 100 10 10


Female

Applied Admitted Rate [%]MAJOR A 100 80MAJOR B 900 180


Female



Gender bias

What do you think, is there a gender bias?

Who do you think is favored? Male or female?




Gender bias


Both 1000 460 46


Both 1000 260 26

male

female


Gender bias

Rate [%]MAJOR A 50MAJOR B 10

Both 46

Rate [%]MAJOR A 80MAJOR B 20

Both 26

male

female


Statistics is ambiguous• This example ilustrates how ambiguous the statistics is.

• In choosing how to graph your data you may majorily impact what people believe to be the case.

“I never believe in statistics I didn’t doctor myself.”“Nikdy nevěřím statistice, kterou si sám nezfalšuji.”

Who said that?

Winston Churchill


What is statistics?• Statistics – the science of collecting, organizing,

summarizing, analyzing and interpreting data• Goal – use imperfect information (our data) to infer facts,

make predictions, and make decisions

• Descriptive statistic – describing and summarising data with numbers or pictures

• Inferential statistics – making conclusions or decisions based on data

Variables• variable – a value or characteristics that can vary from

individual to individual• example: favorite color, age

• How variables are classified?

• quantitative variable – numerical values, often with units of measurement, arise from the how much/how many question, example: age, annual income, number children• continuous (spojitá proměnná), example: height, weight• discrete (diskrétní proměnná), example: number of children

• continuous variables can be discretized

Variables• categorical (qualitative) variables

• categories that have no particular order• example: favorite color, gender, nationality

• ordinal• they are not numerical but their values have a natural order• example: tempterature low/medium/high

variable(proměnná)

quantitative(kvantitativní)

categorical(kategorická)

continuous(spojitá)

discrete(diskrétní)

ordinal(ordinální)

Variables

Choosing a profession

Chemistry Geography

50 000 – 60 000 40 000 – 55 000

www.udacity.com – Statistics

Choosing a profession• We made an interval estimate.• But ideally we want one number that describes the entire

dataset. This allows us to quickly summarize all our data.


Choosing a profession

1. The value at which frequency is highest.

2. The value where frequency is lowest.

3. Value in the middle.

4. Biggest value of x-axis.

5. Mean

Chemistry Geography


Three big M’s

• The value at which frequency is highest is called the mode. i.e. the most common value is the mode.

• The value in the middle of the distribution is called the median.

• The mean is the mean (average is the synonymum).

Chemistry Geography


Quick quiz• What is the mode in our data?

2 5 6 5 2 6 9 8 5 2 3 5


Mode in negatively skewed distribution


Mode in uniform distribution


Multimodal distribution


Mode in categorical data


More of modeTrue or False?

1. The mode can be used to describe any type of data we have, whether it’s numerical or categorical.

2. All scores in the dataset affect the mode.

3. If we take a lot of samples from the same population, the mode will be the same in each sample.

4. There is an equation for the mode.

• Ad 3.• http://onlinestatbook.com/stat_sim/sampling_dist/ • http://www.shodor.org/interactivate/activities/Histogram/ - mode changes as you

change a bin size.

• Because 3. is not true, we can’t use mode to learn something about our population. Mode depends on how you present the data.


http://onlinestatbook.com/stat_sim/sampling_dist/



http://www.shodor.org/interactivate/activities/Histogram/

http://www.shodor.org/interactivate/activities/Histogram/

Life expectancy data

www.coursera.org – Statistics: Making Sense of Data

Minimum

Sierra Leone

minimum = 47.8


Maximum

Japan

maximum = 84.3



all countries



1 197

Egypt

99

73.2half larger

half smaller



Minimum = 47.8

Maximum = 83.4

Median = 73.2


Q1

1 197

Sao Tomé & Príncipe

50 (¼ way)

1st quartile = 64.7


Q1

¾ larger¼ smaller

1st quartile = 64.7


Q3

1 197

NetherlandAntilles

148 (¾ way)

3rd quartile = 76.7


Q3

3rd quartile = 76.7

¾ smaller ¼ larger



Minimum = 47.8

Maximum = 83.4

Median = 73.2

1st quartile = 64.7

3rd quartile = 76.7


Box Plot


Box plot

1st quartile

3rd quartilemedian

minimum

maximum

Modified box plot

IQRinterquartile range

1.5 x IQR

outliers

outliers

Quartiles, median – how to do it?

79, 68, 88, 69, 90, 74, 87, 93, 76

Find min, max, median, Q1, Q3 in these data. Then, draw the box plot.


Another example

Min. 1st Qu. Median 3rd Qu. Max.

68.00 75.00 81.00 88.50 93.00

78, 93, 68, 84, 90, 74

Percentiles

věk [roky]http://www.rustovyhormon.cz/on-line-rustove-grafy

http://www.rustovyhormon.cz/on-line-rustove-grafy

http://www.rustovyhormon.cz/on-line-rustove-grafy

3rd M – Mean• Mathematical notation:

• … Greek letter capital sigma• means SUM in mathematics

• Another measure of the center of the data: mean (average)

• Data values:

Salary of 25 players of the American football (NY red Bulls) in 2012.

33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000

median = 112 495

mean = 518 311

Mean is not a robust statistic.

Median is a robust statistic.

Robust statistic

10% trimmed mean … eliminate upper and lower 10% of data

Trimmed mean is more robust.

Trimmed mean33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000

median = 112 495

mean = 518 311

10% trimmed mean = 128 109

measures of centrality

Documents

statistics gender biaswhat

major a1008080major

major a80major

major a900450major

major a10080major

major a50major b10both46rate

gender biasrate

gender biasappliedadmittedrate