There are several TYPES TYPES of variables at reflect characteristics of the data Ratio Interval Ordinal Nominal
There are several TYPESTYPES of variablesthat reflect characteristics of the data
RatioIntervalOrdinalNominal
Ratio scale
constant size interval between adjacent values on the measurement scale
existence of a meaningful zero point
Interval scale
constant size interval between adjacent values on the measurement scale no true zero value
N
S
EW 0-10
10
Ordinal scale
data that convey only relative magnitude
Tall Medium Short
Dark
Medium
Light
Nominal scale
data in which there is no meaningful numerical information
SingleMarriedDivorcedWidowed
Another useful classification
Continuous
Discrete
data can take-on any value
data can take-on only certain values
Eg height 150 to 210cm rangeBill - 174.25 cmBill - 174.25 cm
Eg # of hands 0 to 3 rangeBill - 2 handsBill - 2 hands
2 more important issues with data
AccuracyAccuracy how close is a measured value to the real value
PrecisionPrecision how close repeated measurements are to one another
Let’s say Bill’s real real heightis 174.25 cm.
AccuratePrecise
174.25
174.25
174.25
174.25
174.25
174.25
Not AccurateNot Precise
172
178
171
174
182
168
Not AccuratePrecise
170.25
170.25
170.25
170.25
170.25
170.25
Frequency DistributionFrequency Distribution
occurrence of the various values observed for the variable
raw frequency counts
relative frequency counts divided by total number of observations
Name Height (cm) Hair Colour
Anne 168 Brown
Rishi 178 Black
Bill 183 Brown
Cristin 172 Brown
Rich 175 Black
Variable: Hair ColourVariable: Hair Colour
Sample size = 5
Frequency of Black Hair = 2Frequency of Brown Hair = 3
Must add to 5
Relative Frequency of Black Hair = 2/5 = 0.4Relative Frequency of Brown Hair = 3/5 = 0.6
Must add to 1
Variable: HeightVariable: Height
Sample size = 5
Frequency of 168 cm = 1Frequency of 172 cm = 1Frequency of 175 cm = 1Frequency of 178 cm = 1Frequency of 183 cm = 1
Relative Frequency of 168 cm = 1/5 = 0.2Relative Frequency of 172 cm = 1/5 = 0.2Relative Frequency of 175 cm = 1/5 = 0.2Relative Frequency of 178 cm = 1/5 = 0.2Relative Frequency of 183 cm = 1/5 = 0.2
Make categories
Eg. Number above and number below mid-point of range
Range: Maximum - Minimum
183 cm - 168 cm = 15 cm
Mid-point: half way between Min and Max
= Min + (Range / 2)= 168 cm + 7.5 cm= 175.5 cm
Frequency of Heights Below 175.5 cm = 3Frequency of Heights Above 175.5 cm = 2
Relative Frequency of Heights Below 175.5 cm = 3/5 = 0.6Relative Frequency of Heights Above 175.5 cm = 2/5 = 0.4
Could make THREETHREE categories
Divide range by 3: 15 cm / 3 = 5 cm
Category 1: 168 cm to 168 cm + 5 cm 168 cm to 173 cm
Category 2: 174 cm to 174 cm + 5 cm 174 cm to 179 cm
Category 3: 180 cm to 180 cm + 5 cm 180 cm to 185 cm
Frequency of Heights in 168 cm to 172 cm = 2Frequency of Heights in 173 cm to 178 cm = 2Frequency of Heights in 179 cm to 184 cm = 1
Relative Frequency of Heights in 168 cm to 172 cm = 2/5 = 0.4
Relative Frequency of Heights in 173 cm to 178 cm = 2/5 = 0.4
Relative Frequency of Heights in 179 cm to 184 cm = 1/5 = 0.2
19 252333 255120 255721 259418 260021 262222 263717 263729 266326 266519 272219 273322 275030 275018 276918 276915 277825 278220 280728 282132 283531 283536 283628 286325 287728 287717 290629 292026 292017 292017 292024 294835 2948
25 297725 297729 297719 297727 299231 300533 303321 304219 306223 306221 306218 307618 307632 308019 309024 309022 309022 310023 310422 313230 314719 317516 317521 320330 320320 320317 322517 322523 323224 323228 323426 326020 3274
24 327428 330320 331722 331722 331731 332123 333116 337416 337418 340225 341632 343020 344423 345922 346032 347330 347520 348723 354417 357219 357223 358636 360022 361424 361421 362919 362925 363716 364329 365129 365119 365119 3651
30 369924 372819 375624 377023 377020 377025 379030 379922 382718 385616 386032 386018 388429 388433 391220 394028 394114 394128 396925 398316 399720 399726 405421 405422 411125 415331 416735 417419 423824 459345 499028 70929 1021
34 113525 133025 147427 158823 158824 170124 172921 179032 181819 188525 189316 189925 192820 192821 192824 193621 197020 205525 205519 208219 208426 208424 210017 212520 212622 218727 218720 221117 222525 224020 224018 228218 2296
20 229621 230126 232531 235315 235323 236720 238124 238115 238123 239530 241022 241017 241423 242417 243826 244220 245026 246614 246628 246614 249523 249517 249521 2495
Mother’s age and babies birth weight data from Massachusetts
Range of the Birth Weight data: Minimum: 709 g Maximum: 4990 gDifference: 4281 g
Let’s say we want to look at the distribution of data across 10 categories.
Each category would span 428.1 g, but for convenience we’ll round to 430 g.
Also, instead of starting our first category at 709 g we’ll use 700g
Category12345678910
Range700-11301131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
0.0158730160.0158730160.0740740740.1534391530.179894180.2328042330.1746031750.1216931220.0211640210.010582011
3314293444332342
Freq. Rel. Freq.
Previous breakdown ok as long as I have measured weight to the nearest gram.
BUT, if I’ve measure to the nearest 0.1 gram
--> my categories may miss some observations
So need to adjust…
Category123456789
10
Range700-1130
1131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
Range700-1130.9
1131-1560.91561-1990.91991-2420.92421-2850.92851-3280.93281-3710.93711-4140.94141-4750.94751-5000 .9
Measured to the nearest gram Measured to the nearest 0.1 gram
HistogramHistogram - graphical representation of a frequency distribution
0
0.5
1
1.5
2
2.5
3
Brown Hair Black Hair
Freq
uen
cy
Hair colour
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Freq
uen
cyFrequency distribution of neonatal birth weight
0
0.05
0.1
0.15
0.2
0.25
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Rela
tive F
req
uen
cyFrequency distribution of neonatal birth weight
Category12345678910
Range700-11301131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
Mid-point915134617762206263630663496392643564966
Birth Weight Category Mid-point
Freq
uen
cyFrequency distribution of neonatal birth weight
Category123456789
10
Range700-1130
1131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
0.01580.01580.074070.153430.179890.232800.174600.121690.021160.01058
3314293444332342
Freq. Rel. Freq. Cum. Freq.0.01580.03170.10580.25920.43910.67190.84650.96820.98941.0
Cumulative FrequencyCumulative Frequency - Cum. Freq. at any category is equal to the frequency at that category plus the frequency in each previous category.
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Cu
mula
tive F
requ
en
cyFrequency distribution of neonatal birth weight
Measures of Central TendencyMeasures of Central Tendency
MeanMedianMode
These generally tell you where the majority of the observations lie Each one tells something slightly different
Average
Middle Value
Most Frequent Value
The Mean:The Mean:
The mean is calculated by summing the observed values and dividing the sum by the total number of observations.
Population Mean = μ
Sample Mean = X
A die has 6 sides, 1 dot, 2, 3, 4, 5, and 6
dots5.36
654321
dotsX 33
432
N
XXXX N
...321
n
XXXXX n
...321
N
XN
ii
1
n
X
X
n
ii
1
RishiAnneBillCristinRich
Observationi
HeightXi
12345
172185132191205
n = 5
1775
885'
n
sXX
= 885
19 252333 255120 255721 259418 260021 262222 263717 263729 266326 266519 272219 273322 275030 275018 276918 276915 277825 278220 280728 282132 283531 283536 283628 286325 287728 287717 290629 292026 292017 292017 292024 294835 2948
25 297725 297729 297719 297727 299231 300533 303321 304219 306223 306221 306218 307618 307632 308019 309024 309022 309022 310023 310422 313230 314719 317516 317521 320330 320320 320317 322517 322523 323224 323228 323426 326020 3274
24 327428 330320 331722 331722 331731 332123 333116 337416 337418 340225 341632 343020 344423 345922 346032 347330 347520 348723 354417 357219 357223 358636 360022 361424 361421 362919 362925 363716 364329 365129 365119 365119 3651
30 369924 372819 375624 377023 377020 377025 379030 379922 382718 385616 386032 386018 388429 388433 391220 394028 394114 394128 396925 398316 399720 399726 405421 405422 411125 415331 416735 417419 423824 459345 499028 70929 1021
34 113525 133025 147427 158823 158824 170124 172921 179032 181819 188525 189316 189925 192820 192821 192824 193621 197020 205525 205519 208219 208426 208424 210017 212520 212622 218727 218720 221117 222525 224020 224018 228218 2296
20 229621 230126 232531 235315 235323 236720 238124 238115 238123 239530 241022 241017 241423 242417 243826 244220 245026 246614 246628 246614 249523 249517 249521 2495
n = 189
189
1
556540i
iX
189
1
556540i
iXn = 189
656.2944189
556540'
n
sXX
Another way to calculate the meanAnother way to calculate the mean
Suppose you had a frequency distribution for the number of cancerous moles on people who regularly visit Club Med
# cancerous moles(X)
Frequency(f)
012345
8481021
# cancerous moles(x)
Frequency(f)
012345
8481021
04163085
f * x
n = f’s
X’s = f*x
n = 33 f*x = 63
909.133
63*
f
xfX
The Mode:The Mode: the most frequently occurring value in a set of measurements
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Fre
quen
cy
Frequency distribution of neonatal birth weight
Category123456789
10
Range700-1130
1131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
0.0158730160.0158730160.0740740740.1534391530.179894180.2328042330.1746031750.1216931220.0211640210.010582011
3314293444332342
Freq. Rel. Freq.
Mid-point is 3065.5 --> report the MODE as 3065.5
The Median: the middle measurement of a set of data
--> data must be ordered
Heights (cm)178143123189187205168173198
Ordered Heights (cm)123143168173178187189198205
Observation (X)123456789
Median is 178 cm
Heights (cm)178143123189187205168173198162
Ordered Heights (cm)123143162168173178187189198205
Observation (X)123456789
10
Middle observation is 5.5 --> median is midway between observation 5 and observation 6
Median is (173+178)/2 = 175.5
General formula for Median:
If n is an oddodd number:
2/)1( nX
2/)19( X
178)5( X
General formula for Median:
If n is an eveneven number:
2/)1( nX
2/)110( X
)5.5(X
265 XX
5.1752
178173
# cancerous moles(X)
Frequency(f)
012345
8481021
Cumulative Frequency
81220303233
M = X(n+1)/2=X17=2
000000001111
222222223333
333333445
Category123456789
10
Range700-1130
1131-15601561-19901991-24202421-28502851-32803281-37103711-41404141-47504751-5000
36204983127160183187189
3314293444332342
Freq. Cum. Freq.
M = X(n+1)/2 = X190/2 = X95
Median =
(lower limit of class) + ((0.5*n - cum.freq.)/#obs in interval)(interval size)
= 2851 + ((0.5*189- 83)/44) * (430)
= 2851 + (94.5-83)/44 *430
= 2963.4
Of the previous class
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Fre
quen
cyFrequency distribution of neonatal birth weight
05
1015202530354045
1 2 3 4 5 6 7 8 9 10 11 12 13
Symetrical, unimodal distribution
Mean, Mode and Median
02468
1012141618
1 2 3 4 5 6 7 8 9 10 11 12 13
Symetrical, bimodal distribution
MeanMedain
ModeMode
05
1015202530354045
1 2 3 4 5 6 7 8 9 10 11 12 13
Asymmetric distribution
Mode Median Mean
05
1015202530354045
1 2 3 4 5 6 7 8 9 10 11 12 13
Asymmetric distribution
Mean Median Mode
Measures of Dispersion and Variability
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10
Birth Weight Category
Fre
quen
cyFrequency distribution of neonatal birth weight
0500
1000150020002500300035004000450050005500
0 0.2 0.4 0.6 0.8 1 1.2
Birt
h W
eigh
t (g)
Mean
Maximum
Minimum
Ran
ge
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5
0500
1000150020002500300035004000450050005500
0 0.2 0.4 0.6 0.8 1 1.2
Birt
h W
eigh
t (g)
Mean
Maximum
Minimum
Observationi
Deviation
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5
Average Deviation from the Mean
--> on average, how much do the individual observations differ from the mean?
n
XX in
i)(
1
Xi
1.21.41.61.82.02.22.4
XX i XX i 2XX i 1.2-1.8 = -0.6
-0.4-0.20.00.20.40.6
X=12.6n=7
8.17
6.12X
i1234567
07
1
XX i
i
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5
Average Absolute Deviation from the Mean
--> on average, how much do the individual observations differ from the mean?
n
XX in
i
1
Xi
1.21.41.61.82.02.22.4
XX i XX i 2XX i 1.2-1.8 = -0.6
-0.4-0.20.00.20.40.60.0X=12.6
n=78.1
7
6.12X
i1234567
|1.2-1.8| = 0.60.40.20.00.20.40.6
34.07
4.2
7
7
1
XX i
i
Sum of Squared Deviations
n
ii XXSS
1
2)(
“Sum of Squares”
Xi
1.21.41.61.82.02.22.4
XX i XX i 2XX i -0.6-0.4-0.20.00.20.40.60.0X=12.6
n=78.1
7
6.12X
i1234567
0.60.40.20.00.20.40.6
0.34
(-0.6)2 = 0.360.160.04
00.040.160.36
12.1)(1
2
n
ii XX
1.12
Variance
--> mean sum of squares
1
)(1
2
2
n
XXs
n
ii
N
Xn
ii
1
2
2
)( Population
Sample
Xi
1.21.41.61.82.02.22.4
XX i XX i 2XX i -0.6-0.4-0.20.00.20.40.60.0X=12.6
n=78.1
7
6.12X
i1234567
0.60.40.20.00.20.40.6
0.34
(-0.6)2 = 0.360.160.04
00.040.160.361.12
1867.06
12.1
1
)(1
2
2
n
XXs
n
ii
Standard Deviation
2ss
2 Population
Sample
Coefficient of Variation
X
sV
--> allows comparison of variability among samples measured in different units or scales.
S expressed as a % of the mean
0
0.5
1
1.5
2
2.5
3
Mean DeviationVarianceStandard deviationCV
0.340.18670.430.24
0.260.13670.370.21
Standard Error of the MeanStandard Error of the Mean
Recall: x and s are estimates of Recall: x and s are estimates of μμ and and σσ
How good are these measures??How good are these measures??
Need level of uncertainty (due to sampling Need level of uncertainty (due to sampling error) in the mean:error) in the mean:
SEx = s/√ n
Confidence IntervalsConfidence Intervals
SE = measure of how far x is likely to be SE = measure of how far x is likely to be from from μμ
2 * SE = 95% confidence2 * SE = 95% confidence
I.e. μ is inside 2 * SE 95% of the timeI.e. μ is inside 2 * SE 95% of the time
Reporting variability about the mean.
Text
In a table as in previous slide.
Or, for example, in a manuscript, I might write:
The mean (± 95% CI) for the random samples of 100, 50, 25 and 10 was 24.84079 (±0.1816), 24.91241(±0.31996), 24.86719 (±0.40142) and 25.16212 (±0.859) respectively. You are not restricted to using the confidence intervals when reporting variability about the mean, ie I could have used mean ± std dev, or mean ± std error
Graphically: Box Plot or Box and Whisker Plot
Type of Mother
Ne
on
ate
We
igh
t (g
)
2550
2650
2750
2850
2950
3050
3150
3250
Non-smokers Smokers
MeanStandard Error95% CI
Graphically: Box Plot or Box and Whisker Plot
Type of Mother
Ne
on
ate
We
igh
t (g
)
2550
2650
2750
2850
2950
3050
3150
3250
Non-smokers Smokers
MeanStandard Error95% CI
Graphically: Box Plot or Box and Whisker Plot
Type of Mother
Ne
on
ate
We
igh
t (g
)
2550
2650
2750
2850
2950
3050
3150
3250
Non-smokers Smokers
Mean
95% CI
Graphically: Box Plot or Box and Whisker Plot
Type of Mother
Ne
on
ate
We
igh
t (g
)
0
500
1000
1500
2000
2500
3000
3500
4000
Non-smokers Smokers
Mean
95% CI