Math 3339 Section 21155 - SR 117 - MW 1 - 2:30pm Bekki George: [email protected] University of Houston Sections 2.1-2.3 Bekki George (UH) Math 3339 Sections 2.1-2.3 1 / 28
Math 3339Section 21155 - SR 117 - MW 1 - 2:30pm
Bekki George: [email protected]
University of Houston
Sections 2.1-2.3
Bekki George (UH) Math 3339 Sections 2.1-2.3 1 / 28
Office Hours: Mondays 11am - 12:30pm, Tuesdays 3-4pm(also available by appointment)
Office: 206 PGH
Course webpage: www.casa.uh.edu
Bekki George (UH) Math 3339 Sections 2.1-2.3 2 / 28
Measures of Location
Last week we discussed mean and median.
R commands:
Entering a list: >list name=c(1,2,3)
>mean(list name)
>median(list name)
Bekki George (UH) Math 3339 Sections 2.1-2.3 3 / 28
Measures of Location
Other PercentilesThe median is also a percentile value - the 50th percentile.
Percentiles: Suppose we have 10 data measurements, arranged inorder from smallest to largest as . Each data measurement accounts for10% of the data taken.
We say that x1 represents the first 10 percent of the data, so we call itthe 5th percentile. So, the 10 data measurements mark the5th, 15th, 25th, ..., 95th percentiles.
In general, if you have n data measurements, x1 represents the100(1−0.5)
n
thpercentile, x2 represents the 100(2−0.5)
n
thpercentile, and xi
represents the 100(i−0.5)n
thpercentile.
This is useful if you wish to calculate the percentile rank of a knownmeasurement.
Bekki George (UH) Math 3339 Sections 2.1-2.3 4 / 28
Measures of Location
Other PercentilesIf you are looking for the measurement that has a desired percentilerank, the 100 · P th percentile, is the measurement with rank nP + 0.5.
For example, in a collection of 30 data measurements, whichmeasurement represents the 25th percentile?
The 25th percentile is called the first quartile, Q1. It represents thefirst 1
4 of the data. Similarly, the 50th and 75th percentiles are thesecond and third quartiles, Q2 and Q3, respectively.
Bekki George (UH) Math 3339 Sections 2.1-2.3 5 / 28
Example
A manufacturer claims that his fabric consists of 80 percent cotton. Tocheck his claim, we take a small swatch from each bolt of fabric anddetermine its cotton content. The results of 25 such measurements areas follows:
77 81 76 76 79 79 80 77 89 77 78 85 8075 79 88 81 78 82 80 76 83 81 85 79
Determine the percentile of the 4th order statistic.Determine the 50th percentile.Determine the first and third quartiles.
Bekki George (UH) Math 3339 Sections 2.1-2.3 6 / 28
Measures of Location
R commands:
>quantile(fabric)
>quantile(fabric,.95)
>quantile(fabric,c(.3,.6,.9))
Bekki George (UH) Math 3339 Sections 2.1-2.3 7 / 28
Measures of Location
ModeThe mode is the value that occurs most often in the data list.
Trimmed MeanTrimmed means are obtained by finding the mean of the values of thedata excluding a given percentage of the largest and smallest values.For example, the 5% trimmed mean is the mean of the values of thedata excluding the largest 5% of the values and the smallest 5% of thevalues.
R commands:
>mean(fabric, trim=0.1)
Bekki George (UH) Math 3339 Sections 2.1-2.3 8 / 28
Measures of Location
Grouped DataExample 2.1. The data below are 50 measured reaction times inresponse to a sensory stimulus, arranged in increasing order. Afrequency table is shown below the data.0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.131.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.721.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.433.43 3.77 3.97 4.54 4.73
How would we estimate the mean if only the table were given?
Bekki George (UH) Math 3339 Sections 2.1-2.3 9 / 28
Measures of Location
HistogramsR commands:
>hist(fabric)
Histograms show grouped data only. You cannot retrieve individualdata from a histogram.
Stemp PlotsR commands:
>stem(fabric)
>stem(fabric, scale = .5)
Bekki George (UH) Math 3339 Sections 2.1-2.3 10 / 28
Measures of Location
Frequency DistributionsA frequency distribution is a tabular summary of data showing thefrequency (or number) of items in each of several non-overlappingclasses. The relative frequency of a class is the fraction orproportion of the total number of data items belonging to the class.
Bekki George (UH) Math 3339 Sections 2.1-2.3 11 / 28
Measures of Location
Cumulative FrequencyThe cumulative frequency distribution shows the number of itemswith values less than or equal to the upper limit of each class. Thecumulative relative frequency distribution shows the proportionof items with values less than or equal to the upper limit of each class.A cumulative frequency plot of the percentages (also called an ogive)can be used to view the total number of events that occurred up to acertain value.
Bekki George (UH) Math 3339 Sections 2.1-2.3 12 / 28
Measures of Location
Cumulative FrequencyExample: Here is an ogive for Hudson Auto Repair’s cost of parts sold:
Bekki George (UH) Math 3339 Sections 2.1-2.3 13 / 28
Measures of Location
The Five Number SummaryR commands:
>fivenum(fabric)
>summary(fabric)
Difference between fivenum and summary:Fivenum: fivenum() returns Tukey’s five number summary i.e.[Minimum] [Lower hinge] [Median] [Upper hinge] [Maximum]Lower Hinge = median of the values to the left of medianUpper hinge = median of the values to the right of median
Summary: summary() gives the following summary statistics:[Minimum] [1st Quartile] [Median] [Mean] [3rd Quartile] [Maximum]
Bekki George (UH) Math 3339 Sections 2.1-2.3 14 / 28
Boxplots
Making a Boxplot from the Five Number Summary
1 Order the values in the data set in ascending order (least togreatest).
2 Find and label the median.3 Of the lower half (less than the median - do not include), find and
label Q1.4 Of the upper half (greater than the median - do not include), find
and label Q3.5 Label the minimum and maximum.6 Draw and label the scale on an axis.7 Plot the five number summary.8 Sketch a box starting at Q1 to Q3.9 Sketch a segment within the box to represent the median.10 Connect the min and max to the box with line segments.
Bekki George (UH) Math 3339 Sections 2.1-2.3 15 / 28
Boxplots
Example:
Bekki George (UH) Math 3339 Sections 2.1-2.3 16 / 28
Boxplots
Calculating Outlier BOUNDARIESFollow the formula (Q1 − 1.5 (IQR), Q3 + 1.5 (IQR))Hint: you need to know what Q1 and Q3 are numerically.
Steps:1 Find Q1 and Q3.2 Calculate the IQR (Q3−Q1)3 Multiply the IQR by 1.54 Subtract 1.5(IQR) from Q1, this is the lower bound5 Add 1.5(IQR) to Q3, this is the upper bound6 Write outlier boundaries in interval notation
Bekki George (UH) Math 3339 Sections 2.1-2.3 17 / 28
Boxplots
Example: Given the five number summary below, determine the outlierboundaries, and give the box plot:
79 84.5 88.5 93 111min Q1 median Q3 max
Will this data have outliers?
Bekki George (UH) Math 3339 Sections 2.1-2.3 18 / 28
Measures of Variability
Measures of Variability - Dispersion (spread)
1 The simplest way to measure dispersion is range. This is thedifference between the smallest and largest measurements.Drawbacks of range: sensitivity to outliers
2 Another method is interquartile range, IQR = Q3 −Q1. This isnot sensitive to outliers, but still has some drawbacks as a measureof dispersion.
3 The most common measure is sample standard deviation.Roughly speaking, standard deviation is the average distancevalues fall from the mean (center of graph).
4 The coefficient of variation measures relative variability.This isused for variables that have only positive values.
Bekki George (UH) Math 3339 Sections 2.1-2.3 19 / 28
Measures of Variability
Population VarianceThe population variance is defined as
σ2 =1
N
[(x1 − x)2 + (x2 − x)2 + · · ·+ (xn − x)2
]=
1
N
N∑i=1
(xi − x)2
and the population standard deviation is given by σ =√σ2, the square
root of the population variance.
Bekki George (UH) Math 3339 Sections 2.1-2.3 20 / 28
Measures of Variability
Sample VarianceThe sample variance is defined as
s2 =1
n− 1
[(x1 − x)2 + (x2 − x)2 + · · ·+ (xn − x)2
]=
1
n− 1
n∑i=1
(xi − x)2
and the sample standard deviation is given by s, the square root of thesample variance. The reason for modifying the definition for thesample variance has to do with its properties as an estimate of thepopulation variance.
Bekki George (UH) Math 3339 Sections 2.1-2.3 21 / 28
Measures of Variability
Example:Distance from the mean is sometimes measured in standard deviations.For example, if x = 20 and s = 4 then a measurement of 12 would be 2standard deviations from the mean.
What interval of measurements from the above scenario would bewithin 2 standard deviations from the mean?
Within 1.5 standard deviations?
Bekki George (UH) Math 3339 Sections 2.1-2.3 22 / 28
Measures of Variability
Example:
Bekki George (UH) Math 3339 Sections 2.1-2.3 23 / 28
Measures of Variability
The Coefficient of VariationThe coefficient of variation measures relative variability.
cv(x) =σ(x)
µ(x)or =
s
x
The advantage of the CV is that it is unitless. This allows CVs to becompared to each other in ways that other measures cannot be. It isused frequently in regression analysis.
Bekki George (UH) Math 3339 Sections 2.1-2.3 24 / 28
Jointly Distributed Variables
Side by Side BoxplotsR commands:
>boxplot(payroll$payroll ∼ payroll$industry)
>boxplot(test vs grade$Test∼test vs grade$Grade)
Bekki George (UH) Math 3339 Sections 2.1-2.3 25 / 28
Jointly Distributed Variables
ScatterplotsWith scatter plots, we plot ordered pairs of our data. R commands:
>plot(y ∼ x)
Check the box under Packages for Using R. Let’s open the data seturchin.growth.>plot( size ∼ age, data=urchin.growth)
Bekki George (UH) Math 3339 Sections 2.1-2.3 26 / 28
Jointly Distributed Variables
Covariance and CorrelationIf x and y are jointly distributed numeric variables, we define theircovariance as
cov(x, y) =1
N
N∑i=1
(xi − µ(x))(yi − µ(y))
for a population and
cov(x, y) =1
n− 1
n∑i=1
(xi − x)(yi − y)
for a sample.
Covariance is a measure of how changes in one variable are associatedwith changes in a second variable.
Bekki George (UH) Math 3339 Sections 2.1-2.3 27 / 28
Jointly Distributed Variables
Covariance and CorrelationIf x and y are jointly distributed numeric variables, we define theircorrelation as
cor(x, y) = r =cov(x, y)
sd(x) · sd(y)
The correlation is always such that −1 < cor(x, y) < 1.
The correlation indicates the strength and direction of a linearrelationship.
If | cor(x, y) |= 1 then there is a perfect linear relationship between xand y.
R commands:
>cor(x,y)
Bekki George (UH) Math 3339 Sections 2.1-2.3 28 / 28