Math 3339 - Section 21155 - SR 117 - MW 1 - 2:30pmbekki/3339/notes/3339day05.pdfMeasures of Location Other Percentiles The median is also a percentile value - the 50th percentile.

Math 3339Section 21155 - SR 117 - MW 1 - 2:30pm

Bekki George: [email protected]

University of Houston

Sections 2.1-2.3

Bekki George (UH) Math 3339 Sections 2.1-2.3 1 / 28

Office Hours: Mondays 11am - 12:30pm, Tuesdays 3-4pm(also available by appointment)

Office: 206 PGH

Course webpage: www.casa.uh.edu


Measures of Location

Last week we discussed mean and median.

R commands:

Entering a list: >list name=c(1,2,3)

>mean(list name)

>median(list name)



Other PercentilesThe median is also a percentile value - the 50th percentile.

Percentiles: Suppose we have 10 data measurements, arranged inorder from smallest to largest as . Each data measurement accounts for10% of the data taken.

We say that x1 represents the first 10 percent of the data, so we call itthe 5th percentile. So, the 10 data measurements mark the5th, 15th, 25th, ..., 95th percentiles.

In general, if you have n data measurements, x1 represents the100(1−0.5)

n

thpercentile, x2 represents the 100(2−0.5)

n

thpercentile, and xi

represents the 100(i−0.5)n

thpercentile.

This is useful if you wish to calculate the percentile rank of a knownmeasurement.



Other PercentilesIf you are looking for the measurement that has a desired percentilerank, the 100 · P th percentile, is the measurement with rank nP + 0.5.

For example, in a collection of 30 data measurements, whichmeasurement represents the 25th percentile?

The 25th percentile is called the first quartile, Q1. It represents thefirst 1

4 of the data. Similarly, the 50th and 75th percentiles are thesecond and third quartiles, Q2 and Q3, respectively.


Example

A manufacturer claims that his fabric consists of 80 percent cotton. Tocheck his claim, we take a small swatch from each bolt of fabric anddetermine its cotton content. The results of 25 such measurements areas follows:

77 81 76 76 79 79 80 77 89 77 78 85 8075 79 88 81 78 82 80 76 83 81 85 79

Determine the percentile of the 4th order statistic.Determine the 50th percentile.Determine the first and third quartiles.



R commands:

>quantile(fabric)

>quantile(fabric,.95)

>quantile(fabric,c(.3,.6,.9))



ModeThe mode is the value that occurs most often in the data list.

Trimmed MeanTrimmed means are obtained by finding the mean of the values of thedata excluding a given percentage of the largest and smallest values.For example, the 5% trimmed mean is the mean of the values of thedata excluding the largest 5% of the values and the smallest 5% of thevalues.

R commands:

>mean(fabric, trim=0.1)



Grouped DataExample 2.1. The data below are 50 measured reaction times inresponse to a sensory stimulus, arranged in increasing order. Afrequency table is shown below the data.0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.131.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.721.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.433.43 3.77 3.97 4.54 4.73

How would we estimate the mean if only the table were given?



HistogramsR commands:

>hist(fabric)

Histograms show grouped data only. You cannot retrieve individualdata from a histogram.

Stemp PlotsR commands:

>stem(fabric)

>stem(fabric, scale = .5)



Frequency DistributionsA frequency distribution is a tabular summary of data showing thefrequency (or number) of items in each of several non-overlappingclasses. The relative frequency of a class is the fraction orproportion of the total number of data items belonging to the class.



Cumulative FrequencyThe cumulative frequency distribution shows the number of itemswith values less than or equal to the upper limit of each class. Thecumulative relative frequency distribution shows the proportionof items with values less than or equal to the upper limit of each class.A cumulative frequency plot of the percentages (also called an ogive)can be used to view the total number of events that occurred up to acertain value.



Cumulative FrequencyExample: Here is an ogive for Hudson Auto Repair’s cost of parts sold:



The Five Number SummaryR commands:

>fivenum(fabric)

>summary(fabric)

Difference between fivenum and summary:Fivenum: fivenum() returns Tukey’s five number summary i.e.[Minimum] [Lower hinge] [Median] [Upper hinge] [Maximum]Lower Hinge = median of the values to the left of medianUpper hinge = median of the values to the right of median

Summary: summary() gives the following summary statistics:[Minimum] [1st Quartile] [Median] [Mean] [3rd Quartile] [Maximum]


Boxplots

Making a Boxplot from the Five Number Summary

1 Order the values in the data set in ascending order (least togreatest).

2 Find and label the median.3 Of the lower half (less than the median - do not include), find and

label Q1.4 Of the upper half (greater than the median - do not include), find

and label Q3.5 Label the minimum and maximum.6 Draw and label the scale on an axis.7 Plot the five number summary.8 Sketch a box starting at Q1 to Q3.9 Sketch a segment within the box to represent the median.10 Connect the min and max to the box with line segments.


Boxplots

Example:


Boxplots

Calculating Outlier BOUNDARIESFollow the formula (Q1 − 1.5 (IQR), Q3 + 1.5 (IQR))Hint: you need to know what Q1 and Q3 are numerically.

Steps:1 Find Q1 and Q3.2 Calculate the IQR (Q3−Q1)3 Multiply the IQR by 1.54 Subtract 1.5(IQR) from Q1, this is the lower bound5 Add 1.5(IQR) to Q3, this is the upper bound6 Write outlier boundaries in interval notation


Boxplots

Example: Given the five number summary below, determine the outlierboundaries, and give the box plot:

79 84.5 88.5 93 111min Q1 median Q3 max

Will this data have outliers?


Measures of Variability

Measures of Variability - Dispersion (spread)

1 The simplest way to measure dispersion is range. This is thedifference between the smallest and largest measurements.Drawbacks of range: sensitivity to outliers

2 Another method is interquartile range, IQR = Q3 −Q1. This isnot sensitive to outliers, but still has some drawbacks as a measureof dispersion.

3 The most common measure is sample standard deviation.Roughly speaking, standard deviation is the average distancevalues fall from the mean (center of graph).

4 The coefficient of variation measures relative variability.This isused for variables that have only positive values.



Population VarianceThe population variance is defined as

σ2 =1

N

[(x1 − x)2 + (x2 − x)2 + · · ·+ (xn − x)2

]=

1

N

N∑i=1

(xi − x)2

and the population standard deviation is given by σ =√σ2, the square

root of the population variance.



Sample VarianceThe sample variance is defined as

s2 =1

n− 1

[(x1 − x)2 + (x2 − x)2 + · · ·+ (xn − x)2

]=

1

n− 1

n∑i=1

(xi − x)2

and the sample standard deviation is given by s, the square root of thesample variance. The reason for modifying the definition for thesample variance has to do with its properties as an estimate of thepopulation variance.



Example:Distance from the mean is sometimes measured in standard deviations.For example, if x = 20 and s = 4 then a measurement of 12 would be 2standard deviations from the mean.

What interval of measurements from the above scenario would bewithin 2 standard deviations from the mean?

Within 1.5 standard deviations?



Example:



The Coefficient of VariationThe coefficient of variation measures relative variability.

cv(x) =σ(x)

µ(x)or =

s

x

The advantage of the CV is that it is unitless. This allows CVs to becompared to each other in ways that other measures cannot be. It isused frequently in regression analysis.


Jointly Distributed Variables

Side by Side BoxplotsR commands:

>boxplot(payroll$payroll ∼ payroll$industry)

>boxplot(test vs grade$Test∼test vs grade$Grade)



ScatterplotsWith scatter plots, we plot ordered pairs of our data. R commands:

>plot(y ∼ x)

Check the box under Packages for Using R. Let’s open the data seturchin.growth.>plot( size ∼ age, data=urchin.growth)



Covariance and CorrelationIf x and y are jointly distributed numeric variables, we define theircovariance as

cov(x, y) =1

N

N∑i=1

(xi − µ(x))(yi − µ(y))

for a population and

cov(x, y) =1

n− 1

n∑i=1

(xi − x)(yi − y)

for a sample.

Covariance is a measure of how changes in one variable are associatedwith changes in a second variable.



Covariance and CorrelationIf x and y are jointly distributed numeric variables, we define theircorrelation as

cor(x, y) = r =cov(x, y)

sd(x) · sd(y)

The correlation is always such that −1 < cor(x, y) < 1.

The correlation indicates the strength and direction of a linearrelationship.

If | cor(x, y) |= 1 then there is a perfect linear relationship between xand y.

R commands:

>cor(x,y)


Math 3339 - Section 21155 - SR 117 - MW 1 - 2:30pmbekki/3339/notes/3339day05.pdfMeasures of Location Other Percentiles The median is also a percentile value - the 50th percentile.

Documents