Math 120 Notes – Chapter 3 – Numerically Summarizing Data Math 203 Section 3… · 2020-01-03 · Math 120 Notes – Chapter 3 – Numerically Summarizing Data Section 3.1 –

Math 120 Notes – Chapter 3 – Numerically Summarizing Data

Section 3.1 – Measures of Central Tendency

Objectives:

1. Determine the arithmetic mean of a variable from raw data

2. Determine the median of a variable from raw data

3. Explain what it means for a statistic to be resistant

4. Determine the mode of a variable from raw data

Objective – Determine the Arithmetic Mean of a Variable from Raw Data

- – a.k.a. the average – computed by adding all the

values of the variable in the data set and dividing by the number of observations

- – (pronounced “mew”), is computed using all

the individuals in a population (this is a parameter)

o If x1, x2, …, xN are the N observations of a variable from a population, then the population

mean, µ, is

- – (pronounced “x-bar”), is computed using

sample data (this is a statistic)

If x1, x2, …, xn are the n observations of a variable from a sample, then the sample mean, ! , is

*Note: You will have to do this by hand on the exam

µ =x1 + x2 ++ xN

N=

xi∑N

x = x1 + x2 ++ xnn

=xi∑n

Math 203

HillMath120,Page 2

Example: Computing a Population Mean and a Sample Mean

The following data represent the travel times (in minutes) to work for all seven employees of a start-up web

development company.

23, 36, 23, 18, 5, 26, 43

a) Compute the population mean of this data

b) Use the sample of 4 employees given below to compute the sample mean.

23, 36, 5, 26

c) Use the sample of 4 employees given below to compute the sample mean.

36, 23, 43, 26

d) What do you notice about the three different means found in parts (a) – (c)?

HillMath120,Page 3

Objective – Determine the Median of a Variable from Raw Data

- – the value of a variable that lies in the middle of the data

when arranged in ascending order

o the bottom 50% of the data from the top 50%

Steps in Finding the Median of a Data Set

1. Arrange the data in ascending order (smallest to biggest).

2. Determine the number of observations

3. Determine the observation in the middle of the data set.

a. If the number of observations is , then the median is the data value that is exactly in the

middle of the data set.

b. If the number of observations is , then the median is the mean of the two middle

observations in the data set.

*Note: You will have to do this by hand on the exam

Example: Computing a Median of a Data Set with an Odd Number of Observations

Determine the median travel time for all seven employees of the company from the previous example.

23, 36, 23, 18, 5, 26, 43

HillMath120,Page 4

Example: Computing a Median of a Data Set with an Even Number of Observations

1. Determine the median of the first sample data from the previous example.

23, 36, 5, 26

2. Determine the median of the second sample data from the previous example.

36, 23, 43, 26

3. What do you notice about the medians found in the previous problems, as compared to the means?

- – A numerical summary (mean, median, etc.) of data in which

extreme values (very large or small) relative to the data do not affect its value substantially

HillMath120,Page 5

Now we are going to find the mean and median on our calculators:

First, to input the values into L1 (or any list) hit:

Stat, highlight Edit, Enter, then hit enter after inputting each number:

23, 36, 23, 18, 5, 26, 43

To find the mean hit:

Stat, right arrow once to Calc, Enter to execute 1-Var Stats. Then hit 2nd, L1 (the #1), Enter

Note: To clear a list, arrow to the top of the list until the list name (Such as L1) is highlighted then

hit the clear button followed by enter. Make sure to hit clear, not delete.

HillMath120,Page 6

Relation Between the Mean, Median, and Distribution Shape

Objective – Determine the Mode of a Variable from Raw Data

- – the most frequent observation of a variable that occurs in the data set

o A set of data can have no mode, one mode, or more than one mode

o If there all observations have the same frequency, then the data set has no mode.

o If there are two observations that occur with the highest frequency, then the data is bimodal.

o If there are more than two observations that occur with the highest frequency, then the data

is multimodal.

HillMath120,Page 7

Example: Finding the Mode of a Data Set

A sample of 30 registered voters was surveyed in which the respondents were asked, “Do you consider

your political views to be conservative, moderate, or liberal?” The results of the survey are shown in the

table.

Determine the mode political view.

HillMath120,Page 8

Summary of the Measures of Central Tendency

HillMath120,Page 9

Section 3.2 – Measures of Dispersion

Objectives

1. Determine the range of a variable from raw data

2. Determine the standard deviation of a variable from raw data

3. Determine the variance of a variable from raw data

4. Use the Empirical Rule to describe data that are bell shaped

5. Use Chebyshev’s Inequality to describe any data set

Quick example discussing Dispersion:

To order food at a McDonald’s restaurant, one must choose from multiple lines, while at Wendy’s

Restaurant, one enters a single line.

The following data represent the wait time (in minutes) in line for a simple random sample of 30 customers

at each restaurant during the lunch hour.

The mean wait time at both restaurants is 1.39 minutes

HillMath120,Page 10

Looking at the graphs, McDonald’s wait time appears to be more

Objective – Determine the range of a variable from raw data

- – of a variable is the difference between the largest data value and

the smallest data values (difference between the min and max)

o That is, Range = R = Largest Data Value – Smallest Data Value

Example: Finding the Range of a Set of Data

The following data represent the travel times (in minutes) to work for all seven employees of a start-up web

development company. Find the range of the data.

23, 36, 23, 18, 5, 26, 43

Range =

HillMath120,Page 11

*We will have to do this by hand and on calculators on exams (and show how you did it)

Objective – Determine the Standard Deviation of a Variable from Raw Data

- (!) is given by the formula:

Where x1, x2, . . . , xN are the N observations in the population and µ is the population mean.

Example: Computing a Population Standard Deviation

Back to the travel times (in minutes) to work for all seven employees of a start-up web development

company. Compute the population standard deviation of this data.

23, 36, 23, 18, 5, 26, 43

σ =x1 −µ( )2 + x2 −µ( )2 ++ xN −µ( )2

N

=xi −µ( )2∑N

HillMath120,Page 12

- (s) is given by the formula:

where x1, x2, . . . , xn are the n observations in the sample and is the sample mean

Example: Computing a Sample Standard Deviation

Take a sample of 4 of the travel times to work from the previous example. Compute the sample standard

deviation of this data.

Sample:

*We will have todo this by handand on calculatorson exams (andshow how you didit)

Careful!!! It’s similar to the formula for population standard deviation right?

HillMath120,Page 13

Using your Calculator to find Population and Sample Standard Deviations

We can use our calculators to find population and sample standard deviations by executing the 1-Var Stats

program on the data and knowing how to interpret the results.

Let’s use the travel times of the start-up company data: 23, 36, 23, 18, 5, 26, 43

Input the data into L1 (if you don’t have it already), execute 1-Var Stats L1 (I hope you remember how).

The standard deviation is σx =

The standard deviation is sx =

HillMath120,Page 14

Extra Example: Computing a Population and Sample Standard Deviation

Given are the scores of 6 different Statistics students’ first exam:

81, 68, 91, 72, 55, 70

a) Treating the scores as a population, find the standard deviation of the test scores.

b) Now treat the scores as a sample, and find the standard deviation of the test scores.

HillMath120,Page 15

– we call n - 1 the degrees of freedom because the

first n - 1 observations have freedom to be whatever value they wish, but the nth value has no freedom

o The nth must be whatever value forces the sum of the deviations about the mean to equal

zero (it cancels out all the deviations)

Example: Comparing Standard Deviations

Recall the Wendy’s and McDonald’s data. Which data has a larger standard deviation and how do you

know?

Remember we said that the McDonald’s data is more

Overall Idea: More dispersed data results in a standard deviation.

HillMath120,Page 16

Objective – Determine the Variance of a Variable from Raw Data

- – the square of the standard deviation

o The population variance is

o The sample variance is

Example: Computing a Population Variance

We previously computed the standard deviation of the travel times to work for all seven employees of the

start-up web development company.

23, 36, 23, 18, 5, 26, 43

Recall that the population standard deviation was σ = 11.36 minutes

So the population variance is σ2 =

If we were told to treat the data as a sample, the sample standard deviation would be

s = 12.27 minutes

So the sample variance is s2 =

HillMath120,Page 17

Objective – Use the Empirical Rule to Describe Data That Are Bell Shaped

The Empirical Rule: 68, 95, 99.7

If a distribution is roughly , then

o Approx. of the data will lie within 1 standard deviation of the mean

o Approx. of the data will lie within 2 standard deviations of the mean

o Approx. of the data will lie within 3 standard deviations of the mean

Note: We can also use the Empirical Rule based on sample data with ! used in place of µ and s used in

place of σ.

HillMath120,Page 18

Example: Using the Empirical Rule

The following data represent the serum HDL cholesterol of the 54 female patients of a family doctor.

41 48 43 38 35 37 44 44 44

62 75 77 58 82 39 85 55 54

67 69 69 70 65 72 74 74 74

60 60 60 61 62 63 64 64 64

54 54 55 56 56 56 57 58 59

45 47 47 48 48 50 52 52 53

We are told that the data has a bell-shaped distribution. Also, the population mean, !, is 57.4 and the

population standard deviation, !, is 11.7.

a) According to the Empirical Rule determine the percentage of all patients that have serum HDL

within 3 standard deviations of the mean.

b) According to the Empirical Rule, determine the percentage of all patients that have serum HDL

between 34 and 69.1

c) According to the Empirical Rule, determine the percentage of all patients that have serum HDL greater

than 69.1.

HillMath120,Page 19

Extra Example: Using the Empirical Rule

A random sample of 30 statistics exams from a previous semester has a mean score, !, of 72.1 with a

standard deviation of 4.3. The distribution of exam scores is known to be approximately bell-shaped.

Determine the following:

a) The approximate percentage of exams with scores greater than 80.7.

b) About 95% of scores will lie between what two scores?

c) About 68% of scores will lie between what two scores?

HillMath120,Page 20

Objective – Use Chebyshev’s Inequality to Describe Any Set of Data

Chebyshev’s Inequality:

For any data set or distribution, at least 1− !!! ⋅ 100% of the observations lie within standard

deviations of the mean, where k is any number greater than 1.

Note: We can also use Chebyshev’s Inequality based on sample data.

Example: Using Chebyshev’s Theorem

Using the data from the previous example, use Chebyshev’s Theorem to determine the following:

a) The percentage of exams scores within 3 standard deviations of the mean, 72.1.

b) The minimum percentage of exam scores between 63.5 and 80.7.

HillMath120,Page 23

Objective – Compute the Weighted Mean

- – mean found by multiplying each value of the variable by its

corresponding weight, adding these products, and dividing this sum by the sum of the weights

where w is the weight of the ith observation and xi is the value of the ith observation

Example: Computed a Weighted Mean

Bob goes to the “Buy the Weigh” Nut store and creates his own bridge mix. He combines 1 pound of

raisins, 2 pounds of chocolate covered peanuts, and 1.5 pounds of cashews.

The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the cashews cost

$5.40 per pound. What is the cost per pound of this mix?

Gathering the data:

Raisins:

Peanuts:

Cashews:

*Again we will have to do this by hand and on calculators on exams (and show/explain how you did)

HillMath120,Page 27

Section 3.4 - Measures of Position and Outliers

Objectives:

1. Determine and interpret z-scores

2. Interpret percentiles

3. Determine and interpret quartiles

4. Determine and interpret the interquartile range

5. Check a set of data for outliers

Objective – Determine and interpret z-scores

– measures how many standard deviations the data value is above or below the mean.

o The z-score of a data value can be found using the following formulas. There is both a

population z-score and a sample z-score formula:

Population z-score Sample z-score

! = !!!! ! = !!!

!

o A negative z-score would indicate that the data value is below the mean.

o A positive z-score would indicate that the data value is above the mean.

o Z-scores are typically rounded to two decimal places

3.3

HillMath120,Page 28

Examples: Using Z-Scores

1. You are filling out an application for college. The application requests either your ACT score or your

SAT I score. You scored a 32 on the ACT and a 635 on the SAT I. On the ACT exam, the mean score

is 30 with a standard deviation of 4, while the SAT I has a mean score of 505 with a standard deviation

of 109. Which test score should you provide on your application? Why?

2. A highly selective boarding school will only admit students who place at least 1.5 standard deviations

above the mean on a standardized test that has a mean of 200 and a standard deviation of 26. What is

the minimum score that an applicant must make on the test to be accepted?

Hill$Math$120,$Page$ 28$

Example: Using Z-Scores

Sports enthusiasts love to debate who is a “better” player when a direct comparison cannot occur. For

example, in 2010, Josh Johnson of the Florida Marlins had the lowest earned-run average (ERA is the

mean number of runs yielded per nine innings pitched) of any starting pitcher in the National League, with

an ERA of 2.30. Meanwhile, Clay Bucholz of the Boston Red Sox finished 2nd in the American League

with an ERA of 2.33.

In the National League, the mean ERA in 2010 was 3.622 and the standard deviation was 0.743. In the

American League, the mean ERA in 2010 was 3.929 and the standard deviation was 0.775. Which player

had the better year relative to his peers? Why?

Clay had a better year relative to his peers based on his z score. He was just more than 2 standard deviations away from the mean compared to Josh Johnson being less than 2 deviations away from the mean.


Objective 2 - Interpret Percentiles – the kth percentile

- – a value such that approximately k percent of the

observations in a data set are less than or equal to the value discussed.

Example: Interpret a Percentile

The Graduate Record Examination (GRE) is a test required for admission to many U.S. graduate schools.

The University of Pittsburgh Graduate School of Public Health requires a GRE score no less than the 70th

percentile for admission into their Human Genetics MPH or MS program. Interpret this admissions

requirement. (Source: http://www.publichealth.pitt.edu/interior.php?pageID=101.)

Interpretation: In order to be admitted to this program, an applicant must score

than 70% of the people who take the GRE. Put another way, the individual’s

score must be in the

Another (quick) Example: Interpret a Percentile

Sticking with the GRE idea, let’s say that Stanford requires a GRE score no less than the 85th percentile for

admission into their M.B.A. program. In order to be admitted, an applicant would have to score in the top

what percentage on the test?


Objective 3 - Determine and Interpret Quartiles

- divide data sets into fourths, or four equal parts

• The 1st quartile, Q1, is equivalent to the percentile.

o Q1 divides the bottom 25% the data from the top 75%.

• The 2nd quartile is equivalent to the , or the 50th percentile

o Q2 divides the bottom 50% of the data from the top 50% of the data

• The 3rd quartile, Q3, is equivalent to the percentile.

o Q3 divides the bottom 75% of the data from the top 25% of the data.

• The 4th quartile is just the value (We don’t really use a Q4)

Finding Quartiles

Step 1 Arrange the data in ascending order. Step 2 Determine the median, M, or second quartile, Q2. Step 3 Divide the data set into halves: the observations below (to the left of) M and the

observations above M. - The first quartile, Q1, is the median of the bottom half- The third quartile, Q3, is the median of the top half.

*Note: If the number of observations is odd, do not include the median when determining Q1 and Q3 byhand.


Example: Finding and Interpreting Quartiles

A group of BYU students collected data on the speed of vehicles traveling through a construction zone on a

state highway, where the posted speed was 25 mph. The recorded speed of 14 randomly selected vehicles is

given below:

20, 40, 24, 40, 27, 39, 28, 38, 29, 36, 30, 34, 32, 33

Find and interpret the quartiles for speed in the construction zone.

Interpretation:

• Q1 : Approximately 25% of the speeds are 28 mph, and 75%

of the speeds are greater than 28 mph.

• M (Q2): Approximately 50% of the speeds are to the

32.5 mph, and 50% of the speeds are greater than 32.5 mph.

• Q3: Approximately 75% of the speeds are to 38 mph, and

25% of the speeds are greater than 38 mph.


Objective 4 – Determine and Interpret the Interquartile Range

- – is the range of the of the

observations in a data set. That is, the IQR is the difference between Q1 and Q3 and is found using the

formula:

Example: Determining and Interpreting the Interquartile Range

Determine and interpret the interquartile range of the speed data in the last example.

Answer: We already found Q1 = and Q3 = .

Thus, IQR = Q3 – Q1 =

Interpretation: The range of the middle 50% of the speed of cars traveling through the construction zone

is


Extra Example: Mean, Median, and IQR

Now suppose a 15th car travels through the construction zone at 100 miles per hour. Find the mean, median,

and interquartile range, and discuss this value impacts those values.

New data set: 20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40, 100

Answer: Notice that the mean speed increased from 32.1 (found previously) to

However, the median speed only increased from 32.5 to

and the IQR only increased from 10 to

Thus, is the IQR sensitive or resistant to outliers?

Summary: Which measure do I rely on to represent data?


Objective 5 - Check a Set of Data for Outliers

Checking for Outliers by Using Quartiles:

Step 1 Determine the first and third quartiles of the data.

Step 2 Compute the interquartile range.

Step 3 Determine the fences. Fences serve as cutoff points for determining outliers.

Lower Fence = Q1 – 1.5(IQR)

Upper Fence = Q3 + 1.5(IQR)

Step 4 Conclusion: If a data value is less than the lower fence or greater than the upper fence, it is

considered an outlier.

Example: Determining and Interpreting the Interquartile Range

Check the speed data for outliers. The original data set is provided below:

20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40

Step 1: (we previously found) Q1 = and Q3 =

Step 2: (we previously found) The interquartile range is

Step 3: Thus the fences are

Lower Fence: Q1 – 1.5(IQR) =

Upper Fence: Q3 + 1.5(IQR) =

Step 4: Conclusion:


Extra Practice: Finding the Mean and Standard Deviation of Grouped Data

The following data represent SAT Mathematics scores for 2010. Approximate the mean and standard deviation of the scores both by hand and using your calculator.

SAT Math Score Frequency

200–299 36,305 300–399 193,968 400–499 459,010 500–599 467,855 600–699 286,518 700–800 104,334 Source: The College Board


Section 3.5 – The Five-Number Summary and Boxplots

Objectives:

1. Compute the five-number summary

2. Draw and interpret boxplots

Objective 1 – Compute the Five-Number Summary

- – the minimum, Q1, median, Q3, and maximum values of a set of data

*Min = smallest data value; Max = largest data value

We organize the Five-Number summary as follows:

Example: Obtaining the Five-Number Summary

Every six months, the United States Federal

Reserve Board conducts a survey of credit card

plans in the U.S.

The following data are the interest rates charged

by 10 credit card issuers randomly selected.

Determine the five-number summary of the data

by hand.


Now using your calculator, Input the data into our calculator (in any list) and execute the 1-VarStats

program. Scroll down to obtain the Five Number Summary.

*Note: Input values without percent signs, i.e. 6.5 rather than 6.5%

Five Number Summary:

Objective 2 – Draw and Interpret Boxplots

Example: Drawing a Boxplot

Back to the interest rates charged by the 10 credit card issuers in a previous example. Below is the data (in

ascending order. Construct a boxplot.

6.5%, 9.9%, 12.0%, 13.0%, 13.3%, 13.9%,14.3%, 14.4%, 14.4%, 14.5%

To construct a boxplot of the data using a calculator:

Enter the data into any list (you already have our data in L4)

Hit: 2nd, Stat Plot (Y= button),

Enter (to enter plot 1),

Highlight “On” (by hitting Enter)

Arrow down to Type, Arrow Right to boxplot with

outliers (bottom left figure) and hit Enter

Xlist: L4

Freq: 1

Mark: hit enter on any icon


Now hit Zoom,

arrow down to ZoomStat,

Enter

Once you have the graph, hit:

Trace, then arrow right and left to see key values such as the Five-Number Summary:

6.5, 12.0, 13.6, 14.4, 14.5 (in percents)

The Trace function also provides you the value of the

- Whiskers – a line from Q1 to the smallest data value that is

and the line from Q3 to the largest data value that is

Now construct the boxplot by hand. Our calculator does not generate Fences for us, so we have to

remember to draw these ourselves! Remember (from 3.4) Fences serve as cutoff points for determining

outliers.

Again, the formulas for fences are: Lower Fence: Q1 – 1.5(IQR)

Upper Fence: Q3 + 1.5(IQR)


Discussion Example: Use a boxplot to describe the Shape of a Distribution

From the interest rate example, can you describe the shape of the distribution using the boxplot:

The interest rate boxplot indicates that the distribution is

Math 120 Notes – Chapter 3 – Numerically Summarizing Data Math 203 Section 3… · 2020-01-03 · Math 120 Notes – Chapter 3 – Numerically Summarizing Data Section 3.1 –

Documents