Top Banner
AN INTRODUCTION TO ELEMENTARY STATISTICS by Lauretta J. Fox Contents of Curriculum Unit 86.05.03: Introduction Definitions of Statistical Terms Exercises Frequency Distributions Dot Diagrams Histograms Frequency Polygon Cumulative Frequency Polygon Measures of Central Tendency Measures of Dispersion Introduction Statistics have become an important part of everyday life. We are confronted by them in newspapers and magazines, on television and in general conversations. We encounter them when we discuss the cost of living, unemployment, medical breakthroughs, weather predictions, sports, politics and the state lottery. Although we are not always aware of it, each of us is an informal statistician. We are constantly gathering, organizing and analyzing information and using this data to make judgments and decisions that will affect our actions. In this unit of study we will try to improve the students’ understanding of the elementary topics included in statistics. The unit will begin by discussing terms that are commonly used in statistics. It will then proceed to explain and construct frequency distributions, dot diagrams, histograms, frequency polygons and cumulative frequency polygons. Next, the unit will define and compute measures of central tendency including the mean, median and mode of a set of numbers. Measures of dispersion including range and standard deviation will be discussed. Following the explanation of each topic, a set of practice exercises will be included. There are several basic objectives for this unit of study. Upon completion of the unit, the student will be able to: —define basic terms used in statistics. —compute simple measures of central tendency. —compute measures of dispersion. —construct tables and graphs that display measures of central tendency. The material developed here may be used as a whole unit, or parts of it may be extracted and taught in various courses. The elementary concepts may be incorporated into general mathematics classes in grades seven to twelve, and the more difficult parts may be used in advanced algebra classes in the high school. Depending upon the amount of material used, several days or several weeks may be allotted to teach the unit.
121
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction to Elementary Statistics[1]

AN INTRODUCTION TO ELEMENTARY STATISTICSby

Lauretta J. Fox

Contents of Curriculum Unit 86.05.03:Introduction Definitions of Statistical Terms Exercises Frequency Distributions Dot Diagrams Histograms Frequency Polygon Cumulative Frequency Polygon Measures of Central Tendency Measures of DispersionIntroductionStatistics have become an important part of everyday life. We are confronted by them in newspapers and magazines, on television and in general conversations. We encounter them when we discuss the cost of living, unemployment, medical breakthroughs, weather predictions, sports, politics and the state lottery. Although we are not always aware of it, each of us is an informal statistician. We are constantly gathering, organizing and analyzing information and using this data to make judgments and decisions that will affect our actions. In this unit of study we will try to improve the students’ understanding of the elementary topics included in statistics. The unit will begin by discussing terms that are commonly used in statistics. It will then proceed to explain and construct frequency distributions, dot diagrams, histograms, frequency polygons and cumulative frequency polygons. Next, the unit will define and compute measures of central tendency including the mean, median and mode of a set of numbers. Measures of dispersion including range and standard deviation will be discussed. Following the explanation of each topic, a set of practice exercises will be included. There are several basic objectives for this unit of study. Upon completion of the unit, the student will be able to: —define basic terms used in statistics. —compute simple measures of central tendency. —compute measures of dispersion. —construct tables and graphs that display measures of central tendency. The material developed here may be used as a whole unit, or parts of it may be extracted and taught in various courses. The elementary concepts may be incorporated into general mathematics classes in grades seven to twelve, and the more difficult parts may be used in advanced algebra classes in the high school. Depending upon the amount of material used, several days or several weeks may be allotted to teach the unit.

Definitions of Statistical TermsStatistics is a branch of mathematics in which groups of measurements or observations are studied. The subject is divided into two general categories— descriptive statistics and inferential statistics. In descriptive statistics one deals with methods used to collect, organize and analyze numerical facts. Its primary concern is to describe information gathered through observation in an understandable and usable manner. Similarities and patterns among people, things and events in the world around us are emphasized. Inferential statistics takes data collected from relatively small groups of a population and uses inductive reasoning to make generalizations, inferences and predictions about a wider population. Throughout the study of statistics certain basic terms occur frequently. Some of the more commonly used terms are defined below: A population is a complete set of items that is being studied. It includes all members of the set. The set may refer to people, objects or measurements that have a common characteristic. Examples of a population are all high school students, all cats, all scholastic aptitude test scores. A relatively small group of items selected from a population is a sample. If every member of the population has an equal chance of being selected for the sample, it is called a random sample. Examples of a sample are all algebra students at Central High School, or all Siamese cats. Data are numbers or measurements that are collected. Data may include numbers of individuals that make up the census of a city, ages of pupils in a certain class, temperatures in a town during a given period of time, sales made

Page 2: An Introduction to Elementary Statistics[1]

by a company, or test scores made by ninth graders on a standardized test. Variables are characteristics or attributes that enable us to distinguish one individual from another. They take on different values when different individuals are observed. Some variables are height, weight, age and price. Variables are the opposite of constants whose values never change.

Exercises:1. Tell whether each of the following is a variable or a constant: a.) Scores obtained on a final examination by members of a statistics class. b.) The cost of clothing purchased each year by secretaries. c.) The number of days in the month of June. d.) The time it takes to do grocery shopping. e.) The age at which one may become a voter in the United States of America. 2. Fill in the missing word to make a true statement. a.) ____ are measurements obtained by observation. b.) A ____ is a complete set of items. c.) ________takes data collected from a small group and makes predictions about a wider sample. d.) When every member of a set has an equal chance of being selected as part of a sample, the sample is called a ____ ____. e.) Characteristics that vary from one individual to another are ____. f.) The study that deals with methods of collecting, organizing and analyzing data is ____ ____.

Frequency DistributionsGroups of data have little value until they have been placed in some kind of order. Usually measurements are arranged in ascending or descending order. Such a group is an array or distribution. A frequency distribution is a table in which measurements are tallied and the frequency or total number of times that each item occurs is recorded. Example 1:The frequency distribution below shows data obtained in a survey asking a group of people to name their favourite among several kinds of cars. Use the table to answer the following questions: a.) How many people are included in the sample? b.) What percent of the people surveyed preferred Chevrolets? c.) What is the ratio of people who prefer Oldsmobiles to those who prefer Buicks? d.) If the number of Subarus were increased by three, what would the percent of increase be? (figure available in print form) Solution:

a.) 25 + 18 + 15 + 12 + 10 = 80 80 people are included in the survey. b.) 25 Ö 80 = .3125 = 31.25% 31.25% of the people surveyed preferred Chevrolets. c.) 15:10 = 3:2 The ratio of people who prefer Oldsmobiles to those who prefer Buicks is 3:2. d.) 3:18=x:100 1:6 =x:100 6x=100 x =16 2/3 The percent of increase is 16 2/3%.

When the number of measurements in a survey is large, or when the range, that is, the difference between the highest and lowest measurements in the survey, is great, it is usually more efficient to arrange the data in intervals and show the number of items within each group. The number of intervals used in a frequency distribution may vary. However, it has been found that ten to twenty intervals are most practical. The following steps may be used to set up a frequency distribution:

1.) Select an appropriate number of intervals for the given data. 2.) Find the difference between the highest and lowest measurements in the data. Add one to the result end divide the sum by the number of intervals. If the quotient is not an integer, round it to the nearest odd integer. This will be the size or width of each interval and will be designated by the symbol w. 3.) The lowest number in the bottom interval will be the lowest measurement in the given data. Add (w-1) to this measurement to obtain the highest number in the bottom interval. The next interval begins at the integer following the highest number in the bottom group. Continue in this manner for each successive higher interval until every measurement has been placed in its proper group. 4.) After the intervals have been established, a tally mark is placed by the interval for each measurement in the group. The frequency, or number of measurements in each interval, is indicated with a numeral.

Page 3: An Introduction to Elementary Statistics[1]

Example 2:Make a frequency distribution of the following scores obtained by 40 students on a mathematics test.

86

82

56

73

87

89

72

86

88

76

72

69

84

85

62

97

70

78

84

93

70

60

91

76

83

94

65

72

92

81

98

78

88

76

96

89

90

83

74

80

Solution:Use ten intervals. Highest Score—Lowest Score = 98Ð56 = 42 (42 + 1) = 10 = 43 $dv$ 10 = 4.3 Round to 5. The size of each interval is 5. Scores Tally Frequency

96Ð100 111 3

91Ð95 1111 4

86Ð90 111 8

81Ð85 11 7

76Ð80 1 6

71Ð75 1111 5

66Ð70 111 3

61Ð65 11 2

56 Ð60 11 2

Although it is not necessary, it is often helpful for use in further analysis to have additional information in a frequency distribution. This additional information may include the midpoint of each interval, the percentage of the numbers in the frequency column relative to the total frequencies, the cumulative frequency of successive summation of entries in the frequency column, and the percentage of the cumulative frequency. Example 3:In the frequency distribution for example 2 find (a) the midpoint of each interval; (b) the percentage of each frequency relative to the total frequencies; (c) the cumulative frequency; and (d) the percentage of cumulative frequency relative to the total frequencies. Solution:

(a) Since the width of each interval is 5, the third score is the midpoint of the interval. For example, the lowest interval contains the scores 56, 57, 58, 59, 60. 58 is the midpoint of this interval. (b) To find the percentage of each frequency divide the frequency by the total number of measurements and change the resulting decimal to a percent. The frequency of the lowest interval is 2. The total number of measurements is 40. 2 $dv$ 40 = .05 = 5% (c) The cumulative frequency at any interval may be obtained by successively adding the frequencies of all the groups from the lowest interval up to and including the given interval. The cumulative frequency of the interval 76-80 is 2+ 2+ 3+ 5+ 6 =18. (d) To obtain the percentage of cumulative frequency relative to the total of the frequencies, divide the cumulative frequency by the total number of measurements. Change the resulting decimal to a percent. The percentage of the cumulative frequency in the interval 76-80 is 18 $dv$ 40 = .45= 45%. This figure may also be found by adding the percentage of frequency of all groups from

Page 4: An Introduction to Elementary Statistics[1]

the lowest up to and including the given interval.

% of Cumulative % of

Scores Midpoint Frequency Frequency Frequency Cumulative

Frequency

99-100 98 3 7.5 40 100.0

9195 93 4 10.0 37 92.5

8690 88 8 20.0 33 82.5

8185 83 7 17.5 25 62.5

7680 78 6 15.0 18 45.0

7175 73 5 12.5 12 30.0

6670 68 3 7.5 7 17.5

6165 63 2 5.0 4 10.0

5660 58 2 5.0 2 5.0

Exercises:1.) Ask the students in each of your classes which of the following colors they prefer—red, blue, yellow, green, brown, or purple. Construct a frequency distribution to display the results of your survey a.) How many people are included in the sample? b.) What percent of the people surveyed prefer yellow? red? purple? c.) What is the ratio of people who prefer green to those who prefer blue? d.) What is the most popular color? e.) What is the least popular color? f.) If the number of people who prefer red were decreased by 2, what would be the percent of decrease?

2.) Tally the following scores in a frequency distribution. Do not use grouping.

84

98

92

88

91

91

85

80

84

93

92

80

91

84

87

85

84

80

87

95

3.) Make a frequency distribution of the following scores obtained by a basketball team.

72 104 95 93 96

76

105 100

88 62 79 78 87

78

89 81

110 68 96 106 80

87

86 84

102 84 96 88 82

83

92 87

87 85 108 90 94

98

78 80

a.) Use ten intervals and display the midpoint of each interval. b.) Calculate the percentage of frequency of each interval. c.) Find the cumulative frequency for each interval. d.) Calculate the percentage of each cumulative frequency

Page 5: An Introduction to Elementary Statistics[1]

relative to the total of the frequencies.

Dot DiagramsMany people find it easier to obtain information from pictures than from written material. Statisticians display mathematical relationships with diagrams and graphs. From these pictures numerical data can be summarized clearly and easily. When the data of a frequency distribution have not been grouped in intervals, they can be represented on a dot diagram. A dot diagram illustrates the pattern of a distribution. It clearly shows whether the data are spread out evenly or if they tend to cluster about any point. To construct a dot diagram list the measurements, from lowest to highest, horizontally across the bottom of the graph. On the left side vertically list the frequencies or number of times that the measurements occur. For each time a measurement occurs place a dot in the column above the measurement. Example:Construct a dot diagram to represent the following distribution of daily temperature highs in twenty-four cities of the States.

67

68

69

70

70

71

71

71

72

72

72

74

74

74

74

76

76

76

76

80

80

80

84

85

Solution:(figure available in print form) Exercises:

1.) Twenty workers were rated on a scale of 1 to 10 for efficiency. Construct a dot diagram to represent the following ratings: 7, 8, 9, 4, 5, 5, 7, 10, 6, 8, 7, 7. 5, 6, 9, 6. 2.) Draw a dot diagram to represent the following scores received on a spelling test: 98, 100, 78, 75, 68, 62, 75, 80, 82, 94, 80, 72, 75, 85, 85, 80, 70, 82, 78, 78, 72, 70, 90, 65. 3.) The distribution of heights of fifteen children is given below. Show the distribution on a dot diagram.

Height in Inches Frequency

56 2

58 3

60 7

62 2

64 1

HistogramsA frequency distribution can be represented graphically on a histogram. A histogram is a bar graph on which the bars are adjacent to each other with no space between them. To construct a histogram, arrange the data in equal intervals. Represent the frequencies along the vertical axis and the scores along the horizontal axis. The true limits of any interval extend one half unit beyond the endpoints established for the interval and are represented in this manner on the horizontal axis. For example, the true limits of the interval 76-80 are 75.5 and 80.5. To get the proper perspective, the vertical axis should be approximately three-fourths as long as the horizontal axis. Example:Illustrate the following set of measurements on a histogram:

Page 6: An Introduction to Elementary Statistics[1]

72

82

56

73

87

89

72

86

88

76

86

69

84

85

62

97

70

78

84

93

70

60

91

76

83

94

65

72

92

81

98

78

88

76

96

89

90

83

74

80

Solution:

Scores Frequency

96-100 3

91-95 3

86-90 4

81-85 6

76-80 8

71-75 5

66-70 3

61-65 2

56-60 (figure available in print form) Exercises:

1.) Construct a histogram for the following scores earned by a group of high school students on a Scholastic Aptitude Examination.

Score Number of Students

400-449 20

450-499 35

500-549 50

550-599 50

600-649 40

650-699 20

700-749 10

2.) The weights of 40 football players are as follows:

210 181 192 164 170 186 205 194

178 161 175 195 172 188 196 182

206 188 165 202 178 163 190 198

187 198 174 172 183 208 185 162

203 172 196 184 185 176 197 184

Page 7: An Introduction to Elementary Statistics[1]

a.) Construct a frequency distribution for the given data. b.) Make a histogram for the given data.

Frequency PolygonA frequency polygon is a line graph which can be used to represent the frequency of a set of numbers. It is formed by connecting a series of points. The abscissa of each point is the midpoint of the interval in which the point lies. The ordinate of each point is the frequency for the interval. The polygon is closed at each end by drawing a line from the endpoints to the horizontal axis at the midpoint of the next interval. Example:Illustrate the following data on a frequency polygon: Scores Midpoint Frequency

96-100 98 3

9195 93 3

8690 88 4

8185 83 6

7680 78 8

7175 73 5

6670 68 3

6165 63 2

5660 58 2

Solution:(figure available in print form) Exercises:

1.) The following table shows the weekly wages earned by workers in a local hospital:

Number of People 11 11 15 18 13 12 10

Weekly Wage $140 $200 $180 $160 $190 $150 $170

a.) Draw a histogram for the given data. b.) Construct a frequency polygon for the given data. 2.) A baseball team made the following number of hits in a recent game:

Inning 1

2

3

4

5

6

7

8

9

Number of Hits 1

4

2

3

3

5

3

2

1

a.) Draw a dot diagram for the given data. b.) Make a histogram for the given data. c.) Construct a frequency polygon for the given data.

3.) The students in an English class received the following scores on a test:

60

95

85

100 81

56

87

80

62 75

73

64

69

86 93

82

77

91

58 69

Page 8: An Introduction to Elementary Statistics[1]

76

94

72

88 78

a.) Make a frequency distribution for the given scores. b.) Draw a histogram to represent the scores. c.) Construct a frequency polygon for the given data.

Cumulative Frequency PolygonAnother method of graphical representation is the cumulative frequency polygon. The cumulative frequency polygon is a line graph which is used to picture cumulative frequencies of a set of numbers. The abscissa of each point is the upper limit of an interval in a frequency distribution. The ordinate of each point is the corresponding cumulative frequency. The graph starts at a frequency of zero for a group below the lowest interval in the distribution. Example:Construct a cumulative frequency polygon to represent the following scores obtained by 40 students on a mathematics test.

86

82

56

73

87

89

72

86

88

76

72

69

84

85

62

97

70

78

84

93

70

60

91

76

83

94

65

72

92

81

98

78

88

76

96

89

90

83

74

80

Solution:Make a frequency distribution for the scores, then draw the graph.

Cumulative % of Cumulative

Scores Frequency Frequency Frequency

96-100 3 40 100.0

9195 4 37 92.5

8690 8 33 82.5

8185 7 25 62.5

7680 6 18 45.0

7175 5 12 30.0

6670 3 7 17.5

6165 2 4 10.0

5660 2 2 5.0

(figure available in print form) For some purposes the cumulative frequency polygon is very valuable. On the right side of the polygon is a scale of percent that parallels the scale of cumulative frequency. On the percent scale you read 25 corresponding to an abscissa of 72. This means that 25% of the scores were 72 or lower. The figure 72 is called the 25th percentile. The nth percentile is that score below which n percent of the scores in the distribution will fall. To find the score that corresponds to a percentile on the graph, draw a horizontal line through the desired percent to

Page 9: An Introduction to Elementary Statistics[1]

intersect the cumulative frequency polygon. From the point of intersection draw a vertical line to the x-axis. The score at the point of intersection of the vertical line and the x-axis corresponds to the required percentile. The fiftieth percentile is the median or middle score in a set of measurements. The 25th percentile is called the lower quartile, and the 75th percentile is the upper quartile. Exercises:1.) During one week a dealer sold the following number of cars: Monday 12, Tuesday 15, Wednesday 5, Thursday 6, Friday 10, Saturday 12. a.) Construct a histogram to represent the given data. b.) Make a frequency polygon to represent the given data. c.) Draw a cumulative frequency polygon to represent the given data. 2.) The heights in inches of 50 high school students are:

60

68

74

79

62

75

60

65

61

64

71

72

63

66

71

60

60

73

63

65

73

68

76

75

62

76

72

70

69

62

78

71

68

62

74

69

67

70

61

63

72

67

71

68

62

60

70

69

65

64

a.) Group the data into a frequency table. b.) Construct a histogram to represent the data. c.) Construct a cumulative frequency polygon. d.) Find the median height. Find the upper and lower quartiles. e.) Determine the 8Oth percentile. 3. Forty students have the following IQ scores:

120 100 115 126 82 108 114 95

150 92 140 88 98 116 134 138

98 87 110 92 106 96 126 102

80 82 100 128 110 100 118 84

88 98 94 85 124 90 80 112

a.) Group the data into a frequency table. b.) Construct a cumulative frequency polygon. c.) Determine the median IQ score and the 7Oth percentile. Measures of Central TendencyWhen statisticians study a group of measurements, they try to determine which measure is most representative of the group. The score about which most of the other scores tend to cluster is a measure of central tendency. Three measures of central tendency are the mode, the median and the mean. The mode of a set of numbers is the element that appears most frequently in the set. There can be more than one mode in a set of numbers. A set that has two modes is bimodal, and one that has three modes is trimodal. If no element of a set appears more often than any other element, the set has no mode. The mode is an important measure for business people. It tells them what items are most popular with consumers. Example 1:Find the mode of the following set of numbers: 34, 26, 30, 34, 28, 32, 32, 34, 33, 31, 33, 30. Solution:

Element Frequency

26 1

Page 10: An Introduction to Elementary Statistics[1]

28 1

30 2

31 1

32 2

33 2

34 3

The number 34 occurs most frequently, hence 34 is the mode of the set. Example 2:Find the mode of the following set of numbers: 13 17, 14, 20, 18 Solution:

Element Frequency

13 1

17 1

14 1

20 1

18 1

No number appears more than any other number in the set. The set has no mode. Example 3:Find the mode of the following set of numbers: 1, 2, 2, 3, 4, 4, 5 Solution:

Element Frequency

1 1

2 2

3 14

4 2

5 1

The numbers 2 and 4 each appear twice. The set has two modes: 2 and 4. Another measure of central tendency is the median. When the elements of a set of numbers have been arranged in ascending order, the number in the middle of the set is the median of the set. The median divides the set of data into two equal parts. On a cumulative frequency polygon the median is the 50th percentile. To determine which element of a set is the middle number, use the following formula: Middle Number = (Total Number of Elements + 1); = 2 If the set contains an even number of elements, the median is the average of the two middle numbers. Example 1:The weights of nine children are as follows: 99, 98, 73, 81, 79, 86, 90, 94, 71. Find the median weight. Solution:Arrange the weights in order from lowest to highest: 71, 73, 79, 81, 86, 90, 94, 98, 99 (9 + 1) $dv$ 2 = 10 $dv$ 2 = 5 The fifth number of the set is the middle number. The median weight is 86. Example 2:Ten students received the following scores on an examination: 96, 68, 78, 82, 87, 74, 80, 70, 86, 84. Find the median score.

Page 11: An Introduction to Elementary Statistics[1]

Solution:Arrange the scores in ascending order: 68, 70, 74, 78, 80, 82, 84, 86, 87, 96. (10 + 1) Ö 2 = 11 Ö 2 = 5.5 The two middle numbers of the set are the fifth and sixth numbers: 80 and 82. (80 + 82) Ö 2 = 162 Ö 2 = 81 The median score is 81. A third, and most widely used, measure of central tendency is the arithmetic mean. The arithmetic mean is the average of a set of numbers. It is usually denoted by the symbol x. To calculate the arithmetic mean of a set of numbers, add the members of the set and divide the sum by the number of items in the set. Example:Find the arithmetic mean of the following set of numbers: 25, 15, 20, 20, 10. Solution:(25 + 15 + 20 + 30 + 10) Ö 5 = 100 Ö 5 = 20 The arithmetic mean of the set is 20. Sometimes an item appears more than once in a set of measures. To find the arithmetic mean of a set of measures when some items occur several times, multiply each item in the set by Its frequency and divide the sum of these products by the total number of items in the set. Example:Find the arithmetic mean of the following numbers: 28, 24, 22, 24, 26, 26, 22, 24, 22, 28, 30, 24. Solution:

Item Frequency Product

22 3 66

24 4 96

26 2 52

28 2 56

30 1 30

Sum of Products = 66 + 96 + 52 + 56 + 30 = 300 Total Number of Items = 3 + 4 + 2 + 2 + 1 = 12 Sum of Products $dv$ Total # of Items = 300 $dv$ 12 = 25 The arithmetic mean is 25. When the data have been arranged in intervals in a frequency distribution, the arithmetic mean is obtained in the following manner:

1.) Multiply the midpoint of each interval by the frequency of the interval. 2.) Find the sum of the products obtained in step 1. 3.) Divide the sum obtained in step 2 by the total number of items in the distribution.

The formula used to find the arithmetic mean is: n x = 1/ni å1 xifi x ; arithmetic mean xi = midpoint of the interval

n = number of items in fi = frequency of the interval

the distribution = sum

Example:Find the arithmetic mean for the following distribution: Scores Midpoint Frequency xifi

96-100 98 3 294

Page 12: An Introduction to Elementary Statistics[1]

91-95 93 4 372

86-90 88 8 684

81-85 83 7 581

76-80 78 6 468

71-75 73 5 365

66-70 68 3 204

61-65 63 2 126

56-60 58 2 116

Solutionn = 3 + 4 + 8+ 7+ 6+ 5+ 3+ 2 + 2 = 40 40 åxifi= 294 + 372 + 684 + 581 + 468 + 365+ i= 1 204 + 126 + 116 = 3210 40 x = 1/n å xifi 1/40 x 3210 = 3210/40 = 80.25 i = 1 The arithmetic mean of the distribution is 80.25. Exercises:1.) Ten employees of a department store earn the following weekly wages: $200, $150, $160, $125, $160, $150, $180, $130, $170 $150 a.) Find the average weekly income. b.) What is the median wage? c.) Find the mode.

2.) Write mean, median, or mode to complete the sentence.

a.) 7, 13, 8, 5, 9, 12. The ____ is 9.

b.) 6, 2, 4, 7, 6, 3. The ____ is 6.

c.) 18, 10, 21, 17, 12. The ____ is 17.

d.) 8, 3, 9, 4, 10, 14. The ____ is 8.

e.) 13, 11, 8, 15, 9, 10. The ____ is 10.5.

3.) Find the mean, the median and the mode for each set of numbers. a.) 72, 68, 56, 65, 72, 56, 68. b.) 13, 19, 12, 18, 24, 10. c.) 125, 132, 120, 118, 128, 126, 120. d.) 8, 4, 6, 4, 10, 4, 10.

4.) Find the arithmetic mean of the following numbers:

Number Frequency

32 4

36 2

38 6

40 8

5.) The salaries of thirty people are listed below.

$12,500 $23,900 $18,750 $24,000 $$14,000

$18,750 $11,570 $25,000 $ 9,200 $15,000

Page 13: An Introduction to Elementary Statistics[1]

$24,000 $22,000 $20,500 $12,500 $17,300

$10,980 $15,550 $18,750 $18,000 $16,200

$ 8,750 $12,500 $10 980 $13,000 $19,850

$32,000 $13,000 $22,000 $35,000 $21,000

a.) Arrange the salaries in intervals and make a frequency table for the set of data. b.) What is the mode of the salaries? c.) What is the median salary? d.) What is the mean salary?

Measures of DispersionMeasures of central tendency very often present an incomplete picture of data. In order to evaluate more completely any group of scores it is necessary to measure the spread or dispersion of the data being studied. One way to indicate the spread of scores is by the range of scores. The range of a set of numbers is the difference between the highest and lowest numbers of the set. To find the range of a set of numbers, use the following formula: Range = Highest Number—Lowest number Example:What is the range of the following set of numbers? 3, 1, 6, 12, 9, 8, 10, 15 Solution:The highest number in the set is 15. The lowest number in the set is 1. 15Ð1 = 14. The range of the set is 14. Another way of indicating the dispersion of scores is in terms of their deviations from the mean. This method is known as standard deviation and tells how scores tend to scatter about the mean of a set of data. If the standard deviation is small the scores tend to cluster closely about the mean. If the standard deviation is large, there is a wide scattering of scores about the mean. Standard deviation is represented by the symbol s and may be computed by the formula: Standard Deviation = s= (figure available in print form) where x is a score, x is the mean, n is the number of scores, and means “the sum of”. Six steps are used to find standard deviation: 1.) List each score (x) in the set of data. 2.) Compute the mean (x) for the data. 3.) Subtract the mean from each score (xÐx). The result is the deviation of each score from the mean. 4.) Square the deviations. 5.) Find the average of the squares of the deviations by dividing the sum of the squares of the deviations by the number of scores in the distribution. 6.) Take the square root of this average. The result is the standard deviation. Example:Compute the standard deviation for the scores: 2, 3, 4, 5, 6, 7, 8 Solution(figure available in print form) The standard deviation is a number that is used to compare scores in a distribution. If the mean of a group of test scores is 75, and the standard deviation is 10, a person who receives a score of 85 is one standard deviation above the mean. If the mean of another group of test scores is 80, and the standard deviation is 3, a person who receives a score of 83 is one standard deviation above the mean. This person has done equally well, with respect to the other class members, as the person who received 85 on the first test. Exercises:1.) Compute the range for the following sets of scores: a.) 24, 15, 19, 29, 24, 22 b.) 113, 98, 107, 102, 123, 110 c.) 72.9, 75.6, 74.3, 86.1, 80, 82.7 d.) 56, 72, 98, 64, 87, 91, 22 2.) Compute the standard deviations for the following sets of scores: a.) 26, 18, 19, 29, 20, 26 b.) 111, 98, 107, 103, 126 c.) 72.9, 75.6, 74.3, 86.1, 80, 82.7 3.) On an arithmetic test the mean was 78 and the standard deviation was 8. How many standard deviations

Page 14: An Introduction to Elementary Statistics[1]

from the mean was each of the following scores? 86, 74, 94, 80, 98, 70, 62

Bar Charts, Frequency Distributions, and Histograms Frequency Distributions, Bar Graphs, and Circle Graphs The frequency of a particular event is the number of times that the event occurs. The relative frequency is the proportion of observed responses in the category. Example: We asked the students what country their car is from (or no car) and make a tally of the answers. Then we computed the frequency and relative frequency of each category. The relative frequency is computed by dividing the frequency by the total number of respondents. The following table summarizes.

Country Frequency Relative Frequency

US 6 0.3

Japan 7 0.35

Europe 2 0.1

Korea 1 0.05

None 4 0.2

Total 20 1

Below is a bar graph for the car data. This bar graph is called a Pareto chart since the height represents the frequency. Notice that the widths of the bars are always the same.

We make a circle graph often called a pie chart of this data by placing wedges in the circle of proportionate size to the frequencies. Below is a circle graph the shows this data.

to find the angles of each of the slices we use the formula

Page 15: An Introduction to Elementary Statistics[1]

Frequency Angle = x 360 Total For example to find the angle for US cars we have 6 Angle = x 360 = 108 degrees 20

Histograms Histograms are bar graphs whose vertical coordinate is the frequency count and whose horizontal coordinate corresponds to a numerical interval.Example: The depth of clarity of Lake Tahoe was measured at several different places with the results in inches as follows: 15.4, 16.7, 16.9, 17.0, 20.2, 25.3, 28.8, 29.1, 30.4, 34.5, 36.7, 39.1, 39.4, 39.6, 39.8, 40.1, 42.3, 43.5, 45.6, 45.9, 48.3, 48.5, 48.7, 49.0, 49.1, 49.3, 49.5, 50.1, 50.2, 52.3

We use a frequency distribution table with class intervals of length 5.

Class Interval

Frequency Relative FrequencyCumulative Relative Frequency

15 -<20 4 0.129 0.129

20 -<25 1 0.032 0.161

25 -< 30 3 0.097 0.258

30 -< 35 2 0.065 0.323

35 -< 40 6 0.194 0.516

40 -< 45 3 0.097 0.613

45 -< 50 9 0.290 0.90350 -< 55 3 0.097 1.000

Total 31 1.000

Below is the graph of the histogram

The Shape of a Histogram A histogram is unimodal if there is one hump, bimodal if there are two humps and multimodal if there are many humps. A nonsymmetric histogram is called skewed if it is not symmetric. If the upper tail is longer than the lower tail then it is positively skewed. If the upper tail is shorter than it is negatively skewed.

Page 16: An Introduction to Elementary Statistics[1]

Unimodal, Symmetric, Nonskewed

Nonsymmetric, Skewed Right

Bimodal

Mean, Mode, Median, and Standard Deviation The Mean and Mode The sample mean is the average and is computed as the sum of all the observed outcomes from the sample divided by the total number of events. We use x as the symbol for the sample mean. In math terms,

where n is the sample size and the x correspond to the observed valued. Example Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region: 34, 43, 81, 106, 106 and 115 We compute the sample mean by adding and dividing by the number of samples, 6. 34 + 43 + 81 + 106 + 106 + 115

Page 17: An Introduction to Elementary Statistics[1]

= 80.83 6 We can say that the sample mean of non-indigenous weed is 80.83. The mode of a set of data is the number with the highest frequency. In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once. The population mean is the average of the entire population and is usually impossible to compute. We use the Greek letter for the population mean.

Median, and Trimmed Mean One problem with using the mean, is that it often does not depict the typical outcome. If there is one outcome that is very far from the rest of the data, then the mean will be strongly affected by this outcome. Such an outcome is called and outlier. An alternative measure is the median. The median is the middle score. If we have an even number of events we take the average of the two middles. The median is better for describing the typical value. It is often used for income and home prices. Example Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your are interested in the typical house price. In $100,000 the prices were 2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8 If we computed the mean, we would say that the average house price is 744,000. Although this number is true, it does not reflect the price for available housing in South Lake Tahoe. A closer look at the data shows that the house valued at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since there is an even number of outcomes, we take the average of the middle two 3.7 + 4.1 = 3.9 2 The median house price is $390,000. This better reflects what house shoppers should expect to spend. There is an alternative value that also is resistant to outliers. This is called the trimmed mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom. We can also use the trimmed mean if we are concerned with outliers skewing the data, however the median is used more often since more people understand it. Example: At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays: 44, 50, 38, 96, 42, 47, 40, 39, 46, 50. To find the sample mean, add them and divide by 10: 44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50 = 49.210 Notice that the mean value is not a value of the sample. To find the median, first sort the data: 38, 39, 40, 42, 44, 46, 47, 50, 50, 96 Notice that there are two middle numbers 44 and 46. To find the median we take the average of the two. 44 + 46 Median = = 45 2 Notice also that the mean is larger than all but three of the data points. The mean is influenced by outliers while the median is robust.

Variance, Standard Deviation and Coefficient of Variation The mean, mode, median, and trimmed mean do a nice job in telling where the center of the data set is, but often we are interested in more. For example, a pharmaceutical engineer develops a new drug that regulates iron in the blood. Suppose she finds out that the average sugar content after taking the medication is the optimal level. This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar content while the other half have dangerously high content. Instead of the drug being an effective regulator, it is a deadly poison. What the pharmacist needs is a measure of how far the data is spread apart. This is what the

Page 18: An Introduction to Elementary Statistics[1]

variance and standard deviation do. First we show the formulas for these measurements. Then we will go through the steps on how to use the formulas. We define the variance to be

and the standard deviation to be

Variance and Standard Deviation: Step by Step Calculate the mean, x. Write a table that subtracts the mean from each observed value.Square each of the differences.Add this column.Divide by n -1 where n is the number of items in the sample This is the variance.To get the standard deviation we take the square root of the variance.

Example The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and writes down the following data. 44, 50, 38, 96, 42, 47, 40, 39, 46, 50 He calculated the mean by adding and dividing by 10 to get x = 49.2 Below is the table for getting the standard deviation:

x x - 49.2 (x - 49.2 )2 44 -5.2 27.04

50 0.8 0.6438 11.2 125.4496 46.8 2190.2442 -7.2 51.8447 -2.2 4.8440 -9.2 84.6439 -10.2 104.0446 -3.2 10.2450 0.8 0.64Total 2600.4

Now 2600.4 = 288.7 10 - 1 Hence the variance is 289 and the standard deviation is the square root of 289 = 17. Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction. The mean for this example was about 49.2 and the standard deviation was 17. We have: 49.2 - 17 = 32.2and 49.2 + 17 = 66.2 What this means is that most of the patrons probably spend between $32.20 and $66.20.

Page 19: An Introduction to Elementary Statistics[1]

The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter . The sample variance will be denoted by s2 and the population variance will be denoted by 2. The variance and standard deviation describe how spread out the data is. If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large. Having outliers will increase the standard deviation. One of the flaws involved with the standard deviation, is that it depends on the units that are used. One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100% CV = 100% In the above example, it is 17 100% = 34.6% 49.2 This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.

Chebyshev's Theorem A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean. In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least 1 1 - k2 For example, if k = 2 this number is 1 1 - = .75 22 This tell us that at least 75% of the data lies within 75% of the mean. In the above example, we can say that at least 75% of the diners spent between 49.2 - 2(17) = 15.2 and 49.2 + 2(17) = 83.2 dollars.

and for Grouped DataCalculating the Mean from a Frequency DistributionSince calculating the mean and standard deviation is tedious, we can save some of this work when we have a frequency distribution. Suppose we were interested in how many siblings are in statistics students' families. We come up with a frequency distribution table below.

Number of Children

1 2 3 4 5 6 7

Frequency 5 12 8 3 0 0 1

Notice that since there are 29 respondents, calculating the mean would be very tedious. Instead, we see that there are five ones, 12 twos, 8 threes, 3 fours, and 1 seven. Hence the total count of siblings is

1(5) + 2(12) + 3(8) + 4(3) + 7(1) = 72Now divide by the number of respondents to get the mean.72 = = 2.5 29

Extending the Frequency Distribution TableJust as with the mean formula, there is an easier way to compute the standard deviation given a frequency distribution table. We extend the table as follows:

Number of Frequency (f) xf x2f

Page 20: An Introduction to Elementary Statistics[1]

Children (x)1 5 5 52 12 24 483 8 24 724 3 12 485 0 0 06 0 0 07 1 7 49Totals f = 29 xf = 72 x2f = 222

Next we calculate(xf)2 (72)2 SSx = x2f - = 222 - n 29 = 43.24Now finally apply the formula

to get

Weighted AveragesSometimes instead of the simple mean, we want to weight certain outcomes higher then others. For example, for your statistics class, the following percentages are givenHomework = 150 Midterm = 450 Project = 100 Final = 300Suppose that you received an 84% on your homework, a 96% on your midterms, a 98% on your project and an 78% on your final. What is your average for you class?To compute the weighted average, we use the formulaxw Weighted Average = wWe havexw = .88(150) + .97(450) + .98(100) + .78(300) = 900.5andw = 150 + 450 + 100 + 300 = 1000Now divide to get your weighted average900.5 = .9005 1000You squeaked by with an "A".

Probability For an experiment we define an event to be any collection of possible outcomes. A simple event is an event that consists of exactly one outcome. or: means the union i.e. either can occur and: means intersection i.e. both must occur Two events are mutually exclusive if they cannot occur simultaneously. For a Venn diagram, we can tell that two events are mutually exclusive if their regions do not intersect We define Probability of an event E to be to be

number of simple events within EP(E) =

total number of possible outcomes

We have the following:P(E) is always between 0 and 1.

Page 21: An Introduction to Elementary Statistics[1]

The sum of the probabilities of all simple events must be 1.P(E) + P(not E) = 1If E and F are mutually exclusive then

P(E or F) = P(E) + P(F)

The Difference Between And and Or If E and F are events then we use the terminology E and F to mean all outcomes that belong to both E and F We use the terminology E Or F to mean all outcomes that belong to either E or F. Example Below is an example of two sets, A and B, graphed in a Venn diagram.

The green area represents A and B while all areas with color represent A or B

Example Our Women's Volleyball team is recruiting for new members. Suppose that a person inquires about the team. Let E be the event that the person is female Let F be the event that the person is a student then E And F represents the qualifications for being a member of the team. Note that E Or F is not enough. We define

Definition of Conditional Probability

P(E and F) P(E|F) = P(F)

We read the left hand side as "The probability of event E given event F" We call two events independent if

For Independent EventsP(E|F) = P(E)

Equivalently, we can say that E and F are independent if

Page 22: An Introduction to Elementary Statistics[1]

For Independent EventsP(E and F) = P(E)P(F)

Example Consider rolling two dice. Let E be the event that the first die is a 3. F be the event that the sum of the dice is an 8. Then E and F means that we rolled a three and then we rolled a 5 This probability is 1/36 since there are 36 possible pairs and only one of them is (3,5) We have P(E) = 1/6 And note that (2,6),(3,5),(4,4),(5,3), and (6,2) give F Hence P(F) = 5/36 We have P(E) P(F) = (1/6) (5/36) which is not 1/36. We can conclude that E and F are not independent.

A Counting RuleFor two events, E and F, we always haveP(E or F) = P(E) + P(F) - P(E and F)ExampleFind the probability of selecting either a heart or a face card from a 52 card deck.SolutionWe letE = the event that a heart is selected F = the event that a face card is selectedthen P(E) = 1/4 and P(F) = 3/13 (Jack, Queen, or King out of 13 choices)P(E and F) = 3/52The formula givesP(E or F) = 1/4 + 3/13 - 3/52 = 22/52 = 42%

Trees and CountingUsing TreesWe have seen that probability is defined byNumber in E P(E) = Number in the Sample Space Although this formula appears simple, counting the number in each can prove to be a challenge. Visual aids will help us immensely. ExampleA native flowering plant has several varieties. The color of the flower can be red, yellow, or white. The stems can be long or short and the leaves can be thorny, smooth, or velvety. Show all varieties. SolutionWe use a tree diagram. A tree diagram is a diagram that branches out and ends in leaves that correspond to the final

Page 23: An Introduction to Elementary Statistics[1]

variety. The picture below shows this.

To read this tree diagram, we begin from start. then move along the branches collecting words until we get to the end. For example,

Always taking the upper path leads to the selection of a red long thorny plant.

Always taking the lower path leads to a blue short velvety plant. We can count the total number of leaves (path endings) and get that there are 18 possible varieties.

Counting the leaves that came from long stems tell us that there are 9 possible long stemmed varieties.

ExampleA committee of three republican senators and four democratic senators is selected to investigate corporate securities fraud. Out of this committee two members are to be selected at random for a subcommittee on the energy sector.

What is the probability that both members will be republican?What is the probability that both members will be democrat?What is the probability of one of each?

SolutionWe write a tree diagram

Page 24: An Introduction to Elementary Statistics[1]

In this tree diagram, D represents democrat and R represents republican. The probabilities are given in the diagram.To answer part A, we need to findP(first is R and second is R)This corresponds to the bottom leaf. As we travel to the bottom leaf, we pick up the two numbersP(R and R) = (3/7)(1/3) = 1/7To answer part B, we need to find P(first is D and second is D)This corresponds to the top leaf. We haveP(D and D) = (4/7)(1/2) = 2/7To answer the part C we add the two middle leavesP((D and R) or (R and D)) = (4/7)(1/2) + (3/7)(2/3) = 2/7 + 2/7 = 4/7

PermutationsExampleSuppose that 40 women try out for the newest play that has an all women cast of seven. You are the director. How many choices do you have?SolutionThe way to work this problem out is to consider the main role first. You have 40 choices for the main role. For the lead supporting actor there are 39 left to select from. For the next role there are 38 to select from. Now following this pattern and consider that there are seven in the cast gives a total number of choices as40 . 39 . 38 . 37 . 36 . 35 . 34We could multiply these all out, however there is an easier way. We can write33 . 32 . 31 . ... . 2 . 140 . 39 . 38 . 37 . 36 . 35 . 34 33 . 32 . 31 . ... . 2 . 140!= (40 - 7)!This expression has a special notation. We write40!P40,7 = = 93,963,542,400(40 - 7)!

Page 25: An Introduction to Elementary Statistics[1]

We can see that there are plenty of choices.In general we write

n!Pn,r = (n - r)!We call this a permutation.

CombinationsExampleHow many 5 card poker hands are there? SolutionWe can solve this in a similar way as the prior question. We are selecting 5 cards out of 52 total. Unfortunately this is not quite a permutation since, for example, the hand2H 3H 4H 5H 6H is the same as the hand 3H 5H 2H 6H 4Hwhere "H" means hearts. That is the order at which the cards are dealt does not matter. The number of ways of ordering 5 cards is 5! (five choices for the first card, four left for the second, three left for the third, two for the fourth, and one for the fifth). We divide by this number to get our solution52!= (52 - 5)! 5!We write this with the notation C52,5 = 2,598,960In general, we have

n! Cn,k = (n - k)! k! and call this a combination.

ExampleThe following was taken from the California state lottery web site:"SuperLottoPlus is your chance to win millions of dollars! The jackpot ranges from $7 million to $50 million or more. The jackpot rolls over and grows whenever there is no winner. All you have to do is pick five numbers from 1 to 47 and one MEGA number from 1 to 27 and match them to the numbers drawn by the Lottery every Wednesday and Saturday."What is the probability of winning the lottery?SolutionThere is only one element in the event space. Your numbers. For the sample space, first they pick 5 numbers from 47. There are C47,5 = 1,533939ways of doing this.Next they select a number from 1 to 27. There are 27 ways of doing this. We multiply to get 1,533939 x 27 = 41,416,353So your chances are worse than one in forty-million.

Probability Distributions Random Variables A variable whose value depends upon a chance experiment is called a random variable. Suppose that a person is asked who that person is closest to: their mother or their father. The random variable of this experiment is the boolean variable whose possibilities are {Mother, Father}

Page 26: An Introduction to Elementary Statistics[1]

A continuous random variable is a variable whose possible outcomes are part of a continuous data set. Examples the random variable that represents the height of the next person who walks in the room is a continuous random variable while the random variable that represents the number rolled on a six sided die is not a continuous random variable. A random variable that is not continuous is called a discreet random variable.

Probability Distributions Example Suppose we toss two dice. We will make a table of the probabilities for the sum of the dice. The possibilities are 2,3,4,5,6,7,8,9,10,11,12. Probability Distribution Table

x 2 3 4 5 6 7 8 910

11

12

P(x)

1/36

2/36

3/36

4/36

5/36

6/36

5/36

4/36

3/36

2/36

1/36

Exercise Suppose that you buy a raffle ticket for $5. If 1,000 tickets are sold and there are 10 third place winners of $25, three second place winners of $100 and1 grand prize winner of $2,000, construct a probability distribution table. Do not forget that if you have the $25 ticket, you will have won $20.

Expected Value: (Mean) Example Insurance We when we buy insurance in black jack we lose the insurance bet if the dealer does not have black jack and win twice the bet if the dealer does have black jack. Suppose you have $20 wagered and that you have a king and a 9 and the dealer has an ace. Should you buy insurance for $10? Solution: We construct a probability distribution table

x P(x)

-10 34/49

20 15/49

(There are 49 cards that haven't been seen and 15 are 10JKQ and the other 34 are non tens.) We define the expected value = xP(x) We calculate: -10(34/49) + 20(15/49) = -40/49 Hence the expected value is negative so that we should not buy insurance. What if I am playing with my wife. My cards are 2 and a 6 and my wife's are 7 and 4. Should I buy insurance? We have:

x P(x)

-10 31/47

Page 27: An Introduction to Elementary Statistics[1]

20 16/47

We calculate: -10(31/47) + 20(16/47) = 10/47 = 0.21 Hence my expected value is positive so that I should buy insurance.

Standard Deviation We compute the standard deviation for a probability distribution function the same way that we compute the standard deviation for a sample, except that after squaring x - , we multiply by P(x). Also we do not need to divide by n - 1. Consider the second insurance example:

x P(x) x - x (x - x )2

-10 31/47 -10.21 104

20 16/47 19.79 392

Hence the variance is 104(31/47) + 392(16/47) = 202 and the standard deviation is the square root, that is 14.2.

Combining Distributions If we have two distributions with independent random variables x and y and if a and b are constants then if L = a + bx and W = ax + by then L = a + b L

2 = b22 L = |b| W = ax + by W2 = a2

2 + b2

2

Example Gamblers who played both black jack and craps were studied and it was found that the average amount of black playing per weekend was 7 hours with a standard deviation of 3 hours. The average amount of craps play was 4 hour with a standard deviation of 2 hours. A. What is the mean and standard deviation for the total amount of gaming? Solution Here a and b are 1 and 1. The mean is just 7 + 4 = 11 and the standard deviation is just

B. If each player spends about $100 per hour on black jack and $200 per hour on craps, what will be the mean and standard deviation for the amount of money that the casino wins per person? Solution Here a and b are 100 and 200. the mean is 100(7) + 200(4) = 1,500 and the standard deviation is

C. If the players spend $150 on the hotel, find the mean and standard deviation of the total amount of money that the players spend.

Page 28: An Introduction to Elementary Statistics[1]

Here L = 150 + x where x is the result from part B. Hence the mean is 150 + 1500 = 1,650 and the standard deviation is the same as part B since the coefficient is 1.

The Binomial Distribution There is a type of distribution that occurs so frequently that it has a special name. We call a distribution a binomial distribution if all of the following are true

There are a fixed number of trials, n, which are all independent.

The outcomes are Boolean, such as True or False, yes or no, success or failure.

The probability of success is the same for each trial. For a binomial distribution with n trials with the probability of success p and failure q, we have

P(r successes) = Cn,r pr qn-r

ExampleSuppose that each time you take a free throw shot, you have a 25% chance of making it. If you take 15 shots,A. What is the probability of making exactly 5 of them.SolutionWe haven = 15 r = 5 p = .25 q = .75ComputeC15,5 .255 .7510 = 0.165 There is a 16.5 percent chance of making exactly 5 shots.B. What is the probability of making fewer than 3 shots?SolutionThe possible outcomes that will make this happen are 2 shots, 1 shot, and 0 shots. Since these are mutually exclusive, we can add these probabilities.C15,2 .252 .7513 + C15,1 .251 .7514 + C15,0 .250 .7515 = .156 + .067 + .013 = 0.236There is a 24 percent chance of sinking fewer than 3 shots.

The Normal DistributionThere is a special distribution that we will use just about every day for the next month. It is a distribution for a continuous random variable that has the following properties:

It is symmetric about the meanIt approaches the horizontal axis on both the left and right side without touching, that is the x-axis is a

asymptote.It is bell shaped with transition points one standard deviation from the mean.Approximately 68% of the data points lie within one standard deviation of the mean.Approximately 95% of the data points lie within two standard deviations of the mean.Approximately 99.7% of the data points lie within three standard deviations of the mean.

You can play with the graphs by going tohttp://www-stat.stanford.edu/~naras/jsm/NormalDensity/NormalDensity.htmlExampleYou are the manager at a new toy store and want to determine how many Monopoly games to stock in you store.

Page 29: An Introduction to Elementary Statistics[1]

The mean number of Monopoly games that sell per month is 22 with a standard deviation of 6. Assume that this distribution is Normal.A. What is the probability that next month you will sell between 10 and 34 games? SolutionWe notice that 22 - 2(6) = 10 and 34 = 22 + 2(6)We want to know what the probability is that the outcome lies within two standard deviations of the mean. Property 5 says that this percent is about 95%.B. If you stock 45 games, should you feel secure about not running out?SolutionSince three standard deviations above the mean is22 + 3(6) = 40and 45 is above that, there is a less than 0.3% chance of running out. You should feel very secure.

Control ChartsWe often want to determine if things are beginning to stray from the norm as time goes on. ExampleIt has been determined that the mean number of errors that medical staff at a hospital makes is 0.002 per hour with a standard deviation of 0.0003. The medical board wanted to determine if long working hours was related to mistakes. During the day, the medical staff was observed to see when they made mistakes. The table illustrates the finding.

Hours Worked

1 2 3 4 5 6 7 8 9 10

Mistakes per Hour

0.0019

0.0022

0.0015

0.0017

0.0020

0.0022

0.0018

0.0028

0.0019

0.0027

It is difficult to see a trend from just looking at the table. Instead, we will create a chart that better illustrates the trends. We call the system out of control if at least one of the three events occurOut of Control Signal 1: At least one point falls beyond the 3 level. Out of Control Signal 2: A run of nine consecutive points is on the same side of the center line (usually the mean).Out of Control Signal 3: At least two of three consecutive points lie beyond the 2 level on same side of the center line (usually the mean).For our example we have + = 0.002 + 0.0003 = 0.0023 - = 0.002 - 0.0003 = 0.0017 + 2 = 0.002 + 0.0006 = 0.0026 - 2 = 0.002 - 0.0006 = 0.0014 + 3 = 0.002 + 0.0009 = 0.0029 - 3 = 0.002 - 0.0009 = 0.0011 We now graph the points on a control chart.

Page 30: An Introduction to Elementary Statistics[1]

We can see that two of the last three data points lie beyond two standard deviations above the mean. This gives out of control warning signals. The information should make the hospital administration weary about long hours.

Definition of the Standard Normal Distribution

The Standard Normal distribution follows a normal distribution and has mean 0 and standard deviation 1

Notice that the distribution is perfectly symmetric about 0. If a distribution is normal but not standard, we can convert a value to the Standard normal distribution table by first by finding how many standard deviations away the number is from the mean.

The z-score The number of standard deviations from the mean is called the z-score and can be found by the formula

x - z = Example Find the z-score corresponding to a raw score of 132 from a normal distribution with mean 100 and standard deviation 15. Solution We compute 132 - z = = 2.13315

Page 31: An Introduction to Elementary Statistics[1]

Example A z-score of 1.7 was found from an observation coming from a normal distribution with mean 14 and standard deviation 3. Find the raw score. Solution We have x - 1.7 = 3 To solve this we just multiply both sides by the denominator 3, (1.7)(3) = x - 14 5.1 = x - 14 x = 19.1

The z-score and Area Often we want to find the probability that a z-score will be less than a given value, greater than a given value, or in between two values. To accomplish this, we use the table from the textbook and a few properties about the normal distribution. Example

Find P(z < 2.37) Solution We use the table. Notice the picture on the table has shaded region corresponding to the area to the left (below) a z-score. This is exactly what we want. Below are a few lines of the table.

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

2.2.9861

.9864

.9868

.9871

.9875

.9878

.9881

.9884

.9887

.9890

2.3.9893

.9896

.9898

.9901

.9904

.9906

.9909

.9911

.9913

.9916

2.4.9918

.9920

.9922

.9925

.9927

.9929

.9931

.9932

.9934

.9936

The columns corresponds to the ones and tenths digits of the z-score and the rows correspond to the hundredths digits. For our problem we want the row 2.3 (from 2.37) and the row .07 (from 2.37). The number in the table that matches this is .9911. Hence P(z < 2.37) = .9911

Page 32: An Introduction to Elementary Statistics[1]

Example

Find P(z > 1.82) Solution In this case, we want the area to the right of 1.82. This is not what is given in the table. We can use the identity P(z > 1.82) = 1 - P(z < 1.82) reading the table gives P(z < 1.82) = .9656 Our answer is P(z > 1.82) = 1 - .9656 = .0344

Example

Find P(-1.18 < z < 2.1) Solution Once again, the table does not exactly handle this type of area. However, the area between -1.18 and 2.1 is equal to the area to the left of 2.1 minus the area to the left of -1.18. That is P(-1.18 < z < 2.1) = P(z < 2.1) - P(z < -1.18) To find P(z < 2.1) we rewrite it as P(z < 2.10) and use the table to get P(z < 2.10) = .9821. The table also tells us that P(z < -1.18) = .1190 Now subtract to get P(-1.18 < z < 2.1) = .9821 - .1190 = .8631

Area Under the Normal Curve and the Binomial Distribution The General Normal Distribution and Area Typically the probability distribution does not follow the standard normal distribution, but does follow a general normal distribution. When this is the case, we compute the z-score first to convert it into a standard normal distribution. Then we can use the table. Example The Tahoe Natural Coffee Shop morning customer load follows a normal distribution with mean 45 and standard deviation 8. Determine the probability that the number of customers tomorrow will be less than 42.

Page 33: An Introduction to Elementary Statistics[1]

Solution We first convert the raw score to a z-score. We have 42 - z = = - 0.3758 Next, we use the table to find the probability. The table gives .3520. (We have rounded the raw score to -0.38). We can conclude that P(x < 42) = .352 That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.

Example A study was done to determine the stress levels that students have while taking exams. The stress level was found to be normally distributed with a mean stress level of 8.2 and a standard deviation of 1.34. What is the probability that at your next exam, you will have a stress level between 9 and 10? Solution We want P(9 < x < 10) We compute the z-scores for each of these 9 - z9 = = 0.60 z10 = = 1.341.34 1.34

Now we want P(0.60 < z < 1.34) This is the "in between" type hence we subtract P(0.60 < z < 1.34) = P(z < 1.34) - P(z < 0.60) We use the table to get P(0.60 < z < 1.34) = .9099 - .7257 = .1842 We conclude that there is about an 18 percent chance that the stress level will be between nine and ten.

Example Suppose that your wife is pregnant and due in 100 days. Suppose that the probability density distribution function for having a child is approximately normal with mean 100 and standard deviation 8. You have a business trip and will return in 85 days and have to go on another business trip in 107 days.

What is the probability that the birth will occur before your second trip?

What is the probability that the birth will occur after you return from your first business trip?

What is the probability that you will be there for the birth?

You are able to cancel your second business trip, and your boss tells you that you can return home from your first trip so that there is a 99% chance that you will make it back for the birth. When must you return home?

Solution:

Page 34: An Introduction to Elementary Statistics[1]

We want

P(x < 107) We compute the z-score:

107 - 100 z = = .888

We compute P(z < .88)

The table on the inside front cover gives us P(z < .88) = .8106 Hence there is about a 81% chance that the baby will be born before the second business trip.

We want

P(x > 85) We compute the z-score: 85 - 100 z = = -1.888We compute P(z > -1.88) The table on the inside front cover gives us P(z < -1.88) = .0301 We want the complement of this area hence P(z > -1.88) = 1 - .0301 = .9699 Hence there is about a 97% chance that the baby will be born after the first business trip.

Now we want

Page 35: An Introduction to Elementary Statistics[1]

P( 85 < x < 107) We see form the picture that this is the middle region. We have P(85 < x < 107) = P(x < 107) - P(x < 85) We have already computed these. We have P(85 < x < 107) = P(x < 107) - P(x < 85) = 81% - 3% = 78% There is about a 78% chance that you will make it to the birth.

This problem asks us to work out the math backwards. We are given the probability and we want the raw score. First, we realize that we if there is a 99% chance that we will make it on time, then there is a 1% chance that we will not. Next, we use the table in reverse. That is, we seek a z-score that gives .01 as the probability.

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

-2.4.0082

.0080

.0078

.0075

.0073

.0071

.0069

.0068

.0066

.0064

-2.3.0107

.0104

.0102

.0099

.0096

.0094

.0091

.0089

.0087

.0084

-2.2.0139

.0136

.0132

.0129

.0125

.0122

.0119

.0116

.0113

.0110

We search for the probability value that is closest to .01 and find .0102 and .0099. Since .0099 is the closest to .01, we use this value. The corresponding z-score is -2.33. Now we find the x that produces this z. We have

x - 100-2.33 = 8 Multiply both sides by 8 to get -18.64 = x - 100 Add 100 to both sides to get x = 81.36 We must return from our business trip in 81 days.

Using the Normal Distribution to Approximate the Binomial Distribution The Binomial Distribution is easy to calculate as long as we only need a few values. However, if we need many values, the computation can be extremely tedious.

Page 36: An Introduction to Elementary Statistics[1]

Example Suppose you throw a die 1000 times. What is the probability of having it roll a 6 fewer than 160 times? Solution

The horrible way of figuring this out is to calculate C1000,r (1/6)r (5/6)1000 - r for every r between 0 and 159. We have better things to do with our time than do this. Instead we will approximate the answer. As you may have already guessed, this distribution is very close to being normal. We give the following theorem

Theorem: Normal Approximation to the Binomial DistributionIf a binomial distribution with probability of success p and failure q and n trials is such that

np > 5 nq > 5 Then the distribution can be approximated by a normal distribution with mean

= npand standard deviation

Now we can continue with our example. We have np = (1000)(1/6) = 166.67 > 5 and nq = (1000)(5/6) = 833 > 5 Thus we can use the normal distribution. We have = np = (1000)(1/6) = 166.67 and npq = (1000)(1/6)(5/6) = 138.89 Taking a square root gives = 11.79 Now we can compute the z-score, since we want P(x < 160). We have 160 - 166.67z = = -0.57 11.79 Now we use the table to find the probability. We get .2843. Thus there is about a 28% chance that we will roll a six fewer than 160 times.

Continuity Correction We can achieve a slightly more accurate approximation with what is called the continuity correction. We looked at P(x < 160). However this is the same as P(x < 159.5) or any such fraction. When using the normal distribution to approximate the binomial distribution, we correct by this .5 value.

Page 37: An Introduction to Elementary Statistics[1]

Example Each year a squirrel has a 35% chance of surviving the winter. Suppose in patch of land there are 200 squirrels. What is the probability that between 65 and 80 of these squirrels will survive the winter? Solution We first check np = (200)(.35) = 70 > 5 and nq = (200)(.65) = 130 > 5 Thus we can use the normal curve approximation. We have = np = (200)(.35) = 70 and npq = (200)(.35)(.65) = 45.5 Taking a square root gives = 6.7 Instead of using P(65 < x < 80), we use the continuity correction and find P(64.5 < x < 80.5). We compute the two z-scores. 64.5 - 70z = = -.82 6.7 and 80.5 - 70z = = 1.57 6.7 Now we use the table to find the probabilities. We get .2061 and .9418. Since we want the middle area, we subtract these .9418 - .2061 = .7357 Thus there is about a 73% chance that there will be between 65 and 80 surviving squirrels.

Homework Handout for Using the Normal Distribution to Approximate the Normal Distribution

The distributions below are binomial with number of trials n and probability of success p. Determine which of the distributions can be approximated by the normal distribution. If it can be approximated by the normal distribution, give the mean and standard deviation of the normal approximation.

n = 75, p = 0.8 n = 30, p = 0.9 n = 50, p = 0.06The distributions below are binomial with number of trials n and probability of success p. Determine which of

the distributions can be approximated by the normal distribution. If it can be approximated by the normal distribution, give the mean and standard deviation of the normal approximation.

n = 100, p = 0.3 n = 20, p = 0.1 n = 24, p = 0.95 n = 22, p = 0.48

For each of the following probability statements involving the binomial distribution and variable r number of successes, write down the corresponding probability statement that uses the normal distribution variable X. Make sure you use the continuity correction. Solution

P(r < 18) P(r > 59) P(r < 22)

Page 38: An Introduction to Elementary Statistics[1]

P(r > 37) P(11 < r < 19)

For each of the following probability statements involving the binomial distribution and variable r number of successes, write down the corresponding probability statement that uses the normal distribution variable X. Make sure you use the continuity correction.

P(r < 25) P(r > 17) P(r < 80) P(r > 35) P(35 < r < 45)

Each of the probability statements involve the binomial distribution. Use the continuity correction to write a probability statement that uses the normal distribution.

The probability that the more than 29 of the students woke up before 8:00 AM. Solution The probability that at least 22 of the cars are hybrid vehicles. The probability that fewer than 41 of the customers returned the next week. The probability that at most 17 of the women wear makeup. The probability that between 14 and 26 of the cats fall on their feet.

Each of the probability statements involve the binomial distribution. Use the continuity correction to write a probability statement that uses the normal distribution.

The probability that the number of yes votes is more that 18. The probability that the number of people with blue eyes is at least 22. The probability that fewer than 50 of the rodents survive. The probability that at most 32 of the penguins jump into the water. The probability that between 21 and 28 of the students pass the exam.

For exercises 7 through 9, use the normal distribution to approximate the binomial distribution.Suppose that 20% of all college students are vegetarians. If 80 students are randomly selected, what is the

probability that fewer than 13 of them are vegetarians? more than 14 of them are vegetarians? at least 20 of them are vegetarians? at most 17 of them are vegetarians? Between 13 and 16 of them are vegetarians?

Explain why it was ok to use the normal approximation to the binomial distribution for these calculations. SolutionSuppose that 84% of all college students travel during winter break. If 45 students are randomly selected, what

is the probability that fewer than 40 of them travel during winter break? more than 35 of them travel during winter break? at least 38 of them travel during winter break? at most 32 of them travel during winter break? Between 37 and 41 of them travel during winter break?

Explain why it was ok to use the normal approximation to the binomial distribution for these calculations.Only 7% of all people who receive CPR survive. If a researcher randomly studied 120 cases of people who

received CPR, what is the probability that Less than 12 of them survived. more than 7 of them survived. at least 10 of them survived. at most 9 of them survived.

Page 39: An Introduction to Elementary Statistics[1]

Between 8 and 13 of them survived. Solution to Problem 1The distributions below are binomial with number of trials n and probability of success p. Determine which of the distributions can be approximated by the normal distribution. If it can be approximated by the normal distribution, give the mean and standard deviation of the normal approximation.

n = 75, p = 0.8

Since np = (75)(0.8) = 60 and nq = (75)(1-0.8) = 15 are both greater than 5, the binomial distribution can be approximated by the normal distribution.

n = 30, p = 0.9

Since np = (30)(0.9) = 27and nq = (30)(1-0.9) = 3 are not both greater than 5, the binomial distribution cannot be approximated by the normal distribution.

n = 50, p = 0.06

Since np = (50)(0.6) = 3 and nq = (60)(1-0.06) = 47 are not both greater than 5, the binomial distribution cannot be approximated by the normal distribution.

Solution to Problem 3

For each of the following probability statements involving the binomial distribution and variable r number of successes, write down the corresponding probability statement that uses the normal distribution variable X. Make sure you use the continuity correction.

P(r < 18)

Since r < 18 and r < 17 are the same for this binomial variable, the continuity correction gives P(x < 17.5).

P(r > 59)

Since r > 59 and r > 60 are the same for this binomial variable, the continuity correction gives P(x > 59.5).

P(r < 22)

Since r < 22 and r < 23 are the same for this binomial variable, the continuity correction gives P(x < 22.5).

P(r > 37)

Since r > 37 and r > 36 are the same for this binomial variable, the continuity correction gives P(x > 36.5).

P(11 < r < 19)

Since 11 > r and 10 > r are the same and r < 19 and r < 20 are the same for this binomial variable, the continuity correction gives P(10.5 < x < 19.5).

Solution to Problem 5

Page 40: An Introduction to Elementary Statistics[1]

Each of the probability statements involve the binomial distribution. Use the continuity correction to write a probability statement that uses the normal distribution.

The probability that the more than 29 of the students woke up before 8:00 AM.

In symbols this is P(r > 29). The continuity correction gives P(x > 28.5).

The probability that at least 22 of the cars are hybrid vehicles.

In symbols this is P(r > 22). The continuity correction gives P(x > 21.5).

The probability that fewer than 41 of the customers returned the next week.

In symbols this is P(r < 41). The continuity correction gives P(x < 40.5).

The probability that at most 17 of the women wear makeup.

In symbols this is P(r < 17). The continuity correction gives P(x < 17.5).

The probability that between 14 and 26 of the cats fall on their feet.

In symbols this is P(14 < r < 26). The continuity correction gives P(13.5 < x < 16.5). Solution to Problem 7

Suppose that 20% of all college students are vegetarians. If 80 students are randomly selected, what is the probability that

Notice first that np = (80)(0.2) = 16 and nq = (80)(0.8) = 64 are both greater than 5. Hence we can use the normal distribution to approximate the binomial distribution. We get:

m = np = 16

fewer than 13 of them are vegetarians?

We can write this as P(r < 13). Using the continuity correction and the normal distribution, we can approximate this probability as P(X < 12.5). Now convert this to a statement using the standard normal distribution:

Page 41: An Introduction to Elementary Statistics[1]

The probability statement becomes: P(z < -0.98). Now we can go to the normal distribution table to get that the probability is 0.1635.

more than 14 of them are vegetarians?

We can write this as P(r > 14). Using the continuity correction and the normal distribution, we can approximate this probability as P(X > 14.5). Now convert this to a statement using the standard normal distribution:

The probability statement becomes: P(z > -0.42). Since we want the area to the right of -0.42 and the table gives us the area to the left, we use the rule of complements and subtract from 1.P(z > -0.42) = 1 - P(z < -0.42) = 1 - 0.3372 = 0.6628

at least 20 of them are vegetarians?

We can write this as P(r > 20). Using the continuity correction and the normal distribution, we can approximate this probability as P(X > 19.5). Now convert this to a statement using the standard normal distribution:

Page 42: An Introduction to Elementary Statistics[1]

The probability statement becomes: P(z > 0.98). Since we want the area to the right of 0.98 and the table gives us the area to the left, we use the rule of complements and subtract from 1.P(z > 0.98) = 1 - P(z < 0.98) = 1 - 0.8365 = 0.1635

at most 17 of them are vegetarians?

We can write this as P(r < 17). Using the continuity correction and the normal distribution, we can approximate this probability as P(X < 17.5). Now convert this to a statement using the standard normal distribution:

The probability statement becomes: P(z < 0.42). Now we can go to the normal distribution table to get that the probability is 0.6628.

Between 13 and 16 of them are vegetarians?

We can write this as P(13 < r < 16). Using the continuity correction and the normal distribution, we can approximate this probability as P(12.5 < X < 16.5). Now convert this to a statement using the standard normal distribution. Compute the two z-scores:

and

Page 43: An Introduction to Elementary Statistics[1]

The probability statement becomes: P(-0.98 < z < 0.14). From the diagram, we see that the area under the normal curve between z = -0.98 and 0.14 is the area to the left of 0.14 minus the area to the left of -0.98. That is

P(-0.98 < z < 0.14) = P(z < 0.14) - P(z < -0.98)

= 0.5557 - 0.1635

= 0.3922

The Central Limit Theorem A Review of Terminology

We begin our journey into inferential statistics. Most of the time the population mean and population standard deviation are impossible or too expensive to determine exactly. Two of the major tasks of a statistician is to get an approximation to the mean and analyze how accurate the approximation is. The most common way of accomplishing this task is by using sampling techniques. Out of the entire population the researcher obtains a (hopefully random) sample from the population and uses the sample to make inferences about the population. From the sample the statistician computes several numbers such as the sample size, the sample mean, and the sample standard deviation. The numbers that are computed from the sample are called statistics.

Example How many cups of coffee do you drink each week? If we asked this question to two different five person groups, we will probably get two different sample means and two different sample standard deviations. Choosing different samples from the same population will produce different statistics. The distribution of all possible samples is called the sampling distribution.

The Five Dice Experiment:

Consider the distribution of rolling a die. It is uniform (flat) between 1 and 6. We will roll five dice we can compute the pdf of the mean. We will see that the distribution becomes more like a normal distribution. The experiment can be modeled at Best Site or Another site.

The Central Limit Theorem

Let x denote the mean of a random sample of size n from a population having mean and standard deviation . Let

Page 44: An Introduction to Elementary Statistics[1]

x = mean value of x and x = the standard deviation of x then

x =

When the population distribution is normal so is the distribution of x for any n.

For large n, the distribution of x is approximately normal regardless of the population distribution

Rule of thumb: n > 30 is large Example: Suppose that we play a slot machine such you can either double your bet or lose your bet. If there is a 45% chance of winning then the expected value for a dollar wager is 1(.45) + (-1)(.55) = -.1 We can compute the standard deviation:

x p(x) (x - )2 p(x)(x - )2

1 .45 1.21 .545

-1 .55 .81 .446

Total

.991

So the standard deviation is

If we throw 100 silver dollars into the slot machine then we expect to average a loss of ten cents with a standard deviation of

Notice that the standard deviation is very small. This is why the casinos are assured to make money. Now let us find the probability that the gambler does not lose any money, that is the mean is greater than or equal to 0.

Page 45: An Introduction to Elementary Statistics[1]

We first compute the z-score. We have 0 - (-.1)z = = 1.01 .0995 Now we go to the table to find the associated probability. We get .8438. Since we want the area to the right, we subtract from 1 to get P(z > 1.01) = 1 - P(z < 1.01) = 1 - .8438 = .1562 There is about a 16% chance that the gambler will not lose.

Sampling Distributions for ProportionsThe last example was a special case of proportions, that is Boolean data. For now on, we can use the following theorem.

The Central Limit Theorem for ProportionsLet p be the probability of success, q be the probability of failure. The sampling distribution for samples of size n is approximately normal with mean

Example The new Endeavor SUV has been recalled because 5% of the cars experience brake failure. The Tahoe dealership has sold 200 of these cars. What is the probability that fewer than 4% of the cars from Tahoe experience brake failure?SolutionWe havep = .05 q = .95 n = 200We have

mp = p = .05 sp = = .0154Next we want to find P(x < 8)Using the continuity correction, we find insteadP(x < 7.5)This is equivalent to P(p < 7.5/200) = P(p < .0375)We find the z-score.0375 - .05z = = -.81 .0154

Page 46: An Introduction to Elementary Statistics[1]

The table gives a probability of .2090. We can conclude that there is about a 21% chance that fewer than 4% of the cars will experience brake failure.

Control Charts for ProportionsA while back we discussed how to construct a control chart. Click here for this discussion. For proportions, we can use the same tool remembering that the Central Limit Theorem tells us how to find the mean and standard deviation.ExampleHeavenly Ski resort conducted a study of falls on its advanced run over twelve consecutive ten minute periods. At each ten minute interval there were 40 boarders on the run. The data is shown below:

Time

1 2 3 4 5 6 7 8 9 10 11 12

r 14 18 11 16 19 22 6 12 13 16 9 17

r/40

.35 .45.275

.4.475

.55 .15 .3.325

.4.225

.425

Make a P-Chart and list any out of control signals by type (I, II, III).SolutionFirst we find p by dividing the total number of falls by the total number of skiers:173p = = .3612(40)Now we compute the mean

Now we find two and three standard deviations above and below the mean are.36 - (2)(.08) = .20 .36 - (3)(.08) = .04.36 + (2)(.08) = .52 .36 + (3)(.08) = .68Now we can use this data as before to construct a control chart and determine any out of control signals.

Notice that no nine consecutive points lie on one side of the blue line, no two of three points lie above 0.52 or below 0.20, and no points lie below 0.04 or above 0.68. Hence this data is in control.

Sample Proportions and Point Estimation Sample Proportions Let p be the proportion of successes of a sample from a population whose total proportion of successes is and let p be the mean of p and p be its standard deviation. Then

Page 47: An Introduction to Elementary Statistics[1]

The Central Limit Theorem For Proportions

p =

For n large, p is approximately normal.

Example Consider the next census. Suppose we are interested in the proportion of Americans that are below the poverty level. Instead of attempting to find all Americans, Congress has proposed to perform statistical sampling. We can concentrate on 10,000 randomly selected people from 1000 locations. We can determine the proportion of people below the poverty level in each of these regions. Suppose this proportion is .08. Then the mean for the sampling distribution is p = 0.8 and the standard deviation is

Point Estimations A Point Estimate is a statistic that gives a plausible estimate for the value in question. Example x is a point estimate for s is a point estimate for A point estimate is unbiased if its mean represents the value that it is estimating.

Confidence Intervals for a Mean Point EstimationsUsually, we do not know the population mean and standard deviation. Our goal is to estimate these numbers. The standard way to accomplish this is to use the sample mean and standard deviation as a best guess for the true population mean and standard deviation. We call this "best guess" a point estimate. A Point Estimate is a statistic that gives a plausible estimate for the value in question. Example: x is a point estimate for s is a point estimate for A point estimate is unbiased if its mean represents the value that it is estimating.

Confidence IntervalsWe are not only interested in finding the point estimate for the mean, but also determining how accurate the point estimate is. The Central Limit Theorem plays a key role here. We assume that the sample standard deviation is close to the population standard deviation (which will almost always be true for large samples). Then the Central Limit Theorem tells us that the standard deviation of the sampling distribution is

We will be interested in finding an interval around x such that there is a large probability that the actual mean falls inside of this interval. This interval is called a confidence interval and the large probability is called the confidence level. Example Suppose that we check for clarity in 50 locations in Lake Tahoe and discover that the average depth of clarity of the lake is 14 feet. Suppose that we know that the standard deviation for the entire lake's depth is 2 feet. What can we

Page 48: An Introduction to Elementary Statistics[1]

conclude about the average clarity of the lake with a 95% confidence level? Solution We can use x to provide a point estimate for . How accurate is x as a point estimate? We construct a 95% confidence interval for as follows. We draw the picture and realize that we need to use the table to find the z-score associated to the probability of .025 (there is .025 to the left and .025 to the right).

We arrive at z = -1.96. Now we solve for x: x - 14 x - 14-1.96 = =

2/ 0.28 Hence x - 14 = -0.55 We say that 0.55 is the margin of error. We have that a 95% confidence interval for the mean clarity is (13.45,14.55) In other words there is a 95% chance that the mean clarity is between 13.45 and 14.55. In general if zc is the z value associated with c% then a c% confidence interval for the mean is

Confidence Interval for a Mean When the Population Standard Deviation is Unknown When the population is normal or if the sample size is large, then the sampling distribution will also be normal, but the use of s to replace is not that accurate. The smaller the sample size the worse the approximation will be. Hence we can expect that some adjustment will be made based on the sample size. The adjustment we make is that we do not use the normal curve for this approximation. Instead, we use the Student t distribution that is based on the sample size. We proceed as before, but we change the table that we use. This distribution looks like the normal distribution, but as the sample size decreases it spreads out. For large n it nearly matches the normal curve. We say that the distribution has n - 1 degrees of freedom. Example Suppose that we conduct a survey of 19 millionaires to find out what percent of their income the average millionaire donates to charity. We discover that the mean percent is 15 with a standard deviation of 5 percent. Find a 95% confidence interval for the mean percent. Assume that the distribution of all charity percents is approximately normal. Solution We use the formula:

(Notice the t instead of the z and s instead of s) We get

15 tc 5 / Since n = 19, there are 18 degrees of freedom. Using the table in the back of the book, we have that tc = 2.10 Hence the margin of error is

Page 49: An Introduction to Elementary Statistics[1]

2.10 (5) / = 2.4 We can conclude with 95% confidence that the millionaires donate between 12.6% and 17.4% of their income to charity.

Handout on Finding the Sample Size Needed for a Confidence Interval for a Single Population MeanSuppose you want to construct a 95% confidence interval for the mean age that people have their wisdom teeth removed. If you know that the standard deviation is 1.8 and you want a margin of error of no more than plus or minus 0.2 years, at least how many people must you survey?Suppose you want to construct a 95% confidence interval for the mean age that people have their wisdom teeth removed. If you know that the standard deviation is 1.8 and you want a margin of error of no more than plus or minus 0.2 years, at least how many people must you survey?

SolutionWe use the formula:

Now write down the cast of characters:s = 1.8E = 0.2c = 0.95z = 1.96Plug in to get

Next note that we can't survey a fractional number of people. For sample size calculations we always round up. We can conclude that in order to construct this 95% confidence interval with a margin of error of no more than plus or minus 0.2 years we need to survey at least 312 people.

You have just conducted a preliminary survey of 22 students asking them how many times a week they eat in fast food restaurants. The standard deviation for this survey was 2.8. If you want to construct a 95% confidence interval for the mean number of times per week students eat in fast food restaurant and have a margin of error no more than plus or minus 0.05, at least how many additional students must you survey?

You have just conducted a preliminary survey of 22 students asking them how many times a week they eat in fast food restaurants. The standard deviation for this survey was 2.8. If you want to construct a 95% confidence interval for the mean number of times per week students eat in fast food restaurant and have a margin of error no more than plus or minus 0.05, at least how many additional students must you survey?

SolutionWe use the formula:

Now write down the cast of characters:s = 2.8E = 0.05c = 0.95

Page 50: An Introduction to Elementary Statistics[1]

z = 1.96Plug in to get

Next note that we can't survey a fractional number of people. For sample size calculations we always round up. We need to survey a total of 4979 students. Since we have already surveyed 22 students, we subtract 22 from the total to get4979 - 22 = 4957We can conclude that in order to construct this 95% confidence interval with a margin of error of no more than plus or minus 0.05 years we need to survey at least 4957 more students.

Increasing the sample size while holding the level of confidence fixed will decrease the margin of error for a confidence interval.

SolutionWe examine the formula:

Now increasing the sample size n increases the denominator. When the denominator gets larger, the fraction gets smaller. Hence, the margin of error will decrease when the sample size increases. The statement is true.

Confidence Intervals For Proportions and Choosing the Sample Size

A Large Sample Confidence Interval for a Population Proportion Recall that a confidence interval for a population mean is given by

Confidence Interval for a Population Meanzc s x

We can make a similar construction for a confidence interval for a population proportion. Instead of x, we can use

p and instead of s, we use , hence, we can write the confidence interval for a large sample proportion as

#Confidence Interval Margin of Error for a Population Proportion

Example 1000 randomly selected Americans were asked if they believed the minimum wage should be raised. 600 said yes. Construct a 95% confidence interval for the proportion of Americans who believe that the minimum wage should be raised. Solution: We have

Page 51: An Introduction to Elementary Statistics[1]

p = 600/1000 = .6 zc = 1.96 and n = 1000 We calculate:

Hence we can conclude that between 57 and 63 percent of all Americans agree with the proposal. In other words, with a margin of error of .03 , 60% agree.

Calculating n for Estimating a MeanExample Suppose that you were interested in the average number of units that students take at a two year college to get an AA degree. Suppose you wanted to find a 95% confidence interval with a margin of error of .5 for knowing = 10. How many people should we ask? Solution Solving for n in

Margin of Error = E = zc / we have

E = zc zc

= E Squaring both sides, we get

We use the formula:

Example A Subaru dealer wants to find out the age of their customers (for advertising purposes). They want the margin of error to be 3 years old. If they want a 90% confidence interval, how many people do they need to know about? Solution: We have E = 3, zc = 1.65 but there is no way of finding sigma exactly. They use the following reasoning: most car customers are between 16 and 68 years old hence the range is Range = 68 - 16 = 52 The range covers about four standard deviations hence one standard deviation is about 52/4 = 13 We can now calculate n:

Page 52: An Introduction to Elementary Statistics[1]

Hence the dealer should survey at least 52 people.

Finding n to Estimate a Proportion Example Suppose that you are in charge to see if dropping a computer will damage it. You want to find the proportion of computers that break. If you want a 90% confidence interval for this proportion, with a margin of error of 4%, How many computers should you drop? Solution The formula states that

Squaring both sides, we get that zc

2 p(1 - p)E2 = n Multiplying by n, we get nE2 = zc

2[p(1 - p)]

This is the formula for finding n. Since we do not know p, we use .5 ( A conservative estimate)

We round 425.4 up for greater accuracy We will need to drop at least 426 computers. This could get expensive.

Suppose you want to construct a 95% confidence interval for the proportion of college students who floss daily. You want a margin of error of no more than plus or minus 3%. According to the ADA 12% of all Americans floss daily. If you think that college students' flossing habits are similar to the general population of Americans, how many college students should you survey?

We use the formula:

We havep = 0.12E = 0.03c = 0.95zc = 1.96Putting these together gives

Page 53: An Introduction to Elementary Statistics[1]

We always round up to determine the sample size needed. We can conclude that 451 college students need to be surveyed.

Suppose you want to construct a 90% confidence interval for the proportion of college students that want to work in the health care profession after graduation. If you want a margin of error of no more than 5%, how many college students must you survey?SolutionSince there is no preliminary estimate for the population proportion, we use the formula:

We have:E = 0.05c = 0.90zc = 1.645Plug these numbers into the formula to get:

We always round up, so we can conclude that we must survey 271 college students.

Estimating DifferencesDifference Between Means I surveyed 50 people from a poor area of town and 70 people from an affluent area of town about their feelings towards minorities. I counted the number of negative comments made. I was interested in comparing their attitudes. The average number of negative comments in the poor area was 14 and in the affluent area was 12. The standard deviations were 5 and 4 respectively. Let's determine a 95% confidence for the difference in mean negative comments. First, we need some formulas.

Page 54: An Introduction to Elementary Statistics[1]

Theorem

The distribution of the difference of means x1 - x2 has mean

1 - 2

and standard deviation

For our investigation, we use s1 and s2 as point estimates for 1 and 2. We have x1 = 14 x2 = 12 s1 = 5 s2 = 4 n1 = 50 n2 = 70 Now calculate x1 - x2 = 14 - 12 = 2

The margin of error is E = zcs = (1.96)(0.85) = 1.7 The confidence interval is 2 1.7 or [0.3, 3.7]

We can conclude that the mean difference between the number of racial slurs that poor and wealthy people make is between 0.3 and 3.7. Note: To calculate the degrees of freedom, we can take the smaller of the two numbers n1 - 1 and n2 - 1. So in the prior example, a better estimate would use 49 degrees of freedom. The t-table gives a value of 2.014 for the t.95 value and the margin of error is E = tcs = (2.014)(0.85) = 1.7119 which still rounds to 1.7. This is an example that demonstrates that using the t-table and z-table for large samples results in practically the same results.

Small Samples With Pooled Pooled Standard Deviations (Optional) When the sample size is small, we can still run the statistics provided the distributions are approximately normal. If in addition we know that the two standard deviations are approximately equal, then we can pool the data together to produce a pooled standard deviation. We have the following theorem.

Pooled Estimate of

with n1 + n2 - 2 degrees of freedom

You've gotta love the beautiful formula!

Page 55: An Introduction to Elementary Statistics[1]

Note After finding the pooled estimate we have that a confidence interval is given by

Example What is the difference between commuting patterns for students and professors. 11 students and 14 professors took part in a study to find mean commuting distances. The mean number of miles traveled by students was 5.6 and the standard deviation was 2.8. The mean number of miles traveled by professors was 14.3 and the standard deviation was 9.1. Construct a 95% confidence interval for the difference between the means. What assumption have we made? Solution We have x1 = 5.6 x2 = 14.3 s1 = 2.8 s2 = 9.1 n1 = 11 n2 = 14 The pooled standard deviation is

The point estimate for the mean is 14.3 - 5.6 = 8.7 and

Use the t-table to find tc for a 95% confidence interval with 23 degrees of freedom and find tc = 2.07 8.7 (2.07)(7.09)(.403) = 8.7 5.9 The range of values is [2.8, 14.6] The difference in average miles driven by students and professors is between 2.8 and 14.6. We have assumed that the standard deviations are approximately equal and the two distributions are approximately normal.

Difference Between ProportionsSo far, we have discussed the difference between two means (both large and small samples). Our next task is to estimate the difference between two proportions. We have the following theoremAnd a confidence interval for the difference of proportions is

Confidence Interval for the difference of Proportions

Note: in order for this to be valid, we need all four of the quantitiesp1n1 p2n2 q1n1 q2n2 to be greater than 5.Example300 men and 400 women we asked how they felt about taxing Internet sales. 75 of the men and 90 of the women agreed with having a tax. Find a confidence interval for the difference in proportions.SolutionWe have

Page 56: An Introduction to Elementary Statistics[1]

p1 = 75/300 = .25 q1 = .75 n1 = 300p2 = 90/400 = .225 q2 = .775 n2 = 400We can calculate

We can conclude that the difference in opinions is between -8.5% and 3.5%.

Hypothesis Testing Hypothesis Testing

Whenever we have a decision to make about a population characteristic, we make a hypothesis. Some examples are: > 3 or 5. Suppose that we want to test the hypothesis that 5. Then we can think of our opponent suggesting that = 5. We call the opponent's hypothesis the null hypothesis and write: H0: = 5 and our hypothesis the alternative hypothesis and write H1: 5 For the null hypothesis we always use equality, since we are comparing with a previously determined mean. For the alternative hypothesis, we have the choices: < , > , or .

Procedures in Hypothesis Testing When we test a hypothesis we proceed as follows:

Formulate the null and alternative hypothesis.

Choose a level of significance.

Determine the sample size. (Same as confidence intervals)

Collect data.

Calculate z (or t) score.

Utilize the table to determine if the z score falls within the acceptance region.

Decide to

Reject the null hypothesis and therefore accept the alternative hypothesis or

Fail to reject the null hypothesis and therefore state that there is not enough evidence to suggest the truth of the alternative hypothesis.

Errors in Hypothesis Tests

Page 57: An Introduction to Elementary Statistics[1]

We define a type I error as the event of rejecting the null hypothesis when the null hypothesis was true. The probability of a type I error () is called the significance level. We define a type II error (with probability ) as the event of failing to reject the null hypothesis when the null hypothesis was false. Example Suppose that you are a lawyer that is trying to establish that a company has been unfair to minorities with regard to salary increases. Suppose the mean salary increase per year is 8%. You set the null hypothesis to be H0: = .08 H1: < .08 Q. What is a type I error? A. We put sanctions on the company, when they were not being discriminatory. Q. What is a type II error? A. We allow the company to go about its discriminatory ways. Note: Larger results in a smaller , and smaller results in a larger .

Hypothesis Testing For a Population Mean The Idea of Hypothesis Testing Suppose we want to show that only children have an average higher cholesterol level than the national average. It is known that the mean cholesterol level for all Americans is 190. Construct the relevant hypothesis test:H0: = 190

H1: > 190 We test 100 only children and find that x = 198 and suppose we know the population standard deviation = 15. Do we have evidence to suggest that only children have an average higher cholesterol level than the national average? We have

z is called the test statistic. Since z is so high, the probability that Ho is true is so small that we decide to reject H0 and accept H1. Therefore, we can conclude that only children have a higher average cholesterol level than the national average.

Rejection Regions Suppose that = .05. We can draw the appropriate picture and find the z score for -.025 and .025. We call the outside regions the rejection regions.

We call the blue areas the rejection region since if the value of z falls in these regions, we can say that the null hypothesis is very unlikely so we can reject the null hypothesis Example 50 smokers were questioned about the number of hours they sleep each day. We want to test the hypothesis that the

Page 58: An Introduction to Elementary Statistics[1]

smokers need less sleep than the general public which needs an average of 7.7 hours of sleep. We follow the steps below.

Compute a rejection region for a significance level of .05. If the sample mean is 7.5 and the population standard deviation is 0.5, what can you conclude?

Solution

First, we write write down the null and alternative hypotheses H0: = 7.7 H1: < 7.7 This is a left tailed test. The z-score that corresponds to .05 is -1.645. The critical region is the area that lies to the left of -1.645. If the z-value is less than -1.645 there we will reject the null hypothesis and accept the alternative hypothesis. If it is greater than -1.645, we will fail to reject the null hypothesis and say that the test was not statistically significant. We have

Since -2.83 is to the left of -1.645, it is in the critical region. Hence we reject the null hypothesis and accept the alternative hypothesis. We can conclude that smokers need less sleep.

p-values There is another way to interpret the test statistic. In hypothesis testing, we make a yes or no decision without discussing borderline cases. For example with = .06, a two tailed test will indicate rejection of H0 for a test statistic of z = 2 or for z = 6, but z = 6 is much stronger evidence than z = 2. To show this difference we write the p-value which is the lowest significance level such that we will still reject Ho. For a two tailed test, we use twice the table value to find p, and for a one tailed test, we use the table value. Example: Suppose that we want to test the hypothesis with a significance level of .05 that the climate has changed since industrializatoin. Suppose that the mean temperature throughout history is 50 degrees. During the last 40 years, the mean temperature has been 51 degrees and suppose the population standard deviation is 2 degrees. What can we conclude? We have H0: = 50 H1: 50 We compute the z score:

The table gives us .9992 so that p = (1 - .9992)(2) = .002

Page 59: An Introduction to Elementary Statistics[1]

since .002 < .05 we can conclude that there has been a change in temperature. Note that small p-values will result in a rejection of H0 and large p-values will result in failing to reject H0.

Hypothesis Testing for a Proportion and for a Mean with Unknown Population Standard Deviation

Small Sample Hypothesis Tests For a Normal population When we have a small sample from a normal population, we use the same method as a large sample except we use the t statistic instead of the z-statistic. Hence, we need to find the degrees of freedom (n - 1) and use the t-table in the back of the book. Example Is the temperature required to damage a computer on the average less than 110 degrees? Because of the price of testing, twenty computers were tested to see what minimum temperature will damage the computer. The damaging temperature averaged 109 degrees with a standard deviation of 3 degrees. Assume that the distribution of all computers' damaging temperatures is approximately normal. (use We test the hypothesis H0: = 110 H1: < 110 We compute the t statistic:

This is a one tailed test, so we can go to our t-table with 19 degrees of freedom to find that

tc = 1.73 Since -1.49 > -1.73 We see that the test statistic does not fall in the critical region. We fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest that the temperature required to damage a computer on the average less than 110 degrees.

Hypothesis Testing for a Population Proportion We have seen how to conduct hypothesis tests for a mean. We now turn to proportions. The process is completely analogous, although we will need to use the standard deviation formula for a proportion. Example Suppose that you interview 1000 exiting voters about who they voted for governor. Of the 1000 voters, 550 reported that they voted for the democratic candidate. Is there sufficient evidence to suggest that the democratic candidate will win the election at the .01 level?

Page 60: An Introduction to Elementary Statistics[1]

H0: p =.5 H1: p>.5 Since it a large sample we can use the central limit theorem to say that the distribution of proportions is approximately normal. We compute the test statistic:

Notice that in this formula, we have used the hypothesized proportion rather than the sample proportion. This is because if the null hypothesis is correct, then .5 is the true proportion and we are not making any approximations. We compute the rejection region using the z-table. We find that zc = 2.33. The picture shows us that 3.16 is in the rejection region. Therefore we reject H0 so can conclude that the democratic candidate will win with a p-value of .0008.

Example 1500 randomly selected pine trees were tested for traces of the Bark Beetle infestation. It was found that 153 of the trees showed such traces. Test the hypothesis that more than 10% of the Tahoe trees have been infested. (Use a 5% level of significance)Solution The hypothesis is

H0: p = .1 H1: p > .1

Page 61: An Introduction to Elementary Statistics[1]

We have that

Next we compute the z-score

Since we are using a 95% level of significance with a one tailed test, we have zc = 1.645. The rejection region is shown in the picture. We see that 0.26 does not lie in the rejection region, hence we fail to reject the null hypothesis. We say that there is insufficient evidence to make a conclusion about the percentage of infested pines being greater than 10%.

Exercises If 40% of the nation is registered republican. Does the Tahoe environment reflect the national proportion? Test

the hypothesis that Tahoe residents differ from the rest of the nation in their affiliation, if of 200 locals surveyed, 75 are registered republican.

If 10% of California residents are vegetarians, test the hypothesis that people who gamble are less likely to be vegetarians. If the 120 people polled, 10 claimed to be a vegetarian.

Difference Between Means Hypothesis Testing of the Difference Between Two Means Do employees perform better at work with music playing. The music was turned on during the working hours of a business with 45 employees. There productivity level averaged 5.2 with a standard deviation of 2.4. On a different day the music was turned off and there were 40 workers. The workers' productivity level averaged 4.8 with a standard deviation of 1.2. What can we conclude at the .05 level? Solution We first develop the hypotheses H0: 1 - 2 = 0 H1: 1 - 2 > 0 Next we need to find the standard deviation. Recall from before, we had that the mean of the difference is x = 1 - 2 and the standard deviation is

x = We can substitute the sample means and sample standard deviations for a point estimate of the population means and standard deviations. We have

and Now we can calculate the t-score. We have 0.4t = = 0.9880.405 To calculate the degrees of freedom, we can take the smaller of the two numbers n1 - 1 and n2 - 1. So in this example we use 39 degrees of freedom. The t-table gives a value of 1.690 for the t.95 value. Notice that 0.988 is still smaller than 1.690 and the result is the same. Since the t-score is smaller than 1.690, we fail to reject the null hypothesis and state that there is insufficient evidence to make a conclusion about employees performing better at work with music playing.

Page 62: An Introduction to Elementary Statistics[1]

Hypothesis Testing For a Difference Between Means for Small Samples Using Pooled Standard Deviations (Optional)

Recall that for small samples we need to make the following assumptions: Random unbiased sample.

Both population distributions are normal.

The two standard deviations are equal. If we know , then the sampling standard deviation is:

If we do not know then we use the pooled standard deviation.

Putting this together with hypothesis testing we can find the t-statistic.

and use n1 + n2 - 2 degrees of freedom.

Example Nine dogs and ten cats were tested to determine if there is a difference in the average number of days that the animal can survive without food. The dogs averaged 11 days with a standard deviation of 2 days while the cats averaged 12 days with a standard deviation of 3 days. What can be concluded? (Use = .05) Solution We write: H0: dog - cat = 0 H1: dog - cat 0 We have: n1 = 9, n2 = 10 x1 = 11, x2 = 12 s1 = 2, s2 = 3 so that

and

Page 63: An Introduction to Elementary Statistics[1]

The t-critical value corresponding to a = .05 with 10 + 9 - 2 = 17 degrees of freedom is 2.11 which is greater than .84. Hence we fail to reject the null hypothesis and conclude that there is not sufficient evidence to suggest that there is a difference between the mean starvation time for cats and dogs.

Hypothesis Testing for a Difference Between ProportionsInferences on the Difference Between Population Proportions If two samples are counted independently of each other we use the test statistic:

where

r1 + r2 p = n1 + n2 and

q = 1 - p

ExampleIs the severity of the drug problem in high school the same for boys and girls? 85 boys and 70 girls were questioned and 34 of the boys and 14 of the girls admitted to having tried some sort of drug. What can be concluded at the .05 level?SolutionThe hypotheses are H0: p1 - p2 = 0 H1: p1 - p2 0 We havep1 = 34/85 = 0.4 p2 = 14/70 = 0.2p = 48/155 = 0.31 q = 0.69Now compute the z-score

Since we are using a significance level of .05 and it is a two tailed test, the critical value is 1.96. Clearly 2.68 is in the critical region, hence we can reject the null hypothesis and accept the alternative hypothesis and conclude that gender does make a difference for drug use. Notice that the P-Value is P = 2(1 - .9963) = 0.0074is less than .05. Yet another way to see that we reject the null hypothesis.

Paired Differences Paired Data: Hypothesis Tests

Example Is success determined by genetics? The best such survey is one that investigates identical twins who have been reared in two different environments, one that is nurturing and one that is non-nurturing. We could measure the difference in high school GPAs between each pair. This is better than just pooling each group individually. Our hypotheses are

Page 64: An Introduction to Elementary Statistics[1]

Ho: d = 0 H1: d > 0 where d is the mean of the differences between the matched pairs. We use the test statistic

where sd is the standard deviation of the differences. We use n - 1 degrees of freedom, where n is the number of pairs.

Paired Differences: Confidence Intervals To construct a confidence interval for the difference of the means we use:

xd t sd/

Example Suppose that ten identical twins were reared apart and the mean difference between the high school GPA of the twin brought up in wealth and the twin brought up in poverty was 0.07. If the standard deviation of the differences was 0.5, find a 95% confidence interval for the difference. Assume the distribution of GPA's is approximately normal. Solution We compute

or [-0.29, 0.43] We are 95% confident that the mean difference in GPA is between -0.29 and 0.43. Notice that 0 falls in this interval, hence we would fail to reject the null hypothesis at the 0.05 level.

Correlation Residuals Suppose that the average lifespan for people who smoke is:

Packs Per Week Life Span

1 72

2 70

3 69

5 68

We can calculate the least squares regression line: y = 73 - 1.3x We define the first residual to be the difference between the first lifespan and the first estimated lifespan:

Page 65: An Introduction to Elementary Statistics[1]

72 - (73 - 1.3(1)) = 0.3 the second residual as: 70 - (73 - 1.3(2)) = -0.4 the third as: 69 - (73 - 1.3(3)) = -0.1 and the fourth as 68 - (73 - 1.3(5)) = 1.5 in general we have the residual is

yi - y = yi - (a + bxi)

Coefficient of determination: r2 We define the coefficient of determination as an indication of how linear the data is. r2 has the following properties: Properties of the Coefficient of Determination

r2 is between 0 and 1.

If r2 = 1 then all points lie on a line. (perfectly linear)

If r2 = 0 then the regression line is a useless indicator for predicting y values.

Construction To compute r2, do the following:

Compute the sum of the squares of the residuals: SSResid

Compute y2 and (y)2. We say that

SSTo = y2 - ( y)2/n

Compute

1 - SSResid/SSto

This is r2 . If we multiply r2 by 100%, we arrive at the percent of the observed variation attributable to the linear relationship.

Correlation: r If we want to determine not just if they are linearly related, but also want to know whether there is a positive relationship or a negative relationship (b> 0 or b<0) and want the calculation unitless, we compute Pearson's correlation coefficient r

We have r2 = r2 that is the square of the correlation coefficient is equal to the coefficient of determination.

If r < 0 then they are negatively correlated.

Page 66: An Introduction to Elementary Statistics[1]

If r > 0 then they are positively correlated. We say that the correlation is

strong if |r| >.8

middle if .5 < |r| < .8 and

weak otherwise. Correlation does not imply causation. For example there may be a strong correlation between grayness in hair and wrinkles, but having gray hair does not cause one to have wrinkles.

Analyzing the Regression Line Estimating Sigma The correlation provides us with an estimate of how linear the data is. We would also like to know how close the data are to the regression line. We use a measurement se which is a point estimate for the standard deviation for the residuals. If se is large then the points lie far from the line and if it is small then the points are close to the line. We have an empirical rule that says that: approximately 95% of the points lie within 2se of the line.

The mean value for is a and the mean value for is b. Some assumptions that we make on the error e from y = + x are

e has mean value 0.

e has standard deviation which does not depend on x.

The distribution of e is normal.

Each of the e's for different x's are independent of one one another. A point estimate for 2 is given by

SSResidse

2 = n - 2

and the point estimate for is its square root.

Page 67: An Introduction to Elementary Statistics[1]

Inferences on the Slope Suppose that the equation of the regression line calculated from the data is y = a + bx.

Can we trust this b? In other words, if the true equation of the regression line is y = + x, is b a good point estimate for ? We can estimate the standard deviation by the formula

The t statistic is

b - t = sb with n - 2 degrees of freedom

We can form a confidence interval for as

To interpret this confidence for example we can say that we are 95% confident that the true slope of the regression line is between two and three. If the slope of the regression line is 0 then the regression line is useless. Hence it is typical to test the hypothesis Ho: = 0 Ha: 0 We use the t statistic

b - 0t = sb

and proceed as usual.

Example Suppose that we have computed the regression line that corresponds to education (years of college) vs. income as

= 15,000 + 5x with 200 data points and have sb = 2 Use = .05 Then we have Ho: = 0 Ha: 0

Page 68: An Introduction to Elementary Statistics[1]

and t = 5/2 = 2.5 giving a p-value between .01 and .02. Since p < we can reject H0 and accept H1 and conclude that the regression line is useful for predicting the income based on college years. We can make a 95% confidence interval for the slope: 5 1.96(2) or [1.08,8.92]

Testing if There is a CorrelationWe have talked about the correlation being weak, moderate, or strong; however, with a small sample this may not be reliable. Smaller samples can produce unreliable results. Next we will create a hypothesis on whether there is a correlation between the two variables. If there is no correlation then the correlation coefficient will be 0. Otherwise it will not be 0. We can also test to see if there is a positive or negative correlation. As you may guess, the difference in the test for a correlation, a positive correlation, or a negative correlation will be whether we use a two tailed test, a right tailed test, or a left tailed test. We will use the Greek letter "" pronounced "rho" for the population correlation and r for the sample correlation. The test statistic will be given by

Notice that this is a "t" statistic. We have

degrees of freedom = n - 2Notice that the larger the sample size (with the same r), the larger the t value. Also, a larger r will produce a larger t value.ExampleA study was done to see if there is a positive correlation between the number of times per month that college students call home and the amount of money that their parents contribute towards their education. 175 students were surveyed and the correlation was found to be 0.18. What can be said at the 0.05 level of significance?SolutionFirst we write down the null and alternative hypotheses:H0: = 0H1: > 0We compute the

Since the sample size is large, we can use the normal distribution (z-table) to approximate the P-value. Notice also that this is a right tailed test so we need to subtract the table value from 1. We haveP = 1 - .9920 = 0.008Since P is less than 0.05, we can conclude that there is a positive correlation between the number of times per month that students call home and the amount of money that their parents contribute towards their education. Remark: We were able to conclude that there is strong statistical evidence of a positive correlation. On the other hand the correlation of 0.18 is a weak correlation. Try not to confuse strong evidence to show a correlation with a strong correlation. Also we can not conclude that calling parents frequently will induce parents to send more money. We have established correlation, not causation.Remark: If the correlation is 0, then so is the slope. It turns out that the test statistic for the slope is the same as the

Page 69: An Introduction to Elementary Statistics[1]

test statistic for the correlation. Computers will usually provide the P-value for testing the slope. This is the same as the P-value for testing the correlation.

Inferences on the Regression Line Estimating the Mean Value of y for a Particular Value of xSuppose that you own a pizza restaurant and are interesting in sending out menus to local residents. You research what your 200 competitors have done to find the relationship between number of mailings and amount of pizzas bought per week. You find that the equation of the regression line is y = 100 + .2x. You calculate se to be 4 and sa + bx to be 3. There are two things you are interested in:

Next week you plan an advertising blitz of 1000 mailings. How many pizzas do you expect to sell and what is a 95% confidence interval for this estimate.

After next week you will be consistently sending out 300 mailings. Over the next several years, what do you expect the average number of pizzas sold will be?

Solutions: We will use the main theorem that states that an unbiased estimate for the value of y given a fixed value of x is

a + bx The standard deviation is

Hence we predict that we will sell about 100 + .2(1000) = 300 pizzas. We construct a confidence interval as

Hence we expect between 290 and 310 pizzas to be sold.Now we are looking for the average y given a fixed value of x. As before, we use the regression line for an

unbiased estimator for the mean. Hence we can expect to sell 100 + .2(300) = 160 pizzas. for a confidence interval we use the standard deviation sa + bx to get the confidence interval: 160 1.96(3) We can conclude with 95% confidence that we will average between 154 and 166 pizzas over the years.

Inferences on r

Recall that if there is no relationship between x and y, then the correlation = 0 and using the regression line is useless. If there is no relationship then we call the variables independent. We can test to see if the correlation is 0 with the following: Ho: = 0 Ha: 0 We use the Greek letter to indicate the population correlation. The test statistic we use is

Page 70: An Introduction to Elementary Statistics[1]

Notice that when r2 is close to 0, the test statistic is smaller. Remark: This test is equivalent to the test for = 0. Example 45 people were questioned to see if the distance they lived from work was correlated to the distance they lived from their parents. It was found that r = -.8. We calculate that

Since this is off the chart, we can conclude that there is a correlation between distance from work and distance from parents house.

Goodness of FitBefore the Gondola was in operation, Heavenly tracked its skiers and boarders and found the following

Type Percent of Skiers

Beginner 30%

Intermediate

40%

Advanced

20%

Expert 10%

With the new gondola in place the ski resort wants to determine if the distribution has changed. They tracked 2000 skiers and boarders and came up with the following

Type Observed Count

Beginner 590

Intermediate

860

Advanced

400

Expert 150

What can be concluded (use = .05)?SolutionWe first write determine the null and alternative hypothesesH0: The new population of skiers and boarders follows the same distribution as the old distribution of skiers.H1: The new population of skiers and boarders does not follow the same distribution as the old distribution of skiers.

Page 71: An Introduction to Elementary Statistics[1]

Next we compute the expected counts by multiplying the sample size 2000 by the expected percent.

Type Observed Count Expected Count

Beginner 590 600

Intermediate 860 800

Advanced 400 400

Expert 150 200

Now add up the total of the last column to get0.167 + 4.5 + 0 + 12.5 = 17.17The number of degrees of freedom is df = n - 1 = 4 - 1 = 3where n is the number of rows in the table.We use the table for the Chi Square distribution. The critical value that corresponds to a level of significance of .05 with 3 degrees of freedom is 7.81.Since 17.17 > 7.81we can reject the null hypothesis and accept the alternative hypothesis and conclude that the distribution of skiers and boarders has changed.

Problem 1 Categorize these measurements associated with Lake Tahoe according to level:nominal, ordinal, interval, or ratio.

The depth of clarity of the lake. The time at which the first boat passes through from the Keys. The type of fish that is the first to be caught in the morning.

Solution A. Ratio, It makes sense to say that the depth has decreased by a factor of two. B. Interval, It makes sense to say that today's time was 15 minutes later than yesterday's, but it does not make sense to say that the time was twice yesterday's. C. Nominal, possible outcomes are Cutthroat, Kokanee, Rainbow, etc. which are qualitative. Problem 2 You are interested in finding out the average number of classes that students take at LTCC. Since you can't find out this information for every student, you have all instructors who teach at 10:00 AM to survey all of their classes. A. Is this a random sample? Explain your reasoning. B. What type of sampling is this? Solution A. No, for example, none of the students will be night students who are less likely to take a full load. B. This is cluster sampling since all of one area are surveyed. Problem 3 A. You did research to determine the number of snow boarders who rode Sierra each Wednesday in January. You came up with the following data.

Week 1 2 3 4

No. of Boarders

800 600 400 900

Page 72: An Introduction to Elementary Statistics[1]

Construct a time plot and a bar chart for this data.Solution

Problem 4 You are interested in the distribution of tree size in the Lake Tahoe basin. You take a random sample of 24 trees and measure their diameters (in inches). Below is the data that you collected.0 3 4 4 6 7 8 8 9 10 10 10 11 12 14 16 17 19 24 24 29 30 34 39 A. Make a frequency table and histogram for this data. Use 4 class intervalsSolutionWe make the class intervals 0-9 10-19 20-29 30-39

Class 0-9 10-19 20-29 30-39

Frequency 9 9 3 3

B. Make a Stem and Leaf Display of this data.Solution0 || 0 3 4 4 6 7 8 8 91 || 0 0 0 1 2 4 6 7 92 || 4 4 93 || 0 4 9C. Describe the distribution of the data using the language of statistics.SolutionThe distribution is skewed to the right.D. If a 25th tree was found to have a diameter of 48 inches would the standard deviation increase or decrease. (Answer this without calculating)SolutionIt would increase, since the standard deviation measures the spread of the data, adding a 48 would increase the spread. (48 is clearly very far from the mean.)E. In what percentile is the tree that has a diameter of 7 inches?SolutionThere are 24 in the sample and there are 6 at or below the number 7. We compute6100% = 25th percentile24Problem 5 Four students were asked to rate their statistics course on a scale of one to ten. The results were 3 3 8 10Find the mean, median, mode, and standard deviation.SolutionFor the mean we compute3 + 3 + 8 + 10= 6

Page 73: An Introduction to Elementary Statistics[1]

4For the median, there are four in the sample. Hence we take the mean of the two middle numbers 3 + 8= 5.52 The mode is the highest frequency number which is 3. Finally we compute the standard deviation

x x - x (x - x)2

3 -3 9

3 -3 9

8 2 4

10 4 16

Total 38

The variance is 38/3 = 12.67 Take the square root to get the standard deviation of 3.56 In summary, we have x = 6 Median = 5.5 Mode = 3 s = 3.56 Problem 6 Who visits Lake Tahoe. Last weekend, 1000 people were observed entering the basin. Their age distributions are given below.

Age 0 - 29 30 - 59 60 - 89

Number of Visitors 500 300 200

Estimate the mean and standard deviation for the age of the visitors. Solution First we compute the mean. The midpoints of the intervals are 14.5, 44.5, and 74.5 500(14.5) + 300(44.5) + 200(74.5)x = = 35.51000 We extend the table to find the standard deviation

ClassFrequency f

Midpoint x

x - x (x - x )2 (x - x )2 f

0 - 29 500 14.5 -21 441 220,500

30 - 59 300 44.5 9 81 24,300

60 - 89 200 74.5 39 1521 304,200

Totals 1000 549000

The variance is s2 = 549000/999 = 550 For the standard deviation, we take the square root s = 23 The mean is 35.5 and the standard deviation is 23 Problem 7 Last year, 2,000,000 Kokanee Salmon hatched in Taylor Creek. Of those, only 20,000 reached to one year. 12,000 of the survivors were female.

Page 74: An Introduction to Elementary Statistics[1]

A. What is the estimated probability that a hatched egg will live for at least one year?Solution20000/2000000 = .01B. What is the estimated probability that a Kokanee that lives to one year will be female?Solution12000/20000 = .6C. What is the estimated probability that a hatched egg will live for at least one year and will be female?Solution12000/2000000 = .006D. Estimate the probability that a hatched egg will die before it reaches one year old?SolutionWe use the property of complementsP(Not living 1 year) = 1 - P(living 1 year) = 1 - .01 = .99Problem 8 You roll two fair six sided dice.A. What is the probability that the sum of the two dice is four?SolutionThe sample space has 36 elements (six possibilities for each die). You can get a four by1 and 3, 3 and 1, and 2 and 2hence P(Sum = 4) = 3/36 = 1/12B. What is the probability that the sum of the two dice is a nine given that the first die was a six?SolutionSince we are given the value of the first die, we only need to determine the second die. There are 6 values for the second die. The only way to roll a sum of nine while the first die shows a six is for the second die to be a 3. We have P(Sum = 9| First = 6) = 1/6C. What is the probability that the sum of the two dice is larger than 3?SolutionThe easiest way to do this one is to use complements.P(Sum > 3) = 1 - P(Sum < 3) = 1 - P(Sum = 3) - P(Sum = 2)The only way to roll a sum of 3 is 1 and 2 and 2 and 1henceP(Sum = 3) = 2/36The only way to roll a sum of 2 is 1 and 1henceP(Sum = 2) = 1/36We compute1 - P(Sum = 3) - P(Sum = 2) = 1 - 2/36 - 1/36 = 33/36 = 11/12Problem 9 In your collection of nine pens you know that three are out of ink. Since you are in a rush to get to your midterm, you randomly select two of the pens (a blue one and a black one). A. Draw a tree diagram for the outcomes of this experiment. Show the probabilities of each stage on the appropriate branch. Solution Let G = Good pen with ink B = Bad pen without ink below is the tree diagram

Page 75: An Introduction to Elementary Statistics[1]

B. Use the tree diagram to determine that probability that at exactly one of the pens is not out of ink. Solution We see that there are two leaves corresponding to exactly one of the pens out of ink. We can first select a good one then a bad one or we can select a bad one then a good one. P(First is good and second is bad) = (2/3)(3/8) = 1/4 P(First is bad and one is good) = (1/3)(3/4) = 1/4 We add these probabilities to get P(exactly one good) = 1/4 + 1/4 = 1/2 Problem 10 Determine the probability of getting a Royal Flush in Poker (5 card stud). (Recall that a Royal Flush is Ace King Queen Jack and Ten all in the same suit. Also recall that there are 52 cards in a deck and a poker hand has five cards. Solution There are four ways to get a Royal Flush (one for each suit). We use combinations to find the total number 5 card hands. We have C52,5 = 2,598,960 The probability is P(Royal Flush) = 4/2,598,960 = .0000015 Problem 11 You are the manager of the chicken and steak house restaurant which has two items on the menu: the chicken dinner and the steak dinner. 30% of your customers order steak. If you have 15 customers this evening, what is the probability that exactly 5 of them will order steak? Solution This is a binomial distribution. We want the probability of 5 successes out of 15 trials. We have P(5 successes) = C15,5 (.3)5 (.7)10 = .206

Practice Midterm II KeyProblem 1The South Lake Tahoe Police Department has charted the number of arrests over the past twelve months. The number of arrests per month that are made follows a Normal distribution with mean 72 and standard deviation 12. Construct a control chart and determine any out of control signals.

Month

1 2 3 4 5 6 7 8 9 10 11 12

Page 76: An Introduction to Elementary Statistics[1]

Arrests

61 74 73 98 76 80 110 75 81 100 51 40

SolutionWe first find + = 84 + 2 = 96 + 3 = 108 - = 60 - 2 = 48 - 3 = 36Now we sketch the control chart

We can see that there are two out of control signals. The first signal is that the seventh month was three standard deviations above the mean. The second signal is that there are nine data points in a row on the same side of the mean line.Problem 2Stellar Jay hatchlings have a body weight that is approximately normally distributed with a mean of 3.4 ounces and a standard deviation of 0.3 ounces. A. Convert the following x-intervals into z-intervals.x > 3.0 x < 6.0SolutionWe find the z-score for each of these. For the first one, we get3.0 - 3.4z = = -1.33 0.3The z-interval is z > -1.33For the second one we find the z-score6.0 - 3.4z = = 8.67 0.3The z-interval is z < 8.67B. Convert the following z-interval into an x-interval-1.2 < z < 0.6 SolutionFor this one, we go backwards to find the raw score. First for z = -1.2, we havex - 3.4-1.2 = 0.3Multiply both sides by 0.3 to get-.36 = x - 3.4

Page 77: An Introduction to Elementary Statistics[1]

Add 3.4 to both sides to getx = 3.04Now we will work on z = 0.6.x - 3.40.6 = 0.3Multiply both sides by 0.3 to get0.18 = x - 3.4Add 3.4 to both sides to getx = 3.58Finally, we put this all together to get3.04 < x < 3.58Problem 3Your tire company's snow and mud tires have an average lifetime of 80,000 miles with a standard deviation of 10,000 miles. Answer the following assuming the distribution is normal.A. If the current guarantee for the tires is 65,000 miles, about what percentage of the tires will wear out before the guarantee expires?

SolutionWe first compute the z-score for this information. We have65,000 - 80,000z = = -1.5 10,000We want the probability P(z < -1.5). We go to the table to get a probability of 0.0668. Compare our graph with the table graph and notice that this is the probability that we want. We can say that about 6.7% of the tires will wear out before the guarantee expires.B. You want to reconsider the guarantee so that about 98% last past the guarantee period. What should you set as the guarantee period on your tires?

Page 78: An Introduction to Elementary Statistics[1]

SolutionThis problem asks us to go backwards. We want to find y such that P(x > y ) = 0.98. The picture shows that we need to subtract from 1, that is we want to first find the z-score such that this z-score corresponds to a probability of .02. The table gives us z = -2.05Next we work backwards to find the raw score. We havex - 80,000-2.05 = 10,000Multiply both sides by 10,000 to get-20,500 = x - 80,000Add 80,000 to both sides to getx = 59,500We should offer a 59,500 mile guarantee so that about 98% of the tires will last past the guarantee period. Problem 4The Lake Tahoe Visitor's Authority has determined that 65% of the tourists who come to the Lake Tahoe area to go snowboarding are from the Bay area. The Boarder Motel has all of its 35 rooms booked during this weekend.A. What is the probability that between 20 and 25 of the rooms host bay area visitors?SolutionWe first find the mean and standard deviation. The formula gives = (35)(.65) = 22.75and

The continuity correction tells us that we want to findP(19.5 < x < 25.5)Now we are ready to compute the z-scores. We have19.5 - 22.75z19.5 = = -1.15

2.82and25.5 - 22.75z25.5 = = 0.98 2.82The table gives us corresponding probabilities.1251 and .8365We want P(-1.15 < z < 0.98). The picture indicates that we need to subtract these probabilities. We get.8365 - .1251 = .7114

Page 79: An Introduction to Elementary Statistics[1]

We can conclude that there is about a 71% chance that between 20 and 25 of the visitors will be from the Bay Area.B. Why is your estimate valid?SolutionWe compute np and nq. We havenp = 35(.65) = 22.75 nq = 35(.35) = 12.25since both of these value are greater than 5, we can conclude that the distribution is approximately normal and our calculations are valid.Problem 5Explain what the difference is between a sampling distribution and the distribution of a sample.SolutionA sampling distribution is the distribution of all possible samples of a fixed size taken from a population, while the distribution of a sample is the results that occur from only one individual sample that was taken.Problem 6It is know that the mean number of houses a Trick-Or-Treater visits is 46 and the standard deviation is 8. A. Assuming that the distribution is approximately normal, what is the probability that your seven year old neighbor will visit fewer than 42 houses on Halloween?

SolutionWe want P(x < 42). We convert this to a z-score to get42 - 46 z = = -0.5 8The corresponding probability that we get from the table is.3085Since our graph matches the table graph, we can conclude that there is about a 31% chance that the child will visit fewer than 42 houses.B. 25 children were randomly selected and observed. What is the probability that their mean number of visits is between 48 and 55?SolutionWe compute the z-scores using the sampling distribution. We have

Page 80: An Introduction to Elementary Statistics[1]

The probability that corresponds with 1.25 is .8944. Notice that 5.63 is off the chart. Thus, the probability that corresponds with 5.63 is 1. Since we want the "in between" area, we take 1 - .8944 = .1056We can conclude that there is about an 11% chance that the mean number of visits for these 25 children will be between 48 and 55.Problem 7Do you favor allowing pilots to carry a gun in the cockpit? 74% of Americans are in favor of allowing pilots to carry a gun in the cockpit. A. 80 passengers board a plane heading toward New York. What is the probability that the greater than 75% of them favor allowing the pilot to carry a gun?SolutionWe want P(p > 0.75) however, we must first use the continuity correction. We need P(p > 0.75 + 0.5/80) = P(p > .756)We find the z-score

The probability that corresponds with 0.33 is .6293. Since we want the right hand side, we must subtract from 1. We get1 - .6293 = .3707We can conclude that there is a 37% chance that the greater than 75% of the passengers will favor the pilot carrying a gun.B. Is the normal approximation to the proportion p = r/n valid? Explain.SolutionWe need to determine if np and nq are greater than 5. We havenp = (80)(.74) = 59.2 nq = (80)(.26) = 20.8Since these are both greater than 5, we can conclude that normal approximation is valid.Problem 8The manager of Wasabi restaurant tallied the number of customers that he received over a 50 day period. He found that the mean number per day for this period was 45 with a standard deviation of 8. Construct a 95% confidence interval for the true mean. Write a sentence that explains your findings.SolutionWe find the margin of error first. We havex = 45 s = 8 z.95 = 1.96 n = 50We have

Page 81: An Introduction to Elementary Statistics[1]

The confidence interval is 45 2.28or [42.72, 47.28]We can conclude that there is a 95% chance that the true mean number of customers that visit Wasabi restaurant per day lies between 42.78 and 47.22.Problem 9Thirteen brown bears in the Sierra Nevada Mountains were captured and released for a research project. Their mean weight was found to be 320 pounds with a standard deviation of 23 pounds. A. Determine a 95% confidence interval for the mean weight of brown bears in the Sierra Nevada Mountains.SolutionNotice that the sample size is small. We use tc instead of the zc. There are 13 - 1 = 12 degrees of freedomThe table gives tc = 2.179Now we can compute the margin of error

The confidence interval is 320 13.9or[306.1, 333.9]B. What assumption do you need for your answer in part A to be valid?SolutionWe assumed that the bears were randomly selected and that the weight distribution is approximately normal.C. Write a sentence that explains your findings.SolutionWe are 95% confident that the mean weight for all Sierra Nevada black bears is between 306.1 and 333.9.Problem 10A psychologist is doing research on blindly following orders. 200 volunteers were ordered to push a button that would inflict 50 volts of electricity into a laboratory animal. 35 of them refused to push the button. Construct a 90% confidence interval for the true proportion of people who will refuse to zap the animal. Write a sentence that explains your findings.SolutionWe are asked to find a confidence interval for a proportion. We havep = 35/200 = 0.175 q = 1 - .175 = .825 n = 200 zc = 1.645We plug these numbers into the formula for the margin of error

The confidence interval is .175 .044 or[.131, .219]We can conclude that there is a 90% chance that between 13% and 22% of all people will refuse to zap the animal when ordered.Problem 11Nationally, 2% of the population carry a venereal disease. You are interested in constructing a 95% confidence

Page 82: An Introduction to Elementary Statistics[1]

interval for the mean number of carriers in the Tahoe Basin. How many people will you need to test if you want a margin of error of 1%?SolutionWe need to find n. Since we have information about the proportion, we can use p = .02. We have E = .01. We findn = p(1 - p)(zc / E)2 = (.02)(.98)(1.96 / .01)2 = 752.95We round up so that we need to test 753 people.Problem 12A study was done to compare the pass rates of Caucasians and Latinos in their statistics class. 192 Caucasians and 83 Latinos were considered. 135 of the Caucasians and 70 of the Latinos passed the course. Find a 95% confidence interval for p1 - p2 and explain in a complete sentence what it means.SolutionWe havep1 = 135/192 = .70 p2 = 70/83 = .84 n1 = 192 n2 = 83We calculate

and conclude that a 95% confidence interval is .14 .1or[.04, .24](Notice that we put the larger one first) We can conclude with a confidence level of 95% that the pass rates between Caucasians and Latinos differ by as little as 4% and as much as 24%.Problem 13Suppose that you were told that the mean number of hours that people watch television per day is 3.4. You believe that this number must be lower and want to challenge this statement. What would you use for the null hypothesis and the alternative hypothesis? Sketch a picture that displays the critical region.Solution

We have H0 : = 3.4 H1 : < 3.4

Math 201 Practice Midterm 3Please work out each of the given problems. Credit will be based on the steps that you show towards the final answer. Show your work.Problem 1A study was done to determine whether LTCC transfer students had a lower retention rate from the retention rate of the average US student retention rate of 71%. Of the 121 LTCC transfer students tracked, 84 eventually received their bachelor's degree. What can you conclude with a level of significance of 5%? Give is the P-value and

Page 83: An Introduction to Elementary Statistics[1]

interpret what it means.SolutionWe form the hypothesesH0: p = .71 H1: p < .71Since the level of significance is 95% and this is a left tailed test, we use -1.645 for zc. The observed proportion is p = 84/121 = .69Now we compute the z-score. We have

This does not lie in the critical region, hence we fail to reject the null hypothesis and can conclude that there is insufficient evidence to make a conclusion about retention rates being lower for LTCC transfer students then for the average student. More resources are needed to further investigate the situation. We look at the table for the P-value. The table gives a value of .3156. This is a large P-value requiring a level of significance of 32% or more in order to reject the null hypothesis.Problem 2Thirteen males and fourteen females participated in a study of grip and leg strength. Right leg strength (in Newtons) was recorded for each participant resulting in the table below. Is there a difference between strength in men and women? Use a 5% level of significance. Give the P-value and interpret what it means.

Gender n x s

Male 13 2127 513

Female 14 1643 446

SolutionThis is a hypothesis of the difference between means. We haveH0: 1 - 2 = 0 H1: 1 - 2 0 Since these are small samples, we calculate the standard error.so that2127 - 1643t = = 2.60186With 12 degrees of freedom, the P-value is between 0.02 and 0.05. This means that if it were true that the mean strength was the same for men and women, then there would be between 2% and 5% chance that if we randomly selected 13 men and 14 women, we would get a difference that was at least as large as we got from this survey. Since the P-value is less than 0.05, we reject the null hypothesis and accept the alternative hypothesis. We conclude that there is a difference between men and women's leg strength. Problem 3Is one ski resort better than another? Data was collected to determine whether the ski resort that was visited had a bearing on how much enjoyment the skier had. The following table shows the data that was collected. What can you conclude at the 5% level?

BoredHad an OK time

Had a Great Time

The Best Experience Ever

Heavenly 7 25 42 4Sierra-at-Tahoe

5 20 30 1

Kirkwood 9 12 30 15

Page 84: An Introduction to Elementary Statistics[1]

SolutionWe first state the null and alternative hypothesesH0: Ski resort and enjoyment are independentH1: Ski resort and enjoyment are dependentNext we complete the contingency table by filling in the required expected frequencies

BoredHad an OK time

Had a Great Time

The Best Experience Ever

Row Total

HeavenlyO = 7 #1 E = 8

O = 25 #2 E = 22

O = 42 #3 E = 40

O = 4 #4 E = 8

78

Sierra-at-Tahoe

O = 5 #5 E = 6

O = 20 #6 E = 16

O = 30 #7 E = 29

O = 1 #8 E = 6

56

KirkwoodO = 9 #9 E = 7

O = 12 #10 E = 19

O = 30 #11 E = 34

O = 15 #12 E = 7

66

Column Total 21 57 102 20 200

Now we use a table to compute the 2 statistic.

Cell O E (O - E) (O - E)2 (O - E)2/E

1 7 8 -1 1 0.125

2 25 22 3 9 0.409

3 42 40 2 4 0.1

4 4 8 -4 16 2

5 5 6 -1 -1 0.167

6 20 16 4 16 1

7 30 29 1 1 0.034

8 1 6 -5 25 4.167

9 9 7 2 4 0.571

10 12 19 -7 49 2.579

11 30 34 -4 16 0.471

12 15 7 8 64 9.143

We now add the numbers from the last column to get 20.766.The contingency table is of size 3 x 4 so the number of degrees of freedom is(3 - 1)(4 - 1) = 6 degrees of freedomFor = .05, we use the X 2 table and get 12.59. Since 20.766 > 12.59, we can conclude that the ski resort and how much enjoyment a skier experiences are not independent. It does matter which ski resort a skier decides to ski at.Problem 4 You are the owner of an automobile dealership and have done research on the relationship between the cost of the clothes (x) that a potential buyer wears and the price of the car (y) that the person will buy. 45 different respondents participated in the study. The average customer comes in wearing a $120 outfit. You have found that the equation of the regression line is y = 8000 + 50xand that Se = 1000, SSx = 12,000, and n = 180 A. A man walks into your dealership sporting a $200 outfit. What is your prediction for the price of the car that this man will buy?

Page 85: An Introduction to Elementary Statistics[1]

SolutionWe just plug in 200 into the regression equation to get8,000 + 50(200) = 18,000We can predict that the man will buy a $18,000 car.B. Find a 95% confidence interval for the price of the car that the man will buy.SolutionWe find E

A 95% confidence interval is given by18,000 2445We are 95% confident that the price of the car that this man will buy is between $15,555 and $18,245.Problem 5You are the owner of the Tahoe Inn Motel and are interested in how the price per room is related to the number of units that are occupied. Below is the SPSS readout produced from motels throughout the Tahoe area. A. What is the equation of the regression line? Interpret the slope of the regression line for this study. Interpret the y-intercept.SolutionThe equation isy = 91.4 - 0.52xThe slope tells us that for every $1 that the price is raised, we expect to lose .52 occupants.The y-intercept tells us that if we allow people to stay in our rooms for free, then we can expect about 90 of our rooms occupied.B. Use your regression line to provide a point estimate for the number of units occupied when the price per room is $100.SolutionWe plug 100 into the equationy = 91.4 - .52(100) = 39.4C. What is the correlation coefficient? Interpret this coefficient. SolutionThe correlation coefficient is r = -0.69.We can say that there is a moderate negative correlation between the price per room and occupancy rate.D. Construct a possible scatterplot for this data and explain using a complete sentence or two your reasoning in constructing the scatterplot the way you did it.Solution

There is a general trend downward, but the data do not perfectly fit the regression line.

Simple linear regression results:Dependent Variable: UnitsIndependent Variable: Price

Page 86: An Introduction to Elementary Statistics[1]

Sample size: 26Correlation coefficient: -0.69Estimate of sigma: 20.581867

Parameter Estimate Std. Err.DF

T-Stat P-Value

Intercept 91.39109 6.945012624

13.159241 <0.0001

Slope -0.5198245 0.11131903524

-4.669682 <0.0001

Problem 6Do students do better on exams if they meditate for the hour just before the exam. At a large university the average score on the first exam is 82%. 38 students volunteered to go through an hour of meditation before their first exam. The meditators averaged 84% on the exam with a standard deviation of 5%. What can you conclude at a .05 level? give the P-value and interpret what it means.Solution The appropriate hypotheses are H0: = 82 H1: > 82 We have x = 84 s = 5 n = 38 Since we are at a .05 level, the z-critical value is 1.645. We compute the z-score

2.47 falls in the critical region. We can reject the null hypothesis and accept the alternative hypothesis. We can conclude that students who meditate before the exam perform better. We find the P-value by looking at -2.47 in the table. We get P = .0068 This is a very small P-value, meaning that the data would have been significant even at a smaller (such as .01) level of significance. Problem 7 A medical researcher is concerned that a new medication has a side effect of raising the variance of the salt content in the blood. For 25 days the blood salinity of a patient who was not on the medication was tested. She calculated the variance as 0.06 percent. Then the patient began taking the medication and the blood salinity was tested for the next 13 days. The variance over these 13 days was found to be 0.15 percent. Use a level of significance of 0.05 to test the claim that the variance of the blood salinity is greater while on medication. Solution We set up the null and alternative hypothesis: H0: 1

2 = 22

H1: 12 < 2

2 The test statistic is given by 0.15F = = 2.50 0.06 The numerator degrees of freedom is 13 - 1 = 12. The denominator degrees of freedom is 25 - 1 = 24.

Page 87: An Introduction to Elementary Statistics[1]

We now go to the table and see that 0.025 < P-Value < 0.05 in particular the P-value is less than the level of significance, so we reject the null hypothesis and conclude that the variance in the salinity is greater with the medication than without the medication. Problem 8 A researcher is interested in determining whether there is a difference between the mean amount of money spent on textbooks in the fall at the three California public university systems. Ten randomly selected students from UC campuses, 8 randomly selected students from Cal State campuses and 14 randomly selected students from community colleges were surveyed. Below is the StatCrunch readout for this survey. Analysis of Variance results: Data stored in separate columns. Column means

Column n Mean Std. Error

UC10

234.9 23.159088

Cal State 8 201.75 11.846865

Comm Col12

183.16667 13.26983

ANOVA table

Sourcedf

SS MS F-Stat P-value

Treatments 2 14740.9 7370.45 2.5071433 0.1003

Error27

79374.07 2939.7803

Total29

94114.97

A. What assumptions have we made about the data to apply a single-factor ANOVA test? Solution Since the data comes from independent random samples, , we need assume only that each group of data came from a normal distribution, and that all the groups came from distributions with about the same standard deviation. B. What can be concluded at the 0.05 level of significance? Solution Since the P-value is 0.1003 is greater than the level of significance of 0.05, we do not have significant evidence to conclude that there is a difference in the mean amount of money spent on textbooks by students at the three institutions.

Math 201 Practice FinalPlease work out each of the given problems. Credit will be based on the steps that you show towards the final answer. Show your work. Problem 1Match the following hypotheses and estimates with the appropriate test statistic or confidence interval. Explain your reasoning.i) A confidence interval for a population mean.ii) A confidence interval for a population proportion.

Page 88: An Introduction to Elementary Statistics[1]

iii) A confidence interval for the difference between two independent population meansiv) A confidence interval for the difference between two population proportions.v) A confidence interval for paired differences (dependent samples).vi) A confidence interval for the value of y given a value of x using a regression line.vii) A hypothesis test for a population mean.viii) A hypothesis test for a population proportion.ix) A hypothesis test for the difference between two independent population meansx) A hypothesis test for the difference between two population proportions.xi) A hypothesis test for paired differences (dependent samples).xii) Chi squared test for independence.xiii) Chi squared test for goodness of fit.A. Are automobile prices higher in South Lake Tahoe then in Sacramento. Fifty Subaru Legacy's from the South Tahoe dealership and fifty from the Sacramento dealership were sold last month and recorded.Solutionix. There are two samples, each with continuous data and cannot be paired.B. Does the color of the paper used for a final exam influence performance? 200 students were randomly given the same test on blue, red, and white paper. The number of A's, B's, C's, D's and F's for each color were tabulated.Solutionxii. There are multiple types and the counts are taken of each.C. Is honey a better medicine for small wounds than conventional salves? Currently 9% of the wounds that are treated with conventional salves end up infected. 150 wounds in a study group were treated with honey.Solutionviii. There is only one sample and the data is Boolean.D. How much of food that you buy ends up being thrown out? A refrigerator was monitored that had 45 perishable items.Solutionii. The data is Boolean (either spoils or does not spoil) and there is only one sample taken.E. How long can you expect to live if your cholesterol level is 230? Data has been taken from 45,000 people with varying levels of cholesterol.Solutionvi. We are given a value for x (230) and are interested in y.F. How much better has the NASDAQ done than the Dow Jones Industrial Average this year? The daily point gains and losses have been charted since January 2.Solutionv. We have two sets of data that can be paired by date. G. What are the low and high estimates for the number of Kokanee salmon that will run in Trout Creek this fall? Data has been collected over the last forty years.Solutioni. There is one sample of a continuous random variable.Problem 2 Your business is being investigated about unfair promotion practices with regard to race. Your policy is to promote 20% of your employees. Your current staff consists of 200 Caucasians, 50 Hispanics, 30 African Americans, and 20 classified as other. Below is a table that shows the number of employees that were promoted last year.

Caucasian Hispanic African American Other

50 6 3 1

what can be concluded at the 5% level?SolutionWe perform a Chi square goodness of fit test. Our hypotheses are H0: The population fits the given distributionH1: The population has a different distribution

Page 89: An Introduction to Elementary Statistics[1]

We create the table

Item O E (O - E)2 (O - E)2/E

Caucasian 50 40 100 2.5

Hispanic 6 10 16 1.6

African American

3 6 9 1.5

Other 1 4 9 2.25

60 60 7.85

There are 4 - 1 = 3 degrees of freedom. We go to the table and find that the Chi square critical value is 7.81. Since 7.85 > 7.81we reject the null hypothesis and conclude that there is sufficient evidence to conclude that true hiring practices differ from what is claimed.Problem 3 A certain model of car comes in a two-door version, a four door version, and a hatchback version. Each version can be equipped with either an automatic transmission or a manual transmission. The accompanying table gives the relevant proportions.

TD FD HB

A 0.32 0.27 0.18

M 0.08 0.04 0.11

A customer who has purchased one of these cars is randomly selected.A. What is the probability that the customer purchased a car with an automatic transmission? A four-door car?SolutionP(A) = 0.32 + 0.27 + 0.18 = 0.77 P(FD) = 0.27 + 0.04 = 0.31There is a 77% chance that the customer purchased a car with automatic transmission and a 31% chance that the customer purchased a car with four doors.B. Given that the customer purchased a four door car, what is the probability that is has an automatic transmission?SolutionWe computeP(A and FD) 0.27P(A|FD) = = = 0.87P(FD) 0.31If we know that the customer purchased a four door car, then there is an 87% chance that this car had automatic transmission.C. Given that the customer did not purchase a hatchback, what is the probability that the car has a manual transmission?SolutionWe computeP(M and not HB) 0.08 + 0.04 P(M|not HB) = = = 0.17P(not HB) 0.08 + 0.04 + 0.32 + 0.27If we know that the customer did not purchase a hatchback, then there is an 17% chance that this car had manual transmission.D. If 8 cars were sold, where is the probability that exactly 6 of them were two doors with automatic transmission?SolutionWe solve this using the binomial distribution formula. We getC8,6(0.32)6(0.68)2 = 0.0139

Page 90: An Introduction to Elementary Statistics[1]

There is a 1.4% chance that exactly six of the cars had automatic transmission.Problem 4 You want to construct a confidence interval for the percent of registered voters who are planning on voting for George Bush for president for his second term. You want to have a margin of error of 0.03. A. How many registered voters should you survey (use = 0.05)?SolutionSince we do not have a preliminary estimate we use the formula

We get

We should survey 1068 registered votersB. Suppose that you conducted this survey (as in part A) and found that 52% of the respondents intended to vote for George Bush. Construct the appropriate 90% confidence interval. Interpret this interval. How would the Bush campaign react to this confidence interval?SolutionWe have zc = 1.645 n = 1,068 p = .52 q = .48The 90% confidence interval is

We can conclude that there is a 90% confidence that between 49% and 55% of the voters intend to vote for Bush for a second term. Since this interval contains numbers less than 50%, he should attempt to woo more voters.Problem 5 For the following please answer true or false. If true explain why. If false explain why or provide a counter example.A. To provide a confidence interval for a population proportion, if the sample size is 18, it is appropriate to use a t-statistic since the z-statistic is used only for large samples.SolutionFalse, the t-statistic can not be used for Boolean data.B. For a large sample, of the mean, the median, and the standard deviation, only two of these will be highly affected by an addition of an extreme outlier.SolutionTrue, the median is not highly affected by extreme outliers.C. No matter how the population data is distributed, the distribution of all possible samples of size 500 will be approximately normal with approximately the same mean and standard deviation as the population mean and standard deviation.SolutionFalse, the standard deviation of the sampling distribution is equal to the standard deviation of the original distribution divided by the square root of 500.Problem 6 Data was collected to study the effect of alcohol on reaction time. Forty participants were given various amounts of alcohol and then took a test to see how many milliseconds it took to press a button upon seeing headlights. The scatter diagram is shown below.A. Given an approximate equation of the regression line. Interpret the slope and the y-intercept.SolutionFirst we eyeball the line. Then find two points on the line.

Page 91: An Introduction to Elementary Statistics[1]

The y-intercept is about 25 and the slope is about65 - 25m = = .8 50 - 0The equation is y = 25 + .8 xThe y-intercept tells us that without drinking any alcohol, the reaction time is about 25 milliseconds. The slope tells us that for every ounce of alcohol a person drinks, reaction time goes up by about 0.8 milliseconds.B. Give an approximation of the correlation coefficient. Explain using a complete sentence why you chose this number.

SolutionThe correlation is probably around 0.8 since the data generally follows a linear model, but not perfectly. Since the slope is positive, so is the correlation coefficient.Problem 7Twenty-five students took the first midterm exam. The number of minutes that they each took are shown below.

Page 92: An Introduction to Elementary Statistics[1]

35, 45, 48, 50, 50, 52, 60, 61, 64, 70, 72, 75, 78, 78, 81, 83, 84, 87, 88, 88, 89, 90, 90, 90, 90A. Construct a stem and leaf diagram for this dataSolutionWe make a stem and leaf diagram with the stems representing the tens digit and the leaves representing the ones.3 || 54 || 5 85 || 0 0 26 || 0 1 47 || 0 2 5 8 88 || 1 3 4 7 8 8 99 || 0 0 0 0B. Construct a histogram for this data using 5 classes.SolutionWe find the class width by taking the range, dividing by 5 and increase the result to the nearest whole number.90 - 35+ 1 = 125Next make a frequency distribution table

Class Interval Frequency

35 - 46 2

47 - 58 4

59 - 70 4

71 - 82 5

82 - 93 10

The histogram is shown below

C. You took one hour to complete the exam. What is your percentile?SolutionWe are looking at the percentile that corresponds with 60 minutes. There are 6 times below 60. We calculate 6/25 x 100% = 24%You are in the 24th percentile.D. What is the mode?Solution The mode is the value that occurs the most, namely 90.E. If the student who took 35 minutes for the exam is disregarded, would the standard deviation decrease, increase,

Page 93: An Introduction to Elementary Statistics[1]

or stay the same. Explain.Solution The standard deviation would decrease, since more of the data would fall closer to the mean.Problem 8A study was done to see the relationship between the type of town and the quality of service. An index was used with 100 meaning average service and larger numbers above average. The table below displays the results of the survey.

PoliceFire Department

Libraries Schools

Urban 110 115 140 82

Rural 132 128 130 84

Suburban 95 102 118 92

List the factors and the number of levels of each factor.

SolutionThere are 2 factors: Type of town with three levels and type of public service with four levels.

Assume that there is no interaction between the factors. Determine if there is a difference in population mean index based on town type. Use a level of significance of 0.05. Make sure you state the null and alternative hypothesis and state your conclusions using a complete sentence in the context of the question.

SolutionH0: There is no difference in mean public service based on town type.H1: At least two town types have different mean public service ratings.Since the P-Value is 0.18, we do not have sufficient evidence to make a conclusion about town type being a factor in determining that there is a difference in the levels of each public service. That is it is plausible that all town types have the same public service ratings.

Determine if there is a difference in population mean index bases on public service. Use a level of significance of 0.05.Make sure you state the null and alternative hypothesis and state your conclusions using a complete sentence in the context of the question.SolutionH0: There is no difference in mean rating based on the type of public service.H1: At least two public services have different mean ratings.Since the P-Value is 0.02, we do have sufficient evidence to show that at least two public services give a different mean rating.

Top of Form 1

#

Treatment Variation

Block Variation

Within Variation

Total Variation

Treatment Statistic Its P-Value

Block Statistic Its P-Value