Top Banner
39

NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.
Page 2: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

NOTICE

Please note that the purpose of these slides is not a substitute for the reading

and interacting of your material. It’s intended purpose is to be a quick

reference for studying concepts as well as presenting the material from a different

angle that might help you to better understand the statistics.

Page 3: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

For the Student:

How do insurance companies determine various premium rates for different age groups and sexes? What information does the government use to decide who get taxed how much? How are you going to determine which vehicle is the safest to drive? The answer to each one of these questions relies heavily on statistics. Unfortunately there is a very large amount of statistics in society that has been manipulated very badly, and will provide you with unreliable results.

Statistics is everywhere, and affects virtually everyone. The question to ask yourself now is “ Are you going to become another “victim of statistics”?”. Whether your career in the future is education, politics, or fire fighting, making decisions based off of statistics will be inevitable, and there will be consequences. Statistics is not hard, it only takes a little time and patience to gain a true understanding of what your information really ‘means’.

Page 4: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

To the Student

A little hint in keeping yourself from being overwhelmed while first learning about statistics. It is not vital to memorize all of the equations. Although memorizing them can help, it is better to understand what the equations mean, why they are used, and when do you use them. Some of the equations, especially any equation that has a symbol requires adding a series of many numbers. Practically speaking you should use either a calculator or computer to compute such equations. Focus on understanding the concept of what the computer is doing so that the number that pops out is more than a number to you, because that number means something. Don’t memorize, just recognize.

Page 5: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Graphs and SummariesA. One Categorical Variable

1. Graphs

2. Summaries AppletsB. One Numerical Variable

1. Graphsa. Stemplotb. Histogramc. Boxplotd. Normal Quantile plot (Q-plot)e. Shapes

i. Symmetric1) Normal2) Uniform

ii. Skewed1) Left2) Right

2. Summariesa. Locations

i. Meanii. Medianiii. Modeiv. Min and Maxv. Quartilesvi. Comparisons of Mean and Medianvii. Z-scores Empirical Rule

b. Spreadsi. Varianceii. Standard Deviationiii. Rangeiv. Interquartile Range (IQR)

3. Transformationsa. Shift changes

i. Centersii. Spreads

b. Scale changesi. Centersii. Spreads

Page 6: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Beginning Definitions

Variable- the overall object of interest that is desired to be understood.

ie. Percent of people who use Deodorant in America

ie. Average debt of a college graduate from Texas A&M

Individual- A single value constructed by a variable.

ie. Bob, an American who does not use deodorant

ie. Jill, a college graduate of Texas A&M with $10,000 in debt

Page 7: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Variable Types

• Categorical (Qualitative)

-Nominalie. Colors{red,blue,green}

-Ordinalie. Strength{weak,moderate,Strong}

• Numerical(Quantitative)

-Discreetie. Number of Children

in a family

-Continuousie. Amount of water

the average house uses

{Depending on the context, certain discreet numbers can be considered continuous for practical purposes, and continuous data can be made discreet}

Page 8: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Distribution

-Shape ie. unimodal, bimodal, multimodal, symmetric, skewed right, skewed left

-Center ie. Mean, Median

-Spread ie. Range, Standard Deviation,Variance, Interquartile Range

Page 9: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Categorical Graphs (Nominal or Ordinal)

• Pie Charts

• Bar Graphs

Index

Page 10: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Pie Charts (Counts and Percents)

American

European

Japanese

Country of Origin

Pies show counts

62.47%18.02%

19.51%

Pie Graphs with Percents

American

European

Japanese

Country of Origin

Pies show counts

n=253n=73

n=79

Pie Graph with Counts

Index

Page 11: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Bar Graphs

American European Japanese

Country of Origin

50

100

150

200

250

Co

un

t

n=253 n=73 n=79

Bar Graph with Counts

American European Japanese

Country of Origin

0%

20%

40%

60%

Per

cen

t

62% 18% 20%

Bar Graph with Percents

Index

Page 12: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Numerical Graphs (Univariate)

• Stemplots

• Histograms

• Boxplots

• Normal Quantile Plots (Q-plots)

Index

Page 13: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Stemplots

Stemplot of Scores

3 | 178 4 | 567 5 | 09 6 | 3789 7 | 013355789 8 | 00124588 9 | 7

Back to back stemplot

boys girls

18| 3 | 7 67 | 4 | 5 0| 5 | 9 7| 6 | 389 13379| 7 | 0558 1488| 8 | 0025 | 9 | 7

Index

Page 14: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Histogram

Total Calories per bar of Common Candy

Calories

Fre

qu

en

cy

100 150 200 250 300 350 400

02

46

81

01

21

4

Total Calories per bar of Common Candy

Calories

De

nsi

ty

100 150 200 250 300 350 400

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

Note that these are analogous to counts and percents with bar charts Index

Page 15: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Boxplots

Boxplots are made using the 5 number summary to define the box and whiskers unless there are outliers present. If an outlier is present then the next minimum number not considered an outlier is chosen to represent the new minimum if the outlier or outliers where minimum numbers and vice versa if the outliers are considered maximum numbers.

Outlier? A number is considered an outlier if it lies a distance of 1.5 times the IQR (Interquartile Range) lower than the 1st quartile or higher than the 3rd quartile.

10

01

50

20

02

50

30

03

50

Calories in Common Candies

Index

Page 16: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Normal Plots (aka. Q-plots)Q-plots are used to determine how reasonable it may be to assume that the sample comes from a normal distribution. If the sample comes from a normal distribution then the plot of the scatterplots should make a straight 45 degree line, or in the case where the Q plot includes a Q-line, the points should follow “closely” to the line. Unfortunately there is no clear rule for declaring a set of data normal or not. It takes practice of examining patterns in Q-plots to recognize “close calls”, but if the data is strongly skewed it will be very easy to see the change in pattern from the line.

-2 -1 0 1 2

10

01

50

20

02

50

30

03

50

Calories in Common Candies

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Index

Page 17: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes-Symmetric-Normal

The blue histograms are samples from a population of test grades that have an average of 65 with a standard deviation of 10. Notice the one with more samples begins to look more like the density curve of a normal distribution (the red line)

100 Samples~ Normal(65,10)

test grade

De

nsi

ty

30 40 50 60 70 80 90 100

0.0

00

.01

0.0

20

.03

0.0

4

1000 Samples~ Normal(65,10)

test grade

De

nsi

ty

30 40 50 60 70 80 90 100

0.0

00

.01

0.0

20

.03

0.0

4

Page 18: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes-Symmetric-Normal30

4050

6070

8090

100 Samples ~N(65,10)

3040

5060

7080

90

1000 Samples ~N(65,10)

-2 -1 0 1 2

5060

7080

QQplot of 100 Samples

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-3 -2 -1 0 1 2 3

4050

6070

8090

QQplot of 1000 Samples

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Boxplots

Normal plots

Index

Page 19: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes-Symmetric-Uniform

100 Samples~ Uniform(40,95)

test grade

Den

sity

30 40 50 60 70 80 90 100

0.00

0.01

0.02

0.03

0.04

3040

5060

7080

9010

0

100 Samples ~U(40,95)

-2 -1 0 1 2

4050

6070

8090

QQplot of 100 Samples

Theoretical Quantiles

Sam

ple

Qua

ntile

s

1000 Samples~ Uniform(40,95)

test grade

Den

sity

30 40 50 60 70 80 90 100

0.00

0.01

0.02

0.03

0.04

3040

5060

7080

9010

0

1000 Samples ~U(40,95)

-3 -2 -1 0 1 2 340

5060

7080

90

QQplot of 1000 Samples

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 20: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes-Skewed- Right and Left

The other major pattern to recognize is skew. Think about a skewer on a barbeque grill. Everything seems lopped to one side of the stick. Likewise, the pattern in graphs is similar. If the majority of the data lies on the left then the graph is right skewed and vice- versa.

Left Skewed

Right Skewed

Index

Page 21: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes- Skewed Left100 Samples Skewed right

Average Speed of Stock Cars

Den

sity

60 80 100 120 140 160 180 200

0.00

00.

005

0.01

00.

015

8010

012

014

016

018

020

0

100 Samples Skewed left

-2 -1 0 1 2

8010

012

014

016

018

020

0

100 Samples Skewed left

Theoretical Quantiles

Sam

ple

Qua

ntile

s

1000 Samples Skewed right

Average Speed of Stock Cars

Den

sity

0 50 100 150 200

0.00

00.

005

0.01

00.

015

050

100

150

200

1000 Samples Skewed left

-3 -2 -1 0 1 2 30

5010

015

020

0

1000 Samples Skewed left

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 22: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Shapes-Skewed Right100 Samples Skewed right

Costs of Meals at Restraunts

Den

sity

0 20 40 60 80 100 120 140

0.00

00.

005

0.01

00.

015

020

4060

8010

012

0

100 Samples Skewed right

-2 -1 0 1 2

020

4060

8010

012

0

100 Samples Skewed right

Theoretical Quantiles

Sam

ple

Qua

ntile

s

1000 Samples Skewed right

Costs of Meals at Restraunts

Den

sity

0 50 100 150 200

0.00

00.

005

0.01

00.

015

050

100

150

200

1000 Samples Skewed right

-3 -2 -1 0 1 2 3

050

100

150

200

1000 Samples Skewed right

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 23: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

SummariesLocations - Mean

Heights of students 71 70 68 69 68 65 72 69 71 62

x xin

x1 xnn

71 70 68 69 68 65 72 69 71 6210

68.5

Index

Page 24: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

SummariesLocation-Median

Heights of students 71 70 68 69 68 65 72 69 71 62

Ordered heights 62 65 68 68 69 69 70 71 71 72

Median =

If the number of observations is even the Median is the average of the middle two numbers. If the number of observations is odd then the middle number of the order data is the Median.

Heights of male students 65 68 70 71 72

Index

x~ 69

x~ 70

Page 25: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

SummariesLocation-Mode, Min, Max

Ordered heights 62 65 68 68 69 69 70 71 71 72

Mode= 69 & 71

Mode is most common number. If there is tie for the number of common numbers then there is more that one mode.

Min= 62

Max=72

Index

Page 26: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Summaries Location- Quartiles

1st Quartile = 68

3rd Quartile = 71

To find the 1st and 3rd Quartiles you consider the data separately to the left and to the right of the median. The median is the 2nd Quartile. The 1st Quartile is the middle number (or average of two middle numbers if the subset is even)

between the minimum and the median. The 3rd Quartile is calculated the same way only replacing the max for the min.

Ordered heights 62 65 68 68 69 69 70 71 71 72

Technical note: Include the median when finding the 1st or 3rd Quartile if the number of observations is odd. Index

Page 27: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Comparing Means and Medians

Notice the blue and red lines on distribution graphs below. The blue line represent the mean and the red line represent the median. This demonstrates how whenever data becomes skewed the mean is affected more then the median. The bottom graph shows how the mean and median are about the same on a normal distribution.

Left Skewed Right Skewed

Normal Distribution

Medians

Mean Mean

Median and Mean the SameIndex

Page 28: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Z - Scores

Suppose we are given a set of data that has a normal distribution. Given that we already know the mean and the standard deviation we want to find precisely how many actual deviations a certain amount is. That value is called a z-score. The equation is:

Why is the z-score useful to us? Well if we compare our z-score to the 68-95-99.7 rule we can learn about what percentage of values in greater than or less that our value. Suppose we had a z-score of 1.5. Obviously more than 68% of the value are below our value, meaning that we would have less than a 32% chance of choosing our particular value at random. Now consider that our value had a z-score of -2.5 meaning that it is 2 and 1/2 standard deviation to the left of the mean. Our new score lies between 95 and 99.7 which means that we had less than a 5% chance of selecting our value at random and more .3%. We can look up our z-score on a table of “Standard Normal Probabilities in order to find our exact chances of being so lucky.

xz

Index

Page 29: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Z-Scores

Based off the standard deviation, Z-Scores are used to determine how far a way a sample is from the mean. A Z-Score of 1 corresponds to one standard deviation from the mean. The 68-95-99.7 rule is helpful in determining what the value of a z-score really means. Figure 2 is density curve demonstrating what is meant by the 68-95-99.7 rule. The area under the blue contains 68% of the data. Where the blue ends is where z = 1 or z = -1. The red plus the blue contains 95 % of the data with the outer edges being z = 2 or z = -2. Likewise, the green added to the data contains 99.7% of the whole data. If we had a z-score of 0.5 we know that our number is somewhere in the blue. A z-score of 2.5 would lie somewhere in the green.

Blue- 68% Blue & Red- 95% Blue, Red & Green- 99.7%

Z-scores

Page 30: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

When to use 68-95-99.7 rule

When do we use the Empirical Rule? It is better to make a decision based off of graphs (histograms,boxplots,Q-plots), but if all we are given is the

above we can notice some features about the distribution by observing the frequency column. The tallies need to be somewhat low in the top and

bottom of this column with the data builiding up near the middle. Notice for this example this is what we have. If this pattern is apparent it is then

necessary to compare the standard deviation of the data with the percentiles. If the data is normal then our standard deviation should contain about 68%

of our data. According to the table 68% of the data lies between 5 and 14 for a length of 9. The standard deviation is 4.7 approximately 4.7, which with

the empirical rule says that we expect about this distance is 9.4, so we conclude that the data has a Normal distribution

NORMAL50

0

9.4200

4.6995

6.0000

9.0000

12.2500

Valid

Missing

N

Mean

Std. Deviation

25

50

75

Percentiles

Page 31: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Empirical Rule usageOnce again the two things we need to

check for

-pattern of the tallies

-68% Interval

Here we see that the frequency column has a pattern of higher tallies appears the same or bigger then the center of the tallies. But to be safe we consider the 68%Interval compared to the standard deviation. The lower bound of the interval is between (-1 and 0) the upperbound is between (19 and 20) Therefore the length of the interval is between 21 and 19. With the empirical rule we would expect this interval to be around 2 * 8.34= 16.68. Because this interval is clearly smaller than either of the previous we conclude that the data is not normal.

Statistics

UNIFORM50

0

10.4400

8.3426

3.7500

10.5000

16.5000

Valid

Missing

N

Mean

Std. Deviation

25

50

75

Percentiles

Page 32: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Spreads- VarianceVariance is a number that describes how much the data “varies”. The reason

for the two different formula below is that one is that the first one is used if we have the mean of the population. The second equation divides by n – 1 because the variance of a sample will be smaller then the variance from the

population that the sample comes from. However as n gets large there becomes very little difference between these two equations

x1 2 xn 2

n

s2 xi x2

n 1

Index

2 xi 2

n

Page 33: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Spreads- Standard Deviation

The Standard Deviation is just the square root of the variance. A standard deviation of “1” is exact the same as Z-score of one. Once again the difference between

the two formula below are whether or not the data is the population or a sample from a population.

s xi x2

n 1 xi 2

n

Index

Page 34: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Spread-Calculation of variance and standard deviation.

Heights of male students 65 68 70 71 72

x 69.2

s2 65 69.22 68 69.22 70 69.22 71 69.22 71 69.22

5 1

s2 7.7

s 2.77

Index

Page 35: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Summaries-Range and IQR

Ordered heights 62 65 68 68 69 69 70 71 71 721st Quartile = 683rd Quartile = 71

Range = Max – Min = 72 – 62 = 10

Inter-Quartile-Range (IQR)= 3rd Quartile – 1st Quartile

= 71 - 68

Index

Page 36: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Transformations

A Transformation is when each value of a data set is placed into the same function. For example if we add a number n to every observation we will have a transformed data set that is shifted n-units. If we multiply or divide every observation by the same number then the data set will have a new

scale.

If you are given a mean, (or ), and a standard deviation, s (or ), and want to convert your data so you have a new mean, new (or new), and new

standard deviation, snew (or new), all you need to remember is what shift and scales changes affect. In our linear transformation formula:

shift scale

Index

newx a bx

Page 37: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

TransformationStandard deviation are only affected by scale changes, but means are affected by both

shift and scales changes. This means that:

xnew shift scale x

snew scale s

For example suppose College Station has an average annual temperature of 72 degrees with a standard deviation of 10 degrees. We want to know what these statistics are in Celsius. The formula for Celsius is:

Celcius 59Farenheight 32

xnew 32 59

72

snew 59

10

Celsius 8

scelsius 5.556

Index

Page 38: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Transformations- Shifts

Suppose we discover that a measuring instrument was off by 3 inches because someone was measuring from the top of the shoe to the head. Well obviously the given heights would not be the height of the subjects. If we assume every suject’s shoes where the same height of 3 inches then we can fix the data appropriately with the equation:

Ordered heights 62 65 68 68 69 69 70 71 71 72

Shifted heights 65 68 71 71 72 72 73 74 74 75

Notice what this does to the following statistics.

xnew x 3

snew

2 s2

min xnew minx 3

Q1new Q1 3

x~new x~ 3

Q3new Q3 3

maxxnew maxx 3

range 10IQR 3

xnew xi 3

What we see from this is that a shift change adds or substracts the same amount from every statistic that is not related to spread. The statistics that describe the spread (ie s2 and IQR) are not affected by the shift.

Index

Page 39: NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to.

Transformations - Scale

Going back to our original subjects for whom we have their height. Suppose that instead of inches we wanted to know how tall every one was in cm. 2.54 cm = 1 inch. Therefore our new data is as follows

Ordered heights 62 65 68 68 69 69 70 71 71 72

Heights in cm 157.48 165.10 172.72 172.72 175.26 175.26 177.80 180.34 180.34 182.88

xnew 174

snew 7.69

Rangenew 25.4

IQRnew 7

min xnew 157.48

Q1new 172.7

x~new 175.3

Q3new 179.7

maxxnew 182.88

Unlike with the shifts notice that every single one of these statistics is affected by the scale change.

Index