1. Displaying Distributions with Graphsmsgilfordmath.weebly.com/uploads/2/7/7/5/27753231/all... · 2018. 9. 10. · 3 Statistics Chap 01.docx the 3rd quartile, is also the 75th percentile.

1 Statistics Chap 01.docx

Statistics

Chapter 1

Exploring Data

1. Displaying Distributions with Graphs:

Graphs for categorical variables –

Pie charts – must include all the categories that make up the whole. Pie charts are used

to show each category’s relation (%) to the whole.

Bar graphs – compares any set of quantities that are measured in the same units,

includes but is not limited to % of a whole.

Graphs of distributions of quantitative variables –

Stem plots – displays the shape of the distribution and the actual data values in the

graph. Better for smaller data sets. The leaf is the left-most digit; the stem is the digits

to the left of the stem. Write the stems in a vertical column with the leaves in ascending

order from the stem. A back-to-back stem plot compares 2 distributions using the same

stem. Leaves are ordered from the stem to the left for the left data set and to the right

for the right data set. Stem plots are best suited to smaller data sets. For large numbers,

the leaf is sometimes trimmed by using only the left digit (e.g. 34850 has a stem of 3

and a leaf of 4,) Stems may be split to contain leaves 0-4 and 5-9. Splitting the stem

stretches the distribution.

Histograms –.breaks range of values of a variable into classes and displays only the

count or the percent of the observations that fall into each class. Classes must be of

equal width. Typically 5 to 20 classes will fit most data sets. A frequency histogram

will have the same shape of a relative frequency (%) histogram given the same class

width. Only the label of the y-axis will be different. Unlike bar graphs, histograms can

have no space between bars and no blank bars.

Examining Distributions – look for the overall pattern and for striking deviations from that

pattern. The pattern of a distribution is described by its shape, center, and spread. An

outlier is an individual value(s) that fall outside the overall pattern. One measure of

center is the midpoint where ½ the data items are above that value and ½ the data items

are below that value. Range (largest value – smallest value) is one measure of the

spread of a distribution. Stem plots and histograms display the shape of a distribution.

A one-peaked distribution is called unimodal. A distribution is symmetric if the data

values smaller and larger than its midpoint are mirror images of each other. It is

skewed right if the right tail is much longer than the left tail.

Outliers – Data points far beyond the rest of the data. Outliers should be investigated. The

omission of outliers should be justified by the investigation.

Relative Frequency and Cumulative Frequency – a relative frequency histogram converts the

quantitative data on the y-axis to percentages. A relative cumulative frequency graph is an

ogive. To construct an ogive, create a relative frequency histogram (establish classes of equal

widths and tabulate the data within classes.) An ogive displays the percent of data above or

below a given data value (probability distribution.) It shows the relative standing of data items.

Time plots – plots each observation against the time at which it was measured. Time is plotted

on the horizontal axis. Connect the dots to clearly display changes over time.


Recap of graphs:

Graph Purpose Data Type Strength

Pie Chart Shows composition of the total Categorical variables Quick visibility of

categorical % to total

Bar Graph Shows quantitative data on the

categorical variables. Does not

necessarily sum to 100%

Categorical variables Quick visibility of

quantitative variable

Stem Plot

(stem and leaf)

Distribution of data includes

actual data items

Quantitative variables

Small data set

Center, Spread, Shape,

(Mode, median, range,

symmetry)

Histogram Distribution of data - frequency Quantitative variables

Large data set



symmetry)

Relative

Frequency

Histogram

Distribution of data - percent Quantitative variables

Large data set



symmetry)

Ogive Histogram of Cumulative

Relative Frequency – displays

standing (probability distribution)


Large data set

Relative standing (% of

data above or below,

percentile)

Time Plot Displays the effect of time on a

quantity


Large or small data

set

Trends of data over

time

2. Describing Distributions with Numbers:

Measures of Center: Mean and Median

Mean is the average: the mean of a sample is 𝑥 = 𝑥𝑖

𝑛. The mean of a population is

= 𝑥𝑖

𝑁 . The mean is sensitive to extreme values; it is not a resistant measure of center.

Median (denoted by M or Q2) is the midpoint of the data set so that ½ the data items are above

the median and ½ the data items are below the median. To find the median order the data set

from low to high. The median is the data item in the position 𝑛+1

2 . (e.g. For a data set of 7

items the median is the 4th data item. For a data set of 8 items the median is ½ way between

the 4th and the data items.) The median is more resistant to extreme values than the mean.

Measures of Spread: Range, Percentile, Quartiles

Range is the distance between the most extreme values (i.e. highest – lowest). One weakness

in range is that the extreme values may be outliers.

Percentile – a data item in the pth percentile means that there are p percent of the data at or

below that data item. The median is the 50th percentile also known as the 2

nd Quartile (Q2).

The Quartiles, Q1 and Q3: Q1, the first quartile, is also the 25th percentile. Q1 is the median of

the lower ½ of the distribution that is bisected by Q2, the median of the entire distribution. Q3,


the 3rd

quartile, is also the 75th percentile. Q3 is the median of the upper ½ of the distribution

that is bisected by Q2, the median of the entire distribution. When finding Q1 and Q3, the

median itself is part of neither half. Caution: some software calculates the quartiles differently

(e.g. TPS3e pg 77-78.).

Examples:

Data Set A:

19 21 22 23 24 25 26 27 28 29 30 31 32

Data Set B:

19 21 22 23 24 25 26 27 28 29 30 31

Data Set C:

19 21 22 23 24 25 26 27 28 29 30

Data Set D:

19 21 22 23 24 25 26 27 28 29

Measures of Center and Spread: the 5-Number Summary (minimum, Q1, M, Q3, and

maximum). These 5 values are plotted on a box plot (box and whisker). The median, M or

Q2, shows the center of the distribution. The quartiles show the spread of the center ½ of the

data but do not give any information on extreme values. The max and min values show the

spread of the entire data set.

Example for Data Set B

19 21 22 23 24 25 26 27 28 29 30 31

Use the box plot to locate median, the spread of the entire distribution (extreme values), and

the spread of the center ½ of the data set. Note: the median is not necessarily in the center of

the box.

The Interquartile Range (IQR) is the distance of the center ½ of the data. IQR = (Q3 – Q1).

The interquartile range is used to identify outliers.

Outlier – a data item below 1.5*IQR from Q1 or above 1.5*IQR from Q3.

Q1 – 1.5*IQR establishes the minimum value.

Q3 + 1.5*IQR establishes the maximum value.

TI-84 2nd

statplot to access statistics graphs.


Variance (s2) and Standard Deviation (s) are measures of spread (i.e. how far the observations

are from their mean.) Variance =

2

2

1

ix xs

n

. Standard Deviation =

2

1

ix xs

n

.

Note that the standard deviation is the square root of the variance. Variance is the average

squared deviation from the mean. Standard deviation is the square root of the average squared

deviation from the mean. S and s2 will be large for disperse data sets (data items far from the

mean) and small for compact data sets (data items clustered about the mean.).

Properties of Standard Deviation:

S measures spread about the mean and should be used only when the center is measured by the

mean.

S = 0 only when there is no spread or variability. This happens only when all observations

have the same value. As the observations become more spread out about their mean, s

gets larger.

S, like 𝑥 , is not resistant. A few outliers can make s very large.

Choosing Measures of Center and Spread:

The 5-number summary is usually better than the mean and standard deviation for describing a

skewed distribution or a distribution with strong outliers. Use 𝑥 and s only for reasonably

symmetric distributions that are free of outliers. Because you need to determine if the data set

is skewed, always plot the data!

Changing the Unit of Measure:

Linear transformations of a data set:

𝑥𝑛𝑒𝑤 = 𝑎 + 𝑏𝑥 where a shifts the data up (a postitive) or down (a negative) and b changes the

unit of measure.

𝑥𝑛𝑒𝑤 = 𝑎 + 𝑏𝑥

Impact on Measures of

Center (𝑥 , Q1, M, Q3) Spread (IQR, s)

±a ± No change

×b × ×

Comparing Distributions:

Use side-by-side bar graphs for categorical variables.

Use back-to-back stemplots and boxplots for quantitative variables.

Use the graphs to interpret shape, center, spread, and outliers.

Calculate mean and median for center. Calculate 5-number summary, standard deviation, and

outliers for spread.

Citing your calculations, write your conclusions about the data addressing shape, center,

spread, and outliers. Make sure your writing is in context with the specific situation of the

problem.

1 Statistics Chap 02

Statistics

Chapter 2

Location in a Distribution

1. Measures of Relative Standing and Density Curves:

Relative Standing:

Z-score: a standard score, the number of standard deviations and the direction a data item is

from the mean in a given distribution.

𝑧 =𝑥−𝑥

𝑠. 𝑧 =

𝑥−𝜇

𝜎

Percentiles: percent of the observations less than (or equal to) the given observation. Some

definitions of percentile calculate only data less than the given data item.

Chebyshev’s Inequality:In any distribution (i.e. even skewed ones), the percent of

observations falling within k standard deviations from the mean is at least 1 −1

𝑘2 .

Density Curves

A density curve is a curve that

Is always on or above the horizontal axis AND

Has exactly 1 underneath it (i.e. the area under the curve represents 100% of the data)

A density curve shows the proportion of data either above, below, or between given data

values.

Do the following when analyzing univariate data (i.e. one variable):

1. Plot the data (histogram or stemplot).

2. Evaluate the shape, center, spread and outliers.

3. Calculate a numerical summary to describe center and spread. Use mean and standard

deviation for symmetrical distributions. Use median and IQR for skewed distributions.

4. With large data sets, smooth the histogram with a continuous curved line. This curve is

a mathematical model for the distribution, an overall description that ignores minor

irregularities.

The median of a density curve is the point that separates the distribution into 2 equal areas

The mean of a density curve balances the distribution. The mean will be to the right of the

median in a right skewed distribution (right tail longer.) The mean will be to the left of the

median in a left skewed distribution (left tail longer.)

Mean

Standard

Deviation

Data Set Observations 𝑥 s

Density Curve 𝜇 𝜎

Uniform Distribution – The frequency is flat. The shape is rectangular. The area is still 1. See

exercise 2.10 page 128.


2. Normal Distributions:

Normal curves are symmetric, single-peaked (unimodal), and bell-shaped. Normal curves are:

Good descriptions for some distributions of real data.

Good approximations to the results of many kinds of chance outcomes.

Statistical inference procedures based on normal distributions work well for roughly

other symmetric distributions.

FYI the equation for a normal density curve is

21

2

2

1

2

x

e

The inflection points of the curve occur at ±1𝜎.

𝑁 𝜇,𝜎 is the notation for a normal distribution with mean at and standard deviation

at 𝜎.

The 68-95-99.7 Rule (Empirical Rule):

The Standard Normal Distribution:

Mean is 0 and standard deviation is 1. The standard normal table gives area

(probability) below (to the left of) a given z-score. Therefore, to obtain the area above (to the

right of) a given z-score, calculate 1 P z . To obtain the area between 2 z-scores calculate

R LP z P z.

Normal Distribution Calculations:

Follow these steps to solve normal distribution problems:

1. Draw a picture and shade according to the wording of the problem. Record and and

the position of the given data value(s).

2. Convert data score(s) to z-scores.

3. Use the table or a calculator (normalcdf) in combination with the shading to determine area

(probability).

4. Write a conclusion in context with the particulars of the given problem.

Sometimes in a given problem, the probability (area) is known and the goal is to find the data

score associated with the given probability. Use the table in reverse to find the z-score, then

solve for the unknown data score.

Assessing Normality:

Method 1 (histogram)

1. Draw a histogram or stemplot. Is the curve unimodal, symmetrical, and bell-shaped?


2. Mark off the points 𝑥 , 𝑥 ± 𝑠, and 𝑥 ± 2𝑠 on the horizontal axis.

3. Compare the counts of the distribution with the Empirical Rule (68-95-99.7).

Method 2 (normal probability plot)

1. Sort the data from low to high. Record the percentiles for each data point.

2. Find the z-scores for each of the percentiles from step 1.

3. Plot the z-scores against the data scores. If the distribution is normal, the plots will form a

straight line. Systematic deviations indicate that the distribution is non-normal. Outliers

will lie far from the overall pattern.

4. In a right skewed distribution, the largest observations fall distinctly above a line drawn

through the main body of points.

5. Left skewness is when the smallest observations fall below the line.


Statistics

Chapter 3

Examining Relationships (Bivariate Data)

Introduction: To understand a statistical relationship between two variables, measure both variables on

the same individuals. Caution: the relationship between two variables can be strongly

influenced by other variables that are lurking in the background. Categorical variables often

are present that have an influence on the relationships. Identify the explanatory variable (x-

axis, input, independent) and the response variable (y-axis, output, dependent).

1. Scatterplots and Correlation:

Scatterplots:

The relationship between 2 quantitative variables measured on the same individuals. Each

point represents an individual with the coordinates of the point (explanatory value, response

value). To graph a scatterplot, do the following:

a. The explanatory variable is on the x-axis, the response variable on the y-axis.

b. Label both axes!!!

c. Scale the intervals on the axes so that the intervals are uniform.

d. Use a large enough grid so that the details can be readily identified.

To interpret a scatterplot, identify patterns and deviations from the patterns.

a. Overall pattern and striking deviations from the pattern.

b. Direction (positive or negative), Form (linear, curved), Strength (how closely thr

points follow a clear form)

c. Outlier (individual value that falls outside the overall pattern of the relationship.)

To add a categorical variable to a scatterplot, use a different plotting color for each category.

Correlation::

A scatterplot can be misleading in determining direction, form, and strength if the scales are not

properly set. Therefore, the calculation of the correlation coefficient, r, is used to interpret the

direction and strength of a linear relationship between 2 quantifiable variables.

1

i i

x y

x x y y

s sr

n or more simply A positive r indicates a positively

(upward) sloping line and a negative r indicates a negatively (downward) sloping line. The

value of r will be 1,1 . The closer r is to either – 1 or 1, the stronger the linear relationship.

The closer r is to 0, the weaker the relationship between the 2 variables.

Features of correlation:

Correlation makes no distinction between explanatory and response variables.

R does not change if the units of measure change for either x or y or both. The

correlation, r, has no unit of measure itself.

Positive r indicates positive association between the variables, and negative r indicates

negative association.

The correlation coefficient, r, will always be between – 1 and 1. 1 1r .

Correlation of bivariate data:

Correlation requires that both variables are quantitative.

Correlation only describes the strength of linear relationships.

Correlation, r, is not resistant to extreme values.

Correlation is not a complete summary of bivariate data. Use also ,


2. Least-Squares Regression Line and Residuals:

Regression requires an explanatory and a response variable. A regression line describes how a

response variable, y, changes as an explanatory variable, x, changes. Regression lines are used

to predict the value of y for a given value of x. The value of the slope is no indication of the

strength of the relationship. The slope will change if the units of measure of the variable(s)

changes. The strength of the linear relationship is the value of r. Extrapolation is the use of a

regression line for predicting y for values of x outside the data range. Extrapolation is often not

accurate!

The least-squares regression line is the line that minimizes the sum of the squared vertical

distances of the y values of data points from the y values of the line. The equation of the least-

squares line is

, where is the y-intercept and b is the slope. Alternate form: .

The slope, b, is calculated .

The regression line goes through the point .

Remember, is the predicted value for y, given an input value of x. It is better to use values of

x within the range of the x data values, interpolation, rather than outside the range,

extrapolation.

Residuals are the differences between the observed value of the response variable and the

expected value of the response variable as predicted by the regression line.

Residual /. The sum of the least-squares residuals is always zero. A residual plot is a

scatterplot of the regression residuals against either the x values or the predicted y values .

Residual plots show how well a regression line fits the data.

Features of residual plots:

Residual plots should show no obvious pattern. A curved pattern would indicate that

the relationship is not linear. A “fan shaped” pattern indicates the prediction value is

less accurate for x values at the larger residual side of the plot.

The residuals should be relatively small in size.

The standard deviation of the residuals, , measures the typical prediction

error for the regression line.

The coefficient of determination, , is the fraction of the variation in the y values that is

explained by the least squares regression line (that is due to the linear relationship of x and y.)

Facts about least squares linear regression:

The distinction between explanatory and response variables is essential.

There is a close connection between correlation and slope. . A change in one

standard deviation of x corresponds to a change of r standard deviations in y.

The least squares regression line always passes through the point .

The correlation, r, describes the strength of the linear relationship.

The square of the correlation, r2 or the coefficient of determination, is the fraction of the

variation that is explained by the least squares regression line. It is the proportion of the

variation that is due to the linear relationship of x and y.


3. Correlation and Regression Wisdom:

Overall facts:

Correlation and regression describe only linear relationships.

Extrapolation (using values outside the domain of the data) often produces unreliable

predictions.

Correlation is not resistant.

Outliers – observations that lie outside the overall pattern (beyond 1.5IQR). Outliers in the y

value produce large residuals. Outliers in the x value may or may not produce large residuals.

Remember, by definition, residuals are . Outliers in the y direction are not necessarily

influential. Outliers in the x direction are often influential.

Influential observations – observations that markedly affect the calculation of the regression

line. Points that are outliers in the x direction are often influential and therefore affect the

regression line (pull the regression line toward the influential point.)

Lurking variables – variable(s) that are neither the explanatory nor the response variable, yet

they may influence the relationship between the explanatory and response variables. Lurking

variables can create a correlation or they can hide a correlation.

Correlations based on averages often produce too high an r value to be useful in analyzing

individuals.

4. Chapter Summary:

a. Plot the data on a scatterplot

b. Evaluate the form, direction and strength from the scatterplot

c. Calculate numerical summaries (e.g. ).

d. Calculate a regression line .

e. Calculate the strength of the linear relationship, and how well the regression line fits the

data (e.g. residuals and r2.)


Statistics Chapter 4

Exponential and Power Relationships on Bivariate Data

1. Transforming to Achieve Linearity: The original data plot may not show a linear relationship; however, a function of the data may be linearly related. Taking a function on the data (e.g. log or square root) is called transforming the data. It is also known as reexamining the data. Transforming is changing the units of measure on the original data. Common transformations in statistics use linear functions, exponential (using both positive and negative exponents), or logarithms. Steps to follow for bivariate relationships 1. For a linear relationship:

a. Creating a scatterplot of the data (x, y). What is the form? b. LinReg on (x, y) and noting the equation and the values of r and r2.

i. LinReg L1, L2, Y1 c. Plot the residuals. If there is a pattern, then a linear relationship is NOT the best fit.

i. L3 = Y1(L1) to put y into L3

ii. L4 is the residual. Residual = y y− = L2 – L3. iii. The scatterplot for the residual is L1, L4.

2. For an exponential relationship: a. Create a scatterplot of (x, ln(y)). What is the form?

i. One method of identifying an exponential relationship is if 1

n

n

yy −

is constant.

b. LinReg on (x, ln(y)) and noting the equation and the values of r and r2. i. LinReg L1, L3, Y2

c. Plot the residuals. See the table below for the residual plot. If there is a pattern, then an exponential relationship is NOT the best fit.

i. x y Ln(y) ( )ln y ( )ln( ) lny y−

L1 L2 L3 L4 L5 Data Data L3=ln(y) L4=Y2(L1) L3-L4

ii. The scatterplot of the residual for the exponential linear regression line is L1, L5 d. Transform the equation from part b to exponential form.

i. The linear equation will be in the form ln( )y a bx= +

ii. Transform as follows:

( )

a bx

a bx

xa b

y ey e e

y e e

+=

= ⋅

= ⋅


3. For a power relationship: a. Follow the same steps as for an exponential EXCEPT you will also need to take a logarithm

of the explanatory value (x) as well as the response variable (y). b. LinReg on (ln(x), ln(y)) and noting the equation and the values of r and r2.

i. LinReg L3, L4, Y3 where L3 = ln(L1) and L4 = ln(L2) c. Plot the residuals. See the table below for the residual plot. If there is a pattern, then a

power relationship is NOT the best fit. i.

x y Ln(x) Ln(y) ( )ln y ( )ln( ) lny y−

L1 L2 L3 L4 L5 L6 Data Data L3=ln(L1) L4=ln(L2) L5=Y3(L3) L4-L5

ii. The scatterplot of the residual for the power linear regression line is L3, L6 d. Transform the equation from part b to power form.

i. The linear equation will be in the form · ii. Transform as follows: ·

Handy hints: Write down the definitions of the contents of L1 through L6. Write down the definitions of Scatterplots 1 through 4. These steps are especially important for calculation and plotting of residuals. Recap:

Power Graph Linear Form Power or Exponent Form ∞, 0

·

0

·

0,1

·

1,∞

·

For exponential growth, each succeeding y value grows by a fixed % of the previous y value for uniform increments of x. If a variable grows exponentially, then its log grows linearly.


2. Relationships Between Categorical Variables: Always sum the rows and columns for categorical data. The marginal distribution shows the % of a variable to the TOTAL distribution. A conditional distribution shows % of a subset of the data to the total of the subset (ie you are considering only a particular row or column.) Simpson’s Paradox – an association that holds true for all of several groups can reverse direction when the data are combined to form a single group. (example of the effect of lurking variables.)

3. Establishing Causation: A strong association is not enough to prove cause and effect. Even when direct causation is present, it is rarely a complete explanation of an association between 2 variables. Even well established causal relationships may not generalize to other settings (eg rats to people.) A common response to a third variable may explain changes in 2 associated variables. 2 variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory or lurking. If an experiment is not possible, then look for the following to establish a relationship: 1. A strong association 2. Consistent association 3. Larger values of the response variable are associated with stronger responses. 4. Alleged cause precedes the effect in time. 5. Alleged cause is plausible.


Statistics

Chapter 5

Data Production

1. Designing Samples:

Population – the entire group of individuals

Sample – part of the population

Sampling – studying a part in order to gain information about the whole.

Census – studying every individual in the population.

Type of Sample Description Reliability

Voluntary Response Individuals decide

whether to participate in

a general appeal for data

Weak (biased)

Convenience Select individuals based

on those who are easiest

to contact

Weak (biased)

Systematic Select every nth item Weak if not done properly.

Subject to variations in the

sequence of the data set.

Simple Random Samples

(SRS)

selecting n individuals

from a population in

such a way that every set

of n individuals has an

equal chance of being

selected.

Strong (objective – no bias)

Biased sampling method – those that systematically favor certain outcomes. They have weak

reliability.

Random samples may be chosen by using a systematic random sampling technique:

Random digits using computer software

TI-84 (Math, PRB, 5:randInt(1,ending number)

Random digits using a Table where:

each digit is equally likely to be chosen and

the digits are independent of one another (no pattern to the digits)

Steps to choose an SRS:

1. Label – assign a unique numerical value to each individual in the population

2. Table – use a Random Digit Table to select labels at random

3. Stopping rule – establish the total sample size

4. Identify sample – link the numerical values selected to the individuals.

Pg 326 Case Closed

A probability sample is a sample chosen by chance. It must be known the probability for

each possible sample.

The use of chance to select the sample is the essential principle of statistical sampling.


A stratified random sample – the population is divided into its important characteristics called

strata. Then choose a SRS from each strata and combine these SRS to form the full sample.

A cluster sample – the population is divided into groups or clusters. Some clusters are

randomly selected, then all the individuals in the selected clusters are chosen for the sample.

Multistage sample – several stages of selection are used to divide the population. Each stage

may be an SRS, strata, or cluster. The final stage results in clusters from which the sample is

selected.

Areas of bias in sampling despite good sampling methods:

Undercoverage – some groups in the population are omitted from the sampling process (e.g.

homeless people).

Nonresponse – individuals chosen for the sample are unavailable or refuse to participate.

Response bias – the nature of the question (e.g. illegal activities or estimates by the

respondent) or the behavior of the interviewer may result in bias by the respondent.

Wording of the questions – the wording or the order of the questions may result in a biased

response. Wording biases may result from omission of information (covert) or by including

biased wording (overt, an error of commission.)

Larger random samples produce results that are closer to the population than smaller samples.

2. Designing Experiments:

Experiment – a study that includes a treatment (a specific experimental condition) to

individuals (experimental units or subjects) in order to observe the response.

The explanatory variables (factors) are the inputs and the response variables are the outputs.

The explanatory variables may have different values (levels.)

Example 5.14 and Figure 5.3 pg 355.

A lurking (confounding) variable is neither the explanatory variable nor the response variable

yet the lurking variable may influence the interpretation of the results of a study.

Placebo – “dummy” or fake treatment used to disguise the treatment from the control group

and the treatment group.

The control group receives the placebo.

The 1st basic principle of statistical design of experiments is control! Comparison of several

treatments in the same environment is the simplest form of control. Control is the overall

effort to minimize variability in the way the experimental units (individuals or subjects) are

obtained (sample) and treated.

The 2nd

basic principle of statistical design of experiments is replication! Replication – use

enough subjects to reduce chance variation between groups (i.e. the larger the sample size the

better.)


The 3rd

basic principle of statistical design of experiments is randomization!

Randomization – rely on chance to assign individuals (experimental units) to the control and

treatment groups. Goal is to eliminate bias. Do not rely on the characteristics of the

individuals or the judgment of the designer of the experiment.

Recap of the basic principles of experimental design:

1. Control the effects of lurking variables on the response. (Use a control group and at least

one treatment group.)

2. Replicate each treatment on many individuals to reduce chance variation in the results.

3. Randomize – use impersonal chance to assign individuals to treatments. A completely

randomized design assigns all individuals to all groups randomly.

Statistically significant – an observed effect so large that it would rarely occur by chance.

Do example 5.19 pg 362 for using Table B for random assignment.

Block – a group of individuals that are known before the experiment to be similar in some

variable so that the response to the treatments will be systematically affected. Blocks are

another form of control. They control the effects of some outside variables by bringing those

variables into the experiment to form the blocks. (e.g. separate blocks for males and females

where gender may impact the effect of the treatment.) Blocking allows separate conclusions

about each block. Form blocks based on the most important unavoidable sources of variability

among individuals. Randomization averages out the effects of the remaining variation

enabling an unbiased comparison of the treatments.

Block Design – individuals are randomly assigned to treatment groups within each block.

Control what is possible, block what is not controllable, randomize the rest and replicate!.

Often, matching the subjects in various ways can produce more precise results than simple

randomization.

Matched pairs – a type of block design that compares just two treatments. The subjects are

matched in pairs because matched subjects are more similar than unmatched subjects.

Therefore, comparing responses between matched pairs is more efficient than comparing

responses of randomly assigned subjects. Example 5.23 pg 368.

Double Blind Experiment – neither the subjects nor those who measure the response variable

know which treatment a subject received.

A potential weakness of any experiment is lack of realism. The ability to apply the

conclusions of an experiment to a real setting may be limited.


Statistics

Chapter 6

Probability – Simulations, Randomness

Introduction: the three types of probability are relative frequency, theoretical model, and simulation.

1. Simulations:

A simulation is an imitation of chance behavior, based on a model that accurately reflects the

phenomenon. Do the following steps:

a. Describe the random phenomenon

b. State the assumptions

c. Assign digits to represent the outcome

d. Simulate many independent trials (repetitions)

Using the TI-84:

randInt(1st, last, size) L1:sortA(L1): (L1≤value) L2 : sum(L2)

See Example 6.8 on page 401

2. Probability Models:

Chance behavior is unpredictable in the short run but has a regular and predictable pattern in the long

run. Probability is empirical based on observations of many trials.

A phenomenon is random if individual outcomes are uncertain but there is a regular distribution of

outcomes in a large number of trials.

𝑃 𝐸 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑎 𝑙𝑟𝑎𝑔𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠

𝑛𝑢𝑚𝑏𝑒 𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠. This is long term relative frequency.

Probability models have independent trials, empirical trials, and many trials.

Probability models contain:

a. Sample space: the set of all possible outcomes

b. Event: any individual outcome (this is a subset of the sample space)

c. Probability model: a mathematical description of random phenomena consisting of:

i. Sample space

ii. Method of assigning probabilities to events.

The Multiplication Principle: the number of ways for more than one event. 𝑛1 ∙ 𝑛2 ∙ 𝑛3 𝑒𝑡𝑐 where each

“n” represents the number of ways for that particular event. Example: 3 types of bread, 4 types of meat

and 2 types of cheese will produce 24 different sandwiches.

Independent events: the occurrence of one event has no impact on the probability of another event (e.g.

selection with replacement.)

Dependent events: the occurrence of one event has an impact on the probability of another event (e.g.

selection without replacement.)

Probability Rules:

a. 𝑃 𝐴 = [0,1] where 0 is an impossibility and 1 is a certainty. 0 ≤ 𝑃 𝐴 ≤ 1

b. In a given sample space, the sum of all probabilities is 1. 𝑃 𝑆 = 1

c. For event A, 𝑃 𝐴 =𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑜𝑓 𝐴

𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑜𝑓 𝑆

d. For 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵

i. If A and B are mutually exclusive, disjoint, then

𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵

ii. If A and B are NOT mutually exclusive, joint, then

𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃 𝐴 𝑎𝑛𝑑 𝐵


e. The complement of event A, denoted as 𝑃 𝑛𝑜𝑡 𝐴 = 𝑃 𝐴′ = 𝑃 𝐴𝑐 = 1 − 𝑃 𝐴 f. For 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵

i. If A and B are independent, then

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 ∙ 𝑃 𝐵 ii. If A and B are dependent, then

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 ∙ 𝑃 𝐵|𝐴 iii. Disjoint events are not independent

iv. Independence can NOT be shown on a Venn diagram.

Benford’s Law of 1st digit in a legitimate record (0 cannot be the 1

st digit ∴ 𝑃 0 = 0

1 2 3 4 5 6 7 8 9

.301 .176 .125 .097 .079 .067 .058 .051 .046

3. Conditional Probability:

Expanded version of AND (∩) probabilities

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∙ 𝑃 𝐵|𝐴 = 𝑃 𝐵 𝑎𝑛𝑑 𝐴 = 𝑃 𝐵 ∙ 𝑃 𝐴|𝐵 𝑃 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑛𝑑 𝐶 = 𝑃 𝐴 ∙ 𝑃 𝐵|𝐴 ∙ 𝑃 𝐶|(𝐴 𝑎𝑛𝑑 𝐵)


Statistics

Chapter 7

Random Variables

Random variable: variable whose value is a numerical outcome of a random phenomenon.

1. Discrete and Continuous Random Variables:

Discrete random variables have distinct values (usually whole numbers). Increments between the stated

values are excluded.

The probability distribution of discrete random variables have 2 features:

1. 0 ≤ 𝑃(𝑥) ≤ 1 and

2. 𝑃 𝑆 = 1 (i.e. the probability of the entire sample space is exactly 1.

Continuous random variables take all values within an interval. The probability distribution is a density

curve. Probability is calculated as the area under the curve within the stated x-interval. The probability

of exactly a given x-value is 0. The area under the entire density curve = 1. If a density curve is

normally distributed, use the techniques from chapter 2 to find probabilities (z-scores or normalcdf).

2. Means and Variances of Random Variables:

The mean of a discrete random variable is the mean of a probability distribution based on many trials.

The mean is also called the expected value.

𝜇𝑥 = 𝑥1 ∙ 𝑝1 + 𝑥2 ∙ 𝑝2 +⋯+ 𝑥𝑘 ∙ 𝑝𝑘

The variance of a discrete random variable is:

𝜎𝑥2 = 𝑥1 − 𝜇𝑥

2 ∙ 𝑝1 + 𝑥2 − 𝜇𝑥 2 ∙ 𝑝2 +⋯+ 𝑥𝑘 − 𝜇𝑥

2 ∙ 𝑝𝑘 Take the square root of the above calculation to get the standard deviation of a discrete random variable.

Sampling distributions are probability distributions of random variables.

The Law of Large Numbers: as the number of observations increases, 𝑥 approaches 𝜇. The size

required to approach the mean is influenced by the variability in the distribution (i.e. the larger 𝜎, the

larger the number of observations needed to approach 𝜇.

Rules for Means:

a. 𝜇𝑎+𝑏𝑥 = 𝑎 + 𝑏𝜇𝑥 , (given that X is a random variable, and a and b are constants.)

b. 𝜇𝑋±𝑌 = 𝜇𝑋 ± 𝜇𝑌 , (given that X and Y are random variables.)

Rules for Variances:

a. 𝜎𝑎+𝑏𝑋2 = 𝑏2 ∙ 𝜎2𝑋 (given that X is a random variable, and a and b are constants.)

b. 𝜎𝑋±𝑌2 = 𝜎2𝑋 + 𝜎2𝑌 (given that X and Y are INDEPENDENT random variables.) Note

that whether one is adding or subtracting X and Y, ALWAYS ADD their respective variances!

Combining Normal Random Variables: any linear combination of INDEPENDENT normal random

variables is also normally distributed.


Statistics

Chapter 8

Binomial and Geometric Distributions

1. Binomial Distributions:

4 conditions for a binomial distribution:

a. Only 2 outcomes possible. success and failure. The variable is

the number of success.

(x)

b. Fixed number of trials (n)

c. All observations are independent

d. Probabilities are constant (p)

i. If n is small compared to N, p is practically constant;

therefore can assume a binomial distribution in such

cases.

ii. For notation, p is the probability of success and q is the

probability of failure; therefore, q = 1 – p .

A binomial distribution refers to x number of successes with the parameters n and p. A common

notation is X is B(n,p). This is a type of discrete probability distribution.

The probability of x successes is P(x) = nCxpxq

n-x.

Note: !

! !n x

nC

n x x. This represents the number of ways to get x successes in n trials.

TI 83/84 hit the value for n then Math Prb 3 for combination then hit the value for x.

TI 83/84 function for binomial P(x) is binomialpdf(n,p,x). Hit 2nd

Vars (for Dist) to get to binomial,

geometric, and poisson probabilities. Hit A for binomialpdf.

Example: P(2) given n = 5 and p = .25.

2 3

5 2

5!.25 .75 .26367

3!2!C

Binomialpdf(5,.25,2) = .26367

For cumulative probabilities you can either calculate each value separately, then add them (see example

8.7) OR use the function binomialcdf(n,p,x) which gives values at and below ( ) the stated x value.

Hit 2nd

Vars (for Dist), then hit B for binomialcdf.

Mean and Standard Deviation for binomial discrete distributions: np

npq

Using a normal distribution to approximate a binomial distribution:

Use normal when both : 10

10

np

nq


Using the normal distribution to approximate a binomial is

Most Accurate Least Accurate

.5

.5

p

q

0

1

p

q OR

1

0

p

q AND the distribution is skewed

Continuity Correction Factor (CCF) – this topic is not in your book

When using the normal distribution to approximate a binomial distribution, the accuracy is increased

when you use the continuity correction factor ( .5 ). Below is the justification and examples.

Example A:

The binomial discrete value of 3 resides between 2.5 and 3.5 on a continual normal distribution.

Using TI 83/84: Normalcdf(2.5,3.5,mu, sigma)

Using z-scores: P(Z<3.5) – P(Z<2.5)

P.S. you would rarely use the normal to approximate a binomial for probabilities equal to a given value.

This example is merely to show the justification of the .5 CCF.

Example B:

To find a value greater than 2, find values greater than 2.5 (because the value of 2 is from 1.5 to 2.5).

Greater than: you will add .5 to the binomial value.

Example C:

To find a value less than 4, find values less than 3.5 (because the value of 4 is from 3.5 to 4.5).

Less than: you will subtract .5 from the binomial value.

Note: if a problem asks for an “exact” probability, use the binomial calculation. If a problem tells you to

“approximate” the probability it means to use the normal distribution as an approximation.

3P x


Simulation of binomial distributions:

a. Generate random binomial values (0 and 1 for failure and success respectively) for a given

probability for a given number of trials. Store in List1, then sum List1 to get the number of

successes that were randomly generated according to your parameters.

b. 1 11, , :randBin p n L sum L

c. Do Step “a” several times.

d. Average the ratio of number of successes divided by number of trials and compare that value to

the binomial probability.

e. As you increase the number of trials, the generated probability should come closer to the

binomial probability.

3. Geometric Distributions:

4 conditions for a geometric distribution:

a. Only 2 outcomes possible. success and failure.

b. NOT a fixed number of trials. The variable is the number of

trials required to get the first success.

(x)

c. All observations are independent

d. Probabilities are constant (p)

i. If n is small compared to N, p is practically constant;

therefore can assume a binomial distribution in such

cases.

ii. For notation, p is the probability of success and q is the

probability of failure; therefore, q = 1 – p .

The probability of it taking x trials in order to get the first sucess is P(x) = qx-1

p

Mean and Standard Deviation for geometric discrete distributions:

2

2

2

1

p

q

p

qq

p p

nP x n q

Using TI 83/84 geometpdf(p,x) for P(x) or geometcdf(p,n) for cumulative probabilities.


Statistics

Chapter 9

Sampling Distributions

1. Sampling Distributions:

Knowledge of how sampling distributions are constructed gives more understanding into inferences

about a population based on a sample.

Population

(Usually not known)

Fixed

Sample

(Usually used to infer

population parameters)

Varies

Characteristics: Parameter Statistic

Mean µ 𝑥 Proportion p 𝑝 Standard Deviation σ s

Sampling variability – the value of a statistic varies in repeated random sampling. The variability

decreases as the sample size increases (i.e. there is an inverse relationship between sample size and

variability.)

Sampling distribution of a statistic – the distribution of values of a statistic (e.g. mean, proportion,

standard deviation) of all possible samples of the same size from the same population.

Simplest Example:

Assume a population of only 3 data items that have the values of 1, 2, and 3. If we take a sample size of

2 items with replacement, there are a total of 9 different possible samples that could be taken:

Sample Mean of sample

𝑥 1,1 1

1,2 1.5

1,3 2

2,1 1.5

2,2 2

2,3 2.5

3,1 2

3,2 2.5

3,3 3

Total 18

Number of samples taken 9

Mean of sample means 𝜇𝑥 2

Mean of population 𝜇 2


Histogram of the sample means:

Describing a sampling distribution – when sampling is done randomly, the shape and center of the

sampling distribution will approximate the true population shape and center but the spread of the

sampling distribution (e.g. variance or standard deviation) will decrease with a larger sample size (n).

Unbiased statistic (estimator) – a statistic is unbiased if the mean of its sampling distribution is equal to

the true value of the parameter (population) being estimated.

The variability of a statistic – is the spread of its sampling distribution, determined by the sampling

design and the sample size. A large sample size produces a smaller spread. Provided the population is

much larger than the sample size (N≥10n), the spread of the sampling distribution is the same for any

population size (N).

Example: comparing 2 different populations and 2 different samples (of the same size)

Population A = 1,000,000 Population B = 500,000

Sample size = 1,000 Sample size = 1,000

Variability is the approximately the same !!!

The goal in sampling is to have low bias and low variability. Bias (sampling distribution center ≠

population center) is usually a result of poor sampling techniques while variability (sampling

distribution has a large spread) is usually a result of too small a sample size. Note the analogy of the 4

targets on page 576.

2. Sampling Proportions:

Sample proportion problems are generally used with categorical variables.

𝜇𝑝 = 𝑝 Therefore, 𝑝 is an unbiased estimator of p.

𝜎𝑝 = 𝑝𝑞

𝑛 Therefore, the standard deviation is smaller as n increases.

Checklist:

a. SRS

b. :N≥10n

c. Use the normal distribution to approximate the binomial when BOTH 𝑛𝑝 ≥ 10 𝐴𝑁𝐷 𝑛𝑞 ≥ 10

0

0.5

1

1.5

2

2.5

3

3.5

1 1.5 2 2.5 3

Fre

qu

en

cy

Sample Mean

Distribution of Sample Means


To solve sampling proportion problems:

1. Identify p, n, and 𝑝 . 2. Verify that N≥10n (if this is not true, the sample size is too close to the size of the population so

analyze the parameters themselves)

3. Verify that BOTH 𝑛𝑝 ≥ 10 𝐴𝑁𝐷 𝑛𝑞 ≥ 10, then use a normal curve calculation to estimate the

binomial data.

4. Establish mean and standard deviation for the normal curve by finding 𝜇𝑝 = 𝑝 and 𝜎𝑝 = 𝑝𝑞

𝑛.

5. Evaluate the probabilities based on z-scores (normal distribution).

3. Sampling Means:

Sample means are usually used with quantitative variables. Means of random samples are less variable

than individual observations. Means of random samples are more normal than individual observations.

𝜇𝑥 = 𝜇 Therefore, 𝑥 is an unbiased estimator of 𝜇.

𝜎𝑥 =𝜎

𝑛 Therefore, the standard deviation is smaller as n increases. (e.g. if the sample size

is quadrupled, the standard deviation is halved.

The sampling distribution will be normally distributed if the population is also normally distributed.

Checklist:

a. SRS

b. :N≥10n

c. Population is normal, then n can be any size

d. Population is not normal, n should be “large” usually defined as (𝑛 ≥ 30)

Central Limit Theorem – With a large sample size, the sampling distribution of the sample mean, 𝑥 , is

close to a normal distribution regardless if the population is normally distributed or not.

Recap:

Population Shape Sampling Distribution Shape

Normal Normal regardless of sample size

Any shape (normal or skewed) Small n: shape of the population

Any shape (normal or skewed) Large n: normal (𝑛 ≥ 30)

Simulations of Sampling Distributions:

Assume population that’s normally distributed.

Example:

𝜇 = 64.5 and 𝜎 = 2.5.

1. Create a list in L1.

2. Math Prb #6 to get randNorm(64.5,2.5,100) to generate a list of 100 items with the given mean and

standard deviation.

3. Plot a histogram on L1 data (Plot 1)

4. Plot a boxplot above the histogram on L1 data (Plot 2)

5. Compare 1-Var Stats on L1 with the population mean, median, and standard deviation.

6. Repeat step 2 in L2, L3 and L4.

7. Calculate averages of the 1-Var Stats for lists 1-4. Compare to population parameters.

8. You should verify that the averages of the lists approaches the population parameters.


Statistics

Chapter 10

Inference Means and Proportions

Statistical Inference: methods to draw conclusions about a population based on sample data. Follow the

schema below when inferring data from a sample to draw conclusions about the population.

A. Identify the population and the parameter of interest.

B. Verify conditions to be met for a confidence interval:

1. SRS. This criterion justifies the assumption that the sample is representative of the population.

2. N>=10n (for independence of sample that is really without replacement). This criterion justifies

the use of the standard deviation calculations , σ

n and

pq

n for mean and proportions respectively.

3. See the table below for conditions based on the normality or lack thereof of the population.

These criteria justify the use of a normal curve in the analysis.

Means Proportions

Pop is

normal

Pop is not normal or is unknown Pop is

normal

Pop is not normal or is

unknown

Go to

calculations 𝑛 ≥ 30 𝑛 < 30

Go to

calculations 𝑛𝑝 ≥ 10 AND

𝑛𝑞 ≥ 10

𝑛𝑝 < 10

OR

𝑛𝑞 < 10

Go to

calculations

Sample

distribution

is normal

(boxplot no

outliers) or

(normality

plot)

Sample

distribution

not normal

Go to

calculations

STOP or

proceed

with caveat

sample

may not be

normal

Go to

calculations

STOP or

proceed

with caveat

sample

may not be

normal

All of the conditions should be met to provide the most reliable interval for the population parameters.

Any weaknesses should be noted.

1. Confidence Interval for Means ( known)

Confidence Level: Use critical Z scores based on the level of confidence.

Common Z-scores

Confidence Level Z-score

99% 2.575

95% 1.96

90% 1.645

For other confidence intervals:

Find the z-score for the 4 digit probability

calculated from 1−𝐶

2


Confidence Interval = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟

𝑥 ± 𝐸

𝑥 ± 𝑍∗𝜎

𝑛

Conclusion: We are C% confident that the actual population parameter (be specific with context) lies

within the interval [ , ]. Make connections to the data calculations and assumptions and state the

context of the problem. The confidence is in the method of constructing the interval, not the interval

itself.

To find the sample size required to generate a given margin of error with a given confidence level and a

given standard deviation, solve the equation for n. Always round n UP!

2. Confidence Interval for Means ( unknown):

You do not know the population standard deviation, . Therefore, use “s”, the sample standard

deviation because is unknown. Instead of a z-score, use a “t” score! Use row for the degrees of

freedom, df, defined as (n-1). Use the column with the percentage of data in 1-tail based on the given

confidence interval.

Confidence Interval = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 ± 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑒𝑟𝑟𝑜𝑟

𝑥 ± 𝐸

𝑥 ± 𝑡∗𝑠

𝑛


given standard deviation, solve the equation for n. Always round n UP!

Notes on t-scores:and degrees of freedom

a. The t-distribution is

i. Symmetrical

ii. Unimodal (one peak)

iii. Bell-shaped

b. The t-distribution has a larger spread (variance) than the normal distribution and the peak is

lower. This is because there is larger variance in a sample than its population.

c. As the degrees of freedom increase, the closer the t-score will be to the z-score for a given

confidence level.

d. If the degrees of freedom are not on the table, round down to the closest df. This will produce a

wider interval that will capture the true population mean.

e. The interval is correct when the population is normal and is approximately correct for large

values of n when the population is not normal.

An interval is called robust if the calculations are fairly accurate despite an unmet condition (i.e. the

confidence interval is still fairly accurate.)

Outliers will cause calculations to be inaccurate because 𝑥 and s are not resistant to outliers (i.e. outliers

influence 𝑥 and therefore s.)


Practical guidelines for a confidence interval with unknown:

a. SRS is more important than a normal population

b. For 𝑛 < 15 and normal sample distribution (based on boxplot with no outliers or normal

quantile plot that is linear), use the formula x ± t∗s

n, otherwise, STOP.

c. For 𝑛 ≥ 15, use for formula unless there are outliers.

d. For 𝑛 > 30, use the formula even if sample is skewed.

3. Confidence Interval for Proportions:

Use Z-scores for proportion intervals. Follow the schema for all intervals.

Confidence Interval = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟

𝑝 ± 𝐸

𝑝 ± 𝑧∗ 𝑝 𝑞

𝑛


given standard deviation, solve the equation for n. Always round n UP!. If you do not know 𝑝 and you

want to find the sample size for a given margin of error, estimate 𝑝 𝑎𝑛𝑑 𝑞 as both 0.5. This will ensure

the largest interval to capture the population parameter.

Recap of formulas for the intervals:

Means Proportions

Population is normal Population not normal

or unknown

𝑥 ± 𝑍∗𝜎

𝑛 𝑥 ± 𝑡∗

𝜎

𝑛

𝑝 ± 𝑧∗ 𝑝 𝑞

𝑛

1 Statistics Chap 11-12

Statistics

Chapter 11-12

Inference: Testing Claim (1 sample) Means and Proportions

1. Basics (11.1 and 11.2)

Significance Test – assess evidence about some claim (hypothesis) about a population parameter.

This process is done indirectly. Rather than test the claim itself (when does one reach the burden of

proof?), you establish a null hypothesis and gather evidence to reject or fail to reject the null hypothesis.

The null hypothesis is set equal to a given value. The alternate hypothesis is the claim usually about the

population itself, either < , > , or the value stated in the null hypothesis. .

Basic logic – an outcome, that would rarely occur if the claim were true, is good evidence that the claim

is false.

General Schema:

A. Identify the population and claim about the parameter of interest.

B. Write the null and alternate hypotheses.

C. Verify conditions to be met (same as for a confidence interval) :

1. SRS. This criterion justifies the assumption that the sample is representative of the population.

2. N>=10n (for independence of sample that is really without replacement). This criterion justifies

the use of the standard deviation calculations , σ

n and

pq

n for mean and proportions respectively.

3. See the table below for conditions based on the normality or lack thereof of the population.

These criteria justify the use of a normal curve in the analysis.

Means Proportions

Pop is

normal

Pop is not normal or is unknown Pop is

normal

Pop is not normal or is

unknown

Go to

calculations 𝑛 ≥ 30 𝑛 < 30

Go to

calculations 𝑛𝑝 ≥ 10 AND

𝑛𝑞 ≥ 10

𝑛𝑝 < 10

OR

𝑛𝑞 < 10

Go to

calculations

Sample

distribution

is normal

(boxplot no

outliers) or

(normality

plot)

Sample

distribution

not normal

Go to

calculations

STOP or

proceed

with caveat

sample

may not be

normal

Go to

calculations

STOP or

proceed

with caveat

sample

may not be

normal


D. Perform calculations

a. Test statistic

i. For means: use either 𝑍𝑡𝑒𝑠𝑡 =𝑥 −𝜇0

𝜎

𝑛

or 𝑡𝑡𝑒𝑠𝑡 =𝑥 −𝜇0

𝑠

𝑛

ii. For proportions use 𝑍 =𝑝 −𝑝0

𝑝0𝑞0

𝑛

b. P-value – the p-value is the area for the test statistic. Remember to multiply the area by 2 for

a 2 tailed-test.

E. Interpret results

a. Interpret p-value or make a decision about H0 using statistical significance ( value).

i. If the p-value < , then reject H0. This means the test statistic is inside the critical

area!!!

ii. If the p-value > , then fail to reject H0. This means the test statistic is outside the

critical area!!!

b. Write conclusion using connection to your calculations and context of the particulars of the

scenario.

Hypotheses:

a. H0 –. Tests are conducted to find the strength of evidence AGAINST the null hypothesis. The

null hypothesis always has an element of equality. It is usually phrased “no change” or “no

effect” from historical values.

b. Halt – alternate hypothesis is the claim itself. It may be one or two sided.

i. One sided (either . < or >).

ii. Two sided (≠). This includes both greater than and less than.

Calculations:

a. Compare the value of the parameter in the null hypothesis with the estimate of the parameter from

the sample data.

b. Values of the estimate that are far from the parameter in the direction specified in Halt, is evidence

AGAINST Ho.

c. Standardize the estimate. 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 −ℎ𝑦𝑝𝑜𝑡 ℎ𝑒𝑠𝑖𝑠 𝑣𝑎𝑙𝑢𝑒

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒’

Test statistic 𝑍 =𝑥 −𝜇0

𝜎

𝑛

P-Values:

H0 states the null hypothesis that we are seeking evidence against (in order to accept the claim about the

population parameter.) The test statistic (Z-score or t-score) measures how much the sample data

diverges from H0. If the test statistic is large AND in the direction suggested by Halt, we have data that

are unlikely if H0 were true.

P-value is the probability associated with the test statistic (Z-score or t-score). It is computed assuming

H0 is true, that observed outcomes would take a value as extreme or more extreme than actually

observed. The smaller the p-value, the stronger the evidence AGAINST H0 (observed is unlikely to

occur if H0 were true; therefore, H0 is rejected as evidence that the claim is true.)


Statistical Significance:

If the p-value is ≤ 𝛼, then the data are statistically significant at the given 𝛼 level. The data value is

unlikely to occur by chance if H0 were true; therefore, reject H0. It is VERY important to establish

levels BEFORE collecting data.

Interpretation:

Interpret the p-value and/or make a decision about H0 (reject or fail to reject) using the given level.

State your conclusion, referencing a connection to your calculations, and provide context to the specifics

of the given problem.

2. 11.3 Use and Abuse of Tests

The purpose of a significance test – degree of evidence provided by the sample against the null

hypothesis. Issues to consider:

1. How small should p-value be to show convincing evidence against H0?

2. How plausible is H0?

3. What are the consequences of rejecting H0?

When taking a large sample, even small deviations from H0 may be significant.

Take note in your conclusions:

1. Small sample sizes

2. Hawthorne effect (an effect not considered in the study may be the factor that influenced the

effect detected, not the factor being studied)

3. In multiple analyses, the laws of chance may result in an effect in an analysis but not the

other analyses.

Basic guidelines:

1. Establish a hypothesis

2. Design a study to search specifically for evidence of an effect on the hypothesis

3. If the result is statistically significant, then there is evidence to make a conclusion about the

hypothesis

3. 11.4 Using Inference to Make a Decision:

Truth

Decision

Reject H0 Fail to reject H0

H0 is true Type 1 error = correct

H0 is false correct Type 2 error =

Power = 1 –

The calculation of the power of a test requires the calculation of a Type 2 error (). The calculation of

, is dependent on the estimated value of the alternate value of 0.

The power of a test is the probability that for a given level, will reject H0 when H0 is false (i.e. the Halt

is true). Power measures the ability to detect an alternate hypothesis. A high power is desirable ( ≥80%

for an level of .05).


4 ways to increase power (i.e. decrease , a Type 2 error for a given level)

1. Increase sample size n (this decreases both and )

2. Increase ( and are inversely related)

3. Increase the difference between alt and 0

4. Decrease standard deviation


1. Displaying Distributions with Graphsmsgilfordmath.weebly.com/uploads/2/7/7/5/27753231/all... · 2018. 9. 10. · 3 Statistics Chap 01.docx the 3rd quartile, is also the 75th percentile.

Documents