1 Statistics Chap 01.docx Statistics Chapter 1 Exploring Data 1. Displaying Distributions with Graphs: Graphs for categorical variables – Pie charts – must include all the categories that make up the whole. Pie charts are used to show each category’s relation (%) to the whole . Bar graphs – compares any set of quantities that are measured in the same units , includes but is not limited to % of a whole. Graphs of distributions of quantitative variables – Stem plots – displays the shape of the distribution and the actual data values in the graph. Better for smaller data sets. The leaf is the left-most digit; the stem is the digits to the left of the stem. Write the stems in a vertical column with the leaves in ascending order from the stem. A back-to-back stem plot compares 2 distributions using the same stem. Leaves are ordered from the stem to the left for the left data set and to the right for the right data set. Stem plots are best suited to smaller data sets. For large numbers, the leaf is sometimes trimmed by using only the left digit (e.g. 34850 has a stem of 3 and a leaf of 4,) Stems may be split to contain leaves 0-4 and 5-9. Splitting the stem stretches the distribution. Histograms –.breaks range of values of a variable into classes and displays only the count or the percent of the observations that fall into each class. Classes must be of equal width. Typically 5 to 20 classes will fit most data sets. A frequency histogram will have the same shape of a relative frequency (%) histogram given the same class width. Only the label of the y-axis will be different. Unlike bar graphs, histograms can have no space between bars and no blank bars. Examining Distributions – look for the overall pattern and for striking deviations from that pattern. The pattern of a distribution is described by its shape, center, and spread . An outlier is an individual value(s) that fall outside the overall pattern. One measure of center is the midpoint where ½ the data items are above that value and ½ the data items are below that value. Range (largest value – smallest value) is one measure of the spread of a distribution. Stem plots and histograms display the shape of a distribution. A one-peaked distribution is called unimodal. A distribution is symmetric if the data values smaller and larger than its midpoint are mirror images of each other. It is skewed right if the right tail is much longer than the left tail. Outliers – Data points far beyond the rest of the data. Outliers should be investigated. The omission of outliers should be justified by the investigation. Relative Frequency and Cumulative Frequency – a relative frequency histogram converts the quantitative data on the y-axis to percentages. A relative cumulative frequency graph is an ogive. To construct an ogive, create a relative frequency histogram (establish classes of equal widths and tabulate the data within classes.) An ogive displays the percent of data above or below a given data value (probability distribution.) It shows the relative standing of data items. Time plots – plots each observation against the time at which it was measured. Time is plotted on the horizontal axis. Connect the dots to clearly display changes over time.
33
Embed
1. Displaying Distributions with Graphsmsgilfordmath.weebly.com/uploads/2/7/7/5/27753231/all... · 2018. 9. 10. · 3 Statistics Chap 01.docx the 3rd quartile, is also the 75th percentile.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1 Statistics Chap 01.docx
Statistics
Chapter 1
Exploring Data
1. Displaying Distributions with Graphs:
Graphs for categorical variables –
Pie charts – must include all the categories that make up the whole. Pie charts are used
to show each category’s relation (%) to the whole.
Bar graphs – compares any set of quantities that are measured in the same units,
includes but is not limited to % of a whole.
Graphs of distributions of quantitative variables –
Stem plots – displays the shape of the distribution and the actual data values in the
graph. Better for smaller data sets. The leaf is the left-most digit; the stem is the digits
to the left of the stem. Write the stems in a vertical column with the leaves in ascending
order from the stem. A back-to-back stem plot compares 2 distributions using the same
stem. Leaves are ordered from the stem to the left for the left data set and to the right
for the right data set. Stem plots are best suited to smaller data sets. For large numbers,
the leaf is sometimes trimmed by using only the left digit (e.g. 34850 has a stem of 3
and a leaf of 4,) Stems may be split to contain leaves 0-4 and 5-9. Splitting the stem
stretches the distribution.
Histograms –.breaks range of values of a variable into classes and displays only the
count or the percent of the observations that fall into each class. Classes must be of
equal width. Typically 5 to 20 classes will fit most data sets. A frequency histogram
will have the same shape of a relative frequency (%) histogram given the same class
width. Only the label of the y-axis will be different. Unlike bar graphs, histograms can
have no space between bars and no blank bars.
Examining Distributions – look for the overall pattern and for striking deviations from that
pattern. The pattern of a distribution is described by its shape, center, and spread. An
outlier is an individual value(s) that fall outside the overall pattern. One measure of
center is the midpoint where ½ the data items are above that value and ½ the data items
are below that value. Range (largest value – smallest value) is one measure of the
spread of a distribution. Stem plots and histograms display the shape of a distribution.
A one-peaked distribution is called unimodal. A distribution is symmetric if the data
values smaller and larger than its midpoint are mirror images of each other. It is
skewed right if the right tail is much longer than the left tail.
Outliers – Data points far beyond the rest of the data. Outliers should be investigated. The
omission of outliers should be justified by the investigation.
Relative Frequency and Cumulative Frequency – a relative frequency histogram converts the
quantitative data on the y-axis to percentages. A relative cumulative frequency graph is an
ogive. To construct an ogive, create a relative frequency histogram (establish classes of equal
widths and tabulate the data within classes.) An ogive displays the percent of data above or
below a given data value (probability distribution.) It shows the relative standing of data items.
Time plots – plots each observation against the time at which it was measured. Time is plotted
on the horizontal axis. Connect the dots to clearly display changes over time.
2 Statistics Chap 01.docx
Recap of graphs:
Graph Purpose Data Type Strength
Pie Chart Shows composition of the total Categorical variables Quick visibility of
categorical % to total
Bar Graph Shows quantitative data on the
categorical variables. Does not
necessarily sum to 100%
Categorical variables Quick visibility of
quantitative variable
Stem Plot
(stem and leaf)
Distribution of data includes
actual data items
Quantitative variables
Small data set
Center, Spread, Shape,
(Mode, median, range,
symmetry)
Histogram Distribution of data - frequency Quantitative variables
Large data set
Center, Spread, Shape,
(Mode, median, range,
symmetry)
Relative
Frequency
Histogram
Distribution of data - percent Quantitative variables
Large data set
Center, Spread, Shape,
(Mode, median, range,
symmetry)
Ogive Histogram of Cumulative
Relative Frequency – displays
standing (probability distribution)
Quantitative variables
Large data set
Relative standing (% of
data above or below,
percentile)
Time Plot Displays the effect of time on a
quantity
Quantitative variables
Large or small data
set
Trends of data over
time
2. Describing Distributions with Numbers:
Measures of Center: Mean and Median
Mean is the average: the mean of a sample is 𝑥 = 𝑥𝑖
𝑛. The mean of a population is
= 𝑥𝑖
𝑁 . The mean is sensitive to extreme values; it is not a resistant measure of center.
Median (denoted by M or Q2) is the midpoint of the data set so that ½ the data items are above
the median and ½ the data items are below the median. To find the median order the data set
from low to high. The median is the data item in the position 𝑛+1
2 . (e.g. For a data set of 7
items the median is the 4th data item. For a data set of 8 items the median is ½ way between
the 4th and the data items.) The median is more resistant to extreme values than the mean.
Measures of Spread: Range, Percentile, Quartiles
Range is the distance between the most extreme values (i.e. highest – lowest). One weakness
in range is that the extreme values may be outliers.
Percentile – a data item in the pth percentile means that there are p percent of the data at or
below that data item. The median is the 50th percentile also known as the 2
nd Quartile (Q2).
The Quartiles, Q1 and Q3: Q1, the first quartile, is also the 25th percentile. Q1 is the median of
the lower ½ of the distribution that is bisected by Q2, the median of the entire distribution. Q3,
3 Statistics Chap 01.docx
the 3rd
quartile, is also the 75th percentile. Q3 is the median of the upper ½ of the distribution
that is bisected by Q2, the median of the entire distribution. When finding Q1 and Q3, the
median itself is part of neither half. Caution: some software calculates the quartiles differently
(e.g. TPS3e pg 77-78.).
Examples:
Data Set A:
19 21 22 23 24 25 26 27 28 29 30 31 32
Data Set B:
19 21 22 23 24 25 26 27 28 29 30 31
Data Set C:
19 21 22 23 24 25 26 27 28 29 30
Data Set D:
19 21 22 23 24 25 26 27 28 29
Measures of Center and Spread: the 5-Number Summary (minimum, Q1, M, Q3, and
maximum). These 5 values are plotted on a box plot (box and whisker). The median, M or
Q2, shows the center of the distribution. The quartiles show the spread of the center ½ of the
data but do not give any information on extreme values. The max and min values show the
spread of the entire data set.
Example for Data Set B
19 21 22 23 24 25 26 27 28 29 30 31
Use the box plot to locate median, the spread of the entire distribution (extreme values), and
the spread of the center ½ of the data set. Note: the median is not necessarily in the center of
the box.
The Interquartile Range (IQR) is the distance of the center ½ of the data. IQR = (Q3 – Q1).
The interquartile range is used to identify outliers.
Outlier – a data item below 1.5*IQR from Q1 or above 1.5*IQR from Q3.
Q1 – 1.5*IQR establishes the minimum value.
Q3 + 1.5*IQR establishes the maximum value.
TI-84 2nd
statplot to access statistics graphs.
4 Statistics Chap 01.docx
Variance (s2) and Standard Deviation (s) are measures of spread (i.e. how far the observations
are from their mean.) Variance =
2
2
1
ix xs
n
. Standard Deviation =
2
1
ix xs
n
.
Note that the standard deviation is the square root of the variance. Variance is the average
squared deviation from the mean. Standard deviation is the square root of the average squared
deviation from the mean. S and s2 will be large for disperse data sets (data items far from the
mean) and small for compact data sets (data items clustered about the mean.).
Properties of Standard Deviation:
S measures spread about the mean and should be used only when the center is measured by the
mean.
S = 0 only when there is no spread or variability. This happens only when all observations
have the same value. As the observations become more spread out about their mean, s
gets larger.
S, like 𝑥 , is not resistant. A few outliers can make s very large.
Choosing Measures of Center and Spread:
The 5-number summary is usually better than the mean and standard deviation for describing a
skewed distribution or a distribution with strong outliers. Use 𝑥 and s only for reasonably
symmetric distributions that are free of outliers. Because you need to determine if the data set
is skewed, always plot the data!
Changing the Unit of Measure:
Linear transformations of a data set:
𝑥𝑛𝑒𝑤 = 𝑎 + 𝑏𝑥 where a shifts the data up (a postitive) or down (a negative) and b changes the
unit of measure.
𝑥𝑛𝑒𝑤 = 𝑎 + 𝑏𝑥
Impact on Measures of
Center (𝑥 , Q1, M, Q3) Spread (IQR, s)
±a ± No change
×b × ×
Comparing Distributions:
Use side-by-side bar graphs for categorical variables.
Use back-to-back stemplots and boxplots for quantitative variables.
Use the graphs to interpret shape, center, spread, and outliers.
Calculate mean and median for center. Calculate 5-number summary, standard deviation, and
outliers for spread.
Citing your calculations, write your conclusions about the data addressing shape, center,
spread, and outliers. Make sure your writing is in context with the specific situation of the
problem.
1 Statistics Chap 02
Statistics
Chapter 2
Location in a Distribution
1. Measures of Relative Standing and Density Curves:
Relative Standing:
Z-score: a standard score, the number of standard deviations and the direction a data item is
from the mean in a given distribution.
𝑧 =𝑥−𝑥
𝑠. 𝑧 =
𝑥−𝜇
𝜎
Percentiles: percent of the observations less than (or equal to) the given observation. Some
definitions of percentile calculate only data less than the given data item.
Chebyshev’s Inequality:In any distribution (i.e. even skewed ones), the percent of
observations falling within k standard deviations from the mean is at least 1 −1
𝑘2 .
Density Curves
A density curve is a curve that
Is always on or above the horizontal axis AND
Has exactly 1 underneath it (i.e. the area under the curve represents 100% of the data)
A density curve shows the proportion of data either above, below, or between given data
values.
Do the following when analyzing univariate data (i.e. one variable):
1. Plot the data (histogram or stemplot).
2. Evaluate the shape, center, spread and outliers.
3. Calculate a numerical summary to describe center and spread. Use mean and standard
deviation for symmetrical distributions. Use median and IQR for skewed distributions.
4. With large data sets, smooth the histogram with a continuous curved line. This curve is
a mathematical model for the distribution, an overall description that ignores minor
irregularities.
The median of a density curve is the point that separates the distribution into 2 equal areas
The mean of a density curve balances the distribution. The mean will be to the right of the
median in a right skewed distribution (right tail longer.) The mean will be to the left of the
median in a left skewed distribution (left tail longer.)
Mean
Standard
Deviation
Data Set Observations 𝑥 s
Density Curve 𝜇 𝜎
Uniform Distribution – The frequency is flat. The shape is rectangular. The area is still 1. See
exercise 2.10 page 128.
2 Statistics Chap 02
2. Normal Distributions:
Normal curves are symmetric, single-peaked (unimodal), and bell-shaped. Normal curves are:
Good descriptions for some distributions of real data.
Good approximations to the results of many kinds of chance outcomes.
Statistical inference procedures based on normal distributions work well for roughly
other symmetric distributions.
FYI the equation for a normal density curve is
21
2
2
1
2
x
e
The inflection points of the curve occur at ±1𝜎.
𝑁 𝜇,𝜎 is the notation for a normal distribution with mean at and standard deviation
at 𝜎.
The 68-95-99.7 Rule (Empirical Rule):
The Standard Normal Distribution:
Mean is 0 and standard deviation is 1. The standard normal table gives area
(probability) below (to the left of) a given z-score. Therefore, to obtain the area above (to the
right of) a given z-score, calculate 1 P z . To obtain the area between 2 z-scores calculate
R LP z P z.
Normal Distribution Calculations:
Follow these steps to solve normal distribution problems:
1. Draw a picture and shade according to the wording of the problem. Record and and
the position of the given data value(s).
2. Convert data score(s) to z-scores.
3. Use the table or a calculator (normalcdf) in combination with the shading to determine area
(probability).
4. Write a conclusion in context with the particulars of the given problem.
Sometimes in a given problem, the probability (area) is known and the goal is to find the data
score associated with the given probability. Use the table in reverse to find the z-score, then
solve for the unknown data score.
Assessing Normality:
Method 1 (histogram)
1. Draw a histogram or stemplot. Is the curve unimodal, symmetrical, and bell-shaped?
3 Statistics Chap 02
2. Mark off the points 𝑥 , 𝑥 ± 𝑠, and 𝑥 ± 2𝑠 on the horizontal axis.
3. Compare the counts of the distribution with the Empirical Rule (68-95-99.7).
Method 2 (normal probability plot)
1. Sort the data from low to high. Record the percentiles for each data point.
2. Find the z-scores for each of the percentiles from step 1.
3. Plot the z-scores against the data scores. If the distribution is normal, the plots will form a
straight line. Systematic deviations indicate that the distribution is non-normal. Outliers
will lie far from the overall pattern.
4. In a right skewed distribution, the largest observations fall distinctly above a line drawn
through the main body of points.
5. Left skewness is when the smallest observations fall below the line.
1 Statistics Chap 03
Statistics
Chapter 3
Examining Relationships (Bivariate Data)
Introduction: To understand a statistical relationship between two variables, measure both variables on
the same individuals. Caution: the relationship between two variables can be strongly
influenced by other variables that are lurking in the background. Categorical variables often
are present that have an influence on the relationships. Identify the explanatory variable (x-
axis, input, independent) and the response variable (y-axis, output, dependent).
1. Scatterplots and Correlation:
Scatterplots:
The relationship between 2 quantitative variables measured on the same individuals. Each
point represents an individual with the coordinates of the point (explanatory value, response
value). To graph a scatterplot, do the following:
a. The explanatory variable is on the x-axis, the response variable on the y-axis.
b. Label both axes!!!
c. Scale the intervals on the axes so that the intervals are uniform.
d. Use a large enough grid so that the details can be readily identified.
To interpret a scatterplot, identify patterns and deviations from the patterns.
a. Overall pattern and striking deviations from the pattern.
b. Direction (positive or negative), Form (linear, curved), Strength (how closely thr
points follow a clear form)
c. Outlier (individual value that falls outside the overall pattern of the relationship.)
To add a categorical variable to a scatterplot, use a different plotting color for each category.
Correlation::
A scatterplot can be misleading in determining direction, form, and strength if the scales are not
properly set. Therefore, the calculation of the correlation coefficient, r, is used to interpret the
direction and strength of a linear relationship between 2 quantifiable variables.
1
i i
x y
x x y y
s sr
n or more simply A positive r indicates a positively
(upward) sloping line and a negative r indicates a negatively (downward) sloping line. The
value of r will be 1,1 . The closer r is to either – 1 or 1, the stronger the linear relationship.
The closer r is to 0, the weaker the relationship between the 2 variables.
Features of correlation:
Correlation makes no distinction between explanatory and response variables.
R does not change if the units of measure change for either x or y or both. The
correlation, r, has no unit of measure itself.
Positive r indicates positive association between the variables, and negative r indicates
negative association.
The correlation coefficient, r, will always be between – 1 and 1. 1 1r .
Correlation of bivariate data:
Correlation requires that both variables are quantitative.
Correlation only describes the strength of linear relationships.
Correlation, r, is not resistant to extreme values.
Correlation is not a complete summary of bivariate data. Use also ,
2 Statistics Chap 03
2. Least-Squares Regression Line and Residuals:
Regression requires an explanatory and a response variable. A regression line describes how a
response variable, y, changes as an explanatory variable, x, changes. Regression lines are used
to predict the value of y for a given value of x. The value of the slope is no indication of the
strength of the relationship. The slope will change if the units of measure of the variable(s)
changes. The strength of the linear relationship is the value of r. Extrapolation is the use of a
regression line for predicting y for values of x outside the data range. Extrapolation is often not
accurate!
The least-squares regression line is the line that minimizes the sum of the squared vertical
distances of the y values of data points from the y values of the line. The equation of the least-
squares line is
, where is the y-intercept and b is the slope. Alternate form: .
The slope, b, is calculated .
The regression line goes through the point .
Remember, is the predicted value for y, given an input value of x. It is better to use values of
x within the range of the x data values, interpolation, rather than outside the range,
extrapolation.
Residuals are the differences between the observed value of the response variable and the
expected value of the response variable as predicted by the regression line.
Residual /. The sum of the least-squares residuals is always zero. A residual plot is a
scatterplot of the regression residuals against either the x values or the predicted y values .
Residual plots show how well a regression line fits the data.
Features of residual plots:
Residual plots should show no obvious pattern. A curved pattern would indicate that
the relationship is not linear. A “fan shaped” pattern indicates the prediction value is
less accurate for x values at the larger residual side of the plot.
The residuals should be relatively small in size.
The standard deviation of the residuals, , measures the typical prediction
error for the regression line.
The coefficient of determination, , is the fraction of the variation in the y values that is
explained by the least squares regression line (that is due to the linear relationship of x and y.)
Facts about least squares linear regression:
The distinction between explanatory and response variables is essential.
There is a close connection between correlation and slope. . A change in one
standard deviation of x corresponds to a change of r standard deviations in y.
The least squares regression line always passes through the point .
The correlation, r, describes the strength of the linear relationship.
The square of the correlation, r2 or the coefficient of determination, is the fraction of the
variation that is explained by the least squares regression line. It is the proportion of the
variation that is due to the linear relationship of x and y.
3 Statistics Chap 03
3. Correlation and Regression Wisdom:
Overall facts:
Correlation and regression describe only linear relationships.
Extrapolation (using values outside the domain of the data) often produces unreliable
predictions.
Correlation is not resistant.
Outliers – observations that lie outside the overall pattern (beyond 1.5IQR). Outliers in the y
value produce large residuals. Outliers in the x value may or may not produce large residuals.
Remember, by definition, residuals are . Outliers in the y direction are not necessarily
influential. Outliers in the x direction are often influential.
Influential observations – observations that markedly affect the calculation of the regression
line. Points that are outliers in the x direction are often influential and therefore affect the
regression line (pull the regression line toward the influential point.)
Lurking variables – variable(s) that are neither the explanatory nor the response variable, yet
they may influence the relationship between the explanatory and response variables. Lurking
variables can create a correlation or they can hide a correlation.
Correlations based on averages often produce too high an r value to be useful in analyzing
individuals.
4. Chapter Summary:
a. Plot the data on a scatterplot
b. Evaluate the form, direction and strength from the scatterplot
c. Calculate numerical summaries (e.g. ).
d. Calculate a regression line .
e. Calculate the strength of the linear relationship, and how well the regression line fits the
data (e.g. residuals and r2.)
1 Statistics Chap 04
Statistics Chapter 4
Exponential and Power Relationships on Bivariate Data
1. Transforming to Achieve Linearity: The original data plot may not show a linear relationship; however, a function of the data may be linearly related. Taking a function on the data (e.g. log or square root) is called transforming the data. It is also known as reexamining the data. Transforming is changing the units of measure on the original data. Common transformations in statistics use linear functions, exponential (using both positive and negative exponents), or logarithms. Steps to follow for bivariate relationships 1. For a linear relationship:
a. Creating a scatterplot of the data (x, y). What is the form? b. LinReg on (x, y) and noting the equation and the values of r and r2.
i. LinReg L1, L2, Y1 c. Plot the residuals. If there is a pattern, then a linear relationship is NOT the best fit.
i. L3 = Y1(L1) to put y into L3
ii. L4 is the residual. Residual = y y− = L2 – L3. iii. The scatterplot for the residual is L1, L4.
2. For an exponential relationship: a. Create a scatterplot of (x, ln(y)). What is the form?
i. One method of identifying an exponential relationship is if 1
n
n
yy −
is constant.
b. LinReg on (x, ln(y)) and noting the equation and the values of r and r2. i. LinReg L1, L3, Y2
c. Plot the residuals. See the table below for the residual plot. If there is a pattern, then an exponential relationship is NOT the best fit.
i. x y Ln(y) ( )ln y ( )ln( ) lny y−
L1 L2 L3 L4 L5 Data Data L3=ln(y) L4=Y2(L1) L3-L4
ii. The scatterplot of the residual for the exponential linear regression line is L1, L5 d. Transform the equation from part b to exponential form.
i. The linear equation will be in the form ln( )y a bx= +
ii. Transform as follows:
( )
a bx
a bx
xa b
y ey e e
y e e
+=
= ⋅
= ⋅
2 Statistics Chap 04
3. For a power relationship: a. Follow the same steps as for an exponential EXCEPT you will also need to take a logarithm
of the explanatory value (x) as well as the response variable (y). b. LinReg on (ln(x), ln(y)) and noting the equation and the values of r and r2.
i. LinReg L3, L4, Y3 where L3 = ln(L1) and L4 = ln(L2) c. Plot the residuals. See the table below for the residual plot. If there is a pattern, then a
power relationship is NOT the best fit. i.
x y Ln(x) Ln(y) ( )ln y ( )ln( ) lny y−
L1 L2 L3 L4 L5 L6 Data Data L3=ln(L1) L4=ln(L2) L5=Y3(L3) L4-L5
ii. The scatterplot of the residual for the power linear regression line is L3, L6 d. Transform the equation from part b to power form.
i. The linear equation will be in the form · ii. Transform as follows: ·
Handy hints: Write down the definitions of the contents of L1 through L6. Write down the definitions of Scatterplots 1 through 4. These steps are especially important for calculation and plotting of residuals. Recap:
Power Graph Linear Form Power or Exponent Form ∞, 0
·
0
·
0,1
·
1,∞
·
For exponential growth, each succeeding y value grows by a fixed % of the previous y value for uniform increments of x. If a variable grows exponentially, then its log grows linearly.
3 Statistics Chap 04
2. Relationships Between Categorical Variables: Always sum the rows and columns for categorical data. The marginal distribution shows the % of a variable to the TOTAL distribution. A conditional distribution shows % of a subset of the data to the total of the subset (ie you are considering only a particular row or column.) Simpson’s Paradox – an association that holds true for all of several groups can reverse direction when the data are combined to form a single group. (example of the effect of lurking variables.)
3. Establishing Causation: A strong association is not enough to prove cause and effect. Even when direct causation is present, it is rarely a complete explanation of an association between 2 variables. Even well established causal relationships may not generalize to other settings (eg rats to people.) A common response to a third variable may explain changes in 2 associated variables. 2 variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory or lurking. If an experiment is not possible, then look for the following to establish a relationship: 1. A strong association 2. Consistent association 3. Larger values of the response variable are associated with stronger responses. 4. Alleged cause precedes the effect in time. 5. Alleged cause is plausible.
1 Statistics Chap 05
Statistics
Chapter 5
Data Production
1. Designing Samples:
Population – the entire group of individuals
Sample – part of the population
Sampling – studying a part in order to gain information about the whole.
Census – studying every individual in the population.
Type of Sample Description Reliability
Voluntary Response Individuals decide
whether to participate in
a general appeal for data
Weak (biased)
Convenience Select individuals based
on those who are easiest
to contact
Weak (biased)
Systematic Select every nth item Weak if not done properly.
Subject to variations in the
sequence of the data set.
Simple Random Samples
(SRS)
selecting n individuals
from a population in
such a way that every set
of n individuals has an
equal chance of being
selected.
Strong (objective – no bias)
Biased sampling method – those that systematically favor certain outcomes. They have weak
reliability.
Random samples may be chosen by using a systematic random sampling technique:
Random digits using computer software
TI-84 (Math, PRB, 5:randInt(1,ending number)
Random digits using a Table where:
each digit is equally likely to be chosen and
the digits are independent of one another (no pattern to the digits)
Steps to choose an SRS:
1. Label – assign a unique numerical value to each individual in the population
2. Table – use a Random Digit Table to select labels at random
3. Stopping rule – establish the total sample size
4. Identify sample – link the numerical values selected to the individuals.
Pg 326 Case Closed
A probability sample is a sample chosen by chance. It must be known the probability for
each possible sample.
The use of chance to select the sample is the essential principle of statistical sampling.
2 Statistics Chap 05
A stratified random sample – the population is divided into its important characteristics called
strata. Then choose a SRS from each strata and combine these SRS to form the full sample.
A cluster sample – the population is divided into groups or clusters. Some clusters are
randomly selected, then all the individuals in the selected clusters are chosen for the sample.
Multistage sample – several stages of selection are used to divide the population. Each stage
may be an SRS, strata, or cluster. The final stage results in clusters from which the sample is
selected.
Areas of bias in sampling despite good sampling methods:
Undercoverage – some groups in the population are omitted from the sampling process (e.g.
homeless people).
Nonresponse – individuals chosen for the sample are unavailable or refuse to participate.
Response bias – the nature of the question (e.g. illegal activities or estimates by the
respondent) or the behavior of the interviewer may result in bias by the respondent.
Wording of the questions – the wording or the order of the questions may result in a biased
response. Wording biases may result from omission of information (covert) or by including
biased wording (overt, an error of commission.)
Larger random samples produce results that are closer to the population than smaller samples.
2. Designing Experiments:
Experiment – a study that includes a treatment (a specific experimental condition) to
individuals (experimental units or subjects) in order to observe the response.
The explanatory variables (factors) are the inputs and the response variables are the outputs.
The explanatory variables may have different values (levels.)
Example 5.14 and Figure 5.3 pg 355.
A lurking (confounding) variable is neither the explanatory variable nor the response variable
yet the lurking variable may influence the interpretation of the results of a study.
Placebo – “dummy” or fake treatment used to disguise the treatment from the control group
and the treatment group.
The control group receives the placebo.
The 1st basic principle of statistical design of experiments is control! Comparison of several
treatments in the same environment is the simplest form of control. Control is the overall
effort to minimize variability in the way the experimental units (individuals or subjects) are
obtained (sample) and treated.
The 2nd
basic principle of statistical design of experiments is replication! Replication – use
enough subjects to reduce chance variation between groups (i.e. the larger the sample size the
better.)
3 Statistics Chap 05
The 3rd
basic principle of statistical design of experiments is randomization!
Randomization – rely on chance to assign individuals (experimental units) to the control and
treatment groups. Goal is to eliminate bias. Do not rely on the characteristics of the
individuals or the judgment of the designer of the experiment.
Recap of the basic principles of experimental design:
1. Control the effects of lurking variables on the response. (Use a control group and at least
one treatment group.)
2. Replicate each treatment on many individuals to reduce chance variation in the results.
3. Randomize – use impersonal chance to assign individuals to treatments. A completely
randomized design assigns all individuals to all groups randomly.
Statistically significant – an observed effect so large that it would rarely occur by chance.
Do example 5.19 pg 362 for using Table B for random assignment.
Block – a group of individuals that are known before the experiment to be similar in some
variable so that the response to the treatments will be systematically affected. Blocks are
another form of control. They control the effects of some outside variables by bringing those
variables into the experiment to form the blocks. (e.g. separate blocks for males and females
where gender may impact the effect of the treatment.) Blocking allows separate conclusions
about each block. Form blocks based on the most important unavoidable sources of variability
among individuals. Randomization averages out the effects of the remaining variation
enabling an unbiased comparison of the treatments.
Block Design – individuals are randomly assigned to treatment groups within each block.
Control what is possible, block what is not controllable, randomize the rest and replicate!.
Often, matching the subjects in various ways can produce more precise results than simple
randomization.
Matched pairs – a type of block design that compares just two treatments. The subjects are
matched in pairs because matched subjects are more similar than unmatched subjects.
Therefore, comparing responses between matched pairs is more efficient than comparing
responses of randomly assigned subjects. Example 5.23 pg 368.
Double Blind Experiment – neither the subjects nor those who measure the response variable
know which treatment a subject received.
A potential weakness of any experiment is lack of realism. The ability to apply the
conclusions of an experiment to a real setting may be limited.
1 Statistics Chap 06
Statistics
Chapter 6
Probability – Simulations, Randomness
Introduction: the three types of probability are relative frequency, theoretical model, and simulation.
1. Simulations:
A simulation is an imitation of chance behavior, based on a model that accurately reflects the
phenomenon. Do the following steps:
a. Describe the random phenomenon
b. State the assumptions
c. Assign digits to represent the outcome
d. Simulate many independent trials (repetitions)
Using the TI-84:
randInt(1st, last, size) L1:sortA(L1): (L1≤value) L2 : sum(L2)
See Example 6.8 on page 401
2. Probability Models:
Chance behavior is unpredictable in the short run but has a regular and predictable pattern in the long
run. Probability is empirical based on observations of many trials.
A phenomenon is random if individual outcomes are uncertain but there is a regular distribution of