-
Normal DistributionThe Normal Distribution is a density curve
based on the following formula. Its completely defined by two
parameters: mean; and standard deviation.
A density function describes the overall pattern of a
distribution. The total area under the curve is always 1.0.The
normal distribution is symmetrical.The mean, median, and mode are
all the same.
-
Normal Distribution2Total area under the curve is 1If we know
and , we know every thing about the normal distribution.
-
Normal DistributionThe 68-95-99.7 RuleIn the normal distribution
with mean and standard deviation :68% of the observations fall
within of the mean .95% of the observations fall within 2 of the
mean .99.7% of the observations fall within 3 of the mean .3223
-
Normal Density PlotA sample of 100 observations from a normal
distribution with mean 0 and standard deviation 1. 68% 95%
-
Normal DistributionStandardizing and z-ScoresIf x is an
observation from a distribution that has mean and standard
deviation , the standardized value of x is,
A standardized value is often called a z-score. If x is a normal
variable with mean and standard deviation , then z is a standard
normal variable with mean 0 and standard deviation 1.
-
Normal DistributionLet x1, x2, ., xn be n random variables each
with mean and standard deviation , then sum of them xi be also a
normal with mean n and standard deviation n. The distribution of
mean is also a normal with mean and standard deviation /n.The
standardized score of the mean is,
The mean of this standardized random variable is 0 and standard
deviation is 1.
-
Assessing the normality of dataMost statistical methods assume
that data are from a normal population. So its important to test
the normality of the data.
Normal quantile plots If the points on a normal quantile plot
lie close to diagonal line, the plot indicates that the data are
normal. Otherwise, it indicates departure from normality. Points
far away from the overall pattern indicates outliers. Minor wiggles
can be overlooked. We will see normal quantile plots in next two
slides. Shapiro-Wilk W statistics, Kolmogorov-Smirnov (K-S) tests
etc are being used for testing normality of the data. To perform a
K-S Test for Normality in SPSS, Analyze> Nonparametric Tests
> 1 Sample K-S. Choose OK after selecting variable (s). To
perform Shapiro-Wilk test of normality in R use the following:
>xshapiro.test(x)
-
Normal quantile plotq-q plot of 100 sample observations from a
normal distribution with mean 0 and standard deviation 1
-
Normal quantile plotThis plot shows a minor deviation from the
linear diagonal line. We can ignore this minor wiggles and still
can assume that the data are from a normal distribution.
-
Population and SamplePopulation: The entire collection of
individuals or measurements about which information is desired e.g.
If want to know the average height of 5-year old children in USA,
then all 5-year old children in USA is our population. Sample: A
subset of the population selected for study.Methods of sampling:
Random sampling, stratified sampling, systematic sampling, cluster
sampling, multistage sampling, area sampling, qoata sampling etc.
We will discuss only random sampling.Random Sample: A simple random
sample of size n from a population is a subset of n elements from
that population where the subset is chosen in such a way that every
possible unit of population has the same chance of being
selected.Example: Consider a population of 5 numbers (1, 2, 3, 4,
5). How many random samples (without replacement) of size 2 can we
draw from this population ? (1,2), (1,3), (1, 4), (1, 5), (2, 3),
(2, 4), (2, 5), (3,4), (3,5), (4,5)
-
Population and SampleWhy do we need randomness in sampling? It
reduces the possibility of subjective and other biases.Mean and
variance of a random sample is an unbiased estimate of the
population mean and variance respectively.Population mean of the
five numbers in previous slide is 3. Averages of 10 samples of
sizes 2 are 1.5, 2, 2.5, 3, 2.5, 3, 3.5, 3.5, 4, 4.5. Mean of this
10 averages (1.5 +2 + 2.5 + 3 + 2.5 + 3+ 3.5+ 3.5+ 4+ 4.5)/10 =3
which is the same as the population mean.
-
Parameter and StatisticParameter: Any statistical characteristic
of a population. Population mean, population median, population
standard deviation are examples of parameters.Statistic: Any
statistical characteristic of a sample used as an estimate of
population parameter such as sample mean, sample median, sample
standard deviation etc.Statistical Issue: Describing population
through census or making inference from sample by estimating the
value of the parameter using statistic.
-
Census and InferenceCensus: Complete enumeration of population
units.Statistical Inference: We sample the population (in a manner
to ensure that the sample correctly represents the population) and
then take measurements on our sample and infer (or generalize) back
to the population.Example: We may want to know the average height
of all adults (over 18 years old) in the U.S. Our population is
then all adults over 18 years of age. If we were to census, we
would measure every adult and then compute the average. By using
statistics, we can take a random sample of adults over 18 years of
age, measure their average height, and then infer that the average
height of the total population is ``close to'' the average height
of our sample.
-
Univariate, Bivariate, and Multivariate DataDepending on how
many variables we are measuring on the individuals or objects in
our sample, we will have one of the three following types of data
sets Univariate: Measurements made on only one variable per
observation.Bivariate: Measurements made on two variables per
observation.Multivariate: Measurements made on more than two
variables per observation.
-
ProportionProportion: In many cases, it is appropriate to
summarize a group of independent observations by the number of
observations in the group that represent one of two
outcomes.Consider a variable X with two outcomes 1 and 0 for
happening and not happening of some events correspondingly. Let p
be the probability that the event happens then p=Prob(X=1).
Suppose, we want to estimate of the proportion of the Patients
coming to duPont having some particular disease. To estimate this
proportion (population), we need to take a sample of size n and
examine if the patient is bearing that particular disease. Then the
estimated proportion is,
-
For large n, the sampling distribution of is approximately
normal with mean P (Population Proportion) and the standard
deviation
If probability of happening one event is p, then probability of
not happening of the same event is 1-p and total probability is 1.
What is the difference between proportion and a sample mean? If X
takes two values 0 or 1 and p is the proportion of happening an
event (X=1) then the sample proportion is the same as the sample
mean. For example, the mean of the following data 1,0,1,1,0 is
(1+0+1+1+0)/5 = 3/5 and the proportion is also the 3/5.
Proportion
-
Binomial DistributionLet us consider an experiment with two
outcomes success (S) and failure (F) for each subject and the
experiment be done for n subjects. One of the possible sequences of
S and F can be arranged as follows- SSFSFFFSSFSF where there are x
success out of n trials. Then the probability distribution of x can
written as
-
Binomial DistributionThe mean and variance of x are np and
np(1-p).If p=1/2, then Binomial distribution is symmetric.
-
Test of hypothesisA statistical hypothesis is some statement or
assertion about the population distribution or population parameter
which we want to verify on the basis of information available from
a sample. Test of hypothesis is a two-action decision problem after
the sample values have been obtained, the two actions being the
acceptance or rejection of the hypothesis under consideration. Null
Hypothesis: The statement is being tested in a test of
significance. Usually the null hypothesis is a statement of no
effect or no difference. That is, a statement without any kind of
motivation due to any reason. H0 is the abbreviated form of null
hypothesis. Alternative Hypothesis: Complement of null hypothesis.
Ha is the abbreviated form of alternative hypothesis.
-
Test of hypothesisTest statistic: The statistic is used to test
the significance of a hypothesis.Two types of errors: Type I error
and Type II error.Type I error (): Probability of rejecting H0 when
it is true (Ha is false). Type II error (): Probability of
rejecting Ha when it is true. Truth about the population H0 true Ha
trueDecisionBased on sampleReject H0Accept H0
Type I errorCorrect DecisionCorrect DecisionType II error
-
Test of hypothesisOur aim is to make inference controlling both
type I and type II errors. But the reduction in one results in an
increase in the other and the consequence of type I error seems to
be more severe than that of type II error. Thats why, we choose a
test that minimizes type II error (maximize the power of the test)
keeping type I error at a fixed low level.Level of Significance:
The probability of type I error is known as the level of
significance.Power (1-): The probability of rejecting H0 when
alternative hypothesis is true at a fixed level of significance (
). Power of a test is a function of sample size and the parameter
of interest. We calculate power for a particular value of the
parameter in alternative hypothesis. Increasing sample size
increases power of a test.Increasing Power: If power is too small
than we can do the followings to increase power:Increase the sample
sizeIncrease the significance level ( )Reduce the variability
-
Test of hypothesisP-value: The probability, assuming H0 is true,
that the test statistic would take a value as extreme or more
extreme than that actually observed is called the P-value of the
test. The smaller the P-value, the stronger the evidence against H0
provided by the data. Statistical Significance: If the P-value is
small or smaller than , we say that the test is statistically
significant at level . A result is statistically significant if it
is unlikely to happen by chance.
-
Point estimation and Confidence IntervalTwo types of parameter
estimation:Point Estimation: A single value calculated from sample
data, that estimates the parameter. Unbiasedness, consistency,
efficiency, and sufficiency are some of the criteria that should be
satisfied by a good estimator. The sample mean is a point estimate
of the population mean.Interval Estimation: A range calculated from
sample data along with a confidence level that a population
parameter lie between the range is called interval estimate of the
parameter. Confidence Interval: An interval computed from sample
data containing the true value of the parameter with a certain
level of confidence. With a 95% confidence interval of a mean we
mean, 95% of all samples of the same size will contain the true
population mean. A confidence interval for mean can be written as
follows
-
Confidence IntervalConfidence Interval = Point Estimate Margin
of error Margin of error: The amount of allowed sampling error. 95%
Confidence Interval for mean= Sample mean ((z97.5 / n ) or ( t97.5,
(n-1)s/ n)) where s and are the sample and population standard
deviations correspondingly.
-
Sampling DistributionStandard Error: Standard deviation of a
statistic. Let x1, x2, xn be a sample of size n. Then the standard
deviation of the sample mean is a standard error. It is written as
/n and s/n termed as estimated standard error of sample mean.
Sampling distribution: The distribution of a statistic. Commonly
used statistic: Z-statistic, t-statistic, chi-square statistic,
F-statistic are some of the commonly used statistics.
-
Z-ScoreZ-score: the statistic is distributed as standard normal
with mean 0 and standard deviation 1.Assumptions: Sample is large
and the population standard deviation is known.Applications: 1.
Test of hypothesis of a single sample mean2. Comparing means of two
groups or populations.
-
Power Calculation and Sample Size DetectionSample size
calculation is an important part of a research study. If the sample
size is too small, even a well designed study may fail to answer
research questions such as important effects or associations. If
the sample size is too large, it will be expensive and difficult to
handle. Power analysis can provide us an optimum sample size that
we need to answer the particular research answer with certain level
of accuracy.Calculation of sample size is a difficult and
cumbersome task as it involves rigorous mathematical derivation.
There are many softwares for sample size calculation. Some of them
are PASS, Power and Precision, PS. Softwares are not very costly.
Some of them are free.
-
Example: Power calculation/ sample size detectionA research team
is planning a study to examine if a 6-month exercise program
increase the total body bone mineral content (TBBMC) of young
women. Based on the results of a previous study, they are willing
to assume that = 2 for the percent change in TBBMC over the 6-month
period. A change in TBBMC of 1% would be considered important and
the researchers would like to have a reasonable chance of detecting
a change this large or larger. Are 25 subjects large enough for
this project?To answer the above question we need to calculate
power of the test which involved three following steps-State H0 and
Ha, and the significance level .Find the value of sample mean
(xbar) that will lead to reject H0.Calculate the probability of
observing these values of xbar when the alternative is true.
-
Example: Power calculation/ sample size detectionStep 1: H0:
Mean percent change, = 0Ha: > 0A 5% level of significance ()
will be used.Step 2: The z-tests rejects H0 at =0.05 when z>=
1.645 i. e. (xbar-0)/(2/25) >= 1.645reject H0 when xbar >=
0.658Step 3: Prob( xbar >=0.658 when =1) =
-
Useful Website(s)For the basic idea and the statistical
definitions the website may be useful.
http://www.cas.lancs.ac.uk/glossary_v1.1/main.html