Sampling Error
Sampling Error
When we take a sample, our results will not exactly equal the
correct results for the whole population. That is, our results will be
subject to errors.
Sampling error: A sample is a subset of a population. Because of
this property of samples, results obtained from them cannot reflect
the full range of variation found in the larger group (population).
This type of error, arising from the sampling process itself, is called
sampling error which is a form of random error.
Sampling error can be minimized by increasing the size of the
sample. When n = N sampling error = 0⇒
Non-sampling error (bias)
It is a type of systematic error in the design or conduct of
a sampling procedure which results in distortion of the
sample, so that it is no longer representative of the
reference population.
We can eliminate or reduce the non-sampling error (bias)
by careful design of the sampling procedure and not by
increasing the sample size.
Sources of non sampling errors:
Accessibility bias, volunteer bias, etc.
The best known source of bias is non response. It is the failure to obtain
information on some of the subjects included in the sample to be
studied.
Non response results in significant bias when the following two
conditions are both fulfilled.
1. When non-respondents constitute a significant proportion of the
sample (about 15% or more)
2. When non-respondents differ significantly from respondents.
There are several ways to deal with this problem and reduce
the possibility of bias:
1. Data collection tools (questionnaire) have to be pre-tested.
2. If non response is due to absence of the subjects, repeated
attempts should be considered to contact study subjects who
were absent at the time of the initial visit.
3. To include additional people in the sample, so that non
respondents who were absent during data collection can be
replaced (make sure that their absence is not related to the
topic being studied).
ESTIMATION
The sample from a population is used to provide the estimates
of the population parameters.
A parameter is a numerical descriptive measure of a population
( μ is an example of a parameter).
A statistic is a numerical descriptive measure of a sample ( X is
an example of a statistic).
To each sample statistic there corresponds a population
parameter.
We use X , S2, S , p, etc. to estimate μ, σ2, σ, P (or π), etc
Sample statistic Corresponding population parameter
X (sample mean) μ (population mean)
S2 ( sample variance) σ2 ( population variance)
S (sample Standard deviation) σ(population standard deviation)
p ( sample proportion) P or π (Population proportion)
Sampling Distribution of Means
Sampling Distribution is a frequency distribution and it
has its own mean and standard deviation.
Steps:
1. Obtain a sample of n observations selected completely at
random from a large population . Determine their mean
and then replace the observations in the population.
2. Obtain another random sample of n observations from
the population, determine their mean and again replace
the observations.
1. Repeat the sampling procedure indefinitely, calculating the
mean of the random sample of n each time and subsequently
replacing the observations in the population.
2. The result is a series of means of samples of size n. If each
mean in the series is now treated as an individual observation
and arrayed in a frequency distribution, one determines the
sampling distributionof means of samples of size n.
Because the scores ( X s) in the sampling distribution of
means are themselves means (of individual samples), we
shall use the notation σ X for the standard deviation of the
distribution.
Standard error of mean (SEM): The standard deviation
of the sampling distribution of means is called the
standard error of the mean.
Formula: σ x = √ Ʃ ( x i - μ)2 / N
Properties of sampling distribution
1. The mean of the sampling distribution of means is the
same as the population mean, μ .
2. The SD of the sampling distribution of means is σ / √n
3. The shape of the sampling distribution of means is
approximately a normal curve, regardless of the shape of
the population distribution and provided n is large
enough (Central limit theorem).
Confidence interval
Interval Estimation (large samples)
A point estimate does not give any indication on how far
away the parameter lies. A more useful method of
estimation is to compute an interval which has a high
probability of containing the parameter.
An interval estimate is a statement that a population
parameter has a value lying between two specified limits.
Confidence interval
Confidence interval provides an indication of how close the sample
estimate is likely to be to the true population value.
Gives an estimated range of values which is likely to include the true
value of the unknown population parameter with a certain confidence
(probability) and the estimated range being calculated from a given set
of sample data.
Consider the standard normal distribution and the statement Pr (-1.96≤
Z ≤1.96) = . 95. it means that 95% of the standard normal curve lies
between + 1.96 and –1.96.
Formula:
Pr( X - 1.96(σ /√n) ≤ μ ≤ X + 1.96(σ /√n) ) = .95
The range X -1.96(σ /√n) to X + 1.96(σ /√n) ) is called the
95% confidence interval;
X -1.96(σ /√n) is the lower confidence limit while X +
1.96(σ /√n) is the upper confidence limit
Few things to remember
At 90%, the corresponding Z score to be used in the
formula is 1.64
At 95%, the corresponding Z score to be used in the
formula is 1.96
99%, the corresponding Z score is 2.58
Confidence Interval for Formula
Population Mean
Population Proportion
Difference in Population Means
Difference in Population Proportions
npq
zp value
ns
zx value
2
22
1
21
value21 ns
ns
zxx
2
22
1
11value21 n
qpnqp
zpp
problem 1 : Suppose x= 50, SD = 10 and N=100. what is the 99%
confidence interval?
CI lower, X –2.58 (σ /√n) = 47
CI upper, X + 2.58 (σ /√n) = 53
Example 1
The mean fasting blood sugar of a group of 70 individuals
was found to be 115 with a SD of 12.56. Find the 95% and
99% CI’s for the population mean.
Solution:
SE (mean) = (12.56/√70)
= 1.51
95% CI = 115 ± 1.96 (1.51)
= (112.05, 117.95)
Interpretation of 95% CI:
Probabilistic Interpretation:
In repeated sampling, approximately 95 percent of the intervals constructed will include the population mean.
Practical Interpretation:
One can say with 95 percent confidence that the population mean fasting blood sugar is between 112.05 and 117.95
Example 2 In a study, it was found that 129 out of 150 carcinoma of lung patients were smokers. Find the 95% & 99% CI’s for the proportion of smokers among lung cancer patients.
Solution:
SE (Proportion) = √ (0.86)(0.14)/150)
= 0.028
95% CI = 0.86 ± 1.96 (0.028)
= (0.8, 0.91)
Similarly, 99% CI is (0.932, 0.788)
Example 3
In a study to assess the effect of anabolic steroids in weight gain, the following data was observed
Find the 95% CI for the difference in the mean weight gain?
Solution:
SE (Diff. in mean) = √(21.22/50)+(92/50))
= 3.257
CI= (3.7 – 3.1) + (1.96 x 3.257)
95% CI = (-5.78, 6.98)
Group n Mean weight gain SD of Weight
Study 50 3.7 21.2
Control
50 3.1 9
Example 4 In a study to assess the effect of BCG, the following data was observed
Find the 99% CI for the difference in proportion?
Solution:SE (Diff. in Proportion) = √((0.0088)(0.9912)/2500+(0.03)(0.97)/3000) = 0.363
99% CI = ( -0.03 , 0.019)Actual difference between proportion = 3.00-0.88 = 2.12
BCG n TB developed Disease rate
Vaccinated 2500 22 0.88%
Unvaccinated 3000 90 3.00%
Factors affecting the width of confidence interval Variation in the data (standard deviation): more the SD,
more the confidence interval
Sample size : as N increases, confidence interval decreases
Level of Confidence: more the confidence level (90%, 95% and 99%) , more the Confidence interval
THANK YOU