PROBABILITY & STATISTICAL INFERENCE LECTURE 4 MSc in Computing (Data Analytics)
Feb 23, 2016
PROBABILITY & STATISTICAL INFERENCE LECTURE 4MSc in Computing (Data Analytics)
Lecture Outline Recap to Statistical Inference Central Limit Theorem Confidence Intervals Section Takeaways
Statistical Analysis Process
Population
Representative
Sample
Sample Statistic
Describe
Make Inference
Populations vs. Samples How do Irish voters intend voting in the
next election?
The voting population of Ireland:2,680,0001
A sample of 1,008 adults was taken and surveyed for their voting intention in the next election2
1. Source - http://www.nationmaster.com/graph/dem_pre_ele_vot_age_pop-presidential-elections-voting-age-population
2. http://redcresearch.ie/wp-content/uploads/2012/01/Report.pdf
Populations vs. Samples How do Irish voters intend voting in the
next election?
1,008 voters were asked how they
intended to vote in the next election
Fine Gael: 30% Labour: 14% Fianna Fail: 18% Sinn Fein: 17% Other: 21%
Populations vs. Samples The term population is used in statistics to
represent all possible measurements or outcomes that are of interest to us in a particular study or piece of analysis In the example the population of interest was
the voting intentions of all voters in Ireland The term sample refers to a subset of the
population that is selected for analysis In the example the polling company selected
a sample of 1,008 voters
Sampling In choosing a sample it is important that
it is representative of the population No bias should exist in the sample There are a number of sampling
methods available to ensure that your data is representative
A simple random sample is the most straight forward of these methods
Statistical Inference The statistical methods used to draw
conclusions about populations based on the statistics describing a sample is known as statistical inference
We want to make decisions based on evidence from a sample i.e. extrapolate from sample evidence to a general population
To make such decisions we need to be able to quantify our (un)certainty about how good or bad our sample information is
Statistical Inference Statistical Inference is divided into two
major areas: Parameter Estimation: This is where
sample statistics are used to estimate population parameters
Hypothesis Testing: A statistical hypothesis is a statement about the parameters of one or more populations. Hypothesis testing tests whether a hypothesis is supported by data collected
The population mean is denoted by µ (mu) In general, given a sufficiently large
sample, we use the sample mean as a point estimate of µ
The population variance is denoted by σ2 (sigma-squared) In general, given a sufficiently large
sample, we use the sample variance s2 as a point estimate of σ2
Population Statistics – Point Estimation
Population Statistics – Point Estimation An estimate of proportion, p, of items in a
population that belong to a class of interest is calculated as:
where c is the number of items in a random sample of size n that belong to the class of interest
This is known as the sample proportion
p cn
Central Limit Theorem
Demonstration
Central Limit Theorem Explained by Example
The distribution shown is a poission distribution with λ=3
This could represent the distribution of the number of clicks on a particular link in one second
Create 200 sample distributions each with a large sample size Calculate the mean of each distribution
Central Limit Theorem Explain what has happened?
As the sample sizes increased the shape of the histogram of means tended towards a normal distribution
As the sample sizes increased the spread (standard deviation) between the sample means decreased
Central Limit Theorem These histograms are pictures of The
Sampling Distribution of the Mean
This phenomenon will happen in ALL cases
The proof of this is called the Central Limit Theorem (CLT) and involves some fairly non-trivial mathematics
Definition: Central Limit Theorem continued… The sampling distribution of the mean has a average
value = (the population mean).
The sampling distribution of the mean has a standard deviation
Where σ is the population standard deviation, and n is the sample size taken.
This value is called the standard error of the mean.
The Sampling Distribution of the Mean will be a Normal distribution if the sample size is large.
n
Central Limit Theorem - Definition If a random sample is taken from a
population, where: Each member of the sample can be
considered to be independent of each other The are all members of the same population That population has a mean value μ and a
standard deviation σ Then.......
Central Limit Theorem - Definition .........
This is a non-mathematical definition of the Central Limit Theorem (CLT)
The central limit theorem states that given a distribution with a mean μ and variance σ², the sampling distribution of the mean approaches a normal
distribution with a mean μ and a variance σ²/n as n, the sample size,
increases
The Distribution of the Sample Means
ns
x
Confidence Intervals
How can we use the CLT The Central Limit Theorem avoids the necessity
of specifying a complete statistical model for all the sampled data.
All we have to do is specify a probability model for the sample mean.
For any sample mean, calculated from a large independent random sample taken from any population with a mean μ and standard deviation σ, we know from the CLT, that this sample mean is a random variable from a Normal distribution with a mean = μ and a standard deviation = n
Practical use for the CLT continued… Take a single sample and calculate
This is an estimate of μ – the true (but unknown) population mean.
But, how good is this estimate?
We assume that is not exactly , but is somewhere near - but how near is it likely to be?
___X
___
X
Confidence Intervals Introduction We would like to make probability
statements as to how close is likely to be to .
If sample size is sufficiently large – then the estimate can be considered as: a random variable from a Normal
distribution, so probability statements are possible.
This is how we use the CLT in practical data analysis.
___
X
___
X
Confidence Intervals Introduction For a Normal distribution, we know that
95% of values will be within 1.96 Standard deviations of
So, given one estimate we can say that this estimate is within 1.96 standard errors of the actual population mean , with 95% confidence95% in
shaded area • We can turn this knowledge
on its head: given we can be 95% confident that the true mean is within 1.96 standard errors of it.
Confidence Interval From this we can specify a range of values within
which we are 95% confident that the population mean () lies
This is called a confidence interval 95% Confidence Interval for a population mean (from large enough sample):
Remarkably, this result holds for samples of size 30 or more. So, a large sample in this context, is a sample of 30 or more.
n
96.1x
error standard96.1x__
__
Interpretation: we would say that the average lifetime of all components (μ) is between 4,456 and 7,290 hours with 95% confidence
ExampleOne sample of size 30 from the electronic components yields a sample mean = 5,873 hours .We know = 3,959 so a 95% confidence interval would be:
7290 to44561417587330
395996.1587396.1x
error standard96.1x__
__
n
Confidence Intervals Why is this any good? Before: one estimate, = 5,873 but no
idea of how good or bad it was, i.e. how close to μ is was likely to be.
Now: 95% confident that μ is between 4,456 and 7,290 hours.
So, using CLT leads to Confidence Intervals that enables us to estimate a statistic with certain level of confidence.
In other word it gives us an objective measure of the actual amount of information contained in our sample about the likely location of μ.
Problem with σ All of the above assumes that the population
standard deviation (i.e. ) is known.
In practice this is not known (just like ). So, we need to estimate as well as we get this estimate from the standard deviation
of the sample, given that the sample is large enough.
Sample Standard Deviation is called ‘s’
Estimate by s
2
1
nxxs
General Confidence Interval for μ (Large Samples)
The general formula is:
Where: • is between a value between 0-1, • (1-)×100% is the confidence level you want • Z1-/2 is a value from the Normal distribution
table.• Example: for a 95% CI, = 0.05
(1-)×100% = 95% Z1-/2 = 1.96
nszx 2/1
__
-1CI
Confidence Level α/2 Z1-/2
90% 0.05 (5%) 1.6449
95% 0.025 (2.5%) 1.96
99% 0.005 (0.5%) 2.5758
99.9% 0.0005 (0.05%) 4.4172
Z-Values The value of Z1-/2 for other % confidence
intervals are given in standard tables.
Confidence Level Z1-/2 CI
90% 1.6449 4681 to 706595% 1.96 4456 to 729099% 2.5758 4011 to 773599.9% 4.4172 2679 to 9067
Example Using these we get the following results for the
electronic component example:
Note as gets smaller the CI gets wider Also, at the same time as n gets bigger the CI narrows –
So big samples leads to more precise estimates (i.e. narrower confidence intervals)
What CI’s and sample sizes should I use? You can’t control s – it is inherent in the data
(population). You can’t control x-bar either. You can control Z1-/2 but in practice scientific
convention sets this to reflect 90%, 95% or 99% confidence, with 95% being the accepted default.
You can choose n – but resources may limit you. There is a whole topic called sample size
determination which you may want to review before collecting data or starting research
Confidence Interval Assumptions Sample size 40 or greater
Experimental units are independent or each other
Experimental units were randomly sampled
The independence assumption requires that value of the variable for one experimental unit should not tell us anything about the value of another.
Randomness is required to avoid systematic bias in selection.
Exercise Complete Exercise 1 & 2
Calculation of CIs for small samples What about small samples?
In the case of CIs about a mean we can use the Student-t distribution.
The process turns of to be very similar – but the CLT no longer works
History of the Student t test William Gosset used the publishing pseudonym ‘Student’.
He derived the correct sampling distribution for the mean of samples < 40 – and called it the ‘t distribution’.
In his honour, it is often called the ‘Student t’ distribution.
Gosset was a chief brewer for Guinness.
The mathematical details are complicated, but, it turns out that we perform exactly the same calculations as before, with the one change that the t distribution instead of the normal distribution is used.
Assumptions Student t’s result only referred to a mean
where the distribution of the population was normally distributed with some mean μ and finite standard deviation σ.
This is in contrast to the CLT for large samples that required no such assumption about normality.
The t-test also requires the assumption regarding independence in the sample.
Statistical Model for mean from small samples The experimental units are independently sampled
from a population with mean=μ and standard deviation = σ
The population is normally distributed (we don’t need this with large samples)
So, to use the t-test for a small sample, you need to establish that data is sampled from a population that is normally distributed – you could look at the histogram of the sample and see if it is symmetric and bell shaped – or use other methods.
If Assumptions met:
The statistic:
Can be shown to be distributed according to a (student) t-distribution.
The t-distribution has one parameter, called ‘degrees of freedom’ (df).
The t - Statistic
nst
___X
The t-Distribution The t-distribution itself is bell shaped and symmetric
– just like the normal distribution but is ‘flatter’.
There are many t distributions – one for each sample size.
The rule used is: for a sample of size n – use the t distribution with degrees of freedom = n−1Example: if the sample size is 15, then use a t distribution with degrees of freedom 15 − 1=14.
Note the degrees of freedom often abbreviated to df.
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
Normal(0,1)t(df=4t(df=1)
The t probability density function with k degrees of freedom:
2/)1(2 1/
122/1)(
kkxkk
kxf
The t-Distribution
General Confidence Interval for μ (small Samples) The general formula is:
Where (1-) 100% is the confidence level you want andt(n-1, /2) is a value from the t distribution with df=n-1, and
with a specified level.
What is t(n−1, 1−/2)?
A value from the t distribution with n−1 df such that 100(1 − )% of values lie within that range around the mean.
nstx n )1,2/1(
__
-1CI
How do you find t(n−1, 1−/2)? from a table specifically designed to give it to
you or use a computer
Note: as gets smaller then CI gets wider as df gets smaller then CI gets wider
Confidence Level /2 t(df=1) t(df=10) t(df=30)
90% 0.05 (5%) 6.314 1.812 1.69795% 0.025 (2.5%) 12.71 2.228 2.04299% 0.005 (0.5%) 63.66 3.169 2.750
99.9% 0.0005 (0.05%) 636.6 4.587 3.646
Example Internal temperature of autoclaved
aerated concrete used in building. An engineer recorded the following data:
23.01, 22.22, 22.04, 22.62, 22.59
95% CI for the population mean?
)97.22,03.22(4696.05.225
3793.0776.25.22
CI )1,2/(
__
-1
nstx n
Exercise Answer Questions 3-6
Confidence Intervals for Proportions (Large Samples) Proportions (including %) are often a statistic of
interest
Think of the proportion of defective items on a production line, the proportion of people who respond favourably to a survey question, to proportion of success versus failures in some experiment
Proportions are also covered by the CLT - remember that a proportion is a different kind of average
Confidence Intervals for Proportions (Large Samples) Take a sample of size n of electronic components
coming off a production line, a test each one for defects. The statistic of interest is the proportion of defectives produced by the production process.
The estimated proportion from the sample is,
where (p-hat) is the symbol used for the estimated proportion from the sample
size) sample totaln(theSample thein s Defectiveof Noˆ p
Confidence Intervals for Proportions (Large Samples) If the sample size is sufficiently large and
we repeat the experiment a large number of times, then:
The sampling distribution of the proportion will be normally distributed by the CLT
The mean of this distribution will be p - i.e. the 'true' population proportion
The standard deviation of the sampling distribution of the proportion, called the standard error of the proportion is estimated by
n)ˆ1(ˆ
proportion of S.E pp
Example: A pharmaceutical company produces
400,000 capsules per day of a particular drug. They test 200 of the capsules for defects (too much/little active compound). If the population p = 0.05, and they take 10,000 repeated samples this is the histogram they would get
Sample Size How big does the sample have to be for
the CLT to work with proportions? The rule is different than the rule for
means. Do the following test. A rule of thumb: the sample size is big
enough if1. np > 5 and2. n(1-p) > 5
General Confidence Interval Formula for a Population Proportion (large Sample)
where = the confidence level and Z1-/2 = a value from the standard normal distribution such that 100(1-)% of values of a standard normal distribution lie within that range around the mean
So the Z1-/2 values used for a population proportion are the same as those used for a population mean
nppzpCI )ˆ1(ˆˆ 2/1
Example How many voters will give F.F. a first preference in
the next general election ? There are 2 different estimates Researcher A (10 people) => 40% Researcher B (100 people) => 25%
How much 'better' is estimate B than estimate A ? Step one: Can we use the formula for large
numbers1. Researcher A: np = 10 * 0.4 = 4 => 4 is not greater than
5 therefore you cannot used the large number method2. Researcher B: np = 100 * 0.25 = 25
n(1-p) = 100 * (1-0 .25) = 75 both figures are greater than 5 therefore you can used the large number method
Example Continued Researcher B - 95% Confidence Interval
So, the 95% CI is 17% to 33%.
0.33 to0.1708.025.0
04.096.125.0100
75.025.096.125.0
)ˆ1(ˆˆ
95
95
95
95
2/1
CICICI
CI
nppzpCI
Example Continued NB: If fact we can get a 95% CI for researcher A's
findings using small sample theory (exact CI) - this is available in SAS and other software:
Exact CI’s are often based on direct use of probability models.
The method is based directly on calculations for the binomial distribution (see lecture 3)
What do we have to do? Using the CLT, we found, that the 95% CI was
composed of the set of values for the mean, such that an hypothesis test would not reject the null hypotheses for any of those values in the set using the α = 0.05 level.
Using SAS we can calculate a 95% CI for Researcher A: CI 95% for Researcher A = 12% to 74% which is too wide to be informative anyway!
If we use the same technique for researcher B we get: CI95 for Researcher B = 17% to 35% Which is virtually the same as before using
the CLT.
Exact CI and tests for population proportions These work for small samples as well as
large samples
With large sample will give essentially the same results as CLT
Must be used for small samples, however
Based on the binomial probability distribution.
Difference between Exact and CLT based methods When sample sizes are ‘large’ they will give
the same results – but exact tests can be very hard to compute even with modern PCs
When sample sizes are small exact methods must be used
The CIs from small samples tend to be very wide – there is no short cut from collecting as much high quality data as you can manage.
Exercise Answer Question 7-9