STAT6101 CHEATSHEET SAMPLE STATISTICS Sample Population Size n N Mean Standard Deviation Variance Proportion p Skewness : Positively skewed: mean > median Negatively skewed: mean < median FIVE FIGURE SUMMARY: Min, Q1, Median (Q2), Q3, Max The median may be a better indicator of the most typical value if a set of scores has an outlier. An outlier is an extreme value that differs greatly from other values. However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency. Normal Distribution: Mean and SD Skewed Distribution: Median and IQR
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
STAT6101 CHEATSHEET
SAMPLE STATISTICS
Sample Population
Size n N
Mean
Standard Deviation
Variance
Proportion p
Skewness :
Positively skewed: mean > median
Negatively skewed: mean < median
FIVE FIGURE SUMMARY:
Min, Q1, Median (Q2), Q3, Max
The median may be a better indicator of the most typical value if a set of scores
has an outlier. An outlier is an extreme value that differs greatly from other
values.
However, when the sample size is large and does not include outliers, the mean
score usually provides a better measure of central tendency.
Normal Distribution: Mean and SD
Skewed Distribution: Median and IQR
BIVARIATE DATA
Bivariate data. When we conduct a study that examines the relationship between two variables, we are
working with bivariate data. Suppose we conducted a study to see if there were a relationship between the
height and weight of high school students. Since we are working with two variables (height and weight), we
would be working with bivariate data.
CORRELATION COEFFICIENT
ASSOCIATIONS
Positive association indicates that if x increases, then y increases.
Negative association indicates that if x increases, then y decreases.
- Value of a correlation coefficient ranges between -1 and 1
- The greater the absolute value of a correlation coefficient, the stronger the linear relationship
SPEARMAN CORRELATION COEFFICIENT:
- Rank x in order
- Rank y in order
- Find difference between ranks of x and ranks of y
- Apply:
PEARSON CORRELATION COEFFICIENT :
π π¦π₯ =πΆπ₯π¦
βπΆπ₯π₯ πΆπ¦π¦
- Strength of the linear association and direction between two variables Strength given by absolute value
- Influenced by remote points and variable
- Cannot make any inference based on Rxy alone.
- See booklet for further properties.
- 0 correlation means zero LINEAR relationship, they could still have a high correlation.
NORMAL DISTRIBUTION
Symmetric shape: mean = median
- Bell shaped
- and parameters
- f(x), distribution of a variable for a population
- area under f(x) = 1
CURVE DEPENDS ON TWO FACTORS: THE MEAN AND THE STANDARD DEVIATION.
- Mean determines the location of the centre
- Standard deviation determines the height and width of the graph
o When SD is large, the curve is short ant wide
o When SD is small, the curve is tall and narrow.
- All normal distributions look like a symmetric, bell-shaped curve.
About 68% of the area under the curve falls within 1 standard deviation of the mean.
About 95% of the area under the curve falls within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the mean.
STANDARD NORMAL DISTRIBUTION
The standard normal distribution is a special case of the normal distribution. It is the distribution that occurs
when a normal random variable has a mean of zero and a standard deviation of one.
STANDARD SCORE (AKA, Z SCORE)
The normal random variable of a standard normal distribution is called a standard score or a z-score. Every
normal random variable X can be transformed into a z score via the following equation:
z = (X - ΞΌ) / Ο
where X is a normal random variable, ΞΌ is the mean of X, and Ο is the standard deviation of X.
PROBABILITY IN NORMAL DISTRIBUTION
P(X < z) Statistical tables
If z too large, normalize with:
Three rules:
THE Z SCORE (STANDARD SCORE)
Indicates how many standard deviations an elements is from the mean.
When it is equal to 1 for example, it represents an element that is 1 standard deviation greater than the mean.
PERCENTAGE POINTS
Eg. span exceeded by only 5% men or eg. required delivery time to ensure 95% of packages are delivered before
guaranteed time.
P(X > K) = 0.05
P (X < k) = 0.95
π (π < π β π
π) = 0.95
Take average between two Z probabilities if Z-value is not precise
CENTRAL LIMIT THEOREM
- Central limit theorem: the sampling distribution of a statistic (like a sample mean) will follow a normal
distribution, as long as the sample size is sufficiently large.
- When we know the standard deviation of the population we can compute a z-score and use the normal
distribution to evaluate probabilities with the sample mean.
STUDENTβS T DISTRIBUTION
The t distribution (aka, Studentβs t-distribution) is a probability distribution that is used to estimate population
parameters when the sample size is small and/or when the population variance is unknown.
- When sample size is small and we donβt know the standard deviation of the population.
- Use the t-statistic (t score)
PROPERTIES OF THE T DISTRIBUTION
The t distribution has the following properties:
The mean of the distribution is equal to 0 and it is symmetric
The variance is equal to v / ( v - 2 ), where v is the degrees of freedom (see last section) and v >2.
The variance is always greater than 1, although it is close to 1 when there are many degrees of freedom. With
infinite degrees of freedom, the t distribution is the same as the standard normal distribution.
WHEN TO USE THE T DISTRIBUTION
The t distribution can be used with any statistic having a bell-shaped distribution (i.e., approximately normal).
The central limit theorem states that the sampling distribution of a statistic will be normal or nearly normal, if any of
the following conditions apply.
The population distribution is normal.
The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less.
The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between
16 and 40.
The sample size is greater than 40, without outliers.
The t distribution should not be used with small samples from populations that are not approximately normal.
CHI SQUARE DISTRIBUTION
The chi-square distribution has the following properties:
The mean of the distribution is equal to the number of degrees of freedom: ΞΌ = v.
The variance is equal to two times the number of degrees of freedom: Ο2 = 2 * v
When the degrees of freedom are greater than or equal to 2, the maximum value for Y occurs when Ξ§2 = v - 2.
As the degrees of freedom increase, the chi-square curve approaches a normal distribution.
The chi-square distribution is constructed so that the total area under the curve is equal to 1. The area under the
curve between 0 and a particular chi-square value is a cumulative probability associated with that chi-square value.
Cumulative probability: probability that the value of a random variable falls within a specified range
BINOMIAL DISTRIBUTION
Properties of Binomial:
The experiment consists of n repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a
failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The number of success is a random variable that follows a Binomial distribution
Mean:
Variance:
Standard Deviation:
Probability: π ( π = π) = (ππ
) ππ(1 β π)πβπ
POISSON DISTRIBUTION
Events occur at a rate of per unit.
X: number of random events that occur in a particular unit. X has a poisson distribution with mean if:
π (π = π) = πππβπ
π!
For r = 0, 1, 2 etc
Use calculator or table 7.
Find Mean: Time observed events by their frequency/Total time or space youβre observing for
Features:
1. Discrete variables
2. Events that happen at random in time and space
3. Events happen at rate /unit (constant interval)
4. Independent events
5. No. of events follow Poisson distribution
6. Only one parameter:
Variance of Poisson Distribution is equal to .
Hence: π = βπ
Limiting case: X, binomial, with index n and success probability P = /n. As n becomes large (and p small),
distribution of X tends to Poisson with mean = np
Approximate to normal:
STATISTICAL INFERENCE
1. To reach conclusion about a population on the basis of the information contained in a sample drawn from
the specific population.
2. Two types of inference : Estimation and Hypothesis testing
ESTIMATION AND SAMPLING DISTRIBUTION
If the P value is very small, this is evidence that the null hypothesis is not correct.
In Chi Square: 0.05 is cut off point.
SAMPLING DISTRIBUTIONS OF THE MEAN:
1. Always calculate standard error of the mean
2. Samples must all have the same sample size
Assumptions:
o the samples were all gathered randomly and the resulting distribution is normal
o This distribution has a mean of Β΅ and standard error of the mean π βπβ
Eg. if I select sample of people from a population of people, and compute mean age of the population sample, what
is the probability that the mean age of the sample will be less than 40 years?
POINT ESTIMATES AND INTERVAL ESTIMATES
Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the
sample mean x is a point estimate of the population mean ΞΌ. Similarly, the sample proportion p is a point
estimate of the population proportion P.
Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is
said to lie. For example, a < x < b is an interval estimate of the population mean ΞΌ. It indicates that the
population mean is greater than a but less than b
CONDITIONS
We can specify the sampling distribution of the mean whenever two conditions are met:
The population is normally distributed, or the sample size is sufficiently large.
The population standard deviation Ο is known.
ESTIMATE : 1. known:
a. π = πβπ
π/βπ
b. πΉπππ π π£πππ’π
c. Calculate standard error of Β΅ = π
βπ and mean of x = Β΅
d. πΈπ₯ππππ π π ππππππππππ πππ‘πππ£ππ
i. P (-1.96 < Z < 1.96) = 0.95
ii. P (-1.96 < π = πβπ
π/βπ < 1.96) = 0.95
iii. Β΅ in between x Β± 1.96 π
βπ with 95% confidence
2. not known
a. T = πβπ
π /βπ
b. Student t- distribution with degrees of freedom n - 1
c. Find t value
d. Calculate estimate standard error of Β΅ = π
βπ
e. Find a 95% confidence interval
i. Β΅ in between x Β± Tn-1,0.025 π
βπ with 95% confidence
CONFIDENCE INTERVALS
FOR THE NORMAL LINEAR REGRESSION:
HYPOTHESIS TESTS
When performing t-test hypothesis tests, always time p-value by 2
P value: P (T > t) x 2
Two sided tests
P VALUE P > 0.05 little or no evidence against Ho, do not reject null hypothesis
0.01 < P < 0.05, fairly strong evidence, reject Ho
P < 0.01, strong evidence against Ho
One Sample t-test Two samples t-test Matched pairs t-test
Β΅ = Β΅0 and Β΅ β Β΅0 Β΅x = Β΅y or Β΅x β Β΅y = 0 Β΅ = 0 and Β΅ β 0
Random sample from Normal D. Assume same SD and independent, doesnβt have to be the same n
Values are dependent, same n
T = πβπ
π /βπ , t should be close to 0
T = π₯+π¦
π πβ1
ππ₯+
1
ππ¦
T = π₯
π /βπ
D.F. = n β 1 DF = (nx -1) + (ny β 1) D.F. = n - 1
Estimate differences: Β΅x - Β΅y = x - y
Standard error for estimates:
Se (Β΅x - Β΅y) = πβ1
π(π₯)+
1
π(π¦)
If unknown:
Sp = β(ππ₯β1)π π₯
2+(ππ¦β1)π π¦2
(ππ₯β1 )+(ππ¦β1)
SE is same as before, with Sp
Assumptions: Assumptions: - Same standard deviation
(the means might be different)
- All observations are independent + normal
Assumptions: - Observations are dependent on
each other - Differences assumed to be a
random sample from a normal distribution
HYPOTHESIS TEST FOR TWO-SAMPLE T-TEST
P Value = 2 x P (T > t) Time by two the P values found as an interval of t.
The P-value is the probability of getting a result as extreme as (or more extreme than) that observed, if H0 is true
ASSUMPTIONS FOR DIFFERENCE BETWEEN PROPORTIONS (MEANS ETC)
The size of each population is large relative to the sample drawn from the population. That is, N1is large
relative to n1, and N2 is large relative to n2. (In this context, populations are considered to be large if they are
at least 10 times bigger than their sample.)
The samples are independent; that is, observations in population 1 are not affected by observations in
population 2, and vice versa.
The set of differences between sample means is normally distributed. This will be true if each population is
normal or if the sample sizes are large. (Based on the central limit theorem, sample sizes of 40 are large
enough).
CHI SQUARE TEST:
Assumptions: Random, Independent, Large enough sample
CHI SQUARE GOODNESS OF FIT:
H0: the data are consistent with the specified distribution
This procedure only works if the expected numbers
are not too small (all larger than 5 is large enough)
otherwise have to group categories.
- Measure of how well the set of observed frequencies agrees with the set of expected frequencies (See bit
booklet for the equation)
- Degrees of freedom: number of categories β 1
- If any of the parameters have to be estimated, do categories β 1 β No. of estimated parameters