CE 459 Statistics Assistant Prof. Muhammet Vefa AKPINAR Expected Normal VAR1 Upper Boundaries (x <= boundary) No of obs 0 2 4 6 8 10 12 14 16 50 55 60 65 70 75 80 85 90 95 100
CE 459 Statistics
Assistant Prof. Muhammet Vefa AKPINAR
ExpectedNormal
VAR1
Upper Boundaries (x <= boundary)
No
of
ob
s
0
2
4
6
8
10
12
14
16
50 55 60 65 70 75 80 85 90 95 100
08.10.2011 2
Lecture Notes
What is Statistics
Frequency Distribution
Descriptive Statistics
Normal Probability Distribution
Sampling Distribution of the Mean
Simple Linear Regression & Correlation
Multiple Regression & Correlation
08.10.2011 3
INTRODUCTION
Criticism
There is a general perception that statistical knowledge is all-too-frequently intentionally misused, by finding ways to interpret the data that are favorable to the presenter.
(A famous quote, variously attributed, but thought to be from Benjamin Disraeli is: "There are three types of lies - lies, damn lies, and statistics.") Indeed, the well-known book How to Lie with Statistics by Darrell Huff discusses many cases of deceptive uses of statistics, focusing on misleading graphs. By choosing (or rejecting, or modifying) a certain sample, results can be manipulated; throwing out outliers is one means of doing so. This may be the result of outright fraud or of subtle and unintentional bias on the part of the researcher.
WHAT IS STATISTICS?
Definition
Statistics is a group of methods used to collect, analyze, present, and interpret data and to make decisions.
08.10.2011 5
What is Statistics ?
American Heritage Dictionary defines statistics as: "The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling."
The Merriam-Webster‟s Collegiate Dictionary definition is: "A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data."
The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in employment statistics, accident statistics, etc.
08.10.2011 6
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period.
For practical reasons, rather than compiling data about an entire population, one usually instead studies a chosen subset of the population, called a sample.
Data are collected about the sample in an observational or experimental setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference.
08.10.2011 7
Descriptive statistics and Inferential statistics.
Statistical data analysis can be subdivided into Descriptive statistics and Inferential statistics.
Descriptive statistics is concerned with exploring, visualising, and summarizing data but without fitting the data to any models. This kind of analysis is used to explore the data in the initial stages of data analysis. Since no models are involved, it can not be used to test hypotheses or to make testable predictions. Nevertheless, it is a very important part of analysis that can reveal many interesting features in the data.
Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs.
08.10.2011 8
Inferential statistics is the next stage in data analysis and involves the identification of a suitable model. The data is then fit to the model to obtain an optimal estimation of the model's parameters. The model then undergoes validation by testing either predictions or hypotheses of the model. Models based on a unique sample of data can be used to infer generalities about features of the whole population.
Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), forecasting of future observations, descriptions of association (correlation), or modeling of relationships (regression).
Other modeling techniques include ANOVA, time series, and data mining.
Population and sample.
Population
Sample
A portion of the population selected for study is referred to as a sample.
A population consists of all elements – individuals, items, or objects – whose characteristics are being studied. The population that is being studied is also called the target population.
Measures of Central Tendency
Mean
Sum of all measurements divided by the number of measurements.
Median:
A number such that at most half of the measurements are below it and at most half of the measurements are above it.
Mode:
The most frequent measurement in the data.
The central tendency of a dataset, i.e. the centre of a frequency distribution, is most commonly measured by the 3 Ms:
= arithmetic mean =average
Mean The Sample Mean ( ) is the arithmetic average of a data set.
It is used to estimate the population mean, ( .
Calculated by taking the sum of the observed values (yi) divided by the number of observations (n).
System $K
1 22.2
2 17.3
3 11.8
4 9.6
5 8.8
6 7.6
7 6.8
8 3.2
9 1.7
10 1.6
n
yyy
n
yn
n
i
i 211y
K06.9$10
6.13.172.22y
yi
yy - yi
Residual
= 9.06
y
Historical Transmogrifier
Average Unit Production Costs
08.10.2011 12
The Mode
The mode, symbolized by Mo, is the most frequently occurring score value. If the scores for a given sample distribution are:
32 32 35 36 37 38 38 39 39 39 40 40 42 45
then the mode would be 39 because a score of 39 occurs 3 times, more than any other score.
08.10.2011 13
A distribution may have more than one mode if the two most frequently occurring scores occur the same number of times. For example, if the earlier score distribution were modified as follows:
32 32 32 36 37 38 38 39 39 39 40 40 42 45
then there would be two modes, 32 and 39. Such distributions are called bimodal. The frequency polygon of a bimodal distribution is presented below.
Example of Mode
Measurements
x
3
5
1
1
4
7
3
8
3
Mode: 3
Notice that it is possible for a data not to have any mode.
Mode
The Mode is the value of the data set that occurs most frequently
Example:
1, 2, 4, 5, 5, 6, 8
Here the Mode is 5, since 5 occurred twice and no other value occurred more than once
Data sets can have more than one mode, while the mean and median have one unique value
Data sets can also have NO mode, for example:
1, 3, 5, 6, 7, 8, 9
Here, no value occurs more frequently than any other, therefore no mode exists
You could also argue that this data set contains 7 modes since each value occurs as frequently as every other
Example of Mode
Measurements
x
3
5
5
1
7
2
6
7
0
4
In this case the data have tow modes:
5 and 7
Both measurements are repeated twice
08.10.2011 17
Median
Computation of Median When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
Example of Median
Median: (4+5)/2 = 4.5
Notice that only the two central values are used in the computation.
The median is not sensible to extreme values
Measurements Measurements
Ranked
x x
3 0
5 1
5 2
1 3
7 4
2 5
6 5
7 6
0 7
4 7
40 40
median rim diameter (cm)
unit 1 unit 2
9.7 9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
12.9 <-- 13.2 13.2
13.1 13.8
13.5 14.0
13.6 15.5
14.8 15.6
16.3 16.2
26.9 16.4
Median
The Median is the middle observation of an ordered (from low to high) data set
Examples:
1, 2, 4, 5, 5, 6, 8
Here, the middle observation is 5, so the median is 5
1, 3, 4, 4, 5, 7, 8, 8
Here, there is no “middle” observation so we take the average of the two observations at the center
5.42
54Median
Mode Median
Mean
Mode = Median = Mean
Dispersion Statistics
The Mean, Median and Mode by themselves are not sufficient descriptors of a data set
Example:
Data Set 1: 48, 49, 50, 51, 52
Data Set 2: 5, 15, 50, 80, 100
Note that the Mean and Median for both data sets are identical, but the data sets are glaringly different!
The difference is in the dispersion of the data points
Dispersion Statistics we will discuss are:
Range
Variance
Standard Deviation
Range
The Range is simply the difference between the smallest and largest observation in a data set
Example
Data Set 1: 48, 49, 50, 51, 52
Data Set 2: 5, 15, 50, 80, 100
The Range of data set 1 is 52 - 48 = 4
The Range of data set 2 is 100 - 5 = 95
So, while both data sets have the same mean and median, the dispersion of the data, as depicted by the range, is much smaller in data set 1
08.10.2011 24
deviation score
A deviation score is a measure of by how much each point in a frequency distribution lies above or below the mean for the entire dataset:
where: X = raw score X= the mean
Note that if you add all the deviation scores for a dataset together, you automatically get the mean for that dataset.
Variance
The Variance, s2, represents the amount of variability of the data relative to their mean
As shown below, the variance is the “average” of the squared deviations of the observations about their mean
1
)( 2
2
n
yys
i
The Variance, s2, is the sample variance, and is used to estimate the actual population variance, 2
N
yi
2
2)(
Standard Deviation
The Variance is not a “common sense” statistic because it describes the data in terms of squared units
The Standard Deviation, s, is simply the square root of the variance
1
)( 2
n
yys
i
The Standard Deviation, s, is the sample standard deviation, and is used to estimate the actual population standard deviation,
N
yi
2)(
Standard Deviation
The sample standard deviation, s, is measured in the same units as the data from which the standard deviation is being calculated
)($4.449
8.399
110
7.559.677.172
1
)(
2
2
2
K
n
yys
i
System FY97$K
1 22.2 13.1 172.7
2 17.3 8.2 67.9
3 11.8 2.7 7.5
4 9.6 0.5 0.3
5 8.8 -0.3 0.1
6 7.6 -1.5 2.1
7 6.8 -2.3 5.1
8 3.2 -5.9 34.3
9 1.7 -7.4 54.2
10 1.6 -7.5 55.7
Average 9.06
yyi
2)y(yi
)($67.6
)($4.44 22
K
Kss
This number, $6.67K, represents the average estimating error for predicting subsequent observations
In other words: On average, when estimating the cost of transmogrifiers that belongs to the same population as the ten systems above, we would expect to be off by $6.67K
08.10.2011 28
Variance and the closely-related standard deviation
The variance and the closely-related standard deviation are measures of how spread out a distribution is. In other words, they are measures of variability.
In order to define the amount of deviation of a dataset from the mean, calculate the mean of all the deviation scores, i.e. the variance.
The variance is computed as the average squared deviation of each number from its mean.
For example, for the numbers 1, 2, and 3, the mean is 2 and the variance is: .
08.10.2011 29
variance in a population is:
variance in a sample is:
where; μ is the mean and N is the number of scores.
08.10.2011 30
The standard deviation is the square root of the variance.
08.10.2011 31
Variance and Standar Deviation
Example of Mean
Measurements Deviation
x x - mean
3 -1
5 1
5 1
1 -3
7 3
2 -2
6 2
7 3
0 -4
4 0
40 0
MEAN = 40/10 = 4
Notice that the sum of the “deviations” is 0.
Notice that every single observation intervenes in the computation of the mean.
Example of Variance
Measurements Deviations Square of
deviations
x x - mean
3 -1 1
5 1 1
5 1 1
1 -3 9
7 3 9
2 -2 4
6 2 4
7 3 9
0 -4 16
4 0 0
40 0 54
Variance = 54/9 = 6
It is a measure of “spread”.
Notice that the larger the deviations (positive or negative) the larger the variance
The standard deviation
It is defines as the square root of the variance
In the previous example
Variance = 6
Standard deviation = Square root of the variance = Square root of 6 = 2.45
08.10.2011 35
Observed Vehicle velocity
velocity km/saat
67 73 81 72 76 75 85 77 68 84
76 93 73 79 88 73 60 93 71 59
74 62 95 78 63 72 66 78 82 75
96 70 89 61 75 95 66 79 83 71
76 65 71 75 65 80 73 57 88 78
08.10.2011 36
Mean, Median, Standard Deviation
Valid
Numbers Range Mean Median Minimum Maximum Variance Standard.Dev.
50 39 75,62 75 57 96 96,362 9,816458
08.10.2011 37
Frequency Table
Number class class frequency relative freq. Cumulative freq.
Relative cumulative
freq.
of Class (intervals) intervals midpoints % %
1 50,000 < x <= 55,000 52,5 0 0 0 0
2 55,000 < x <= 60,000 57,5 3 6 3 6
3 60,000 < x <= 65,000 62,5 5 10 8 16
4 65,000 < x <= 70,000 67,5 5 10 13 26
5 70,000 < x <= 75,000 72,5 14 28 27 54
6 75,000 < x <= 80,000 77,5 10 20 37 74
7 80,000 < x <= 85,000 82,5 5 10 42 84
8 85,000 < x <= 90,000 87,5 3 6 45 90
9 90,000 < x <= 95,000 92,5 4 8 49 98
10 95,000 < x <= 100,00 97,5 1 2 50 100
08.10.2011 38
Frequency Table
A cumulative frequency distribution is a plot of the number of observations falling in or below an interval. The graph shown here is a cumulative frequency distribution of the scores on a statistics test.
A frequency table is constructed by dividing the scores into intervals and counting the number of scores in each interval. The actual number of scores as well as the percentage of scores in each interval are displayed. Cumulative frequencies are also usually displayed.
The X-axis shows various intervals of vehicle speed.
08.10.2011 39
Selecting the Interval Size
In order to find a starting interval size the first step is to find the range of the data by subtracting the smallest score from the largest. In the case of the example data, the range was 96-57 = 39. The range is then divided by the number of desired intervals, with a suggested starting number of intervals being ten (10). In the example, the result would be 50/10 = 5. The nearest odd integer value is used as the starting point for the selection of the interval size.
08.10.2011 40
Histogram
A histogram is constructed from a frequency table. The intervals are shown on
the X-axis and the number of scores in each interval is represented by the
height of a rectangle located above the interval. A histogram of the vehicle
speed from the dataset is shown below. The shapes of histograms will vary
depending on the choice of the size of the intervals.
ExpectedNormal
VAR1
Upper Boundaries (x <= boundary)
No
of
ob
s
0
2
4
6
8
10
12
14
16
50 55 60 65 70 75 80 85 90 95 100
08.10.2011 41
There are many different-shaped frequency distributions:
08.10.2011 42
A frequency polygon is a graphical display of a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a point located above the middle of the interval. The points are connected so that together with the X-axis they form a polygon.
08.10.2011 43
Spread, Dispersion, Variability
A variable's spread is the degree to which scores on the variable differ from each other. If every score on the variable were about equal, the variable would have very little spread. There are many measures of spread. The distributions shown below have the same mean but differ in spread: The distribution on the bottom is more spread out. Variability and dispersion are synonyms for spread.
08.10.2011 44
Skew
Further Notes
When the Mean is greater than the Median the data distribution is skewed to the Right.
When the Median is greater than the Mean the data distribution is skewed to the Left.
When Mean and Median are very close to each other the data distribution is approximately symmetric.
08.10.2011 46
The distribution shown below has a positive skew. The mean is larger than the median.
test was very difficult and almost everyone in the class did very poorly on it,
the resulting distribution would most likely be positively skewed.
The Effect of Skew on the Mean and Median
08.10.2011 47
The distribution shown below has a negative skew. The mean is smaller than the median.
08.10.2011 48
Probability
Likelihood or chance of occurrence. The probability of an event is the theoretical relative frequency of the event in a model of the population.
08.10.2011 49
Normal Distribution or Normal Curve
Normal distribution is probably one of the most important and
widely used continuous distribution. It is known as a normal random variable, and its probability distribution is called a normal distribution.
The normal distribution is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions. In general, the normal distribution provides a good model for a random variable.
08.10.2011 50
In a normal distribution:
68% of samples fall between ±1 SD
95% of samples fall between ±2 SD (actually + 1.96 SD)
99.7% of samples fall between ±3 SD
08.10.2011 51
The normal distribution function
The normal distribution function is determined by the following formula:
Where;
: mean : standard deviation e: Euler's constant (2.71...) : constant Pi (3.14...)
08.10.2011 52
Characteristics of the Normal Distribution:
It is bell shaped and is symmetrical about its mean. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
They are symmetric with scores more concentrated in the middle than in the tails.
It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different normal distribution. Thus, the normal distribution is completely described by two parameters: mean and standard deviation.
There is a strong tendency for the variable to take a central value. It is unimodal, i.e., values mound up only in the center of the curve.
The frequency of deviations falls off rapidly as the deviations become larger.
08.10.2011 53
Total area under the curve sums to 1, the area of the distribution on each side of the mean is 0.5.
The Area Under the Curve Between any Two Scores is a PROBABILITY The probability that a random variable will have a value between any
two points is equal to the area under the curve between those points. Positive and negative deviations from this central value are equally likely
08.10.2011 54
Examples of normal distributions
Notice that they differ in how spread out they are. The area under each curve is the same. The height of a normal distribution can be specified mathematically in terms
of two parameters: the mean (μ) and the standard deviation (σ). The two
parameters, and , each change the shape of the distribution in a different
manner.
08.10.2011 55
Changes in without changes in
Changes in , without changes in , result in moving the distribution to the right or left, depending upon whether the new value of was larger or smaller than the previous value, but does not change the shape of the distribution.
08.10.2011 56
Changes in the value of
Changes in the value of , change the shape of the distribution without affecting the midpoint, because d affects the spread or the dispersion of scores. The larger the value of , the more dispersed the scores; the smaller the value, the less dispersed. The distribution below demonstrates the effect of increasing the value of :
08.10.2011 57
THE STANDARD NORMAL CURVE
The standard normal curve is a member of the family of normal curves with = 0.0 and = 1.0.
Note that the integral calculus is used to find the area under the normal distribution curve. However, this can be avoided by transforming all normal distribution to fit the standard normal distribution. This conversion is done by rescalling the normal distribution axis from its true units (time, weight, dollars, and...) to a standard measure called Z score or Z value.
08.10.2011 58
Standard Scores (z Scores)
A Z score is the number of standard deviations that a value, X, is away from the mean.
Standard scores are therefore useful for comparing datapoints in different distributions.
If the value of X is greater than the mean, the Z score is positive; if the value of X is less than the mean, the Z score is negative. The Z score or equation is as follows:
where z is the z-score for the value of X
08.10.2011 59
Table of the Standard Normal (z) Distribution
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0190 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2969 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3513 0.3554 0.3577 0.3529 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
08.10.2011 60
Three areas on a standard normal curve
08.10.2011 61
Z-1.5
Total - infinity to Z-
1.5
Z
Total Z to +
infinity
Z
Total - infinity to
Z
+Z
- Z
Area from -Z to +Z
+Z
- Z
-infinity to -Z
plus
+Z to +
infinity
+Z
- Z
-infinity to -Z
plus
+Z to +
infinity
Z Area Under Curve
from negative infinity
to Z
Area Under Curve
from Z to positive
infinity
Area Under Curve
from -Z to +Z
Area Under Curve
(negative infinity to -
Z) PLUS
(+Z to positive infinity)
Convert
(negative infinity to -Z
) PLUS
(+Z to positive infinity)
into PPM
Area Under Curve
negative infinity to Z-
1.5
0,000 0,50000000000000 0,50000000000000 0,00000000000000 1,00000000000000 1.000.000,00000000 0,06680720126886
0,100 0,53982783727702 0,46017216272298 0,07965567455403 0,92034432544597 920.344,32544597 0,08075665923377
0,200 0,57925970943909 0,42074029056091 0,15851941887818 0,84148058112182 841.480,58112182 0,09680048458561
0,300 0,61791142218894 0,38208857781106 0,23582284437788 0,76417715562212 764.177,15562212 0,11506967022170
0,400 0,65542174161031 0,34457825838969 0,31084348322063 0,68915651677937 689.156,51677937 0,13566606094638
0,500 0,69146246127400 0,30853753872600 0,38292492254801 0,61707507745200 617.075,07745200 0,15865525393145
0,600 0,72574688224992 0,27425311775008 0,45149376449983 0,54850623550017 548.506,23550017 0,18406012534675
0,700 0,75803634777692 0,24196365222308 0,51607269555384 0,48392730444617 483.927,30444617 0,21185539858339
0,800 0,78814460141659 0,21185539858341 0,57628920283319 0,42371079716681 423.710,79716681 0,24196365222306
0,900 0,81593987465323 0,18406012534677 0,63187974930647 0,36812025069354 368.120,25069354 0,27425311775006
1,000 0,84134474606854 0,15865525393146 0,68268949213707 0,31731050786293 317.310,50786293 0,30853753872598
1,100 0,86433393905361 0,13566606094639 0,72866787810722 0,27133212189278 271.332,12189278 0,34457825838967
1,200 0,88493032977829 0,11506967022171 0,76986065955657 0,23013934044343 230.139,34044343 0,38208857781104
1,300 0,90319951541439 0,09680048458562 0,80639903082877 0,19360096917123 193.600,96917123 0,42074029056089
1,400 0,91924334076622 0,08075665923378 0,83848668153245 0,16151331846755 161.513,31846755 0,46017216272296
08.10.2011 62
The area between Z-scores of -1.00 and +1.00. It is .68 or 68%.
The area between Z-scores of -2.00 and +2.00 and is .95 or 95%.
08.10.2011 63
Exercise 1
An industrial sewing machine uses ball bearings that are targeted to have a diameter of 0.75 inch. The specification limits under which the ball bearing can operate are 0.74 inch (lower) and 0.76 inch (upper). Past experience has indicated that the actual diameter of the ball bearings is approximately normally distributed with a mean of 0.753 inch and a standard deviation of 0.004 inch.
For this problem, note that "Target" = .75, and "Actual mean" = .753.
08.10.2011 64
What is the probability that a ball bearing will be between the target and the actual mean?
P(-0.75 < Z < 0) = .2734
08.10.2011 65
What is the probability that a ball bearing will be between the lower specification limit and the target?
P(-3.25 < Z < -0.75) = .49942 - .2734 = .22602
08.10.2011 66
What is the probability that a ball bearing will be above the upper specification limit?
P(Z > 1.75) = .5 - .4599 = .0401
08.10.2011 67
What is the probability that a ball bearing will be below the lower specification limit?
P (Z < -3.25) = .5 - .49942 = .00058
08.10.2011 68
Above which value in diameter will 93% of the ball bearings be?
The value asked for here will be the 7th percentile, since 93% of the ball bearings will have diameters above that. So we will look up .4300 in the Z-table in a "backwards“ manner. The closest area to this is .4306, which corresponds to a Z-value of 1.48.
-0.00592 = X - 0.753 X = 0.74708
So 0.74708 in. is the value that 93% of the diameters are above.
08.10.2011 69
Exercise 2
Graduate Management Aptitude Test (GMAT) scores are widely used by graduate schools of business as an entrance requirement. Suppose that in one particular year, the mean score for the GMAT was 476, with a standard deviation of 107. Assuming that the GMAT scores are normally distributed, answer the following questions:
08.10.2011 70
Question 1
What is the probability that a randomly selected score from this GMAT falls between 476 and 650 (476 <= x <= 650) the following figure shows a graphic representation of this problem.
Answer: Z = (650 - 476)/107 = 1.62. The Z value of 1.62 indicates that the GMAT score of 650 is 1.62 standard
deviation above the mean. The standard normal table gives the probability of value falling between 650 and the mean. The whole number and tenths place portion of the Z score appear in the first column of the table. Across the top of the table are the values of the hundredths place portion of the Z score. Thus the answer is that 0.4474 or 44.74% of the scores on the GMAT fall between a score of 650 and 476.
08.10.2011 71
Question 2. What is the probability of receiving a score greater than 750 on a GMAT test that
has a mean of 476 and a standard deviation of 107 i.e., P(X >= 750) = ?. Answer This problem is asking for determining the area of the upper tail of the distribution. The Z score is: Z = ( 750 - 476)/107 = 2.56- Table- P(Z=2.56) = 0.4948.
This is the probability of a GMAT with a score between 476 and 750. 0.5 - 0.4948 = 0.0052 or 0.52%. Note that P(X >= 750) is the same as P(X >750), because, in continuous
distribution, the area under an exact number such as X=750 is zero.
08.10.2011 72
What is the probability of receiving a score of 540 or less on a GMAT test that has a mean of 476 and a standard deviation of 107 i.e., P(X <= 540)= ?
we are asked to determine the area under the curve for all values less than or equal to 540.
z score (540-476)/107=0.6 -Table- P (z= 0.2257) which is the probability of getting a score between the mean 476 and 540.
The answer to this problem is: 0.5 + 0.2257 = 0.73 or 73%. Graphic representation of this problem.
08.10.2011 73
Question 4 What is the probability of receiving a score between 440 and 330 on a GMAT test that
has a mean of 476 and a standard deviation of 107. i.e., P(330 < 440)="?."
The two values fall on the same side of the mean.
The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 - 476)/107 = -0.34. The probability associated with Z = -1.36 is 0.4131,
The probability associated with Z = -0.34 is 0.1331.
Thee answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.
08.10.2011 74
Standard Error (SE)
Any statistic can have a standard error. Each sampling distribution has a standard error.
Standard errors are important because they reflect how much sampling fluctuation a statistic will show, i.e. how good an estimate of the population the sample statistic is
How good an estimate is the mean of a population? One way to determine this is to repeat the experiment many times and to determine the mean of the means. However, this is tedious and frequently impossible.
SE refers to the variability of the sample statistic, a measure of spread for random variables
The inferential statistics involved in the construction of confidence intervals (CI) and significance testing are based on standard errors.
08.10.2011 75
Standard Error of the Mean, SEM, σM
The standard deviation of the sampling distribution of the mean is called the standard error of the mean.
The size of the standard error of the mean is inversely proportional to the square root of the sample size.
Not:
08.10.2011 76
The standard error of any statistic depends on the sample size - in general, the larger the sample size the smaller the standard error.
Note that the spread of the sampling distribution of the mean decreases as the sample size increases.
Notice that the mean of the distribution is not affected by sample size.
08.10.2011 77
Comparing the Averages of Two Independent Samples
Is there "grade inflation" in KTU? How does the average GPA of KTU students today compare with, say 10, years ago?
Suppose a random sample of 100 student records from 10 years ago yields a sample average GPA of 2.90 with a standard deviation of .40.
A random sample of 100 current students today yields a sample average of 2.98 with a standard deviation of .45.
The difference between the two sample means is 2.98-2.90 = .08. Is this proof that GPA's are higher today than 10 years ago?
08.10.2011 78
First we need to account for the fact that 2.98 and 2.90 are not the true averages, but are computed from random samples. Therefore, .08 is not the true difference, but simply an estimate of the true difference.
Can this estimate miss by much? Fortunately, statistics has a way of measuring the expected size of the ``miss'' (or error of estimation) . For our example, it is .06 (we show how to calculate this later). Therefore, we can state the bottom line of the study as follows: "The average GPA of KTU students today is .08 higher than 10 years ago, give or take .06 or so."
08.10.2011 79
Overview of Confidence Intervals
Once the population is specified, the next step is to take a random sample from it. In this example, let's say that a sample of 10 students were drawn and each student's memory tested. The way to estimate the mean of all high school students would be to compute the mean of the 10 students in the sample. Indeed, the sample mean is an unbiased estimate of μ, the population mean.
Clearly, if you already knew the population mean, there would be no need for a confidence interval.
08.10.2011 80
We are interested in the mean weight of 10-year old kids living in Turkey. Since it would have been impractical to weigh all the 10-year old kids in Turkey, you took a sample of 16 and found that the mean weight was 90 pounds. This sample mean of 90 is a point estimate of the population mean.
A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far this sample mean may be from the population mean. For example, can you be confident that the population mean is within 5 pounds of 90? You simply do not know.
08.10.2011 81
Confidence intervals provide more information than point estimates.
An example of a 95% confidence interval is shown below:
72.85 < μ < 107.15
There is good reason to believe that the population mean lies between these two bounds of 72.85 and 107.15 since 95% of the time confidence intervals contain the true mean.
If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean. Naturally, 5% of the intervals would not contain the population mean.
08.10.2011 82
It is natural to interpret a 95% confidence interval as an interval with a 0.95 probability of containing the population mean
The wider the interval, the more confident you are that it contains the parameter. The 99% confidence interval is therefore wider than the 95% confidence interval and extends from 4.19 to 7.61.
08.10.2011 83
Example
Assume that the weights of 10-year old children are normally distributed with a mean of 90 and a standard deviation of 36. What is the sampling distribution of the mean for a sample size of 9?
standard deviation of 36/3 = 12. Note that the standard deviation of a sampling distribution is its standard error.
90 - (1.96)(12) = 66.48
90 + (1.96)(12) = 113.52
The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean; 12 is the standard error of the mean.
08.10.2011 84
Figure shows that 95% of the means are no more than 23.52 units
(1.96x12) from the mean of 90.
Now consider the probability that a sample mean computed in a
random sample is within 23.52 units of the population mean of 90.
Since 95% of the distribution is within 23.52 of 90, the probability that
the mean from any given sample will be within 23.52 of 90 is 0.95.
This means that if we repeatedly compute the mean (M) from a
sample, and create an interval ranging from M - 23.52 to M +
23.52, this interval will contain the population mean 95% of the
time.
08.10.2011 85
notice that you need to know the standard deviation (σ) in order
to estimate the mean. This may sound unrealistic, and it is.
However, computing a confidence interval when σ is known is
easier than when σ has to be estimated, and serves a
pedagogical purpose.
Suppose the following five were sampled from a normal
distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To
compute the 95% confidence interval, start by computing the
mean and standard error:
M = (2 + 3 + 5 + 6 + 9)/5 = 5. σm = = 1.118.
08.10.2011 86
Z.95 --the value is 1.96.
08.10.2011 87
If you had wanted to compute the 99% confidence interval, you would have set the shaded area to 0.99 and the result would have been 2.58.
The confidence interval can then be computed as follows:
Lower limit = 5 - (1.96)(1.118)= 2.81
Upper limit = 5 + (1.96)(1.118)= 7.19
08.10.2011 88
Estimating the Population Mean Using Intervals
Estimate the average GPA of the population of approximately 23000 KTU undergraduates.n=25 randomly selected students, sample average= 3.05.
Consider estimating the population average
Now chances are the true average is not equal to 3.05.
True KTU average GPA is between 1.00 and 4.00, and with high confidence between (2.50, 3.50); but what level of confidence do we have that it is between say, (2.75, 3.25) or (2.95, 3.15)?
Even better, can we find an interval (a, b) which will contain with 95%
certainty?
08.10.2011 89
Example:
Given the following GPA for 6 students: 2.80, 3.20, 3.75, 3.10, 2.95, 3.40
Calculate a 95% confidence interval for the population mean GPA.
08.10.2011 90
Determining Sample Size for Estimating the Mean
want to estimate the average GPA of KTU undergraduates this school year. Historically, the SD of student GPA is known to be .
If a random sample of size n=25 yields a sample mean of , then the population mean is estimated as lying within the interval
with 95% confidence. The plus-or-minus quantity .12 is called the margin of error of the sample mean associated with a 95% confidence level. It is also correct to say ``we are 95% confident that is within .12 of the sample mean 3.05''.
Confidence Interval for μ, Standard Deviation Estimated
It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both μ and σ. When σ is known -> M - zσM ≤ μ ≤ M + zσM is used for a confidence interval.
When σ is not known, Whenever the standard deviation is estimated (NOT KNOWN), the t rather than the normal (z) distribution should be used. for μ when σ is estimated is: M - t sM ≤ μ ≤ M + t sM where M is the sample mean, sM is an estimate of σM (standard error), and t depends on the degrees of freedom and the level of confidence.
confidence interval on the mean:
More generally, the formula for the 95% confidence interval on the mean is:
Lower limit = M - (t)(sm) Upper limit = M + (t)(sm)
where;
M is the sample mean, t is the t for the confidence level desired (0.95 in the above example), and sm is the estimated standard error of the mean.
A comparison of the t and normal distribution
A comparison of the t distribution with 4 df
(in blue) and the standard normal
distribution (in red).
Finding t-values
Find the t-value such that the area under the t distribution to the right of the t-value is 0.2 assuming 10 degrees of freedom. That is, find t0.20 with 10 degrees of freedom.
Upper tail probability p (area under the right side)
Example:
P[t(2) > 2.92] = 0.05
P[-2.92 < t(2) < 2.92] = 0.9
50% 60% 70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 99.9%
0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
df
1 1.000 1.376 1.963 3.078 6.314 12.706 15.895 31.821 63.657 127.32 318.30 636.61
2 0.817 1.061 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.089 22.327 31.599
3 0.765 0.979 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.215 12.924
4 0.741 0.941 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437
12 0.696 0.873 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.249 2.602 2.947 3.286 3.733 4.073
Abbreviated t table
df
0.95
0.99
2 4.303 9.925
3 3.182 5.841
4 2.776 4.604
5 2.571 4.032
8 2.306 3.355
10 2.228 3.169
20 2.086 2.845
50 2.009 2.678
100 1.984 2.626
Example
Assume that the following five numbers are sampled from a normal distribution: 2, 3, 5, 6, and 9 and that the standard deviation is not known. The first steps are to compute the sample mean and variance: M = 5 sm = 7.5 Standard error (sm)= 1.225
df = N - 1 = 4
t t tablethe value for the 95% interval for is
2.776.
Lower limit = 5 - (2.776)(1.225)= 1.60 Upper limit = 5 + (2.776)(1.225)= 8.40
Example
Suppose a researcher were interested in estimating the mean reading speed (number of words per minute) of high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data, M = 366.6667 sM= 60.9736 df = 6-1 = 5 t = 2.571
lower limit is: M - (t) (sM) = 209.904
upper limit is: M + (t) (sM) = 523.430,
95% confidence interval is: 209.904 ≤ μ ≤ 523.430
Thus, the researcher can conclude based on the rounded off 95% confidence interval that the mean reading speed of high-school graduates is between 210 and 523.
Homework 1
The mean time difference for all 47 subjects is 16.362 seconds and the standard deviation is 7.470 seconds. The standard error of the mean is 1.090.
A t table shows the critical value of t for 47 - 1 = 46 degrees of freedom is 2.013 (for a 95% confidence interval). The confidence interval is computed as follows:
Lower limit = 16.362 - (2.013)(1.090)= 14.17 Upper limit = 16.362 + (2.013)(1.090)= 18.56
Therefore, the interference effect (difference) for the whole population is likely to be between 14.17 and 18.56 seconds.
Homework 2
The pasteurization process reduces the amount of bacteria found in dairy products, such as milk. The following data represent the counts of bacteria in pasteurized milk (in CFU/mL) for a random sample of 12 pasteurized glasses of milk.
Construct a 95% confidence interval for the bacteria count.
NOTE: Each observation is in tens of thousand. So, 9.06 represents 9.06 x 104.
Prediction with Regression Analysis
The relationship(s) between values of the response variable and corresponding values of the predictor variable(s) is (are) not deterministic.
Thus the value of y is estimated given the value of x. The estimated value of the dependent variable is denoted y, and the population slope and intercept are usually denoted β1 and β0.
Linear Regression
The idea is to fit a straight line through data points
Linear Regression - Indicates that the relationship(s) between the dependent variable and the independent variable(s).
Can extend to multiple dimensions
correlation analysis is applied to independent factors: if X increases, what will Y do (increase, decrease, or perhaps not change at all)?
In regression analysis a unilateral response is assumed: changes in X result in changes in Y, but changes in Y do not result in changes in X.
0.1 0.0-0.1-0.2
0.4
0.3
0.2
0.1
0.0
-0.1
-0.2
-0.3
vwmkt
m1
S = 0.0590370 R-Sq = 31.3 % R-Sq(adj) = 30.8 %
m1 = 0.0095937 + 0.880436 vwmkt
Regression Plot
Linear regression means a regression that is linear in the parameters
A linear regression can be non-linear in the variables
Example: Y = β0 + β1X2
Some non-linear regression models can be transformed
into a linear regression model
(e.g., Y=aXbZc can be transformed into
lnY = ln a + b*ln X + c*ln Z)
Example
Given one variable
Goal: Predict Y
Example: Given Years of
Experience
Predict Salary
Questions: When X=10, what is Y?
When X=25, what is Y?
This is known as regression
X
(years)
Y (salary, $1,000)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
For the example data
xy 5.32.23
5.3
,2.23
x=10 years prediction of y (salary) is:
23.2+35=58.2 K dollars/year.
Linear Regression Example Linear Regression: Y=3.5*X+23.2
0
20
40
60
80
100
120
0 5 10 15 20 25
Years
Sal
ary
XY
xy
xx
yyxx
i
i
i
ii
2)(
))((
Regression Error
We can also write a regression equation slightly differently:
Also called the residual, this is the difference between our estimate of the value of
the dependent variable y and the actual value of the dependent variable y.
Unless we have perfect prediction, many of the y values will fall off of the line. The added e in the equation refers to this fact. It would be incorrect to write the equation without the e, because it would suggest that the y scores are completely accounted for by just knowing the slope, x values, and the intercept. Almost always, that is not true. There is some error in prediction, so we need to add an e for error variation into the equation.
The actual values of y can be accounted for by the regression line equation (y=a+bx) plus some degree of error in our prediction (the e's).
r correlation coefficient
The correlation between X and Y is expressed by the correlation coefficient r :
xi = data X, ¯x = mean of data X yi = data Y, ¯y = mean of data Y
1 >r > -1
r = 1 perfect positive linear correlation between two variables
r = 0 no linear correlation (maybe other correlation) r = -1 perfect negative linear correlation
Notice that for the perfect correlation, there is a perfect line of points. They do not deviate from that line.
least squares
The principle is to establish a statistical linear relationship between two sets of corresponding data by fitting the data to a straight line by means of the "least squares" technique.
The resulting line takes the general form: y = bx + a
a = intercept of the line with the y-axis
b = slope (tangent)
a = 0, b= 1 perfect positive correlation without bias a= 0 systematic discrepancy (bias, error) between X and Y; b = 1 proportional response or difference between X and Y.
Example
Each point represents one student with a certain score for time on the exam, x, and grade, y. The scatter plot reveals that, in general, longer times on the exam tend to be associated with higher grades.
0.64
ID Grade on
Exam (x)
Time on
Exam (y)
X-X avr Y-Yavr (X-Xavr)*(Y-Yavr) (X-Xavr)2
1 88 60 8.6 18.55 159.53 73.96
2 96 53 16.6 11.55 191.73 275.56
3 72 22 -7.4 -19.45 143.93 54.76
4 78 44 -1.4 2.55 -3.57 1.96
5 65 34 -14.4 -7.45 107.28 207.36
6 80 47 0.6 5.55 3.33 0.36
7 77 38 -2.4 -3.45 8.28 5.76
8 83 50 3.6 8.55 30.78 12.96
9 79 51 -0.4 9.55 -3.82 0.16
10 68 35 -11.4 -6.45 73.53 129.96
11 84 46 4.6 4.55 20.93 21.16
12 76 36 -3.4 -5.45 18.53 11.56
13 92 48 12.6 6.55 82.53 158.76
r correlation
The Pearson r can be positive or negative, ranging from -1.0 to
1.0. If the correlation is 1.0, the longer the amount of time spent on
the exam, the higher the grade will be--without any exceptions. An r value of -1.0 indicates a perfect negative correlation--
without an exception, the longer one spends on the exam, the poorer the grade.
If r=0, there is absolutely no relationship between the two variables. When r=0, on average, longer time spent on the exam does not result in any higher or lower grade. Most often r is somewhere in between -1.0 and +1.0.
ID Grade on Exam (x) x2 Time on Exam (y) y2 xy
1 88 7744 60 3600 5280
2 96 9216 53 2809 5088
3 72 5184 22 484 1584
4 78 6084 44 1936 3432
5 65 4225 34 1156 2210
6 80 6400 47 2209 3760
7 77 5929 38 1444 2926
8 83 6889 50 2500 4150
9 79 6241 51 2601 4029
10 68 4624 35 1225 2380
11 84 7056 46 2116 3864
12 76 5776 36 1296 2736
13 92 8464 48 2304 4416
14 80 6400 43 1849 3440
15 67 4489 40 1600 2680
16 78 6084 32 1024 2496
17 74 5476 27 729 1998
18 73 5329 41 1681 2993
19 88 7744 39 1521 3432
20 90 8100 43 1849 3870
S 1588 127454 829 35933 66764
ID Grade on
Exam (x)
Time on
Exam (y)
X-X ort Y-Yort (X-Xort)*(Y-Yort) (X-Xort)2 (Y-Yort)
2
1 88 60 8,6 18,55 159,53 73,96 344,1025
2 96 53 16,6 11,55 191,73 275,56 133,4025
3 72 22 -7,4 -19,45 143,93 54,76 378,3025
4 78 44 -1,4 2,55 -3,57 1,96 6,5025
5 65 34 -14,4 -7,45 107,28 207,36 55,5025
6 80 47 0,6 5,55 3,33 0,36 30,8025
7 77 38 -2,4 -3,45 8,28 5,76 11,9025
8 83 50 3,6 8,55 30,78 12,96 73,1025
9 79 51 -0,4 9,55 -3,82 0,16 91,2025
10 68 35 -11,4 -6,45 73,53 129,96 41,6025
11 84 46 4,6 4,55 20,93 21,16 20,7025
12 76 36 -3,4 -5,45 18,53 11,56 29,7025
13 92 48 12,6 6,55 82,53 158,76 42,9025
14 80 43 0,6 1,55 0,93 0,36 2,4025
15 67 40 -12,4 -1,45 17,98 153,76 2,1025
16 78 32 -1,4 -9,45 13,23 1,96 89,3025
17 74 27 -5,4 -14,45 78,03 29,16 208,8025
18 73 41 -6,4 -0,45 2,88 40,96 0,2025
19 88 39 8,6 -2,45 -21,07 73,96 6,0025
20 90 43 10,6 1,55 16,43 112,36 2,4025
Total 1588 829 941,4 1366,8 1570,95
Average 79,4 41,45
r = 0.6424
r2 square of the correlation coefficient
r² is the proportion of the sum of squares explained in one-variable regression,
r² is the proportion of the sum of squares explained in multiple regression.
Is an R-Square < 1.00 Good or bad?
This is both a statistical and a philosophical question; It is quite rare, especially in the social sciences, to get an R-square that is really high (e.g., 98%).
The goal is NOT to get the highest R-square per se. Instead, the goal is to develop a model that is both statistically and theoretically sound, creating the best fit with existing data.
Do you want just the best fit, or a model that theoretically/conceptually makes sense? Yes, you might get a good fit with nonsensical explanatory variables. But, this opens you to spurious/intervening relationships. THEREFORE: hard to use model for explanation.
Why might an R-Square be less than 1.00?
underdetermined model (need more variables) nonlinear relationships measurement error sampling error not fully predictable/explainable even with all data
available; there is a certain amount of unexplainable chaos/static/randomness in the universe (which may be reassuring)
the unit of analysis is too aggregated (e.g., you are predicting mean housing values for a city -- you might get better results with predicting individual housing prices, or neighborhood housing prices).
Adjusted R2 (R-square)
What is an "Adjusted" R-Square? The Adjusted R-Square takes into account not only how much of the variation is explained, but also the impact of the degrees of freedom. It "adjusts" for the number of variables use. That is, look at the adjusted R- Square to see how adding another variable to the model both increases the explained variance but also lowers the degrees of freedom. Adjusted R2 = 1- (1 - R2 )((n - 1)/(n - k - 1)). As the number of variables in the model increases, the gap between the R-square and the adjusted R-square will increase. This serves as a disincentive to simply throwing in a huge number of variables into the model to increase the R-square.
This adjusted value for R-square will be equal or smaller than the regular R-square. The adjusted R-square adjusts for a bias in R-square. R-square tends to over estimate the variance accounted for compared to an estimate that would be obtained from the population. There are two reasons for the overestimate, a large number of predictors and a small sample size.
So, with a small sample and with few predictors, adjusted R-square should be very similar to the R-square value. Researchers and statisticians differ on whether to use the adjusted R-square. It is probably a good idea to look at it to see how much your R-square might be inflated, especially with a small sample and many predictors.
Example
Suppose we have collected the following sample of 6 observations on age and income:
Find the estimated regression line for the sample of six observations we have collected on age and income:
Which is the independent variable and which is the dependent variable for this problem?
Cautions About Simple Linear Regression
Correlation and regression describe only linear relations Correlation and least-squares regression line are not resistant to
outliers Predictions outside the range of observed data are often
inaccurate Correlation and regression are powerful tools for describing
relationship between two variables, but be aware of their limitations
Multiple Prediction
Regression analysis allows us to use more than one independent variable to predict values of y. Take the fat intake and blood cholesterol level study as an example. If we want to predict cholesterol as accurately as possible, we need to know more about diet than just how much fat intake there is.
On the island of Crete, they consume a lot of olive oil, so there fat intake is high. This, however, seems to have no dramatic affect on cholesterol (at least the bad cholesterol, the LDLs). They also consume very little cholesterol in their diet, which consists more of fish than high cholesterol foods like cheese and beef (hopefully this won't be considered libelous in Texas). So, to improve our prediction of blood cholesterol levels, it would be helpful to add another predictor, dietary cholesterol.
From Bivariate to Multiple regression: what changes?
potentially more explanatory power with more variables.
the ability to control for other variables: and one sees the interaction of the various explanatory variables. partial correlations and multicollinearity.
harder to visualize drawing a line through three+ n-dimensional space.
the R is no longer simply the square of the correlation statistic r.
From Two to Three Dimensions With simple regression (one predictor) we had only the x-axis and the y-axis. Now we need an axis for x1, x2, and y.
where Y' is the predicted score, X1 is the score on the first predictor variable, X2 is the score on the second, etc. The Y intercept is A. The regression coefficients (b1, b2, etc.) are analogous to the slope in simple regression.
If we want to predict these points, we now need a regression plane rather than just a regression line. That looks something like this:
More than one prediction attribute
X1, X2
For example,
X1=„years of experience‟
X2=„age‟
Y=„salary‟
2211 xxY
x1
x2
y
0=10
0
(xi1, xi2)
E(yi)
yi
i
Response Surface
The parameters β0, β1, β2,… , βk are called partial regression coefficients.
β1 represents the change in y corresponding to a unit increase in x1, holding all the other predictors constant.
A similar interpretation can be made for β2, β3, ……, βk
Regression Statistics
Multiple R 0,995
R Square 0,990
Adjusted R Square 0,989
Standard Error 0,008
Observations 30
ANOVA
df SS MS F
Significa
nce F
Regression 4 0,164 0,041 628,372 0,000
Residual 25 0,002 0,000
Total 29 0,165
Coefficie
nts
Standard
Error t Stat P-value
Intercept 0,500 0,008 60,294 0,000
Percent of Gross Hhd Income Spent on rent -0,399 0,016 -24,610 0,000
percent 2-parent families -0,288 0,015 -19,422 0,000
Police Anti-Drug Program? -0,004 0,004 -1,238 0,227
Active Tenants Group? (1 = yes; 0 = no) -0,102 0,004 -28,827 0,000
Controlling also for this new variable, the police anti-drug program is no
longer statistically significant, an instead the presence of the active
tenants group makes the dramatic difference. (and look at that great R
square!). However, we are no quite done…
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.928
R Square 0.861
Adjusted R Square 0.850
Standard Error 0.030
Observations 30
ANOVA
df SS MS F Significance F
Regression 2 0.149 0.074 83.484 0.000
Residual 27 0.024 0.001
Total 29 0.173
Coeffici
ents
Standard
Error t Stat P-value BETA
Intercept 0.36582 0.017 20.908 0.000
percent 2-parent
families -0.2565 0.051 -5.017 0.000 -0.362
Active Tenants Group?
(1 = yes; 0 = no) -0.1246 0.011 -11.347 0.000 -0.821
Since the police variable now has a statistically insignificant t-score, we remove it
from the model. (We also remove the income variable, since it also becomes
insignificant after we remove the police variable.) We are left with two independent
variables: percent of 2-parent families and active tenants group.
Stepwise Regression Algorithms
• Backward Elimination
• Forward Selection
• Stepwise Selection
Backward Elimination
1. Fit the model containing all (remaining)
predictors.
2. Test each predictor variable, one at a
time, for a significant relationship with y.
3. Identify the variable with the largest pvalue.
If p > α, remove this variable from
the model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Forward Selection
1. Fit all models with one (more) predictor.
2. Test each of these predictor variables,
for a significant relationship with y.
3. Identify the variable with the smallest pvalue.
If p < α, add this variable to the
model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Stepwise Selection
• The Stepwise Selection method is
basically Forward Selection with Backward
Elimination added in at every step.
Stepwise Selection 1. Fit all models with one (more) predictor. 2. Test each of these predictor variables, for a significant relationship with y. 3. Identify the variable with the smallest p-value. If p < α, add this variable to the model, and return to (1.). 4. Now, for the model being considered, test each predictor variable, one at a time, for a significant relationship with y. 5. Identify the variable with the largest p-value. If p > α, remove this variable from the model, and return to (1.). 6. Otherwise, stop and use the existing model.
Linear regression
Review
Multiple Regression Models
Chapter Topics
The Multiple Regression Model
Contribution of Individual Independent Variables
Coefficient of Determination
Categorical Explanatory Variables
Transformation of Variables
Violations of Assumptions
Qualitative Dependent Variables
Multiple Regression Models
MultipleRegression
Models
LinearDummy
Variable
LinearNon-
Linear
Inter-action
Poly-
Nomial
SquareRoot
Log Reciprocal Exponential
Linear Multiple Regression Model
Additional Assumption for Multiple Regression
No exact linear relation exists between any subset of explanatory variables (perfect
"multicollinearity")
The Multiple Regression Model
ipipiii XXXY 22110
Relationship between 1 dependent & 2 or more independent variables is a linear
function Population
Y-intercept Population slopes
Dependent (Response)
variable for sample
Independent (Explanatory)
variables for sample model
Random
Error
ipipiii eXbXbXbbY 22110
Population Multiple Regression Model
X2
Y
X1
YX =
0 +
1X
1i +
2X
2i
0
Yi =
0 +
1X
1i +
2X
2i +
i
Response
Plane
(X1i
,X2i
)
(Observed Y)
i
Bivariate model
Sample Multiple Regression Model
X2
Y
X1
b0
Yi = b
0 + b
1X
1i + b
2X
2i + e
i
Response
Plane
(X1i
,X2i
)
(Observed Y)
^
ei
Yi = b
0 + b
1X
1i + b
2X
2i
Bivariate model
Parameter Estimation
Linear Multiple Regression Model
O il (G a l) T e m p In su la tio n
275.30 40 3
363.80 27 3
164.30 40 10
40.80 73 6
94.30 64 6
230.90 34 6
366.70 9 6
300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 10
Multiple Regression Model: Example
(0F)
Develop a model for estimating
heating oil used for a single
family home in the month of
January based on average
temperature and amount of
insulation in inches.
Interpretation of Estimated Coefficients
Slope (bP)
Estimated Y changes by bP for each 1 unit increase in XP holding all other variables constant (ceterus paribus) Example: If b1 = -2, then fuel oil usage (Y) is
expected to decrease by 2 gallons for each 1 degree increase in temperature (X1) given the inches of insulation (X2)
Y-Intercept (b0) Average value of Y when all XP = 0
Sample Regression Model: Example
C o e ffic ie n ts
I n te r c e p t 5 6 2 . 1 5 1 0 0 9 2
X V a r i a b l e 1 -5 . 4 3 6 5 8 0 5 8 8
X V a r i a b l e 2 -2 0 . 0 1 2 3 2 0 6 7
iii X.X..Y 21 012204375151562
For each degree increase in
temperature, the average amount of
heating oil used is decreased by 5.437
gallons, holding insulation constant.
For each increase in one inch of
insulation, the use of heating oil is
decreased by 20.012 gallons,
holding temperature constant.
Evaluating the Model
Evaluating Multiple Regression Model Steps
Examine variation measures
Test parameter significance
Overall model
Portions of model
Individual coefficients
Variation Measures
Coefficient of Multiple Determination
r2Y.12..P = Explained variation = SSR
Total variation SST
r2=0 all the variables taken together do
not explain variation in Y
NOT proportion of variation in Y „explained‟ by all X variables taken together
Reflects
Sample size
Number of independent variables
Smaller than r2Y.12..P
Sometimes used to compare models
Adjusted Coefficient of Multiple Determination
Simple and Multiple Regression Compared:Example
Two simple regressions:
ABSENCES= + 1AUTONOMY
ABSENCES= + 2SKILLVARIETY
Multiple Regression:
ABSENCES= + 1AUTONOMY+
2SKILLVARIETY
Overlap in Explanation
SIMPLE REGRESSION: AUTONOMY MULTIPLE REGRESSION
Multiple R 0,169171 Multiple R 0,231298
R Square 0,028619 R Square 0,053499
Adjusted R Square0,027709 Adjusted R Square0,051723
Standard Error 12,443 Standard Error12,28837
Observations 1069 Observations 1069
ANOVA ANOVA
df SS MS F Significance F df SS MS F
Regression 1 4867,198 4867,198 31,43612 2,62392E-08 Regression 2 9098,483 4549,242 30,1266
Residual 1067 165201,7 154,8282 Residual 1066 160970,4 151,0041
Total 1068 170068,9 Total 1068 170068,9
SIMPLE REGRESSION: SKILL VARIETY
Multiple R 0,193838 0,06619206 SUM OF SIMPLE R2
R Square 0,037573 0,05349881 MULTIPLE R2
Adjusted R Square0,036671 0,01269325 OVERLAP ATTRIBUTED TO BOTH
Standard Error 12,38552
Observations 1069
11257,2098 SUM OF REGRESSION SUM OF SQUARES
ANOVA 9098,4831 REGRESSION SUM OF SQUARES
df SS MS F Significance F 2158,72671 OVERLAP
Regression 1 6390,011 6390,011 41,6556 1,64882E-10
Residual 1067 163678,9 153,401
Total 1068 170068,9
Testing Parameters
F 0 3.89
H0: 1 = 2 = … = p = 0
H1: At least one I 0
= .05
df = 2 and 12
Critical Value(s):
Test Statistic:
Decision:
Conclusion:
Reject at = 0.05
There is evidence that at
least one independent
variable affects Y
= 0.05
F
Test for Overall Significance Example Solution
168.47
Test for Significance: Individual Variables
•Shows if there is a linear relationship between the
variable Xi and Y
•Use t test Statistic
•Hypotheses:
H0: i = 0 (No linear relationship)
H1: i 0 (Linear relationship between Xi and Y)
C o e ffic ie n ts S ta n d a rd E rro r t S ta t
I n te r c e p t 5 6 2 . 1 5 1 0 0 9 2 1 . 0 9 3 1 0 4 3 3 2 6 . 6 5 0 9 4
X V a r i a b l e 1 -5 . 4 3 6 5 8 0 6 0 . 3 3 6 2 1 6 1 6 7 -1 6 . 1 6 9 9
X V a r i a b l e 2 -2 0 . 0 1 2 3 2 1 2 . 3 4 2 5 0 5 2 2 7 -8 . 5 4 3 1 3
t Test Statistic Excel Output: Example
t Test Statistic for X1
(Temperature)
t Test Statistic for X2
(Insulation) Seb
bt
H0: 1 = 0
H1: 1 0
df = 12
Critical Value(s):
Test Statistic:
Decision:
Conclusion:
Reject H0 at = 0.05
There is evidence of a
significant effect of
temperature on oil
consumption. Z 0 2.1788 -2.1788
.025
Reject H 0 Reject H 0
.025
Does temperature have a significant effect on monthly
consumption of heating oil? Test at = 0.05.
t Test : Example Solution
t Test Statistic = -16.1699
Example: Analysis of job earnings
What is the impact of employer tenure
(ERTEN), unemployment (UNEM) and
education (EDU) on job earnings (JEARN)?
Example: Analysis of job earnings
Correlations
Results: Anova
Results
Examines the contribution of a set of X variables to the relationship with Y
Null hypothesis:
Variables in set do not improve significantly the model when all other variables are included
Alternative hypothesis:
At least one variable is significant
Testing Model Portions
Only one-tail test
Requires comparison of two regressions
One regression includes everything
One regression includes everything except the portion to be tested.
Testing Model Portions
Testing Model Portions Test Statistic
)X, ,
))/k-)X, ,
3
3
21
321
(
(((
XXMSE
XSSRXXSSRF
From ANOVA section
of regression for
iiii XbXbXbbY 3322110ˆ
ii XbbY 330ˆ
From ANOVA section
of regression for
Test H0: 1= 2 = 0 in a 3 variable model
Testing Portions of Model: SSR
Contribution of X1 and X2 given X3 has been
included:
SSR(X1and X2 X3) = SSR(X1,X2 and X3) -
SSR(X3)
From ANOVA section of
regression for
iiii XbXbXbbY 3322110ˆ
From ANOVA section of
regression for
ii XbbY 320ˆ
Partial F Test For Contribution of Set of X variables
Hypotheses:
H0 : Variables Xi... do not significantly improve
the model given all others included
H1 : Variables Xi... significantly improve the
model given all others included
Test Statistic:
F = MSE
kothersallXSSR i /)....(
With df = k and (n - p -1)
k=# of variables
tested
Testing Portions of Model: Example
Test at the = .05 level
to determine if the
variable of average
temperature
significantly improves
the model given that
insulation is included.
Testing Portions of Model: Example
H0: X1 does not improve
model (X2 included)
H1: X1 does improve model
= .05, df = 1 and 12
Critical Value = 4.75
A N O V A
S S
R e g r e s s i o n 5 1 0 7 6 . 4 7
R e s i d u a l 1 8 5 0 5 8 . 8
T o t a l 2 3 6 1 3 5 . 2
717,676
076,51015,228)( 21
MSE
XXSSRF
A N O V A
S S M S
R e g re ssio n 2 2 8 0 1 4 .6 2 6 3 1 1 4 0 0 7 .3 1 3
R e sid u a l 8 1 2 0 .6 0 3 0 1 6 6 7 6 .7 1 6 9 1 8
T o ta l 2 3 6 1 3 5 .2 2 9 3
(For X1 and X2) (For X2)
= 261.47
Conclusion: Reject H0. X1 does improve model
Do I need to do this for one variable?
•The F test for the inclusion of a single variable
after all other variables are included in the
model is IDENTICAL to the t test of the slope
for that variable
•The only reason to do an F test is to test several
variables together.
Example: Collinear Variables
20,000 Execs in 439 Corps: Dependent Variable=base pay+bonus
Individual Simple Regression Multiple Regression
R2 Contribution to R2
Company Dummies .33 .08
Occupational Dummies .52 .022
Position in hierarchy .69 .104
Human Capital Vars .28 .032
Shared .632
TOTAL .87
Yedek
Multiple Regression
The value of outcome variable depends on several explanatory variables.
F-test. To judge whether the explanatory variables in the model adequately describe the outcome variable.
t-test. Applies to each individual explanatory variable. Significant t indicates whether the explanatory variable has an effect on outcome variable while controlling for other X‟s.
T-ratio. To judge the relative importance of the explanatory variable.
Problem of Multicollinearity
When explanatory variables are correlated there is difficulty in interpreting the effect of explanatory variables on the outcome.
Check by:
Correlation coefficient matrix (see next slide).
F-test significant with insignificant t.
Large changes occur in the regression coefficients when variables are added or deleted. (Variance Inflation). Vi > 4 or 5 means there is multicollinearity.
Example of a Matrix Plot
This matrix plot comprises several scatter plots to provide visual information as to whether variables are correlated
The arrow points at a scatter plot where two explanatory variables are strongly correlated
Selecting the most Economic Model
The purpose is to find the smallest number of explanatory variables which make the maximum contribution to the outcome.
After excluding variables that may be causing multicollinearity, examine the table of t-ratios in the full model. Those variables with a significant t are included in the sub-set.
In the Analysis of Variance table examine the column headed SEQ SS. Check that the candidate variables are indeed making a sizable contribution to the Regression Sum of Squares
Stepwise Regression Analysis
Stepwise finds the explanatory variable with the highest R2 to start with. It then checks each of the remaining variables until two variables with highest R2 are found. It then repeats the process until three variables with highest R2 are found, and so on.
The overall R2 gets larger as more variables are added.
Stepwise may be useful in the early exploratory stage of data analysis, but not to be relied upon for the confirmatory stage.
Is the Model Adequate?
Judged by the following:
R2 value. Increase in R2 on adding another variable gives a useful hint
Adjusted R2 is a more sensitive measure.
Smallest value of s (standard deviation).
C-p statistic. A model with the smallest C-p is used such that Cp value is closest to p (the number of parameters in the
Confidence Interval Estimate For The Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
111 bpn Stb
Coefficients Lower 95% Upper 95%
Intercept 562,151009 516,1930837 608,108935
X Variable 1 -5,4365806 -6,169132673 -4,7040285
X Variable 2 -20,012321 -25,11620102 -14,90844
-6.169 1 -4.704
The average consumption of oil is reduced by between
4.7 gallons to 6.17 gallons per each increase of 10 F in
houses with the same insulation.