8/2/2019 KFS2312 Topic 10(1t) http://slidepdf.com/reader/full/kfs2312-topic-101t 1/27 INTRODUCTION In Topic 8, we will learn about a method to visually check for the relationship between two variables using the two-way scatter plot and another method to measure the strength of this relationship using correlation. If a relationship exists, we would like to know the meaning of the relationship. Once we have determined the relationship in terms of equation, we will be able to predict the value of a variable given the value of the other variable. Statistical method used to examine a linear relationship between 2 variables is called Simple Linear Regression. Only quantitative variables are considered in this case. LEARNING OUTCOMES By the end of this topic, you should be able to: 1. Explain regression concepts, 2. Construct simple linear regression model and identify the assumptions made; 3. Prove mathematically using the least squares estimate method, how a regression model is constructed; 4. Identify inferential concepts for the regression parameters; 5. Use appropriate methods to evaluate data suitability in fitting a regression model; 6. Use regression analysis for prediction and variables estimation. Topic Topic 10 10 Simple Linear Regression Analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
In Topic 8, we will learn about a method to visually check for the relationship
between two variables using the two-way scatter plot and another method to
measure the strength of this relationship using correlation. If a relationship exists,
we would like to know the meaning of the relationship. Once we have determinedthe relationship in terms of equation, we will be able to predict the value of a
variable given the value of the other variable. Statistical method used to examine a
linear relationship between 2 variables is called Simple Linear Regression. Only
quantitative variables are considered in this case.
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain regression concepts,
2. Construct simple linear regression model and identify the assumptions
made;
3. Prove mathematically using the least squares estimate method, how a
regression model is constructed;
4. Identify inferential concepts for the regression parameters;
5. Use appropriate methods to evaluate data suitability in fitting aregression model;
6. Use regression analysis for prediction and variables estimation.
Regression analysis concepts deal with finding the best relationship betweendependent variable Y and independent variable X, quantifying the strength of that
relationship, and the use of methods that allow for prediction of the response values
(Y) given values of the regressor χ . The y variable value can only be determined if
the independent variables values (denoted by 1 2, , ...,
k x x x where k is the number of
independent variables) are known.
Examples of independent variables are the amount of electrical consumption in a
house, profit made by a company, final examination students’ grades, selling price
of a house etc. These are considered as dependent variables as their values dependon other variables. For example, the amount of electrical consumption in a house
depends on outside temperature during that day. If the temperature for the day were
high, then the occupants of the house would most probably turn on their air-
conditioner or fan to cool themselves down. Hence, we can say that temperature is
an independent variable since it is a factor that influences the amount of electrical
consumption in a house. Another possible variable is the number of electrical
appliances in a house - the more it has, the greater the amount of electrical
consumption.
Regression analysis is used to determine the mathematical relationship betweenthese variables through a linear equation termed as regression model. From the
model, we can predict the y value for a given value of χ.
10.1
2
SELF-CHECK
Try to think of the independent variables for the following dependent
A simple linear regression model involves only one variable, that is for k = 1 case.Multiple linear regressions are employed for cases involving more than one
independent variable (k > 1). A simple linear regression model is written as
0 1 y x= β + β ∈
ε refers to the random variable for errors/residuals. Errors/Residuals exist due to
imperfect relationship between variables and measurements are rarely done without
errors. To further understand about errors, let us see the following example:
A property development manager would like to know the estimation of the selling price for each house that will be built. He knows that the cost of building a house is
RM90 for each square feet and the land price is RM25,000 for an area of 4,500
square feet. Hence, the manager can estimate the selling price using the equation
below:
y = 25,000 + 90 x (10.2)
where y = selling price and x = house size in square feet. If the house is 2,000
square feet, the price would be RM205,000, that is
y = 25,000 + 90(2,000) = 205,000
However, this is only an estimated price and the actual price (based on observation)
would be between RM180,000 and RM250,000. For this reason, to reflect the actual
situation, another simple linear regression model replaces the previous model, that
is:
y = 25,000 + 90 x + ε (10.2)
where ε is a random variable for errors representing all other variables which are
not considered in equation (5.1). In other words, the selling price for the same size
will also differ due to other factors such as location, number of bedrooms, toilets
and other unknown factors.
The simple linear regression model 0 1 y x= β + β is a population model and
regression coefficients 0β and 1
β values are population parameters. It is difficult to
get these values of the population parameter and for this purpose, sample data is
collected to estimate the values. The estimation model is as shown below.
Here, y is the predicted/fitted value for y, 0β is the estimation for population
parameter 0β and 1β is the estimation for population parameter 1
β . The estimation
model (10.3) is a linear equation with 1β parameter as the regression slope and 0β parameter as the y-intercept , which is the y value when x is zero (Refer Figure
10.1). However, in most cases, when x = 0, the y value does not carry any
significant meaning and at times x = 0 is not possible to happen. The slope of a
straight line is a fixed value that explains the changes (increasing or decreasing) in
y value given a one unit change in x value.
Figure 10.1: Estimation model
Errors (Refer to Figure 10.1) are obtained from the difference between y observed
values with y fitted values. This is denoted by i∈ for i = 1, 2…n and the formula
is:
iiiy y ˆ−=∈ (10.4)
The residuals, ∈i is a random variable. To determine whether a calculated simple
linear regression is a good estimate for the population, we need to ensure that the∈i random variable satisfies few conditions. The assumptions made on ∈i random
variable are:
(a) i∈ is distributed as normal; that is i∈ ~ N (0, s2), i=1, 2, …, n.
(b) mean for i∈ is zero, that is E ( i∈ ) = 0, I = 1,2,…,n.
(c) standard deviation for i∈ is s; that is s( i∈ ) = s , i=1, 2, …., n fixed
(d) i∈ for any y value is independent of i∈ for other values of y.
Assumption 1 is made to facilitate the inferential processes (hypothesis test and
confidence interval) on the significance of the relationship between x and y, asdisplayed by the fitted line. Assumptions 2 and 3 refer to the linearity of a
regression model. Suppose we have the population regression model as below:
0 1 y x= β + β + ∈ (10.5)
For each x value, y is distributed as normal with mean
( ) 0 1 E y x = β + β (10.6)
and standard deviation
s( y) = σ∈. (10.7)
Observe from equation (10.6), mean E ( y) depends on x but the standard deviation
does not depend on anything. This is because σ∈ is fixed for all x values. The visual
display of a simple linear regression is shown in Figure 10.2 below.
When the straight line fails to capture all the data (point ( x, y) on the graph), what
must we do to obtain the best straight line? This best straight line refers to the fitted
straight line that we build in the two-way scatter plot that best represents the
relationship between the two variables. This fitted line would be a straight line that
is close to points ( x, y) and when the errors between the points on the straight line
(estimated) and actual observed points are minimised. However, the total errors
i∈∑ does not represent the distance between the actual and observed points. Let us
look at an example to prove why ˆ( )i i y y−∑ is not suitable to represent the
distance value of the actual and observed points.
10.
3
6
EXERCISE 10.1
Given regression equation y = –12.84 + 36.18 x, state the values of 0β and 1β and explain both values. Next, calculate the residuals using the following data:
With reference to Figures 10.3(a) and 10.3(b), we can see that the positions of thetwo data sets [data (a) and (b)] are different. The total errors for data (a) and data
Total errors are zero for both data (a) and (b), and this always holds. This figure
shows that the distance of data points (a) and (b) from the regression line is the
same. However, from both graphs in Figure 5.3, we can see that this is not true.There exist differences in positions of data points (a) and (b) from the regression
line where data points (b) are closer to the regression line compared to data points
(a). Hence, ∑∈i is not suitable to be used as a selection criteria.
So, how can we solve this problem? It can be solved if we squared each error before
summing them up. The following table are the values of 2
)ˆ(∑ − ii y y for data (a)
and (b).
Data (a)
2 2ˆ( )i i i y y∈ = −
Data (b)
2 2ˆ( )i i i y y∈ = −
(8 – 6)2 = 4
(1 – 5)2 = 16
(6 – 4)2 = 4
(7 – 6)2 = 1
(6 – 5)2 = 1
(2 – 4)2 = 4
∑(∈i)2=24 ∑(∈i)
2= 6
Based on 2ˆ( )i i y y−∑ values for both data (a) and (b), it shows that the total sum of
squares for data (b) is smaller than (a). This proves that points for data (b) are
nearer to the regression line and this line is the best fitted line. This method to
obtain the best fitted line based on the least squares summation is known as the
least squares method.
To fit the regression line, we need to get the estimates for regression coefficients 0β
and 1β . Using the least squares method, the formula for regression coefficient 1β is:
We can perform either a one-sided ( 1β > 0 or 1β < 0) or a two-sided ( 1β ≠ 0) test to
determine if there is enough evidence to conclude the existence of a linear
relationship (that is population 1β is not zero). Hence, test the hypothesis:
s( 1
ˆ
β ) is the standard deviation for 1
ˆ
β . The formula to get the standard deviation for 1β is:
( )
2
0 1 1
12 2
ˆ ˆ
2ˆ
i i i
i
y y x y
n s
x nx
− β − β−β =−
∑ ∑ ∑
∑Apart from hypothesis testing, we can also construct a confidence interval for 1β .
Confidence interval will provide a confidence range that contains the value of
population parameter at a certain α level. Based on T test statistic (two-sided) thatfollows 2nt − distribution, we can construct a (1 - α ) 100% confidence interval as
below:
( ) ( )1 2 2 1 1 1 2 2 1α − α −β − β ≤ β ≤ β + β ,n ,nˆ ˆ ˆ ˆ t s t s
Worked Example 10.2
Based on the data in Example 10.1, prove that at 0.05 significance level, there is
12
H o
:1β = 0
H 1
: 1.1β > 0
2. 1β < 0
3. 1β ≠ 0
Test Statistic :1
1 1
ˆ ˆ ˆ
ˆ ˆ( ) ( )T
s s
β − β β= =
β β
Test Result : T follows t distribution with v = n – 2 degrees of freedom and α .significance level.
Transformation is important if the regression model is in non-linear shape. The
linearity of any model can be verified by drawing a two-way scatter plot. A linear regression model will display a linear function, which is in straight-line form. A
common transformation function is logarithm or inverse, either on x or y. The
following are a few transformation examples to change some non-linear functions
to their linear form.
Table 10.1: Some Transformations
Functional Form
that Relates y to xTransformation
Linear Regression
Model Form
Exponent: y = 1
0
xe
ββ y* = ln y y* = ln 0β + 1β x
Power: y = 1
0 xββ y* = log y; x* = log x y* = log 0β + 1β x*
Inverse: x* = 1– x ; y = 0β + 1β x*
Hyperbolic: y* = 1– y ; x* = 1– x y* = 0β x*+ 1β
A two-way scatter plot is very useful to ascertain whether a model has a linear or
non-linear form. Hence, it is good to know the shape of Exponential, Power,Inverse and Hyperbolic functions (refer Figure 10.8). Observe Figures 5.1 and 5.8
on the chosen transformations.
19
EXERCISE 10.6
Draw a two-way plot on the following regression models and then perform
transformation on the models and obtain the linear regression models.
One of the reasons to build a linear regression is to predict variable values at futurex values. For example, refer to the property development manager’s problem in
estimating the selling price (in RM) for each house built (refer section 10.2). Using
regression model
ˆ 25,000 90 , b a y x x x x= + ≤ ≤ (10.11)
where y = selling price and x = house size (in square foot). The x values are in
between a x and b x . If we would like to predict the selling price of a house where
the built-up area is 2,000 square feet, where the value 2,000 > a x , we can use the
regression model with x value = 2,000. Based on the regression equation, themanager can predict that the selling price for each house with 2,000 square feet is
RM205,000.
However, this selling price is a forward estimation and it does not explain about the
position of that value with respect to actual selling price. In other words, is the
estimation value close to the actual value or very different? This relates to the
reliability aspect of certain prediction. To get information on the position of
estimation values versus actual values, we need to use intervals. There are two
types of interval used - prediction interval for any dependent variable y andestimation interval for estimated value of y.
10.6.1Prediction Interval for an IndividualValue of y
The prediction interval is used to predict a certain value of dependent variable y,
given a specific value of independent variable x when this x value is outside the
range of x values, that is x > a x or x < b x . The term “prediction interval” is used
rather than confidence interval because a population parameter is not being
estimated in this case; instead, the response or performance of a single individual in
10.6
21
ACTIVITY 10.4
What are other situations that require prediction or estimation?