Top Banner

of 34

(2) Regression

Apr 10, 2018

Download

Documents

Sanjeev Nawani
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/8/2019 (2) Regression

    1/34

    Graphical and NumericalDescription of Interval/RatioData: Scatter plot and LinearRegression.

    DR. TAREK TAWFIK

  • 8/8/2019 (2) Regression

    2/34

    9/15/2010 Dr Tarek Amin 2

    RationaleThe data of the two variable under

    investigation have been collected at the

    , which usually means ainterval/ratio level

    cross tabulations,large number of values

    are not a convenient means of describing

    the distribution. The equivalentscatter plot.technique is the use of

  • 8/8/2019 (2) Regression

    3/34

    9/15/2010 Dr Tarek Amin 3

    Scatter plotdisplays the jointA scatter plot

    distribution for two continuousvariables.

    on a scatter plotCoordinatesindicate the value each case takes

    for each of the two variables.

  • 8/8/2019 (2) Regression

    4/34

    9/15/2010 Dr Tarek Amin 4

    The relation between unemployment

    and crime rate

    Crime rate.(Y)

    Unemploymentrate. (X)

    City(no.=5)

    17

    15

    10

    54

    25

    13

    5

    102

    A

    B

    C

    DE

    Independent variable Dependent variable

  • 8/8/2019 (2) Regression

    5/34

    9/15/2010 Dr Tarek Amin 5

    Scatter plot

    What this graph cantell?

    Relation between unemployment and crime rate

    0

    101

    0

    0 10 1 0 0unemployment rate (% )

    Crimerate

    No. crime rate

    Linear (crime rate)

    e

    c

    d

    ba

  • 8/8/2019 (2) Regression

    6/34

    9/15/2010 Dr Tarek Amin 6

    Linear RegressionRegression analysis is simply the task offitting a line through a scatter plot of

    the data.Best Fitscases thatAny line can be expressed in amathematical formula.Y=a bXWhere :

    Y is the Dependent variable.

    X is the Independent variable.

    a is the Y-intercept (the value of Y when X is zero)

    b is the slope of regression line.

    indicates negative or positive association.

  • 8/8/2019 (2) Regression

    7/34

    9/15/2010 Dr Tarek Amin 7

    The line formulaThe previous formula elaborates that a

    two factors:line is defined by

    The starting point along the verticalaxis, a.

    The slope of the line from this point b.

    The value of b indicates the direction of

    slope whether positive or negative or noassociation exists between the twovariables.

  • 8/8/2019 (2) Regression

    8/34

    9/15/2010 Dr Tarek Amin 8

    Scatter plots and Correlations

    Three linesexhibiting positive, negativeand no correlation.

  • 8/8/2019 (2) Regression

    9/34

    9/15/2010 Dr Tarek Amin 9

    Identifying the regression line

    To identify the line that best fits thescatter plot we combined twocharacters:

    The point of origin along the Y-axis.But this is not enough to distinguish it from the multitudeof lines that can be start from the same point.

    The sloping of the line, alone does notdistinguish it from all the others that could occupy thespace.

  • 8/8/2019 (2) Regression

    10/34

    Criteria to identify the best

    fit line

    If we specify both the point of origin on the Y-

    axis and the slope of the line from that point,we are able to identify uniquely any linewithin the space (the line of best fit).

    Straight lines with the same value

    for a but different values for b

    Straight lines with the sameValue for b but different values for a

  • 8/8/2019 (2) Regression

    11/34

    Unemployment and crime rate

    (Y=5+0.6X)??????

    The value for a (5) is the pointon the Y-axis where the linebegins, this the number ofcrimes we expect to find in a

    city with an unemployment rateof zero.

    The + sign means that the linehas a positive slope, whichindicates a positive correlationbetween these two variables.

    The value of 0.6 for b is theslope or Coefficient of theregression line, by how muchcrime will increase ifunemployment increases by 1%.

    The slope of any line, b = rise/run

    Relation between unemployment and crime rate

    0

    5

    10

    15

    20

    0 5 10 15 20 25 30

    unemployment rate (%)

    Crime

    rate

    Rise=3

    Run=5

    Y=5+0.6X

  • 8/8/2019 (2) Regression

    12/34

    9/15/2010 Dr Tarek Amin 12

    THE RESIDUAL ERROR

    Is the difference between the

    observed value of th

    e dependentvariable (crime rate ) and thevalue of the dependent variable

    predicted by a regression line.

  • 8/8/2019 (2) Regression

    13/34

    9/15/2010 Dr Tarek Amin 13

    There is no straight line will passthrough all the points in a scatterplot, in fact, a good line might nottouch any of the points: there will bea gap between each plot and the

    regression line. Unless a point fallsexactly on the line there will be aresidual value.

  • 8/8/2019 (2) Regression

    14/34

    Residual error

    D

    Error=-6

    Actual

    =5

    Expected=11.

    .

    ..

    .

    Y=5+0.6X

    5

    10Unemployment rate %

    Crime rate

    11

    For City D the line predictswith unemployment rate of

    10 %, the number of crime will beY= 5+0.6X = 5+0.6(10)=11

    Instead there were only 5 crimeSo the error=

    e=Yactual-Yexpected = 5-11=-6

  • 8/8/2019 (2) Regression

    15/34

    9/15/2010 Dr Tarek Amin 15

    Ordinary least square regression(OLS).

    The best line is that makes residuals assmall as possible (minimizing residuals).

    Ordinary least square regression is a rulethat tell us to draw a line through a scatterplot to minimizes the sum of the squaredresiduals.

    TheOLS regression line must pass througha point whose coordinates are the averagesof the dependent and independent

    variables (Y, X)

  • 8/8/2019 (2) Regression

    16/34

    9/15/2010 Dr Tarek Amin 16

    The slope of the regression

    line b formula.

    !2

    GG

    KKGG

    i

    bii

  • 8/8/2019 (2) Regression

    17/34

    9/15/2010 Dr Tarek Amin 17

    Easier formula

    22

    ii

    iiii

    n

    nb

    GG

    KGKG

  • 8/8/2019 (2) Regression

    18/34

    CalculationsXi YiYi

    2Xi2Crime rate YUnemployment

    rateX

    City

    425

    19550

    50

    8

    289

    225100

    25

    16

    625

    16925

    100

    4

    17

    1510

    5

    4

    25

    135

    10

    2

    A

    BC

    D

    E

    XY=728Y2= 655X2= 923 Y = 51

    Mean= 10.2

    X=55

    Mean=11

    b= 5(728)-(55)(51)/5(923)-(55)2= +0.53This called the regression coefficient

  • 8/8/2019 (2) Regression

    19/34

    9/15/2010 Dr Tarek Amin 19

    The regression coefficient

    Indicates by how many units thedependent variable will change, givena one-unit change in the independent

    variable.

    An increase in the unemployment

    rate of 1 % is correlated with anincrease of 0.53 increase in the crimerate.

  • 8/8/2019 (2) Regression

    20/34

    9/15/2010 Dr Tarek Amin 2

    Prediction (determination of Y)

    GK ba !

    10.2-0.53(11)=4.4 The line of best fit:Y= 4.4+0.53X

    GbaY s!

    If we have another city with unemployment rate of18 % what is the best guess for the crime rateY= 4.4+0.53 (18) = 13.9

  • 8/8/2019 (2) Regression

    21/34

    9/15/2010 Dr Tarek Amin 21

    Pearsons product moment

    correlation coefficient. The value of b does not indicate the strength

    of the relationship because units of

    measurements vary from one situation toanother.

    To overcome these points we convert b intoa standardized measure of correlation called

    the product moment correlation coefficient,Pearsons r, will range from -1 to +1regardless of the unit of measurements.

  • 8/8/2019 (2) Regression

    22/34

    9/15/2010 Dr Tarek Amin 22

    Formula and calculation of r

    ? A ? A22

    KKGG

    KKGG

    !

    ii

    iir

    r=0.81

  • 8/8/2019 (2) Regression

    23/34

    9/15/2010 Dr Tarek Amin 23

    The Coefficient of Determination

    The predictive ability of the regression line willbe affected by how much the scores are tightly

    packed or dispersed around th

    e line.a

    b

    Predictive ability with greater confidence with a than b.Therefore we need some measure of how much of thevariation in the dependent variable is explained bya regression line = the coefficient of determination r2

  • 8/8/2019 (2) Regression

    24/34

    9/15/2010 Dr Tarek Amin 24

    The coefficient of determination

    It is PRE measure of the amount ofvariation explained by a regression line, and

    therefore gives a sense of how muchconfidence we should place in the accuracyof our prediction.

    r2= 0.65 indicates that the least square

    regression line explains 65 % of the varianceof the dependent variable (crime rate)relative to the variance explained by thehorizontal no relation line.

  • 8/8/2019 (2) Regression

    25/34

    9/15/2010 Dr Tarek Amin 25

    Multiple RegressionA real state agent wants toexplore the factorsaffecting the selling price

    of a house. The agentcollects data on thesetwo variables for 12houses.

    There is a relationship

    between th

    e selling priceand the house size, doesthis hold true for the 12houses?

    House size

    (squares)

    Selling price

    ($,000)

    2015

    20

    13

    18

    14

    2816

    24

    20

    23

    25

    260240

    245

    210

    230

    242

    295235

    287

    252

    270

    275

  • 8/8/2019 (2) Regression

    26/34

    9/15/2010 Dr Tarek Amin 26

    ContinuedConduction of simple regression analysis using themethods of OLS produces the following results:

    Y=157+4.88

    r =0.92r2= 0.85 There is a positive relationship between house size and

    selling price. For every one square increase in house size the selling

    price increases by $ 4880. Th

    e relationsh

    ip is strong andh

    igh

    ly reliable for makingpredictions. The variation in house size does not perfectly predict

    selling price, the coefficient of determination is high(0.85 ), but not equal to 1 . Therefore other factors alsoaffect the sale price of houses in this sample.

  • 8/8/2019 (2) Regression

    27/34

    9/15/2010 Dr Tarek Amin 27

    Scatter plot

    Not all the data pointsLie right on the regressionline

    The actual selling price = a+b (house size) +e (error term)

  • 8/8/2019 (2) Regression

    28/34

    Why multiple regression?- We have three houses in the sample with equal sizesbut different selling prices, why? It is may be due to

    regularlyor other factors thatrandom factors

    impact on the prices of the houses; age of thehouse!!!.

    - There is may be a negative relation between the ageof the house and the selling price, to investigate weuse multiple regression.

    - Multivariate regression investigates the relationshipbetween two or more independent variables on asingle dependent variable

  • 8/8/2019 (2) Regression

    29/34

    9/15/2010 Dr Tarek Amin 29

    Multi-collinearityMultiple regression assumes that each ofthe independent variables is

    independent of each other.

    House size

    Price

    Age

  • 8/8/2019 (2) Regression

    30/34

    9/15/2010 Dr Tarek Amin 3

    Multiple regressionAge in yearsHouse size (squares)Selling price ($,000)

    5

    12

    9

    15

    9

    7

    1

    12

    2

    5

    5

    5

    20

    15

    20

    13

    18

    14

    28

    16

    24

    20

    23

    25

    260

    240

    245

    210

    230

    242

    295

    235

    287

    252

    270

    275

    Selling price Y=a+b1 (house size)+b2 (age) +e

  • 8/8/2019 (2) Regression

    31/34

    9/15/2010 Dr Tarek Amin 31

    Multiple regressionof eachinfluenceTo weigh the

    independent variable on the

    dependent variable we calculateandregression coefficientthe

    for eachpartial correlation

    independent variables on thedependent variable.

  • 8/8/2019 (2) Regression

    32/34

    Interpretation of multiple

    regression.Allow us to make predictions for the dependentvariable based on the values of the independent

    variable, in term of the original units of

    measurement.

    Regression

    coefficient

    Allows us to distinguish the relative importance ofeach independent variable in determining the value

    of the dependent variable.

    Standardized

    coefficient

    Indicates the strength of the relationship.R

    Indicates the amount of variation in the dependent

    variable explained by the combination of

    independent variable. Whether it is a good predictor

    of the dependent variable.

    R-squared

  • 8/8/2019 (2) Regression

    33/34

    9/15/2010 Dr Tarek Amin 33

    Stepwise RegressionG It allows us determine which combination

    of possible independent variables best

    explains the dependent variable.G It does this by adding in and taking out

    variables from the calculations accordingto whether each makes a statistically

    significant change to the value of R-squared.

  • 8/8/2019 (2) Regression

    34/34

    9/15/2010 Dr Tarek Amin 34

    Thank you