Top Banner
CS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D. Computer Science, Kennesaw State University * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington
17

CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

May 14, 2018

Download

Documents

nguyencong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

CS7267 MACHINE LEARNING

LINEAR REGRESSION

Mingon Kang, Ph.D.

Computer Science, Kennesaw State University

* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

Page 2: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Correlation (r)

Linear association between two variables

Show how to determine both the nature and

strength of relationship between two variables

Correlation lies between +1 to -1

Zero correlation indicates that there is no

relationship between the variables

Pearson correlation coefficient

most familiar measure of dependence between two

quantities

Page 3: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Correlation (r)

Page 4: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Correlation (r)

where E is the expected value operator, cov(,) means

covariance, and corr(,) is a widely used alternative

notation for the correlation coefficient

Reference: https://en.wikipedia.org/wiki/Correlation_and_dependence

Page 5: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Coefficient of Determination (๐‘Ÿ2)

Coefficient of determination lies between 0 and 1

Represented by ๐‘Ÿ2

Measure of how well the regression line represents the data

If r = 0.922, then ๐‘Ÿ2=0.85

Means that 85% of the total variation in y can be explained by the linear relationship between x and y in linear regression

The other 15% of the total variation in y remains unexplained

Page 6: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

Samples with ONE independent variable Samples with TWO independent variables

Page 7: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

Samples with ONE independent variable Samples with TWO independent variables

Page 8: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

How to represent the data as a vector/matrix

We assume a model:

๐ฒ = b0 + ๐›๐— + ฯต,

where b0 and ๐› are intercept and slope, known as

coefficients or parameters. ฯต is the error term (typically

assumes that ฯต~๐‘(๐œ‡, ๐œŽ2)

Page 9: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

Simple linear regression

A single independent variable is used to predict

Multiple linear regression

Two or more independent variables are used to predict

Page 10: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

How to represent the data as a vector/matrix

Include bias constant (intercept) in the input vector

๐— โˆˆ โ„๐’ร—(๐’‘+๐Ÿ), ๐ฒ โˆˆ โ„๐’, ๐› โˆˆ โ„๐’‘+๐Ÿ, and ๐ž โˆˆ โ„๐’

๐ฒ = ๐— โˆ™ ๐› + ๐ž

๐— = ๐Ÿ, ๐ฑ๐Ÿ, ๐ฑ๐Ÿ, โ€ฆ , ๐ฑ๐ฉ , ๐› = {๐‘0, ๐‘1, ๐‘2, โ€ฆ , ๐‘๐‘}T

๐ฒ = {๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘›}T, ๐ž = {๐‘’1, ๐‘’2, โ€ฆ , ๐‘’๐‘›}

T

โˆ™ is a dot product

equivalent toy๐‘– = 1 โˆ— b0 + ๐‘ฅ๐‘–1b1 + ๐‘ฅ๐‘–2b2 +โ‹ฏ+ ๐‘ฅ๐‘–๐‘bp (1 โ‰ค ๐‘– โ‰ค ๐‘›)

Page 11: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear Regression

Find the optimal coefficient vector b that makes the

most similar observation

๐‘ฆ1โ‹ฎ๐‘ฆ๐‘›

=111

๐‘ฅ11 โ‹ฏ ๐‘ฅ1๐‘โ‹ฎ โ‹ฑ โ‹ฎ

๐‘ฅ๐‘›1 โ‹ฏ ๐‘ฅ๐‘›๐‘

๐‘0โ‹ฎ๐‘๐‘

+

๐‘’1โ‹ฎ๐‘’๐‘›

Page 12: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Ordinary Least Squares (OLS)

๐ฒ = ๐—๐› + ๐ž

Estimate the unknown parameters (b) in linear regression model

Minimizing the sum of the squares of the differences between the observed responses and the predicted by a linear function

Sum squared error =

๐‘–=1

๐‘›

(๐‘ฆ๐‘– โˆ’ ๐ฑ๐‘–โˆ—๐›)2

Page 13: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Ordinary Least Squares (OLS)

Page 14: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Optimization

Need to minimize the error

min ๐ฝ(๐›) =

๐‘–=1

๐‘›

(๐‘ฆ๐‘– โˆ’ ๐ฑ๐‘–,โˆ—๐›)2

To obtain the optimal set of parameters (b),

derivatives of the error w.r.t. each parameters must

be zero.

Page 15: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Optimization

๐ฝ = ๐žT๐ž = ๐ฒ โˆ’ ๐—๐› โ€ฒ ๐ฒ โˆ’ ๐—๐›= ๐ฒโ€ฒ โˆ’ ๐›โ€ฒ๐—โ€ฒ ๐ฒ โˆ’ ๐—๐›= ๐ฒโ€ฒ๐ฒ โˆ’ ๐ฒโ€ฒ๐—๐› โˆ’ ๐›โ€ฒ๐—โ€ฒ๐ฒ + ๐›โ€ฒ๐—โ€ฒ๐—๐›= ๐ฒโ€ฒ๐ฒ โˆ’ ๐Ÿ๐›โ€ฒ๐—โ€ฒ๐ฒ + ๐›โ€ฒ๐—โ€ฒ๐—๐›

๐œ•๐žโ€ฒ๐ž

๐œ•๐›= โˆ’2๐—โ€ฒ๐ฒ + 2๐—โ€ฒ๐—๐› = 0

๐—โ€ฒ๐— ๐› = ๐—โ€ฒ๐ฒ๐› = (๐—โ€ฒ๐—)โˆ’1๐—โ€ฒ๐ฒ

Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Page 16: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Linear regression for classification

For binary classification

Encode class labels as y = 0, 1 ๐‘œ๐‘Ÿ {โˆ’1, 1}

Apply OLS

Check which class the prediction is closer to

If class 1 is encoded to 1 and class 2 is -1.

๐‘๐‘™๐‘Ž๐‘ ๐‘  1 ๐‘–๐‘“ ๐‘“ ๐‘ฅ โ‰ฅ 0

๐‘๐‘™๐‘Ž๐‘ ๐‘  2 ๐‘–๐‘“ ๐‘“ ๐‘ฅ < 0

Linear models are NOT optimized for classification

Logistic regression

Page 17: CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D.

Assumptions in Linear regression

Linearity of independent variable in the predictor

normally good approximation, especially for high-dimensional data

Error has normal distribution, with mean zero and constant variance

important for tests

Independent variables are independent from each other

Otherwise, it causes a multicollinearity problem; two or more predictor variables are highly correlated.

Should remove them