Top Banner
CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann LECTURE 4: REGRESSION
13

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019Marion Neumann

LECTURE 4: REGRESSION

Page 2: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: DATA SCIENCE

2

…solving problems with data…

collect & understand

data

clean & format

data

dataproblem

use datato createsolution

scientific or business problem

…which step is most exciting?

Machine Learning

Page 3: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: ML

• data: anything you can measure or record

• model: specifica9on of a (mathema9cal) rela+onship between different variables

• evalua*on: how well does the model work?

3

…creating and using models that learn from data…

Page 4: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

RECAP: ML WORKFLOW• Training phase, test phase, and evaluation phase

à turn to your neighbor• by taking turns, explain what happens in the

• training phase• test phase• evaluation phase

• carefully define what kinds of data are used in each phase

4

data

outputprogram

data

output

ground truth performance

measure

Page 5: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PROPERTY SALES DATAGoal: predict how much my house is worth

• features (input variables)size (in sq. ft): o numeric o categorical o binaryneighborhood: o numeric o categorical o binary# bed rooms: o numeric o categorical o binary# bath rooms: o numeric o categorical o binarypool o numeric o categorical o binaryage (in years): o numeric o categorical o binaryrenovated o numeric o categorical o binary

• house price = target variableo numeric o categorical o binary

5

How can this data

help?

Page 6: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PREDICTING HOUSE PRICES

• target (house price) is a real number

6

How much is my house worth?

Look at Zillow!

Page 7: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

LINEAR REGRESSION MODEL

7

Page 8: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

TRAINING: MINIMIZE ERROR

8

PDSHp391

Linear Regression

math & statistics

Page 9: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

PREDICTION: USE MODEL

9

PDSHp391

Linear Regression

Page 10: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

HOW ABOUT MORE COMPLEX MODELS?

10

PDSHp393

Linear Regression

Error on training set:linear model >> quadratic >> 6-order polynomial

ß error is zero!

Is the model with zero (training)

error the best?

Page 11: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

EVALUATION FOR REGRESSION

• Training Error vs. Test Error

• Error measures: • RMSE: root mean squared error• MAE: mean absolute error

11

RMSE %&, &() = +,-

.(%0. − 0.)3

MAE %&, &() = +,-

.| %0. − 0.|

%& = 6(7())predictions for test data

Page 12: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

MACHINE LEARNING WORKFLOW

• Training Phase, Test Phase, Evaluation Phase

12

Page 13: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: …m.neumann/sp2019/cse217/... · CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann ... •carefully define what kinds

SUMMARY & READING• Learning from Data requires a lot of math!

• Regression models are used to predict real valued targets.

• We need a test set to evaluate how well our model generalizes.

13

• DSFS• Ch11: ML (p142-144) • Ch14: Simple Linear Regression (p173-176)

• PDSH Ch5: ML – Linear Regression (p390-394)• LINEAR REGRESSION BY HAND

https://www.wired.com/2011/01/linear-regression-by-hand/

SciKitLearn

understandthe model use the

model in practice