Linear Regression - Penn Engineering › ~cis519 › spring2019 › ... · Based on slide by Christopher Bishop (PRML) Linear Basis Function Models • Sigmoidal basis functions:

LinearRegression

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

RegressionGiven:– Datawhere

– Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

berA

rcticSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)

LinearRegressionQuadraticRegression

X =nx(1), . . . ,x(n)

ox(i) 2 Rd

y =ny(1), . . . , y(n)

oy(i) 2 R

• 97samples,partitionedinto67train/30test• Eightpredictors(features):

– 6continuous(4logtransforms),1binary,1ordinal

• Continuousoutcomevariable:– lpsa:log(prostatespecificantigenlevel)

ProstateCancerDataset

BasedonslidebyJeffHowbert

LinearRegression• Hypothesis:

• Fitmodelbyminimizingsumofsquarederrors

5

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0 =1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy ofGregShakhnarovich

LeastSquaresLinearRegression

6

• CostFunction

• Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

min✓

J(✓)

IntuitionBehindCostFunction

7

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

BasedonexamplebyAndrewNg


8

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafunctionofx) (functionoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2



9

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Basedonexample

byAndrewNg


10

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333


J()isconcave


11SlidebyAndrewNg


12

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg


13


SlidebyAndrewNg


14


SlidebyAndrewNg


15


SlidebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

16

✓

✓ J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg


17

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

FigurebyAndrewNg


18

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

FigurebyAndrewNg

Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminima

GradientDescent• Initialize• Repeatuntilconvergence

19

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

learningrate(small)e.g.,α=0.05

J(✓)

✓

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

↵


20

✓

✓j ✓j � ↵@


forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j


21

✓

✓j ✓j � ↵@


forj =0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j


22

✓

✓j ✓j � ↵@


forj =0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j


23

✓

✓j ✓j � ↵@


forj =0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescentforLinearRegression

• Initialize• Repeatuntilconvergence

24

✓

simultaneousupdateforj =0...d

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop

h✓

⇣x(i)

⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new � ✓oldk2 < ✏• Assumeconvergencewhen

GradientDescent

25


h(x)=-900– 0.1x

SlidebyAndrewNg

GradientDescent

26


SlidebyAndrewNg

GradientDescent

27


SlidebyAndrewNg

GradientDescent

28


SlidebyAndrewNg

GradientDescent

29


SlidebyAndrewNg

GradientDescent

30


SlidebyAndrewNg

GradientDescent

31


SlidebyAndrewNg

GradientDescent

32


SlidebyAndrewNg

GradientDescent

33


SlidebyAndrewNg

Choosingα

34

αtoosmall

slowconvergence

αtoolarge

Increasingvaluefor J(✓)

• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs

• e.g.log,exp,squareroot,square,etc.

– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3

– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables

• example:x3 =x1 × x2

Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.

LinearBasisFunctionModels

• Generally,

• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfunction

BasedonslidebyChristopherBishop(PRML)


– Theseareglobal;asmallchangeinx affectsallbasisfunctions

• Polynomialbasisfunctions:

• Gaussianbasisfunctions:

– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).


LinearBasisFunctionModels• Sigmoidal basisfunctions:

where

– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).


ExampleofFittingaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj


• BasicLinearModel:

• GeneralizedLinearModel:

• Oncewehavereplacedthedatabytheoutputsofthebasisfunctions,fittingthegeneralizedmodelisexactlythesameproblemasfittingthebasicmodel– Unlessweusethekerneltrick– moreonthatwhenwecoversupportvectormachines

– Therefore,thereisnopointinclutteringthemathwithbasisfunctions

40

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

BasedonslidebyGeoffHinton

LinearAlgebraConcepts• Vector in isanorderedsetofd realnumbers– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:

• Anm-by-n matrix isanobjectwithm rowsandn columns,whereeachentryisarealnumber:

÷÷÷÷÷

ø

ö

ççççç

è

æ

4361

( )4361

÷÷÷

ø

ö

ççç

è

æ

2396784821

Rd

R4

BasedonslidesbyJosephBradley

• Transpose:reflectvector/matrixonline:

( )baba T

=÷÷ø

öççè

æ÷÷ø

öççè

æ=÷÷

ø

öççè

ædbca

dcba T

– Note:(Ax)T=xTAT (We’lldefinemultiplicationsoon…)

• Vectornorms:– Lp normofv =(v1,…,vk)is– Commonnorms:L1,L2– Linfinity =maxi |vi|

• Lengthofavectorv isL2(v)

X

i

|vi|p! 1

p


LinearAlgebraConcepts

• Vectordotproduct:

– Note:dotproductofu withitself =length(u)2 =

• Matrixproduct:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

÷÷ø

öççè

æ++++

=

÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22



• Vectorproducts:– Dotproduct:

– Outerproduct:

( ) 22112

121 vuvuvv

uuvuvu T +=÷÷ø

öççè

æ==•

( ) ÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2212

211121

2

1

vuvuvuvu

vvuu

uvT



h(x) = ✓|x

x| =⇥1 x1 . . . xd

⇤

Vectorization• Benefitsofvectorization– Morecompactequations– Fastercode(usingoptimizedmatrixlibraries)

• Considerourmodel:

• Let

• Canwritethemodelinvectorized formas45

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

Vectorization• Considerourmodelforn instances:

• Let

• Canwritethemodelinvectorized formas46

h✓(x) = X✓

X =

2

66666664

1 x(1)1 . . . x(1)

d...

.... . .

...

1 x(i)1 . . . x(i)

d...

.... . .

...

1 x(n)1 . . . x(n)

d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h⇣x(i)

⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

2n

nX

i=1

⇣✓|x(i) � y(i)

⌘2

Vectorization• Forthelinearregressioncostfunction:

47

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

ClosedFormSolution:

ClosedFormSolution• InsteadofusingGD,solveforoptimal analytically– Noticethatthesolutioniswhen

• Derivation:

Takederivativeandsetequalto0,thensolvefor:

48

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1x1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

ClosedFormSolution• CanobtainbysimplypluggingX and into

• IfX TX isnotinvertible(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse

• Inpython,numpy.linalg.pinv(a)

– Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd ≤n

49

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x(1)1 . . . x(1)

d...

.... . .

...

1 x(i)1 . . . x(i)

d...

.... . .

...

1 x(n)1 . . . x(n)

d

3

77777775

✓ y

GradientDescentvs ClosedForm

GradientDescentClosedFormSolution

50

• Requiresmultipleiterations• Needtochooseα• Workswellwhenn islarge• Cansupportincremental

learning

• Non-iterative• Noneedforα• Slowifn islarge

– Computing(X TX)-1 isroughlyO(n3)

ImprovingLearning:FeatureScaling

• Idea:Ensurethatfeaturehavesimilarscales

• Makesgradientdescentconvergemuch faster

51

0

5

10

15

20

0 5 10 15 20

✓1

✓2

BeforeFeatureScaling

0

5

10

15

20

0 5 10 15 20

✓1

✓2

AfterFeatureScaling

FeatureStandardization• Rescalesfeaturestohavezeromeanandunitvariance

– Letμj bethemeanoffeaturej:

– Replaceeachvaluewith:

• sj isthestandarddeviationoffeaturej• Couldalsousetherangeoffeaturej (maxj – minj)forsj

• Mustapplythesametransformationtoinstancesforbothtrainingandprediction

• Outlierscancauseproblems52

µj =1

n

nX

i=1

x(i)j

x(i)j

x(i)j � µj

sj

forj =1...d(notx0!)

QualityofFit

Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )

• ...butfailstogeneralizetonewexamples

53

Prod

uctiv

ity

TimeSpent

Prod

uctiv

ity

TimeSpent

Prod

uctiv

ity

TimeSpent

Underfitting(highbias)

Overfitting(highvariance)

Correctfit

J(✓) ⇡ 0


Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis

• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel

• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)

54

✓j

Regularization• Linearregressionobjectivefunction

– istheregularizationparameter()– Noregularizationon!

55

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

UnderstandingRegularization

• Notethat

– Thisisthemagnitudeofthefeaturecoefficientvector!

• Wecanalsothinkofthisas:

• L2 regularizationpullscoefficientstoward0

56

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


• Whathappensas?

57

Prod

uctiv

ity

TimeSpentonWork

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4


• Whathappensas?

58

Prod

uctiv

ity

TimeSpentonWork

0 0 0 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4

RegularizedLinearRegression

59

• CostFunction

• Fitbysolving

• Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

� ↵�✓j


60

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘

• Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

✓j ✓j (1� ↵�)� ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j � ↵�✓j


61

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

• Toincorporateregularizationintotheclosedformsolution:


62

• Toincorporateregularizationintotheclosedformsolution:

• Canderivethisthesameway,bysolving

• Canprovethatforλ >0,inverseexistsintheequationabove

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0

Linear Regression - Penn Engineering › ~cis519 › spring2019 › ... · Based on slide by Christopher Bishop (PRML) Linear Basis Function Models • Sigmoidal basis functions:

Documents