Page 1
LinearRegression
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.
Page 2
RegressionGiven:– Datawhere
– Correspondinglabelswhere
2
0
1
2
3
4
5
6
7
8
9
1970 1980 1990 2000 2010 2020
Septem
berA
rcticSeaIceExtent
(1,000,000sq
km)
Year
DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)
LinearRegressionQuadraticRegression
X =nx(1), . . . ,x(n)
ox(i) 2 Rd
y =ny(1), . . . , y(n)
oy(i) 2 R
Page 3
• 97samples,partitionedinto67train/30test• Eightpredictors(features):
– 6continuous(4logtransforms),1binary,1ordinal
• Continuousoutcomevariable:– lpsa:log(prostatespecificantigenlevel)
ProstateCancerDataset
BasedonslidebyJeffHowbert
Page 4
LinearRegression• Hypothesis:
• Fitmodelbyminimizingsumofsquarederrors
5
x
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Assumex0 =1
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Figures are courtesy ofGregShakhnarovich
Page 5
LeastSquaresLinearRegression
6
• CostFunction
• Fitbysolving
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
min✓
J(✓)
Page 6
IntuitionBehindCostFunction
7
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
Page 7
IntuitionBehindCostFunction
8
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
Page 8
IntuitionBehindCostFunction
9
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0.5]) =1
2⇥ 3
⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2
⇤⇡ 0.58Basedonexample
byAndrewNg
Page 9
IntuitionBehindCostFunction
10
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0]) ⇡ 2.333
BasedonexamplebyAndrewNg
J()isconcave
Page 10
IntuitionBehindCostFunction
11SlidebyAndrewNg
Page 11
IntuitionBehindCostFunction
12
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 12
IntuitionBehindCostFunction
13
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 13
IntuitionBehindCostFunction
14
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 14
IntuitionBehindCostFunction
15
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 15
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
16
✓
✓ J(✓)
q1q0
J(q0,q1)
FigurebyAndrewNg
Page 16
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
17
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
Page 17
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
18
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminima
Page 18
GradientDescent• Initialize• Repeatuntilconvergence
19
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
learningrate(small)e.g.,α=0.05
J(✓)
✓
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
↵
Page 19
GradientDescent• Initialize• Repeatuntilconvergence
20
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
Page 20
GradientDescent• Initialize• Repeatuntilconvergence
21
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
Page 21
GradientDescent• Initialize• Repeatuntilconvergence
22
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
Page 22
GradientDescent• Initialize• Repeatuntilconvergence
23
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
Page 23
GradientDescentforLinearRegression
• Initialize• Repeatuntilconvergence
24
✓
simultaneousupdateforj =0...d
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop
h✓
⇣x(i)
⌘
kvk2 =
sX
i
v2i =q
v21 + v22 + . . .+ v2|v|L2 norm:
k✓new � ✓oldk2 < ✏• Assumeconvergencewhen
Page 24
GradientDescent
25
(forfixed,thisisafunctionofx) (functionoftheparameters)
h(x)=-900– 0.1x
SlidebyAndrewNg
Page 25
GradientDescent
26
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 26
GradientDescent
27
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 27
GradientDescent
28
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 28
GradientDescent
29
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 29
GradientDescent
30
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 30
GradientDescent
31
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 31
GradientDescent
32
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 32
GradientDescent
33
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Page 33
Choosingα
34
αtoosmall
slowconvergence
αtoolarge
Increasingvaluefor J(✓)
• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge
Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα
J(✓)
Page 34
ExtendingLinearRegressiontoMoreComplexModels
• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs
• e.g.log,exp,squareroot,square,etc.
– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3
– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables
• example:x3 =x1 × x2
Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.
Page 35
LinearBasisFunctionModels
• Generally,
• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:
h✓(x) =dX
j=0
✓j�j(x)
�0(x) = 1 ✓0
�j(x) = xj
basisfunction
BasedonslidebyChristopherBishop(PRML)
Page 36
LinearBasisFunctionModels
– Theseareglobal;asmallchangeinx affectsallbasisfunctions
• Polynomialbasisfunctions:
• Gaussianbasisfunctions:
– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).
BasedonslidebyChristopherBishop(PRML)
Page 37
LinearBasisFunctionModels• Sigmoidal basisfunctions:
where
– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).
BasedonslidebyChristopherBishop(PRML)
Page 38
ExampleofFittingaPolynomialCurvewithaLinearModel
y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px
p =pX
j=0
✓jxj
Page 39
LinearBasisFunctionModels
• BasicLinearModel:
• GeneralizedLinearModel:
• Oncewehavereplacedthedatabytheoutputsofthebasisfunctions,fittingthegeneralizedmodelisexactlythesameproblemasfittingthebasicmodel– Unlessweusethekerneltrick– moreonthatwhenwecoversupportvectormachines
– Therefore,thereisnopointinclutteringthemathwithbasisfunctions
40
h✓(x) =dX
j=0
✓j�j(x)
h✓(x) =dX
j=0
✓jxj
BasedonslidebyGeoffHinton
Page 40
LinearAlgebraConcepts• Vector in isanorderedsetofd realnumbers– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:
• Anm-by-n matrix isanobjectwithm rowsandn columns,whereeachentryisarealnumber:
÷÷÷÷÷
ø
ö
ççççç
è
æ
4361
( )4361
÷÷÷
ø
ö
ççç
è
æ
2396784821
Rd
R4
BasedonslidesbyJosephBradley
Page 41
• Transpose:reflectvector/matrixonline:
( )baba T
=÷÷ø
öççè
æ÷÷ø
öççè
æ=÷÷
ø
öççè
ædbca
dcba T
– Note:(Ax)T=xTAT (We’lldefinemultiplicationsoon…)
• Vectornorms:– Lp normofv =(v1,…,vk)is– Commonnorms:L1,L2– Linfinity =maxi |vi|
• Lengthofavectorv isL2(v)
X
i
|vi|p! 1
p
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
Page 42
• Vectordotproduct:
– Note:dotproductofu withitself =length(u)2 =
• Matrixproduct:
( ) ( ) 22112121 vuvuvvuuvu +=•=•
÷÷ø
öççè
æ++++
=
÷÷ø
öççè
æ=÷÷
ø
öççè
æ=
2222122121221121
2212121121121111
2221
1211
2221
1211 ,
babababababababa
AB
bbbb
Baaaa
A
kuk22
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
Page 43
• Vectorproducts:– Dotproduct:
– Outerproduct:
( ) 22112
121 vuvuvv
uuvuvu T +=÷÷ø
öççè
æ==•
( ) ÷÷ø
öççè
æ=÷÷
ø
öççè
æ=
2212
211121
2
1
vuvuvuvu
vvuu
uvT
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
Page 44
h(x) = ✓|x
x| =⇥1 x1 . . . xd
⇤
Vectorization• Benefitsofvectorization– Morecompactequations– Fastercode(usingoptimizedmatrixlibraries)
• Considerourmodel:
• Let
• Canwritethemodelinvectorized formas45
h(x) =dX
j=0
✓jxj
✓ =
2
6664
✓0✓1...✓d
3
7775
Page 45
Vectorization• Considerourmodelforn instances:
• Let
• Canwritethemodelinvectorized formas46
h✓(x) = X✓
X =
2
66666664
1 x(1)1 . . . x(1)
d...
.... . .
...
1 x(i)1 . . . x(i)
d...
.... . .
...
1 x(n)1 . . . x(n)
d
3
77777775
✓ =
2
6664
✓0✓1...✓d
3
7775
h⇣x(i)
⌘=
dX
j=0
✓jx(i)j
R(d+1)⇥1 Rn⇥(d+1)
Page 46
J(✓) =1
2n
nX
i=1
⇣✓|x(i) � y(i)
⌘2
Vectorization• Forthelinearregressioncostfunction:
47
J(✓) =1
2n(X✓ � y)| (X✓ � y)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
Rn⇥(d+1)
R(d+1)⇥1
Rn⇥1R1⇥n
Let:
y =
2
6664
y(1)
y(2)
...y(n)
3
7775
Page 47
ClosedFormSolution:
ClosedFormSolution• InsteadofusingGD,solveforoptimal analytically– Noticethatthesolutioniswhen
• Derivation:
Takederivativeandsetequalto0,thensolvefor:
48
✓@
@✓J(✓) = 0
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
1x1J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
✓@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
Page 48
ClosedFormSolution• CanobtainbysimplypluggingX and into
• IfX TX isnotinvertible(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse
• Inpython,numpy.linalg.pinv(a)
– Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd ≤n
49
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
y =
2
6664
y(1)
y(2)
...y(n)
3
7775X =
2
66666664
1 x(1)1 . . . x(1)
d...
.... . .
...
1 x(i)1 . . . x(i)
d...
.... . .
...
1 x(n)1 . . . x(n)
d
3
77777775
✓ y
Page 49
GradientDescentvs ClosedForm
GradientDescentClosedFormSolution
50
• Requiresmultipleiterations• Needtochooseα• Workswellwhenn islarge• Cansupportincremental
learning
• Non-iterative• Noneedforα• Slowifn islarge
– Computing(X TX)-1 isroughlyO(n3)
Page 50
ImprovingLearning:FeatureScaling
• Idea:Ensurethatfeaturehavesimilarscales
• Makesgradientdescentconvergemuch faster
51
0
5
10
15
20
0 5 10 15 20
✓1
✓2
BeforeFeatureScaling
0
5
10
15
20
0 5 10 15 20
✓1
✓2
AfterFeatureScaling
Page 51
FeatureStandardization• Rescalesfeaturestohavezeromeanandunitvariance
– Letμj bethemeanoffeaturej:
– Replaceeachvaluewith:
• sj isthestandarddeviationoffeaturej• Couldalsousetherangeoffeaturej (maxj – minj)forsj
• Mustapplythesametransformationtoinstancesforbothtrainingandprediction
• Outlierscancauseproblems52
µj =1
n
nX
i=1
x(i)j
x(i)j
x(i)j � µj
sj
forj =1...d(notx0!)
Page 52
QualityofFit
Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )
• ...butfailstogeneralizetonewexamples
53
Prod
uctiv
ity
TimeSpent
Prod
uctiv
ity
TimeSpent
Prod
uctiv
ity
TimeSpent
Underfitting(highbias)
Overfitting(highvariance)
Correctfit
J(✓) ⇡ 0
BasedonexamplebyAndrewNg
Page 53
Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis
• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel
• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)
54
✓j
Page 54
Regularization• Linearregressionobjectivefunction
– istheregularizationparameter()– Noregularizationon!
55
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+ �
dX
j=1
✓2j
modelfittodata regularization
✓0
� � � 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
Page 55
UnderstandingRegularization
• Notethat
– Thisisthemagnitudeofthefeaturecoefficientvector!
• Wecanalsothinkofthisas:
• L2 regularizationpullscoefficientstoward0
56
dX
j=1
✓2j = k✓1:dk22
dX
j=1
(✓j � 0)2 = k✓1:d � ~0k22
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
Page 56
UnderstandingRegularization
• Whathappensas?
57
Prod
uctiv
ity
TimeSpentonWork
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ! 1
✓0 + ✓1x+ ✓2x2 + ✓3x
3 + ✓4x4
Page 57
UnderstandingRegularization
• Whathappensas?
58
Prod
uctiv
ity
TimeSpentonWork
0 0 0 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ! 1
✓0 + ✓1x+ ✓2x2 + ✓3x
3 + ✓4x4
Page 58
RegularizedLinearRegression
59
• CostFunction
• Fitbysolving
• Gradientupdate:
min✓
J(✓)
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
regularization
@
@✓jJ(✓)
@
@✓0J(✓)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ↵�✓j
Page 59
RegularizedLinearRegression
60
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
• Wecanrewritethegradientstepas:
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
✓j ✓j (1� ↵�)� ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j � ↵�✓j
Page 60
RegularizedLinearRegression
61
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
• Toincorporateregularizationintotheclosedformsolution:
Page 61
RegularizedLinearRegression
62
• Toincorporateregularizationintotheclosedformsolution:
• Canderivethisthesameway,bysolving
• Canprovethatforλ >0,inverseexistsintheequationabove
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
@
@✓J(✓) = 0