Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore
Unsupervisedlearning(part1)Lecture19
DavidSontagNewYorkUniversity
SlidesadaptedfromCarlosGuestrin,DanKlein,LukeZe@lemoyer,DanWeld,VibhavGogate,andAndrewMoore
Bayesiannetworksenableuseofdomainknowledge
Willmycarstartthismorning?
Heckermanetal.,Decision-TheoreMcTroubleshooMng,1995
Bayesian networksReference: Chapter 3
A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:
1 One node i 2 V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi
| xPa(i)),specifying the variable’s probability conditioned on its parents’ values
Corresponds 1-1 with a particular factorization of the jointdistribution:
p(x1
, . . . xn
) =Y
i2Vp(x
i
| xPa(i))
Powerful framework for designing algorithms to perform probabilitycomputations
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44
Bayesian networksReference: Chapter 3
A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:
1 One node i 2 V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi
| xPa(i)),specifying the variable’s probability conditioned on its parents’ values
Corresponds 1-1 with a particular factorization of the jointdistribution:
p(x1
, . . . xn
) =Y
i2Vp(x
i
| xPa(i))
Powerful framework for designing algorithms to perform probabilitycomputations
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44
Bayesiannetworksenableuseofdomainknowledge
WhatisthedifferenMaldiagnosis?
Beinlichetal.,TheALARMMonitoringSystem,1989
Bayesiannetworksaregenera*vemodels
• CansamplefromthejointdistribuMon,top-down• SupposeYcanbe“spam”or“notspam”,andXiisabinary
indicatorofwhetherwordiispresentinthee-mail
• Let’strygeneraMngafewemails!
• OZenhelpstothinkaboutBayesiannetworksasageneraMvemodelwhenconstrucMngthestructureandthinkingaboutthemodelassumpMons
Y
X1 X2 X3 Xn. . .
Features
Label
InferenceinBayesiannetworks• CompuMngmarginalprobabiliMesintreestructuredBayesian
networksiseasy– Thealgorithmcalled“beliefpropagaMon”generalizeswhatweshowedfor
hiddenMarkovmodelstoarbitrarytrees
• Wait…thisisn’tatree!Whatcanwedo?
X1 X2 X3 X4 X5 X6
Y1 Y2 Y3 Y4 Y5 Y6
Y
X1 X2 X3 Xn. . .
Features
Label
InferenceinBayesiannetworks
• Insomecases(suchasthis)wecantransformthisintowhatiscalleda“juncMontree”,andthenrunbeliefpropagaMon
Approximateinference
• ThereisalsoawealthofapproximateinferencealgorithmsthatcanbeappliedtoBayesiannetworkssuchasthese
• MarkovchainMonteCarloalgorithmsrepeatedlysampleassignmentsforesMmaMngmarginals
• Varia4onalinferencealgorithms(determinisMc)findasimplerdistribuMonwhichis“close”totheoriginal,thencomputemarginalsusingthesimplerdistribuMon
MaximumlikelihoodesMmaMoninBayesiannetworksML estimation in Bayesian networks
Suppose that we know the Bayesian network structure G
Let ✓x
i
|xpa(i)
be the parameter giving the value of the CPD p(xi
| xpa(i)
)
Maximum likelihood estimation corresponds to solving:
max✓
1
M
MX
m=1
log p(xM ; ✓)
subject to the non-negativity and normalization constraints
This is equal to:
max✓
1
M
MX
m=1
log p(xM ; ✓) = max✓
1
M
MX
m=1
NX
i=1
log p(xMi
| xMpa(i)
; ✓)
= max✓
NX
i=1
1
M
MX
m=1
log p(xMi
| xMpa(i)
; ✓)
The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.
David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22
ML estimation in Bayesian networks
Suppose that we know the Bayesian network structure G
Let ✓x
i
|xpa(i)
be the parameter giving the value of the CPD p(xi
| xpa(i)
)
Maximum likelihood estimation corresponds to solving:
max✓
1
M
MX
m=1
log p(xM ; ✓)
subject to the non-negativity and normalization constraints
This is equal to:
max✓
1
M
MX
m=1
log p(xM ; ✓) = max✓
1
M
MX
m=1
NX
i=1
log p(xMi
| xMpa(i)
; ✓)
= max✓
NX
i=1
1
M
MX
m=1
log p(xMi
| xMpa(i)
; ✓)
The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.
David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22
ML estimation in Bayesian networks
Suppose that we know the Bayesian network structure G
Let ✓x
i
|xpa(i)
be the parameter giving the value of the CPD p(xi
| xpa(i)
)
Maximum likelihood estimation corresponds to solving:
max✓
1
M
MX
m=1
log p(xM ; ✓)
subject to the non-negativity and normalization constraints
This is equal to:
max✓
1
M
MX
m=1
log p(xM ; ✓) = max✓
1
M
MX
m=1
NX
i=1
log p(xMi
| xMpa(i)
; ✓)
= max✓
NX
i=1
1
M
MX
m=1
log p(xMi
| xMpa(i)
; ✓)
The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.
David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22
Returningtoclustering…
• Clustersmayoverlap• Someclustersmaybe“wider”thanothers
• Canwemodelthisexplicitly?
• Withwhatprobabilityisapointfromacluster?
ProbabilisMcClustering
• TryaprobabilisMcmodel!• allowsoverlaps,clustersofdifferent
size,etc.
• Cantellagenera*vestoryfordata– P(Y)P(X|Y)
• Challenge:weneedtoesMmatemodelparameterswithoutlabeledYs
Y X1 X2
?? 0.1 2.1
?? 0.5 -1.1
?? 0.0 3.0
?? -0.1 -2.0
?? 0.2 1.5
… … …
GaussianMixtureModels
µ1µ2
µ3
• P(Y):Therearekcomponents• P(X|Y):Eachcomponentgeneratesdatafromamul>variateGaussian
withmeanμiandcovariancematrixΣi
Eachdatapointassumedtohavebeensampledfromagenera4veprocess:
1. ChoosecomponentiwithprobabilityP(y=i)[Mul*nomial]
2. Generatedatapoint~N(mi,Σi)
€
P(X = x j |Y = i) =1
(2π)m / 2 ||Σi ||1/ 2 exp −
12x j − µi( )
TΣi−1 x j − µi( )
⎡
⎣ ⎢ ⎤
⎦ ⎥
Byfi:ngthismodel(unsupervisedlearning),wecanlearnnewinsightsaboutthedata
MulMvariateGaussians
Σ∝idenMtymatrix
P(X = x j |Y = i) =1
(2π )m/2 || Σi ||1/2 exp −
12x j −µi( )
TΣi−1 x j −µi( )
#
$%&
'(P(X=xj)=
MulMvariateGaussians
Σ=diagonalmatrixXiareindependentalaGaussianNB
P(X = x j |Y = i) =1
(2π )m/2 || Σi ||1/2 exp −
12x j −µi( )
TΣi−1 x j −µi( )
#
$%&
'(P(X=xj)=
MulMvariateGaussians
Σ=arbitrary(semidefinite)matrix:-specifiesrotaMon(changeofbasis)-eigenvaluesspecifyrelaMveelongaMon
P(X = x j |Y = i) =1
(2π )m/2 || Σi ||1/2 exp −
12x j −µi( )
TΣi−1 x j −µi( )
#
$%&
'(P(X=xj)=
P(X = x j |Y = i) =1
(2π )m/2 || Σi ||1/2 exp −
12x j −µi( )
TΣi−1 x j −µi( )
#
$%&
'(P(X=xj)=
Covariancematrix,Σ,=degreetowhichxivarytogether
Eigenvalue,λ,ofΣ
MulMvariateGaussians
ModellingerupMonofgeysers
OldFaithfulDataSet
TimetoErupM
on
DuraMonofLastErupMon
ModellingerupMonofgeysers
OldFaithfulDataSet
SingleGaussian MixtureoftwoGaussians
MarginaldistribuMonformixturesofGaussians
Component
Mixingcoefficient
K=3
MarginaldistribuMonformixturesofGaussians
LearningmixturesofGaussians
Originaldata(hypothesized) Observeddata(ymissing)
Pr(Y = i | x)
Inferredy’s(learnedmodel)
ShownistheposteriorprobabilitythatapointwasgeneratedfromithGaussian:
MLesMmaMoninsupervisedsetng
• UnivariateGaussian
• MixtureofMul4variateGaussiansMLesMmateforeachoftheMulMvariateGaussiansisgivenby:
Justsumsoverxgeneratedfromthek’thGaussian
µML =1n
xnj=1
n
∑ ΣML =1n
x j −µML( ) x j −µML( )T
j=1
n
∑k k k k
Whataboutwithunobserveddata?
• Maximizemarginallikelihood:– argmaxθ∏jP(xj)=argmax∏j∑k=1P(Yj=k,xj)
• Almostalwaysahardproblem!– UsuallynoclosedformsoluMon
– EvenwhenlgP(X,Y)isconvex,lgP(X)generallyisn’t…
– ManylocalopMma
K
ExpectaMonMaximizaMon
1977:Dempster,Laird,&Rubin
TheEMAlgorithm
• Aclevermethodformaximizingmarginallikelihood:– argmaxθ∏jP(xj)=argmaxθ∏j∑k=1KP(Yj=k,xj)– Basedoncoordinatedescent.Easytoimplement(eg,nolinesearch,learningrates,etc.)
• Alternatebetweentwosteps:– ComputeanexpectaMon– ComputeamaximizaMon
• Notmagic:s4llop4mizinganon-convexfunc4onwithlotsoflocalop4ma– ThecomputaMonsarejusteasier(oZen,significantlyso)
EM:TwoEasyStepsObjec>ve:argmaxθlg∏j∑k=1KP(Yj=k,xj;θ)=∑jlg∑k=1KP(Yj=k,xj;θ)
Data:{xj|j=1..n}
• E-step:ComputeexpectaMonsto“fillin”missingyvaluesaccordingtocurrentparameters,θ
– ForallexamplesjandvalueskforYj,compute:P(Yj=k|xj;θ)
• M-step:Re-esMmatetheparameterswith“weighted”MLEesMmates
– Setθnew=argmaxθ∑j∑kP(Yj=k|xj;θold)logP(Yj=k,xj;θ)
Par>cularlyusefulwhentheEandMstepshaveclosedformsolu>ons
GaussianMixtureExample:Start
AZerfirstiteraMon
AZer2nditeraMon
AZer3rditeraMon
AZer4thiteraMon
AZer5thiteraMon
AZer6thiteraMon
AZer20thiteraMon
EMforGMMs:onlylearningmeans(1D)Iterate:Onthet’thiteraMonletouresMmatesbe
λt={μ1(t),μ2(t)…μK(t)}E-step
Compute“expected”classesofalldatapoints
M-step
ComputemostlikelynewμsgivenclassexpectaMons€
P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 (x j − µk )2⎛
⎝ ⎜
⎞
⎠ ⎟ P Yj = k( )
€
µk = P Yj = k x j( )
j=1
m
∑ x j
P Yj = k x j( )j=1
m
∑
Whatifwedohardassignments?Iterate:Onthet’thiteraMonletouresMmatesbe
λt={μ1(t),μ2(t)…μK(t)}E-step
Compute“expected”classesofalldatapoints
M-step
ComputemostlikelynewμsgivenclassexpectaMons
€
µk = j=1
m∑ δ Yj = k,x j( ) x j
δ Yj = k,x j( )j=1
m
∑
δrepresentshardassignmentto“mostlikely”ornearestcluster
Equivalenttok-meansclusteringalgorithm!!!
€
P Yj = k xj ,µ1...µK( ) ∝ exp − 12σ 2 (xj −µk )2⎛
⎝ ⎜
⎞
⎠ ⎟ P Yj = k( )
€
µk = P Yj = k x j( )
j=1
m
∑ x j
P Yj = k x j( )j=1
m
∑
E.M.forGeneralGMMsIterate:Onthet’thiteraMonletouresMmatesbe
λt={μ1(t),μ2(t)…μK(t),Σ1(t),Σ2(t)…ΣK(t),p1(t),p2(t)…pK(t)}
E-stepCompute“expected”classesofalldatapointsforeachclass
€
P Yj = k x j;λt( )∝ pk( t )p x j ;µk( t ),Σk(t )( )
pk(t)isshorthandforesMmateofP(y=k)ont’thiteraMon
M-step
ComputeweightedMLEforμgivenexpectedclassesabove
€
µkt+1( ) =
P Yj = k x j;λt( )j∑ x j
P Yj = k x j;λt( )j∑
€
Σkt+1( ) =
P Yj = k x j;λt( )j∑ x j − µk t+1( )[ ] x j − µk t+1( )[ ]
T
P Yj = k x j ;λt( )j∑
€
pk(t+1) =
P Yj = k x j;λt( )j∑
m m=#trainingexamples
Evaluateprobabilityofamul*variateaGaussianatxj
Thegenerallearningproblemwithmissingdata• Marginallikelihood:Xisobserved,
Z(e.g.theclasslabelsY)ismissing:
• ObjecMve:Findargmaxθl(θ:Data)• Assuminghiddenvariablesaremissingcompletelyatrandom
(otherwise,weshouldexplicitlymodelwhythevaluesaremissing)
ProperMesofEM
• Onecanprovethat:– EMconvergestoalocalmaxima– EachiteraMonimprovesthelog-likelihood
• How?(Sameask-means)– LikelihoodobjecMveinsteadofk-meansobjecMve– M-stepcanneverdecreaselikelihood
EMpictoriallyDerivation of EM algorithm
L(θ) l(θ|θn)
θn θn+1
L(θn) = l(θn|θn)l(θn+1|θn)L(θn+1)
L(θ)l(θ|θn)
θ
Figure 2: Graphical interpretation of a single iteration of the EM algorithm:The function l(θ|θn) is bounded above by the likelihood function L(θ). Thefunctions are equal at θ = θn. The EM algorithm chooses θn+1 as the value of θfor which l(θ|θn) is a maximum. Since L(θ) ≥ l(θ|θn) increasing l(θ|θn) ensuresthat the value of the likelihood function L(θ) is increased at each step.
We have now a function, l(θ|θn) which is bounded above by the likelihoodfunction L(θ). Additionally, observe that,
l(θn|θn) = L(θn) + ∆(θn|θn)
= L(θn) +∑
z
P(z|X, θn) lnP(X|z, θn)P(z|θn)P(z|X, θn)P(X|θn)
= L(θn) +∑
z
P(z|X, θn) lnP(X, z|θn)P(X, z|θn)
= L(θn) +∑
z
P(z|X, θn) ln 1
= L(θn), (16)
so for θ = θn the functions l(θ|θn) and L(θ) are equal.Our objective is to choose a values of θ so that L(θ) is maximized. We have
shown that the function l(θ|θn) is bounded above by the likelihood function L(θ)and that the value of the functions l(θ|θn) and L(θ) are equal at the currentestimate for θ = θn. Therefore, any θ which increases l(θ|θn) will also increaseL(θ). In order to achieve the greatest possible increase in the value of L(θ), theEM algorithm calls for selecting θ such that l(θ|θn) is maximized. We denotethis updated value as θn+1. This process is illustrated in Figure (2).
7
(Figure from tutorial by Sean Borman)
David Sontag (NYU) Graphical Models Lecture 11, April 12, 2012 11 / 13
LikelihoodobjecMve
Lowerboundatitern
Whatyoushouldknow• MixtureofGaussians• EMformixtureofGaussians:
– Howtolearnmaximumlikelihoodparametersinthecaseofunlabeleddata– RelaMontoK-means
• Twostepalgorithm,justlikeK-means• Hard/soZclustering• ProbabilisMcmodel
• Remember,EMcangetstuckinlocalminima, – AndempiricallyitDOES