Page 1
BayesianClusteringwiththeDirichletProcess:IssueswithpriorsandinterpretingMCMC
ShaneT.Jensen
DepartmentofStatistics
TheWhartonSchool,UniversityofPennsylvania
[email protected]
CollaborativeworkwithJ.Liu,L.Dicker,andG.Tuteja
ShaneT.Jensen1May13,2006
Page 2
Introduction
•Bayesiannon-parametricorsemi-parametricmodelsare
veryusefulinmanyapplications
•Non-parametric:randomvariablesrealizationsfrom
unspecifiedprobabilitydistributione.g.,
Xi∼F(·)i=1,...,n
•Xi’scanbeobserveddata,latentvariablesorunknown
parameters(ofteninahierarchicalsetting)
•PriordistributionsforF(·)playanimportantrolein
non-parametricmodeling
ShaneT.Jensen2May13,2006
Page 3
DirichletProcessPriors
•Acommonly-usedpriordistributionforanunknown
probabilitydistributionistheDirichletprocess
F(·)∼DP(θ,F0)
•F0isaprobabilitymeasure
–canrepresentpriorbeliefinformofF
•θisaweightparameter
–canrepresentdegreeofbeliefinpriorformF0
•Ferguson(1973,1974);Antoniak(1974);manyothers
•ImportantconsequenceofDirichletprocessisthatit
inducesadiscretizedposteriordistribution
ShaneT.Jensen3May13,2006
Page 4
ConsequenceofDPpriors
•Ferguson,1974:usingaDirichletprocessDP(θ,F0)prior
forF(·)resultsinaposteriormixtureofF0andpoint
massesatobservationXi:
F(·)|X1,...,Xn∼DP
(
θ+n,F0+n∑
i=1
δ(Xi)
)
•Fordensityestimation,discretenessmaybeaproblem:
convolutionswithkernelfunctionscanbeusedtoproduce
acontinuousdensityestimate
•Inotherapplications,discretenessisnotadisadvantage!
ShaneT.Jensen4May13,2006
Page 5
ClusteringwithaDPprior
•Pointmasscomponentofposteriorleadstoarandom
partitionofourvariables
•ConsideranewvariableXn+1andletX1,...,XCbethe
uniquevaluesofX1:n=(X1,...,Xn).Then,
P(Xn+1=XC|X1:n)=Nc
θ+nc=1,...,C
P(Xn+1=new|X1:n)=θ
θ+n
•Nc=sizeofclusterc:numberinX1:nthatequalXc
“Richgetricher”:willreturntothis...
ShaneT.Jensen5May13,2006
Page 6
MotivatingApplication:TFmotifs
•Genesareregulatedbytranscriptionfactor(TF)proteins
thatbindtotheDNAsequenceneartogene
•TFproteinscanselectivelycontrolonlycertaintargetgenes
byonlybindingtothe“same”sequence,calledamotif
•Themotifsitesarehighlyconservedbutnotidentical,so
weuseamatrixdescriptionofthemotifappearance
FrequencyMatrix-Xi
A0.050.020.850.020.210.06
C0.040.020.030.930.050.06
G0.060.940.060.040.700.11
T0.850.020.060.010.040.77
SequenceLogo
ShaneT.Jensen6May13,2006
Page 7
CollectionsofTFmotifs
•LargedatabasescontainmotifinformationonmanyTFs
butwithlargeamountofredundancy
–TRANSFACandJASPARarelargest(100’sineach)
•Wanttoclustermotifstogethertoeitherreduce
redundancyindatabasesormatchnewmotifstodatabase
•Nucleotideconservationvariesbothwithinasinglemotif
(betweenpositions)andbetweendifferentmotifs
Tal1beta-E47SAGL3
ShaneT.Jensen7May13,2006
Page 8
MotifClusteringwithDPprior
•Hierarchicalmodelwithlevelsforbothwithin-unitand
between-unitvariabilityindiscoveredmotifs
–ObservedcountmatrixYiisaproductmultinomial
realizationoffrequencymatrixXi
–UnknownXi’sshareunknowndistributionF(·)
•DirichletprocessDP(θ,F0)priorforF(·)leadsto
posteriormixtureofF0andpointmassesateachXi
•OurpriormeasureF0inthisapplicationisaproduct
Dirichletdistribution
ShaneT.Jensen8May13,2006
Page 9
BenefitsandIssueswithDPprior
•Allowsunknownnumberofclusterswithoutneedtomodel
numberofclustersdirectly
–Norealpriorknowledgeaboutnumberofclustersinour
application
•However,withDPthereareimplicitassumptionsabout
numberofclusters(andtheirsizedistribution)
•“Richgetricher”propertyinfluencespriorpredictive
numberofclustersandclustersizedistribution
–Howinfluentialisthispropertyinanapplication?
ShaneT.Jensen9May13,2006
Page 10
BenefitsandIssueswithMCMC
•DP-basedmodeliseasytoimplementviaGibbssampling
–p(Xi|X−i)issamechoicestructureasp(Xn+1|X1:n)
–Xieithersampledintooneofcurrentclustersdefinedby
X−iorsampledfromF0toformanewcluster
•Alternativeisdirectmodelonnumberofclustersandthen
usesomethinglikeReversibleJumpMCMC
•MixingcanbeanissuewithGibbssampler
–collapsedGibbssampler:integrateoutXianddeal
directlywithclusteringindicators
–split/mergemovestospeedupmixing:lotsofgreat
workbyR.Neal,D.Dahlandothers
ShaneT.Jensen10May13,2006
Page 11
MainIssue1:PosteriorInferencefromMCMC
•However,therearestillissuesposteriorinferencebasedon
Gibbssamplingoutputalsohasissues
•Needtoinferasetofclustersfromsampledpartitions,but
wehavealabelswitchingproblem(Stephens,1999)
•clusterlabelsareexchangeableforaparticularpartition
•usualsummariessuchasposteriormeancanbemisleading
mixturesoftheseexchangeablelabeling
•needsummariesthatareuninfluencedbylabeling
ShaneT.Jensen11May13,2006
Page 12
PosteriorInferenceOptions
•Option1:clustersdefinedbylastpartitionvisited
–sampledpartitionproducedatendofGibbschain
–surprisinglypopular,e.g.LatentDirichletAlloc.models
•Option2:clustersdefinedbyMAPpartition
–sampledpartitionwithhighestposteriordensity
–simpleandpopular
•Option3:clustersdefinedbythresholdonpairwise
posteriorprobabilitiesPij
–frequencyofiterationswithmotifsi&jinsamecluster
ShaneT.Jensen12May13,2006
Page 13
MainIssue2:ImplicitDPAssumptions
•DPhasimplicit“richgetricher”property:easytosee
fromthepredictivedistribution:
P(Xn+1joinsclusterc)=Nc
θ+nc=1,...,C
P(Xn+1formsnewcluster)=θ
θ+n
•Chineserestaurantprocess:newcustomerchoosestable
–sitsatcurrenttablewithprobability∝Nc,thenumber
ofcustomersalreadysittingthere
–sitsatentirelynewtablewithprobability∝θ
ShaneT.Jensen13May13,2006
Page 14
AlternativePriorsforClustering
•UniformPrior:socialism,noonegetsrich
P(Xn+1joinsclusterc)=1
θ+Cc=1,...,C
P(Xn+1formsnewcluster)=θ
θ+C
•Pitman-YorPrior:richgetricher,butcharitable
P(Xn+1joinsclusterc)=Nc−α
θ+nc=1,...,C
P(Xn+1formsnewcluster)=θ−C·α
θ+n
•0≤α≤1isoftencalledthe“discountfactor”
ShaneT.Jensen14May13,2006
Page 15
AsymptoticComparisonofPriors
•NumberofclustersCnisclearlyafunctionofsamplesizen
•HowdoesCngrowasn−→∞?
DPPrior:E(Cn)≈θ·log(n)
Pitman−YorPrior:E(Cn)≈K(θ,α)·nα
UniformPrior:E(Cn)≈K(θ)·n1
2
•DPpriorshowsslowestgrowthinnumberofclustersCn
•Interestingly,Pitman-Yorcanleadtoeitherfasterorslower
growthvs.Uniform,dependingonα
•Alsoworkingonresultsfordistributionofclustersizes
ShaneT.Jensen15May13,2006
Page 16
FiniteSampleComparisonofPriors
•Y=Cnvs.X=nfordifferentvaluesofθ
1e+025e+025e+035e+04
510
2050
100200
5001000
θ=1
n = number of observations
Expected Num
ber of Clusters
DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)
1e+025e+025e+035e+04
50100
200500
10002000
θ=10
n = number of observations
Expected Num
ber of Clusters
DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)
1e+025e+025e+035e+04
100200
5001000
20005000
θ=100
n = number of observations
Expected Num
ber of Clusters
DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)
ShaneT.Jensen16May13,2006
Page 17
SimulationStudyofMotifClustering
•Evaluationofdifferentpriorsandmodesofinferencein
contextofmotifclusteringapplication
•Simulatedrealisticcollectionsofmotifs(knownpartitions)
•Differentsimulationconditionstovaryclusteringdifficulty:
–hightolowwithin-clustersimilarity
–hightolowbetween-clustersimilarity
•SuccessmeasuredbyJacardsimilaritybetweentrue
partitionzandinferredpartitionz
J(z,z)=TP
TP+FP+FN
ShaneT.Jensen17May13,2006
Page 18
SimulationComparisonofInferenceAlternatives
2468
0.20.4
0.60.8
1.0
Increasing Clustering Difficulty
Jacard Index
MAPProb > 0.5Prob > 0.25
•MAPpartitionconsistentlyinferiortopairwiseprobs.
•Post.probs.incorporateuncertaintyacrossiterations
ShaneT.Jensen18May13,2006
Page 19
SimulationComparisonofPriorAlternatives
2468
0.700.75
0.800.85
0.900.95
Increasing Clustering Difficulty
Jacard Index
UniformPY 0.25PY 0.5PY 0.75DP
•Notmuchdifferenceingeneralbetweenpriors
•Uniformdoesalittleworseinmostsituations
ShaneT.Jensen19May13,2006
Page 20
RealDataResults:ClusteringJASPARdatabase
•Treebasedonpairwiseposteriorprobabilities:Homo.sapiens−NUCLEAR−MA0065
Homo.sapiens−NUCLEAR−MA0072Drosophila.melanogaster−NUCLEAR−MA0016
Homo.sapiens−NUCLEAR−MA0074Homo.sapiens−NUCLEAR−MA0066Homo.sapiens−NUCLEAR−MA0071Arabidopsis.thaliana−HOMEO.ZIP−MA0008Arabidopsis.thaliana−HOMEO.ZIP−MA0110Mus.musculus−bHLH.ZIP−MA0104Homo.sapiens−bHLH.ZIP−MA0093Homo.sapiens−bHLH.ZIP−MA0059
Mus.musculus−bHLH−MA0004Homo.sapiens−bHLH.ZIP−MA0058
Mus.musculus−HMG−MA0078Rattus.norvegicus−FORKHEAD−MA0041
Rattus.norvegicus−FORKHEAD−MA0047Rattus.norvegicus−FORKHEAD−MA0040
Homo.sapiens−FORKHEAD−MA0042Rattus.norvegicus−bZIP−MA0019
Homo.sapiens−bHLH−MA0091Homo.sapiens−ZN.FINGER−MA0073
Homo.sapiens−TEA−MA0090Homo.sapiens−NUCLEAR−MA0017Gallus.gallus−ZN.FINGER−MA0103Homo.sapiens−P53−MA0106
Drosophila.melanogaster−ZN.FINGER−MA0011Homo.sapiens−MADS−MA0083
Arabidopsis.thaliana−MADS−MA0001Homo.sapiens−AP2−MA0003
Homo.sapiens−ZN.FINGER−MA0095Arabidopsis.thaliana−MADS−MA0005Homo.sapiens−FORKHEAD−MA0032
Homo.sapiens−bHLH−MA0048Antirrhinum.majus−MADS−MA0082
Oryctolagus.cuniculus−ZN.FINGER−MA0109Homo.sapiens−Unknown−MA0024Pisum.sativum−HMG−MA0044
Homo.sapiens−RUNT−MA0002Mus.musculus−bHLH−MA0006
Homo.sapiens−PAIRED−MA0069Petunia.hybrida−TRP.CLUSTER−MA0054
Hordeum.vulgare−TRP.CLUSTER−MA0034Xenupus.laevis−ZN.FINGER−MA0088
Gallus.gallus−ETS−MA0098Homo.sapiens−bHLH−MA0055
Mus.musculus−T.BOX−MA0009Drosophila.melanogaster−ZN.FINGER−MA0086
NA−bZIP−MA0102Homo.sapiens−bZIP−MA0025
Mus.musculus−HOMEO−MA0063Homo.sapiens−bZIP−MA0018
Antirrhinum.majus−bZIP−MA0096Antirrhinum.majus−bZIP−MA0097
Mus.musculus−PAIRED−MA0067Homo.sapiens−bZIP−MA0043
Gallus.gallus−bZIP−MA0089Mus.musculus−bZIP−MA0099Drosophila.melanogaster−REL−MA0022
Homo.sapiens−ZN.FINGER−MA0079Mus.musculus−bHLH.ZIP−MA0111
Homo.sapiens−REL−MA0107Homo.sapiens−REL−MA0101
Vertebrates−REL−MA0061Drosophila.melanogaster−REL−MA0023
Homo.sapiens−REL−MA0105Mus.musculus−PAIRED−MA0014
Homo.sapiens−ZN.FINGER−MA0057Mus.musculus−ZN.FINGER−MA0039
Mus.musculus−HOMEO−MA0027Homo.sapiens−ZN.FINGER−MA0056
Homo.sapiens−ETS−MA0081Homo.sapiens−ETS−MA0062
Homo.sapiens−ETS−MA0028Drosophila.melanogaster−ETS−MA0026
Homo.sapiens−ETS−MA0076Drosophila.melanogaster−IPT/TIG−MA0085
Rattus.rattus−NUCLEAR−MA0007Homo.sapiens−ETS−MA0080
Drosophila.melanogaster−ZN.FINGER−MA0015NA−TATA.box−MA0108
Mus.musculus−ZN.FINGER−MA0035Mus.musculus−ZN.FINGER−MA0029Homo.sapiens−ZN.FINGER−MA0037
Rattus.norvegicus−ZN.FINGER−MA0038Homo.sapiens−TRP.CLUSTER−MA0050
Homo.sapiens−TRP.CLUSTER−MA0051Zea.mays−ZN.FINGER−MA0020Zea.mays−ZN.FINGER−MA0021
Homo.sapiens−HOMEO−MA0070Homo.sapiens−HMG−MA0077
Homo.sapiens−HMG−MA0084Mus.musculus−HMG−MA0087
Drosophila.melanogaster−ZN.FINGER−MA0013Homo.sapiens−FORKHEAD−MA0030Homo.sapiens−FORKHEAD−MA0031
Homo.sapiens−FORKHEAD−MA0033Mus.musculus−PAIRED.HOMEO−MA0068
Drosophila.melanogaster−ZN.FINGER−MA0010Pisum.sativum−HMG−MA0045
Drosophila.melanogaster−ZN.FINGER−MA0012Drosophila.melanogaster−ZN.FINGER−MA0049
0.00.2
0.40.6
0.81.0
1−Prob(Clustering)
•Post-processedMAPpartitiontoremoveweak
relationships,thenverysimilartothresholdedpost.probs.
ShaneT.Jensen20May13,2006
Page 21
ComparingPriors:ClusteringJASPARdatabaseNumber of Clusters − Unif
Frequency
20253035
050
100200
300
Number of Clusters − DP
Frequency
20253035
050
100200
300Average Cluster Size − Unif
Frequency
2.53.03.5
0100
200300
400
Average Cluster Size − DP
Frequency
2.53.03.50
100200
300400
•VerylittledifferencebetweenusingDPanduniformprior
•Likelihoodisdominatinganypriorassumptiononpartition
ShaneT.Jensen21May13,2006
Page 22
Summary
•Non-parametricBayesianapproachesbasedonDirichlet
processcanbeveryusefulforclusteringapplications
•IssueswithMCMCinference:popularMAPpartitions
seeminferiortopartitionsbasedonposteriorprobabilities
•IssueswithimplicitDPassumptions:alternativepriorsgive
quitedifferentpriorpartitions
•Posteriordifferencesbetweenpriorsaresmallinourmotif
application,butcanbelargerinotherapplications
•JensenandLiu,JASA(forthcoming)plusother
manuscriptssoonavailableonmywebsite
http://stat.wharton.upenn.edu/∼stjensen
ShaneT.Jensen22May13,2006