Shane T. Jensen Issues with priors and interpreting MCMC ...junliu/Workshops/... · Dirichlet Process Priors • A commonly-used prior distribution for an unknown probability distribution

BayesianClusteringwiththeDirichletProcess:IssueswithpriorsandinterpretingMCMC

ShaneT.Jensen

DepartmentofStatistics

TheWhartonSchool,UniversityofPennsylvania

[email protected]

CollaborativeworkwithJ.Liu,L.Dicker,andG.Tuteja

ShaneT.Jensen1May13,2006

Introduction

•Bayesiannon-parametricorsemi-parametricmodelsare

veryusefulinmanyapplications

•Non-parametric:randomvariablesrealizationsfrom

unspecifiedprobabilitydistributione.g.,

Xi∼F(·)i=1,...,n

•Xi’scanbeobserveddata,latentvariablesorunknown

parameters(ofteninahierarchicalsetting)

•PriordistributionsforF(·)playanimportantrolein

non-parametricmodeling


DirichletProcessPriors

•Acommonly-usedpriordistributionforanunknown

probabilitydistributionistheDirichletprocess

F(·)∼DP(θ,F0)

•F0isaprobabilitymeasure

–canrepresentpriorbeliefinformofF

•θisaweightparameter

–canrepresentdegreeofbeliefinpriorformF0

•Ferguson(1973,1974);Antoniak(1974);manyothers

•ImportantconsequenceofDirichletprocessisthatit

inducesadiscretizedposteriordistribution


ConsequenceofDPpriors

•Ferguson,1974:usingaDirichletprocessDP(θ,F0)prior

forF(·)resultsinaposteriormixtureofF0andpoint

massesatobservationXi:

F(·)|X1,...,Xn∼DP

(

θ+n,F0+n∑

i=1

δ(Xi)

)

•Fordensityestimation,discretenessmaybeaproblem:

convolutionswithkernelfunctionscanbeusedtoproduce

acontinuousdensityestimate

•Inotherapplications,discretenessisnotadisadvantage!


ClusteringwithaDPprior

•Pointmasscomponentofposteriorleadstoarandom

partitionofourvariables

•ConsideranewvariableXn+1andletX1,...,XCbethe

uniquevaluesofX1:n=(X1,...,Xn).Then,

P(Xn+1=XC|X1:n)=Nc

θ+nc=1,...,C

P(Xn+1=new|X1:n)=θ

θ+n

•Nc=sizeofclusterc:numberinX1:nthatequalXc

“Richgetricher”:willreturntothis...


MotivatingApplication:TFmotifs

•Genesareregulatedbytranscriptionfactor(TF)proteins

thatbindtotheDNAsequenceneartogene

•TFproteinscanselectivelycontrolonlycertaintargetgenes

byonlybindingtothe“same”sequence,calledamotif

•Themotifsitesarehighlyconservedbutnotidentical,so

weuseamatrixdescriptionofthemotifappearance

FrequencyMatrix-Xi

A0.050.020.850.020.210.06

C0.040.020.030.930.050.06

G0.060.940.060.040.700.11

T0.850.020.060.010.040.77

SequenceLogo


CollectionsofTFmotifs

•LargedatabasescontainmotifinformationonmanyTFs

butwithlargeamountofredundancy

–TRANSFACandJASPARarelargest(100’sineach)

•Wanttoclustermotifstogethertoeitherreduce

redundancyindatabasesormatchnewmotifstodatabase

•Nucleotideconservationvariesbothwithinasinglemotif

(betweenpositions)andbetweendifferentmotifs

Tal1beta-E47SAGL3


MotifClusteringwithDPprior

•Hierarchicalmodelwithlevelsforbothwithin-unitand

between-unitvariabilityindiscoveredmotifs

–ObservedcountmatrixYiisaproductmultinomial

realizationoffrequencymatrixXi

–UnknownXi’sshareunknowndistributionF(·)

•DirichletprocessDP(θ,F0)priorforF(·)leadsto

posteriormixtureofF0andpointmassesateachXi

•OurpriormeasureF0inthisapplicationisaproduct

Dirichletdistribution


BenefitsandIssueswithDPprior

•Allowsunknownnumberofclusterswithoutneedtomodel

numberofclustersdirectly

–Norealpriorknowledgeaboutnumberofclustersinour

application

•However,withDPthereareimplicitassumptionsabout

numberofclusters(andtheirsizedistribution)

•“Richgetricher”propertyinfluencespriorpredictive

numberofclustersandclustersizedistribution

–Howinfluentialisthispropertyinanapplication?


BenefitsandIssueswithMCMC

•DP-basedmodeliseasytoimplementviaGibbssampling

–p(Xi|X−i)issamechoicestructureasp(Xn+1|X1:n)

–Xieithersampledintooneofcurrentclustersdefinedby

X−iorsampledfromF0toformanewcluster

•Alternativeisdirectmodelonnumberofclustersandthen

usesomethinglikeReversibleJumpMCMC

•MixingcanbeanissuewithGibbssampler

–collapsedGibbssampler:integrateoutXianddeal

directlywithclusteringindicators

–split/mergemovestospeedupmixing:lotsofgreat

workbyR.Neal,D.Dahlandothers


MainIssue1:PosteriorInferencefromMCMC

•However,therearestillissuesposteriorinferencebasedon

Gibbssamplingoutputalsohasissues

•Needtoinferasetofclustersfromsampledpartitions,but

wehavealabelswitchingproblem(Stephens,1999)

•clusterlabelsareexchangeableforaparticularpartition

•usualsummariessuchasposteriormeancanbemisleading

mixturesoftheseexchangeablelabeling

•needsummariesthatareuninfluencedbylabeling


PosteriorInferenceOptions

•Option1:clustersdefinedbylastpartitionvisited

–sampledpartitionproducedatendofGibbschain

–surprisinglypopular,e.g.LatentDirichletAlloc.models

•Option2:clustersdefinedbyMAPpartition

–sampledpartitionwithhighestposteriordensity

–simpleandpopular

•Option3:clustersdefinedbythresholdonpairwise

posteriorprobabilitiesPij

–frequencyofiterationswithmotifsi&jinsamecluster


MainIssue2:ImplicitDPAssumptions

•DPhasimplicit“richgetricher”property:easytosee

fromthepredictivedistribution:

P(Xn+1joinsclusterc)=Nc

θ+nc=1,...,C

P(Xn+1formsnewcluster)=θ

θ+n

•Chineserestaurantprocess:newcustomerchoosestable

–sitsatcurrenttablewithprobability∝Nc,thenumber

ofcustomersalreadysittingthere

–sitsatentirelynewtablewithprobability∝θ


AlternativePriorsforClustering

•UniformPrior:socialism,noonegetsrich

P(Xn+1joinsclusterc)=1

θ+Cc=1,...,C

P(Xn+1formsnewcluster)=θ

θ+C

•Pitman-YorPrior:richgetricher,butcharitable

P(Xn+1joinsclusterc)=Nc−α

θ+nc=1,...,C

P(Xn+1formsnewcluster)=θ−C·α

θ+n

•0≤α≤1isoftencalledthe“discountfactor”


AsymptoticComparisonofPriors

•NumberofclustersCnisclearlyafunctionofsamplesizen

•HowdoesCngrowasn−→∞?

DPPrior:E(Cn)≈θ·log(n)

Pitman−YorPrior:E(Cn)≈K(θ,α)·nα

UniformPrior:E(Cn)≈K(θ)·n1

2

•DPpriorshowsslowestgrowthinnumberofclustersCn

•Interestingly,Pitman-Yorcanleadtoeitherfasterorslower

growthvs.Uniform,dependingonα

•Alsoworkingonresultsfordistributionofclustersizes


FiniteSampleComparisonofPriors

•Y=Cnvs.X=nfordifferentvaluesofθ

1e+025e+025e+035e+04

510

2050

100200

5001000

θ=1

n = number of observations

Expected Num

ber of Clusters

DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)

1e+025e+025e+035e+04

50100

200500

10002000

θ=10


Expected Num

ber of Clusters

DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)

1e+025e+025e+035e+04

100200

5001000

20005000

θ=100


Expected Num

ber of Clusters

DPUNPY (α=0.5)PY (α=0.25)PY (α=0.75)


SimulationStudyofMotifClustering

•Evaluationofdifferentpriorsandmodesofinferencein

contextofmotifclusteringapplication

•Simulatedrealisticcollectionsofmotifs(knownpartitions)

•Differentsimulationconditionstovaryclusteringdifficulty:

–hightolowwithin-clustersimilarity

–hightolowbetween-clustersimilarity

•SuccessmeasuredbyJacardsimilaritybetweentrue

partitionzandinferredpartitionz

J(z,z)=TP

TP+FP+FN


SimulationComparisonofInferenceAlternatives

2468

0.20.4

0.60.8

1.0

Increasing Clustering Difficulty

Jacard Index

MAPProb > 0.5Prob > 0.25

•MAPpartitionconsistentlyinferiortopairwiseprobs.

•Post.probs.incorporateuncertaintyacrossiterations


SimulationComparisonofPriorAlternatives

2468

0.700.75

0.800.85

0.900.95

Increasing Clustering Difficulty

Jacard Index

UniformPY 0.25PY 0.5PY 0.75DP

•Notmuchdifferenceingeneralbetweenpriors

•Uniformdoesalittleworseinmostsituations


RealDataResults:ClusteringJASPARdatabase

•Treebasedonpairwiseposteriorprobabilities:Homo.sapiens−NUCLEAR−MA0065

Homo.sapiens−NUCLEAR−MA0072Drosophila.melanogaster−NUCLEAR−MA0016

Homo.sapiens−NUCLEAR−MA0074Homo.sapiens−NUCLEAR−MA0066Homo.sapiens−NUCLEAR−MA0071Arabidopsis.thaliana−HOMEO.ZIP−MA0008Arabidopsis.thaliana−HOMEO.ZIP−MA0110Mus.musculus−bHLH.ZIP−MA0104Homo.sapiens−bHLH.ZIP−MA0093Homo.sapiens−bHLH.ZIP−MA0059

Mus.musculus−bHLH−MA0004Homo.sapiens−bHLH.ZIP−MA0058

Mus.musculus−HMG−MA0078Rattus.norvegicus−FORKHEAD−MA0041

Rattus.norvegicus−FORKHEAD−MA0047Rattus.norvegicus−FORKHEAD−MA0040

Homo.sapiens−FORKHEAD−MA0042Rattus.norvegicus−bZIP−MA0019

Homo.sapiens−bHLH−MA0091Homo.sapiens−ZN.FINGER−MA0073

Homo.sapiens−TEA−MA0090Homo.sapiens−NUCLEAR−MA0017Gallus.gallus−ZN.FINGER−MA0103Homo.sapiens−P53−MA0106

Drosophila.melanogaster−ZN.FINGER−MA0011Homo.sapiens−MADS−MA0083

Arabidopsis.thaliana−MADS−MA0001Homo.sapiens−AP2−MA0003

Homo.sapiens−ZN.FINGER−MA0095Arabidopsis.thaliana−MADS−MA0005Homo.sapiens−FORKHEAD−MA0032

Homo.sapiens−bHLH−MA0048Antirrhinum.majus−MADS−MA0082

Oryctolagus.cuniculus−ZN.FINGER−MA0109Homo.sapiens−Unknown−MA0024Pisum.sativum−HMG−MA0044

Homo.sapiens−RUNT−MA0002Mus.musculus−bHLH−MA0006

Homo.sapiens−PAIRED−MA0069Petunia.hybrida−TRP.CLUSTER−MA0054

Hordeum.vulgare−TRP.CLUSTER−MA0034Xenupus.laevis−ZN.FINGER−MA0088

Gallus.gallus−ETS−MA0098Homo.sapiens−bHLH−MA0055

Mus.musculus−T.BOX−MA0009Drosophila.melanogaster−ZN.FINGER−MA0086

NA−bZIP−MA0102Homo.sapiens−bZIP−MA0025

Mus.musculus−HOMEO−MA0063Homo.sapiens−bZIP−MA0018

Antirrhinum.majus−bZIP−MA0096Antirrhinum.majus−bZIP−MA0097

Mus.musculus−PAIRED−MA0067Homo.sapiens−bZIP−MA0043

Gallus.gallus−bZIP−MA0089Mus.musculus−bZIP−MA0099Drosophila.melanogaster−REL−MA0022

Homo.sapiens−ZN.FINGER−MA0079Mus.musculus−bHLH.ZIP−MA0111

Homo.sapiens−REL−MA0107Homo.sapiens−REL−MA0101

Vertebrates−REL−MA0061Drosophila.melanogaster−REL−MA0023

Homo.sapiens−REL−MA0105Mus.musculus−PAIRED−MA0014

Homo.sapiens−ZN.FINGER−MA0057Mus.musculus−ZN.FINGER−MA0039

Mus.musculus−HOMEO−MA0027Homo.sapiens−ZN.FINGER−MA0056

Homo.sapiens−ETS−MA0081Homo.sapiens−ETS−MA0062

Homo.sapiens−ETS−MA0028Drosophila.melanogaster−ETS−MA0026

Homo.sapiens−ETS−MA0076Drosophila.melanogaster−IPT/TIG−MA0085

Rattus.rattus−NUCLEAR−MA0007Homo.sapiens−ETS−MA0080

Drosophila.melanogaster−ZN.FINGER−MA0015NA−TATA.box−MA0108

Mus.musculus−ZN.FINGER−MA0035Mus.musculus−ZN.FINGER−MA0029Homo.sapiens−ZN.FINGER−MA0037

Rattus.norvegicus−ZN.FINGER−MA0038Homo.sapiens−TRP.CLUSTER−MA0050

Homo.sapiens−TRP.CLUSTER−MA0051Zea.mays−ZN.FINGER−MA0020Zea.mays−ZN.FINGER−MA0021

Homo.sapiens−HOMEO−MA0070Homo.sapiens−HMG−MA0077

Homo.sapiens−HMG−MA0084Mus.musculus−HMG−MA0087

Drosophila.melanogaster−ZN.FINGER−MA0013Homo.sapiens−FORKHEAD−MA0030Homo.sapiens−FORKHEAD−MA0031

Homo.sapiens−FORKHEAD−MA0033Mus.musculus−PAIRED.HOMEO−MA0068

Drosophila.melanogaster−ZN.FINGER−MA0010Pisum.sativum−HMG−MA0045

Drosophila.melanogaster−ZN.FINGER−MA0012Drosophila.melanogaster−ZN.FINGER−MA0049

0.00.2

0.40.6

0.81.0

1−Prob(Clustering)

•Post-processedMAPpartitiontoremoveweak

relationships,thenverysimilartothresholdedpost.probs.


ComparingPriors:ClusteringJASPARdatabaseNumber of Clusters − Unif

Frequency

20253035

050

100200

300

Number of Clusters − DP

Frequency

20253035

050

100200

300Average Cluster Size − Unif

Frequency

2.53.03.5

0100

200300

400

Average Cluster Size − DP

Frequency

2.53.03.50

100200

300400

•VerylittledifferencebetweenusingDPanduniformprior

•Likelihoodisdominatinganypriorassumptiononpartition


Summary

•Non-parametricBayesianapproachesbasedonDirichlet

processcanbeveryusefulforclusteringapplications

•IssueswithMCMCinference:popularMAPpartitions

seeminferiortopartitionsbasedonposteriorprobabilities

•IssueswithimplicitDPassumptions:alternativepriorsgive

quitedifferentpriorpartitions

•Posteriordifferencesbetweenpriorsaresmallinourmotif

application,butcanbelargerinotherapplications

•JensenandLiu,JASA(forthcoming)plusother

manuscriptssoonavailableonmywebsite

http://stat.wharton.upenn.edu/∼stjensen


Shane T. Jensen Issues with priors and interpreting MCMC ...junliu/Workshops/... · Dirichlet Process Priors • A commonly-used prior distribution for an unknown probability distribution

Documents