Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors Grigory Yaroslavtsev http://grigory.us vs
ClusteringonClusters2049:MassivelyParallelAlgorithmsforClusteringGraphsandVectors
Grigory Yaroslavtsevhttp://grigory.us
vs
• Algorithmdesignformassivelyparallelcomputing– Blog:http://grigory.us/blog/mapreduce-model/
• MPCalgorithmsforgraphs– Connectivity– Correlationclustering
• MPCalgorithmsforvectors– K-means– Single-linkageclustering
• Openproblemsanddirections
ClusteringonClusters:Overview
ClusteringonClusters2049:Overview
Graphs Vectors
Basic ConnectivityConnectivity++
K-means--
Advanced
CorrelationClustering
Single-LinkageClustering
MSTSingle-LinkageClustering
ClusterComputation(alaBSP)• Input:sizen(e.g.n= billions of edgesinagraph)• 𝑴Machines,𝑺 Space(RAM)each– ConstantoverheadinRAM:𝑴 ⋅ 𝑺 = 𝑂(𝒏)– 𝑺 =𝒏*+,,e.g.𝜖 =0.1 or𝜖 =0.5(𝑴 = 𝑺 = 𝑂( 𝒏� ))
• Output:solutiontoaproblem(oftensizeO(𝒏))– Doesn’tfitinlocalRAM(𝑺 ≪ 𝒏)
} 𝑴machines}Sspace
𝐈𝐧𝐩𝐮𝐭: size𝒏 ⇒ ⇒𝐎𝐮𝐭𝐩𝐮𝐭
} 𝑴machines}Sspace
ClusterComputation(alaBSP)• Computation/Communicationin𝑹 rounds:– Everymachineperformsanear-lineartimecomputation=>Totalusertime𝑂(𝑺𝟏?𝒐(𝟏)𝑹)
– Everymachinesends/receivesatmost𝑺 bits ofinformation=>Totalcommunication𝑂(𝒏𝑹).
Goal:Minimize𝑹.Ideally:𝑹 =constant.
𝑶(𝑺𝟏?𝒐(𝟏)) time
≤ 𝑺 bits
MapReduce-stylecomputations
WhatIwon’tdiscusstoday• PRAMs(sharedmemory,multipleprocessors)(seee.g.[Karloff,Suri,Vassilvitskii‘10])– ComputingXORrequiresΩD(log 𝑛) roundsinCRCWPRAM– Canbedonein𝑂(log𝒔 𝑛) roundsofMapReduce
• Pregel-stylesystems,DistributedHashTables(seee.g.AshishGoel’s classnotesandpapers)
• Lower-levelimplementationdetails(seee.g.Rajaraman-Leskovec-Ullman book)
Modelsofparallelcomputation• Bulk-SynchronousParallelModel (BSP)[Valiant,90]
Pro:Mostgeneral,generalizesallothermodelsCon:Manyparameters,hardtodesignalgorithms
• MassiveParallelComputation [Andoni,Onak,Nikolov,Y.‘14][Feldman-Muthukrishnan-Sidiropoulos-Stein-Svitkina’07,Karloff-Suri-Vassilvitskii’10,Goodrich-Sitchinava-Zhang’11,...,Beame,Koutris,Suciu’13]Pros:• Inspiredbymodern systems(Hadoop,MapReduce,Dryad,Spark,
Giraph,…)• Fewparameters,simple todesignalgorithms• Newalgorithmicideas,robusttotheexactmodelspecification• #Rounds isaninformation-theoreticmeasure=>canprove
unconditionalresultsCon:sometimesnotenoughtomodelmorecomplexbehavior
Businessperspective• Pricings:– https://cloud.google.com/pricing/– https://aws.amazon.com/pricing/
• ~Linearwithspace andtime usage– 100machines:5K$/year– 10000machines:0.5M$/year
• Youpayalotmore forusingprovidedalgorithms– https://aws.amazon.com/machine-learning/pricing/
Part1:ClusteringGraphs
• Applications:– Communitydetection– Fakeaccountdetection– Deduplication– Storagelocalization– …
Problem1:Connectivity
• Input:n edgesofagraph(arbitrarilypartitionedbetweenmachines)
• Output:isthegraphconnected?(or#ofconnectedcomponents)
• Question:howmanyroundsdoesittake?1. 𝑂 12. 𝑂 logM n3. 𝑂(nM)4. 𝑂(2Mn)5. Impossible
• VersionofBoruvka’s algorithm:– Allverticesassignedtodifferentcomponents– Repeat𝑂(log |V|) times:
• Eachcomponentchoosesaneighboringcomponent• Allpairsofchosencomponentsgetmerged
• Howtoavoidchaining?
• Ifthegraphofcomponentsisbipartiteandonlyonesidegetstochoosethennochaining
• Randomly assigncomponentstothesides
AlgorithmforConnectivity
AlgorithmsforGraphs:GraphDensity
• Dense:𝑺 ≫ |𝑉|,e.g.𝑺 ≥ 𝑉 T/V
• Semi-dense:𝑺 = Θ(|𝑉|)• Sparse:𝑺 ≪ 𝑉 ,e.g.𝑺 ≤ 𝑉 */V
AlgorithmsforGraphs:GraphDensity
• Dense:𝑺 ≫ |𝑉|,e.g.𝑺 ≥ 𝑉 T/V
– Linearsketching:oneround,see[McGregor’14]• WorkshopatBerkeleytomorrow:http://caml.indiana.edu/linear-sketching-focs.html
• “Filtering”[Karloff,SuriVassilvitskii,SODA’10;Ene,Im,Moseley,KDD’11;Lattanzi,Moseley,Suri,Vassilvitskii,SPAA’11;Suri,Vassilvitskii,WWW’11]…
AlgorithmsforGraphs:GraphDensity
• Semi-dense Graphs:𝑺 = Θ(|𝑉|) [Avdyukhin,Y.]– RunBoruvka’s algorithmforO( log |𝑉|� )rounds
– #Verticesreducesdownto |Y|
V Z[\|]|�
– RepeatO( log |𝑉|� ) times:• Computeaspanningtreeoflocallystorededges
• Put 2 ^_`|Y|�suchtreespermachine
AlgorithmsforGraphs:GraphDensity
• Sparse:𝑺 ≪ 𝑉 , 𝑺 ≤ 𝑉 */V
• Sparsegraphproblemsappearhard– Bigopenquestion:connectivityino(log|𝑉|) rounds?– Probablyno:[Roughgarden,Vassilvitskii,Wang’16]
• “OneCyclevs.TwoCycle”Problem– Distinguishonecyclefromtwoino(log|𝑉|) rounds?
VS.
OtherConnectivityAlgorithms• [Rastogi,Machanavajjhala,Chitnis,DasSarma’13]– D=graphdiameter
Algorithm MRRounds Communicationperround
Hash-Min D 𝑂( 𝑉 + |𝐸|)Hash-to-all LogD 𝑂( 𝑉 V + |𝐸|)
Hash-to-Min 𝑂(log |𝑉|)forpaths
𝑂c(( 𝑉 + 𝐸 ))forpaths
Hash-Greater-to-Min O(logn) O(|V|+|E|)
Graph-DependentConnectivityAlgs?
• Bigquestion:connectivityin𝑂 log𝐷 roundswith𝑂c( 𝑉 + |𝐸|) communicationperround?
• [Rastogi etal’13]conjecturedthatHash-to-Mincanachievethis
• [Avdyukhin,Y.’17]:– Hash-to-MintakesΩ 𝐷 rounds
• Openproblem:betterconnectivityalgorithmsifweparametrizebygraphexpansion?
• Otherwork: [Kiveris etal.‘14]
Whataboutclustering?
• ≈sameideasworkforSingle-LinkageClustering• Usingconnectivityasaprimitivecanpreservecutsingraphs[Benczur,Karger’98]– ConstructagraphwithO(nlogn)edges– Allcutsizesarepreservedwithafactorof2
• Allowstorunclusteringalgorithmsthatusecutsintheobjectiveusingthissparsegraph
SingleLinkageClustering• [Zahn’71]Clustering viaMinimumSpanningTree:k clusters:remove𝒌 − 𝟏 longestedgesfromMST• Maximizesminimum intercluster distance
[Kleinberg,Tardos]
Part2:ClusteringVectors• Input:𝑣*, … , 𝑣𝒏 ∈ ℝ𝒅– FeaturevectorsinML,wordembeddings inNLP,etc.– (Implicit)weightedgraphofpairwisedistances
• Applications:– Sameasbefore+Datavisualization
Largegeometricgraphs• Graphalgorithms:Densegraphs vs.sparsegraphs– Dense:𝑺 ≫ |𝑉|.– Sparse:𝑺 ≪ |𝑉|.
• Oursetting:– Densegraphs,sparselyrepresented:O(n)space– Outputdoesn’tfitononemachine(𝑺 ≪ 𝒏)
• Today:(1 + 𝜖)-approximateMST[Andoni,Onak,Nikolov,Y.]– 𝒅 = 2 (easytogeneralize)– 𝑹 = log𝑺 𝒏=O(1)rounds(𝑺 = 𝒏𝛀(𝟏))
𝑂(log𝑛)-MSTin 𝑅 = 𝑂(log𝑛)rounds• Assumepointshaveintegercoordinates 0,… , Δ ,whereΔ = 𝑂 𝒏𝟐 .
Imposean𝑂(log𝒏)-depthquadtreeBottom-up:Foreachcellinthequadtree
– computeoptimumMSTsinsubcells– Useonlyone representative fromeachcellonthenextlevel
Wrongrepresentative:O(1)-approximationperlevel
Wrongrepresentative:O(1)-approximationperlevel
𝝐𝑳-nets• 𝝐𝑳-netforacellCwithsidelength𝑳:
CollectionS ofverticesinC,everyvertexisatdistance<=𝝐𝑳 fromsomevertexinS.(Fact:Canefficientlycompute𝝐-netofsize𝑂 *
𝝐�)
Bottom-up:Foreachcellinthequadtree– ComputeoptimumMSTsinsubcells– Use𝝐𝑳-netfromeachcellonthenextlevel
• Idea:PayonlyO(𝝐𝑳)foranedge cutbycellwithside𝑳• Randomlyshiftthequadtree:Pr 𝑐𝑢𝑡𝑒𝑑𝑔𝑒𝑜𝑓𝑙𝑒𝑛𝑔𝑡ℎℓ𝑏𝑦𝑳 ∼ ℓ/𝑳 – chargeerrors
𝑳 𝑳𝜖𝑳
Randomlyshiftedquadtree• Topcellshiftedbyarandomvectorin 0, 𝑳 V
Imposearandomlyshifted quadtree (topcelllength𝟐𝚫)Bottom-up:Foreachcellinthequadtree– ComputeoptimumMSTsinsubcells– Use𝝐𝑳-netfromeachcellonthenextlevel
Pay5 insteadof4Pr[𝐁𝐚𝐝𝐂𝐮𝐭]=𝛀(1)
2
1
𝐁𝐚𝐝𝐂𝐮𝐭
1 + 𝝐 -MSTin 𝐑 = 𝑂(log𝑛)rounds• Idea: Onlyuseshortedgesinsidethecells
Imposearandomlyshifted quadtree (topcelllength𝟐𝚫𝝐)
Bottom-up:Foreachnode(cell)inthequadtree– computeoptimumMinimumSpanningForests insubcells,usingedgesoflength≤ 𝝐𝑳
– Useonly𝝐𝟐𝑳-netfromeachcellonthenextlevel
2
1Pr[𝐁𝐚𝐝𝐂𝐮𝐭]=𝑶(𝝐)
𝑳 = 𝛀(𝟏𝝐)
1 + 𝝐 -MSTin 𝐑 = 𝑂(1)rounds• 𝑂(log𝒏) rounds=>O(log𝑺 𝒏)=O(1)rounds– Flattenthetree:( 𝑴� × 𝑴� )-gridsinsteadof(2x2)gridsateachlevel.
Imposearandomlyshifted ( 𝑴� × 𝑴� )-treeBottom-up:Foreachnode(cell)inthetree– computeoptimumMSTsinsubcells viaedgesoflength ≤ 𝝐𝑳– Useonly𝝐𝟐𝑳-netfromeachcellonthenextlevel
⇒ } 𝑴� = 𝒏�(*)
Single-LinkageClustering[Y.,Vadapalli]
• Q:Single-linkageclusteringfrom(1 + 𝜖)-MST?• A:No,afixededgecanbearbitrarilydistorted
• Idea:– Run𝑂 log 𝑛 times&collectall(1 + 𝜖)-MSTedges– ComputeMSToftheseedgesusingBoruvka– UsethisMSTfork-SingleLinkageClusteringforallk
• Overall:O(logn)roundsofMPCinsteadofO(1)• Q:Isthisactuallynecessary?• A:Mostlikelyyes,i.e.yes,assumingsparseconnectivityishard
Single-LinkageClustering[Y.,Vadapalli]
• Conj 1:SparseConnectivityrequiresΩ(log |𝑉|)• Conj 2:“1cyclevs.2cycles”requiresΩ(log |𝑉|)• Underℓ�-Distances:
Distance Approximation Hardness underConjecture1*
Hardness underConjecture2*
Hamming Exact 2 3ℓ* (1 + 𝜖) 2 3ℓV (1 + 𝜖) 1.41 − 𝜖 1.84 − 𝜖ℓ¡ (1 + 𝜖) 2
Thanks!Questions?• Slideswillbeavailableonhttp://grigory.us• Moreaboutalgorithmsformassivedata:
http://grigory.us/blog/• MoreintheclassesIteach:
1 + 𝝐 -MSTin 𝐑 = 𝑂(1)roundsTheorem:Let𝒍 =#levelsinarandomtreeP
𝔼𝑷 𝐀𝐋𝐆 ≤ 1 + 𝑂 𝝐𝒍𝒅 𝐎𝐏𝐓Proof(sketch):• 𝚫𝑷(𝑢, 𝑣) =celllength,whichfirstpartitions(𝑢, 𝑣)• Newweights:𝒘𝑷 𝑢, 𝑣 = 𝑢 − 𝑣 V + 𝝐𝚫𝑷 𝑢, 𝑣
𝑢 − 𝑣 V ≤ 𝔼𝑷[𝒘𝑷 𝑢, 𝑣 ] ≤ 1 + 𝑂 𝝐𝒍𝒅 𝑢 − 𝑣 V
• OuralgorithmimplementsKruskal forweights𝒘𝑷
𝑢 𝑣𝚫𝑷 𝑢, 𝑣
TechnicalDetails
(1 + 𝜖)-MST:– “Loadbalancing”:partitionthetreeintopartsofthesamesize
– Almostlineartimelocally:ApproximateNearestNeighbordatastructure[Indyk’99]
– Dependenceondimensiond (sizeof𝝐-netis𝑂 𝒅𝝐
𝒅)
– Generalizestoboundeddoublingdimension
AlgorithmforConnectivity:SetupData:nedgesofanundirectedgraph.
Notation:• 𝜋(𝑣) ≡uniqueidof𝑣• Γ(𝑆) ≡ setofneighborsofasubsetofvertices S.
Labels:• Algorithmassignsalabel ℓ(𝑣) toeach𝑣.• 𝐿´ ≡thesetofverticeswiththelabel ℓ(𝑣) (invariant:subsetoftheconnectedcomponentcontaining 𝑣).
Active vertices:• Someverticeswillbecalledactive(exactlyoneper𝐿´).
AlgorithmforConnectivity
• Markeveryvertex as active andlet ℓ(𝑣) = 𝜋(𝑣).• Forphases 𝑖 = 1,2, … , 𝑂(logn) do:– Calleach active vertexa leader withprobability 1/2.If v isaleader,markallverticesin 𝐿´ as leaders.
– Forevery activenon-leader vertex w,findthesmallest leader(by 𝜋)vertexw⋆ inΓ(𝐿¸).
– Mark w passive, relabel eachvertexwithlabel w by w⋆.
• Output:setofconnectedcomponentsbasedon ℓ.
AlgorithmforConnectivity:Analysis• If ℓ(𝑢) = ℓ(𝑣) then 𝑢 and 𝑣 areinthesameCC.• Claim: Uniquelabelswithhighprobabilityafter 𝑂(log𝑁) phases.
• ForeveryCC#activeverticesreducesbyaconstantfactorineveryphase.– Halfoftheactiveverticesdeclaredasnon-leaders.– Fixanactivenon-leader vertex 𝒗.– IfatleasttwodifferentlabelsintheCCof v thenthereisanedge (𝒗′, 𝒖) suchthat ℓ(𝒗) = ℓ(𝒗′) andℓ(𝒗′) ≠ ℓ(𝒖).
– 𝒖markedasaleader withprobability 1/2⇒halfoftheactivenon-leaderverticeswillchangetheirlabel.
– Overall,expect 1/4 oflabelstodisappear.– After 𝑂(log𝑁) phases#ofactivelabelsineveryconnectedcomponentwilldroptoonewithhighprobability
AlgorithmforConnectivity:ImplementationDetails
• Distributeddatastructureofsize𝑂 𝑉 tomaintainlabels,ids,leader/non-leaderstatus,etc.– O(1)roundsperstagetoupdatethedatastructure
• Edgesstoredlocallywithallauxiliaryinfo– Betweenstages:usedistributeddatastructuretoupdatelocalinfoonedges
• Forevery activenon-leader vertex w,findthesmallest leader(w.r.t 𝜋)vertexw⋆ ∈ Γ(𝐿¸)– Each(non-leader,leader)edgesendsanupdatetothedistributeddatastructure
• MuchfasterwithDistributedHashTableService(DHT)[Kiveris,Lattanzi,Mirrokni,Rastogi,Vassilvitskii’14]
⇒Problem3:K-means
• Input:𝑣*, … , 𝑣𝒏 ∈ ℝ𝒅• Find𝒌 centers𝑐*, … , 𝑐𝒌• Minimizesumofsquareddistancetotheclosestcenter:
¾min¿À*Á ||𝑣Â − 𝑐¿||VVÃ
ÂÀ*
• ||𝑣Â − 𝑐¿||VV = ∑ 𝑣ÂÅ − 𝑐¿ÅV𝒅
ÅÀ*• NP-hard
K-means++[Arthur,Vassilvitskii’07]
• 𝐶 = {𝑐*, … , 𝑐Å} (collectionofcenters)• 𝑑V 𝑣, 𝐶 = min¿À*Á ||𝑣 − 𝑐¿||VV
K-means++algorithm(gives𝑂 log 𝒌 -approximation):• Pick𝑐* uniformlyatrandomfromthedata• Pickcenters𝑐V … , 𝑐𝒌 sequentiallyfromthedistributionwherepoint𝑣 hasprobability
𝑑V 𝑣, 𝐶∑ 𝑑V(𝑣Â, 𝐶)ÃÂÀ*
K-means|| [Bahmani etal.‘12]
• Pick𝐶 = 𝑐* uniformlyatrandomfromdata• Initialcost:𝜓 = ∑ 𝑑V(𝑣Â, 𝑐*)Ã
ÂÀ*• Do𝑂(log𝜓) times:– Add𝑂 𝒌 centersfromthedistributionwherepoint𝑣hasprobability
𝑑V 𝑣, 𝐶∑ 𝑑V(𝑣Â, 𝐶)ÃÂÀ*
• Solvek-meansfortheseO(𝒌 log𝜓)pointslocally
• Thm. Iffinalstepgives𝜶-approximation⇒𝑂(𝜶)-approximationoverall
Problem2:CorrelationClustering
• Inspiredbymachinelearningat• Practice:[Cohen,McCallum‘01,Cohen,Richman’02]
• Theory: [Blum,Bansal,Chawla’04]
CorrelationClustering:Example• Minimize #ofincorrectly classifiedpairs:
#Coverednon-edges+#Non-coverededges
4 incorrectlyclassified=1 coverednon-edge+3 non-coverededges
ApproximatingCorrelationClustering
• Minimize #ofincorrectly classifiedpairs– ≈ 20000-approximation [Blum,Bansal,Chawla’04]– [Demaine,Emmanuel,Fiat,Immorlica’04],[Charikar,Guruswami,Wirth’05],[Ailon,Charikar,Newman’05][Williamson,vanZuylen’07],[Ailon,Liberty’08],…
– ≈ 2-approximation[Chawla,Makarychev,Schramm,Y.’15]
• Maximize #ofcorrectly classifiedpairs– (1 − 𝜖)-approximation[Blum,Bansal,Chawla’04]
CorrelationClusteringOneofthemostsuccessfulclusteringmethods:• Onlyusesqualitativeinformation aboutsimilarities
• #ofclustersunspecified(selectedtobestfitdata)
• Applications:document/imagededuplication(datafromcrowdsorblack-boxmachinelearning)
• NP-hard [Bansal,Blum,Chawla ‘04],admitssimpleapproximationalgorithms withgoodprovableguarantees
CorrelationClustering
More:• Survey [Wirth]• KDD’14 tutorial:“CorrelationClustering:FromTheorytoPractice”[Bonchi,Garcia-Soriano,Liberty]http://francescobonchi.com/CCtuto_kdd14.pdf
• Wikipedia article:http://en.wikipedia.org/wiki/Correlation_clustering
Data-BasedRandomizedPivoting
3-approximation(expected)[Ailon,Charikar,Newman]Algorithm:• Pickarandompivotvertex𝒗• Makeacluster𝒗 ∪ 𝑁(𝒗),where𝑁 𝒗 isthesetofneighborsof𝒗
• Removetheclusterfromthegraphandrepeat
Data-BasedRandomizedPivoting
• Pickarandompivotvertex𝒑• Makeacluster𝒑 ∪ 𝑁(𝒑),where𝑁 𝒑 isthesetofneighborsof𝒑
• Removetheclusterfromthegraphandrepeat
8 incorrectlyclassified=2 coverednon-edges+6 non-coverededges
ParallelPivotAlgorithm
• (3 + 𝝐)-approx.in𝑂(logV 𝑛 /𝜖) rounds[Chierichetti,Dalvi,Kumar,KDD’14]
• Algorithm:whilethegraphisnotempty– 𝑫 = currentmaximumdegree– Activateeachnodeindependentlywithprob.𝝐/𝑫– Deactivatenodesconnectedtootheractivenodes– Theremainingnodesarepivots– Createclusteraroundeachpivotasbefore– Removetheclusters
ParallelPivotAlgorithm:Analysis
• Fact:Halvesmaxdegreeafter*𝝐log 𝒏 rounds
⇒ terminatesinO ^_`� 𝒏𝝐
rounds
• Fact:Activationprocessinducesclosetouniformmarginaldistributionofthepivots⇒ analysissimilartoregularpivotgives(3 + 𝝐)-approximation