Clustering on Clusters 2049: Massively Parallel Algorithms for …grigory.us/files/talks/fb17-2.pdf · 2020-05-20 · •Using connectivity as a primitive can preserve cuts in graphs

ClusteringonClusters2049:MassivelyParallelAlgorithmsforClusteringGraphsandVectors

Grigory Yaroslavtsevhttp://grigory.us

vs

• Algorithmdesignformassivelyparallelcomputing– Blog:http://grigory.us/blog/mapreduce-model/

• MPCalgorithmsforgraphs– Connectivity– Correlationclustering

• MPCalgorithmsforvectors– K-means– Single-linkageclustering

• Openproblemsanddirections

ClusteringonClusters:Overview

ClusteringonClusters2049:Overview

Graphs Vectors

Basic ConnectivityConnectivity++

K-means--

Advanced

CorrelationClustering

Single-LinkageClustering

MSTSingle-LinkageClustering

ClusterComputation(alaBSP)• Input:sizen(e.g.n= billions of edgesinagraph)• 𝑴Machines,𝑺 Space(RAM)each– ConstantoverheadinRAM:𝑴 ⋅ 𝑺 = 𝑂(𝒏)– 𝑺 =𝒏*+,,e.g.𝜖 =0.1 or𝜖 =0.5(𝑴 = 𝑺 = 𝑂( 𝒏� ))

• Output:solutiontoaproblem(oftensizeO(𝒏))– Doesn’tfitinlocalRAM(𝑺 ≪ 𝒏)

} 𝑴machines}Sspace

𝐈𝐧𝐩𝐮𝐭: size𝒏 ⇒ ⇒𝐎𝐮𝐭𝐩𝐮𝐭

} 𝑴machines}Sspace

ClusterComputation(alaBSP)• Computation/Communicationin𝑹 rounds:– Everymachineperformsanear-lineartimecomputation=>Totalusertime𝑂(𝑺𝟏?𝒐(𝟏)𝑹)

– Everymachinesends/receivesatmost𝑺 bits ofinformation=>Totalcommunication𝑂(𝒏𝑹).

Goal:Minimize𝑹.Ideally:𝑹 =constant.

𝑶(𝑺𝟏?𝒐(𝟏)) time

≤ 𝑺 bits

MapReduce-stylecomputations

WhatIwon’tdiscusstoday• PRAMs(sharedmemory,multipleprocessors)(seee.g.[Karloff,Suri,Vassilvitskii‘10])– ComputingXORrequiresΩD(log 𝑛) roundsinCRCWPRAM– Canbedonein𝑂(log𝒔 𝑛) roundsofMapReduce

• Pregel-stylesystems,DistributedHashTables(seee.g.AshishGoel’s classnotesandpapers)

• Lower-levelimplementationdetails(seee.g.Rajaraman-Leskovec-Ullman book)

Modelsofparallelcomputation• Bulk-SynchronousParallelModel (BSP)[Valiant,90]

Pro:Mostgeneral,generalizesallothermodelsCon:Manyparameters,hardtodesignalgorithms

• MassiveParallelComputation [Andoni,Onak,Nikolov,Y.‘14][Feldman-Muthukrishnan-Sidiropoulos-Stein-Svitkina’07,Karloff-Suri-Vassilvitskii’10,Goodrich-Sitchinava-Zhang’11,...,Beame,Koutris,Suciu’13]Pros:• Inspiredbymodern systems(Hadoop,MapReduce,Dryad,Spark,

Giraph,…)• Fewparameters,simple todesignalgorithms• Newalgorithmicideas,robusttotheexactmodelspecification• #Rounds isaninformation-theoreticmeasure=>canprove

unconditionalresultsCon:sometimesnotenoughtomodelmorecomplexbehavior

Businessperspective• Pricings:– https://cloud.google.com/pricing/– https://aws.amazon.com/pricing/

• ~Linearwithspace andtime usage– 100machines:5K$/year– 10000machines:0.5M$/year

• Youpayalotmore forusingprovidedalgorithms– https://aws.amazon.com/machine-learning/pricing/

Part1:ClusteringGraphs

• Applications:– Communitydetection– Fakeaccountdetection– Deduplication– Storagelocalization– …

Problem1:Connectivity

• Input:n edgesofagraph(arbitrarilypartitionedbetweenmachines)

• Output:isthegraphconnected?(or#ofconnectedcomponents)

• Question:howmanyroundsdoesittake?1. 𝑂 12. 𝑂 logM n3. 𝑂(nM)4. 𝑂(2Mn)5. Impossible

• VersionofBoruvka’s algorithm:– Allverticesassignedtodifferentcomponents– Repeat𝑂(log |V|) times:

• Eachcomponentchoosesaneighboringcomponent• Allpairsofchosencomponentsgetmerged

• Howtoavoidchaining?

• Ifthegraphofcomponentsisbipartiteandonlyonesidegetstochoosethennochaining

• Randomly assigncomponentstothesides

AlgorithmforConnectivity

AlgorithmsforGraphs:GraphDensity

• Dense:𝑺 ≫ |𝑉|,e.g.𝑺 ≥ 𝑉 T/V

• Semi-dense:𝑺 = Θ(|𝑉|)• Sparse:𝑺 ≪ 𝑉 ,e.g.𝑺 ≤ 𝑉 */V


• Dense:𝑺 ≫ |𝑉|,e.g.𝑺 ≥ 𝑉 T/V

– Linearsketching:oneround,see[McGregor’14]• WorkshopatBerkeleytomorrow:http://caml.indiana.edu/linear-sketching-focs.html

• “Filtering”[Karloff,SuriVassilvitskii,SODA’10;Ene,Im,Moseley,KDD’11;Lattanzi,Moseley,Suri,Vassilvitskii,SPAA’11;Suri,Vassilvitskii,WWW’11]…


• Semi-dense Graphs:𝑺 = Θ(|𝑉|) [Avdyukhin,Y.]– RunBoruvka’s algorithmforO( log |𝑉|� )rounds

– #Verticesreducesdownto |Y|

V Z[\|]|�

– RepeatO( log |𝑉|� ) times:• Computeaspanningtreeoflocallystorededges

• Put 2 ^_`|Y|�suchtreespermachine


• Sparse:𝑺 ≪ 𝑉 , 𝑺 ≤ 𝑉 */V

• Sparsegraphproblemsappearhard– Bigopenquestion:connectivityino(log|𝑉|) rounds?– Probablyno:[Roughgarden,Vassilvitskii,Wang’16]

• “OneCyclevs.TwoCycle”Problem– Distinguishonecyclefromtwoino(log|𝑉|) rounds?

VS.

OtherConnectivityAlgorithms• [Rastogi,Machanavajjhala,Chitnis,DasSarma’13]– D=graphdiameter

Algorithm MRRounds Communicationperround

Hash-Min D 𝑂( 𝑉 + |𝐸|)Hash-to-all LogD 𝑂( 𝑉 V + |𝐸|)

Hash-to-Min 𝑂(log |𝑉|)forpaths

𝑂c(( 𝑉 + 𝐸 ))forpaths

Hash-Greater-to-Min O(logn) O(|V|+|E|)

Graph-DependentConnectivityAlgs?

• Bigquestion:connectivityin𝑂 log𝐷 roundswith𝑂c( 𝑉 + |𝐸|) communicationperround?

• [Rastogi etal’13]conjecturedthatHash-to-Mincanachievethis

• [Avdyukhin,Y.’17]:– Hash-to-MintakesΩ 𝐷 rounds

• Openproblem:betterconnectivityalgorithmsifweparametrizebygraphexpansion?

• Otherwork: [Kiveris etal.‘14]

Whataboutclustering?

• ≈sameideasworkforSingle-LinkageClustering• Usingconnectivityasaprimitivecanpreservecutsingraphs[Benczur,Karger’98]– ConstructagraphwithO(nlogn)edges– Allcutsizesarepreservedwithafactorof2

• Allowstorunclusteringalgorithmsthatusecutsintheobjectiveusingthissparsegraph

SingleLinkageClustering• [Zahn’71]Clustering viaMinimumSpanningTree:k clusters:remove𝒌 − 𝟏 longestedgesfromMST• Maximizesminimum intercluster distance

[Kleinberg,Tardos]

Part2:ClusteringVectors• Input:𝑣*, … , 𝑣𝒏 ∈ ℝ𝒅– FeaturevectorsinML,wordembeddings inNLP,etc.– (Implicit)weightedgraphofpairwisedistances

• Applications:– Sameasbefore+Datavisualization

Largegeometricgraphs• Graphalgorithms:Densegraphs vs.sparsegraphs– Dense:𝑺 ≫ |𝑉|.– Sparse:𝑺 ≪ |𝑉|.

• Oursetting:– Densegraphs,sparselyrepresented:O(n)space– Outputdoesn’tfitononemachine(𝑺 ≪ 𝒏)

• Today:(1 + 𝜖)-approximateMST[Andoni,Onak,Nikolov,Y.]– 𝒅 = 2 (easytogeneralize)– 𝑹 = log𝑺 𝒏=O(1)rounds(𝑺 = 𝒏𝛀(𝟏))

𝑂(log𝑛)-MSTin 𝑅 = 𝑂(log𝑛)rounds• Assumepointshaveintegercoordinates 0,… , Δ ,whereΔ = 𝑂 𝒏𝟐 .

Imposean𝑂(log𝒏)-depthquadtreeBottom-up:Foreachcellinthequadtree

– computeoptimumMSTsinsubcells– Useonlyone representative fromeachcellonthenextlevel

Wrongrepresentative:O(1)-approximationperlevel

Wrongrepresentative:O(1)-approximationperlevel

𝝐𝑳-nets• 𝝐𝑳-netforacellCwithsidelength𝑳:

CollectionS ofverticesinC,everyvertexisatdistance<=𝝐𝑳 fromsomevertexinS.(Fact:Canefficientlycompute𝝐-netofsize𝑂 *

𝝐�)

Bottom-up:Foreachcellinthequadtree– ComputeoptimumMSTsinsubcells– Use𝝐𝑳-netfromeachcellonthenextlevel

• Idea:PayonlyO(𝝐𝑳)foranedge cutbycellwithside𝑳• Randomlyshiftthequadtree:Pr 𝑐𝑢𝑡𝑒𝑑𝑔𝑒𝑜𝑓𝑙𝑒𝑛𝑔𝑡ℎℓ𝑏𝑦𝑳 ∼ ℓ/𝑳 – chargeerrors

𝑳 𝑳𝜖𝑳

Randomlyshiftedquadtree• Topcellshiftedbyarandomvectorin 0, 𝑳 V

Imposearandomlyshifted quadtree (topcelllength𝟐𝚫)Bottom-up:Foreachcellinthequadtree– ComputeoptimumMSTsinsubcells– Use𝝐𝑳-netfromeachcellonthenextlevel

Pay5 insteadof4Pr[𝐁𝐚𝐝𝐂𝐮𝐭]=𝛀(1)

2

1

𝐁𝐚𝐝𝐂𝐮𝐭

1 + 𝝐 -MSTin 𝐑 = 𝑂(log𝑛)rounds• Idea: Onlyuseshortedgesinsidethecells

Imposearandomlyshifted quadtree (topcelllength𝟐𝚫𝝐)

Bottom-up:Foreachnode(cell)inthequadtree– computeoptimumMinimumSpanningForests insubcells,usingedgesoflength≤ 𝝐𝑳

– Useonly𝝐𝟐𝑳-netfromeachcellonthenextlevel

2

1Pr[𝐁𝐚𝐝𝐂𝐮𝐭]=𝑶(𝝐)

𝑳 = 𝛀(𝟏𝝐)

1 + 𝝐 -MSTin 𝐑 = 𝑂(1)rounds• 𝑂(log𝒏) rounds=>O(log𝑺 𝒏)=O(1)rounds– Flattenthetree:( 𝑴� × 𝑴� )-gridsinsteadof(2x2)gridsateachlevel.

Imposearandomlyshifted ( 𝑴� × 𝑴� )-treeBottom-up:Foreachnode(cell)inthetree– computeoptimumMSTsinsubcells viaedgesoflength ≤ 𝝐𝑳– Useonly𝝐𝟐𝑳-netfromeachcellonthenextlevel

⇒ } 𝑴� = 𝒏�(*)

Single-LinkageClustering[Y.,Vadapalli]

• Q:Single-linkageclusteringfrom(1 + 𝜖)-MST?• A:No,afixededgecanbearbitrarilydistorted

• Idea:– Run𝑂 log 𝑛 times&collectall(1 + 𝜖)-MSTedges– ComputeMSToftheseedgesusingBoruvka– UsethisMSTfork-SingleLinkageClusteringforallk

• Overall:O(logn)roundsofMPCinsteadofO(1)• Q:Isthisactuallynecessary?• A:Mostlikelyyes,i.e.yes,assumingsparseconnectivityishard

Single-LinkageClustering[Y.,Vadapalli]

• Conj 1:SparseConnectivityrequiresΩ(log |𝑉|)• Conj 2:“1cyclevs.2cycles”requiresΩ(log |𝑉|)• Underℓ�-Distances:

Distance Approximation Hardness underConjecture1*

Hardness underConjecture2*

Hamming Exact 2 3ℓ* (1 + 𝜖) 2 3ℓV (1 + 𝜖) 1.41 − 𝜖 1.84 − 𝜖ℓ¡ (1 + 𝜖) 2

Thanks!Questions?• Slideswillbeavailableonhttp://grigory.us• Moreaboutalgorithmsformassivedata:

http://grigory.us/blog/• MoreintheclassesIteach:

1 + 𝝐 -MSTin 𝐑 = 𝑂(1)roundsTheorem:Let𝒍 =#levelsinarandomtreeP

𝔼𝑷 𝐀𝐋𝐆 ≤ 1 + 𝑂 𝝐𝒍𝒅 𝐎𝐏𝐓Proof(sketch):• 𝚫𝑷(𝑢, 𝑣) =celllength,whichfirstpartitions(𝑢, 𝑣)• Newweights:𝒘𝑷 𝑢, 𝑣 = 𝑢 − 𝑣 V + 𝝐𝚫𝑷 𝑢, 𝑣

𝑢 − 𝑣 V ≤ 𝔼𝑷[𝒘𝑷 𝑢, 𝑣 ] ≤ 1 + 𝑂 𝝐𝒍𝒅 𝑢 − 𝑣 V

• OuralgorithmimplementsKruskal forweights𝒘𝑷

𝑢 𝑣𝚫𝑷 𝑢, 𝑣

TechnicalDetails

(1 + 𝜖)-MST:– “Loadbalancing”:partitionthetreeintopartsofthesamesize

– Almostlineartimelocally:ApproximateNearestNeighbordatastructure[Indyk’99]

– Dependenceondimensiond (sizeof𝝐-netis𝑂 𝒅𝝐

𝒅)

– Generalizestoboundeddoublingdimension

AlgorithmforConnectivity:SetupData:nedgesofanundirectedgraph.

Notation:• 𝜋(𝑣) ≡uniqueidof𝑣• Γ(𝑆) ≡ setofneighborsofasubsetofvertices S.

Labels:• Algorithmassignsalabel ℓ(𝑣) toeach𝑣.• 𝐿´ ≡thesetofverticeswiththelabel ℓ(𝑣) (invariant:subsetoftheconnectedcomponentcontaining 𝑣).

Active vertices:• Someverticeswillbecalledactive(exactlyoneper𝐿´).

AlgorithmforConnectivity

• Markeveryvertex as active andlet ℓ(𝑣) = 𝜋(𝑣).• Forphases 𝑖 = 1,2, … , 𝑂(logn) do:– Calleach active vertexa leader withprobability 1/2.If v isaleader,markallverticesin 𝐿´ as leaders.

– Forevery activenon-leader vertex w,findthesmallest leader(by 𝜋)vertexw⋆ inΓ(𝐿¸).

– Mark w passive, relabel eachvertexwithlabel w by w⋆.

• Output:setofconnectedcomponentsbasedon ℓ.

AlgorithmforConnectivity:Analysis• If ℓ(𝑢) = ℓ(𝑣) then 𝑢 and 𝑣 areinthesameCC.• Claim: Uniquelabelswithhighprobabilityafter 𝑂(log𝑁) phases.

• ForeveryCC#activeverticesreducesbyaconstantfactorineveryphase.– Halfoftheactiveverticesdeclaredasnon-leaders.– Fixanactivenon-leader vertex 𝒗.– IfatleasttwodifferentlabelsintheCCof v thenthereisanedge (𝒗′, 𝒖) suchthat ℓ(𝒗) = ℓ(𝒗′) andℓ(𝒗′) ≠ ℓ(𝒖).

– 𝒖markedasaleader withprobability 1/2⇒halfoftheactivenon-leaderverticeswillchangetheirlabel.

– Overall,expect 1/4 oflabelstodisappear.– After 𝑂(log𝑁) phases#ofactivelabelsineveryconnectedcomponentwilldroptoonewithhighprobability

AlgorithmforConnectivity:ImplementationDetails

• Distributeddatastructureofsize𝑂 𝑉 tomaintainlabels,ids,leader/non-leaderstatus,etc.– O(1)roundsperstagetoupdatethedatastructure

• Edgesstoredlocallywithallauxiliaryinfo– Betweenstages:usedistributeddatastructuretoupdatelocalinfoonedges

• Forevery activenon-leader vertex w,findthesmallest leader(w.r.t 𝜋)vertexw⋆ ∈ Γ(𝐿¸)– Each(non-leader,leader)edgesendsanupdatetothedistributeddatastructure

• MuchfasterwithDistributedHashTableService(DHT)[Kiveris,Lattanzi,Mirrokni,Rastogi,Vassilvitskii’14]

⇒Problem3:K-means

• Input:𝑣*, … , 𝑣𝒏 ∈ ℝ𝒅• Find𝒌 centers𝑐*, … , 𝑐𝒌• Minimizesumofsquareddistancetotheclosestcenter:

¾min¿À*Á ||𝑣Â − 𝑐¿||VVÃ

ÂÀ*

• ||𝑣Â − 𝑐¿||VV = ∑ 𝑣ÂÅ − 𝑐¿ÅV𝒅

ÅÀ*• NP-hard

K-means++[Arthur,Vassilvitskii’07]

• 𝐶 = {𝑐*, … , 𝑐Å} (collectionofcenters)• 𝑑V 𝑣, 𝐶 = min¿À*Á ||𝑣 − 𝑐¿||VV

K-means++algorithm(gives𝑂 log 𝒌 -approximation):• Pick𝑐* uniformlyatrandomfromthedata• Pickcenters𝑐V … , 𝑐𝒌 sequentiallyfromthedistributionwherepoint𝑣 hasprobability

𝑑V 𝑣, 𝐶∑ 𝑑V(𝑣Â, 𝐶)ÃÂÀ*

K-means|| [Bahmani etal.‘12]

• Pick𝐶 = 𝑐* uniformlyatrandomfromdata• Initialcost:𝜓 = ∑ 𝑑V(𝑣Â, 𝑐*)Ã

ÂÀ*• Do𝑂(log𝜓) times:– Add𝑂 𝒌 centersfromthedistributionwherepoint𝑣hasprobability

𝑑V 𝑣, 𝐶∑ 𝑑V(𝑣Â, 𝐶)ÃÂÀ*

• Solvek-meansfortheseO(𝒌 log𝜓)pointslocally

• Thm. Iffinalstepgives𝜶-approximation⇒𝑂(𝜶)-approximationoverall

Problem2:CorrelationClustering

• Inspiredbymachinelearningat• Practice:[Cohen,McCallum‘01,Cohen,Richman’02]

• Theory: [Blum,Bansal,Chawla’04]

CorrelationClustering:Example• Minimize #ofincorrectly classifiedpairs:

#Coverednon-edges+#Non-coverededges

4 incorrectlyclassified=1 coverednon-edge+3 non-coverededges

ApproximatingCorrelationClustering

• Minimize #ofincorrectly classifiedpairs– ≈ 20000-approximation [Blum,Bansal,Chawla’04]– [Demaine,Emmanuel,Fiat,Immorlica’04],[Charikar,Guruswami,Wirth’05],[Ailon,Charikar,Newman’05][Williamson,vanZuylen’07],[Ailon,Liberty’08],…

– ≈ 2-approximation[Chawla,Makarychev,Schramm,Y.’15]

• Maximize #ofcorrectly classifiedpairs– (1 − 𝜖)-approximation[Blum,Bansal,Chawla’04]

CorrelationClusteringOneofthemostsuccessfulclusteringmethods:• Onlyusesqualitativeinformation aboutsimilarities

• #ofclustersunspecified(selectedtobestfitdata)

• Applications:document/imagededuplication(datafromcrowdsorblack-boxmachinelearning)

• NP-hard [Bansal,Blum,Chawla ‘04],admitssimpleapproximationalgorithms withgoodprovableguarantees

CorrelationClustering

More:• Survey [Wirth]• KDD’14 tutorial:“CorrelationClustering:FromTheorytoPractice”[Bonchi,Garcia-Soriano,Liberty]http://francescobonchi.com/CCtuto_kdd14.pdf

• Wikipedia article:http://en.wikipedia.org/wiki/Correlation_clustering

Data-BasedRandomizedPivoting

3-approximation(expected)[Ailon,Charikar,Newman]Algorithm:• Pickarandompivotvertex𝒗• Makeacluster𝒗 ∪ 𝑁(𝒗),where𝑁 𝒗 isthesetofneighborsof𝒗

• Removetheclusterfromthegraphandrepeat

Data-BasedRandomizedPivoting

• Pickarandompivotvertex𝒑• Makeacluster𝒑 ∪ 𝑁(𝒑),where𝑁 𝒑 isthesetofneighborsof𝒑

• Removetheclusterfromthegraphandrepeat

8 incorrectlyclassified=2 coverednon-edges+6 non-coverededges

ParallelPivotAlgorithm

• (3 + 𝝐)-approx.in𝑂(logV 𝑛 /𝜖) rounds[Chierichetti,Dalvi,Kumar,KDD’14]

• Algorithm:whilethegraphisnotempty– 𝑫 = currentmaximumdegree– Activateeachnodeindependentlywithprob.𝝐/𝑫– Deactivatenodesconnectedtootheractivenodes– Theremainingnodesarepivots– Createclusteraroundeachpivotasbefore– Removetheclusters

ParallelPivotAlgorithm:Analysis

• Fact:Halvesmaxdegreeafter*𝝐log 𝒏 rounds

⇒ terminatesinO ^_`� 𝒏𝝐

rounds

• Fact:Activationprocessinducesclosetouniformmarginaldistributionofthepivots⇒ analysissimilartoregularpivotgives(3 + 𝝐)-approximation

Clustering on Clusters 2049: Massively Parallel Algorithms for …grigory.us/files/talks/fb17-2.pdf · 2020-05-20 · •Using connectivity as a primitive can preserve cuts in graphs

Documents