Page 1
GrahamCormode,[email protected]
NickDuffield,TexasA&[email protected]
Sampling for Big Data
x9x8x7x6x5x4x3x2x1
x10
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Page 2
Sampling for Big Data
Big Data
◊ “Big”dataarisesinmanyforms:– PhysicalMeasurements:fromscience(physics,astronomy)
– Medicaldata:geneticsequences,detailedtimeseries
– Activitydata:GPSlocation,socialnetworkactivity
– Businessdata:customerbehaviortrackingatfinedetail
◊ Commonthemes:– Dataislarge,andgrowing
– Thereareimportantpatternsandtrendsinthedata
– Wedon’tfullyknowwheretolookorhowtofindthem
Page 3
Sampling for Big Data
Why Reduce?
◊ Although“big”dataisaboutmorethanjustthevolume……mostbigdataisbig!
◊ Itisnotalwayspossibletostorethedatainfull– Manyapplications(telecoms,ISPs,searchengines)can’tkeepeverything
◊ Itisinconvenienttoworkwithdatainfull– Justbecausewecan,doesn’tmeanweshould
◊ Itisfastertoworkwithacompactsummary– Bettertoexploredataonalaptopthanacluster
Page 4
Sampling for Big Data
Why Sample?
◊ Samplinghasanintuitivesemantics– Weobtainasmallerdatasetwiththesamestructure
◊ Estimatingonasampleisoftenstraightforward– Runtheanalysisonthesamplethatyouwouldonthefulldata
– Somerescaling/reweightingmaybenecessary
◊ Samplingisgeneralandagnostictotheanalysistobedone– Othersummarymethodsonlyworkforcertaincomputations– Thoughsamplingcanbetunedtooptimizesomecriteria
◊ Samplingis(usually)easytounderstand– Soprevalentthatwehaveanintuitionaboutsampling
Page 5
Sampling for Big Data
Alternatives to Sampling
◊ Samplingisnottheonlygameintown– Manyotherdatareductiontechniquesbymanynames
◊ Dimensionalityreductionmethods– PCA,SVD,eigenvalue/eigenvectordecompositions
– Costlyandslowtoperformonbigdata
◊ “Sketching”techniquesforstreamsofdata– Hashbasedsummariesviarandomprojections– Complextounderstandandlimitedinfunction
◊ Othertransform/dictionarybasedsummarizationmethods– Wavelets,FourierTransform,DCT,Histograms
– Notincrementallyupdatable,highoverhead
◊ Allworthyofstudy– inothertutorials
Page 6
Sampling for Big Data
Health Warning: contains probabilities
◊ Willavoiddetailedprobabilitycalculations,aimtogivehighleveldescriptionsandintuition
◊ Butsomeprobabilitybasicsareassumed– Conceptsofprobability,expectation,varianceofrandomvariables– Alludetoconcentrationofmeasure(Exponential/Chernoff bounds)
◊ Feelfreetoaskquestionsabouttechnicaldetailsalongtheway
Page 7
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Page 8
Sampling for Big Data
Sampling as a Mediator of Constraints
Data Characteristics(Heavy Tails, Correlations)
Query Requirements(Ad Hoc, Accuracy, Aggregates, Speed)
Resource Constraints(Bandwidth, Storage, CPU)
Sampling
Page 9
Sampling for Big Data
Motivating Application: ISP Data
◊ WillmotivatemanyresultswithapplicationtoISPs
◊ Manyreasonstousesuchexamples:– Expertise:tutorsfromtelecomsworld
– Demand:manysamplingmethodsdevelopedinresponsetoISPneeds
– Practice:samplingwidelyusedinISPmonitoring,builtintorouters
– Prescience:ISPswerefirsttohitmany“bigdata”problems– Variety:manydifferentplaceswheresamplingisneeded
◊ First,acrash-courseonISPnetworks…
Page 10
Sampling for Big Data
Structure of Large ISP Networks
PeeringwithotherISPs
AccessNetworks:Wireless,DSL,IPTV
City-levelRouterCenters
BackboneLinks
DownstreamISPandbusinesscustomers
ServiceandDatacenters
NetworkManagement&Administration
Page 11
Sampling for Big Data
Measuring the ISP Network: Data Sources
Peering
AccessRouterCenters
Backbone
Business
DatacentersManagement
LinkTrafficRatesAggregatedperrouterinterface
TrafficMatricesFlowrecordsfromrouters
Loss&LatencyActiveprobing
Loss&LatencyRoundtrip toedge
ProtocolMonitoring:Routers,Wireless
StatusReports:Devicefailuresandtransitions
CustomerCareLogsReactiveindicatorsofnetworkperformance
Page 12
Sampling for Big Data
Why Summarize (ISP) Big Data?
◊ Whentransmissionbandwidthformeasurementsislimited– NotsuchabigissueinISPswithin-bandcollection
◊ Typicallyrawaccumulationisnotfeasible(evenfornationstates)– Highratestreamingdata
– Maintainhistoricalsummariesforbaselining,timeseriesanalysis
◊ Tofacilitatefastqueries– Wheninfeasibletorunexploratoryqueriesoverfulldata
◊ Aspartofhierarchicalqueryinfrastructure:– Maintainfulldataoverlimiteddurationwindow
– Drilldownintofulldatathroughoneormorelayersofsummarization
Samplinghasbeenprovedtobeaflexiblemethodtoaccomplishthis
Page 13
Sampling for Big Data
Data Scale:Summarization and Sampling
Page 14
Sampling for Big Data
Traffic Measurement in the ISP Network
AccessRouter Centers
Backbone
Business
DatacentersManagement
Traffic MatricesFlow records from routers
Page 15
Sampling for Big Data
Massive Dataset: Flow Records
◊ IPFlow:setofpacketswithcommonkeyobservedcloseintime
◊ FlowKey:IPsrc/dst address,TCP/UDPports,ToS,…[64to104+bits]
◊ FlowRecords:– Protocollevelsummariesofflows,compiledandexportedbyrouters
– Flowkey,packetandbytecounts,first/lastpackettime,somerouterstate
– Realizations:CiscoNetflow,IETFStandards
◊ Scale:100’sTeraBytesofflowrecordsdailyaregeneratedinalargeISP◊ Usedtomanagenetworkoverrangeoftimescales:
– Capacityplanning(months),….,detectingnetworkattacks(seconds)
◊ Analysistasks– Easy:timeseries ofpredeterminedaggregates(e.g.addressprefixes)
– Hard:fastqueriesoverexploratoryselectors,history,communicationssubgraphs
flow1 flow2 flow3 flow4
time
Page 16
Sampling for Big Data
Flows, Flow Records and Sampling
◊ Twotypesofsamplingusedinpracticeforinternettraffic:1. Samplingpacketstreaminrouterpriortoformingflowrecords
□ Limitstherateoflookupsofpacketkeyinflowcache
□ RealizedasPacketSampledNetFlow(morelater…)
2. Downstreamsamplingofflowrecordsincollectioninfrastructure
□ Limitstransmissionbandwidth,storagerequirements□ RealizedinISPmeasurementcollectioninfrastructure(morelater…)
◊ Twocasesillustrativeofgeneralproperty– Differentunderlyingdistributionsrequiredifferentsampledesigns
– Statisticaloptimalitysometimeslimitedbyimplementationconstraints
□ Availabilityofrouterstorage,processingcycles
Page 17
Sampling for Big Data
Abstraction: Keyed Data Streams
◊ DataModel:objectsarekeyedweights– Objects(x,k):Weight x;keyk
□ Example1:objects=packets,x =bytes,k =key(source/destination)
□ Example2:objects=flows,x =packetsorbytes,k =key
□ Example3:objects=accountupdates,x =credit/debit,k =accountID
◊ Streamofkeyedweights,{(xi , ki):i =1,2,…,n}
◊ Genericquery:subsetsums– X(S)=ΣiÎS xi for SÌ {1,2,…,n}i.e.totalweightofindexsubsetS– TypicallyS=S(K)={i:kiÎ K} :objectswithkeysinK
□ Example1,2:X(S(K))=totalbytestogivenIPdest address/UDPport
□ Example3:X(S(K))=totalbalancechangeoversetofaccounts
◊ Aim:Computefixedsizesummaryofstreamthatcanbeusedtoestimatearbitrarysubsetsumswithknownerrorbounds
Page 18
Sampling for Big Data
Inclusion Sampling and Estimation
◊ Horvitz-ThompsonEstimation:– Objectofsizexi sampledwithprobabilitypi
– Unbiasedestimatex’i=xi /pi (ifsampled),0ifnotsampled:E[x’i]=xi◊ Linearity:– Estimateofsubsetsum=sumofmatchingestimates
– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i◊ Accuracy:– ExponentialBounds:Pr[|X’(S)- X(S)|>δX(S)]≤exp[-g(δ)X(S)]
– Confidenceintervals:X(S)Î [X-(e),X+(e)]withprobability1- e◊ Futureproof:– Don’tneedtoknowqueriesattimeofsampling
□ “Where/wheredidthatsuspiciousUDPportfirstbecomesoactive?”□ “WhichisthemostactiveIPaddresswithinthananomaloussubnet?”
– Retrospectiveestimate:subsetsumoverrelevantkeyset
Page 19
Sampling for Big Data
Independent Stream Sampling
◊ BernoulliSampling– IIDsamplingofobjectswithsomeprobabilityp
– Sampledweightx hasHTestimatex/p
◊ PoissonSampling– Weightxi sampledwithprobabilitypi ;HTestimatexi /pi
◊ WhentousePoissonvs.Bernoullisampling?– Elephantsandmice:Poissonallowsprobabilitytodependonweight…
◊ Whatisbestchoiceofprobabilitiesforgivenstream{xi}?
Page 20
Sampling for Big Data
Bernoulli Sampling
◊ Theeasiestpossiblecaseofsampling:allweightsare1– N objects,andwanttosamplek fromthemuniformly
– Eachpossiblesubsetofk shouldbeequallylikely
◊ UniformlysampleanindexfromN (withoutreplacement)k times– Somesubtleties:trulyrandomnumbersfrom[1…N]onacomputer?
– Assumethatrandomnumbergeneratorsaregoodenough
◊ CommontrickinDB:assignarandomnumbertoeachitemandsort– CostlyifN isverybig,butsoisrandomaccess
◊ Interestingproblem:takeasinglelinearscanofdatatodrawsample– Streamingmodelofcomputation:seeeachelementonce
– Application:IPflowsampling,toomany(forus)tostore
– (Forawhile)commontechinterviewquestion
Page 21
Sampling for Big Data
Reservoir Sampling
“Reservoirsampling”describedby[Knuth 69, 81];enhancements[Vitter 85]◊ Fixedsizek uniformsamplefromarbitrarysizeN streaminonepass– Noneedtoknowstreamsizeinadvance
– Includefirstk itemsw.p.1
– Includeitemn>k withprobabilitypn=k/n,n>k
□ Pickj uniformlyfrom{1,2,…,n}□ Ifj≤k,swapitemn intolocationj inreservoir,discardreplaceditem
◊ Neatproofshowstheuniformityofthesamplingmethod:– LetSn =samplesetaftern arrivals
k=7 n
m (< n)
Previouslysampleditem:induction
mÎ Sn-1 w.p.pn-1Þ mÎ Sn w.p.pn-1 *(1– pn /k)=pn
Newitem:selectionprobability
Prob[nÎ Sn ]=pn :=k/n
Page 22
Sampling for Big Data
Reservoir Sampling: Skip Counting
◊ Simpleapproach:checkeachiteminturn– O(1) peritem:
– Fineifcomputationtime<interarrivaltime
– OtherwisebuildupcomputationbacklogO(N)
◊ Better:“skipcounting”– Findrandomindexm(n)ofnextselection>n– Distribution:Prob[m(n)≤m]=1- (1-pn+1)*(1-pn+2)*…*(1-pm)
◊ Expectednumberofselectionsfromstreamisk+Σk<m≤N pm =k+Σk<m≤N k/m=O(k(1+ln (N/k)))
◊ Vitter’85providedalgorithmwiththisaveragerunningtime
Page 23
Sampling for Big Data
Reservoir Sampling via Order Sampling
◊ Ordersamplinga.k.a.bottom-ksample,min-hashing
◊ Uniformsamplingofstreamintoreservoirofsizek
◊ Eacharrivaln:generateone-timerandomvaluern Î U[0,1]– rn alsoknownashash,rank,tag…
◊ Storekitemswiththesmallestrandomtags
0.391 0.908 0.291 0.555 0.619 0.273
n Eachitemhassamechanceofleasttag,souniform
n Fasttoimplementviapriorityqueue
n Canrunonmultipleinputstreamsseparately,thenmerge
Page 24
Sampling for Big Data
Handling Weights
◊ Sofar:uniformsamplingfromastreamusingareservoir
◊ Extendtonon-uniformsamplingfromweightedstreams– Easycase:k=1
– Samplingprobabilityp(n)=xn/WnwhereWn=Si=1n xi
◊ k>1 isharder– Canhaveelementswithlargeweight:wouldbesampledwithprob 1?
◊ Numberofdifferentweightedorder-samplingschemesproposedtorealizedesireddistributionalobjectives– Rankrn=f(un,xn )forsomefunctionfandunÎ U[0,1]
– k-mins sketches[Cohen 1997],Bottom-ksketches[Cohen Kaplan 2007]– [Rosen 1972], Weightedrandomsampling[Efraimidis Spirakis 2006]– OrderPPSSampling[Ohlsson 1990, Rosen 1997] – PrioritySampling[Duffield Lund Thorup 2004], [Alon+DLT 2005]
Page 25
Sampling for Big Data
Weighted random sampling
◊ Weightedrandomsampling[Efraimidis Spirakis 06] generalizesmin-wise– Foreachitemdrawrn uniformlyatrandominrange[0,1]
– Computethe‘tag’ofanitemasrn (1/xn)
– Keeptheitemswiththek smallesttags
– Canprovethecorrectnessoftheexponentialsamplingdistribution
◊ Canalsomakeefficientviaskipcountingideas
Page 26
Sampling for Big Data
Priority Sampling
◊ Eachitemxi givenpriorityzi =xi /ri withrn uniformrandomin(0,1]
◊ Maintainreservoirofk+1items(xi ,zi )ofhighestpriority◊ Estimation
– Letz*=(k+1)st highest priority
– Top-kpriorityitems:weightestimate x’I =max{xi ,z*}
– Allother items:weightestimate zero
◊ Statisticsandbounds– x’I unbiased; zerocovariance:Cov[x’i ,x’j ]=0fori≠j
– Relative varianceforanysubset sum≤1/(k-1)[Szegedy, 2006]
Page 27
Sampling for Big Data
Priority Sampling in Databases
◊ OneTimeSamplePreparation– Computeprioritiesofallitems,sortindecreasingpriorityorder
□ Nodiscard
◊ SampleandEstimate– EstimateanysubsetsumX(S)= SiÎS xi byX’(S)=SiÎS x’I forsomeS’Ì S
– Method:selectitemsindecreasingpriorityorder
◊ Twovariants:boundedvarianceorcomplexity1. S’=firstkitemsfromS:relativevariancebounded≤1/(k-1)
□ x’I=max{xi ,z*}where z*=(k+1)st highestpriorityinS
2. S’=itemsfromSinfirstk:executiontimeO(k)
□ x’I=max{xi ,z*}where z*=(k+1)st highestpriority
[Alon et. al., 2005]
Page 28
Sampling for Big Data
Making Stream Samples Smarter
◊ Observation:wesee thewholestream,evenifwecan’tstoreit– Cankeepmoreinformationaboutsampleditemsifrepeated
– Simpleinformation:ifitemsampled,countallrepeats
◊ CountingSamples[Gibbons & Mattias 98]– Samplenewitemswithfixedprobabilityp,countrepeatsasci– Unbiasedestimateoftotalcount:1/p+(ci – 1)
◊ SampleandHold[Estan & Varghese 02]:generalizetoweightedkeys– Newkeywithweightb sampledwithprobability1- (1-p)b
◊ Lowervariancecomparedwithindependentsampling– Butsamplesizewillgrowaspn
◊ Adaptivesampleandhold:reducep whenneeded– “Stickysampling”:geometricdecreasesinp [Manku, Motwani 02]– Muchsubsequentworktuningdecreaseinptomaintainsamplesize
Page 29
Sampling for Big Data
Sketch Guided Sampling
◊ Gofurther:avoidsamplingtheheavykeysasmuch– Uniformsamplingwillpickfromtheheavykeysagainandagain
◊ Idea:useanoracletotellwhenakeyisheavy[Kumar Xu 06] – Adjustsamplingprobabilityaccordingly
◊ Canusea“sketch”datastructuretoplaytheroleoforacle– Likeahashtablewithcollisions,tracksapproximatefrequencies
– E.g.(Counting)BloomFilters,Count-MinSketch
◊ Trackprobabilitywithwhichkeyissampled,useHTestimators– Setprobabilityofsamplingkeywith(estimated)weightwas
1/(1+ew)forparametere : decreasesaswincreases– Decreasinge improvesaccuracy,increasessamplesize
Page 30
Sampling for Big Data
Challenges for Smart Stream Sampling
◊ Currentrouterconstraints– FlowtablesmaintainedinfastexpensiveSRAM
□ Tosupportperpacketkeylookupatlinerate
◊ Implementationrequirements– SampleandHold:stillneedperpacketlookup– SampledNetFlow:(uniform)samplingreduceslookuprate
□ Easiertoimplement despite inferiorstatistical properties
◊ Longdevelopmenttimestorealizenewsamplingalgorithms
◊ Similarconcernsaffectsamplinginotherapplications– Processinglargeamountsofdataneedsawarenessofhardware
– Uniformsamplingmeansnocoordinationneededindistributedsetting
Page 31
Sampling for Big Data
Future for Smarter Stream Sampling
◊ SoftwareDefinedNetworking– Current:proprietarysoftwarerunningonspecialvendorequipment
– Future:opensoftwareandprotocolsoncommodityhardware
◊ Potentiallyoffersflexibilityintrafficmeasurement– Allocatesystemresourcestomeasurementtasksasneeded– Dynamicreconfiguration,finegrainedtuningofsampling
– Stateful packetinspectionandsamplingfornetworksecurity
◊ Technicalchallenges:– Highratepacketprocessinginsoftware
– Transparentsupportfromcommodityhardware
– OpenSketch:[Yu, Jose, Miao, 2013]◊ Sameissuesinotherapplications:useofcommodityprogrammableHW
Page 32
Sampling for Big Data
Stream Sampling:Sampling as Cost Optimization
Page 33
Sampling for Big Data
Matching Data to Sampling Analysis
◊ Genericproblem1:Countingobjects:weightxi =1Bernoulli(uniform)samplingwithprobabilitypworksfine
– EstimatedsubsetcountX’(S)=#{samplesinS}/p
– RelativeVariance(X’(S))=(1/p-1)/X(S)
□ givenp,getanydesiredaccuracyforlargeenoughS
◊ Genericproblem2:xi inParetodistribution,a.k.a.80-20law– Smallproportionofobjectspossessalargeproportionoftotalweight
□ Howtobesttosampleobjectstoaccuratelyestimateweight?
– Uniformsampling?
□ likelytoomitheavyobjectsÞbighitonaccuracy□ makingselectionsetS largedoesn’thelp
– Selectm largestobjects?
□ biased&smallerobjectssystematicallyignored
Page 34
Sampling for Big Data
Heavy Tails in the Internet and Beyond◊ Filessizesinstorage
◊ Bytesandpacketspernetworkflow
◊ Degreedistributionsinwebgraph,socialnetworks
Page 35
Sampling for Big Data
Non-Uniform Sampling
◊ Extensiveliterature:seebookby[Tille, “Sampling Algorithms”, 2006]◊ Predates“BigData”– Focusonstatisticalproperties,notsomuchcomputational
◊ IPPS:InclusionProbabilityProportionaltoSize– VarianceOptimalforHTEstimation
– Samplingprobabilitiesformultivariateversion:[Chao 1982, Tille 1996]– Efficientstreamsamplingalgorithm:[Cohen et. al. 2009]
Page 36
Sampling for Big Data
Costs of Non-Uniform Sampling
◊ Independentsamplingfromn objectswithweights{x1,…,xn}
◊ Goal:findthe“best”samplingprobabilities{p1,…,pn}
◊ Horvitz-Thompson:unbiasedestimationofeachxi by
◊ Twocoststobalance:1. EstimationVariance:Var(x’i)=x2i(1/pi – 1)
2. ExpectedSampleSize:Sipi
◊ MinimizeLinearCombinationCost:Si (xi2(1/pi–1)+z2 pi)– z expressesrelativeimportanceofsmallsamplevs.smallvariance
otherwise0selectediweightifpx
x' iii =
Page 37
Sampling for Big Data
Minimal Cost Sampling: IPPS
IPPS:InclusionProbabilityProportionaltoSize
◊ MinimizeCostSi (xi2 (1/pi– 1)+z2 pi)subjectto1≥pi ≥0
◊ Solution:pi =pz(xi)=min{1,xi /z}
– smallobjects(xi <z)selectedwithprobabilityproportionaltosize
– largeobjects(xi ≥z)selectedwithprobability1
– Callz the“samplingthreshold”
– Unbiasedestimatorxi/pi =max{xi ,z}
◊ Perhapsreminiscentofimportancesampling,butnotthesame:
– makenoassumptionsconcerningdistributionofthex
pz(x)
1
zx
Page 38
Sampling for Big Data
Error Estimates and Bounds
◊ VarianceBased:– HTsamplingvarianceforsingleobjectofweightxi
□ Var(x’i)=x2i(1/pi – 1)=x2i(1/min{1,xi/z}– 1)≤zxi– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i
□ Var(X’(S))≤zX(S)
◊ ExponentialBounds– E.g.Prob[X’(S)=0]≤exp(- X(S)/z)
◊ Boundsaresimpleandpowerful– dependonlyonsubsetsumX(S),notindividualconstituents
Page 39
Sampling for Big Data
Sampled IP Traffic Measurements
◊ PacketSampledNetFlow– Samplepacketstreaminroutertolimitrateofkeylookup:uniform1/N
– Aggregatesampledpacketsintoflowrecordsbykey
◊ Model:packetstreamof(key,bytesize)pairs{(bi,ki)}
◊ Packetsampledflowrecord(b,k)where b=Σ {bi:i sampled� ki =k}– HTestimateb*Noftotalbytesinflow
◊ Downstreamsamplingofflowrecordsinmeasurementinfrastructure– IPPSsampling,probabilitymin{1,b*N/z}
◊ ChainedvarianceboundforanysubsetsumX offlows– Var(X’)≤(z+Nbmax)Xwherebmax=maximumpacketbytesize
– Regardlessofhowpacketsaredistributedamongstflows
[Duffield, Lund, Thorup, IEEE ToIT, 2004]
Page 40
Sampling for Big Data
Estimation Accuracy in Practice
◊ Estimateanysubsetsumcomprisingatleastsomefractionf ofweight
◊ Suppose:samplesizem
◊ Analysis:typicalestimationerrorε (relativestandarddeviation)obeys
◊ 2**16=storageneededforaggregatesover16bitaddressprefixes□ Butsamplinggivesmoreflexibilitytoestimatetrafficwithinaggregates
kf1ε
0.10%
1.00%
10.00%
100.00%
0.0001 0.001 0.01 0.1 1R
SD ε
fraction f
m = 2**16 samples
Estimate fractionf =0.1%withtypicalrelativeerror
12%:
m
Page 41
Sampling for Big Data
Heavy Hitters: Exact vs. Aggregate vs. Sampled◊ Samplingdoesnottellyouwheretheinterestingfeaturesare– Butdoesspeeduptheabilitytofindthemwithexistingtools
◊ Example:HeavyHitterDetection– Setting:Flowrecordsreporting10GB/strafficstream
– Aim:findHeavyHitters=IPprefixescomprising≥0.1%oftraffic
– Responsetimeneeded:5minute
◊ Compare:– Exact:10GB/sx5minutesyieldsupwardsof300Mflowrecords
– 64kaggregatesover16bitprefixes:nodeeperdrill-downpossible
– Sampled:64kflowrecords:anyaggregate≥0.1%accurateto10%Exact Aggregate Sampled
Page 42
Sampling for Big Data
Cost Optimization for Sampling
Severaldifferentapproachesoptimizefordifferentobjectives:
1. FixedSampleSizeIPPSSample– VarianceOptimalsampling:minimalvarianceunbiasedestimation
2. StructureAwareSampling– Improveestimationaccuracyforsubnetqueriesusingtopologicalcost
3. FairSampling– Adaptivelybalancesamplingbudgetoversubpopulationsofflows– Uniformestimationaccuracyregardlessofsubpopulationsize
4. StableSampling– Increasestabilityofsamplesetbyimposingcostonchanges
Page 43
Sampling for Big Data
IPPS Stream Reservoir Sampling◊ Eacharrivingitem:
– Provisionally include iteminreservoir
– Ifm+1 items, discard1itemrandomly
□ Calculatethresholdz tosamplem itemsonaverage:z solvesSi pz(xi)=m□ Discarditemi withprobabilityqi =1– pz(xi)
□ Adjustm survivingxiwithHorvitz-Thompsonx’i =xi /pi =max{xi,z}◊ Efficient Implementation:
– ComputationalcostO(logm)peritem,amortizedcostO(loglogm)
[Cohen, Duffield, Lund, Kaplan, Thorup; SODA 2009, SIAM J. Comput. 2011]
x9x8x7x6x5x4x3x2x1
Example: m=9
x10
Recalculatethreshold z:
=
=10
1ii 9z}xmin{1,
z
0
1
RecalculateDiscard probs:
z}xmin{1,-1q ii =
x7x6x5x4x3x2x1
x9x8
x10
Adjust weights:z},max{xx' ii =
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Page 44
Sampling for Big Data
Structure (Un)Aware Sampling
◊ Samplingisoblivioustostructureinkeys(IPaddresshierarchy)– Estimationdispersestheweightofdiscardeditemstosurvivingsamples
◊ Queriesstructureaware:subsetsumsoverrelatedkeys(IPsubnets)– AccuracyonLHSisdecreasedbydiscardingweightonRHS
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Page 45
Sampling for Big Data
Localizing Weight Redistribution
◊ Initialweightset{xi :iÎS}forsomeSÌΩ– E.g. Ω =possible IPaddresses, S =observedIPaddresses
◊ Attribute“rangecost”C({xi :iÎR})foreachweightsubsetRÍS– Possible factorsforRangeCost:
□ Samplingvariance
□ Topologye.g.height oflowestcommonancestor
– Heuristics: R* =Nearest Neighbor{xi,xj}ofminimalxixj◊ SamplekitemsfromS:
– Progressivelyremoveoneitem fromsubsetwithminimal rangecost:
– While(|S|>k)
□ FindR*ÍS ofminimal rangecost.
□ RemoveaweightfromR* w/VarOpt
[Cohen, Cormode, Duffield; PVLDB 2011]
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Nochangeoutsidesubtree belowclosestancestor
Orderofmagnitude reduction inaveragesubneterrorvs.VarOpt
Page 46
Sampling for Big Data
Fair Sampling Across Subpopulations
◊ Analysisqueriesoftenfocusonspecificsubpopulations– E.g.networking:differentcustomers,userapplications,networkpaths
◊ Widevariationinsubpopulationsize– 5ordersofmagnitudevariationintrafficoninterfacesofaccessrouter
◊ Ifuniformsamplingacrosssubpopulations:– Poorestimationaccuracyonsubsetsumswithinsmallsubpopulations
Sample
Color=subpopulation
,=interesting items
– occurrenceproportionaltosubpopulation size
UniformSamplingacross subpopulations:
– Difficulttotrackproportionofinterestingitemswithinsmall subpopulations:
Page 47
Sampling for Big Data
Fair Sampling Across Subpopulations
◊ Minimizerelativevariancebysharingbudgetmoversubpopulations– Totaln objects insubpopulations n1,…,ndwithSini=n– Allocate budgetmi toeachsubpopulation ni withSimi=m
◊ MinimizeaveragepopulationrelativevarianceR=const.Si1/mi
◊ Theorem:– R minimizedwhen{mi}areMax-MinFairshareofm underdemands {ni}
◊ Streaming– Problem:don’tknowsubpopulation sizes{ni}inadvance
◊ Solution:progressivefairsharingasreservoirsample– Provisionally include eacharrival
– Discard1itemasVarOpt samplefromanymaximal subpopulation
◊ Theorem[Duffield; Sigmetrics2012]: – Max-MinFairatalltimes; equality indistributionwithVarOpt samples {mi fromni}
Page 48
Sampling for Big Data
Stable Sampling
◊ Setting:Samplingapopulationoversuccessiveperiods
◊ Sampleindependentlyateachtimeperiod?– Costassociatedwithsamplechurn
– Timeseriesanalysisofsetofrelativelystablekeys
◊ Findsamplingprobabilitiesthroughcostminimization– MinimizeCost=EstimationVariance+z*E[#Churn]
◊ SizemsamplewithmaximalexpectedchurnD– weights{xi},previoussamplingprobabilities{pi}
– findnewsamplingprobabilities{qi}tominimizecostoftakingmsamples
– MinimizeSix2i/qi subjectto1≥qi ≥0,SIqi =mandSI|pi – qi |≤ D
[Cohen, Cormode, Duffield, Lund 13]
Page 49
Sampling for Big Data
Summary of Part 1
◊ Samplingasapowerful,generalsummarizationtechnique
◊ UnbiasedestimationviaHorvitz-Thompsonestimators
◊ Samplingfromstreamsofdata– Uniformsampling:reservoirsampling
– Weightedgeneralizations:sampleandhold,countingsamples
◊ Advancesinstreamsampling– Thecostprincipleforsampledesign,andIPPSmethods– Threshold,priorityandVarOptsampling
– Extendingthecostprinciple:
□ structureaware,fairsampling,stablesampling,sketchguided
Page 50
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Page 51
Sampling for Big Data
Data Scale:Hashing and Coordination
Page 52
Sampling for Big Data
Sampling from the set of items
◊ Sometimesneedtosamplefromthedistinctsetofobjects– Notinfluencedbytheweightornumberofoccurrences
– E.g.samplefromthedistinctsetofflows,regardlessofweight
◊ Needsamplingmethodthatisinvarianttoduplicates
◊ Basicidea:buildafunctiontodeterminewhattosample– A“random”functionf(k)àR
– Usef(k) tomakeasamplingdecision:consistentdecisionforsamekey
Page 53
Sampling for Big Data
Permanent Random Numbers
◊ Oftenconvenienttothinkoff asgiving“permanentrandomnumbers”– Permanent:assignedonceandforall
– Random:treatasiffullyrandomlychosen
◊ Thepermanentrandomnumberisusedinmultiplesamplingsteps– Same“random”numbereachtime,soconsistent (correlated)decisions
◊ Example:usePRNstodrawasampleofs fromN viaordersampling– Ifs<<N,smallchanceofseeingsameelementindifferentsamples– ViaPRN,strongerchanceofseeingsameelement
□ Cantrackpropertiesovertime,givesaformofstability
◊ EasiestwaytogeneratePRNs:applyahash function totheelementid– EnsuresPRNcanbegeneratedwithminimalcoordination
– Explicitlystoringarandomnumberforallobservedkeysdoesnotscale
Page 54
Sampling for Big Data
Hash Functions
Manypossiblechoicesofhashingfunctions:
◊ Cryptographichashfunctions:SHA-1,MD5,etc.– Resultsappear“random”formosttests(usingseed/salt)
– Canbeslowforhighspeed/highvolumeapplications
– Fullpowerofcryptographicsecuritynotneededformoststatisticalpurposes
□ Althoughpossiblesometrade-offsinrobustnesstosubversionifnotused
◊ Heuristichashfunctions:srand(),mod– Usuallyprettyfast
– Maynotberandomenough:structureinkeysmaycausecollisions
◊ Mathematicalhashfunctions:universalhashing,k-wisehashing– Haveprecisemathematicalpropertiesonprobabilities
– Canbeimplementedtobeveryfast
Page 55
Sampling for Big Data
Mathematical Hashing
◊ K-wiseindependence:Pr[h(x1)=y1 Ù h(x2)=y2Ù …Ù h(xt)=yt]=1/Rt
– Simple function:ctxt +ct-1xt-1 +…c1x+c0 modP
– ForfixedprimeP,randomlychosenc0 …ct– Canbemadeveryfast(chooseP tobeMersenneprimetosimplifymods)
◊ (Twisted)tabulationhashing[Thorup Patrascu 13]– Interpreteachkeyasasequenceofshortcharacters,e.g.8*8bits– Usea“trulyrandom”look-uptableforeachcharacter(so8*256entries)
– Taketheexclusive-ORoftherelevanttablevalues
– Fast,andfairlycompact
– Strongenoughformanyapplicationsofhashing(hashtablesetc.)
Page 56
Sampling for Big Data
Bottom-k sampling
◊ Samplefromthesetofdistinctkeys– Hasheachkeyusingappropriatehashfunction
– Keepinformationonthekeyswiththes smallesthashvalues
– ThinkofasordersamplingwithPRNs…
◊ Usefulforestimatingpropertiesofthesupportsetofkeys– Evaluateanypredicateonthesampledsetofkeys
◊ Sameconcept,severaldifferentnames:– Bottom-ksampling,Min-wisehashing,K-minimumvalues
0.391 0.908 0.291 0.391 0.391 0.273
Page 57
Sampling for Big Data
Subset Size Estimation from Bottom-k
◊ Wanttoestimatethefractiont=|A|/|D|– D istheobservedsetofdata
– A isanarbitrarysubsetgivenlater
– E.g.fractionofcustomerswhoaresportsfansfrommidwestaged18-35
◊ Simplealgorithm:– Runbottom-ktogetsamplesetS,estimatet’=|A∩S|/s– Errordecreasesas1/√s
– Analysisdueto[Thorup 13]: simplehashfunctionssufficeforbigenoughs
Page 58
Sampling for Big Data
Similarity Estimation
◊ Howsimilararetwosets,A andB?
◊ Jaccard coefficient:|AÇ B|/|AÈ B|– 1ifA,B identical,0iftheyaredisjoint
– Widelyused,e.g.tomeasuredocumentsimilarity
◊ Simpleapproach:sampleanitemuniformlyfromA andB– Probabilityofseeingsameitemfromboth:|AÇ B|/(|A| ´ |B|)– Chanceofseeingsameitemtoolowtobeinformative
◊ Coordinatedsampling:usesamehashfunctiontosamplefromA,B– Probabilitythatsameitemsampled:|AÇ B|/|AÈ B|– Repeat:theaveragenumberofagreementsgivesJaccard coefficient
– Concentration:(additive)errorscalesas1/√s
Page 59
Sampling for Big Data
Technical Issue: Min-wise hashing
◊ Foranalysistowork,thehashfunctionmustbefullyrandom– Allpossiblypermutationsoftheinputareequallylikely
– Unrealisticinpractice:descriptionofsuchafunctionishuge
◊ “Simple”hashfunctionsdon’tworkwell– Universalhashfunctionsaretooskewed
◊ Needhashfunctionsthatare“approximatelymin-wise”– Probabilityofsamplingasubsetisalmostuniform– Tabulationhashingasimplewaytoachievethis
Page 60
Sampling for Big Data
Bottom-k hashing for F0 Estimation
◊ F0 isthenumberofdistinctitemsinthestream– afundamentalquantitywithmanyapplications
– E.g.numberofdistinctflowsseenonabackbonelink
◊ Letm bethedomainofstreamelements:eachdataitemis[1…m]
◊ Pickarandom(pairwise)hashfunctionh:[m]® [R]
◊ Applybottom-ksamplingunderhashfunctionh– Letvs =s’th smallest(distinct)valueofh(i) seen
◊ Ifn=F0 <s,giveexactanswer,elseestimateF’0 =sR/vs– vs/R » fractionofhashdomainoccupiedbys smallest
R0R vs
Page 61
Sampling for Big Data
Analysis of F0 algorithm
◊ Canshowthatitisunlikelytohaveanoverestimate– Toomanyitemshashedbelowafixedvalue
– Cantreateacheventofanitemhashingtoolowasindependent
◊ Similaroutlinetoshowunlikelytohaveanoverestimate
◊ (Relative)errorscalesas1/√s
◊ Spacecost:– Stores hashvalues,soO(slogm)bits– CanimprovetoO(s+logm)withadditionalhashingtricks
– Seealso“StreamedApproximateCountingofDistinctElements”,KDD’14
RsR/(1+e)n0R vs
Page 62
Sampling for Big Data
Consistent Weighted Sampling
◊ Wanttoextendbottom-kresultswhendatahasweights
◊ Specifically,twodatasetsA andB whereeachelementhasweight– Weightsareaggregated:weseewholeweightofelementtogether
◊ WeightedJaccard:wantprobabilitythatsamekeyischosenbybothtobeåi min(A(i),B(i))/åi max(A(i),B(i))
◊ Samplingmethodshouldobeyuniformityandconsistency– Uniformity:elementi pickedfromAwithprobabilityproportionaltoA(i)
– Consistency:ifi ispickedfromA,andB(i)>A(i),theni alsopickedforB
◊ Simplesolution:assumingintegerweights,treatweightA(i)asA(i)unique(different)copiesofelementi,applybottom-k– Limitations:slow,unscalablewhenweightscanbelarge
– Needtorescalefractionalweightstointegralmultiples
Page 63
Sampling for Big Data
Consistent Weighted Sampling
◊ Efficientsamplingdistributionsexistachievinguniformityandconsistency
◊ Basicidea:consideraweightw asw/D differentelements– Computetheprobabilitythatanyoftheseachievestheminimumvalue
– StudythelimitingdistributionasD® 0
◊ ConsistentWeightedSampling[Manasse, McSherry, Talway 07], [Ioffe 10]– Usehashofitemtodeterminewhichpointssampledviacarefultransform
– Manydetailsneededtocontainbit-precision,allowfastcomputation
◊ Othercombinationsofkeyweightsarepossible[Cohen Kaplan Sen 09]– Minofweights,maxofweights,sumof(absolute)differences
Page 64
Sampling for Big Data
Trajectory Sampling
◊ Aims[Duffield Grossglauser 01]:– Probepacketsateachroutertheytraverse
– Collatereportstoinferlinklossandlatency
– Needtosample;independentsamplingnouse
◊ Hash-basedsampling:– Allrouters/packets:computehashh ofinvariantpacketfields
– SampleifhÎ someH andreporttocollector;tunesampleratewith|H|– Usehighentropypacketfieldsashashinput,e.g.IPaddresses,IDfield
– Hashfunctionchoicetrade-offbetweenspeed,uniformity&security
◊ StandardizedinInternetEngineeringTaskForce(IETF)– Serviceprovidersneedconsistencyacrossdifferentvendors
– Severalhashfunctionsstandardized,extensible– Sameissuesariseinotherbigdataecosystems(appsandAPIs)
Page 65
Sampling for Big Data
Hash Sampling in Network Management ◊ Manydifferentnetworksubsystemsusedtoprovideservice– Monitoredthrougheventlogs,passivemeasurementoftraffic&protocols
– Needcross-systemsamplethatcapturesfullinteractionbetweennetworkandarepresentativesetofusers
◊ Ideal:hash-basedselectionbasedoncommonidentifier
◊ Administrativechallenges!Organizationaldiversity
◊ Timelinesschallenge:– Selectionidentifiermaynotbepresentatameasurementlocation
– Example:commonidentifier=anonymized customerid
□ PassivetrafficmeasurementbasedonIPaddress
□ MappingofIPaddresstocustomerIDnotavailableremotely□ AttributionoftrafficIPaddresstoauserdifficulttocomputeatlinespeed
Page 66
Sampling for Big Data
Advanced Sampling from Sketches
◊ Difficultcase:inputswithpositive andnegativeweights
◊ Wanttosamplebasedontheoverallfrequencydistribution– Samplefromsupportsetofn possibleitems
– Sampleproportionalto(absolute)totalweights
– Sampleproportionaltosomefunctionofweights
◊ Howtodothissamplingeffectively?– Challenge:maybemanyelementswithpositiveandnegativeweights– Aggregateweightsmayendupzero:howtofindthenon-zeroweights?
◊ Recentapproach:L0sampling– L0 samplingenablesnovel“graphsketching”techniques
– Sketchesforconnectivity,sparsifiers [Ahn, Guha, McGregor 12]
Page 67
Sampling for Big Data
L0 Sampling
◊ L0 sampling:samplewithprob ≈fi0/F0– i.e.,sample(near)uniformlyfromitemswithnon-zerofrequency
◊ Generalapproach:[Frahling, Indyk, Sohler 05, C., Muthu, Rozenbaum05]– Sub-sampleallitems(presentornot)withprobabilityp
– Generateasub-sampledvectoroffrequenciesfp– Feedfp toak-sparserecoverydatastructure
□ Allowsreconstructionoffp ifF0 <k
– Iffp isk-sparse,samplefromreconstructedvector
– Repeatinparallelforexponentiallyshrinkingvaluesofp
Page 68
Sampling for Big Data
Sampling Process
◊ Exponentialsetofprobabilities,p=1,½,¼,1/8,1/16…1/U– LetN=F0 =|{i :fi ¹ 0}|– Wanttheretobealevelwherek-sparserecoverywillsucceed
– Atlevelp,expectednumberofitemsselectedS isNp
– Picklevelp sothatk/3<Np £ 2k/3◊ Chernoffbound:withprobabilityexponentialink,1£ S£ k– Pickk=O(log1/d)toget1-dprobability
p=1
p=1/U
k-sparserecovery
Page 69
Sampling for Big Data
Hash-based sampling summary
◊ Usehashfunctionsforsamplingwheresomeconsistencyisneeded– Consistencyoverrepeatedkeys
– Consistencyoverdistributedobservations
◊ Hashfunctionshavedualityofrandomandfixed– Treatasrandomforstatisticalanalysis
– Treatasfixedforgivingconsistencyproperties
◊ Canbecomequitecomplexandsubtle– Complexsamplingdistributionsforconsistentweightedsampling
– TrickycombinationofalgorithmsforL0 sampling
◊ Plentyofscopefornewhashing-basedsamplingmethods
Page 70
Sampling for Big Data
Data Scale:Massive Graph Sampling
Page 71
Sampling for Big Data
Massive Graph Sampling
◊ “GraphServiceProviders”– Searchproviders:webgraphs(billionsofpagesindexed)
– Onlinesocialnetworks
□ Facebook:~109users(nodes),~1012 links
– ISPs:communicationsgraphs
□ Fromflowrecords:node=src ordst IP,edgeiftrafficflowsbetweenthem
◊ Graphserviceproviderperspective– Alreadyhaveallthedata,buthowtouseit?
– Wantageneralpurposesamplethatcan:
□ Quicklyprovideanswerstoexploratoryqueries
□ Compactlyarchivesnapshotsforretrospectivequeries&baselining
◊ Graphconsumerperspective– Wanttoobtainarealisticsubgraphdirectlyorviacrawling/API
Page 72
Sampling for Big Data
Retrospective analysis of ISP graphs
◊ Node=IPaddress
◊ Directededge=flowfromsourcenodetodestinationnode
compromise
control
flooding
• Hardtodetectagainstbackground
• Knownattackscanbedetected:– Signaturematchingbasedonpartialgraphs,
flowfeatures,timing
• Unknownattacksarehardertospot:– exploratory&retrospective analysis
– preserveaccuracyifsampling?BOTNET
Page 73
Sampling for Big Data
Goals for Graph Sampling
Crudelydivideintothreeclassesofgoal:
1. Studylocal(nodeoredge)properties– Averageageofusers(nodes),averagelengthofconversation(edges)
2. Estimateglobalpropertiesorparametersofthenetwork– Averagedegree,shortestpathdistribution
3. Samplea“representative”subgraph– Testnewalgorithmsandlearningmorequicklythanonfullgraph
◊ Challenges:whatpropertiesshouldthesamplepreserve?– Thenotionof“representative”isverysubjective
– Canlistpropertiesthatshouldbepreserved(e.g.degreedbn,pathlengthdbn),buttherearealwaysmore…
Page 74
Sampling for Big Data
Models for Graph Sampling
Manypossiblemodels,butreducetotwoforsimplicity
(seetutorialbyHasan,Ahmed,Neville,Kompella inKDD13)
◊ Staticmodel:fullaccesstothegraphtodrawthesample– The(massive)graphisaccessibleinfulltomakethesmallsample
◊ Streamingmodel:edgesarriveinsomearbitraryorder– Mustmakesamplingdecisionsonthefly
◊ Othergraphmodelscapturedifferentaccessscenarios– Crawlingmodel:e.g.exploringthe(deep)web,APIgivesnodeneighbours
– Adjacencyliststreaming:seeallneighboursofanodetogether
Page 75
Sampling for Big Data
Node and Edge Properties
◊ Grossover-generalization:nodeandedgepropertiescanbesolvedusingprevioustechniques– Samplenodes/edge(inastream)– Handleduplicates(sameedgemanytimes)viahash-basedsampling
– Trackpropertiesofsampledelements
□ E.g.countthedegreeofsamplednodes
◊ Somechallenges.E.g.howtosampleanodeproportionaltoitsdegree?– Ifdegreeisknown(precomputed),thenusetheseasweights– Else,sampleedgesuniformly,thensampleeachendwithprobability½
Page 76
Sampling for Big Data
Induced subgraph sampling
◊ Node-inducedsubgraph– Pass1:Sampleasetofnodes(e.g.uniformly)
– Pass2:collectalledgesincidentonsamplednodes
– Cancollapseintoasinglestreamingpass
– Can’tknowinadvancehowmanyedgeswillbesampled
◊ Edge-inducedsubgraph– Sampleasetofedges(e.g.uniformlyinonepass)
– Resultinggraphtendstobesparse,disconnected
◊ Edge-inducedvariant[Ahmed Neville Kompella 13]:– Takesecondpasstofillinedgesonsamplednodes
– Hack:combinepassestofillinedgesoncurrentsample
Page 77
Sampling for Big Data
HT Estimators for Graphs
◊ CanconstructHTestimatorsfromuniformvertexsamples[Frank 78]– Evaluatethedesiredfunctiononthesampledgraph(e.g.averagedegree)
◊ Forfunctionsofedges (e.g.numberofedgessatisfyingaproperty):– Scaleupaccordingly,byN(N-1)/(k(k-1))forsamplesizek ongraphsizeN
– VarianceofestimatescanalsobeboundedintermsofN andk
◊ Similarforfunctionsofthreeedges(triangles)andhigher:– ScaleupbyNC3/kC3≈1/p3 togetunbiasedestimator– Highvariance,soothersamplingschemeshavebeendeveloped
Page 78
Sampling for Big Data
Graph Sampling Heuristics
“Heuristics”,sincefewformalstatisticalpropertiesareknown
◊ Breadthfirstsampling:sampleanode,thenitsneighbours…– Biasedtowardshigh-degreenodes(morechancestoreachthem)
◊ Snowballsampling:generalizeBFbypickingmanyinitialnodes– Respondent-drivensampling:weightthesnowballsampletogive
statisticallysoundestimates[Salganik Heckathorn 04]◊ Forest-firesampling:generalizeBFbypickingonlyafractionofneighbourstoexplore[Leskovec Kleinberg Faloutsos 05]– Withprobabilityp,movetoanewnodeand“kill”currentnode
No“onetruegraphsamplingmethod”– Experimentsshowdifferentpreferences,dependingongraphandmetric
[Leskovec,Faloutsos06;Hasan,Ahmed,Neville,Kompella13]
– Noneofthesemethodsare“streamingfriendly”:requirestaticgraph□ Hack:applythemtothestreamofedgesas-is
Page 79
Sampling for Big Data
Random Walks Sampling
◊ Randomwalkshaveprovenveryeffectiveformanygraphcomputations– PageRank fornodeimportance,andmanyvariations
◊ Randomwalkanaturalmodelforsamplinganode– Perform“longenough”randomwalktopickanode
– Howlongis“longenough”(formixingofRW)?
– Canget“stuck”inasubgraph ifgraphnotwell-connected– Costlytoperformmultiplerandomwalks
– Highlynon-streamingfriendly,butsuitsgraphcrawling
◊ MultidimensionalRandomWalks[Ribeiro, Towsley 10]– Pickk randomnodestoinitializethesample
– Pickarandomedgefromtheunionofedgesincidentonthesample
– Canbeviewedasawalkonahigh-dimensionalextensionofthegraph– Outperformsrunningk independentrandomwalks
Page 80
Sampling for Big Data
Subgraph estimation: counting triangles
◊ Hottopic:sample-basedtrianglecounting– Triangles:simplestnon-trivialrepresentationofnodeclustering
□ Regardasprototypeformorecomplexsubgraphsofinterest
– Measureof“clusteringcoefficient”ingraph,parameteringraphmodels…
◊ Uniformsamplingperformspoorly:– Chancethatrandomlysamplededgeshappentoformsubgraph is≈0
◊ Bias thesamplingsothatdesiredsubgraph ispreferentiallysampled
Page 81
Sampling for Big Data
Subgraph Sampling in Streams
WanttosampleoneoftheT trianglesinagraph
◊ [Buriol et al 06]:sampleanedgeuniformly,thenpickanode– Scanfortheedgesthatcompletethetriangle
– ProbabilityofsamplingatriangleisT/(|E|(|V|-2))
◊ [Pavan et al 13]:sampleanedge,thensampleanincidentedge– Scanfortheedgethatcompletesthetriangle
– (Afterbiascorrection)probabilityofsamplingatriangleisT/(|E|D)□ D = maxdegree,considerablysmallerthan|V| inmostgraphs
◊ [Jha et.al. KDD 2013]: sampleedges,thesamplepairsofincidentedges– Scanforedgesthatcomplete“wedges”(edgepairsincidentonavertex)
◊ Advert:GraphSampleandHold[Ahmed, Duffield, Neville, Kompella, KDD 2014]– Generalframeworkforsubgraph counting;e.g.trianglecounting
– Similaraccuracytopreviousstateofart,butusingsmallerstorage
Page 82
Sampling for Big Data
Graph Sampling Summary
◊ Samplingarepresentativegraphfromamassivegraphisdifficult!
◊ Currentstateoftheart:– Samplenodes/edgesuniformlyfromastream
– Heuristicsamplingfromstatic/streaminggraph
◊ Samplingenablessubgraph sampling/counting– Mucheffortdevotedtotriangles(smallestnon-trivialsubgraph)
◊ “Real”graphsarericher– Differentnodeandedgetypes,attributesonboth
– Justscratchingsurfaceofsamplingrealisticgraphs
Page 83
Sampling for Big Data
Current Directions in Sampling
Page 84
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Page 85
Sampling for Big Data
Role and Challenges for Sampling
◊ Matching
– Samplingmediatesbetweendatacharacteristicsandanalysisneeds
– Example:samplefrompower-lawdistributionofbytesperflow…□ butalsomakeaccurateestimatesfromsamples
□ simpleuniformsamplingmissesthelargeflows
◊ Balance
– Weightedsamplingacrosskey-functions:e.g.customers,networkpaths,geolocations
□ coversmallcustomers,notjustlarge
□ coverallnetworkelements,notjusthighlyutilized
◊ Consistency
– Sampleallviewsofsameevent,flow,customer,networkelement□ acrossdifferentdatasets,atdifferenttimes
□ independentsamplingÞ smallintersectionofviews
Page 86
Sampling for Big Data
Sampling and Big Data Systems
◊ Samplingisstillausefultoolinclustercomputing– Reducethelatencyofexperimentalanalysisandalgorithmdesign
◊ SamplingasanoperatoriseasytoimplementinMapReduce– Foruniformorweightedsamplingoftuples
◊ Graphcomputationsareacoremotivatorofbigdata– PageRankasacanonicalbigcomputation
– Graph-specificsystemsemerging(Pregel,LFgraph,Graphlab,Giraph…)– But… samplingprimitivesnotyetprevalentinevolvinggraphsystems
◊ Whentodothesampling?– Option1:Sampleasaninitialstepinthecomputation
□ Foldsampleintotheinitial“Map”step
– Option2:Sampletocreateastoredsamplegraphbeforecomputation□ Allowsmorecomplexsampling,e.g.randomwalksampling
Page 87
Sampling for Big Data
Sampling + KDD
◊ Theinterplaybetweensamplinganddataminingisnotwellunderstood– NeedanunderstandingofhowML/DMalgorithmsareaffectedbysampling
– E.g.howbigasampleisneededtobuildanaccurateclassifier?
– E.g.whatsamplingstrategyoptimizesclusterquality
◊ Expectresultstobemethodspecific– I.e.“IPPS+k-means”ratherthan“sample+cluster”
Page 88
Sampling for Big Data
Sampling and Privacy
◊ Currentfocusonprivacy-preservingdatamining– Deliverpromiseofbigdatawithoutsacrificingprivacy?
– Opportunityforsamplingtobepartofthesolution
◊ Naïvesamplingprovides“privacyinexpectation”– Yourdataremainsprivateifyouaren’tincludedinthesample…
◊ Intuition:uncertaintyintroducedbysamplingcontributes toprivacy– Thisintuitioncanbeformalizedwithdifferentprivacymodels
◊ Samplingcanbeanalyzedinthecontextofdifferentialprivacy– Samplingalonedoesnotprovidedifferentialprivacy– ButapplyingaDPmethodtosampleddatadoesguaranteeprivacy
– Atradeoffbetweensamplingrateandprivacyparameters
□ Sometimes,lowersamplingrateimprovesoverallaccuracy
Page 89
Sampling for Big Data
Advert: Now Hiring…
◊ NickDuffield,TexasA&M– Phds inbigdata,graphsampling
◊ GrahamCormode,UniversityofWarwickUK– Phds inbigdatasummarization
(graphsandmatrices,fundedbyMSR)– Postdocs inprivacyanddatamodeling
(fundedbyEC,AT&T)
Page 90
GrahamCormode,[email protected]
NickDuffield,TexasA&[email protected]
Sampling for Big Data
x9x8x7x6x5x4x3x2x1
x10
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11