Clustering Clustering Geometric Data Streams Geometric Data Streams Jiří Skála Jiří Skála Ivana Kolingerov Ivana Kolingerov á á ZČU/FAV/KIV ZČU/FAV/KIV 2007 2007
Jan 15, 2016
ClusteringClusteringGeometric Data StreamsGeometric Data Streams
Jiří SkálaJiří Skála Ivana KolingerovIvana Kolingerováá
ZČU/FAV/KIVZČU/FAV/KIV 20072007
ZČU/FAV/KIVZČU/FAV/KIV 22//2222
Talk outlineTalk outline
motivationmotivation backgroundbackground existing solutionexisting solution improvementsimprovements experiments & observationsexperiments & observations conclusionconclusion
ZČU/FAV/KIVZČU/FAV/KIV 33//2222
MotivationMotivation
why data streams?why data streams? geometric models growing largergeometric models growing larger
– Stanford’s Michelangelo project (David 28 mil. Stanford’s Michelangelo project (David 28 mil. vertices, St. Matthew 187 mil. vertices)vertices, St. Matthew 187 mil. vertices)
– 187187·10·1066 points · 3 coordinates · 8 bytes ≈ 4.5 GB points · 3 coordinates · 8 bytes ≈ 4.5 GB must be processed out-of-coremust be processed out-of-core why clustering?why clustering?
– use hierarchical clustering to createuse hierarchical clustering to create multiresolution multiresolution modelmodel
– various LOD in different partsvarious LOD in different parts
ZČU/FAV/KIVZČU/FAV/KIV 44//2222
Background – data streamBackground – data stream
ordered set of dataordered set of data data coming online or stored on HDDdata coming online or stored on HDD too large to fit in main memorytoo large to fit in main memory
– viewed only in order; random access extremely viewed only in order; random access extremely inefficient or even impossibleinefficient or even impossible
– processed in one or very few linear scansprocessed in one or very few linear scans
ZČU/FAV/KIVZČU/FAV/KIV 55//2222
Background – clusteringBackground – clustering
grouping similar elements togethergrouping similar elements together– vertices, DB entries, documentsvertices, DB entries, documents– similarity most often measured as Euclidean distancesimilarity most often measured as Euclidean distance
k-means, k-median clusteringk-means, k-median clustering facility locationfacility location
– clients and facilitiesclients and facilities– facility costfacility cost
1
0
N
ipii fpfckC
k-means k-median
ZČU/FAV/KIVZČU/FAV/KIV 66//2222
Facility locationFacility location
no data streams yetno data streams yet introduced by Charikar and Guha, 1999introduced by Charikar and Guha, 1999 initial solution iteratively refined by local initial solution iteratively refined by local
improvements improvements –– local search algorithm local search algorithm initial solutioninitial solution
– points taken in random orderpoints taken in random order– first point always a facilityfirst point always a facility– others become facility with probability others become facility with probability pp = = dd / / fcfc
if if dd / / fcfc > 1 > 1 then then pp := 1 := 1
– otherwise connect to closest existing facilityotherwise connect to closest existing facility
ZČU/FAV/KIVZČU/FAV/KIV 77//2222
Facility locationFacility location
pick a point at random (new facility candidate)pick a point at random (new facility candidate) compute function compute function gaingain
– pay for opening a facilitypay for opening a facility– inspect all points and compare distance to facilityinspect all points and compare distance to facility– inspect facilities and determine whether they can be inspect facilities and determine whether they can be
closedclosed if if gaingain > 0 > 0 then perform reassignments & closures then perform reassignments & closures repeated repeated mm log log mm times? times?
00
).(. expensesdistfcsparedistfcgain
ZČU/FAV/KIVZČU/FAV/KIV 88//2222
Facility locationFacility location
New facility candidate After reassignments & closures
ZČU/FAV/KIVZČU/FAV/KIV 99//2222
Data stream clusteringData stream clustering
proposed by Guha et al., 2000proposed by Guha et al., 2000 data stream processed in blocksdata stream processed in blocks
– clustering within each blockclustering within each block– cluster centers given weight and passed to higher cluster centers given weight and passed to higher
levellevel– when higher level full, clustered againwhen higher level full, clustered again– distances multiplied by point weightsdistances multiplied by point weights
ZČU/FAV/KIVZČU/FAV/KIV 1010//2222
Data stream clusteringData stream clustering
input data stream
the first level
the second level
time for videotime for video
)/log(
)/log(
km
mNl
ZČU/FAV/KIVZČU/FAV/KIV 1111//2222
ImprovementsImprovements
limiting the search spacelimiting the search space– inspect only points whose reassignment can improve inspect only points whose reassignment can improve
the solutionthe solution– i.e., those assigned to facilities within i.e., those assigned to facilities within 2 2 fcfc radius radius– does not work for weighted pointsdoes not work for weighted points
ZČU/FAV/KIVZČU/FAV/KIV 1212//2222
ImprovementsImprovements
modification from k-median to facility locationmodification from k-median to facility location choosing the facility costchoosing the facility cost
– equal to the diagonal of bounding boxequal to the diagonal of bounding box weight normalizationweight normalization
– we need to keep weights around 1, i.e., average we need to keep weights around 1, i.e., average weight equal to 1weight equal to 1
– divide weights by their averagedivide weights by their average
111 1
0
1
0
k
ii
k
ii w
kcwc
k
ZČU/FAV/KIVZČU/FAV/KIV 1313//2222
Experiments – setting the facility costExperiments – setting the facility cost
high settinghigh setting– aggressive clusteringaggressive clustering– low number of large clusterslow number of large clusters
low settinglow setting– moderate clusteringmoderate clustering– many small clustersmany small clusters
set facility cost equal to diagonal of bounding boxset facility cost equal to diagonal of bounding box affects memory and running timeaffects memory and running time
ZČU/FAV/KIVZČU/FAV/KIV 1414//2222
Experiments – setting the facility costExperiments – setting the facility cost
diagonal 2 diagonal
1/2 diagonal 1/4 diagonal
ZČU/FAV/KIVZČU/FAV/KIV 1515//2222
Experiments – input point distributionExperiments – input point distribution
many authors rely on data being orderedmany authors rely on data being ordered– usually trueusually true– presented algorithm can handle unordered datapresented algorithm can handle unordered data
as wellas well there may be a problemthere may be a problem
with few outlierswith few outliers
ZČU/FAV/KIVZČU/FAV/KIV 1616//2222
Experiments – input point distributionExperiments – input point distribution
1st block 2nd block
3rd block higher level
ZČU/FAV/KIVZČU/FAV/KIV 1717//2222
Experiments – input point distributionExperiments – input point distribution
1st block 2nd block
3rd block higher level
ZČU/FAV/KIVZČU/FAV/KIV 1818//2222
Experiments – block sizeExperiments – block size
affects memory requirementsaffects memory requirements
– required memoryrequired memory
somewhat affects clustering resultsomewhat affects clustering result affects running timeaffects running time
– required iterationsrequired iterations
)/log(
)/log(
km
mNmlm
mmkm
Nlog
ZČU/FAV/KIVZČU/FAV/KIV 1919//2222
Experiments – number of iterationsExperiments – number of iterations
mm log log mm iterations necessary for a constant-factor iterations necessary for a constant-factor approximationapproximation
for large blocks running time grows unpleasantlyfor large blocks running time grows unpleasantly 0.1 0.1 mm iterations seem to be enough; for data with iterations seem to be enough; for data with
clusters even lessclusters even less
ZČU/FAV/KIVZČU/FAV/KIV 2020//2222
Experiments – number of iterationsExperiments – number of iterations
6560 iterations 164 iterations
1640 points1640 points
ZČU/FAV/KIVZČU/FAV/KIV 2121//2222
ConclusionConclusion
modified data stream approach to facility locationmodified data stream approach to facility location introduced facility weight normalizationintroduced facility weight normalization improvement to limit the number of points inspectedimprovement to limit the number of points inspected experimentsexperiments
– discussion of parameter settingsdiscussion of parameter settings– description of algorithm behaviordescription of algorithm behavior
ZČU/FAV/KIVZČU/FAV/KIV 2222//2222
ReferencesReferences M. Charikar, S. Guha, M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Improved Combinatorial Algorithms for the
Facility Location and k-Median ProblemsFacility Location and k-Median Problems. Proc. 40th Sympos. on . Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388.Foundations of Computer Science, 1999, pp. 378-- 388.
S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Clustering Data StreamsStreams. In Proceedings of the Annual Symposium on Foundations . In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000of Computer Science. IEEE, November 2000
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality ClusteringStreaming-Data Algorithms for High-Quality Clustering. In . In Proceedings of IEEE International Conference on Data Engineering, Proceedings of IEEE International Conference on Data Engineering, March 2002.March 2002.
S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practiceClustering data streams: Theory and practice. IEEE Trans. Knowl. . IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528.Data Eng 15, 3 (2003), 515-528.