Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ClusteringClusteringGeometric Data StreamsGeometric Data Streams

Jiří SkálaJiří Skála Ivana KolingerovIvana Kolingerováá

ZČU/FAV/KIVZČU/FAV/KIV 20072007

ZČU/FAV/KIVZČU/FAV/KIV 22//2222

Talk outlineTalk outline

motivationmotivation backgroundbackground existing solutionexisting solution improvementsimprovements experiments & observationsexperiments & observations conclusionconclusion


MotivationMotivation

why data streams?why data streams? geometric models growing largergeometric models growing larger

– Stanford’s Michelangelo project (David 28 mil. Stanford’s Michelangelo project (David 28 mil. vertices, St. Matthew 187 mil. vertices)vertices, St. Matthew 187 mil. vertices)

– 187187·10·1066 points · 3 coordinates · 8 bytes ≈ 4.5 GB points · 3 coordinates · 8 bytes ≈ 4.5 GB must be processed out-of-coremust be processed out-of-core why clustering?why clustering?

– use hierarchical clustering to createuse hierarchical clustering to create multiresolution multiresolution modelmodel

– various LOD in different partsvarious LOD in different parts


Background – data streamBackground – data stream

ordered set of dataordered set of data data coming online or stored on HDDdata coming online or stored on HDD too large to fit in main memorytoo large to fit in main memory

– viewed only in order; random access extremely viewed only in order; random access extremely inefficient or even impossibleinefficient or even impossible

– processed in one or very few linear scansprocessed in one or very few linear scans


Background – clusteringBackground – clustering

grouping similar elements togethergrouping similar elements together– vertices, DB entries, documentsvertices, DB entries, documents– similarity most often measured as Euclidean distancesimilarity most often measured as Euclidean distance

k-means, k-median clusteringk-means, k-median clustering facility locationfacility location

– clients and facilitiesclients and facilities– facility costfacility cost

1

0

N

ipii fpfckC

k-means k-median


Facility locationFacility location

no data streams yetno data streams yet introduced by Charikar and Guha, 1999introduced by Charikar and Guha, 1999 initial solution iteratively refined by local initial solution iteratively refined by local

improvements improvements –– local search algorithm local search algorithm initial solutioninitial solution

– points taken in random orderpoints taken in random order– first point always a facilityfirst point always a facility– others become facility with probability others become facility with probability pp = = dd / / fcfc

if if dd / / fcfc > 1 > 1 then then pp := 1 := 1

– otherwise connect to closest existing facilityotherwise connect to closest existing facility



pick a point at random (new facility candidate)pick a point at random (new facility candidate) compute function compute function gaingain

– pay for opening a facilitypay for opening a facility– inspect all points and compare distance to facilityinspect all points and compare distance to facility– inspect facilities and determine whether they can be inspect facilities and determine whether they can be

closedclosed if if gaingain > 0 > 0 then perform reassignments & closures then perform reassignments & closures repeated repeated mm log log mm times? times?

00

).(. expensesdistfcsparedistfcgain



New facility candidate After reassignments & closures


Data stream clusteringData stream clustering

proposed by Guha et al., 2000proposed by Guha et al., 2000 data stream processed in blocksdata stream processed in blocks

– clustering within each blockclustering within each block– cluster centers given weight and passed to higher cluster centers given weight and passed to higher

levellevel– when higher level full, clustered againwhen higher level full, clustered again– distances multiplied by point weightsdistances multiplied by point weights


Data stream clusteringData stream clustering

input data stream

the first level

the second level

time for videotime for video

)/log(

)/log(

km

mNl


ImprovementsImprovements

limiting the search spacelimiting the search space– inspect only points whose reassignment can improve inspect only points whose reassignment can improve

the solutionthe solution– i.e., those assigned to facilities within i.e., those assigned to facilities within 2 2 fcfc radius radius– does not work for weighted pointsdoes not work for weighted points


ImprovementsImprovements

modification from k-median to facility locationmodification from k-median to facility location choosing the facility costchoosing the facility cost

– equal to the diagonal of bounding boxequal to the diagonal of bounding box weight normalizationweight normalization

– we need to keep weights around 1, i.e., average we need to keep weights around 1, i.e., average weight equal to 1weight equal to 1

– divide weights by their averagedivide weights by their average

111 1

0

1

0

k

ii

k

ii w

kcwc

k


Experiments – setting the facility costExperiments – setting the facility cost

high settinghigh setting– aggressive clusteringaggressive clustering– low number of large clusterslow number of large clusters

low settinglow setting– moderate clusteringmoderate clustering– many small clustersmany small clusters

set facility cost equal to diagonal of bounding boxset facility cost equal to diagonal of bounding box affects memory and running timeaffects memory and running time


Experiments – setting the facility costExperiments – setting the facility cost

diagonal 2 diagonal

1/2 diagonal 1/4 diagonal


Experiments – input point distributionExperiments – input point distribution

many authors rely on data being orderedmany authors rely on data being ordered– usually trueusually true– presented algorithm can handle unordered datapresented algorithm can handle unordered data

as wellas well there may be a problemthere may be a problem

with few outlierswith few outliers



1st block 2nd block

3rd block higher level



1st block 2nd block

3rd block higher level


Experiments – block sizeExperiments – block size

affects memory requirementsaffects memory requirements

– required memoryrequired memory

somewhat affects clustering resultsomewhat affects clustering result affects running timeaffects running time

– required iterationsrequired iterations

)/log(

)/log(

km

mNmlm

mmkm

Nlog


Experiments – number of iterationsExperiments – number of iterations

mm log log mm iterations necessary for a constant-factor iterations necessary for a constant-factor approximationapproximation

for large blocks running time grows unpleasantlyfor large blocks running time grows unpleasantly 0.1 0.1 mm iterations seem to be enough; for data with iterations seem to be enough; for data with

clusters even lessclusters even less


Experiments – number of iterationsExperiments – number of iterations

6560 iterations 164 iterations

1640 points1640 points


ConclusionConclusion

modified data stream approach to facility locationmodified data stream approach to facility location introduced facility weight normalizationintroduced facility weight normalization improvement to limit the number of points inspectedimprovement to limit the number of points inspected experimentsexperiments

– discussion of parameter settingsdiscussion of parameter settings– description of algorithm behaviordescription of algorithm behavior


ReferencesReferences M. Charikar, S. Guha, M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Improved Combinatorial Algorithms for the

Facility Location and k-Median ProblemsFacility Location and k-Median Problems. Proc. 40th Sympos. on . Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388.Foundations of Computer Science, 1999, pp. 378-- 388.

S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Clustering Data StreamsStreams. In Proceedings of the Annual Symposium on Foundations . In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000of Computer Science. IEEE, November 2000

L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality ClusteringStreaming-Data Algorithms for High-Quality Clustering. In . In Proceedings of IEEE International Conference on Data Engineering, Proceedings of IEEE International Conference on Data Engineering, March 2002.March 2002.

S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practiceClustering data streams: Theory and practice. IEEE Trans. Knowl. . IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528.Data Eng 15, 3 (2003), 515-528.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Documents

facility costequal

facility locationchoosing

hierarchical clustering

corewhy clustering

vertices187106 points

d fcif d fc

order random access

average weight equal