Top Banner
Clustering Clustering Geometric Data Streams Geometric Data Streams Jiří Skála Jiří Skála Ivana Kolingerov Ivana Kolingerov á á ZČU/FAV/KIV ZČU/FAV/KIV 2007 2007
22

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Jan 15, 2016

Download

Documents

Felix Farmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ClusteringClusteringGeometric Data StreamsGeometric Data Streams

Jiří SkálaJiří Skála Ivana KolingerovIvana Kolingerováá

ZČU/FAV/KIVZČU/FAV/KIV 20072007

Page 2: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 22//2222

Talk outlineTalk outline

motivationmotivation backgroundbackground existing solutionexisting solution improvementsimprovements experiments & observationsexperiments & observations conclusionconclusion

Page 3: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 33//2222

MotivationMotivation

why data streams?why data streams? geometric models growing largergeometric models growing larger

– Stanford’s Michelangelo project (David 28 mil. Stanford’s Michelangelo project (David 28 mil. vertices, St. Matthew 187 mil. vertices)vertices, St. Matthew 187 mil. vertices)

– 187187·10·1066 points · 3 coordinates · 8 bytes ≈ 4.5 GB points · 3 coordinates · 8 bytes ≈ 4.5 GB must be processed out-of-coremust be processed out-of-core why clustering?why clustering?

– use hierarchical clustering to createuse hierarchical clustering to create multiresolution multiresolution modelmodel

– various LOD in different partsvarious LOD in different parts

Page 4: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 44//2222

Background – data streamBackground – data stream

ordered set of dataordered set of data data coming online or stored on HDDdata coming online or stored on HDD too large to fit in main memorytoo large to fit in main memory

– viewed only in order; random access extremely viewed only in order; random access extremely inefficient or even impossibleinefficient or even impossible

– processed in one or very few linear scansprocessed in one or very few linear scans

Page 5: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 55//2222

Background – clusteringBackground – clustering

grouping similar elements togethergrouping similar elements together– vertices, DB entries, documentsvertices, DB entries, documents– similarity most often measured as Euclidean distancesimilarity most often measured as Euclidean distance

k-means, k-median clusteringk-means, k-median clustering facility locationfacility location

– clients and facilitiesclients and facilities– facility costfacility cost

1

0

N

ipii fpfckC

k-means k-median

Page 6: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 66//2222

Facility locationFacility location

no data streams yetno data streams yet introduced by Charikar and Guha, 1999introduced by Charikar and Guha, 1999 initial solution iteratively refined by local initial solution iteratively refined by local

improvements improvements –– local search algorithm local search algorithm initial solutioninitial solution

– points taken in random orderpoints taken in random order– first point always a facilityfirst point always a facility– others become facility with probability others become facility with probability pp = = dd / / fcfc

if if dd / / fcfc > 1 > 1 then then pp := 1 := 1

– otherwise connect to closest existing facilityotherwise connect to closest existing facility

Page 7: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 77//2222

Facility locationFacility location

pick a point at random (new facility candidate)pick a point at random (new facility candidate) compute function compute function gaingain

– pay for opening a facilitypay for opening a facility– inspect all points and compare distance to facilityinspect all points and compare distance to facility– inspect facilities and determine whether they can be inspect facilities and determine whether they can be

closedclosed if if gaingain > 0 > 0 then perform reassignments & closures then perform reassignments & closures repeated repeated mm log log mm times? times?

00

).(. expensesdistfcsparedistfcgain

Page 8: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 88//2222

Facility locationFacility location

New facility candidate After reassignments & closures

Page 9: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 99//2222

Data stream clusteringData stream clustering

proposed by Guha et al., 2000proposed by Guha et al., 2000 data stream processed in blocksdata stream processed in blocks

– clustering within each blockclustering within each block– cluster centers given weight and passed to higher cluster centers given weight and passed to higher

levellevel– when higher level full, clustered againwhen higher level full, clustered again– distances multiplied by point weightsdistances multiplied by point weights

Page 10: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1010//2222

Data stream clusteringData stream clustering

input data stream

the first level

the second level

time for videotime for video

)/log(

)/log(

km

mNl

Page 11: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1111//2222

ImprovementsImprovements

limiting the search spacelimiting the search space– inspect only points whose reassignment can improve inspect only points whose reassignment can improve

the solutionthe solution– i.e., those assigned to facilities within i.e., those assigned to facilities within 2 2 fcfc radius radius– does not work for weighted pointsdoes not work for weighted points

Page 12: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1212//2222

ImprovementsImprovements

modification from k-median to facility locationmodification from k-median to facility location choosing the facility costchoosing the facility cost

– equal to the diagonal of bounding boxequal to the diagonal of bounding box weight normalizationweight normalization

– we need to keep weights around 1, i.e., average we need to keep weights around 1, i.e., average weight equal to 1weight equal to 1

– divide weights by their averagedivide weights by their average

111 1

0

1

0

k

ii

k

ii w

kcwc

k

Page 13: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1313//2222

Experiments – setting the facility costExperiments – setting the facility cost

high settinghigh setting– aggressive clusteringaggressive clustering– low number of large clusterslow number of large clusters

low settinglow setting– moderate clusteringmoderate clustering– many small clustersmany small clusters

set facility cost equal to diagonal of bounding boxset facility cost equal to diagonal of bounding box affects memory and running timeaffects memory and running time

Page 14: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1414//2222

Experiments – setting the facility costExperiments – setting the facility cost

diagonal 2 diagonal

1/2 diagonal 1/4 diagonal

Page 15: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1515//2222

Experiments – input point distributionExperiments – input point distribution

many authors rely on data being orderedmany authors rely on data being ordered– usually trueusually true– presented algorithm can handle unordered datapresented algorithm can handle unordered data

as wellas well there may be a problemthere may be a problem

with few outlierswith few outliers

Page 16: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1616//2222

Experiments – input point distributionExperiments – input point distribution

1st block 2nd block

3rd block higher level

Page 17: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1717//2222

Experiments – input point distributionExperiments – input point distribution

1st block 2nd block

3rd block higher level

Page 18: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1818//2222

Experiments – block sizeExperiments – block size

affects memory requirementsaffects memory requirements

– required memoryrequired memory

somewhat affects clustering resultsomewhat affects clustering result affects running timeaffects running time

– required iterationsrequired iterations

)/log(

)/log(

km

mNmlm

mmkm

Nlog

Page 19: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 1919//2222

Experiments – number of iterationsExperiments – number of iterations

mm log log mm iterations necessary for a constant-factor iterations necessary for a constant-factor approximationapproximation

for large blocks running time grows unpleasantlyfor large blocks running time grows unpleasantly 0.1 0.1 mm iterations seem to be enough; for data with iterations seem to be enough; for data with

clusters even lessclusters even less

Page 20: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 2020//2222

Experiments – number of iterationsExperiments – number of iterations

6560 iterations 164 iterations

1640 points1640 points

Page 21: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 2121//2222

ConclusionConclusion

modified data stream approach to facility locationmodified data stream approach to facility location introduced facility weight normalizationintroduced facility weight normalization improvement to limit the number of points inspectedimprovement to limit the number of points inspected experimentsexperiments

– discussion of parameter settingsdiscussion of parameter settings– description of algorithm behaviordescription of algorithm behavior

Page 22: Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

ZČU/FAV/KIVZČU/FAV/KIV 2222//2222

ReferencesReferences M. Charikar, S. Guha, M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Improved Combinatorial Algorithms for the

Facility Location and k-Median ProblemsFacility Location and k-Median Problems. Proc. 40th Sympos. on . Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388.Foundations of Computer Science, 1999, pp. 378-- 388.

S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Clustering Data StreamsStreams. In Proceedings of the Annual Symposium on Foundations . In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000of Computer Science. IEEE, November 2000

L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality ClusteringStreaming-Data Algorithms for High-Quality Clustering. In . In Proceedings of IEEE International Conference on Data Engineering, Proceedings of IEEE International Conference on Data Engineering, March 2002.March 2002.

S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practiceClustering data streams: Theory and practice. IEEE Trans. Knowl. . IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528.Data Eng 15, 3 (2003), 515-528.