Top Banner
Very large data sets Pasi Fränti Clustering methods: Part 10 Speech and Image Processing Unit School of Computing University of Eastern Finland 5.5.2014
17

Very large data sets

Jan 02, 2016

Download

Documents

Lester Hunt

Very large data sets. Speech and Image Processing Unit School of Computing University of Eastern Finland. Clustering methods: Part 10. Pasi Fränti. 5.5.2014. Let’s study this (no material for the others) . Methods for large data sets. Birch Clarans On-line EM Scalable EM GMG. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Very large data sets

Very large data sets

Pasi Fränti

Clustering methods: Part 10

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

5.5.2014

Page 2: Very large data sets

Methods for large data sets

• Birch

• Clarans

• On-line EM

• Scalable EM

• GMG

Let’s study this(no material for the others)

Page 3: Very large data sets

Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]

D at a B u ffer M o d el

M o d el s iz ered u ct io n

M o d el gen erat io n

G en erat edm o d el

P o s t p ro ces s in gO u t p u t m o d els

S elec tio n

Page 4: Very large data sets

EM GMG

Goal of the GMG algorithm

Page 5: Very large data sets

EM GMG

Contours of probability density distributions

Page 6: Very large data sets

Before update After update

Model update

• New data points are mapped immediately when input.• Points too far (from any model) will remain in buffer.• Buffered points are re-tested when new models created.

Page 7: Very large data sets

Selected points and a new component

Data in buffer

Generating new components• When buffer full, selected points are used to generate new

components.• Most compact k-neighborhood is selected as seed for a new

component.

Page 8: Very large data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Example

Page 9: Very large data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Example

Page 10: Very large data sets

Example

Page 11: Very large data sets

Example

Page 12: Very large data sets

Example

Page 13: Very large data sets

Example

Page 14: Very large data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Post-processing

Model before processing

Page 15: Very large data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Post-processing

Model before processing Updated model

Page 16: Very large data sets

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Post-processing

Model before processing Updated model + data

Page 17: Very large data sets

Literature

1. I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007.

2. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80.

3. R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016.

4. M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432.

5. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.