Top Banner
OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír 1 , Chytry Milan 1 , Botta-Dukát Zoltán 2 , Hájek Michal 1 ; Talbot Stephen S. 3 1 Masaryk University, Brno, Czech Republic 2 Hungarian Academy of Sciences, Vácrátot, Hungary 3 U.S. Fish and Wildlife Service, Anchorage, USA
25

OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Jan 19, 2016

Download

Documents

Camron Stokes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal

number of clusters in vegetation classification studies

Tichy Lubomír1, Chytry Milan1, Botta-Dukát Zoltán2, Hájek Michal1; Talbot Stephen S.3

1Masaryk University, Brno, Czech Republic2Hungarian Academy of Sciences, Vácrátot, Hungary

3U.S. Fish and Wildlife Service, Anchorage, USA

Page 2: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Why do we need a method for identification of optimal clustering algorithm and optimal number of clusters?

The same dataset

Page 3: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

-A huge variety of clustering methods produce “reasonable” results.

-Subjective selection of the clustering method and no. of clusters is usually based on empirical experience

Why do we need a method for identification of optimal clustering algorithm and optimal number of clusters?

Methods published:

Most algorithms identify the optimal partition mathematically, without considering ecological interpretation

Page 4: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method

A posteriori description of phytosociological tables is based on

diagnostic species

Diagnostic species describes a cluster. Therefore, the number of diagnostic species determines whether the classified table can be sufficiently interpreted.

Species 1 98788 12112 3.211Species 2 51123 1223. 11132Species 3 23132 ..... .....Species 4 ..2.4 112.. 1..5.Species 5 ..... .1.1. 1.213

Page 5: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method The samedataset:

Page 6: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method

Measure of the classification quality: the total sum of diagnostic species

Fisher’s Exact Test

calculates the probability of observed occurrence of species across clusters for a right-tailed test hypothesis

– The measure reduces the importance of very small clusters.

– Easy interpretation: the more diagnostic species in the dataset, the better description of the clusters.

Page 7: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method Test on three different datasets

Southern Siberia, Sayan Mountains (310 plots; forest, steppe and tundra vegetation)

Central Europe, Carpathians (241 plots; mire vegetation)

Alaska, Kenai Peninsula(171 plots; wetlands)

Page 8: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method Classifications tested

Flexible beta clustering WARD‘s clustering

UPGMA(PC-ORD)

Cover transformations (percentages, log percentages,

Braun-Blanquet, presence/absence)

Distance measures(Bray-Curtis, Manhattan,

Euclidean)

Ordinal cluster analysis(SYN-TAX)

Modified TWINSPAN classification

(JUICE) The sequence of splits in divisive

classification is determined by internal heterogeneity of clusters.

Therefore, any number of clusters is possible

(three modifications of pseudospecies cut levels)

Distance measures (Kruskal-Wallis, Kendall,

Gower-Podani coefficient)

Page 9: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

0

100

200

300

400

500

600

700

800

900

0 5 10 15 20 25 30 35 40 45 50

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

0

50

100

150

0 5 10 15 20 25 30 35 40 45 50

Probability = 10-3

Probability = 10-6

Probability = 10-9

No. of clusters

No

. o

f d

iag

no

stic

sp

ecie

s No. of clustersNo. of clusters

No

. o

f d

iag

. s

pec

.

No

. o

f d

iag

. s

pec

.

Page 10: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Untransformed cover data

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 11: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Euclidean distance measure

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 12: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Manhattan distance measure

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 13: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Bray-Curtis distance measure

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 14: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

UPGMA

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 15: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Ward‘s method

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 16: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Flexible beta -0.25

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 17: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Ordinal cluster analyses (SYN-TAX)

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 18: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Results Sayan Mountains, Siberia(310 plots, 1036 species)

Modified TWINSPAN

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s

Number of clusters

Page 19: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

The Method Test on three different datasets

Southern Siberia, Sayan Mountains (310 plots; forest, steppe and tundra vegetation)

Central Europe, Carpathians (241 plots; mire vegetation)

Alaska, Kenai Peninsula(171 plots; wetlands)

Similar results:

Page 20: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Conclusions

Classifications based on transformed cover values give better results than percentage covers.

Euclidean distance - slightly poorer results than Manhattan or Bray-Curtis distances.

UPGMA clustering method - poorer results than Ward’s and Flexible beta methods.

No significant difference between ordinal cluster analysis proposed by Podani (SYN-TAX 2000) and other clustering methods.

Modified TWINSPAN – performs well with small numbers of clusters.

Page 21: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.
Page 22: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.
Page 23: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Number of clusters

Nu

mb

er o

f d

iag

no

stic

sp

ecie

s o

ccu

rren

ces Modified TWINSPAN classification

Page 24: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

0

50

100

150

200

250

300

0 5 10 15 20 25 30 35 40 45 50

Number of clusters

Su

m o

f d

iag

no

stic

sp

ecie

sModified TWINSPAN classification

Page 25: OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

0

5

10

15

20

0 5 10 15 20 25 30 35 40 45 50

Number of clusters

Nu

mb

er o

f cl

ust

ers

wit

h m

ore

th

an 4

dia

gn

ost

ic s

pec

ies

Modified TWINSPAN classification