Top Banner
Penalized and weighted K-means for clustering with noise and prior information incorporation George C. Tseng Department of Biostatistics Department of Human Genetics University of Pittsburgh
50

Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Jun 07, 2018

Download

Documents

phungkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Penalized and weighted K-means for clustering with noise and prior

information incorporation

George C. TsengDepartment of Biostatistics

Department of Human GeneticsUniversity of Pittsburgh

Page 2: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

OutlineIntro of cluster analysis

Model-based clusteringHeuristic methods

Hierarchical clusteringK-means & K-memoids……

A motivating example (yeast cell cycle microarray data)Penalized weighted K-means

Penalty term and weightsSome propertiesEstimate parameters (k and λ)

ApplicationsSimulationYeast cell cycle microarray dataCID fragmentation patterns in MS/MS

Discussion

Page 3: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

Cluster analysis:Data X={xi, i=1,…, n}, each object xi∈Rp.

Given a dissimilarity measure d(xi, xj), assign the nobjects into k disjoint clusters; i.e. C={C1 , … , Ck} and

x

y

0 10 20 30

-20

-15

-10

-50

Page 4: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

Cluster analysis:

1.Estimate the number of clusters k.

2.Decide which clustering method to use.

3.Evaluation and re-validation of clustering results.

Long history in statistics, computer science and applied math literature.

Page 5: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

( ) ( )⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡Σ=Σ ΣΠ

==jjij

k

j

n

ixfL ,;log,,

11µπµπ

Model-based clustering:

1. Mixture maximum likelihood (ML):

2. Classification maximum likelihood (CML):

( ) ( )

{ } k

k

jk

jjiCx

k

j

C, X , C , CC

xfCLji

U1

1

1,;log,,

=

∈=

=…=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

Σ=Σ ΠΠ µµ

Page 6: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

Model selection of model-based clustering: Bayesian Information criterion (BIC) to determine k and Σj

( ) ( ) BICnmxlconstMxp MM ≡−⋅≈+ )log(ˆ,2.|log2 θ

p(x|M) is the (integrated) likelihood of the data for the model M

lM(x, θ) is the maximized mixture loglikelihood for the model

mM is the number of independent parameters to be estimated in the model

Page 7: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Hierarchical clustering:

Intro. of cluster analysis

Page 8: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

Hierarchical clustering:

Iteratively agglomerate nearest nodes to form bottom-up tree.

Single Linkage: shortest distance between points in the two nodes.

Complete Linkage: largest distance between points in the two nodes.

Note: Clusters can be obtained by cutting the hierarchical tree.

Page 9: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

K-means criterion:Minimize the within-group sum-squared dispersion to obtain C:

Intro. of cluster analysis

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

∑ ∑= ∈

− −=

K-memoids criterion:

( )

.cluster in point median theis

,);(

)(

1

)(

jXx

xxdkCW

j

k

j Cx

jimemoidsK

ji

=∑ ∑= ∈

Page 10: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Intro. of cluster analysis

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

∑ ∑= ∈

− −=

Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters.

K-means:

CML:

Page 11: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

K-means

C1

C2

C3

C4

C5

356

300

267

538

202

Clusters contain many false positives because the algorithm has to assign all genes into clusters.

Many genes are noise (scattered) genes!!

A motivating exampleYeast cell cycle microarray data

Page 12: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Traditional:Assign all genes into clusters.

x

y

0 10 20 30

-20

-15

-10

-50

x

y

-10 0 10 20 30 40

-30

-20

-10

010

Question:Can we allow a set of scattered (noise) genes?

A motivating exampleYeast cell cycle microarray data

Page 13: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

A motivating exampleYeast cell cycle microarray data

K-means penalized K-means

C1

C2

C3

C4

C5

C1C2C3

C4C5

S

356

300

267

538

202

583139

101

71

1276

Page 14: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Prior information:Six groups of validated cell cycle genes:

A motivating exampleYeast cell cycle microarray data

Page 15: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Goal 1:Allow a set of scattered genes without being clustered.

Goal 2:Incorporation of prior information in cluster formation.

A motivating exampleYeast cell cycle microarray data

Page 16: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

PW-KmeansFormulation:Assign n objects into k clusters and a possible noise set.i.e. C={C1 , … , Ck , S},Extend K-means criterion to:

d(xi, Cj): dispersion of point xi in Cj.|S|: # of objects in noise set S.w(xi; P): weight function to incorporate prior info P.λ: a tuning parameter

Page 17: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

PW-Kmeans

How does it work?

Penalty term λ: assign outlying objects of a cluster to the noise set S.

Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.

Page 18: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

{ }

( ) ( )

( ) ( ).,);,(,);,( , If .3

.),(),( , If .2

.,);,(,);,( , If 1.. and given minimizer

the),(),,(),...,,(),( Denote:nPropositio

22*

11*

21

2*

1*

21

22*

11*

21

***1

*

λλλλλλ

λλλλ

λλλλ

λλλλλ

kkCWkkCW

kSkS

kkCWkkCWkkk

kSkCkCkC k

>>

≤>

<>

=

PW-Kmeans

Page 19: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

PW-Kmeans

∑ ∑= ∈

− −=k

j Cx

jimeansK

ji

xxkCW1

2)();(

K-means and K-memoids are two special cases of the new generalized form of PW-Kmeans. (i.e. w(⋅,⋅)=1, λ=∞)

( )∑ ∑= ∈

− =k

j Cx

jimemoidsK

ji

xxdkCW1

)(,);(

Page 20: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Relation to classification likelihood

K-means loss function:

Classification likelihood: (Gaussian model)

PW-Kmeans

Page 21: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Penalized K-means loss function:

Classification likelihood: (Gaussian model)

V is the space where noise set is uniformly distributed.

PW-KmeansRelation to classification likelihood

Page 22: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

PW-Kmeans

Estimate k and λ

Tibshirani et al. 2001

Page 23: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

[ ][ ]

[ ]( )

centroids.set trainingby thecluster same theto assigned also areat cluster thin that pairsn observatio

of proportion thecompute wecluster,each test For

1),,()1(

1min)(

. sets;cluster thebe),(

cluster. same in the are and if 1s".membership-co" denotingmatrix an

clusters.k into cluster when operation clustering

''1

1

'

∑∈≠

≤≤=

−=

==

=

jAiiiitetr

jjkj

kkTT

kT

tr

iitrtr

trtr

trtr

XkXCDInn

kps

AnkAAX

i'i,k),XC(XDnn,k),XC(XD

X,k)C(X

K

Estimate k and λ

PW-Kmeans

Page 24: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

λ is inversely related to the number of noise genes |S|.

Simulation

Penalized K-means (no weight term)

Page 25: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Estimate k and λ

Simulation

Page 26: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Prior informationSix groups of validated cell cycle genes:

ApplicationsI: Yeast cell cycle microarray data

Page 27: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

8 histone genes tightly coregulated

in S phase

Prior informationSix groups of validated cell cycle genes:

ApplicationsI: Yeast cell cycle microarray data

Page 28: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

C1

C2

C3

C4

C5

S

The 8 histone genes are left in noise set S

without being clustered.

58

31

39

101

71

1276

ApplicationsI: Yeast cell cycle microarray data

::

Penalized K-means

Page 29: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Prior knowledge of p pathways

The weight is designed as a transformation of logistic function:

ApplicationsI: Yeast cell cycle microarray data

Penalized weighted K-means

Page 30: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Design of weight function

ApplicationsI: Yeast cell cycle microarray data

Page 31: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

C1

C2

C3

C4

C5

S

The 8 histone genes are now in cluster 3.

Take three randomly selected histone genes as prior information, P. Then perform penalized weighted K-means.

112

158

88

139

57

1109

ApplicationsI: Yeast cell cycle microarray data

Page 32: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

: unannotated genes

p-value calculation (null is hypergeometric distribution):

G: total of 1663 genesD(F): # of genes in the functional category (23+5=28)n(C): # of genes in the cluster (4+23+71=98)d(F): # of genes in the cluster and the functional category (23)

ApplicationsI: Yeast cell cycle microarray data

Annotation prediction from clusters

Page 33: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Given a p-value threshold δ (δ=0.01), we can compute:

Predictions made: 42+98+98=238Accuracy: (10+4+23)/(42+98+98) =15.55%

Varying δ gives varying “Predictions made” and “Accuracy”

Evaluation of annotation prediction

ApplicationsI: Yeast cell cycle microarray data

Page 34: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Accuracy of random guess

ApplicationsI: Yeast cell cycle microarray dataEvaluation of annotation predictionδ=10-4,…10-20

Page 35: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

• P-kmeans generally better than Kmeans.

• P-kmeans makes fewer predictions than Kmeans but produce much higher accuracy.

• Smaller λ result in smaller clusters and # of prediction made but with better accuracy.

ApplicationsI: Yeast cell cycle microarray data

Conclusion: Evaluation of annotation prediction

Page 36: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Protein #1

Protein #2

m/z

HPLC MS

Abun

danc

e

SIYDGK

FWSEFR

TLLHPYK

PeptideSequence

Enzyme

Peptide Sequencing Algorithm

Abun

danc

eAb

unda

nce

Abun

danc

e

MS/MS

CID(CID)

ApplicationsII: CID fragmentation pattern in MS/MS

Page 37: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Energy

LITSHLVDTDPEVD SIIKDEIER

Energy

LITSHL VDTDPEVDSIIKDEIER

Collision-Induced Dissociation (CID)

The abundance of such cleavages are recorded as intensities.

b10y13

% R

elat

ive

Abu

ndan

ce

m/z

b10y13

% R

elat

ive

Abu

ndan

ce

m/z

LITSHLVDTDPEVDSIIKDEIER

LITSHLVDTDPEVDSIIKDEIER

ApplicationsII: CID fragmentation pattern in MS/MS

Page 38: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

A C D E F G H I K L M N P Q0 1 2 3 4 5 6 7 8 9 10 11 12 13

1st 2nd count intensitiesA A 2 0 0.381A C 0A D 0A E 1 0.031A F 0A G 0A H 0A I 0A K 0A L 0A M 1 0.514A N 0A P 0A Q 1 0.096A R 0A S 0A T 0A V 0A W 0A Y 0C A 0

One single peptide: AAAMDAQAEAK

AA

AE

AM

AQ

ApplicationsII: CID fragmentation pattern in MS/MS

All intensities normalized to [0,1]

Assume intensities measure the probability of dissociation.

Page 39: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

A C D E F G H I K L M N P Q0 1 2 3 4 5 6 7 8 9 10 11 12 13

1st 2nd count intensitiesA A 188 0 0 0 0 0 0 0 0 0 0 0A C 1 0.105A D 91 0 0 0 0 0 0 0 0 0 0 0A E 94 0 0 0 0 0 0 0 0 0 0 0A F 41 0 0 0 0 0 0 0 0 0.008 0.008 0.01A G 129 0 0 0 0 0 0 0 0 0 0 0A H 0A I 69 0 0 0 0 0 0 0 0 0 0 0.019A K 5 0 0.01 0.013 0.122 0.139A L 137 0 0 0 0 0 0 0 0 0 0 0A M 28 0.004 0.02 0.023 0.034 0.034 0.06 0.062 0.068 0.093 0.101 0.122A N 52 0 0 0 0 0.006 0.022 0.025 0.025 0.033 0.039 0.046A P 152 0 0 0 0 0.012 0.029 0.041 0.042 0.049 0.059 0.068A Q 58 0 0 0 0 0 0 0 0 0 0 0A R 0A S 59 0 0 0 0 0 0.01 0.021 0.031 0.038 0.04 0.049A T 78 0 0 0 0 0 0 0 0 0 0.013 0.018A V 88 0 0 0 0 0 0 0 0 0 0 0A W 11 0 0 0 0 0.029 0.149 0.205 0.321 0.428 0.454 1A Y 26 0 0 0 0 0.008 0.011 0.047 0.063 0.065 0.078 0.079C A 0

For a specific set of peptides (720 peptides):

20×20=400 independent distributionsEach with 0 or multiple (up to hundreds) observations

ApplicationsII: CID fragmentation pattern in MS/MS

Page 40: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Protein #1

Protein #2

m/z

HPLC MS

Abun

danc

e

SIYDGK

FWSEFR

TLLHPYK

PeptideSequence

Enzyme

Peptide Sequencing Algorithm

Abun

danc

eAb

unda

nce

Abun

danc

e

MS/MS

CID(CID)

ApplicationsII: CID fragmentation pattern in MS/MS

Page 41: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Current protein identification algorithms assume completely random dissociation probability pattern.

This is, however, found not true and the dissociation pattern depends on the charge state and the peptide sequence motif.

ApplicationsII: CID fragmentation pattern in MS/MS

Page 42: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

A A 188 0 0 0 0 0 0 0 0 0 0A C 1 0.105A D 91 0 0 0 0 0 0 0 0 0 0A E 94 0 0 0 0 0 0 0 0 0 0A F 41 0 0 0 0 0 0 0 0 0.008 0.008A G 129 0 0 0 0 0 0 0 0 0 0A H 0A I 69 0 0 0 0 0 0 0 0 0 0A K 5 0 0.01 0.013 0.122 0.139A L 137 0 0 0 0 0 0 0 0 0 0A M 28 0.004 0.02 0.023 0.034 0.034 0.06 0.062 0.068 0.093 0.101A N 52 0 0 0 0 0.006 0.022 0.025 0.025 0.033 0.039A P 152 0 0 0 0 0.012 0.029 0.041 0.042 0.049 0.059A Q 58 0 0 0 0 0 0 0 0 0 0A R 0

D A 73 0.018 0.026 0.032 0.046 0.046 0.093 0.108 0.146 0.155 0.173 0.193D C 1 0D D 38 0 0.012 0.056 0.097 0.097 0.118 0.135 0.142 0.156 0.182 0.197D E 53 0 0 0 0.021 0.026 0.035 0.05 0.073 0.08 0.085 0.095D F 38 0 0.003 0.044 0.1 0.119 0.214 0.232 0.283 0.41 0.468 0.507D G 44 0.024 0.128 0.128 0.226 0.239 0.247 0.383 0.395 0.491 0.529 0.693D H 0D I 38 0.029 0.054 0.063 0.128 0.15 0.173 0.229 0.233 0.257 0.268 0.284D K 0D L 76 0 0 0 0 0 0.01 0.057 0.064 0.113 0.114 0.126D M 15 0 0.147 0.212 0.376 0.419 0.709 0.806 0.841 0.885 0.887 0.947D N 18 0.047 0.108 0.232 0.442 0.458 0.506 0.508 0.575 0.585 0.844 0.904D P 63 0.122 0.444 0.481 0.61 0.631 0.675 0.753 0.882 0.883 1 1D Q 14 0 0.098 0.163 0.228 0.262 0.297 0.421 0.459 0.463 0.55 0.667D R 0

AX: low

DX: high

ApplicationsII: CID fragmentation pattern in MS/MS

Page 43: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Visualization of a distribution:Ten concentric donuts to represent 5%, 15%,… , 95% percentiles. Value represented by gradient color.

ApplicationsII: CID fragmentation pattern in MS/MS

Page 44: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

ApplicationsII: CID fragmentation pattern in MS/MS

The dissociation pattern depends on the charge state and the peptide sequence motif.

Page 45: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Distances cannot be defined for most pairs of peptides. (more than 95% missing values)Distance between a peptide and a set of peptides can be defined.K-means and PW-Kmeans are applicable while most other clustering methods fail.

ApplicationsII: CID fragmentation pattern in MS/MS

Page 46: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

ApplicationsII: CID fragmentation pattern in MS/MS

[…P…R]+

[…P…R]2+

Page 47: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

• Intensity data of each peptide contain >95% missing values. Most clustering methods would not work.

• Dissimilarity between two peptides cannot be defined.

• Fortunately dissimilarity between one peptide and a set of peptides can be calculated and penalized K-means can be used.

Page 48: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Discussion

Model-basedheuristic

Gaussian mixture model

Bayesian clustering

K-memoidsK-meansPW-KmeansSOM

Hierarchical clustering

CLICK

Tight clustering(re-evaluation machine by re-sampling techniques)

Page 49: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

Conclusion

Penalized and weighted K-means (PW-Kmeans) provides a general and flexible way for clustering complex data.

The penalty term allows a noise set not being clustered, avoiding information dilution by noises.

The weights can be designed to incorporate biological prior pathway information. Similar to Bayesian approach but avoids specific modelling.

In the situation of many missing values (MS/MS example), most methods are hard to pursue but P-Kmeans worked well.

Page 50: Penalized and weighted K-means for clustering with …ctseng/presentations/051102_PW-Kmeans.pdfPenalized and weighted K-means for clustering with noise and prior information incorporation

MS/MS data collaboration with YingyingHuang from Vicki Wysocki’s lab in University of Arizona

Discussion and comments from HaiyanHuang, Eleanor Feingold and Wing Wong.

Acknowledgement