Fuzzy K means
Dec 21, 2015
Fuzzy K means
Fuzzy K means
● A gene can be assigned to several clusters● Each gene is assigned to a cluster with a
membership value between 0 and 1● The membership values of a gene add up to one● Genes with lower membership values are not
well represented by the cluster centroid● Expression of genes with high membership
values are close to cluster centroid
Centroid
• During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene –expression patterns in the data set according to
Membership Function
• Each gene’s membership m (a continuous variable from 0 to 1) is defined
as:
Fuzzy K means
• The gene weight is (only on the seconed and the third round) empirically defined as:
Where
is the Pearson Correlation between Xi and Xn and is the correlation cutoff
ni xCx ,
x
Fuzzy K means
• In each clustering cycle , the centroids were iteratively refined until the average change was <0.001.
• Around 85 % of the centroids , stabilized within approximately 15 iterations , some of centroids required more : about 40 -60 iterations before stabilizing.
Fuzzy K means
• After each clustering cycle , each centroid was compared to all other centroids in the set , and centroid pairs correlated >0.9
were replaced by their average .
Visualization Tools
Cells respond to environment
Heat
FoodSupply
Responds toenvironmentalconditions
Various external messages
Genome is fixed – Cells are dynamic
• A genome is static
– Every cell in our body has a copy of same genome
• A cell is dynamic– Responds to external conditions– Saccharomyces cerevisiae cells follow a cell cycle of
division and also budding.
• Cells differentiate during development
Gene regulation
• Gene regulation is responsible for dynamic cell
• Gene expression varies according to:
– Cell type– External conditions
Transcription Factors Binding to DNA
• Transcription regulation:
• Certain transcription factors bind DNA
• Binding recognizes DNA substrings:
• Regulatory motifs
Regulation of Genes
GeneRegulatory Element
RNA polymerase(Protein)
Transcription Factor(Protein)
DNA
Regulation of Genes
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
Regulation of Genes
Gene
RNA polymerase
Transcription Factor
Regulatory Element
DNA
New protein
The Challenges of Gene Expression Data
• Many genes have expression data patterns that are similar to multiple, distinct gene groups.
Results of Clustering Gene Expression
• CLUSTER is simple and easy to use
• De facto standard for microarray analysis
• Limitations:– Hierarchical and other
method clustering in general is not robust
– Genes may belong to more than one cluster
• Gene can be co expressed with different gene groups in response to different conditions.
Saccharomyces cerevisiae
• The yeast Saccharomyces cerevisiae possesses sophisticated mechanisms to choreograph the expression of its 6200 genes in order to thrive or at list to survive in a wide range of environmental conditions.
• The gene expression of 40 Yap1p targets, these genes were coordinately induced in responds to subset of conditions shown here ( labeled in red)
What is a microarray
What is a microarray (2)
• A 2D array of DNA sequences from thousands of genes
• Each spot has many copies of same gene
• Allow mRNAs from a sample to hybridize
• Measure number of hybridizations per spot
Goal of Microarray Experiments
• Measure level of gene expression across many different conditions:
– Expression Matrix M: {genes}{conditions}:
Mij = |genei| in conditionj
• Deduce gene function– Genes with similar function are expressed under
similar conditions
Fuzzy K-Means clustering
• Each gene can belong to many clusters
• Soft (fuzzy) assignment of genes to clusters– Each gene has 1.0 membership units, allocated amongst clusters
based on correlation with means
• Cluster means are calculated by taking the weighted average of all the genes in the cluster
Fuzzy K-Means clustering
Algorithm:
• Use PCA to initialize cluster means
• 3 iterations of fuzzy k-means clustering, find k/3 clusters per iteration– In each iteration, start with brand new clusters and
initializations
• And a few more heuristic tricks
Initialization
• Use PCA to find a few eigenvectors for initialization
• These features capture the directions of maximum variance
• Must be orthonormal
Example
Initialization• k/3 centroids
defined from k/3 first eigenvectors
Example
• First iteration of clustering
Iteration of the approach
• Remove genes that have a Pearson Correlation with a particular cluster greater than 0.7– Intuition: These strong
signal from these genes has been accounted for
• Repeat
Removing Duplicate Centroids
• Centroids with Pearson correlation > 0.9 will be averaged.
• Allows selecting a large initial number of clusters, since duplicates will be removed
Repeat 3 times
Output
1) Cluster means
2) Gene assignments to clusters
• Regulatory systems that govern the expression of overlapping sets of genes in yeast.
Fuzzy K means ADVANTAGES
• The method can present overlapping clusters , revealing distinct features of each gene’s function and regulation.
• The resulting implication can be used to assign refined hypothetical functions to uncharacterized gene products and additional cellular roles of well none studied proteins .
Fuzzy K means ADVANTAGES
• It present more comprehensive groups of conditionally co regulate genes.
• It elucidate the environmental conditions that trigger changes in gene expression.
• It requires no a priori information about the dataset.
Fuzzy K means DISADVANTAGES
• Assignment of genes to the cluster requires a user – defined cutoff and selecting meaningful cutoff is a challenge.
• Fuzzy K means failed to identify a small number of groups that were identified by hierarchical clustering.
My opinion
• The unique advantages of fuzzy K means clustering make the technique a valuable tool for gene expression analysis , it’s flexibility can be used to reveal more complex correlations between gene expression patterns, promoting refined hypotheses of the role and regulation of gene expression changes.
• In order to get over the limitations… combining hierarchical clustering with fuzzy K means can be useful..
Thank you !