Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust.

Post on 21-Dec-2015

237 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Introduction to Bioinformatics - Tutorial no. 12

Expression Data Analysis:- Clustering- GEO- EPClust

Application of Microarrays

We only know the function of about 20% of the 30,000 genes in the Human Genome Gene exploration Faster and better

Applications: Evolution Behavior Cancer Research

Microarray Analysis

Unsupervised Grouping: Clustering

Pattern discovery via grouping similarly expressed genes together

Three techniques most often used k-Means Clustering Hierarchical Clustering Kohonen Self Organizing Feature Maps

Hierarchical Agglomerative ClusteringMichael Eisen, 1998

Cluster (algorithm) TreeView (visualization)

Hierarchical Agglomerative Clustering Step 1: Similarity score between all pairs of genes

Pearson Correlation Euclidean distance

Step 2: Find the two most similar genes, replace with a node that contains the average Builds a tree of genes

Step 3: Repeat

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Results of Clustering Gene Expression

CLUSTER is simple and easy to use

De facto standard for microarray analysis

Limitations: Hierarchical clustering in

general is not robust Genes may belong to

more than one cluster

K-Means Clustering Algorithm Randomly initialize k cluster means Iterate:

Assign each genes to the nearest cluster mean Recompute cluster means

Stop when clustering converges

Notes: Really fast Genes are partitioned into clusters How do we select k?

K-Means Algorithm

Randomly Initialize Clusters

K-Means Algorithm

Assign data points to nearest clusters

K-Means Algorithm

Recalculate Clusters

K-Means Algorithm

Recalculate Clusters

K-Means Algorithm

Repeat

K-Means Algorithm

Repeat

K-Means Algorithm

Repeat … until convergence

EPClust Input (1)Expression data matrix

Extra annotation for gene rows

Method of tabulation

Name for further analysis

EPClust Input (2)

Method of measuring distance between gene rows

Cluster hierarchically

Number k of means

Cluster into k means

GEO: Gene Expression Omnibus

NCBI database for gene expression data Founded at end of 2000

Querying GEOBrowse records

Search for entries containing a gene

Search for experiments

Search with Entrez

SGD – Expression database

http://db.yeastgenome.org/cgi-bin/expression/expressionConnection.pl

SGD – Expression database

SGD – Expression database

SGD – Expression database

Two labs are running experiments on the APO1 gene. Suggest a method that would allow them to compare their results.

Gene grouping Relative values

Explain how microarrays can be used as a basis for diagnostic

Sample 1

Sample 2

Sample 3

sample4

Sample 5

Gen1+--++Gen2++-+-Gen3-+++-Gen4+++--Gen5--+-+

Explain how microarrays can be used as a basis for diagnostic

Sample 1

Sample 2

sample4

Sample 3

Sample 5

Gen1+-+-+Gen2+++--Gen3-+++-Gen4++-+-Gen5---++

top related