Gene Expression Data Analysis

Analysis of Gene Expression Data

_______________________

Jhoirene B. ClementeAlgorithms and Complexity Lab

University of the Philippines Diliman

Overview

● Definitions● Clustering of Gene Expression Data● Visualizations of Gene Expression Data

Definitions

Gene

Basic unit of heredity in a living organism. It is normally a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism.

Gene Expression Data

Expression level of genes in an individual that is measured through Microarray Technology.

Definitions

Definitions

Definitions


Gene Gene Expression

a

b

c

...

n

Definitions


Gene Gene Expression

a

b

c

...

n

1 Sample

n Samples

Definitions

Gene Sample 1

Sample 1

..... Sample m

a

b

c

...

n

m Samples

n Samples

(n x m) Data Matrix

Definitions

Gene Sample 1

Sample 1

..... Sample m

a

b

c

...

n

m Samples

n Samples

(n x m) Data Matrix

Clustering

Clustering is the unsupervised classification of patterns including observations, data sets and feature vectors into groups called clusters, such that objects in the same cluster are similar to each other while objects in different clusters are dissimilar as possible.

image source:ima.umn.edu

Clustering

Clustering is the unsupervised classification of patterns including observations, data sets and feature vectors into groups called clusters, such that objects in the same cluster are similar to each other while objects in different clusters are dissimilar as possible.

image source:ima.umn.edu

Cluster Analysis

Preprocessing● Filtering● Normalization

Clustering

Analysis

Clustering

Partitional

● K-means Algorithm● X-means Algorithm

Hierarchical

Clustering

Given the (n x m) data matrix, we can

● Cluster the set of genes● Cluster the set of samples● Cluster the set of genes and samples

simultaneously.

Data Set

Data set is a time series gene expression data from a synchronized population of yeast.

Data Set

Data set is a time series gene expression data from a synchronized population of yeast.

Preprocessing

Filtering● Removed genes not involved in cell cycle

regulation● Removed genes belonging to more than one

group

Normalization● All gene expression values range from -1.0 to

1.0.

Data Set

Data matrix (384 genes and 17 samples) with 5 classifications.Groupings based from cell cycle phase activation.

Data Set

Group 1: Resting Phase

Data Set

Group 2: First Growth Phase

Data Set

Group 3: Synthesis Phase

Data Set

Group 4: Second Growth Phase

Data Set

Group 5: Cell Division

Clustering of genes

K-means Algorithm

Given n data points in Rd

1. Assign k initial centers of the k clusters

2. Assign all the data points to the nearest cluster

(Euclidean distance, Manhattan distance, etc.)

3. Adjust the k centers

4. Repeat steps 2 and 3 until convergence

Clustering of genes

K-means Algorithm

Given n data points in Rd

1. Assign k initial centers of the k clusters

2. Assign all the data points to the nearest cluster

(Euclidean distance, Manhattan distance, etc.)

3. Adjust the k centers

4. Repeat steps 2 and 3 until convergencek =5

since we want to approximate the 5 biological classification

Clustering of genes

Initialization

1. Choose the first k centers that will maximize the

distance between the clusters

2. Sort the distances between all the data points

and then choose the k initial points at constant

intervals from the sorted list

3. Use the first k points in the data set as the first k

centers

Clustering of genes

Using k-means clustering, with k =5

Clustering of genes

● Clustering may suggest possible roles for genes with unknown functions

● Clustering the samples or experiments may shed light on new subtypes of diseases.

● Identify which type of treatment is suited for a specific type of cancer.

● Building genetic networks

visualization

Vector FusionNon-metric Multidimensional Scaling (nMDS)Principal Components Analysis (PCA)

Vector fusion

Visualization technique that uses the Single point broken line parallel algorithm

nMDS visualization

Input (Dissimilarity Matrix=|ij|) actual distance● In nMDS, only the rank order of entries is

assumed to contain the significant information.● Thus, the purpose of the non-metric MDS

algorithm is to find a configuration of points whose distances reflect as closely as possible the rank order of the data.

● The transformation is by using a non parametric function f. (monotone regression)

dij= f(d

ij) pseudo-distance

PCA

vector fusion visualization

nmds visualization

nmds visualization

nmds visualization

nmds visualization

nmds visualization

nmds visualization

nmds visualization

References2010: "Non-Metric Multidimensional Scaling and Vector Fusion Visualization of Cell Cycle Independent Gene Expressions for Gene Function Analysis", Clemente J., Salido J.A., (2010), Published in the conference proceedings of National Conference on Information Technology for Education(NCITE) 2010 and Philippine IT Journal Feb 2011 Issue.

2010: "Cluster Analysis for Identifying Genes Highly Correlated with a Phenotype", Clemente J., Undergraduate thesis, Department of Computer Science, University of the Philippines Diliman

https://docs.google.com/leaf?id=0BxITsxc0CZ9PMmUyNWRiMTgtYzFhNS00YWQxLWEyZTMtOGI1NGQzNjllZmE5&hl=en&authkey=CPrHvIoE




https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxITsxc0CZ9PZWFkMDc1YzUtZmNjYS00MzNhLWI3MTItMzQ4NTdkNTI2NzFk&authkey=CP2Mx74N&hl=en

https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxITsxc0CZ9PZWFkMDc1YzUtZmNjYS00MzNhLWI3MTItMzQ4NTdkNTI2NzFk&authkey=CP2Mx74N&hl=en

Thank you for Listening