Aspects of microarray gene expression analysis Project 786 - 102 Spring 2002 Antoaneta Vladimirova.

Aspects of microarray gene expression analysis

Project 786 - 102Spring 2002

Antoaneta Vladimirova

Parts of the talk:

1. Why microarray expression experiments?

2. Basic steps of the microarray experiment

3. Data collection and normalization

4. Analysis of expression data - Clustering algorithms

5. “Extraction of correlated gene clusters by multiple graph comparison” - an algorithm that allows integration of the expression data with other existing biological knowledge

Studies in biology until now:

whole---> parts

Assumption: knowledge derived from the parts will enable usto understand the whole

organism-->organs-->cells-->molecules

Consequences of this approach:- incomplete knowledge- isolated studies: individual genes or gene products- databases - inconsistent annotations, lack of integration

Reductionism in Biology

Biological information flow:

copy the genetic information genome : collection of all genes(DNA) of an organism

Transcription (gene expression)RNA - a messenger molecule from genetic information tofunctional unit

Protein: gene productcarries the function of the gene

In general:DNA----> RNA ---->protein

Why study gene expression?

* every cell in an organism has the same set of genes. So, what makes the liver cells and the brain cells different?

* cells in different tissues or in different stages of development express different set of genes and have consequently different characteristics

* gene expression process is the intermediary process between the maintenance of the genetic information in the form of DNA in the chromosomes and the production of protein which carries most of the functions in a cell

* our interest is in understanding the functions, properties and inter-relations among proteins, however, studying gene expression is technologically more affordable, cheaper and is assumed to give us a good approximation about the quantity of the corresponding protein product

Why design and analyze microarray experiments?

* allow simultaneous assessment of multiple genes

* generate expression levels of thousands of genes in parallel

* expression level of the gene is approximated to the protein level, and, respectively, to the function a gene product carries

* expression level changes due to environmental conditions, developmental stage, diseased state

* genes that share expression patterns are assumed to be co-regulated and to be functionally related

* gene expression data might eventually allow us to reverse-engineer(reconstruct gene regulation networks and biological processes)

* advances in technology allow us to move on to synthetic approach:

* take all pieces together, integrate vs. disassemble the biological system.

* towards reconstruction of the whole cell/organism

* need to study gene/gene product not in isolation, but relevant to all other genes/products and the environment - networks of components and interactions between them

Synthetic approach

Microarray System

Adapted from “Ratio-based decisions and the quantitative analysis of cDna microarray images” -Chen, Dougherty and Bittner (1997), J Biomed Opt 2(4)

a. Oligonucleotide array synthesized in situ with photochemical technology by Affymetrix.b. Oligonucleotide array synthesized in situ withink-jet technology (Rosetta Inpharmatics).c. DNA microarray printed on a glass slide (Corning, Inc).

Adapted from “Biomedical Discovery Review with DNA Arrays” - Richard A. Young

Microarray images

Color-coded expression

*Each dot on the microarray is read through two independent channels (green and red)

*Green color - means query expression is lower than the control expression Query signal of gene x/Control signal of gene x < 0

*Red color - means query expression is higher than the control expression Query signal of gene x /Control signal of gene x> 0

*Yellow color - means query expression and control expression are equal Query signal of gene x/Control signal of gene x = 1

*Black color - neither control or query bound to the slide

Why signal normalization is necessary?

Assumptions:

*the quantity of initial RNA from both samples is equal*some genes are up-regulated, others are down-regulated, but overall these changes should balance out so that the total quantity from each sample that hybridizes to the array is equal, therefore the total intensity read through the red and green channels should be the same

The relative fluorescence intensities need to be normalized because:

* we need to adjust for differences in labeling and detection efficiencies of the different fluorescent labels* we need to adjust for differences in the quantity of initial RNA isolated from the query and control samples*need to compensate for experimental variability

* a normalization factor is computed and applied for each gene

How to interpret the raw microarray data?

Experimental conditions 1(e.g. nutrients withdrawal)

Experimental conditions i(e.g. gene disruption)

Experimental conditions n(e.g. drugtreatment)

…. ….

Microarray 1(gene 1- gene m)

gene 1

gene m

Microarray i(gene 1- gene m)

Microarray n(gene 1- gene m)

Fluorescence intensities are translated to a ratio Q/C

Data is organized into an Expression matrix

gene 1gene 2gene 3...

gene m

exp. conditio

n 1

exp. conditio

n 2

exp. conditio

n 3

exp. conditio

n (n - 1

)

exp. conditio

n n

...

.55 0.40 2.34 0 .12 0.77

.37

.59

.19

Gene 1 is differentially expressed in exp. condition 3

Green Red

Expression signal is represented as a ratio

*Red color -> Q/C > 0 If Q/C in the range of 1.5 - 2.0 the particular gene in the query cell is considered up- regulated. In theory: Q/C can go to infinity.

*Green color -> Q/C< 0 The particular gene in the query cell is considered down-regulated. In theory: Q/C will range from zero to one.

1 +infinity0 Green Red

To correct for that, log2(Q/C) is used.If Q/C = 2 ==> log2(Q/C) = 1 If Q/C = 1==> log2(Q/C) = 0If Q/C = 1/2 ==> log2(Q/C) = -1Over-expressed and inhibited genes values are equally distributed

0

Expression vectors and expression space

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.2 0.4 0.6 0.8 1

Exp. 1 Exp. 2Gene 1 0.55 0.33Gene 2 0.45 0.55Gene 3 0.23 0.76Gene 4 0.24 0.34Gene 5 0.11 0.77Gene 6 0.67 0.45Gene 7 0.9 0.33Gene 8 0.4 0.12Gene 9 0.77 0.37Gene 10 0.02 0.33

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.2 0.4 0.6 0.8 1

We want to group (cluster) the expression vectors based on their “similarity”

Assumption:Genes in the same group arefunctionally related

Expression matrix

2D expression space

Experiment vectors and experiment space

We want to group (cluster) the expression vectors based on their “similarity”

Assumption:Genes in the same group arefunctionally related

Expression matrix

5D experiment space

(0, 0, 0, 0 ,0)

(.55, .45, .23, .24 ,.11)

(.44, .33, .37, .29 ,.88)

Cluster 1

Cluster 2

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5Exp. 1 0.55 0.45 0.23 0.24 0.11Exp. 2 0.44 0.33 0.37 0.29 0.88

How to define similarity between expression vectors?

Distance measure:the distance between two objects (e.g. expression vectors)

i j

kMetric:1. dij must be positive or zero (dij >= 0)2. Must be symmetric (dij = dji)3. An object is zero distance from itself (dii = 0)4. When considering three objects i, j and k, the distance from i to k is always less than or equal to the sum of the distance from i to j, and the distance from j to k ; (dik <= dij + djk )(the triangle rule)

Example : Euclidean distance between two 3D points X(x1, x2, x3) and Y(y1, y2, y3) isd12 = SQRT ( (x1 - y1)2 + (x2 - y2)2 + (x3 - y3)2)For an n-dimensional space:d12 = SQRT (xi - yi)2 where i = 1 to n

Semi-metric:Obey the first three rules but not the triangle rule

dij

dkjdik

Clustering analysis of gene expression

Idea: cluster together genes with similar expression patterns

Underlying assumptions:* Genes that share expression patterns are co-regulated and participate in functionally related processes* Unknown genes that are clustered together with known genes might have similar or related functions

Categories of clustering methods:

I. UnsupervisedII. Supervised

A. AgglomerativeB. Divisive

Agglomerative:Start with individual gene clusters and gradually accommodate more genes in a cluster, clusters are eventually joined in one huge cluster; usually represented by a tree structure resembling the phylogenetic trees

Divisive:Start with all point into one cluster and gradually form new clusters and distribute the data points among them

Unsupervised:No prior knowledge is assumed when forming the clusters

Supervised:Existing biological knowledge is used to guide the clustering process

Two major clustering algorithm categories

* Hierarchical clustering

* k-means clustering

* Principal Component Analysis

* Supervised clustering (classifiers)

Clustering methods to be discussed:

Hierarchical Clustering

* One of the most frequently used techniques* simple and can be easily visualized as a tree similar to the phylogenetic trees* an agglomerative approach: single expression profiles are joined to form groups, the process is repeated until all expression profiles have been joined in one cluster*first, the pair-wise distances are calculated for all the genes to be clustered; initially each gene is a cluster itself

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5Gene 1 0Gene 2 0.45 0Gene 3 0.23 0.76 0Gene 4 0.24 0.34 0.44 0Gene 5 0.11 0.77 0.36 0.77 0

* The distance matrix is searched for the clusters with the minimum distance between them* The two selected clusters are joined to form a new cluster containing 2 objects now

Cl1

Cl5

Cl3Cl2

Cl4

0.11

* The process is repeated and Cl2 and Cl2 are joined into Cluster B* The distances are recalculated and clusters formed until only one cluster is left that accommodates all the objects

Hierarchical Clustering* The distances are recalculated from this new cluster to the rest of the clusters; the distance matrix now contains one less dimension ( or cluster)

Cl3Cl2

Cl4Cluster A

Gene 2 Gene 3 Gene 4 Cluster AGene 2 0Gene 3 0.76 0Gene 4 0.34 0.44 0Cluster A 0.56 0.56 0.45 0

Gene 3 Cluster A Cluster BGene 3 0Cluster A 0.56 0Cluster B 0.44 0.67 0

Cluster B

Cl3

Cluster A

Building the hierarchical tree:Cl 1

Cl 5

Cl A

Cl 2

Cl 4

Cl B

Cl1

Cl5

ClA

Building the hierarchical tree:

Cluster A Cluster CCluster A 0Cluster C 0.56 0

Hierarchical Clustering

Cl3Cl2

Cl4Cluster A

Cluster CCluster D

Cluster D 0

Cl3Cl2

Cl4

Cluster D



Cl 1

Cl 5

Cl A

Cl 2

Cl 4

Cl BCl C

Cl 3

Cl 1

Cl 5

Cl A

Cl 2

Cl 4

Cl BCl C

Cl 3

Cl D

Hierarchical Clustering - Tree Representation Example (partial tree)

Hierarchical clustering of gene expression matrices. The image shows an average linkage (UPGMA) clustering of 505 yeast genes duringthree different cell cycle studies with a total of 60 different time points analyzed. The color image on the left shows the numerical values encodedby color according to the method introduced by Mike Eisen. Red is used to represent the positive values and green the negative values.Blue shows the missing values in the respective experiments. The clustering and the image are produced using WWW-based tools in ExpressionPro¢ler (http://www.ebi.ac.uk/microarray/). The interface is interactive and further information about the genes in each subtree is available byclicking on the respective nodes in the tree.

Adapted from “Gene expression data analysis” - A,Brazma and J. Vilo (2000) FEBS Letters

Hierarchical Clustering Algorithms* Single-linkage clustering: the distance between two clusters I and j is calculated as the minimum distance between a member of cluster i and a member of cluster j

* Complete-linkage clustering: the distance between two clusters I and j is calculated as the maximum distance between a member of cluster i and a member of cluster j

* Average-linkage clustering: average values are used for calculating the distance

i j

i j

i j

K-means Clustering* Partitions data in groups with similar expression

* there should be advanced knowledge about the number of clusters or k should be chosen arbitrarily; objects are partitioned into a fixed number of clusters, such that clusters are internally similar but externally dissimilar

* each time the same k might produce slightly different clustering results

* the process is conceptually simple, but can be computationally intensive

* first, the objects are randomly partitioned into k user-specified clusters

* an average expression vector is computed which represents each cluster and is used to compute the distances between each point and each average cluster vector

* if a given object is closer to a different cluster that to the one it is assigned to, it is re-assigned to the closest cluster and the average expression vector for the clusters is recalculated.

* the process is repeated until no re-assignments are necessary.

K-means Clustering

1

5

3 2

4

* Let’s have m objects in n-dimensional space (e.g. five genes in 2D expression space)

* Let k = 2* Let’s partition arbitrarily into 2 clusters* Then calculate the average expression vector for each cluster* Calculate distances from each object to the average vector

1

5

3 2

4* Re-assign object 3 to Cluster A* Recalculate average expression vectors for the clusters * Re-calculate the distances from all objects to all average expression vectors* No further re-assignments are necessary* This are the final 2 clusters 1

5

3 2

4

Principal Component Analysis (PCA)* Principal Components Analysis or Singular Value decomposition is a mathematical technique that picks up patterns in data while reducing dimensionality

* reduction of dimensionality might be necessary when some of the data might contain redundant information, e.g. if a group of experiments are more closely related that initially expected

* “projects” complex data onto a reduced, easily visualized space

* analogy: a 3D cloud of data points that is rotated so that one can see it from different perspectives; some views might allow a better separation of the data into groups than other views ;PCA finds the best views to separate the data

* in most implementations of PCA it is difficult to define the precise boundaries of distinct clusters in the data, or to define genes(or experiments) belonging to each cluster

* however, when combined with another clustering techniques such as k-means, it becomes a very powerful technique

Analysis of a demonstration data set* the performance of the various algorithms is compared* the analysis can help to provide an understanding of how the data are handled and interpreted by the different methods

A synthetic gene-expression data set. This data set provides an opportunity to evaluate how various clustering algorithms reveal different features of the data. A. Nine distinct gene-expression patterns were created with log2(ratio) expression measures defined for tenexperiments. B. For each expression pattern, 50 additional genes were generated,representing variations on the basic patterns.

Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2

Hierarchical Clustering Algorithms

Genes in the demonstration data set were subjected toa. average-linkageb. complete-linkage c. single-linkage hierarchical clustering using aEuclidean distance metric and gene-expression families (A–J) that were color coded for comparison. Genes that are up-regulated appear in red, and those that are down-regulated appear in green, with the relative log2(ratio) reflected by the intensity of the color. This method of clustering groups genes by reordering the expression matrix allows patterns to be easily visualized.

Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), NatureGenetics,vol 2

Hierarchical Clustering and PCA

Principal component analysis. The same demonstration data set was analyzed using a. hierarchical (average-linkage) clustering and b. principal component analysis usingEuclidean distance, to show how each treats the data, with genes color coded on the basisof hierarchical clustering results for comparison.

Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2

Data Filtering by Mean Centering

*Why filtering? To enhance certain features of the patterns we are looking for*Mean Centering removes “constant” expression by subtracting the average across all experiments from each data point* genes with similar changes relative to their baseline expression pattern are grouped*A, B and C have “constant” expression - grouped together (originally B were up-regulated and C were down-regulated)* D and G are grouped together - expression changes in the same fashion (up and down)* E and F are grouped together - expression changes in the same fashion (down and up) Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2

The effect of Data Filtering by Mean Centering

The effect of data filtering.

Application of various data filters or changes in the distance metric can change the results derived from any clustering algorithm.A. Mean centering of the data removes ‘constant’ expression, which reveals changes in expression patterns for the nine gene families across the ten experiments. The changes can be seen in the results of b. principal component analysisc. average-linkage hierarchical clustering.

Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), NatureGenetics,vol 2

Supervised clustering (classifiers)

* Supervised methods can be applied if one has some previous knowledge of which genes are expected to cluster together

* Support Vector Machine (SVM) - a widely used technique

* SVM uses a training set of genes known to be related e.g. functionally; the training set is provided as positive members and genes known not to be related are used as negative examples

* this training of the SVM allows it to distinguish between members and non-members of the group based on the expression data; SVM uses existing biological relationships to determine expression features that are characteristic for a group

Supervised clustering (classifiers)

* the SVM is used then to recognize and classify the genes in the data set to the established groups on the basis of their expression

* the SVM can also identify genes in the training set that are outliers or that have been previously assigned to the incorrect class

* an application of potentially great impact is classification of samples from patients affected by some disease; if there is information on expression patterns that is already correlated with survival data or disease-stage or disease-type, that can be applied to train the SVM to classify samples for cancer diagnostics, for example. In many cases samples look the same histologically, but their “expression fingerprint” is different. A certain “expression fingerprint” might be correlated with different rates of progression of the disease or to its response to treatment with various drugs

Clustering/Classification of expression data

Problems:* how to normalize expression values?* what distance metric to use?* results are very much dependent on the approach taken for the analysis

Algorithm limitations:* take into account only expression profiles* does not incorporate all the biological information out there* clustering unrelated data will still produce clusters!* there is no “best” or “correct” clustering technique - the results have to be evaluated in the context of the existing biological knowledge

“Extraction of Correlated Gene Clusters by Multiple Graph Comparison”

Akihiro Nakaya Susumu Goto Minoru Kanehisa

Bioinformatics Center, Kyoto University, Japan

Genome Informatics 12: 44-53 (2001)

Graphs

A

B

C D

F

B

A

E

Vertices(Nodes)

Edges

G = (V, E)

G - graphV- verticesE - edges

Comparisons of Graphs (common sub-graph)

AB

C

Genome Linear genome

A B C

BD

E A

B

C D G F

E

Genome Pathway

Comparisons of Graphs

AB

C

Cluster Linear genome

A B C

B

D

E

A

B

C D G F

E

Pathway Pathway

The KEGG DatabasesDatabase Data Object Node Edge Content

(graph)

GENES Genome Gene Adjacency Gene catalogs of completely sequenced genomes and some partial genomes

SSDB Protein Universe Protein Sequence Ortholog/Paralog relations of all similarity protein-coding genes in complete genomes

PATHWAY Network Gene Generalized Generalized Generalized protein interaction networkproduct protein interaction (pathways and complexes) involvingor interaction various cellular processessubnetwork

LIGAND Chemical Universe Compound Reaction Chemical Compounds and chemical reactions that are relevant to cellular processes

EXPRESSION Transcriptome Gene Expression Microarray gene expression profiles similarity

BRITE Proteome Protein Direct Protein-protein interactions and relations interaction

Pathway Database

* network of gene products (nodes) with three types of interactions or relations (edges)

- enzyme-enzyme relations (catalyzing successive reaction steps in the metabolic pathway- protein-protein interactions (e.g. binding, phosphorylation)- gene expression relations (transcription factors and target gene products)

* 5761 entries ( as of Sept 2001)- 201 reference pathway diagrams- 83 ortholog group tables- 960 enzyme-enzyme relations

One of the really fundamental problems in biology:

* there is a fraction of genes with known functions, however, the majority of genes have not been assigned a function even if the particular genome has been already sequenced.

* how to find gene functions or genes/gene products with related functions from all the information obtained from the sequencing, expression profiling or protein-protein interaction assays??

* Techniques:

*Clustering of expression microarray data*Classification of expression microarray data*Multiple graph comparison

Goal:*extract a set of correlated genes with respect to multiple biological features

Method:Relationships among genes on a specific feature are encoded as a graph structure where nodes correspond to genes (or gene products). This might suggest a functional link between genes.

Genome Pathway Expression(gene cluster) (enzyme cluster) (co-expressed genes)

Correlated Gene Clusters

* if all or most of the genes from different graphs reserve their mutual relationships in multiple graphs, the biological relevance among these genes is considered to be supported at high possibility

* can be used to characterize, classify or predict activities of genes

* finding clusters in different graphs is actually finding common sub-graphs among them

* belongs to a category of NP-complete problems (non-deterministic polynomial time complete), actually represent a class of extremely problems with enormous computational complexity

*real problems solved by heuristic algorithms

AlgorithmHeuristics:

* given the correspondences of nodes (vertices) in two graphs, we want to identify whether the two graphs contain locally related regions

* when two graphs are viewed as being linked by correspondences (additional edges), then the problem becomes finding clusters of those correspondences

G1 G2

Clusteringalgorithm

G1 G2

C1

C2

Algorithm* if the the set contains n correspondences (Virtual edges), the problem is to cluster these n data points according to a certain distance measure* each datapoint represents a correspondence between a node in G1 and a node in G2 * the distance between two data points i and j may be defined by two distances

d1(i, j) - for the shortest path between nodes v1i and v1j in graph G1

d2(i, j) - for the shortest path between nodes v2i and v2j in graph G2

G1 = (V1, E1) G2 = (V2, E2)

v1iv1j

v2i

v2j

correspondence(binary relationship)(virtual edge)

Algorithm

*first, each correspondence is considered as an individual cluster* initially there are n initial clusters

v1iv1j

v2i

v2jv1iv1j

v2i

v2j

C1

C2

G1 = (V1, E1) G2 = (V2, E2) G1 = (V1, E1) G2 = (V2, E2)

*then single linkage clustering is performed according to the following criterion whether to merge two clusters Ci and Cj:

1 if min{d1(r, s) | r Ci, s Cj} <= 1 + Gap1 and d(i, j) = min{d2(r’, s’) | r’ Ci, s’ Cj} <= 1 + Gap2

0 otherwise

where Gap1 and Gap2 are non-negative gap parameters

Algorithm

* if d(i, j) = 1, the clusters Ci and Cj are merged

v1iv1j

v2iv2j

G1 = (V1, E1) G2 = (V2, E2)

C1C2

Algorithm

* extend the problem to finding a correlation of sub-graphs in more than two graphs (additional graphs provide information about gene-gene relations that cannot be found in the two graphs)

* correlated gene clusters are connected by links (hyperedges) that link genes from the corresponding clusters

* the distance between hyperedges reflects the shortest path length between the nodes in the graphs

* correlated gene clusters: we can find sets of tightly coupled nodes in the graphs by gathering hyperedges based on their distance

Algorithm

c11

G1 G2 G3

Genome Pathway Similarity

C1

C2

c21

c13

c23

c12

c22

Input datasets:n graphs G = {G1, …, Gn}m hyperedges H = {h1, …, hm}n graphs denote a hyperedge with an n-tuple hi = (x1, i1, …., xn, in)

The kth element hik = xk, ik is Gk‘s node that constitutes the hyperedge

(1<= k <= n) (assume a hyperedge has exactly n nodes)

Algorithm

c11

G1 G2 G

C1

C2

c21

c13

c23

c12

c22

set of hyperedges C1 : C1 = {hs1, …, hsp)

set of hyperedges C2 : C2 = {ht1, …, htq)

set of kth elements of hyperedges in C1: C1k = {hk

s1, …, hksp)

set of kth elements of hyperedges in C2: C2k = {hk

t1, …, hktq)

k

* d(x, y) is the length of the shortest path between nodes x and y in graph Gs ( can be calculated by Dijkstra’s algorithm)

* distance dis(C1s, C2

s) = max{d(x, y) | x C1s, y C2

s} for complete linkage clustering

* distance between two sets of hyperedges C1 and C2 :

D(C1 , C2 ) = dis(C1s, C2

s) (1<= s <= n)

c11

G1 G2 G

C1

C2

c21

c13

c23

c12

c22

y

x

c11

G1 G2 G

C1

C2

c21

c13

c23

c12

c22

In our case: H = {h1, h2, h3, h4, h5, h6} C1 = {h1, h2, h3} C2 = {h4, h5, h6}

D(C1 , C2 ) = dis(C1s, C2

s) = 1<= s <= 3

= dis(C11, C2

1)+ dis(C12, C2

2)+dis(C13, C2

3

= 8 + 8 + 8 = 24

h1h2h3

h4h5 h6

Clustering of hyperedges

* using the distance D we cluster the hyperedges into an initial set of clusters, each of which consists of a single hyperedge C only

* we iterate the procedure to pick two clusters between which the distance is the smallest

* merge them into a new cluster (hierarchical clustering using distance D) * in order to merge D must be under a given threshold pi for graph Gi

* if pathlength between two nodes x and y is larger than pi, set d(x, y) to infinity to eliminate that path; clusters with infinity distance are not merged

* when there are no more clusters between which the distance is different than infinity, the clustering is done

Visualizing the clusters* if the clusters were visualized in 2D, then the distance limit pi will correspond to a radius pi within which nodes of one graph can be clustered (to avoid merging distant genes in the same graph)

* only nodes within circles that intersect can potentially form a bigger cluster

pi

C1

C3

x yz

C1

C2

pi

.

.

....

.

.

.

.

.

.

.

.

.

.

.

.

Set distance to infinity between C1 and C3 and C1 and C2

Initial clusters:

C1

C2

C3

MergeC1 and C2

.

.

....

.

.

.

.

.

.

.

.

.

.

.

.

There are no more clusters to join since all the distances between clusters are now infinity

The final clusters:

C1

C2

Homologous gene clusters in the genomes of E. coli and H. influenzae

Applications of the algorithm

* recent high-throughput technologies provide vast amounts of biological data; contain unknown or hypothetical or erroneous relationships among genes

* standard approaches cluster data only according to one biological parameter (e.g. microarray data are clustered by expression patterns only) may uncover links between known and unknown genes

* advantage of the correlated gene clusters: incorporate in the analysis multiple biological criteria (graphs); if relationships among genes/gene products cannot be explained or do not make sense in a single dataset, multiple datasets will increase the likelihood of deducing the potentially biologically significant relationships. The algorithm, alternatively, can emphasize a relationship that might have been uncovered by clustering techniques

* next step - find relations among genes in the correlated gene clusters

Summary:

* Microarray system basics

* Data collection, normalization, similarity measures

* Expression matrix and expression vectors

* Analysis of expression data - Clustering algorithms

- Hierarchical clustering- k-means clustering- Principal Component Analysis- Supervised clustering

5. “Extraction of correlated gene clusters by multiple graph comparison”

Aspects of microarray gene expression analysis Project 786 - 102 Spring 2002 Antoaneta Vladimirova.

Documents