Top Banner
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute
38

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Dec 13, 2015

Download

Documents

Noah Short
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Data Mining the Yeast Genome Expression and

Sequence Data

Alvis Brazma

European Bioinformatics Institute

Page 2: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Why the yeast is interesting to the industry

Easy to work with (first) fully sequenced eukaryotic model organism

30% of genes have analogs in human most known human disease genes

have homologues in the yeast for food industry interesting in itself

Page 3: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Genetic networks

promoter1 gene1 promoter2 gene2 promoter3 gene3 promoter4 gene4DNA

RNA

transcription

translation

proteins

transcription factors

Page 4: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Mining the Yeast Expression Data

The long term goals:» reconstructing the gene regulation networks and

relating it to metabolic pathways Short term goals:

» correlating gene expression profiles with gene functional classes and using this for prediction of gene functions

» correlating gene expression profiles with promoter regions

Page 5: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Yeast microarray

Page 6: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Yeast gene expression during diauxic shift (DeRisi et al)

Yeast cells from an exponentially growing yeast culture were inoculated into fresh medium and after some initial period were harvested at seven 2-hour intervals. Their mRNA were isolated, and fluorescently labeled cDNA prepared. Two different fluorescents were used - one from cells harvested in each of the successive time-points, other from the cells harvested at the first time-point (reference measurement). The cDNA from each time-point together with the reference cDNA were hybridized to the microarray with approximately 6400 DNA sequences representing ORFs of the yeast genome. Measurements of the relative fluerescence intensity for each element reflect the relative abundance of the corresponding mRNA.

Page 7: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Visualizing the data (expression profile of the

“first” 250 genes)

Page 8: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Average expression level of genes at the respective time-

points

Page 9: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Three approaches

Finding correlations between gene expression profiles and their functional classes

Building decision trees for predicting gene functional classes from their expression data

In silico discovery of putative transcription factor binding sites in the regions upstream to the genes with similar expression profiles (to appear in Genome Research, Dec. 1998)

Page 10: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Gene distribution across the functional classes

Page 11: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Energy gene subclasses in the yeast (less frequent merged in

one)

Page 12: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Gene expression for energy genes during the diauxic shift at the seven time-points

Page 13: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Expression profiles of respiration genes

Page 14: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Expression profiles of fermentation genes

Page 15: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Average expression levels at the 7 time-points and for energy class genes during

diauxic shift

Page 16: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Average expression levels at all time-points and for all energy

classes

Page 17: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Energy classes distribution

Page 18: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Energy classes distribution

Page 19: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Decision tree for respiration genes

Page 20: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Decision tree for fermentation

Page 21: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Tricarboxilacid, respiration and reserves decision tree

Page 22: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Clustering the gene expression profiles by

discretization of gene expression measurment space

Logarithm ofexpression ratio

Time points

1 2 3 4 5 6 7

Corresponding discrete pattern: 000012-1Put the genes mapping to the same discrete pattern in a cluster

0

1

2

-1

-2

Page 23: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Organization of a typical yeast promoter

URS URS TATA I

Coding Region

40 - 120 bp

20 - 700 bp

RNA

40 - 60 bp

Page 24: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

In silico discovery of transcription factor binding sites from expression data

Take data from gene expression level measurements (from DNA array technologies) ->

Cluster together genes with similar expression profiles ->

Take sequences upstream from the genes in each cluster ->

Look for sequence patterns overrepresented in a cluster

Page 25: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Clustering genes by similar expression profiles

Put in each cluster all genes that map to the same discrete pattern

Different thresholds give different clustering systems

We obtained 32 different clusters containing from 10 to 77 genes and 11 clusters containing at least 25 genes

Page 26: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Hypothesis to test

Genes with similar expression profiles may be regulated by similar expression mechanisms and thus may contain similar transcription factor binding sites

Page 27: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Discovering regulatory elements in gene upstream

sequences

Take the sequences of a certain length (e.g., 300 bp) upstream to all genes with a certain expression profile

Look for a priori unknown sequence patterns that are over-represented in these regions (taking into account the other upstream regions as background)

Page 28: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Pattern discovery in bioseqeucnes

Group together sequences thought to have common biological (structural, functional) properties, ignoring the purely sequence (syntactic) properties

Study the purely syntactic properties of these sequences ignoring their biological (semantic) properties.

Page 29: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Problem of “noise”

Gene expression measurement accuracy is bout factor of 2 (in 95% cases)

Clusters very dependant on the clustering method or thresholds

The same expression profile does not necessarily mean the same regulation mechanism

Page 30: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Dealing with noise

One cannot look for patterns common to the set of strings, but for patterns overrepresented in the set

looking for sets of patterns covering the set

Use of “negative” or background setquences

Page 31: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

More powerful algorithms than the currently existing

are needed

We used such new, more powerful algorithm, based on suffix-tree representation of the sequence space (implemented by Jaak Vilo at Helsinki University)

We looked systematically for all patterns discriminating the upstream regions in the clusters from randomly selected upstream regions

Page 32: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Use of negative sequences

Looking for patterns that are overrepresented in the sequences upstream from genes in a cluster in comparison to all other upstream sequences

Page 33: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

The rating function

Given two sets S+ and S- and a pattern P, return rating R(S+, S-,P)

Two rating functions that we used:» ratio: nr of sequences in S+ matching P divided

by nr of sequences in S- matching P

» probability that the pattern can occur in S+ “by chance” assuming that the occurrences in S- are “by chance” and using binomial distribution

Page 34: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

The sequence pattern discovery experiment

We run the algorithm on upstream sequences (length 2 * 300) of all the 32 gene clusters

Each cluster produced hundreds of overrepresented patterns

The problem of validation

Page 35: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Some discovered sequence patterns from clusters of upstream sequences

Clusters with the increase in the expression level after time-point 6:

CCCCT - known to be a stress responsive motif

Clusters with the decrease in the expression level after time-point 6:

ATCC..T..A - RAP1 protein

ATC..TAC - RAP1, REB1, BAF1

ATTTCA…T - GA-BF protein

Page 36: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Statistical validation of the discovered patterns

For each cluster choose a random set of upstream regions of the same number

Run the pattern discovery algorithm on the random regions set in addition to the cluster

Compare the scores of the discovered patterns from the cluster and random set

Page 37: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Conclusions

The discovered patterns are in accordance with the existing knowledge

Transcription factor binding sites can be discovered in silico from gene expression data

More refined and validated gene expression measurements are needed

Page 38: Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Acknowledgements

Inge Jonassen (Bergen) Jaak Vilo , Esko Ukkonen (Helsinki) Alistair Ewing, Neil Skilling (Quadstone

Ltd - developers of Decisionhouse data mining software)

BIOVIS and BIOSTANDARDS projects from the EU at EBI