Jaak Vilo DNA expression data analysis 1 European Bioinformatics Institute European Bioinformatics Institute Extracting information from microarray data Jaak Vilo European Bioinformatics Institute EMBL-EBI http://www.ebi.ac.uk/microarray/ http://ep.ebi.ac.uk Lausanne, 1.03.2001 European Bioinformatics Institute European Bioinformatics Institute Microarray Experiment RT-PCR RT-PCR LASER DNA “Chip” High glucose Low glucose
36
Embed
Extracting information from European Bioinformatics ...€¦ · European Bioinformatics Institute Clustering methods ... High-throughput methods (“data mining”) European Bioinformatics
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Jaak Vilo
DNA expression data analysis 1
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Extracting information frommicroarray data
Jaak Vilo
European Bioinformatics Institute EMBL-EBI
http://www.ebi.ac.uk/microarray/
http://ep.ebi.ac.uk
Lausanne, 1.03.2001
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Microarray Experiment
RT-PCR
RT-PCR
LASER
DNA “Chip”
High glucose
Low glucose
Jaak Vilo
DNA expression data analysis 2
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Gene expression data
Treated sample labeled red (Cy5)Control data labeled green (Cy3)
Cluster of co-expressedgenes, pattern discovery inregulatory regions����BASEPAIRS
%XPRESSION�PROFILES
5PSTREAM�REGIONS
2ETRIEVE
0ATTERN�OVER REPRESENTED�IN�CLUSTER
Problem of “noise”✦ Gene expression measurement accuracy
(about within a factor of 2 in 95% cases)
✦ Clustering result depends on the choice ofmethod and parameters used in each
✦ Does co-expression mean co-regulation?
Jaak Vilo
DNA expression data analysis 21
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute What questions we asked
✦ How to perform systematic discovery?✦ How to assess the quality of the predictions?✦ Do “better” clusters give “better” signals?✦ Is method scaleable for larger genomes?
✦ Want to discover something unique for eachcluster, not just features common to upsreams
Cluster and pattern “strengths”
✦ Cluster strength: Average silhouette valueHow well each object is classified into it’s owncluster. Use average distances within cluster andcompare them to next closest cluster for each object.Value between -1 .. +1 (well classified)
✦ Pattern strength: binomial distributionGiven probability of “tails” on coin, how probableis to observe k or more “tails” out of n trials.Number of tails = nr. of pattern occurrences.
Jaak Vilo
DNA expression data analysis 22
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Silhouette value(Rousseeuw 1987)
* Assign “goodness” to each clustered object* Average silhouette over each cluster or over the clustering* Not a silver bullet
Average distance to members in same cluster
Average distance to members in closest cluster
bi - ai
Max( bi, ai ) Si =
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Computational experiment:clustering
✦ Yeast Saccharomyces cerevisiae, 6221 genes, 80expression conditions for each (from P. Brown’s lab)
✦ No single best clustering method: K-means, vary K ∈2..1000, repeat 10x for each K. Total: ~1000 x K-means
✦ Calculated average “Silhouette” values for all clusters
✦ Select all unique clusters of size 20..100 (~ 52.000)
✦ Could combine several methods, several distance measures
✦ Upstream sequences of length 600bp from ORFstart
✦ Analyze all upstreams of all 52K clusters withSPEXS looking for substrings only (one weekend,~10 PC-s)
✦ Extract all patterns from upstreams of allclusters with probability less than 1% (binomialdistribution, background probability is calculatedsimultaneously from all 6221 upstreams)
Jaak Vilo
DNA expression data analysis 24
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Pattern selection criteriaBinomial distribution
5 out of 25, p = 0.2
Background -ALLupstreamsequences
Cluster: π occurs 3 times
P(π,6) is probabilityof having 3 ormore matches in 6sequences
P(π,6) =0.0989
Jaak Vilo
DNA expression data analysis 25
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Pattern vs cluster “strength”
The pattern probability vs.the average silhouette forthe cluster
The same for randomisedclusters
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
The most unprobable pattern from best clustersPattern Probability Cluster Occurrences Total nr of K
Rpn4p acts as a transcription factor by binding to PACE, a nonamer boxfound upstream of 26S proteasomal and other genes in yeast.Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann HAdolf-Butenandt-Institut der Ludwig-Maximilians-Universitat Munchen, Germany.
We identified a new, unique upstream activating sequence(5’-GGTGGCAAA-3’) in the promoters of 26 out of the 32 proteasomalyeast genes characterized to date, which we propose to call proteasome-associated control element. By using the one-hybrid method, we show that thefactor binding to the proteasome-associated control element is Rpn4p, a proteincontaining a C2H2-type finger motif and two acidic domains. ...
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
✦ Analyze multiple sets of sequences simultaneously
✦ Restrict search to most frequent patterns only (in each set)
✦ Report most frequent patterns, patterns over- orunderrepresented in selected subsets, or patterns significantby various statistical criteria, e.g. by binomial distribution
ALIGNMENT: based on pattern AGT-----AGTGACA---ACAGTGACA----CAGTGACA----CAGTGAC----ACAGTGAC----GCAGTGA-----GCAGTG-----AGCAGTGA---TTACAGTG---TTTACAGTGA---TTACAGTGA----TACAGTG------ACAGTGA----TACAGTGA--TTTACAGT----TTTACAGTG---
SCPDOf 1498 patterns315 patterns match sites73 patterns are matched by some site1134 patterns do not match any nor is matched by any
Of 109 factors with total 799 sites85 factors are matched by some of the patterns19 factors match some patterns24 factors do not have matches nor is matched
Of 498 unique sites of total 799 sites238 sites are matched by some of the patterns21 sites match some patterns252 sites do not have matches nor is matched
TRANSFACOf 1498 patterns297 patterns match sites61 patterns are matched by some site1174 patterns do not match any nor is matched by any
Of 351 DB-entries with total 359 sites205 factors are matched by some of the patterns22 factors match some patterns134 factors do not have matches nor is matched
Of 334 unique sites of total 359 sites198 sites are matched by some of the patterns16 sites match some patterns127 sites do not have matches nor is matched
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute PATMATCH
✦ Match your patterns against sequences
✦ Sequences - extracted from “GENOMES”
✦ Visualise matches along the sequence
✦ Visualise pattern by pattern if sequence has amatch
✦ Order sequences according to hierarchicalclustering order from EPCLUST
✦ Show clustering and upstream next to eachother
Jaak Vilo
DNA expression data analysis 33
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
GGTGGCAA - proteasome associated control element
PATMATCH - combine pattern matching with expression data
Jaak Vilo
DNA expression data analysis 34
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Global and Local Data MiningSecondary Data Mining
✦ Find global structure by clustering
✦ Find local structure by pattern discovery
✦ Summarize the findings to the sizefeasible for humans to interpret