TRAWLER: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation Laurence Ettwiller, Benedict Paten, Mirana Ramialison, Ewan Birney & Joachim Wittbrodt Supplementary figures and text: Supplementary Figure 1. Details of Trawler’s procedure. Supplementary Figure 2. Details of the assessment of Trawler. Supplementary Figure 3. Analysis of the secondary motifs. Supplementary Figures 4. Analysis of the secondary motifs. Supplementary Figure 5. Instances of a secondary motif over-representation. Supplementary Table 1. ChIP experiments used in this study. Supplementary Table 2. Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set. Supplementary Note Supplementary Methods
27
Embed
TRAWLER: de novo regulatory motif discovery pipeline for ... · The Gat1 motif does not occur often multiple times per sequence. Hap1 motif Hsf1 motif The Hap1 motif occurs multiple
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TRAWLER: de novo regulatory motif discovery pipeline for
Supplementary Figure 1. Details of Trawler’s procedure.
(a) Example of distribution of z-scores in the sample set (red) and the random set (black). The sequences used here are derived from the E2F1- E2F4 data sets. Only motifs that have a score above the limit of distribution of score in the random sequences are considered significant. (b) Motif clustering procedure. For each pairs of motifs, the overlap on the sample sequence is calculated (in percentage) and motifs that overlap 70 percent of the time or more are linked to form a undirected graph. All motifs from a connected sub-graph are part of the same family and are further resolved into cluster(s). In the Myog example, one family is found that can be further resolved into two clusters. (c) Same as (a) this time the sequences used are randomly picked human sequences (1 kb upstream of randomly picked genes).
Supplementary Figure 2. Details of the assessment of Trawler.
(a) Left graph: assessment of Trawler’s performance relative to the other available tools . Correlation coefficient by species. Right graph: different measure of accuracy (sSn, site sensitivity ; sPPV, site Positive Predictive Value; sASP, overall average site performance). Trawler was run blindly without prior knowledge of the true motifs. (b) Detailed table of Fig. 2 showing, for each pulled down experiment, the ability of individual programs to uncover the correct BS in yeast. For each individual ChIP experiment, the success or failure of 7 different algorithms including Trawler is shown. The results from the 6 other algorithms come from Harbison et al.
NONE NONERpn4
NONE NONESfp1
Sip4
NONE NONESkn7
NONE NONESnt2
Sok2
Spt23
NONEStb1
NONE NONEStb4
Stb5
NONE NONESum1
NONESte12
NONE NONESut1
NONE NONESwi4
NONE NONESwi6
Tec1
Tye7
NONE NONEUme6
NONE
NONE NONEYap1
NONE NONEYap7
NONE NONEYdr026C
The Rpn4 motif occurs multiple times per sequence.
Rpn4 motif
Sfp1 motif
The Sfp1 motif occurs multiple times per sequence.
The Abf1motif as described by Harbison et al.is found over-represented together with the pA-pTmotif and an unknow motif. The pA-pT motif also co-occurs with the Abf1 motif at a sequence level.
NONE NONE
NONE NONE
NONE NONE
NONE NONE
NONE NONE
NONE
NONE
NONE
Fkh1
Fkh2
NONE NONE
NONE NONE
Gat1 NONE NONE
NONE NONEGcn4
NONE NONEGln3
NONE NONEHap1
NONE NONEHsf1
Ino2
NONE NONEIno4
NONE NONELeu3
NONE NONEMbp1
NONEMcm1
NONEMet4
Msn2
NONE NONENrg1
Pdr1
NONEPhd1
Pho2
NONE NONEPho4
NONE NONERap1
NONE NONERcs1
NONE NONERds1
NONE NONEReb1
NONE NONERfx1
The Aft2 motif occurs mutliple times per sequence
Aft2 motif
unknown motif 1
unknown motif 1Bas1 motif
The Bas1 motif occurs mutliple times per sequence.
Cad1 motif
Cbf1 motif
Cin5 motif
The Cad1 motif does not occur often multiple times per sequence.
The Cbf1 motif occurs multiple times per sequence.
The Cin5 motif occurs multiple times per sequence.
The Dal82 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is also found in the Pdr1 dataset andseems to correspond to Reb1 or Nrg1 motif.
Dal82 motifReb1/Nrg1 motif
Dig1 motifTec1 motif
The Dig1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Dig1 motif at a sequence level.
Fhl1 motif
The Fhl1 motif occurs mutliple timesper sequence. The Fhl1 motif is the same as theRap1 motif.
The Fkh1 motif occurs multiple times per sequence.
The Fkh2 motif occurs multiple times per sequence.
Fkh1 motif
Fkh2 motif
Gat1 motif
Gcn4 motif
Gln3 motif
The Gcn4 motif occurs multiple times per sequence.
The Gln3 motif occurs multiple times per sequence.
The Gat1 motif does not occur often multiple times per sequence.
Hap1 motif
Hsf1 motif
The Hap1 motif occurs multiple times per sequence.
The Hsf1 motif occurs mutliple times per sequence.
Ino2 motif unknown motif 1 unknown motif 2
Ino4 motif
The Ino2 motif is found over-represented with 2 unknown motifs but they do not often co-occurwith Ino2.
The Ino4 motif occurs multiple times per sequence.
The Leu3 motif does not occur often multiple times per sequence.
The Mbp1 motif occurs multiple times per sequence.
Leu3 motif
Mbp1 motif
unknown motif 1
unknown motif 1 unknown motif 2 unknown motif 3
Mcm1 motif
Met4 motifPho4/Cbf1 motif
The Mcm1 motif is found over-represented with one other unknown motif but it does not often co-occurwith Mcm1 motif.
The Met4 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Pho4/Cbf1 motif. Additionally, Pho4/Cbf1 motif co-occurs with Met4 motif at a sequence level.
Msn2 motif
Nrg1 motif
The Msn2 motif is found over-represented with 3 other unknown motifs but they do not often co-occur with Msn2 motif.
The Nrg1 motif occurs multiple times per sequence.
The Pdr1 motif is found over-represented with 2 other unknown motifs and a motif thatcorresponds to the Nrg1/ but they do not often co-occur with Pdr1 motif.
Phd1 motifunknown motif 1
The Phd1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown. This unknown motif doesnot co-occurs with Met4 motif at a sequence level.
Pho2 motif unknown motif 1Pho4/Cbf1 motif
The Pho2 motif as described by Harbison et al.is found over-represented together with the Pho4/Cbf1motif and an unknow motif. The Pho4/Cbf1 motif also co-occurs with the Pho2 motif at a sequence level.
Pho4 motif
The Pho4 motif occurs multiple times per sequence.
Rap1 motif
The Rap1 motif occurs multiple timesper sequence. The Rap1 motif is the same as theFHL1 motif.
The Rcs1 motif occurs multiple times per sequence.
The Rds1 motif occurs multiple times per sequence.
The Reb1 motif occurs multiple times per sequence.
Rcs1 motif
Rds1 motif
Reb1 motif
Rfx1 motif
The Rfx1 motif does not occur often multiple times per sequence.
The Skn7 motif occurs multiple times per sequence.
The Snt2 motif occurs multiple times per sequence.
The Sok2 motif is found over-represented with 3 other unknown motifs, two of them seem to be a variante of the motif Sok2.
The Spt23 motif is found over-represented with 3 other unknown motifs, one of them (motif 1) seem to be a variante of the motif Spt23.
The Stb1 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif is unknown and does not often co-occurs with the Stb1 motif.
The Stb4 motif occurs multiple times per sequence.
The Stb1 motif as described by Harbison et al.is the fourth best motif found by Trawler. The other motifs are unknown and do not often co-occur with the Stb5 motif.
The Ste12 motif as described by Harbison et al.is the second best motif found by Trawler. The first motif corresponds to the Tec1 motif. Additionally, Tec1 motif co-occurs with Ste12 motif at a sequence level (same scenario as Dig1).
The Sum1 motif occurs multiple times per sequence.
The Sut1 motif occurs multiple times per sequence.
The Swi4 motif occurs multiple times per sequence.
The Swi6 motif occurs multiple times per sequence.
The Tec1 motif as described by Harbison et al.is found over-represented together with the Ste12/Dig1motif and an unknown motif. The Ste12/Dig1 motif also co-occurs with the Tec1 motif at a sequence level.
The Tye7 motif is found over-represented with one other unknown motif but it does not often co-occurwith Tye7 motif at a sequence level.
The Ume6 motif occurs multiple times per sequence.
The Yap7 motif occurs multiple times per sequence.
The Ydr026C motif occurs multiple times per sequence.
The Yap1 motif does not occur often multiple times per sequence.
Supplementary Figures 3–4. Analysis of the secondary motifs.
. In yeast, from the 54 data sets that have been correctly analysed by Trawler, 19 have secondary motifs (35.2 %). The total number of secondary motifs is 35 for the yeast data sets.
NONE NONEMyod1
No co-occurence of the NFYA motif with the E2F1-E2F4 motif
The POU5F1 (OCT4) and SOX2 motifs occur multiple times, the other over-expressed motifs do not co-occur with either POU5F1 (OCT4) or SOX2
first motif second motif ......
Supplementary Figures 4. Analysis of the secondary motifs.
For the mammalian data set, 4 data sets out of 9 (44 %) have secondary motifs. The total number of secondary motifs is 13.
E2F known PWM(transfac ID : M00050)
NFYA known PWM(transfac ID : M00185)
over-represented motif family 4
over-represented motif family 8
a
b
NFYA E2F1
c
Tec1 binding site as decribed byHarbison et al.
Tec1 binding sitefound by Trawler
Ste12 binding site as decribed byHarbison et al.
Ste12 binding sitefound by Trawler
d
Ste12Tec1
Supplementary Figure 5. Instances of a secondary motif over-representation.
(a) In yeast, Ste12 BS is found over-represented together with the Tec1 BS in the Tec1 ChIP experiments. Both PWMs found
by Trawler are compared with the PWMs previously described . (b) Example of co-occurrence at the sequence level of the
Tec1 and Ste12 BSs in a conserved region between S. paradoxus, S. mikatae, S. bayanus of the promoter of the glutamine-
fructose-6-phosphate amidotransferase (YKL104C). (c) In Human, the NFYA binding motif is found over-represented
together with the E2F1- E2F4 binding motif in the E2F1- E2F4 ChIP experiment. The PWMs found by Trawler are compared
with the PWM previously described . (d) Example of co-occurrence of E2F1- E2F4 and NFYA BSs in the intergenic region
upstream of the human CDC25A gene. Both sites are conserved from human to opossum. See also Supplementary Notes
for more details.
Supplementary Table 1. ChIP experiments used in this study.
Transcriptionfactor(s)
Species Platform Reference insuppl. notes
Tissues/growth
52 datasets D. melanogasterS. cerevisiaeM. musculusH. sapiens
-Tompa et al1 -
203 yeasttranscriptionalregulators
S. cerevisiae PCR productarray
Harbison etal. 4
Differentconditions andmedia
E2F1 and E2F4 H. sapiens 1.5K andAffymetrixDNA microarrays
Ren et al. 3 Cell culture (WI-38 cells)
SOX2 H. sapiens Agilent array Boyer et al.10
Human ES cell
POU5F1(OCT4)
H. sapiens Agilent array Boyer et al.10
Human ES cell
NANOG H. sapiens Agilent array Boyer et al.10
Human ES cell
Myod1 M. musculus In house primerarrays
Cao et al. 1 Mouse embryofibroblast 12
Myog M. musculus In house primerarrays
Cao et al. 1 Mouse embryofibroblast 12
NF-kappa-B H. sapiens PCR productarray
Schreiber etal. 13
Human U937cells
CREB1 H. sapiens PCR productarray
Zhang et al.14
HEK293T cellsand hepatocytes
HNF4A andONECUT1
H. sapiens PCR productarray
Odom et al.15
hepatocytes andpancreatic islets
NOTCH1 H. sapiens PCR productarray
Palomero etal. 16
T-all cell
Supplementary Table 2. Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set.dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut
Creb program stopped program stopped program stopped
E2f
Hnf4
Hnf6 ATCGAT NO MOTIFFOUND
Myod NO MOTIFFOUND
program stopped
dataset literature Trawler Weeder large Weeder small AlignACE n=10 AlignACE n=5 Meme Motifcut
Myog CAGCTG NO MOTIFFOUND
Nfkb
-
Notch TGGGAA NO MOTIFFOUND
Oct4 ATTTGCAT program stopped program stopped program stopped program stopped
Sox2 C[AT]TTGTT program stopped program stopped program stopped program stopped
The height of the sequences logos from one motif to another should not be compared since in case of multiple motifs, the scale was reduced.
Comparison of motifs found by Trawler and other motif discovery algorithms on the mammalian data set. « Program stopped » indicates experiments
cancelled after more than 4 days of run without producing an output.
Supplementary Notes
Example of clustering of discrete motifs
The clustering of over-represented motifs in the Myog data set illustrates the clustering procedure (Supplementary Fig. 1). One family has been found that can be further resolved into two clusters. The resulting matrices differ in the definition of the flanking sequences while the core motif remains essentially the same. Interestingly if the same set of motifs is clustered using randomly picked genomic sequences instead of the Myog pulled-down loci, no clear cluster can be seen. Thus, only in the case of the Myog pulled -down loci the family is resolved into two distinct position weight matrices (PWMs) and the family is resolved in only one PWM. This result suggests that a clear distinction at a sequence level between the two PWMs is specific of the Myog pulled-down loci, hinting at a biological significance. Indeed, it has been shown that the related TF Myod1 also binds a large fraction of Myog bound loci1. The two PWMs found may represent distinct binding sites for the respective TFs. The different clusters uncovered by Trawler provide direction that can be instantly taken up and addressed by the experimental biologist.
Examples of alignments
Alignments have been performed for all the promoter-based Chromatin Immuno precipitations (ChIPs) experiments studied and the motifs found were mapped to these alignments. The result can be found at http://ani.embl.de/trawler/result_paper/ . To systematically investigate whether potential binding sites are conserved in the immunoprecipitated sequences, pairwise alignments between the reference species and other species from appropriate evolutionary distances are carried out using Blastz2. Taking the human E2F1-E2F4 ChIP data set3 as an example, the E2F1-E2F4 and NFYA motifs found by Trawler were mapped back to the resulting multiple alignments and a conservation score was assigned to each motif (see Supplementary Methods for details). The distribution of conservation scores for each motif in the sample was plotted and compared with the distribution of conservation for the same motifs in upstream sequences of randomly picked genes. To rule out systematic biases due to a high overall conservation of the sequences in the sample, random motifs were created and similarly, the distribution of conservation scores was plotted (see Fig. A below).
Figure A. Distribution of absolute conservation scores for 5 different motif families in (a) random and (b) sample sequences (from E2F1-E2F4 chromatin IP). Two families correspond to E2F1-E2F4 and NFYA binding sites, the 3 other families are random motifs with similar occurrences to the E2F1-E2F4-NFYA motifs. Sites corresponding to the E2F1-E2F4 and NFYA motifs in the sample sequences show a pattern of conservation score very different from the randomly picked sequences with a bimodal distribution of conservation score only present in the sample sequences. Indeed a higher number of sites (10.1 % of the E2F1-E2F4 sites and 38.6% of the NFYA sites) have a absolute conservation score of 4 or above in the sample sequences. Conversely, the random motifs, despite having a slight tendency to be more conserved in the sample sequences, do not show such bimodal distributions. This result shows that a substantial number of E2F1-E2F4 and NFYA like sites are conserved and are likely under purifying selection in the sample sequences, providing an independent validation of Trawler’s prediction. Similar results were obtained with the other data sets tested (data not shown). Conversely, immuno-precipitated sequences also contain a substantial proportion of sites that are not conserved and may not represent functional sites or are fast evolving sites. In the example of the E2F1-E2F4 ChIP sequences, non-conserved motifs matching the E2F1-E2F4 PWM account for 13 percent of all the sites, highlighting the importance of distinguishing the most likely functional sites. Considering these findings, we included a last step in the pipeline that consists of sorting the sites according to the conservation score to retain sites likely to be functional. Thus, both at the level of the abstract motif description as well as at the level of sites, the conservation across evolutionary time allows the assessment of functionality. Details on Trawler’s performance on the yeast data set from Harbison et al. We compared the result from Trawler to the high confidence motifs found in other studies 4 5. Sixty-five motifs were previously found combining the results of six different motif discovery methods 5 and a conservation across yeast species. Fifty-four of these motifs (83%) were found by Trawler alone (Fig 2b, Fig S3 and Fig S4). The total number of families found by Trawler is 112 (corresponding to 263 PWMs), representing at least 48.2 % of true positives (20.5 % in terms of PWMs). The remaining 58 families (209 PWMs) that did not match the previously found PWM can thus represent additional binding sites. On average Trawler found 1.7 families per data set (112/65) or 4 PWMs per data set (263/65). For comparison, the average number of PWMs per data set is 40 for Converge, 2.479 for Kellis, 35 for Mdscan, 6 for MEME
and MEME_c and 170.44 for AlignAce (from the online supporting files of Harbison et al. v24, http://fraenkel.mit.edu/Harbison/release_v24/) see table A (below) for details. Weeder has not been included in the comparison by Harbison et al.
number of real motif
(PWM) found total experiments
percentage of positive motifs
(PWM) average motif (PWM) number
total motif (PWM)
Converge 38 65 58.4 40 2600
AlignACE 38 65 58.4 170.4 11076
Kellis 27 65 41.5 2.4 156
Mdscan 41 65 63.0 35 2275
Meme 36 65 55.4 6 390
Meme_c 38 65 58.5 6 390
Trawler 54 65 83.1 4 263 Table A : summary of the performance of Trawler and other algorithms on the yeast datasets. Only experiments were a known binding site has been found by at least one algorithm (Harbison et al. v24) are used (total 65 experiments). Example of co-occurences
The loci bound by Abf1 in yeast contain, in addition of the canonical Abf1 motif, the polyA-polyT motif over-represented and co-occurring. Abf1 is a multifunctional global regulator with possible chromatin reorganizing and DNA bending activities, whereas the motif polyA-polyT has been shown to induce an intrinsic curvature on the DNA 6. One possible function of this motif is to bring other regulatory elements into proximity of the Abf1 binding site 7. We also found additional over-represented PWMs specific to given cell states in yeast. For example, the loci bound by Dig1 and Ste12 (in addition to the canonical Dig1 and Ste12 site) contain over-represented and co-occurring Tec1 binding sites (Supplementary Fig. 5a) under conditions of filamentous growth in haploid cells 8. This result is in accordance with previous studies showing that genes involved in filamentous growth are bound by the Tec1/Ste12/Dig1 complex. The co-occurrence of Ste12 and Tec1 binding sites can also be detected at the sequence level in many loci involved in filamentous growth including the upstream region of gfa1 gene, an enzyme involved in the first step of the chitin biosynthesis pathway (Supplementary Fig. 5b). Over-representation of additional motifs can also be detected in some ChIP experiments from vertebrates. For example in the E2F1-E2F4 data set 3 analysed, we additionally found the binding site of NFYA (Atf6) over-represented and co-occurring (Supplementary Fig. 5c). Co-occurrences of E2F and NFYA have been previously noted for a limited number of cases in the promoters of cell cycle genes 9. Here we uncover the co-occurrence throughout a large number (95%) of E2F1-E2F4 target genes (Table B below). For instance it is found upstream of the gene coding for the M-phase inducer phosphatase 1 (Supplementary Fig. 5d). Sequences pulled down in the SOX2 ChIP experiment in ES cells also show over-representation of the POU5F1 (OCT4) binding site. This result is in accordance with a previous study 10 that shows a notable overlap between the target genes of POU5F1 (OCT4) and SOX2. Furthermore our data suggest that the POU5F1 immuno-precipitated sequences are also enriched for a PWM that corresponds to Ap1 and Np-y binding sites.
ENSG00000198901 PRC1 Protein regulator of cytokinesis 1
Table B. E2F1-E2F4 bound genes with both the E2F1-E2F4 and NFYA binding sites within 1 kb upstream of the transcription start site in Homo sapiens.
CpG enrichment
When analysing the entire promoter (8kb upstream, 2kb downstream of the TSS) of the target genes of SOX2, POU5F1 and NANOG we found that these regions are strongly enriched in CpG dinucleotides, a very low complexity signal, linked to accessible chromatin. Indeed, it has been shown that CpG dinucleotides can be methylated and consequently tend to undergo deamination, resulting in a cytosine to thymine transition 11. The CpG status in different parts of the genome reflects therefore the landscape of methylation in the germ line cells. CpG methylation in the promoter regions has been shown to be associated with the repression of transcription 11. Hypomethylation of CpG in the promoters of POU5F1 SOX2 and NANOG target genes thus suggest that their promoters are active in the germline. This finding needs further investigation to address the causality of the binding of SOX2, POU5F1 and NANOG to hypomethylated promoters.
References: 1. Cao, Y. et al. Global and gene-specific analyses show distinct roles for Myod and Myog at a common set of promoters. Embo J 25, 502-
11 (2006). 2. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13, 103-7 (2003). 3. Ren, B. et al. E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002). 4. Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004). 5. MacIsaac, K. D. et al. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7, 113 (2006). 6. Hagerman, P. J. Sequence-directed curvature of DNA. Annu Rev Biochem 59, 755-81 (1990). 7. Jensen, L. J. & Knudsen, S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and
functional annotation. Bioinformatics 16, 326-33 (2000). 8. Lorenz, M. C., Cutler, N. S. & Heitman, J. Characterization of alcohol-induced filamentous growth in Saccharomyces cerevisiae. Mol
Biol Cell 11, 183-99 (2000). 9. Matuoka, K. & Yu Chen, K. Nuclear factor Y (NF-Y) and cellular senescence. Exp Cell Res 253, 365-71 (1999). 10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947-56 (2005). 11. Kass, S. U., Pruss, D. & Wolffe, A. P. How does DNA methylation repress transcription? Trends Genet 13, 444-9 (1997). 12. Bergstrom, D. A. et al. Promoter-specific regulation of MyoD binding and signal transduction cooperate to pattern gene expression. Mol
Cell 9, 587-600 (2002). 13. Schreiber, J. et al. Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. Proc Natl
Acad Sci U S A 103, 5899-904 (2006). 14. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene
activation in human tissues. Proc Natl Acad Sci U S A 102, 4459-64 (2005). 15. Odom, D. T. et al. Control of pancreas and liver gene expression by HNF transcription factors. Science 303, 1378-81 (2004). 16. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic
cell growth. Proc Natl Acad Sci U S A 103, 18261-6 (2006).
Supplementary Methods
Motif description
To perform a thorough search for motifs judged significant by our objective function
(see below for definition) the algorithm previously published 1 was used with minor
changes. The input to the initial deterministic pattern scan is a set of foreground
sequences, containing the potential signals of interest, and a set of random
background sequences. As the composition of the background sequence set is of great
importance, the user may supply this, otherwise pre-created sets of randomly sampled
genomic sequences are available. The initial pattern scan explores all words
containing up to a maximum number of degenerate IUPAC nucleotide characters,
excluding triply redundant characters (HBVD). To approximately differentiate the
difference in information content between the four-fold degenerate (N) and two-fold
degenerate characters (RYWSMK), the two-fold degenerate characters were scored
equivalent to one, and N equivalent to two, the resulting degeneracy sum is thus the
limiting factor in pattern enumeration. The pattern search is not limited by word
length, but enumeration is stopped when the total number of occurrences (including
overlaps) falls below a prescribed threshold.
Over-representation assessment
The over-representation of these discrete pattern motifs (discrete-motifs) is assessed
by comparing the numbers of occurrences of motifs in the sample sequences versus a
background sequence set composed of a much larger number of randomly picked
sequences. This is done by using the standard normal-approximation to the binomial
to calculate the z-score. The function does not correct for the effects of overlapping
instances, experiments showed this to have no substantial effect on the results, while
making it much faster to compute (data not shown). To ascertain a sensible threshold
for z-score significance, a randomisation procedure can be repeatedly run (default 20
runs), plotted and the mean score plus two standard deviations taken as the default
threshold. The random sequences are sampled with replacement and used to in-place
of the foreground sequences. The given pattern set is then evaluated and the highest
scoring pattern taken as the result of the run. Though this has certain heuristic
qualities, as an approximation scheme it behaves robustly.
Pattern enumeration is performed depth first using a suffix tree. The suffix tree
implementation is based upon the memory efficient scheme 2. This scheme has the
advantage that large volumes of sequence data (more than 100 megabases) can be
stored on current desktop machines with only 1.5 gigabytes of memory. The output of
this deterministic pattern scan is then the complete set of significant patterns.
Clustering step
To decompose the set of overlapping and redundant putative motifs into a set of
position weight matrices (PWMs) a greedy cluster finding algorithm was employed
upon an undirected graph of motifs (the vertices) and their similarities (edges). Edges
are created in an all-against-all procedure according to the degree of sequence overlap
(which must be more than 70 percent) of motif instances in the sample sequences.
Hits in both strands are analysed to take in account reverse complemented motifs.
Each connected subgraph forms a family and the families are further resolved into
clusters using essentially two methods: [1] the most connected nodes and its directly
connected nodes are removed from the graph to form one cluster. The operation is
iterated until no cluster can be found. Clusters from the same initial subgraph
constitute a family. [2] The percentage overlap for all the motifs constituting a family
is calculated as describe above. The resulting 2D matrix is clustered using the
following distance function: correlation, absolute value of the correlation, uncentered
For Fig. 1b, a randomized data set composed of the same number of sequences as in
the sample set was produced by randomly picking sequences from the background.
Trawler was run with the default parameters.
Myod/Myog: The mouse genes targeted by either Myod or Myog in both MDER and
C2C12 cells are found here
(http://www.nature.com/emboj/journal/v25/n3/extref/7600958s3.pdf). The 750 bp
upstream and 250 bp downstream sequences of the annotated Transcription start site
(TSS) (EnsEMBL version 38 9) were repeat and exon masked.
HNF4A and ONECUT1 (HNF6): All the data were downloaded from
(http://jura.wi.mit.edu/young_public/autoregulation/downloaddata.html). The 750 bp
upstream and 250 bp downstream sequences of the annotated TSS (EnsEMBL version
38 9) were repeat and exon masked.
POU5F1 (OCT4), SOX2 and NANOG analysis: Three data sets coming from the
same experiment 10 have been analysed. First the exact loci where the transcription
factor (TF) has been bound were extracted from MacIsaac et al. 11. Secondly, the
entire promoter regions (10kb) that include the bound loci have also been analysed
using the data from10.
NOTCH1: The target genes were taken from the supporting Table 1 of 12 and the 3 kb
upstream human sequences were retrieved from EnsEMBL 41 (repeat masked
sequences). The background corresponds to 3 kb sequences upstream of 2000
randomly picked genes. Trawler was run with the default parameters (but minimum
occurrence of motifs 20).
CREB1: The data were downloaded from 16. The first 200 genes recorded in their
Supplementary Table 6 were used. The 750 bp upstream and 250 bp downstream
sequences of the annotated TSS (EnsEMBL version 38 9) were repeat and exon
masked.
NF-kappa-B: The data were downloaded from (http://web.wi.mit.edu/young/nfkb/).
The 700 bp upstream and 200 bp downstream sequences of the annotated TSS
(EnsEMBL version 38 9) were repeat and exon masked.
The regions used in the Trawler analysis correspond to regions described in the ChIP
of the corresponding analysis. The background corresponds to the upstream region of
appropriate length of 2000 randomly picked genes. Trawler was run with the default
parameters.
References:
1. Ettwiller, L. et al. The discovery, positioning and verification of a set oftranscription-associated motifs in vertebrates. Genome Biol 6, R104 (2005).
2. Kurtz, S. Reducing the space requirements of suffix trees. Software-PractiseExperience 29, 1149:1171 (1999).
3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysisand display of genome-wide expression patterns. Proc Natl Acad Sci U S A95, 14863-8 (1998).
4. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13,103-7 (2003).
5. Clamp, M., Cuff, J., Searle, S. M. & Barton, G. J. The Jalview Java alignmenteditor. Bioinformatics 20, 426-7 (2004).
6. Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: asequence logo generator. Genome Res 14, 1188-90 (2004).
7. Tompa, M. et al. Assessing computational tools for the discovery oftranscription factor binding sites. Nat Biotechnol 23, 137-44 (2005).
8. Ren, B. et al. E2F integrates cell cycle progression with DNA repair,replication, and G(2)/M checkpoints. Genes Dev 16, 245-56 (2002).
9. Birney, E. et al. Ensembl 2006. Nucleic Acids Res 34, D556-61 (2006).10. Boyer, L. A. et al. Core transcriptional regulatory circuitry in human
embryonic stem cells. Cell 122, 947-56 (2005).11. Macisaac, K. D. et al. A hypothesis-based approach for identifying the binding
specificity of regulatory proteins from chromatin immunoprecipitation data.Bioinformatics 22, 423-9 (2006).
12. Palomero, T. et al. NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth. ProcNatl Acad Sci U S A 103, 18261-6 (2006).