Conserved introns reveal novel transcripts in eukaryotic genomes Michael Hiller 1,2 , Sven Findeiß 3,4 , Sandro Lein 5 , Manja Marz 3 , Claudia Nickel 5 , Dominic Rose 3 , Christine Schulz 6 , Rolf Backofen 1 , Sonja J. Prohaska 3,4,7 , Gunter Reuter 5 and Peter F. Stadler 3,4,6,7,8 1) Bioinformatics Group, Albert-Ludwigs-University Freiburg, Germany 5) Institute of Genetics, Martin Luther University Halle-Wittenberg, Germany 2) Department of Developmental Biology, Stanford University, USA 6) RNomics Group, Fraunhofer Institut f¨ ur Zelltherapie und Immunologie, Germany 3) Bioinformatics Group, University of Leipzig, Germany 7) Institut f¨ ur Theoretische Chemie und Molekulare Strukturbiologie, University of Vienna, Austria 4) Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany 8) Sante Fe Institute, Santa Fe, USA email: [email protected] – www: http://www.bioinf.uni-leipzig.de 1. Introduction & Outline Introduction • Most of eukaryotic genomes are transcribed producing large num- bers of non-coding RNAs (ncRNAs), a heterogeneous class of es- sential transcripts exerting their function at the RNA level with- out ever being translated into protein. • A subclass of them, similar to mRNAs, gets spliced, capped, and polyadenylated and is therefore called messenger-like non-coding RNAs (mlncRNAs), examples: Xist, H19 (gene regulators). • Contrary to protein-coding genes, ncRNA gene-finding solely based on sequence data is a challenging problem (no start-/stop codon, lack of discernible open reading frames, poor sequence conservation, in case of long ncRNAs usually not even structure conservation). Outline • Novel genome-wide comparative genomics approach. • Search for conserved introns in eukaryotic genomes. • Capable to identify novel transcripts/genes. • Idea: Gene-finding based on intron prediction. • Intron detection allows to – extend or revise existing annotation. – identify novel protein-coding genes. – identify novel mlncRNAs. 2. Overview The idea Functional pair of donor (5’) and acceptor (3’) splice sites will be retained over long evolutionary time scales only if • the locus is transcribed into a functional transcript and • accurate intron removal is necessary to produce a functional tran- script. The data 2 screens, all input data available at the UCSC genome browser: • 15 insects, already published, see [1] (12 drosophila genomes, mosquito, beetle, honeybee) • 44 vertebrates (human → teleosts, lamprey) The plan Insects: focus on short conserved introns (40-81 nt) • Apply intronscan (preliminary filter) → build alignments → evaluate characteristic intron evolution → train support vector machine (SVM) → classify candidate set of novel introns Vertebrates: focus on general independent splice-site prediction (they have only few short introns) • Apply MaxEntScan (preliminary filter) → compile set of real (positive) and “pseudo” (false) donor/acceptor splice-sites → evaluate characteristic splice-site evolution → train SVM → clas- sify candidate set of novel splice-sites 3. Insects – Methods 498,231 predictions with orthologs D.ere D.mel D.moj 1,398,939 predicted introns for B retain orthologous intronscan predictions A + 12 insects predict introns in individual insect genomes using intronscan variation donor score acceptor score variation variation intron length conservation scores scores splice site C evaluate characteristic intron evolution training samples distributions of train an SVM with these 5 discriminative features apply to 342,785 predictions that overlap no protein-coding gene D. melanogaster 369 conserved introns predicted negative positive substitution genome genome D.ere D.mel D.moj + 12 insects + strand intron - strand intron > > > > > > >>>>>> >>>>>> > > > > > > > > > > > > > > > > > > > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > > > > > > > > > > > > > >>>>>> >>>>>> > > > > > > > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > > > > > > > >>>>>> 1 False Positive Rate True Positive Rate 0 1 independent test set ROC curve of AUC = 0.983 4. Insects – Splice-site evolution Nucleotide frequencies differ at splice-site positions. 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 G T C A 0 -20 20 D.sim D.sec D.yak D.ere D.ana D.pse D.per D.wil D.moj D.vir D.gri A.gam T.cas A.mel percent less frequent compared to more frequent D.mel compared to D.mel donor (5’) splice site acceptor (3’) splice site +5 +6 -7 -6 -5 -4 -3 +3 +4 less frequent ←|→ more frequent (compared to Dmel) e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3) learn log odd substitution scores: ∀x, y ∈{A,T,C,G} ,x = y : log 2 freq pos (x→y ) freq neg (x→y ) → substitution matrix Conservation scores (PhastCons) 0 1 0.5 8...20 and -20...-8 average conservation scores for region substitution scores sum of 0.5 substitution scores sum of 8...20 and -20...-8 average conservation scores for region Density 0.03 1.0 Density 20 0 0 0 40 80 -10 Density 0.5 40 80 0 1.0 Density 10 0 0 -10 0 average = 0.002 sum = 31.6 sum = -3.7 average = 0.92 GGT negative positive D.pse D.sec D.mel D.sim D.yak D.ere D.vir D.gri D.gri D.moj D.vir D.wil D.per D.pse D.ana D.ere D.sim D.mel D.sec GTAAGAT-TATTCCGATTTTTATAGCTTCATTTTTGAGAAATTTAATTTGATTAA----TTTTTAG GTAAGCC----TTACAAAAAACCATATATATTTTTAGTGAATCAATATTGCCTTATT--TTTGTAG GTAGGAT-TAACCATCCAGCTATCTATATATCTGTAGTAATATCTTGAACTATAA----TTTGCAG GTAA AC---GCTATTAGAATTCATTTACATTTACAGACGAT-AATAGTGTATATCTTCAT AG G GTGAGTG-TAACCGTAACCAGCAACTGGCTCCAGCAGTAGACCTATCGAATATA-----TCCGCAG GTGAGTG-TAACCGTAACCAGCAACTGGCTCCAGCAGTAGACCTATCGAATATA-----TCCGCAG GTAAGCTTTTCCGAAGAGATAGCATT--TATTATGATTCAATTGTTT------------TTCACAG GTGAGAA--ACACAAGACATGCTATTGCCAATAATATCATAT-ACCAAGAACTCAA---TTTACAG GTGAGAC--ACCCAAGACATTCTATTGGCAATAATATCCTTT-ACCAAGGACCCA----TTTACAG GTGAGAC--ACCCAAGACATTATATTGGCAATAATATCATCT-ACCAAGGGCTCA----TTTACAG GTGAGAC--CCCCAAGACATTTTATTGGCAATAATATCCTAT-ACCAAGGACCCA----TTTACAG A substitution scores B +20 +8 -8 -20 +20 +8 0 1 -8 -20 GTGGGCTCAG---TCGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCAGTCTGTGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCAGTCTGTGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCTC---TCGGTACTGCATTATGATTGTTTATTTT-------ATATGCGCTTGATTTG AG G GTGGGCTCAG---TCGGAACTCCATTACGATTGTTTATTTT-------ATATGCGCTTGATTTG AG G GTGGGCTCAG-AGTCGGTACTCCACTGCGATTATTTATTTT-------ATTTGCGCCTGATTTG AG G GTGG TTTG-------GACTCCATTATAATTATTTATATT-ACCCGTGTTTGCGCTTGATTTGAAG AT GTGG ATCT----GGGGACTCCATTATAATTATTTATATTTGCTCGTATTTGCGCTTGATTTGAAG G A distribution of positive training samples distribution of negative training samples classified as false prediction (SVM probability 0.001) classified as real intron (SVM probability 0.999) Conservation scores (PhastCons) 5. Insects – Results chr3R: chr2R: chr3L: chr3L: chrX: Conservation Conservation CG14614 21856300 13232800 19480100 4479900 8881300 21856400 13232900 8881400 4480000 4480100 19480200 500 bp 500 bp 600 bp 300 bp 21856500 13233000 8881500 19480300 21856600 8881600 13233100 19480400 4480200 FlyBase Protein-Coding Genes predicted intron predicted intron 8881700 4480300 FlyBase Noncoding Genes D. melanogaster mRNAs from GenBank predicted intron D. melanogaster ESTs That Have Been Spliced 8881800 4480400 D. melanogaster ESTs That Have Been Spliced predicted intron predicted intron predicted intron 600 bp predicted intron D. melanogaster ESTs That Have Been Spliced FlyBase Protein-Coding Genes D. melanogaster ESTs That Have Been Spliced D. melanogaster ESTs That Have Been Spliced CA805633 CA807669 CA805453 CA807471 CO192200 CA807690 E D B C A CA804813 Conservation Conservation EY198607 EY198595 CA805394 CA805952 CA805663 CA804428 CA805031 CA805317 CA807678 pncr009:3L-RA Conservation BE979091 AI944913 EC251326 AY113603 CO334041 CO319199 CK135604 dally EC247591 CO295956 EC249419 A) Predicted intron located at 5’UTR (- strand), B) Predicted intron belonging to antisense transcript of dally, C) EST-confirmed intron prediction, D) Predicted EST-confirmed intron revising current FlyBase annotation, E) Clustered predictions at a putative novel protein-coding gene (blastx hits in several species) • area under ROC: 0.983, p>0.95: 80% TP at 0.12% FP • 369 predictions outside of known protein-cod. genes (p>0.95) • 131 EST/FlyBase-transcript confirmed introns, 238 unconfirmed • Discard novel protein-coding ones: 129 novel mlncRNAs 6. Insects – Exp. verification • RT-PCR, 5 different developmental stages of Dmel: embryo, larva, pupa, male, female • 18/29 (62%) experimentally validated: mlncRNAs: 7/12, introns in putative cod. transcripts: 11/17 7. Vertebrates - Refinements Meet increasing requirements • Vertebrate introns ! = insect introns (2 % vs 54 % short introns) • Rather than predicting complete introns, we switch to individual splice-site prediction → new (SVM-)features needed to distin- guish real from false splice-sites, we propose: (1) The human MaxEntScan splice-site score. (2-4) Three log odd substitution score variants s tree , s pair , s median . (5) The total number of species in an alignment (6) The total number of species with conserved GT/AG dinu- cleotides and a MaxEntScan score >= 0. (7) The slope of a regression line fitted to the splice-sites’ PhastCons sequence conservation profile of [-20,+20]. (8) The average GC content. (9) The mean pairwise identity. Improve log odd substitution scores • Reconstruct ancestral sequences for each splice-site region using prequel and learn splice-site substitution patterns for each edge e of the 44-species tree: s tree = ∑ e∈E log 2 f pos (x→y )/ ∑ n∈A f pos (x→n) f neg (x→y )/ ∑ n∈A f neg (x→n) 8. Vertebrates – Results • 2 models, AUC: ∼0.93 donor, ∼0.94 acceptor • intron candidates: arbitrarily defined as adjacent do/acc pairs with distance <=5000 nt on same strand • chr21: 886 pairs (p> 0.9), 105 with typical PhastCons pro- file/basin, 16 manually chosen for experimental validation (on- going work) Scale chr22: chr22 ACC chr22 DO GM128 cell tot GM128 cyto pA- GM128 cyto pA+ GM128 nucl pA- GM128 nucl pA+ K562 cell total K562 psom pA- K562 cyto pA- K562 cyto pA+ K562 nucl pA- K562 nucl pA+ K562 nplsm total K562 chrm total K562 nlos tot Multiz Align RepeatMasker 1 kb 18111500 18112000 18112500 18113000 chr22 ACC chr22 DO chr22 phast-filtered introns ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:18111410-18112940 GM128 cell tot GM128 cyto pA- GM128 cyto pA+ GM128 nucl pA- GM128 nucl pA+ K562 cell total K562 psom pA- K562 cyto pA- K562 cyto pA+ K562 nucl pA- K562 nucl pA+ K562 nplsm total K562 chrm total K562 nlos tot Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: CONTRAST SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 100 bases 22066000 22066050 22066100 22066150 chr22 phast-filtered introns CONTRAST Gene Predictions SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:22066024-22066128 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: <--- Gencode Manual SIB Genes SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 5 bases 22659655 22659660 22659665 G A T C G G T G T G A C C C C C C T chr22 phast-filtered introns ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:22659660-22661320 ENST00000405781 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: <--- Gencode Manual SIB Genes SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 10 bases 22661305 22661310 22661315 22661320 22661325 22661330 22661335 T C G T CGG G T GCC T G GCC A A T GG A G AG T C G G T T C C A C T T C A G chr22 phast-filtered introns ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker on:chr22:22659660-22661320 ENST00000405781 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Acknowledgements for contributing: • Micha, Sven, Sandro, Manja, Claudia, Christine, Rolf, Sonja, Gunter and Peter for funding: • German Research Foundation (STA 850/7-1 and Hi 1423/2-1) • Graduiertenkolleg Wissensrepr¨asentation of University Leipzig • European Network of Excellence “The Epigenome” • 6th Framework Programme of the European Union (SYNLET) References [1] M. Hiller, S. Findeiß, S. Lein, M. Marz, C. Nickel, D. Rose, C. Schulz, R. Backofen, S. J. Prohaska, G. Reuter, P. F. Stadler, Conserved introns reveal novel transcripts in Drosophila melanogaster, Genome Res. 19 (2009) 1289–1300. Printed by Universit¨atsrechenzentrum Leipzig