Göteborg 01/12/2006 EnsEMBL and the process of genebuild Julio Fernández Banet ([email protected]) Wellcome Trust Sanger Institute EnsEMBL Group (Genebuild Team) 01 - Dec- 2006 Overview • What is Ensembl? – Ensembl project – Open source – The genome browser • Genebuild – Automatic vs. Manual annotation – Traditional Genebuild – Special cases
27
Embed
EnsEMBL and the process of genebuild - …bio.lundberg.gu.se/courses/ht06/bio2/julio.pdf · EnsEMBL and the process of genebuild Julio Fernández Banet ... – Automatic vs. Manual
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Göteborg 01/12/2006
EnsEMBL and the process of genebuild
Julio Fernández Banet ([email protected])Wellcome Trust Sanger Institute
EnsEMBL Group (Genebuild Team)01 - Dec- 2006
Overview• What is Ensembl?
– Ensembl project– Open source– The genome browser
• Genebuild– Automatic vs. Manual annotation– Traditional Genebuild– Special cases
Göteborg 01/12/2006
• Joint project– EMBL – European Bioinformatics Institute (EBI) – Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation• Integrate external (distributed) biological data• Focused on selected eukaryotic genomes • Presentation of the analysis to all via the Web at
http://www.ensembl.org• Open distribution of the analysis to the community• Development of open, collaborative software (databases
and APIs)
Ensembl - Project
Species in Ensembl v41Species in Ensembl v41
Göteborg 01/12/2006
• Used to retrieve data from and to store data in Ensembl databases.
• Ensembl Perl API;– Written in Object-Oriented Perl,– Foundation for the Ensembl Pipeline and Ensembl
Web interface.• Ensembl Java API;
– Written in Java, but similar in layout to the Perl API,– Foundation for Apollo,– Non supported, Stop development .
APIsAPIs
• Object model– standard interface makes it easy for others to build
custom applications on top of Ensembl data• Open discussion of design ([email protected])• Most major pharma and many academics represented on
mailing list and code is being actively developed externally
• Ensembl locally (Free for all)– Both industry & academia
Open source open standardsOpen source open standards
Göteborg 01/12/2006
Ensembl Ensembl –– Open sourceOpen source
Genome browserGenome browser
Göteborg 01/12/2006
• Browse genes in genomic context• Display features in and around a particular gene• Explore larger chromosome regions• Search and retrieve information on a gene- and
• Low sequence identity• Families share conserved secondary structure• Pipeline Rfam Scan• Run Blast and Infernal at the genomic level
ncRNA ncRNA ((tRNAstRNAs))
• Highly conserved across species• Identified using BLAST genomic vs miRBase
precursors
ncRNAs ncRNAs ((miRNAmiRNA))
Göteborg 01/12/2006
Traditional Genebuild Traditional Genebuild (Conclusions)(Conclusions)
• Developed originally for human
• Exploits rich human specific resources
• Protein, cDNA based
• Compute was a Really Big Issue in the past• As we moved beyond “just” building human and mouse,
scaling became a big issue.• Increased build automation crucial - genebuild was
pipelined after being a set of dodgy scripts for far too long.
Genebuild IssuesGenebuild IssuesDataData availabilityavailabilityTargetted build most useful in mouse, rat, human Similarity build more important other species;
StructuralStructural IssuesIssuesZebrafish Many similar genes near each other
Genome from different haplotypes
C. Briggsae Very dense genomeShort introns
Mosquito Many single-exon genesGenes within genes
Solution: Solution: Configuration Files provide flexibilityConfiguration Files provide flexibility
Göteborg 01/12/2006
Genebuild IssuesGenebuild Issues
GeneGene level.level.• Proteins from very distant organisms can skew
similarity build• Spindly exons• Non consensus splice sites• Targetted protein fragments masking similarity
build
Problem: Spindly genesProblem: Spindly genes
No Miniseqs -> Use Fullseqs:
• Compute expensive
• Reduces gene merging
Göteborg 01/12/2006
Problem: Non consensus Problem: Non consensus splice sitessplice sites
Non consensus splice sites common
Excess of alternative transcripts
Problem: Problem: Targetted Targetted protein protein fragmentsfragments
Göteborg 01/12/2006
More improvementsMore improvements
Homology Build:
Used to rescue fragmented genes Compara homology pipeline:
human, mouse, rat, dog and chicken.Exonerate used to align orthologs
Incremental Build:
1. Targeted genes, Similarity genes, Homology genes2. EST genes3. Genes from the previous Build
Low Coverage GenebuildLow Coverage Genebuild
Göteborg 01/12/2006
Low Coverage GenebuildLow Coverage Genebuild
• Genomes come in lots of scaffolds (cow 3x had 800K contigs in 450K scaffolds).
• None of our traditional approaches are much help. Apply normal gene-build with low coverage-cutoffs offers at best a set of gene fragments.
• Approach: Use WGA to infer gene scaffold assemblies• New method reduces fragmentation by piecing together
scaffolds into “gene-scaffolds” that contain complete gene(s)
• Projection of genes from reference species.
Low Coverage GenebuildLow Coverage GenebuildFrom start to finish
(Starting point: a core database with repeat features)
(1) Raw alignment generation - BLASTZ(2) Alignment chaining - Jim Kent’s axtChain(3) Best-in-genome alignment filtering (“net”)(4) Gene-scaffold inference and annotation projection(5) Merging and extension of gene-scaffold(6) Loading gene-scaffold assembly and annotation(7) Run rest of analysis on revised top-level sequence regions
Göteborg 01/12/2006
Method overviewMethod overview
NNNN NNNN
Human Chr.
Low coverageScaffold
Superscaffold
Projection
Human gene
Human
Cow
Göteborg 01/12/2006
Projection (Build over gaps)
NNNNNNNNNNNNNNNNNNN
Human
Cow
Human
NNNNNNNNNNNNNNNNNNN Cow
Projection (Filtering)• Filter out transcript with less than 50% percent identity or
50% coverageHuman
Cow
Human
Cow
Human
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Cow
Göteborg 01/12/2006
• Low number of Chimp protein, cDNA, EST information.• 6X high coverage genome • Based on low coverage projection code.• Take advantage of human chimp genome similarity.• Human has the best annotated and studied genome.
Other cases Other cases (Chimp genebuild)(Chimp genebuild)
Alternative GenebuildAlternative Genebuild
Exonerate
Pseudogenes
HumanTranscripts
Projection
Projectedtranscripts
HumanTranscripts
AlignedcDNAs
Human + ChimpcDNAs
transcripts with UTRs
Genebuilder
Core Ensemblgene set
Final set + pseudogenes
Exonerate
Exoneratedtranscripts
Transcriptmerge
Göteborg 01/12/2006
Transcript MergeTranscript Merge
• Preference for projected transcripts over exonerate.
NNNNNNNNNNNNNNNNNNN
Projection
Exonerate
Human
Transcript mergeTranscript merge
• Exonerate gene models selected where no projection was obtained.
• No human - chimp alignment for the region where the gene resides in human.
Göteborg 01/12/2006
• Ciona intestinalis and savignyi - few protein sequences but lots of ESTs and cDNAs
• Genebuilder not making best use of this resource
• TranscriptCoalescer developed
Other cases Other cases (EST Genebuild)(EST Genebuild)
EST GenebuildEST Genebuild
TranscriptCoalescer tests conserved intron boundaries to join clusters together conservatively.Use ab initio data to confirm gene structuresGenebuild found no genes in this area.
TranscriptCoalescer gene
Included in Feb release
EST alignments
Göteborg 01/12/2006
SummarySummary
• Ensembl: Open source genome browser and annotation tool.
• Automatic genome annotation– Traditional genebuild– Low Coverage genebuild– Alternative Genebuild– Est based Genebuild
Guy Coates, Tim Cutts, Shelley GoddardSystems & Support
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch
Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation
Kerstin Howe, Mario Caccamo, Ian SealyZebrafish Annotation
Val Curwen, Steve Searle, Browen Aken, Julio Fernández Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Simon White
Analysis and Annotation Pipeline
Abel Ureta-Vidal, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Javier Herrero Sánchez, Albert VilellaComparative Genomics
James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion(VEGA), Matt WoodWeb Team
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach
Andreas Kähäri, Eugene KuleshaDistributed Annotation System (DAS)
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart
Glenn Proctor, Ian Longden, Patrick MeidlDatabase Schema and Core API