Top Banner
The Ensembl Gene set The Ensembl Gene set The “Genebuild” The “Genebuild” 21 April 2008
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Ensembl Gene set The “Genebuild” 21 April 2008.

The Ensembl Gene setThe Ensembl Gene setThe “Genebuild”The “Genebuild”

21 April 2008

Page 2: The Ensembl Gene set The “Genebuild” 21 April 2008.

2 of 32

The GeneBuild (determining the Ensembl gene set)

What it means for the scientist? ‘annotation pipeline’ vs ‘manual curation’

Pseudogenes ncRNAs The CCDS project

OutlineOutline

Page 3: The Ensembl Gene set The “Genebuild” 21 April 2008.

3 of 32

What is available?

I) Sequence Assemblies from genome sequencing efforts

IntroductionIntroduction

Page 4: The Ensembl Gene set The “Genebuild” 21 April 2008.

4 of 32

Gene Sequencing- Gene Sequencing- the Assemblythe Assembly

http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.htmlThis generates clones, vs new sequencing methods

Page 5: The Ensembl Gene set The “Genebuild” 21 April 2008.

5 of 32

Clones AvailableClones Available

Human:

(Tilepath- used in the assembly)

Ciona intestinalis

Shotgun assembly

Page 6: The Ensembl Gene set The “Genebuild” 21 April 2008.

6 of 32

ContigView: Clones and ContigsContigView: Clones and Contigs

Contigs

Clones(Plate/well numbers) Ensembl

Transcripts

Page 7: The Ensembl Gene set The “Genebuild” 21 April 2008.

7 of 32

Task:

View the tilepath clone in ContigView for the region containing the human

BRCA2 gene.

Hint: Start with a search for the BRCA2 gene.

Page 8: The Ensembl Gene set The “Genebuild” 21 April 2008.

8 of 32

The Ensembl GenesetThe Ensembl Geneset

How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome?

Protein Sequence Assembly Ensembl Geneset

Page 9: The Ensembl Gene set The “Genebuild” 21 April 2008.

9 of 32

Once the Assembly is Imported…Once the Assembly is Imported…

Proteins/mRNAs are aligned.

These have been submitted to databases such as:

UniProt (manually curated) and

RefSeq (partially manually curated)

Page 10: The Ensembl Gene set The “Genebuild” 21 April 2008.

10 of 32

The BiologicalThe Biological EvidenceEvidence

UniProt/Swiss-Prot

A manually curated database and therefore of highest accuracy

NCBI RefSeq

A partially manually curated database

UniProt/TrEMBL

Automatically annotated translations of EMBL coding sequence (CDS) features

EMBL / GenBank / DDBJ

Primary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

Page 11: The Ensembl Gene set The “Genebuild” 21 April 2008.

11 of 32

Database RelationshipDatabase Relationship

NCBIRefSeq

EMBL-BankDDBJ

GenBank

UniProt

Swiss-Prot TrEMBL

IndividualLab’s

Submission

Page 12: The Ensembl Gene set The “Genebuild” 21 April 2008.

12 of 32

Sequence(Assembly)

Proteins(e.g. Swiss-Prot)

mRNA

EST

Manual annotation (HAVANA)

ESTgenes

Ensembl

GenebuildGenebuild

EMBL-BankGenBank

DDBJ

Page 13: The Ensembl Gene set The “Genebuild” 21 April 2008.

13 of 32

Ensembl genes may be based on multiple protein/mRNAs

What is an Ensembl gene based on?

Why do I want to know?…Why do I want to know?…

Page 14: The Ensembl Gene set The “Genebuild” 21 April 2008.

14 of 32

Task

Look at the evidence for the human EPO gene.

What was this gene based on?

Hint: Go to Exon Information from the GeneView page

Page 15: The Ensembl Gene set The “Genebuild” 21 April 2008.

15 of 32

EPO gene supporting evidence

Page 16: The Ensembl Gene set The “Genebuild” 21 April 2008.

16 of 32

Species-Specific GeneBuildsSpecies-Specific GeneBuilds

Pan troglodytes genes are built by projection from human genes.

Zebrafish has many gene duplications.

Homo sapiens genes must have

protein evidence, not just mRNA.

Page 17: The Ensembl Gene set The “Genebuild” 21 April 2008.

17 of 32

Task

When was the chimpanzee (Pan troglodytes) Genebuild performed?

Can you find information as to how genes were annotated?

Hint: Look on the chimpanzee index page

Page 18: The Ensembl Gene set The “Genebuild” 21 April 2008.

18 of 32

External Gene Set: VEGA/HavanaExternal Gene Set: VEGA/Havana

Human, zebrafish, mouse and dog

Havana transcripts in blue or gold…

What are Havana transcripts?

Page 19: The Ensembl Gene set The “Genebuild” 21 April 2008.

20 of 32

Havana and Ensembl match

When a Havana (manually curated) and Ensembl (automatic methods) predictthe same transcript, basepair for basepair, the transcripts are merged and

coloured gold.

Page 20: The Ensembl Gene set The “Genebuild” 21 April 2008.

21 of 32

Manually-curated gene sets in Manually-curated gene sets in EnsemblEnsembl

Vega (Havana)

Homo sapiens, Danio rerio,

Mus musculus and Canis familiaris

WormBase Caenorhabditis elegans

FlyBase Drosophila melanogaster

SGD Saccharomyces cerevisiae

Page 21: The Ensembl Gene set The “Genebuild” 21 April 2008.

23 of 32

What Can Go Wrong?What Can Go Wrong?

I) A Gap in the assembly

Gene might not be found in Ensembl

II) Fused genes

BLAST hit(SwissProt

entry)

Gene might be associated with two names

Page 22: The Ensembl Gene set The “Genebuild” 21 April 2008.

24 of 32

The genome sequence The Genebuild ‘manual curation’ by Havana Other: EST gene set

Pseudogenes

ncRNAs

OutlineOutline

Page 23: The Ensembl Gene set The “Genebuild” 21 April 2008.

25 of 32

Expressed Sequence Tags vs Expressed Sequence Tags vs ‘cDNA’‘cDNA’

ESTs are annotated separately. Why?

mRNA and cDNA used in the GeneBuild:Sequenced to high standard, often complete.

EST: Lower quality sequence.

‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides longLow quality fragment- sequence error of ~2%.

BUT confers useful expression information discovery of new genes esp in diseased organisms Tissue type Timing/developmental stage Samples more transcripts, variants

Page 24: The Ensembl Gene set The “Genebuild” 21 April 2008.

26 of 32

Where Can I See This EST Geneset?Where Can I See This EST Geneset?ContigView ContigView

Choose EST genes

EST track

Page 25: The Ensembl Gene set The “Genebuild” 21 April 2008.

27 of 32

Pseudogenes: ‘False’ GenesPseudogenes: ‘False’ Genes

Unprocessed

Produced by gene duplication andrearrangement

Reverse transcription and re-integration

mRNA

pseudogene

AAAAAA

Processed

AAAAAA

Page 26: The Ensembl Gene set The “Genebuild” 21 April 2008.

28 of 32

ncRNAs (non coding RNAs)ncRNAs (non coding RNAs)

What types are in Ensembl?

tRNA (transfer RNA)

rRNA (ribosomal RNA)

scRNA (small cytoplasmic)

snRNA (small nuclear)

snoRNA (small nucleolar)

miRNA (microRNA)

Page 27: The Ensembl Gene set The “Genebuild” 21 April 2008.

29 of 32

ncRNAs (2 types)ncRNAs (2 types)

I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern)

II) High sequence conservation (miRNA)

BLAST alignment

‘RNA fold’ applied to make sure

sequences can fold (hairpin)

Page 28: The Ensembl Gene set The “Genebuild” 21 April 2008.

30 of 32

ncRNAs… where can I see them?ncRNAs… where can I see them?

Find them in ContigView:

or use BioMart.

Page 29: The Ensembl Gene set The “Genebuild” 21 April 2008.

31 of 32

*All Ensembl genes are based on biological evidence (protein and mRNA)

One Ensembl gene may come from proteins and mRNAs in various databases.

Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.

The CCDS set strives for consensus coding sequences across databases.

Pseudogenes and RNAs are annotated, along with a separate EST gene set.

Summary – Ensembl GenesSummary – Ensembl Genes

Page 30: The Ensembl Gene set The “Genebuild” 21 April 2008.

32 of 32

For more on GeneBuild:For more on GeneBuild:

Help and Documentation

(About Ensembl)

http://www.ensembl.org/info/about/docs/genome_annotation.html