The Ensembl Gene set The “Genebuild” 21 April 2008.

Post on 27-Dec-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

The Ensembl Gene setThe Ensembl Gene setThe “Genebuild”The “Genebuild”

21 April 2008

2 of 32

The GeneBuild (determining the Ensembl gene set)

What it means for the scientist? ‘annotation pipeline’ vs ‘manual curation’

Pseudogenes ncRNAs The CCDS project

OutlineOutline

3 of 32

What is available?

I) Sequence Assemblies from genome sequencing efforts

IntroductionIntroduction

4 of 32

Gene Sequencing- Gene Sequencing- the Assemblythe Assembly

http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.htmlThis generates clones, vs new sequencing methods

5 of 32

Clones AvailableClones Available

Human:

(Tilepath- used in the assembly)

Ciona intestinalis

Shotgun assembly

6 of 32

ContigView: Clones and ContigsContigView: Clones and Contigs

Contigs

Clones(Plate/well numbers) Ensembl

Transcripts

7 of 32

Task:

View the tilepath clone in ContigView for the region containing the human

BRCA2 gene.

Hint: Start with a search for the BRCA2 gene.

8 of 32

The Ensembl GenesetThe Ensembl Geneset

How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome?

Protein Sequence Assembly Ensembl Geneset

9 of 32

Once the Assembly is Imported…Once the Assembly is Imported…

Proteins/mRNAs are aligned.

These have been submitted to databases such as:

UniProt (manually curated) and

RefSeq (partially manually curated)

10 of 32

The BiologicalThe Biological EvidenceEvidence

UniProt/Swiss-Prot

A manually curated database and therefore of highest accuracy

NCBI RefSeq

A partially manually curated database

UniProt/TrEMBL

Automatically annotated translations of EMBL coding sequence (CDS) features

EMBL / GenBank / DDBJ

Primary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

11 of 32

Database RelationshipDatabase Relationship

NCBIRefSeq

EMBL-BankDDBJ

GenBank

UniProt

Swiss-Prot TrEMBL

IndividualLab’s

Submission

12 of 32

Sequence(Assembly)

Proteins(e.g. Swiss-Prot)

mRNA

EST

Manual annotation (HAVANA)

ESTgenes

Ensembl

GenebuildGenebuild

EMBL-BankGenBank

DDBJ

13 of 32

Ensembl genes may be based on multiple protein/mRNAs

What is an Ensembl gene based on?

Why do I want to know?…Why do I want to know?…

14 of 32

Task

Look at the evidence for the human EPO gene.

What was this gene based on?

Hint: Go to Exon Information from the GeneView page

15 of 32

EPO gene supporting evidence

16 of 32

Species-Specific GeneBuildsSpecies-Specific GeneBuilds

Pan troglodytes genes are built by projection from human genes.

Zebrafish has many gene duplications.

Homo sapiens genes must have

protein evidence, not just mRNA.

17 of 32

Task

When was the chimpanzee (Pan troglodytes) Genebuild performed?

Can you find information as to how genes were annotated?

Hint: Look on the chimpanzee index page

18 of 32

External Gene Set: VEGA/HavanaExternal Gene Set: VEGA/Havana

Human, zebrafish, mouse and dog

Havana transcripts in blue or gold…

What are Havana transcripts?

20 of 32

Havana and Ensembl match

When a Havana (manually curated) and Ensembl (automatic methods) predictthe same transcript, basepair for basepair, the transcripts are merged and

coloured gold.

21 of 32

Manually-curated gene sets in Manually-curated gene sets in EnsemblEnsembl

Vega (Havana)

Homo sapiens, Danio rerio,

Mus musculus and Canis familiaris

WormBase Caenorhabditis elegans

FlyBase Drosophila melanogaster

SGD Saccharomyces cerevisiae

23 of 32

What Can Go Wrong?What Can Go Wrong?

I) A Gap in the assembly

Gene might not be found in Ensembl

II) Fused genes

BLAST hit(SwissProt

entry)

Gene might be associated with two names

24 of 32

The genome sequence The Genebuild ‘manual curation’ by Havana Other: EST gene set

Pseudogenes

ncRNAs

OutlineOutline

25 of 32

Expressed Sequence Tags vs Expressed Sequence Tags vs ‘cDNA’‘cDNA’

ESTs are annotated separately. Why?

mRNA and cDNA used in the GeneBuild:Sequenced to high standard, often complete.

EST: Lower quality sequence.

‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides longLow quality fragment- sequence error of ~2%.

BUT confers useful expression information discovery of new genes esp in diseased organisms Tissue type Timing/developmental stage Samples more transcripts, variants

26 of 32

Where Can I See This EST Geneset?Where Can I See This EST Geneset?ContigView ContigView

Choose EST genes

EST track

27 of 32

Pseudogenes: ‘False’ GenesPseudogenes: ‘False’ Genes

Unprocessed

Produced by gene duplication andrearrangement

Reverse transcription and re-integration

mRNA

pseudogene

AAAAAA

Processed

AAAAAA

28 of 32

ncRNAs (non coding RNAs)ncRNAs (non coding RNAs)

What types are in Ensembl?

tRNA (transfer RNA)

rRNA (ribosomal RNA)

scRNA (small cytoplasmic)

snRNA (small nuclear)

snoRNA (small nucleolar)

miRNA (microRNA)

29 of 32

ncRNAs (2 types)ncRNAs (2 types)

I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern)

II) High sequence conservation (miRNA)

BLAST alignment

‘RNA fold’ applied to make sure

sequences can fold (hairpin)

30 of 32

ncRNAs… where can I see them?ncRNAs… where can I see them?

Find them in ContigView:

or use BioMart.

31 of 32

*All Ensembl genes are based on biological evidence (protein and mRNA)

One Ensembl gene may come from proteins and mRNAs in various databases.

Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.

The CCDS set strives for consensus coding sequences across databases.

Pseudogenes and RNAs are annotated, along with a separate EST gene set.

Summary – Ensembl GenesSummary – Ensembl Genes

32 of 32

For more on GeneBuild:For more on GeneBuild:

Help and Documentation

(About Ensembl)

http://www.ensembl.org/info/about/docs/genome_annotation.html

top related