Top Banner
tctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagc ctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaaca agaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgatt cgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcatttt gtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttg gttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaa atattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaa aaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttc aattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttat tccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaaga tttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaatta attttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaa atttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatga aatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaat gaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattga ttttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagt ttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgatttttttt ccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttc taatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttat ttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatt tgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcag gcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaatt ttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttca attttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttg gatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagcccttta When is a genome finished? Keith Bradnam These slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License . A talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis. Keith Bradnam, March 2011
25

When is a genome finished?

May 11, 2015

Download

Technology

Keith Bradnam

A retrospective look at the state of many famous modern genome sequences, and a cautionary tale of the dangers in assuming that genome sequence and/or its annotations are finished.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: When is a genome finished?

tttagaaaaacaactcacttttcgacgtttttcgccttttcgtggctcacaaaaataatgaaatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaaatattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaaatttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatgaaatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttcaattttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagcccttta

gcaacaaccaaatttatacagttttatgaaaat

When is a genome

finished?

Keith Bradnam

These slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

A talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis.

Keith Bradnam, March 2011

Page 2: When is a genome finished?

Part 1 - the sequence

We can think of ‘genome completion’ as referring to the sequence and/or the set of gene annotations. Let’s start with the sequence.

Page 3: When is a genome finished?

A brief history of genomics

1971 Wu & Taylor determine the first ever DNA sequence (all 12 bp of it!)

1977 Sanger et al. sequence the first ever (DNA-based) virus genome - 5,375 bp

1995 First complete bacterial genome sequence (Haemophilus influenzae) - 1.83 Mb

1996 First complete eukaryotic genome(Saccharomyces cerevisiae) - 12 Mb

1998 First animal genome(Caenorhabditis elegans) - 100 Mb

It took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of a bacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fiers et al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving to eukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).

Page 4: When is a genome finished?

0

1500

3000

4500

6000

Complete Incomplete

Bacteria Archaea Eukaryotes

genomesonline.org

3,077 7,732

Genomesonline.org tries to track all of the major genome projects out there. A lot of them are flagged as incomplete, and maybe some of those will never reach ‘completion’ status.

Page 5: When is a genome finished?

CAP criteria

Sydney Brenner

1) Complete2) Accessible3) Permanent

The great biologist may have won a Nobel prize for his work on development, he may have postulated the very existence of mRNA, and he may have co-discovered the triplet code ... but he also came up with the CAP criteria.

These criteria could pertain to any large scale academic project, but they conceived with reference to genome sequencing projects.

Page 6: When is a genome finished?

2000 - ‘working draft’ announced

2001 - ‘working draft’ published

2003 - ‘Finished’ version announced

2006 - Last chromosome finished

Ns make up ~9% of current genome

Homo sapiens

So it’s finished now right?

The human genome has been finished on several different dates, depending how you define ‘finished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.

Page 7: When is a genome finished?

So it’s finished now right?

2000 - genome published

~175 MB genome

Drosophila melanogaster

Ns make up ~4% of current genome

Drosophila is a much smaller genome, but a third of the genome is represented by the harder-to-sequence heterochromatin. This was the subject of a separate genome project that didn’t finish until 2007.

The genome still has many Ns.

Page 8: When is a genome finished?

Published 2000

115 Mb sequenced, 125 Mb genome

As of 2007...119 Mb sequenced,157 Mb genome

Arabidopsis thaliana

N’s make up ~0.2% of current genome

As of 2012...119 Mb sequenced,135 Mb genome

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete.

This illustrates the difficulty of estimating genome size. The latest figures suggest that the genome is smaller again. Note that much of this missing genome is not present as Ns in sequence you download. But the part you can still download still has many unknown bases.

Page 9: When is a genome finished?

1998 - ‘finished’ genome published

97 100 MB genome

2002 - last gap closed

Caenorhabditis elegans

Genome information for species such as C. elegans are curated by model organism databases (MODs) that ensure that the work goes on long after the initial publication announcing a ‘finished’ genome is made.

Genome size was quickly revised from 97 MB to 100 MB not long after publication.

Page 10: When is a genome finished?

Where’s my gene???

1997200020012002

People will often know that their gene of interest is definitely present in a genome through traditional genetic experiments...however, it might not be present in the published genome sequence. The figure shows the times at which one end of chromosome X of C. elegans were finished. The last 20 kbp region wasn’t finished until four years after the genome was published in 1998. This region contained predicted genes...maybe scientists were working on these genes waiting for the sequence.

Page 11: When is a genome finished?

Caenorhabditis elegans

So it’s finished now right?

1998 - ‘finished’ genome published

97 100 MB genome

2002 - last gap closed

2004 - last N removed

Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).

Page 12: When is a genome finished?

Worm genome progress

0

20,000,000

40,000,000

60,000,000

80,000,000

100,000,000

Jan-91 Dec-92 Nov-94 Oct-96 Sep-98 Aug-00 Jul-02 Jun-04 May-06Date

Gen

om

e s

ize (

bp

)

At a gross level, it looks like the worm genome did not change much after the year 2000....

Page 13: When is a genome finished?

Worm genome progress

100,220,000

100,240,000

100,260,000

100,280,000

Sep-01 Jul-02 May-03 Mar-04 Dec-04 Oct-05

Date

Gen

om

e s

ize (

bp

)

66 nt added May 2010

Here is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents a very small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makes the genome sequence more accurate and closer to the biological truth

Not many genome projects are this devoted!

Page 14: When is a genome finished?

Published 1997

12 MB genome

No gaps, no N’s

So it’s finished now right?

Saccharomyces cerevisiae

1,653 genome changes made since 1997

Last change made in February 2011

Like C. elegans, yeast is a species which benefits from coordinated efforts to finish the genome.

In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – by today’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complex genomes?

Page 15: When is a genome finished?

Part 2 - annotations

Maybe you don’t care about the state of the genome, as long as you have all of the genes present.

Page 16: When is a genome finished?

19000

20500

22000

23500

25000

1998 2003 2004 2005 2006 2007 2008 2009 2010 2011

C. elegans annotations

Genome publication

Genes Proteins

Since publication, the number of protein-coding loci in C. elegans has risen by about 1,500 genes. But the number of proteins that might arise from alternatively spliced products is much, much higher and shows no signs of slowing down.

Page 17: When is a genome finished?

0

6250

12500

18750

25000

1998 2003 2004 2005 2006 2007 2008 2009 2010 2011

C. elegans annotations

Genome publication

Genes Proteins RNA genes

When we consider RNA genes, it is surprising that there are now more RNA genes than protein-coding genes. How many more species have similar secrets in their genomes that have yet to be discovered, mostly because of our historical focus on protein-coding genes.

Page 18: When is a genome finished?

Core genesYou can identify ‘core’ genes, that are highly conserved and that should be present in all species

Our group identified a set of 458 core genes from 6 reference genomes:

Homo sapiensCaenorhabditis elegansDrosophila melanogasterArabidopsis thalianaSaccharomyces cerevisiaeSchizosaccharomyces pombe

We can then test whether these are all present in any ‘finished’ genome.

Our lab developed a set of 458 ‘core genes’ that we believe should be present in every (complete) eukaryotic genome.

In the past we’ve discovered that many published genomes are missing some of these genes from the genome sequence, even though they should be there. E.g. chicken has missing core genes even though those genes are represented by chicken EST sequences.

Page 19: When is a genome finished?

Ciona intestinalis

Version N50 Core genes

v1.95 234,500 444

v2.0 2,571,800 425

Sometimes genomes get updates and assemblies are given a new version number. This might be associated with an increase in average scaffold size, but sometimes the number of core genes gets reduced.

Page 20: When is a genome finished?

Caenorhabditis sp. PS1010

Version N50 Core genes

v4 9,446 454

v5 64,074 428

People can easily measure things like N50, harder to measure things like what genes are present (though people can use our free CEGMA tool!)

Page 21: When is a genome finished?

S. cerevisiae

Changes due to genome sequence changes in Feb 2011 caused changes to 194 protein sequences.

Last correction to gene structure due to mis-annotation was in Jan 2010

So just 13 years to produce a stable gene set!

Even in a simpler genome, the work of annotation goes on.

Bear in mind that many model organism databases often split genes into different categories based on evidence.

Page 22: When is a genome finished?

Conclusions

Page 23: When is a genome finished?

‘Finished’ eukaryotic genome sequences are not finished!

except maybe yeast

Not that this matters necessarily. 1% of a genome is better than no genome at all. At some level, the law of diminishing returns set it. Ideally, we could produce a metric of ‘useful papers published per person-hour of database curator working on model organism database’.

Just be aware that the genome you download today may change in future and your results might not always be easily reproducible by someone using a different version.

Page 24: When is a genome finished?

CAP criteria

Sydney Brenner

1) Complete2) Accessible3) Permanent

Clearly they are not all complete.

As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents a particular problem (it can be almost impossible to download GenBank from east coast to west coast using FTP). Also, online journals often end up breaking links to getting supplemental material.

For the most part, they are permanent. But not always the raw, unassembled read data.

Page 25: When is a genome finished?

The End