Genome Biology and Biotechnology

Genome Biology and Genome Biology and BiotechnologyBiotechnology

3. The genome structures of vertebrates3. The genome structures of vertebrates

Prof. M. ZabeauProf. M. ZabeauDepartment of Plant Systems Biology Department of Plant Systems Biology

Flanders Interuniversity Institute for Biotechnology (VIB)Flanders Interuniversity Institute for Biotechnology (VIB)University of GentUniversity of Gent

International course 2005International course 2005

The Genome Sequences of The Genome Sequences of vertebratesvertebrates

¤ Fish genomes: “compact” vertebrate genomes– Fugu rubripes (2002)– Tetraodon nigroviridis (2004)

¤ Bird genome: Interesting evolutionary intermediate– Chicken - Gallus gallus (2004)

¤ Rodent genomes: the model organism for the human– Mouse - mus musculus (2002)– Rat – Rattus norvegicus (2004)

¤ Primate genomes: our closest relatives– Chimpanzee

¤ Human genome– Draft genome sequence (2001)– Finished genome sequence (2004)

vertebrate vertebrate evolution evolution

310 MY

450 MY

Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)

Whole-Genome Shotgun Assembly and Whole-Genome Shotgun Assembly and Analysis of the Genome of Analysis of the Genome of Fugu rubripesFugu rubripes

¤ Paper presents– Low quality draft genome sequence of Fugu rubripes – the sequence provided a valuable reference for annotating

the human and mouse genomes• Small genome (350 Mb versus 3000 Mb)• low repetitive DNA content

Aparicio et al., Science, 297, 1301-1310 (2002)

Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)

The Fugu Genome SequenceThe Fugu Genome Sequence

¤ The draft sequence covers a total of 332.5 Mb– Highly fragmented sequence (~30 Mb unassembled

sequences) – The total genome size is estimated at ~365 Mb

¤ The number of predicted genes: 31,059– similar to the number of human genes predicted from the

draft sequence¤ Repetitive sequences

– Density of <15% far below the 35 to 45% observed in mammals

• Transposable elements are still very active


Protein-coding genesProtein-coding genes

¤ The gene-containing fraction is a ~ 108 Mb (30%)– The average gene density: one gene per 10.9 kb– The Fugu genome is compact because introns are shorter

than in the human genome • Genome contains ~500 large introns (> 10 kb) compared >

12,000 large introns in human• Genes are scaled in proportion to the compact genome size

– The number of introns is roughly the same as in human• Both gain and loss of introns in the Fugu lineage are observed

¤ The compactness of the Fugu is accounted for by – Low abundance of repeated sequences– The small size of introns and intergenic regions


Comparison of Comparison of FuguFugu and Human and Human ProteomesProteomes

¤ 75% of predicted human proteins have a strong match to Fugu

Genome duplication in the teleost fish Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early Tetraodon nigroviridis reveals the early

vertebrate proto-karyotypevertebrate proto-karyotype

¤ Paper presents – High quality draft genome sequence with long-range

linkage and chromosome anchoring of Tetraodon nigroviridis

• freshwater puffer fish with the smallest known vertebrate genome

Jaillon et. al., Nature 431, 946 - 957 (2004)

Reprinted from: Jaillon et. al., Nature 431, 946 - 957 (2004)

The Tetraodon genome sequenceThe Tetraodon genome sequence

¤ The draft genome sequence (8,3 x) spans 342 Mb– Largest scaffolds were mapped onto the chromosomes

• Draft is much less fragmented than that of Fugu¤ Genome landscape

– Transposable elements are very rare (<4000 copies)• Fewer than Fugu (15% of the genome)

¤ Estimated 20,000–25,000 protein coding genes– Very similar to the recent (2004) human gene count – Much lower than reported for Fugu (current Fugu is also

lower)– Gene ontology (GO) classifications shows only subtle

differences between fish and mammals• Improved fish gene catalogue aids human gene predictions


Evidence For Whole-genome Evidence For Whole-genome Duplication Duplication

¤ Duplicated genes cluster on paralogous chromosomes– paralogous chromosomes arising from whole-genome duplication

each contain one member of duplicated gene pairs in the same order


Evidence For Whole-genome Evidence For Whole-genome Duplication Duplication

¤ Blocks of doubly conserved synteny – The synteny map typically associates two regions in

Tetraodon with one region in human Tni: TetraodonHsa: human


Ancestral genome of bony Ancestral genome of bony vertebratesvertebrates

¤ The patterns of doubly conserved synteny are consistent with – 12 ancestral chromosomes which have rearranged to form – the present day chromosomes of human and fish

Human Fish

Sequence and comparative analysis of the Sequence and comparative analysis of the chicken genome provide unique chicken genome provide unique

perspectives on vertebrate evolutionperspectives on vertebrate evolution

¤ Paper presents – a draft genome sequence of the red jungle fowl Gallus

gallus– The first genome of non-mammalian amniote

• provides a new perspective on vertebrate genome evolution– The evolutionary distance between chicken and human

provides an excellent signal-to-noise ratio to detect functional elements

• 310 MY since the divergence of birds and mammals

International Chicken Genome Sequencing Consortium, Nature 432, 695 - 716 (2004)


The chicken genome sequenceThe chicken genome sequence

¤ The draft genome sequence (6,6 x) spans 1.050 Mb– Draft represents ~96% of the euchromatic part of the genome– 23,212 chicken mRNAs and 485,000 ESTs

¤ Chicken genome is 3x smaller than mammalian genomes reflecting substantially fewer– interspersed repeats

• transposable elements make up <9% of the genome, markedly lower than the 40–50% observed in mammalian genomes

– Pseudogenes• 51 retrotransposed genes vs. > 15,000 in mammalian

genomes – segmental duplications

• Limited to very small (<10kb) intrachromosomal duplications


Gene content of the chicken genomeGene content of the chicken genome

¤ Protein-coding genes– Predict 20,000 to 23,000 protein-coding genes – Matches the current (2004) estimate for mammalian genomes

¤ Non-coding RNA genes– 571 ncRNA genes from >20 gene families

• Fewer than in human: many ncRNA genes are pseudogenes– Syntenic relationships for non-coding RNA genes differ from

those of protein-coding genes• implies a novel mode of evolution for some ncRNA genes• Only certain ncRNA genes are in regions of conserved

synteny – microRNAs (miRNAs) and small nucleolar RNAs

(snoRNAs) found in introns of protein-coding genes

Evolutionary conservation of gene Evolutionary conservation of gene components components

¤ Sequence conservation of chicken and human orthologs – highest in protein-coding exons – minimal in introns– Significant in the 5' and 3' flanking and untranslated

regions

exon 3’UTR5’UTR



Conservation of vertebrate protein Conservation of vertebrate protein content content

¤ 60% of chicken genes have a single human orthologue– also have a single orthologue

in the Fugu genome– Represent a conserved core

present in most vertebrates

Conserved core Conserved core orthologuesorthologues in in vertebratesvertebrates

¤ Core orthologues conserved in vertebrates have – Highly conserved protein sequences indicating that

• They have been subject to purifying selection



Expansion of multigene familiesExpansion of multigene families

¤ Expansion and contraction of multigene families were– major factors in the independent evolution of mammals and

birds


Chromosomal dynamicsChromosomal dynamics in the in the vertebratesvertebrates

¤ Maps of conserved synteny: orthologous chromosomal segments with conserved gene order show– slow rate of rearrangement in the human lineage

• 3-fold higher rate in the rodent lineage– The human genome is closer to the chicken in terms of synteny


Ancestral Ancestral mammalian mammalian

genome genome ¤ Long blocks of

conserved chicken–human synteny

– Entire chromosomes¤ Genome

rearrangements– Many intrachromosomal

rearrangements– Few translocations

between chromosomes– Chicken has a number

of micro-chromosomes


Conserved sequencesConserved sequences in in chicken and chicken and human human

¤ High substitution rates between human and chicken– Can be used to detect functionally conserved sequences

¤ 70 Mb (2.5%) of human sequence aligns with chicken– 44% are in protein-coding regions - exons– 66% is non-coding: intronic (25%) and intergenic (31%)

¤ Conserved non-coding segments occur clustered and far from genes – Identified 57 segments with average length of 1,1 MB

• gene poor, G+C poor and have no interspersed repeats– the functional significance of these sequences is

completely unknown


ConclusionConclusion

¤ The chicken genome sequence– is a key resource for comparative genomics

• to distinguish derived or ancient features of mammalian biology

– mammalian innovation and adaptation• conserved non-coding sequences in particular

– Provides a framework for discovering the functional polymorphisms underlying

• interesting quantitative traits to further exploit the genetic potential of the chicken

Initial sequencing and comparative Initial sequencing and comparative analysis of the mouse genome analysis of the mouse genome

¤ Paper presents – Draft genome sequence of the mouse– comparative analysis of the mouse and human genomes

• 75 MY since the divergence of rodents and primates• The two genome sequences diverge by nearly one

substitution for every two nucleotides– the insights that can be gleaned from the two sequences

Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)

Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)

The Mouse Genome ProjectThe Mouse Genome Project

¤ The laboratory mouse is an experimental model system – for studying human disease and mammalian biology

¤ The Mouse Genome Project– International collaboration of centres in the US and the UK – Adopted mixed strategy for the draft genome sequencing

• a BAC-based physical map of the mouse genome• The initial draft genome sequence was generated by

– WGS sequencing to ~7-fold coverage – Hierarchical shotgun sequencing of BAC clones

– The finished sequence should be completed in 2005 • using the BAC clones for directed finishing


The draft mouse genome sequenceThe draft mouse genome sequence

¤ The euchromatic mouse genome is estimated ~2.5 Gb – The draft genome sequence covers ~96% of the genome

¤ Generation of the draft genome sequence– Sequencing

• 41.4 Mi paired-end sequence reads derived from various clone types

– Assembly• represents ~7.7-fold sequence coverage• 224,713 sequence contigs• total of 7,418 supercontigs• The 200 largest supercontigs span more than 98% of the

assembled sequence– Anchoring to chromosomes

• Anchored all supercontigs >500 kb with the mouse genetic map


The draft mouse genome sequenceThe draft mouse genome sequence

¤ The euchromatic mouse genome is estimated ~2.5 Gb – The draft genome sequence covers ~96% of the genome

¤ Comparative analysis of human and mouse genomes– The mouse genome is about 14% smaller than the human

genome – High degree of synteny

• >90% of the two genomes can be partitioned into corresponding regions of conserved synteny

– At the nucleotide level, approximately 40% of the human genome can be aligned to the mouse genome.

• represent orthologous sequences conserved from the common ancestor


Synteny between mouse and humanSynteny between mouse and human

¤ Regions containing orthologous sequence pairs define– Syntenic segments as regions in which

• Orthologous sequence pairs are in the same order on a chromosome in both species


Synteny between mouse and humanSynteny between mouse and human

¤ Conservation of orthologous sequence pairs shows – Each genome can be parsed into a total of 342 conserved

syntenic segments. • The segments vary greatly in length, from 303 kb to 64.9 Mb• In total, about 90.2% of the human genome and 93.3% of the

mouse genome reside within conserved syntenic segments– The segments can be aggregated into a total of 217

conserved syntenic blocks¤ The syntenic block and segment sizes are

– consistent with the random breakage model of genome evolution

– the minimal number of rearrangements needed to 'transform' one genome into the other is 295 rearrangements


Blocks of conserved synteny in the human Blocks of conserved synteny in the human and mouse genomesand mouse genomes


Repetitive sequences in human and Repetitive sequences in human and mousemouse

¤ The most prevalent feature of mammalian genomes is their high content of repetitive sequences– Most of which are interspersed repeats representing 'fossils' of

transposable elements¤ The repetitive sequences in mouse and human

differ– Only 37.5% of the mouse genome– ~46% of the human genome is transposon-derived– Insertions of transposable elements occured in the last 150–200

million years• The most notable difference is the rate of transposition over time

– in mouse the rate has remained fairly constant– in human the rate increased to a peak at ~40 Myr, and

then plummeted


Age distribution of interspersed repeats in the Age distribution of interspersed repeats in the mouse and human genomes mouse and human genomes

Human

Mouse


Protein-coding genes in mouse and Protein-coding genes in mouse and humanhuman

¤ Human and mouse gene catalogues– The current human gene catalogue (Ensembl build 29) contains

22,808 predicted genes– The current mouse gene catalogue contains 22,011 predicted genes

¤ Comparative analysis of protein coding genes shows

– 80% of the mouse genes have orthologues in the human genome• The proportion of mouse/human genes without any homologue is < 1%.

– Many local gene family expansions have occurred in the mouse lineage

• Most seem to involve genes for reproduction, immunity and olfaction– The rate of protein evolution

• Most proteins evolve at fairly constant rate• Certain proteins evolve much more rapidly: positive selection

– Proteins implicated in reproduction, host defence and immune response seem to be under, which drives

ConclusionsConclusions

¤ The mouse genome provides a powerful resource to unravel the secrets of the human genome– Demonstrates the power of comparative genomics in

identifying relevant genetic elements– These findings inspired additional animal genome

sequencing projects to fully exploit the power of comparative genomics

• As illustrated in: Thomas et. al., Nature 423, 788 - 793 (2003)– The sequence provides a comprehensive framework for

functional genomics approaches to unravel gene functions in both human and mouse

Genome sequence of the Brown Norway Genome sequence of the Brown Norway rat yields insights into mammalian rat yields insights into mammalian

evolutionevolution

¤ Paper presents– a high-quality 'draft' sequence covering > 90% of the

genome– a three-way comparison with the human and mouse

genomes to study the mammalian genome evolution• Rat - mouse common ancestor: 12–24 Myr • Rodent - human common ancestor: 75 Myr

RGSPC, Nature 424: 493 - 521 (2004)

Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)

TheThe rat rat genome sequencegenome sequence¤ The draft genome sequence covers 2,75 Gb

– A 'combined' sequencing strategy using• WGS sequencing and light sequence coverage of BACs• Sequential assembly of 'enriched BACs' (eBACs) joined into

bactigs, superbactigs and ultrabactigs

eBAC


Rat – mouse – human genome Rat – mouse – human genome sequencessequences

¤ Sequence elements in human, mouse and rat genomes – 40% align in all 3 species

• 'ancestral core' of 1 Gb• 95% of the exons and

regulatory regions– 28% aligns only with

mouse• rodent-specific repeats

– 29% does not align• rat-specific repeats


Evolution of genesEvolution of genes

¤ Estimate that 90% of rat genes possess– strict orthologues in both mouse and human

• Intronic structures are well conserved – Most of the non-orthologous genes

• Arose by expansions of gene families in the different lineages• Rapidly evolving genes

– Rat-specific genes comprise novel genes for “life style”• pheromones, immunity, chemosensation, detoxification,

proteolysis


Rat – mouse – human syntenyRat – mouse – human synteny

¤ orthologous chromosome segments– 105 mouse–rat segments– 278 human-rat segments– 280 human-mouse

segments


Rat – mouse – human genome Rat – mouse – human genome rearrangementsrearrangements

¤ Reconstruction of the ancestral mammalian genome– Identified a total of 353 rearrangements

• 247 between the murid ancestor and human• 50 from the murid ancestor to mouse • 56 from the murid ancestor to rat

– much higher (3x) rearrangement rate in the rodent than in the human lineage

247 50

56

The Human GenomeThe Human Genome

¤ The human genome project was launched in 1990– Phase I: generation of genetic and physical maps (1990-1995)

• Demonstration that large scale sequencing is feasible: yeast, worm – Phase II: large scale sequencing (1995-2005)

• Pilot phase: finished sequence with 99.99% accuracy and no gaps of the human chromosomes 21 and 22 (published in ’98 and ‘99)

• Draft phase : draft sequence covering >90% of the genome completed in June 2000 (published in 20001 ) – took ~1 year

• Finishing phase: “finished” covering 99% of the genome sequence, completed in spring 2004 – took ~3 years

• Aftermath: no end point projected– closing the last couple of hundred gaps– Sequencing the centromeres

Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

The Human Genome SequencesThe Human Genome Sequences¤ Draft genome sequences (2001)

– International Human Genome Sequencing Consortium (Collaboration of 20 public sequencing centers in 6 countries)

• Used a hierarchical shotgun sequencing strategy • Sequence published in: Nature 409, 860 (2001)

– Celera Genomics - private initiative• Used a whole-genome shotgun approach • Assembly of the sequence combined their whole-genome

shotgun data and the public genome sequence data• Sequence published in: Venter et. al., Science, 291, 1304 (2001)

¤ Finished genome sequence (2004)– International Human Genome Sequencing Consortium

• Sequence published in: Nature 431, 931 - 945 (2004)

Finished Human Genome SequenceFinished Human Genome Sequence

¤ Finishing process: complex iterative process– Resolving problematic sequences

• From single nucleotide errors and gaps to the integrity of whole chromosomes

– The finishing process involved two distinct components• producing finished maps consisting of continuous and accurate

paths of overlapping large-insert clones• producing finished clone sequences, consisting of continuous

and accurate nucleotide sequences for each clone– generated shotgun sequence of ~59.000 BACs comprising

a total sequence (redundant) length of 5,8 Gb– Assembled sequences of ~46.000 BACs

Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)

Finished Human Genome SequenceFinished Human Genome Sequence

¤ Finished genome sequence– Build 35 comprises 2.851 Mbp– Interrupted by “only” 341 gaps

• 308 gaps in the euchromatic sequence: totalling ~28 Mb• 33 heterochromatic gaps (including 24 centromeres) : total ~198

Mb– The total human genome size is estimated at ~3,080 Mb

¤ Comparison with draft sequence– Substantially fewer gaps (341 versus 147,821)– More accurate and complete sequence: error rate ~1 per 105

• Confirmed local order and orientation of the sequences• Corrected artefactual duplications resulting from mixups• Verified most of the sequence with

– BAC cloned overlap sequence, paired end sequence reads from fosmids, draft chimpanzee genome sequence


Finished Human Genome SequenceFinished Human Genome Sequence¤ Importance of a completely finished genome

sequence – Accurate reference for identifying genetic variation in the human

population• Error rate of 10-5 << frequency of SNP of 10-3

– Identification of segmental duplications• Estimated to cover >5% of the genome sequence

– Located primarily in the pericentromeric and subtelomeric regions– much higher than in mouse and rat

• Great medical interest: predisposes to deletion or rearrangement– Williams syndrome region (7q)– Charcot–Marie–Tooth region (17p)– DiGeorge syndrome region (22q)

• Many remaining gaps involve unresolved segmental duplications – Correct identification of all protein- coding genes structures

• ~60% of the gene models were corrected compared to the draftReprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945

(2004)

Estimates of the Number of Human GenesEstimates of the Number of Human Genes

¤ Reassociation kinetics (60s and 70s)– Early estimates based estimated the mRNA complexity of typical

vertebrate tissues to be 10,000–20,000, and were extrapolated to suggest around 40,000 genes for the entire genome

¤ Estimates from approximate gene and genome sizes – Calculation based on the size of a typical gene ( 3*104 bp) and the size

of the genome (3*109 bp) yielded 100,000 genes (W. Gilbert, pers. Com.)

¤ Number of CpG islands associated with known genes– An estimate of 70,000–80,000 genes was made

¤ Estimates based on ESTs– Estimates based on ESTs varied widely, from 35,000 to 120,000 genes – Discrepancy results from contaminating genomic sequences and

multiple ESTs from single genes¤ Whole-genome shotgun sequence from the pufferfish

– Suggested around 30,000 human genes Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

Identification of Protein Coding GenesIdentification of Protein Coding Genes¤ Draft genome sequence

– Initial estimate: 30.000–35.000 genes¤ Finished genome sequence

– The human gene catalogue contains 22,287 gene loci consisting of 19,438 known genes and 2,188 predicted genes

– Current estimate: 20.000–25.000 genes• 25.000 is an upper limit, the actual number may be ~23.000• Consistent with gene counts in other vertebrates: fish and

chicken


Basic Characteristics of Gene StructuresBasic Characteristics of Gene Structures

¤ Mean and median values of gene structures– Based on the draft sequence

• In particular, the UTRs in the RefSeq database are incomplete


Protein Coding GenesProtein Coding Genes¤ General features of human genes

– average coding length of about 1,400 bp• Similar to al eukaryotic organisms

– average genomic extent of about 30 kb• Much larger than in lower eukaryotes

– The variation in gene and intron size• GC-rich regions: gene-dense with many compact genes• AT-rich regions: gene-poor with genes containing large introns

¤ Known and predicted exons: ~231.000– 1,2% of the human genome– Average of 10.4 exons per gene

¤ Pseudogenes– Current estimates: 20.000 processed and unprocessed

pseudogenes • The total number of pseudogenes is thus likely to exceed the

total number of functional genes • Only those of recent origin can be identified with confidence


Basic Characteristics of Gene StructuresBasic Characteristics of Gene Structures¤ High variation in overall intron size

– distribution has very long tails • Many genes are over 100 kb long

– Largest gene: dystrophin gene (DMD) 2.4 Mb– longest known coding sequence: titin gene 80,780 bp, 178 exons


Comparison with fly, worm and yeastComparison with fly, worm and yeast

¤ Apparent homologues of human proteins– 40% to 60% of the yeast, worm and fly proteomes

¤ Human genes differ from those in worm and fly– Spread out over much larger regions of genomic DNA– Have a substantially larger number of exons

• 4,5 to 5 in fly and worm compared to 10,4 in human– Are used to construct more alternative transcripts

• Larger number of proteins in human than in the worm or fly¤ Increased complexity of the proteome

– Complexity of the human proteome is a consequence of large-scale protein innovation

• Multi-domain proteins with multiple functions, and domain architectures


Protein Coding Gene Evolution in HumanProtein Coding Gene Evolution in Human¤ Gene birth in the human lineage

– gene duplications that arose after divergence from the mouse• Identified 1,183 gene clusters containing 3,300 recently

duplicated genes ( with a peak 3–4 million years ago ) enriched in genes with

– immune function– olfactory function– reproductive functions

– Duplicated genes are the raw material for adaptive evolution: • extra copies are able to undergo functional divergence in

response to positive selection ¤ Gene death in the human lineage

– Recently inactivated genes include genes in olfactory function


Future PerspectivesFuture Perspectives

¤ Vertebrate genome sequencing projects ongoing or planned (currently totaling 25)– Fish: zebrafish, salmon, tilapia, stickleback and Japanese

medaka– Amphibians: Xenopus laevis and X. tropicalis – Birds: turkey– Mammals: ~15 additional species

• cow, pig, cat, dog, horse, rabbit, guinea pig, elephant, kangaroo, shrew,….

– Primates: chimp, orangutan, baboon and rhesus monkey¤ Source: GOLDTM Genomes OnLine Database

– http://www.genomesonline.org/

Recommended readingRecommended reading¤ Genome sequences

– The sequencing of the human genome• International Human Genome Sequencing Consortium Nature

409, 860 (2001)• International Human Genome Sequencing Consortium Nature

431, 931 - 945 (2004)– The sequencing of the mouse genome

• Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)

– Chicken genome sequence• International Chicken Genome Sequencing Consortium,

Nature 432, 695 - 716 (2004)

Further reading Further reading ¤ Vertebrate genome sequences

– Fish genome sequences• Aparicio et al., Science, 297, 1301-1310 (2002)• Jaillon et. al., Nature 431, 946 - 957 (2004)

– Rat genome sequence• RGSPC, Nature 424: 493 - 521 (2004)

– Human genome sequence - Celera Genomics - private initiative• Venter et. al., Science, 291, 1304 (2001)

Genome Biology and Biotechnology

Documents

human genome genome

tetraodon genome

genome biology

bird genome

finished genome sequence

kbthe fugu genome

genome duplication blocks

total genome size