Genome Biology and Genome Biology and Biotechnology Biotechnology 3. The genome structures of vertebrates 3. The genome structures of vertebrates Prof. M. Zabeau Prof. M. Zabeau Department of Plant Systems Biology Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology Flanders Interuniversity Institute for Biotechnology (VIB) (VIB) University of Gent University of Gent International course 2005 International course 2005
Genome Biology and Biotechnology. 3. The genome structures of vertebrates. Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005. The Genome Sequences of vertebrates. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Genome Biology and Genome Biology and BiotechnologyBiotechnology
3. The genome structures of vertebrates3. The genome structures of vertebrates
Prof. M. ZabeauProf. M. ZabeauDepartment of Plant Systems Biology Department of Plant Systems Biology
Flanders Interuniversity Institute for Biotechnology (VIB)Flanders Interuniversity Institute for Biotechnology (VIB)University of GentUniversity of Gent
International course 2005International course 2005
The Genome Sequences of The Genome Sequences of vertebratesvertebrates
Whole-Genome Shotgun Assembly and Whole-Genome Shotgun Assembly and Analysis of the Genome of Analysis of the Genome of Fugu rubripesFugu rubripes
¤ Paper presents– Low quality draft genome sequence of Fugu rubripes – the sequence provided a valuable reference for annotating
the human and mouse genomes• Small genome (350 Mb versus 3000 Mb)• low repetitive DNA content
Aparicio et al., Science, 297, 1301-1310 (2002)
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
The Fugu Genome SequenceThe Fugu Genome Sequence
¤ The draft sequence covers a total of 332.5 Mb– Highly fragmented sequence (~30 Mb unassembled
sequences) – The total genome size is estimated at ~365 Mb
¤ The number of predicted genes: 31,059– similar to the number of human genes predicted from the
draft sequence¤ Repetitive sequences
– Density of <15% far below the 35 to 45% observed in mammals
• Transposable elements are still very active
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
Protein-coding genesProtein-coding genes
¤ The gene-containing fraction is a ~ 108 Mb (30%)– The average gene density: one gene per 10.9 kb– The Fugu genome is compact because introns are shorter
than in the human genome • Genome contains ~500 large introns (> 10 kb) compared >
12,000 large introns in human• Genes are scaled in proportion to the compact genome size
– The number of introns is roughly the same as in human• Both gain and loss of introns in the Fugu lineage are observed
¤ The compactness of the Fugu is accounted for by – Low abundance of repeated sequences– The small size of introns and intergenic regions
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
Comparison of Comparison of FuguFugu and Human and Human ProteomesProteomes
¤ 75% of predicted human proteins have a strong match to Fugu
Genome duplication in the teleost fish Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early Tetraodon nigroviridis reveals the early
The Tetraodon genome sequenceThe Tetraodon genome sequence
¤ The draft genome sequence (8,3 x) spans 342 Mb– Largest scaffolds were mapped onto the chromosomes
• Draft is much less fragmented than that of Fugu¤ Genome landscape
– Transposable elements are very rare (<4000 copies)• Fewer than Fugu (15% of the genome)
¤ Estimated 20,000–25,000 protein coding genes– Very similar to the recent (2004) human gene count – Much lower than reported for Fugu (current Fugu is also
lower)– Gene ontology (GO) classifications shows only subtle
differences between fish and mammals• Improved fish gene catalogue aids human gene predictions
Ancestral genome of bony Ancestral genome of bony vertebratesvertebrates
¤ The patterns of doubly conserved synteny are consistent with – 12 ancestral chromosomes which have rearranged to form – the present day chromosomes of human and fish
Human Fish
Sequence and comparative analysis of the Sequence and comparative analysis of the chicken genome provide unique chicken genome provide unique
perspectives on vertebrate evolutionperspectives on vertebrate evolution
¤ Paper presents – a draft genome sequence of the red jungle fowl Gallus
gallus– The first genome of non-mammalian amniote
• provides a new perspective on vertebrate genome evolution– The evolutionary distance between chicken and human
provides an excellent signal-to-noise ratio to detect functional elements
• 310 MY since the divergence of birds and mammals
• Fewer than in human: many ncRNA genes are pseudogenes– Syntenic relationships for non-coding RNA genes differ from
those of protein-coding genes• implies a novel mode of evolution for some ncRNA genes• Only certain ncRNA genes are in regions of conserved
synteny – microRNAs (miRNAs) and small nucleolar RNAs
(snoRNAs) found in introns of protein-coding genes
Evolutionary conservation of gene Evolutionary conservation of gene components components
¤ Sequence conservation of chicken and human orthologs – highest in protein-coding exons – minimal in introns– Significant in the 5' and 3' flanking and untranslated
Conserved sequencesConserved sequences in in chicken and chicken and human human
¤ High substitution rates between human and chicken– Can be used to detect functionally conserved sequences
¤ 70 Mb (2.5%) of human sequence aligns with chicken– 44% are in protein-coding regions - exons– 66% is non-coding: intronic (25%) and intergenic (31%)
¤ Conserved non-coding segments occur clustered and far from genes – Identified 57 segments with average length of 1,1 MB
• gene poor, G+C poor and have no interspersed repeats– the functional significance of these sequences is
The draft mouse genome sequenceThe draft mouse genome sequence
¤ The euchromatic mouse genome is estimated ~2.5 Gb – The draft genome sequence covers ~96% of the genome
¤ Generation of the draft genome sequence– Sequencing
• 41.4 Mi paired-end sequence reads derived from various clone types
– Assembly• represents ~7.7-fold sequence coverage• 224,713 sequence contigs• total of 7,418 supercontigs• The 200 largest supercontigs span more than 98% of the
assembled sequence– Anchoring to chromosomes
• Anchored all supercontigs >500 kb with the mouse genetic map
Repetitive sequences in human and Repetitive sequences in human and mousemouse
¤ The most prevalent feature of mammalian genomes is their high content of repetitive sequences– Most of which are interspersed repeats representing 'fossils' of
transposable elements¤ The repetitive sequences in mouse and human
differ– Only 37.5% of the mouse genome– ~46% of the human genome is transposon-derived– Insertions of transposable elements occured in the last 150–200
million years• The most notable difference is the rate of transposition over time
– in mouse the rate has remained fairly constant– in human the rate increased to a peak at ~40 Myr, and
TheThe rat rat genome sequencegenome sequence¤ The draft genome sequence covers 2,75 Gb
– A 'combined' sequencing strategy using• WGS sequencing and light sequence coverage of BACs• Sequential assembly of 'enriched BACs' (eBACs) joined into
• 308 gaps in the euchromatic sequence: totalling ~28 Mb• 33 heterochromatic gaps (including 24 centromeres) : total ~198
Mb– The total human genome size is estimated at ~3,080 Mb
¤ Comparison with draft sequence– Substantially fewer gaps (341 versus 147,821)– More accurate and complete sequence: error rate ~1 per 105
• Confirmed local order and orientation of the sequences• Corrected artefactual duplications resulting from mixups• Verified most of the sequence with
– BAC cloned overlap sequence, paired end sequence reads from fosmids, draft chimpanzee genome sequence
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Finished Human Genome SequenceFinished Human Genome Sequence¤ Importance of a completely finished genome
sequence – Accurate reference for identifying genetic variation in the human
population• Error rate of 10-5 << frequency of SNP of 10-3
– Identification of segmental duplications• Estimated to cover >5% of the genome sequence
– Located primarily in the pericentromeric and subtelomeric regions– much higher than in mouse and rat
• Great medical interest: predisposes to deletion or rearrangement– Williams syndrome region (7q)– Charcot–Marie–Tooth region (17p)– DiGeorge syndrome region (22q)
• Many remaining gaps involve unresolved segmental duplications – Correct identification of all protein- coding genes structures
• ~60% of the gene models were corrected compared to the draftReprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945
(2004)
Estimates of the Number of Human GenesEstimates of the Number of Human Genes
¤ Reassociation kinetics (60s and 70s)– Early estimates based estimated the mRNA complexity of typical
vertebrate tissues to be 10,000–20,000, and were extrapolated to suggest around 40,000 genes for the entire genome
¤ Estimates from approximate gene and genome sizes – Calculation based on the size of a typical gene ( 3*104 bp) and the size
of the genome (3*109 bp) yielded 100,000 genes (W. Gilbert, pers. Com.)
¤ Number of CpG islands associated with known genes– An estimate of 70,000–80,000 genes was made
¤ Estimates based on ESTs– Estimates based on ESTs varied widely, from 35,000 to 120,000 genes – Discrepancy results from contaminating genomic sequences and
multiple ESTs from single genes¤ Whole-genome shotgun sequence from the pufferfish
– Suggested around 30,000 human genes Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Identification of Protein Coding GenesIdentification of Protein Coding Genes¤ Draft genome sequence
– The human gene catalogue contains 22,287 gene loci consisting of 19,438 known genes and 2,188 predicted genes
– Current estimate: 20.000–25.000 genes• 25.000 is an upper limit, the actual number may be ~23.000• Consistent with gene counts in other vertebrates: fish and
chicken
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Basic Characteristics of Gene StructuresBasic Characteristics of Gene Structures
¤ Mean and median values of gene structures– Based on the draft sequence
• In particular, the UTRs in the RefSeq database are incomplete
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Protein Coding GenesProtein Coding Genes¤ General features of human genes
– average coding length of about 1,400 bp• Similar to al eukaryotic organisms
– average genomic extent of about 30 kb• Much larger than in lower eukaryotes
– The variation in gene and intron size• GC-rich regions: gene-dense with many compact genes• AT-rich regions: gene-poor with genes containing large introns
¤ Known and predicted exons: ~231.000– 1,2% of the human genome– Average of 10.4 exons per gene
¤ Pseudogenes– Current estimates: 20.000 processed and unprocessed
pseudogenes • The total number of pseudogenes is thus likely to exceed the
total number of functional genes • Only those of recent origin can be identified with confidence
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Basic Characteristics of Gene StructuresBasic Characteristics of Gene Structures¤ High variation in overall intron size
– distribution has very long tails • Many genes are over 100 kb long
– Largest gene: dystrophin gene (DMD) 2.4 Mb– longest known coding sequence: titin gene 80,780 bp, 178 exons
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Comparison with fly, worm and yeastComparison with fly, worm and yeast
¤ Apparent homologues of human proteins– 40% to 60% of the yeast, worm and fly proteomes
¤ Human genes differ from those in worm and fly– Spread out over much larger regions of genomic DNA– Have a substantially larger number of exons
• 4,5 to 5 in fly and worm compared to 10,4 in human– Are used to construct more alternative transcripts
• Larger number of proteins in human than in the worm or fly¤ Increased complexity of the proteome
– Complexity of the human proteome is a consequence of large-scale protein innovation
• Multi-domain proteins with multiple functions, and domain architectures
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Protein Coding Gene Evolution in HumanProtein Coding Gene Evolution in Human¤ Gene birth in the human lineage
– gene duplications that arose after divergence from the mouse• Identified 1,183 gene clusters containing 3,300 recently
duplicated genes ( with a peak 3–4 million years ago ) enriched in genes with