Genome Biology and Genome Biology and Biotechnology Biotechnology 5. The genome structures of plants 5. The genome structures of plants Prof. M. Zabeau Prof. M. Zabeau Department of Plant Systems Biology Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology Flanders Interuniversity Institute for Biotechnology (VIB) (VIB) University of Gent University of Gent International course 2005 International course 2005
Genome Biology and Biotechnology. 5. The genome structures of plants. Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005. Sequenced genomes of invertebrates and plants. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Genome Biology and Genome Biology and BiotechnologyBiotechnology
5. The genome structures of plants5. The genome structures of plants
Prof. M. ZabeauProf. M. ZabeauDepartment of Plant Systems Biology Department of Plant Systems Biology
Flanders Interuniversity Institute for Biotechnology (VIB)Flanders Interuniversity Institute for Biotechnology (VIB)University of GentUniversity of Gent
International course 2005International course 2005
Sequenced genomes of invertebrates and Sequenced genomes of invertebrates and plantsplants
¤ Genome sequencing in progress– Polar (draft sequence completed)– Medicago (in progress)– Tomato (in progress)– Maize (started)
Phylogeny of the flowering plantsPhylogeny of the flowering plants
Monocots
Dicots
~250 MY
Analysis of the genome sequence of the Analysis of the genome sequence of the flowering plant flowering plant Arabidopsis thalianaArabidopsis thaliana
¤ Plants and animals evolved independently from unicellular eukaryotes, representing contrasting life forms– The worm and fly genomes revealed the common genetic
basis of developmental and physiological processes in multicellular organisms
– The genome sequence of a plant provides a glimpse of the genetic basis of differences between plants and other eukaryotes
– The genome sequence represents the most accurately sequenced genomes (error rate < 1:100.000)
The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
The Arabidopsis Genome The Arabidopsis Genome SequenceSequence
¤ The complete genome size is estimated at ~125 Mb– The total length of the sequenced region is 115,409 Mb – The unsequenced centromeres and rRNA repeat (chr. 2 & 4)
regions are estimated at 10 Mb
¤ General features such as gene density and repeat distribution are – very consistent across the five chromosomes
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
RepresentatioRepresentation of the n of the
ArabidopsisArabidopsis ChromosomesChromosomes
Chr.129,1 Mb
Chr.219,6 Mb
Chr.323,2 Mb
Chr.417,5 Mb
Chr.526,0 Mb
rDNA repeat
centromeretelomere telomere
Protein genesESTs
Transposons
Mitoch./Chloropl.
RNA genes
density
Representation of Representation of ArabidopsisArabidopsis Chromosome 1Chromosome 1
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Pericentromeric region
Coding Gene ContentCoding Gene Content
¤ AGI annotation predicted 25.489 genes– Non-homogeneous annotation: performed by different groups
¤ Re-annotation estimates 28.000 to 29.000 genes– Larger than C. elegans (19.099) and D. melanogaster (13.601)– Larger gene set results from numerous gene duplications
¤ MIPS classification of Arabidopsis proteins in 12 functional categories (cfr yeast) – ~70% classified according to sequence similarity to proteins of
known function in all organisms • 9% experimentally characterized
– ~30% not be assigned to functional categories• Representing 10.000 “unknown genes”
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Functional Analysis of Functional Analysis of ArabidopsisArabidopsis GenesGenes
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Comparison of Functional CategoriesComparison of Functional Categories
¤ Comparison of Arabidopsis genes with those of the complete genomes reveals: – High conservation of eukaryotic gene function
• >50% of the genes involved in protein synthesis have counterparts in the other eukaryotic genomes
– Independent evolution of many plant gene families • transcription factors: only 8–23% of Arabidopsis proteins
involved in transcription have related genes in other eukaryotic genomes
– Acquisition of bacterial genes • from the cyanobacterial ancestor of the plastid: in the order of
1.000 genes have been translocated over time from the organelle to the genome.
– Genes with high similarity to Synechochistis
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
RNA Gene ContentRNA Gene Content
¤ rRNA Genes – Nucleolar organizers (NORs) on chromosomes 2 and 4 contain
• 350–400 repeats of 10 kb encoding the 18S, 5.8S and 25S rRNA genes comprising 3.5–4.0 Mb
¤ 5S rRNA genes – Tandem arrays in the centromeric regions of chr 3, 4 and 5
¤ Spliceosomal RNAs, small nucleolar RNAs (snoRNAs) – Several copies occur dispersed on all chromosomes
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Genome Duplication in Arabidopsis Genome Duplication in Arabidopsis
¤ The Arabidopsis genome exhibits traces of extensive duplications – >75% of the Arabidopsis genes are duplicated
• The fact that most genes are duplicated explains the higher gene number than in other organisms
– Segmental duplications• Segmental duplications were first described in yeast • Identified 24 large duplicated segments of > 100 kb
– These duplicated regions encompass 58% of the genome– Tandem gene arrays
• Tandem arrays of genes are common in all genomes• 1,528 tandem arrays containing 4,140 individual genes
– 17% of all genes of Arabidopsis are arranged in tandem arrays
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Genome Organization and Genome Organization and DuplicationDuplication
¤ First analysis of segmental duplications– Detection of collinear clusters of genes using TBLASTX
• This approach detects the “ obvious” duplications
– The proportion of homologous genes in each duplicated segment varies widely
• Extensive gene loss or gain of genes after the segmental duplication occurred
– Sequence conservation/divergence of the duplicated genes varies greatly
• Duplications vary in age
– suggesting several different large-scale duplication events
• Duplications occurred between 75 to 200 million years ago– Earliest duplication coincides with the radiation of the
flowering land plants
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Overall View of the Duplicated Overall View of the Duplicated RegionsRegions
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
Implications of Genomic DuplicationsImplications of Genomic Duplications
¤ What does the duplication in the Arabidopsis genome tell us about the evolution of the species? – Polyploidy occurs widely in plants but not in animals
• The hypothesis is that Arabidopsis had a tetraploid ancestor(s)
– The majority of the Arabidopsis genome is represented in duplicated segments
• Suggests that the duplicated segments arose from whole genome duplications
– The long period of time (75 to 200 My) provided ample opportunity for
• the divergence of the functions of the duplicated genes
– Duplicated genes often have redundant functions • Majority of insertion mutants in Arabidopsis have no obvious
phenotypic effect
Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)
The Origin of Genomic DuplicationsThe Origin of Genomic Duplications
¤ First detailed analysis of the duplications: – Vision et al, Science 290: 2114 (2000)
– Identified 103 duplicated segments with >=7 matching ORFs
• 81% of the Arabidopsis genes fall within at least one block
– The ages of the duplicated blocks were estimated from average extent of amino acid substitution
– The number of duplication events was estimated from the distribution of the estimated block ages
• Single polyploidization event will produce a unimodal distribution of ages with homogeneity among blocks
• Independent duplication events will produce a multimodal distribution
Reprinted from:
Age Classes of Duplicated BlocksAge Classes of Duplicated Blocks
¤ Distribution of divergence suggests 4 duplication events– Classes C through F yield age estimates of 100, 140, 170, and
200 Mya• Age class C , the most recent, comprises 50% of the duplicated
segments• Age class F predates the divergence of monocots and dicots, 180 to
220 Mya
Reprinted from: Vision et al, Science 290: 2114 (2000)
The Origin of Genomic DuplicationsThe Origin of Genomic Duplications
¤ Recent study of the Arabidopsis genome duplications – Simillion et al, PNAS 99, 13627 (2003)
– More refined algorithms detect degenerated block duplications
• Degeneration results from extensive gene loss and subsequent reshufflings of gene order
• Algorithms detect hidden duplications missed in earlier studies
– Study revealed a much larger number of duplications• 304 nonhidden duplications and 53 hidden duplications
– Comprising 82% of all genes in Arabidopsis– >70% of the genes are lost from the duplicated segments
Nonhidden and Hidden DuplicationsNonhidden and Hidden DuplicationsNonhidden
Hidden
Reprinted from: Simillion et al, PNAS 99, 13627 (2003)
Multiplication levels of the Multiplication levels of the DuplicationsDuplications
¤ Chromosomal segments exhibit multiple duplications– Multiplication numbers vary from 5 to 8
Reprinted from: Simillion et al, PNAS 99, 13627 (2003)
ConclusionsConclusions
¤ High multiplication levels– Suggest multiple rounds of whole genome duplication– Observed many duplications with multiplication levels of 5 -
8 • Indicating a maximum of three rounds of duplications
¤ Dating based on silent substitutions– Accurate for the youngest duplication
• dated 75 million years ago
– Less reliable for the two older age classes• dated 163 and 221 million years ago
¤ Results suggest three whole genome duplication or polyploidization events– The oldest one may have occurred before the
monocot/dicot split
Reprinted from: Simillion et al, PNAS 99, 13627 (2003)
The grass genomesThe grass genomes
¤ Grasses are the primary food source– Wheat, rice, maize barley, sorghum…
¤ Grass genomes vary widely in size
Species Genome size (Mb) ploidyRice 430 diploid
Sorghum 735 diploid
Maize 2.360 allotetraploid
Barley 4.900 diploid
wheat 17.000 hexaploid
Reprinted from: Moore et. al., Curr. Biol. 5, 737−739 (1995)
Macro synteny of the grass genomesMacro synteny of the grass genomes
Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)
Gene duplication in riceGene duplication in rice
Duplicated segments
Genome duplication in riceGenome duplication in rice
¤ Extensive gene duplication – 9 duplicated blocks account for 62% of the rice genes
• blocks have retained 16% to 25% of the duplicate copies
– retention of duplicated gene copies is greater than predicted • suggests that gene loss is not random
¤ Phylogenetic Dating of the genome duplication– Ks values suggest a single duplication event
• except the chromosome 11-12 duplication, which was more recent
– The Ks peak for the rice duplicates corresponds to 70 MY– The time of divergence of the cereals is estimated at 50 MYA– a polyploidization event occurred 70 MY ago
• before the divergence of the major cereals
Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)
Genomic Duplications in Angiosperm Genomic Duplications in Angiosperm EvolutionEvolution
Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)
monocots
dicots
Comparison of rice and grass Comparison of rice and grass genomesgenomes
¤ Synteny between rice and Arabidopsis– Limited to relatively short segments comprising few genes
• Successive rounds of genome duplications in the two lineages (Arabidopsis 2; rice 1) have blurred the ancestral synteny
¤ Macro synteny of the grass genomes is confirmed at the sequence level– 98% of the genes found in the different grasses have a rice
homolog• Rice is a model system for the larger cereal genomes
Micro synteny of the grass genomesMicro synteny of the grass genomes
¤ Collinear arrangement of genes is interrupted by– Intergenic retrotransposon blocks
Reprinted from: Ramakrishna et al., Genetics, 162, 1389 (2002)
The maize genomeThe maize genome
¤ Large (2.365 MB) and complex genome– Unusually high repetitive DNA content (>80%)
¤ Stepwise sequencing approach designed to the meet the challenge– Sequencing the gene-rich fraction
• Enrichment of Gene-Coding Sequences by Genome Filtration– Whitelaw et. al., Science, 301, 2118-2120 (2003)
– High resolution physical map of 300:000 BAC clones• BAC end sequencing: completed
– Sequence composition and genome organization of maize • Messing et al., PNAS 101: 14349-14354 (2004)
• BAC skim sequencing: in progress – Low pass sequencing of minimal tiling path BACs
¤ Expect the complete genome sequence by 2007– Martienssen et al., Curr. Op. in plant biol., 7: 102 – 107 (2004)
Structure of the Maize genomeStructure of the Maize genome
¤ The maize genome is 6 times larger than that of rice– ~60% of the genome comprises highly repetitive sequences
• >90% are LTR–retrotransposons inserted in the last 3 to 6 MY– 10 - 100 -kb tracts of nested insertions separate genic
regions
Reprinted from: SanMiguel et al., Nat Genet. 20: 43 (1998)
Duplicated genes in maizeDuplicated genes in maize
¤ A conservative estimate predicts 59,000 genes– A very large fraction of duplicated genes
¤ Two interesting aspects of the gene organization – Despite the fact that the genome was duplicated 5-10 My ago
• the tetraploidization was followed by a heavy loss of duplicate genes
– <50% of the duplicates are retained (cfr. yeast)– Tandem gene amplification is unusually high
• ~1/3 of the genes consist of tandemly arrayed gene families
¤ The maize genome illustrates the exceptional dynamics of genome evolution in plants
Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)
Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)
Origin of rice, maize and sorghum Origin of rice, maize and sorghum
Genome duplication
Enrichment of Gene-Coding Sequences in Enrichment of Gene-Coding Sequences in Maize by Genome FiltrationMaize by Genome Filtration
¤ Paper presents– Two methodologies that enrich for genic sequences for
sequencing complex genomes • Methylation filtering • High C0t selection
– Combination of the two techniques resulted in a six-fold reduction in the effective genome size
– Powerful technologies for sequencing repeat-rich genomes