Page 1
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 1
INTRODUCTION
Plants have existed long before humans and animals have even appeared. With the
instinctive gift of intelligence, and learning through trial-and-error process, humans have
gradually developed tools for survival with the crudest technology from the early human
existence to the ever evolving level of sophistication in our post-modern era. With this
technological evolution came along the agricultural evolution, which could be accounted
with greater significance, for in order to maintain life (i.e. society), food is essential, which
mainly comes from plants [1]. Humans learned the value of plants and through experience
selected those that are beneficial in a process called domestication [2]
Plants with particular economic importance that lead to their cultivation are generally
called crops [3]. These crops are generally valued for their relevance for food, medicine,
materials, industry, landscape, etc. For enhancing the quality and yield of crops produced,
breeders and scientists have worked hard to produce methods that address these
objectives. These are logically achieved by properly understanding the anatomy, physiology,
genetics, and ultimately all aspects of plant mechanisms responsible for growth and
development, as well as environmental factors that affect this growth and development. It
basically means, the more knowledge we have of a plant and the intricate interplay of all
biological factors that limit or promote its optimal performance, the easier it is to
manipulate particular parameters to obtain the desired phenotype.
We have come a long way in understanding the biological factors and mechanisms
that contribute in plant growth and development, from knowing the hereditary molecules
to the isolation of the first gene [4], to the recent studies of the genome, the
transcriptome, the proteome, and the metabolome. Despite these great astonishing
1 Laboratory of Functional Crop Genomics and Biotechnology, Department of Plant Science, College of Agriculture and Life Sciences, Seoul National University, 151-742 Seoul, Korea
2Plant Biotechnology Institute, Department of Life Science, Sahmyook University, 139-742 Seoul, Korea
Email: [email protected]
노이완
R E P O R T
The plant genome: a socio-economic implication
Nomar Espinosa Waminal 1,2
Abstract | Crop genomics has gotten much attention recently, especially after the completion of
the genome sequencing project of Arabidopsis thaliana. Coupled with the ever advancing DNA
sequencing technologies, genomics has unlimited resources to offer to the scientific community.
With several plant genomes having been sequenced, it is apparent that novel knowledge have
been unraveled in understanding plant mechanisms for disease resistance and biosynthesis of
desired traits, areas that immediately affect economic aspect of agriculture. Here, detailed
review on the genome sequencing and assembly of three socio-economically important plant
species are presented.
Page 2
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 2
advances that encompass the study of plants, it is still apparent that we have yet another
long way to go to make use of this vast amount of information for the improvement of
human health and lifestyle, and to address the recent international concerns of global
warming.
The concerted efforts of scientists from various fields have contributed enormously in
the understanding of the plant genome, its structure, and its function. Unlocking the
genomic DNA sequence and understanding the interplay of the DNA and other
biomolecules have a profound impact in downstream researches and applications of this
information. There is such a wide horizon of downstream applications of the genome
sequence that attempting to enumerate them is like trying to limit its possibilities.
Nevertheless, some of the apparent direct and indirect contributions of the genome
sequence to the scientific community include (i) access to: the relatively complete gene
catalogue of a species, the regulatory elements that control the gene functions, and the
foundation in understanding variation of genomes; (ii) understanding the structure,
function, and evolution of organisms; (iii) understanding biochemical pathways; (iv)
development of molecular markers to speed up genetic analysis, discovery of genes, and
breeding programs for crop improvement; (v) and providing framework for further
structural and functional genomics studies of model plants, essential food crops, animal
feed, and energy crops. To date, there are about 26 plant genomes that have been
sequenced [5].
Of which plant genome to sequence first is influenced by several factors like the
sequencing cost, genome size, and genome complexity. These factors have more influence
in the decision of sequencing than the direct economic significance of the species being
sequenced, as exemplified by the sequencing of Zea mays L. which could have been
sequenced after Arabidopsis and rice [6] but was consequently sequenced later, after Vitis
vinefera [7] and Populus trichocarpa [8], due to the huge amount of repetitive elements in
its genome [9]. These highly repetitive elements make genome assembly difficult by
challenging computational accuracy [9], especially with the use of the next generation
sequencing (NGS) technologies that produce short reads [10]. However, scientists have
developed approaches that utilize long reads (Sanger sequencing) in combination with the
NGS reads to produce more reliable results [11]. To date, several important crop species
have already been sequenced using whole genome shotgun (WGS) sequencing or BAC-by-
BAC sequencing approaches (Table 1), and this number is dramatically increasing as more
sophisticated sequencing technologies and bioinformatics tools are being refined.
With the increasing knowledge of the plant genome, which was greatly spurred after
the completion of the genome sequence of Arabidopsis thaliana in 2000 [5], comes along
the incremental evolution of DNA sequencing technologies. The Sanger method has
dominated the DNA sequencing industry for nearly two decades and has contributed so
much in sequencing many genomes, including the monumental completion of the human
Page 3
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 3
Table 1. Overview of plant genomes that have been sequenced (Adapted from Trends in Plant Science February 2011, Vol. 16,
No. 2 and List of sequenced eukaryotic genomes. (2012, April 20). In Wikipedia, April 29, 2012)
Organism* Relevance Genome
(Mb)
Chrom.
no. (n)
Predicted
Genes
Sequencing
strategy Organization
Year of
completion
Dicots
Arabidopsis lyrata Model plant 207 8 32,670 WGS DOE-JGI and Max Planck
Institute for Developmental
Biology
2011
Arabidopsis thaliana Model plant 119 5 25,498,
27,400,
31,670
BAC-by-BAC Arabidopsis Genome Initiative 2000
Brassica rapa Crop and model organism 284 10 41,174 WGS multicenter collaboration 2011
Cannabis sativa Hemp and marijuana
production
534 10 30,074 WGS multicenter collaboration 2011
Cucumis sativus Vegetable crop 367 7 26,682 WGS Chinese Academy of
Agricultural Sciences, Beijing
2009
Fragaria vesca Fruit crop 280 7 25,050 WGS multicenter collaboration 2011
Glycine max Protein and oil crop 1,100 20 46,430 WGS Purdue University 2010
Jatropha curcas Biodiesel crop 410 11 40,929 BAC-by-BAC
WGS
Kazusa DNA Research Institute 2010
Lotus japonicus Model legume 417 30,799 BAC-by-BAC multicenter collaboration 2008
Malus domestica Fruit tree 927 57,000 BAC-by-BAC International consortium 2010
Medicago truncatula Model organism for
legume biology
375 62,388 BAC-by-BAC multicenter collaboration 2011
Populus trichocarpa Carbon sequestration,
model tree, timber
550 19 45,555 WGS The International Poplar
Genome Consortium
2006
Ricinus communis Oilseed crop 320 31,237 WGS multicenter collaboration 2010
Solanum tuberosum Crop plant 844 12 39,031 WGS multicenter collaboration 2011
Thellungiella parvula Arabidopsis relative with
high salt tolerance
140 28,901 WGS multicenter collaboration 2011
Theobroma cacao Flavoring crop 430 10 28,798 WGS CIRAD, multiple institutions
(separate project, Mars Inc.,
USDA)
2010
Vitis vinifera Fruit crop 490 30,434 WGS The French-Italian Public
Consortium for Grapevine
Genome Characterization
2007
Monocots
Brachypodium
distachyon
Model monocot (grass) 272 5 26,500 WGS The International Brachypodium
Initiative
2010
Oryza sativa ssp indica Crop and model organism 420 12 32-50,000 WGS Beijing Genomics Institute,
Zhejiang University and the
Chinese Academy of Sciences
2002
Oryza sativa ssp
japonica
Crop and model organism 466 12 58,000 BAC-by-BAC Syngenta and Myriad Genetics 2002
Oryza glaberrima West African species of
cultivated rice that was
domesticated
independently of Asian
rice.
316 12 ND† BAC pooling
and WGS
Arizona Genomics Institute 2010
Phoenix dactylifera Fruit tree (palm) 658 36 >25,000 WGS Genomics Core, Qatar 2011
Sorghum bicolor Crop plant 730 10 27,640 WGS Multiple institutions 2009
Zea mays Cereal crop 2,800 10 63,300 BAC-by-BAC NSF 2009
†ND: no data
*species in bold font are discussed in detail in this review
genome [12]. However, limitations of this technology (low throughput and high cost are
main concerns) have fueled the need for more advanced sequencing technologies that
could produce enormous amount of sequence data in shorter time and cheaper cost. The
result is the shift of sequencing approach from the traditional ‘first-generation’ technology
of automated Sanger sequencing to the more advanced ‘next-generation’ sequencing [12].
Recent development and refinement of these technologies have initiated the ‘third-
generation’ sequencing with high accuracy, longer read lengths, super high coverage and
fast data acquisition [13].
Page 4
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 4
Here, I review the genomic sequencing results of three recently sequenced crops:
Cucumis sativus, Jatropha curcas, and Theobroma cacao, covering the sequencing
strategies used, genomic structure and arrangement, novel biosynthetic pathways, species-
specific genes, implication in species evolution, and other areas of functional genomics.
REVIEWED GENOMES
Cucumis sativus L. (cucumber) belongs to the family Cucurbitaceae which includes
many economically significant species. It has served as a model system for sex
determination studies [14]. The cucurbits are also plant models for vascular biology studies
because xylem and phloem sap are easily collected for long-distance signaling studies.
Jatropha curcas L. ( jatropha) belongs to the Euphorbiaceae family. It has much
potential for various uses including biofuels due to its high yield of oil per unit area which
is second only to oil palm. This presents a great promise for reducing the problems caused
by the continued consumption of fossil fuels, mainly the global warming concerns.
Theobroma cacao L. (cocoa tree), the Criollo variety, is an important crop in producing
chocolate products. However, fine-cocoa production is about less than 5% globally. This is
mainly caused by fungal, oomycete and viral diseases, and insect pest susceptibility to
fine-flavor cocoa varieties. Breeding of improved Criollo varieties is needed for sustained
production of fine-flavor cocoa.
Despite the great economic significance of these crops as partially mentioned above,
there are limited or very limited genomic resources that hinder speedy researches that
address their respective objectives. Unlocking their genomic sequence will undoubtedly
uncover new frontiers that would help understand their structure, function, and control of
desired traits. Consequently, independent organizations have initiated and completed the
sequencing of their genomes (Table 1).
STRUCTURAL GENOMICS
Sequencing and assembly
Whole genome shotgun (WGS) sequencing approach was used to sequence the three
genomes, with different platforms used and methods to achieve a better quality of
assembled sequences. Whole genome shotgun with a combination of the Sanger and NGS
(GA by Illumina) sequencing was used to sequence the cucumber genome. This method
produced longer N50 of both contigs and scaffold than when using separately assembled
reads from each sequencing strategy (Table 2). For J. curcas genome sequencing, a
combination of BAC end sequencing and shotgun sequencing was employed using the
Table 2. Genome assembly statistics of C. sativus
Assembly Contig N50 (kb) Contig total (Mb) Scaffol N50
(kb)
Scaffold total
(Mb)
Sanger 2.6 204 19 238
Illumina GA 12.5 190 172 200
Sanger + Illumina GA 19.8 226.5 1,140 243.5
conventional Sanger method and the NGS (GS-FLX by Roche/454 and GA by Illumina). For
N50 is the sequence size
above which half of the
total length of the
sequence set can be
found.
Page 5
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 5
T. cacao, the sequencing strategy used was WGS incorporating Sanger and NGS platforms
(-FLX by Roche/454 and GA by Illumina). Different software was used to assemble the
genomic sequences of the three species Table 3. The combination of the conventional
Table 3. Summary of information about the genome, sequencing strategy and genome assembly statistics of the three reviewed species
Jatropha curcas Theobroma cacao Cucumis sativus
Genome size (Mb) ~410 [15] ~430 [18] ~367 [11]
Ploidy, chrom. no. 2n=2x=22 2n=2x=20 2n=2x=14
Date of completion
Date of publication, Journal
2010
2011, DNA research
2010
2011, Nature Genetics
2009
2009, Nature Genetics
Sequencing/Funding
Institution
Kazusa DNA Research Institute
Foundation, Japan
The International Cocoa Genome
Sequencing Consortium-ICGS,
coordinated by CIRAD
Chinese Academy of Agricultural
Sciences, Beijing, China
Sequencing strategy BAC-by-BAC and WGS WGS WGS
Sequencing method
• Sanger –for shotgun libraries and
BAC ends
• NGS –GS-FLX (Roche, USA)
–GA II (Illumina, USA)
• Sanger –for BAC ends only
• NGS –GS-FLX
–GA II
• Sanger –for BAC, plasmid, and
fosmid sequencing
• NGS –GA II
Assembly program PCAP.rep and MIRA Newbler version 2.3, SOAP RePS2
Total length of assembled
genome 285.9 Mb 326.9 Mb 243.5 Mb
Percent of genome covered
by the assembly
~70% (if based on ~410 Mb genome)
~75% (if based on ~380 Mb genome)
~76% (based on genotype B97-61/B2,
430 Mb)
~66% (based on ‘Chinese long’ inbred
line 9930)
Coverage depth of raw data ND 61.1x Total: 72.2x
Gene space covered 95% 97.8% 96.8%
Sequence anchored to
chromosomes ND 67% 72.8%
Contig:
Total number
Total length (Mb)
Average length (Kb)
Longest (Kb)
N50 (Kb)
120, 586
276.7
2.3
29.7
3.8
25, 912
291.4
11.2
190
19.8
62, 410
226.4
ND
ND
19.8
Scaffold:
Total number
Total length (Mb)
Average length (Kb)
Longest (Kb)
N50 (Kb)
15, 300
129.3
8.4
56
ND
4, 792
326.9
68.2
3, 145
473.8
47, 837
243.5
ND
ND
1, 144
ND: no data
Sanger and NGS strategy proved to be superior than just using either technology
independently by compensating the shortcomings of each respective method, allowing the
acquisition of high quality sequences with lower cost in a short period of time; thus,
making it popular in the sequencing of eukaryotes [15].
Linkage analysis
The consensus genetic map of T. cacao was created using two mapping populations,
while 77 recombinant inbred lines from inter-subspecific cross between Gy14 and
PI183967 were used for cucumber. No data was provided about the linkage analysis of J.
curcas. About the same percentage of the molecular markers were aligned into the newly
assembled genomic sequence of C. sativus and T. cacao (Table 4). It was interesting to
observe that in cucumber, recombination suppression regions were found after comparing
Gy14 is a North American
processing market-type
cucumber cultivar.
PI183967 is an accession
of C. sativus var.
hardwickii originating
from India.
Page 6
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 6
the genetic and physical maps. This covers two 10-Mb regions at either ends of
chromosome 4, a 20-Mb region on chromosome 5, and a 8-Mb region on chromosome 7
(Fig. 1a). Further FISH analysis revealed segmental inversion on chromosome 5 between
Gy14 and PI183967 (Fig. 1b). This chromosomal inversion helps explain the recombination
suppression in these regions and added insight to the study of cucumber evolution during
domestication.
Table 4. Summary of the linkage analysis
Mapping
population
Total length
(cM)
No. mol.
markers
Aligned
markers
Anchored sequences
to chromosomes (%)
T. cacao 2 750.6 1,259 1,192 (94%) 67
C. sativus 77 581 1,885 1,763 (93.5) 72.8
Figure 1. The integrated genetic and physical maps of cucumber. (a) Genetic distance vs. physical distance of the seven
cucumber chromosomes. The brackets denote the regions of recombination suppression. (b) Detection of segmental
inversion on chromosome 5 between Gy14 and PI183967 through FISH (12-7 and 12-2 are fosmid clones used as probes).
Bar = 5μm.
FISH: Fluorescence in situ
hybridization is
a cytogenetic technique
that is used to detect and
localize the presence or
absence of
specific DNA sequences
on chromosomes.
Page 7
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 7
Repetitive sequences
In proportion to the genome size, J. curcas has the most number of transposons (36.6%
of 410 Mb genome) followed by C. sativus (24% of 367 Mb genome) and T. cacao (24% of
430 Mb genome) (Table 5). In all three species, the Class I transposable elements
(retrotransposons) represent majority of the repeat sequences in the genome. In J. curcas,
Table 5. Summary of the transposable elements identified in the three species
C. sativus J. curcas T. cacao
Number of
elements
Fraction of the
genome (%)
Number of
elements
Fraction of the
genome (%)
Number of
elements
Fraction of the
genome (%)
Class I
LTR: copia
LTR: gypsy
LTR: other
Others
119,339
(91,109)*
20,119
12.16
(10.43)
1.75
113,047
31,740
67,658
13,454
195
29.91
8.03
19.6
2.23
0.05
49,942
18,060
12,622
19260
ND
Class II 16,972 1.24 25,977 2.04 21,882 ND
Others 135,464 11.64 28,069 5.22 ND ND
TOTAL 266,232 24.01 152,805 36.6 67,575 ~24
*Values in parentheses denote total value of all LTR families.
there are more gypsy-type retrotransposons than copia-type, an opposite pattern with that
of T. cacao. In fact, in T. cacao, a copia-like LTR name Gaucho, 11,297 bp long and
repeated approximately 1,100 times, was identified and hybridized through FISH, and was
found to occupy most of the interstitial regions (regions between centromeres and
telomeres) of chromosome arms (Fig. 2b). Additionally, a 212 bp long repeat named ThCen
was confirmed to be centromere-specific repeats after FISH analysis (Fig 2a), and that it
may have contributed to the genome size variation of T. cacao.
Figure 2. FISH analysis of T. cacao repetitive sequences. (a) T. cacao chromosomes counterstained
with DAPI (blue) with ThCen (red) used as probe. (b) hybridization of ThCen (red) and Gaucho LTR
retrotransposon probes (green).
Gene content
Not all RNA-encoding genes were reported for the three species (Table 6). Due to the
limitations of sequencing method, the number of genes, especially for the ribosomal RNA-
encoding genes may be largely underestimated. In T. cacao, only six fragments of rRNA
genes were recovered, a huge reduction of the average number of repeats found in most
Page 8
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 8
Table 6. Summary of RNA-coding genes in the three species
eukaryotes, as can be observed through FISH analysis using rDNA as probes [16].
MicroRNAs (miRNAs) are short non-coding RNAs that transcriptionally or post-
transcriptionally regulate gene-expression. Many miRNAs have roles in plant development
and stress response. In T. cacao, most of the miRNAs predicted have homologous
transcription factor sequences, suggesting that miRNAs are major gene expression
regulators in T. cacao.
Three gene-prediction methods
were used to identify protein-
coding genes for the three species
namely ab initio, cDNA-EST, and
homology searches using gene
finder software in public databases (Table 7). Comparison of the gene families with other
sequenced genomes resulted to 682 T. cacao-specific and 4,362 C. sativus-specific gene
families, while 1,529 genes were found to be specific to the family Euphorbiaceae where J.
curcas belongs.
Table 7. Summary of the gene-prediction analysis of the three sequenced genomes
FUNCTIONAL GENOMICS
Disease resistance-related genes
Resistance genes (R genes) are subdivided into two classes: the nucleotide-binding
site leucin-rich repeat (NBS-LRR) class of genes and the receptor protein kinase (RPK) class
of genes [17]. A total of 297, 92, and 61 NBS genes were identified in T. cacao, J. curcas,
and C. sativus, respectively. Three major resistance gene families were identified in T. cacao,
the NBS, LRR-RLK and NPR1, all three have been mapped onto the ten chromosomes.
Alternative mechanisms may have been utilized by C. sativus to confer resistance to
pathogens. The relatively few amount of NBS genes found in its genome compared to the
two other species and to Arabidopsis (200), poplar (398), and rice (600) [8] is compensated
C. sativus J. curcas T. cacao
rRNA 292 ND 6
tRNA 699 597 473
miRNA 171 ND 83
snoRNA 238 ND ND
snRNA 192 65 ND
Gene-
prediction
methods
Protein-coding
region search
programs
Similarity
searches
database
No. of
predicted
genes
Mean
coding
sequence
size (bp)
Mean
exon
size
(bp)
Mean
intron
size
(bp)
Mean
exons
per
gene
C. sativus ab initio
homology
search
cDNA-EST
GlimmerHMM
Genscan
Agustus
BGF
SNAP
Arabidopsis
Papaya
Poplar
Grapevine
Rice
26,682 1,046 238 483 4.39
J. curcas ab initio
homology
search
cDNA-EST
GeneMark.hmm
Genescan
Uniref
TrEMBL
40,929 3,064 227 356 ND
T. cacao ab initio
homology
search
cDNA-EST
EUGene
SpliceMachine
Swiss-Prot
TAIR
Malvaceae
GenBank
Glycine max
T. cacao EST
28,798 3,346 231 6,319 5.03
Page 9
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 9
by the expansion of its lipoxygenase (LOX) pathways that produce short chain aldehyde
and alcohols that are involved in plant defense mechanism. The eukaryotic translation
initiation factors confer recessive resistance to plant viral infections. Three EIF4E and EIF4G
genes that encode the eIF4E and eIF4G proteins, respectively, have been identified in C.
sativus genome, another mechanism of compensating its less NBS-R-mediated pathogen
resistance.
Genes for desired traits in respective species
One of the utmost goals of genome sequencing is the elucidation of putative gene
families that are responsible for the desired quality traits that contribute to the species’
economic significance. In the case of T. cacao, gene families that are directly responsible
for the fine quality and high yield of cocoa are of great interest. For J. curcas, gene families
that are responsible for oil production are greatly valued. And for C. sativus, since it is a
model for plant vascular biology and the study of sex expression among others, gene
families responsible for these traits are of great interest.
Triacylglycerol (TAG) genes contribute to the biodiesel production, and J. curcas has
the ability to biosynthesize and accumulate considerable amount of TAGs in its seeds. To
improve Jatropha oil quality for biodiesel, modification of the fatty acid synthesis can be
obtained by altering the genes involved in its synthesis as predicted in the newly
sequenced genome. Moreover, J. curcas is known to produce tumor-promoting phorbol
esters. Lowering the expression of the genes responsible for the production of phorbol
esters in high oil-yielding lines will promote the safe use of J. curcas for biodiesel
production.
Oils, proteins, starch, Flavonoids, alkaloids, and terpenoids are principla components
affecting flavor and quality of cocoa. The unique fatty acid profile of cocoa butter
enhances the quality of smell to chocolates and confectioneries. A total of 84 orthologous
genes were discovered that are potentially involved in the lipid biosynthesis; 96 genes
involved in flavonoid biosynthesis; and 57 genes that encode terpene synthase.
Cucurbits are known for the production of cucurbitacin—an insect-repellent secondary
metabolite but also attracts specific insects for pollination. Four genes for oxidosqualene
cylase (OSC genes) that are responsible for the cucurbitacin production were identified in
C. sativus. Moreover, 137 cucumber genes related to the biosynthesis of ethylene, a
compound that stimulates femaleness in cucumber, have been identified. Additionally,
auxin regulates sex expression, and six auxin-related genes were identified in C. sativus
genome. Additionally, three short-chain dehydrogenase/reductase genes homologous to
the ts2 sex-determination gene in maize (Zea mays) have been identified.
The discovery of these genes will surely spur downstream studies directed to the
improvement of varieties that would eventually address social and economic issues. A
relevant example would be the issue of global warming.
Page 10
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 10
Genome evolution and comparative genomics
Eudicots are known to have undergone paleo-hexaploidization events followed by
lineage-specific whole genome duplications (WGD) events. It was suggested that the T.
cacao genome underwent 11 major chromosome fusions from the 21 chromosomes of the
paleo-hexaploid ancestor to produce the present 10 chromosomes (Fig. 3). On the other
hand, the collinear gene-order analysis of C. sativus revealed no recent WGD, but some
segmental duplication events. Additionally, the comparative genomics between C. sativus
and its immediate relative, C. melo (melon) suggests a possible chromosomal fusion
between two chromosomes among ten ancestral chromosomes to form the five (chrom.
no 1, 2, 3, 5, and 6) of the seven present chromosomes of C. sativus (Fig. 4). In T. cacao,
seven blocks of duplicated genes were characterized after alignment of its gene models
onto its genome (Fig. 5).
Figure 3. Evolutionary model of T. cacao. The eudicot ancestor chromosomes are presented in
seven colors. The several lineage-specific shuffling events have shaped the present eudicot
genomes. R: rounds of WGD, F: chromosomal fusions.
Although the idea of accounting the genomic evolutionary history to common ancestry
may be incredibly enticing, perhaps due to the fact that chromosomal segments are
rearranged after breeding like in the case of C. sativus, I suggest that alternative approach
be considered. The fact that nobody has lived a thousand years (how much more for a
million years) and that homologous segments doesn’t always mean common ancestry but
alternatively mean common design and function, I recommend further unbiased researches
as far as genomic history is concerned, to unlock further mechanisms that underlie the
control of the genomic fusion, inversion, translocation, etc. and to test how much of these
Page 11
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 11
events limits the sustenance of life. Do these similarities really mean common ancestry, or
common functions and/or regulatory mechanisms?
Figure 4. Comparative genomics between melon and cucumber, showing chromosomes 1, 2, 3, 5, and 6 of cucumber largely syntenic to
two chromosomes of melon.
Microsynteny with other genomes
As expected, greater degree of syntenic relationship (53% of the assembled scaffolds)
was observed between J. curcas and Ricinus communis, both in the Euphorbiaceae family,
but less synteny was observed between distantly-related (or
functionally less related) species like Glycine max (11%) and
Arabidopsis thaliana (16%). Meanwhile, 54% of the BAC
sequences of melon were aligned to C. sativus. Moreover,
628, 540, 1,106, 772, and 795 syntenic blocks were identified
between C. sativus and A. thaliana, Carica papaya, Populus
trichocarpa, Vitis vinefera, and Oryza sativa, respectively.
The highly syntenic genomes of C. sativus and C. melo
will help the genetic analysis of C. melo, now that the
genomce of C. sativus have already been sequenced. It will
also help in the advancement of phylogenetic relationship studies. Collectively, syntenic
relationship among dicots, and among plants in a broader sense, will help in gene
prediction, and eventually aid in understanding the relationship between sequence
similarity and function, and the limitations to the theory of ancestry as the sole
explanation to sequence similarity.
CONCLUSION
Recent advancement in sequencing technologies has revolutionized our experimental
approaches in the study of plants. It has also shifted major scientific questions like ‘How to
sequence a genome’ to ‘What platform should be used best to sequence a particular
genome of interest.’ It has allowed scientists to study crops holistically in the genomics,
transcriptomics, proteomics, and metabolomics level. It will definitely aid in crop
improvement, understanding phylogenies and metabolic pathways among others. The
direct or indirect exciting consequences of genome sequencing apparently boils down to
the economic and lifestyle improvement of people that would hopefully be well-
distributed globally. The promising open doors to science brought by these advance
technologies are limitless. The use of these technologies for human and Mother Earth’s
benefit is to be the main goal, and not just solely for humans.
Figure 5. Duplicated gene segments of
T. cacao. The seven colors represent the
seven ancestral eudicot linkage groups.
Page 12
R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 12
REFERENCES (Those in bold font refers to the main articles for the three species reviewed here.)
1. Yang TS. 2012. Plant and Culture:Another Interpretation of Human History. Journal of Jishou
University(Social Sciences)33(1): 1-7.
2. Hirst KK. Plant Domestication: Table of Dates and Places.
http://archaeology.about.com/od/domestications/a/plant_domestic.htm
3. Crop. 2012. March 30. In Wikipedia, The Free Encyclopedia. Retrieved 04:36, April 27, 2012, from
http://en.wikipedia.org/w/index.php?title=Crop&oldid=484675363
4. Shapiro J, Machattie L, Eron L, Ihler G, Ippen K and Beckwith J. 1969. Isolation of Pure lac Operon
DNA. Nature 224, 768 – 774.
5. Feuillet C, Leach JE, Rogers J, Schnable PS and Eversole K. 2011. Crop genome sequencing:
lessons and rationales. Trends in Plant Science 16:77-88.
6. Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA,
Mayer KFX, and Wing RA. 2004. Sequence composition and genome organization of maize.
PNAS 101: 14349–14354.
7. Jaillon O, Aury JM, Noel B, et al. 2007. The grapevine genome sequence suggests ancestral
hexaploidization in major angiosperm phyla. Nature 449 (7161): 463–467.
8. Tuskan GA, Difazio S, Jansson S, et al. 2006. The genome of black cottonwood, Populus
trichocarpa (Torr. & Gray). Science 313 (5793): 1596–604.
9. Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Wing RA, Rounsley S,
Birren B, Nusbaum C, Mayer KF, and Messing J. 2005. Structure and architecture of the maize
genome. Plant Physiol. 139:1612-1624.
10. Schatz MC et al. 2010. Assembly of large genomes using secondgeneration sequencing. Genome
Res. 20, 1165–1173.
11. Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, et al., 2009. The genome of the cucumber,
Cucumis sativus L. Nature Genetics 41:1275–1281.
12. Metzker ML. 2010. Sequencing technologies—the next generation. Nature Reviews: Genetics
11:31-46.
13. Hayden EC. 2009. Genome sequencing: the third generation. Nature 457:768-769.
14. Tanurdzic M and Banks JA. 2004. Sex-determining mechanisms in land plants. Plant Cell 16, S61–
S71.
15. Sato S et al., 2011. Sequence analysis of the genome of an oil-bearing tree, Jatropha curcas
L. DNA Research 18:65-76.
16. Waminal NE, Kim NS and Kim HH. 2011. Dual‐color FISH karyotype analyses using rDNAs in
three Cucurbitaceae species. Genes and Genomics. 33: 517-524.
17. Afzal AJ, Wood AJ and Lightfoot DA. 2008. Plant receptor-like serine threonine kinases: roles in
signaling and plant defense. Mol. Plant Microbe Interact. 21, 507–517.
18. Argout X, et al., 2011. The genome of Theobroma cacao. Nature Genetics 43:101-108.