Top Banner
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana The Arabidopsis Genome Initiative* * Authorship of this paper should be cited as ‘The Arabidopsis Genome Iniative’. A full list of contributors appears at the end of this paper ............................................................................................................................................................................................................................................................................ The flowering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000 families, similar to the functional diversity of Drosophila and Caenorhabditis elegans— the other sequenced multicellular eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement. The plant and animal kingdoms evolved independently from unicellular eukaryotes and represent highly contrasting life forms. The genome sequences of C. elegans 1 and Drosophila 2 reveal that metazoans share a great deal of genetic information required for developmental and physiological processes, but these genome sequences represent a limited survey of multicellular organisms. Flowering plants have unique organizational and physiological properties in addition to ancestral features conserved between plants and animals. The genome sequence of a plant provides a means for understanding the genetic basis of differences between plants and other eukaryotes, and provides the foundation for detailed functional characterization of plant genes. Arabidopsis thaliana has many advantages for genome analysis, including a short generation time, small size, large number of offspring, and a relatively small nuclear genome. These advantages promoted the growth of a scientific community that has investi- gated the biological processes of Arabidopsis and has characterized many genes 3 . To support these activities, an international collabora- tion (the Arabidopsis Genome Initiative, AGI) began sequencing the genome in 1996. The sequences of chromosomes 2 and 4 have been reported 4,5 , and the accompanying Letters describe the sequences of chromosomes 1 (ref. 6), 3 (ref. 7) and 5 (ref. 8). Here we report analysis of the completed Arabidopsis genome sequence, including annotation of predicted genes and assignment of functional categories. We also describe chromosome dynamics and architecture, the distribution of transposable elements and other repeats, the extent of lateral gene transfer from organelles, and the comparison of the genome sequence and structure to that of other Arabidopsis accessions (distinctive lines maintained by single- seed descent) and plant species. This report is the summation of work by experts interested in many biological processes selected to illuminate plant-specific functions including defence, photomor- phogenesis, gene regulation, development, metabolism, transport and DNA repair. The identification of many new members of receptor families, cellular components for plant-specific functions, genes of bac- terial origin whose functions are now integrated with typical eukaryotic components, independent evolution of several families of transcription factors, and suggestions of as yet uncharacterized metabolic pathways are a few more highlights of this work. The implications of these discoveries are not only relevant for plant biologists, but will also affect agricultural science, evolutionary biology, bioinformatics, combinatorial chemistry, functional and comparative genomics, and molecular medicine. Overview of sequencing strategy We used large-insert bacterial artificial chromosome (BAC), phage (P1) and transformation-competent artificial chromosome (TAC) libraries 9–12 as the primary substrates for sequencing. Early stages of genome sequencing used 79 cosmid clones. Physical maps of the genome of accession Columbia were assembled by restriction fragment ‘fingerprint’ analysis of BAC clones 13 , by hybridization 14 or polymerase chain reaction (PCR) 15 of sequence-tagged sites and by hybridization and Southern blotting 16 . The resulting maps were integrated (http://nucleus/cshl.org/arabmaps/) with the genetic map and provided a foundation for assembling sets of contigs into sequence-ready tiling paths. End sequence (http://www. tigr.org/tdb/at/abe/bac_end_search.html) of 47,788 BAC clones was used to extend contigs from BACS anchored by marker content and to integrate contigs. Ten contigs representing the chromosome arms and centromeric heterochromatin were assembled from 1,569 BAC, TAC, cosmid and P1 clones (average insert size 100 kilobases (kb)). Twenty-two PCR products were amplified directly from genomic DNA and sequenced to link regions not covered by cloned DNA or to optimize the minimal tiling path. Telomere sequence was obtained from specific yeast artificial chromosome (YAC) and phage clones, and from inverse polymerase chain reaction (IPCR) products derived from genomic DNA. Clone fingerprints, together with BAC end sequences, were generally adequate for selection of clones for sequencing over most of the genome. In the centromeric regions, these physical mapping methods were supplemented with genetic mapping to identify contig positions and orientation 17 . Selected clones were sequenced on both strands and assembled using standard techniques. Comparison of independently derived sequence of overlapping regions and independent reassembly sequenced clones revealed accuracy rates between 99.99 and 99.999%. Over half of the sequence differences were between genomic and BAC clone sequence. All available sequenced genetic markers were integrated into sequence assemblies to verify sequence contigs 4–8 . The total length of sequenced regions, which extend from either the telomeres or ribosomal DNA repeats to the 180-base-pair articles 796 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd
20

articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

May 13, 2019

Download

Documents

truongcong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

Analysis of the genome sequence of the¯owering plant Arabidopsis thalianaThe Arabidopsis Genome Initiative*

* Authorship of this paper should be cited as `The Arabidopsis Genome Iniative'. A full list of contributors appears at the end of this paper

............................................................................................................................................................................................................................................................................

The ¯owering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions.Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication,followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral genetransfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000families, similar to the functional diversity of Drosophila and Caenorhabditis elegansÐ the other sequenced multicellulareukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the setsof common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the ®rstcomplete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processesin all eukaryotes, identifying a wide range of plant-speci®c gene functions and establishing rapid systematic ways to identifygenes for crop improvement.

The plant and animal kingdoms evolved independently fromunicellular eukaryotes and represent highly contrasting life forms.The genome sequences of C. elegans1 and Drosophila2 reveal thatmetazoans share a great deal of genetic information required fordevelopmental and physiological processes, but these genomesequences represent a limited survey of multicellular organisms.Flowering plants have unique organizational and physiologicalproperties in addition to ancestral features conserved betweenplants and animals. The genome sequence of a plant provides ameans for understanding the genetic basis of differences betweenplants and other eukaryotes, and provides the foundation fordetailed functional characterization of plant genes.

Arabidopsis thaliana has many advantages for genome analysis,including a short generation time, small size, large number ofoffspring, and a relatively small nuclear genome. These advantagespromoted the growth of a scienti®c community that has investi-gated the biological processes of Arabidopsis and has characterizedmany genes3. To support these activities, an international collabora-tion (the Arabidopsis Genome Initiative, AGI) began sequencingthe genome in 1996. The sequences of chromosomes 2 and 4 havebeen reported4,5, and the accompanying Letters describe thesequences of chromosomes 1 (ref. 6), 3 (ref. 7) and 5 (ref. 8).

Here we report analysis of the completed Arabidopsis genomesequence, including annotation of predicted genes and assignmentof functional categories. We also describe chromosome dynamicsand architecture, the distribution of transposable elements andother repeats, the extent of lateral gene transfer from organelles,and the comparison of the genome sequence and structure to that ofother Arabidopsis accessions (distinctive lines maintained by single-seed descent) and plant species. This report is the summation ofwork by experts interested in many biological processes selected toilluminate plant-speci®c functions including defence, photomor-phogenesis, gene regulation, development, metabolism, transportand DNA repair.

The identi®cation of many new members of receptor families,cellular components for plant-speci®c functions, genes of bac-terial origin whose functions are now integrated with typicaleukaryotic components, independent evolution of several familiesof transcription factors, and suggestions of as yet uncharacterizedmetabolic pathways are a few more highlights of this work. Theimplications of these discoveries are not only relevant for plant

biologists, but will also affect agricultural science, evolutionarybiology, bioinformatics, combinatorial chemistry, functional andcomparative genomics, and molecular medicine.

Overview of sequencing strategyWe used large-insert bacterial arti®cial chromosome (BAC), phage(P1) and transformation-competent arti®cial chromosome (TAC)libraries9±12 as the primary substrates for sequencing. Early stages ofgenome sequencing used 79 cosmid clones. Physical maps of thegenome of accession Columbia were assembled by restrictionfragment `®ngerprint' analysis of BAC clones13, by hybridization14

or polymerase chain reaction (PCR)15 of sequence-tagged sites andby hybridization and Southern blotting16. The resulting maps wereintegrated (http://nucleus/cshl.org/arabmaps/) with the geneticmap and provided a foundation for assembling sets of contigsinto sequence-ready tiling paths. End sequence (http://www.tigr.org/tdb/at/abe/bac_end_search.html) of 47,788 BAC cloneswas used to extend contigs from BACS anchored by marker contentand to integrate contigs.

Ten contigs representing the chromosome arms and centromericheterochromatin were assembled from 1,569 BAC, TAC, cosmid andP1 clones (average insert size 100 kilobases (kb)). Twenty-two PCRproducts were ampli®ed directly from genomic DNA andsequenced to link regions not covered by cloned DNA or to optimizethe minimal tiling path. Telomere sequence was obtained fromspeci®c yeast arti®cial chromosome (YAC) and phage clones, andfrom inverse polymerase chain reaction (IPCR) products derivedfrom genomic DNA. Clone ®ngerprints, together with BAC endsequences, were generally adequate for selection of clones forsequencing over most of the genome. In the centromeric regions,these physical mapping methods were supplemented with geneticmapping to identify contig positions and orientation17.

Selected clones were sequenced on both strands and assembledusing standard techniques. Comparison of independently derivedsequence of overlapping regions and independent reassemblysequenced clones revealed accuracy rates between 99.99 and99.999%. Over half of the sequence differences were betweengenomic and BAC clone sequence. All available sequenced geneticmarkers were integrated into sequence assemblies to verify sequencecontigs4±8. The total length of sequenced regions, which extend fromeither the telomeres or ribosomal DNA repeats to the 180-base-pair

articles

796 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com© 2000 Macmillan Magazines Ltd

Page 2: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

(bp) centromeric repeats, is 115,409,949 bp (Table 1). Estimates ofthe unsequenced centromeric and rDNA repeat regions measureroughly 10 megabases (Mb), yielding a genome size of about125 Mb, in the range of the 50±150 Mb haploid content estimatedby different methods18. In general, features such as gene density,expression levels and repeat distribution are very consistent acrossthe ®ve chromosomes (Fig. 1), and these are described in detail inreports on individual chromosomes4±8 and in the analysis ofcentromere, telomere and rDNA sequences.

We used tRNAscan-SE 1.21 (ref. 19) and manual inspection toidentify 589 cytoplasmic transfer RNAs, 27 organelle-derivedtRNAs and 13 pseudogenesÐmore than in any other genomesequenced to date. All 46 tRNA families needed to decode allpossible 61 codons were found, de®ning the completeness of thefunctional set. Several highly ampli®ed families of tRNAs werefound on the same strand6; excluding these, each amino acid isdecoded by 10±41 tRNAs.

The spliceosomal RNAs (U1, U2, U4, U5, U6) have all beenexperimentally identi®ed in Arabidopsis. The previously identi®ed

sequences for all RNAs were found in the genome, except for U5where the most similar counterpart was 92% identical. Between 10and 16 copies of each small nuclear RNA (snRNA) were foundacross all chromosomes, dispersed as singletons or in small groups.

The small nucleolar RNAs (snoRNAs) consist of two subfamilies,the C/D box snoRNAs, which includes 36 Arabidopsis genes, and theH/ACA box snoRNAs, for which no members have been identi®edin Arabidopsis. U3 is the most numerous of the C/D box snoRNAs,with eight copies found in the genome. We identi®ed forty-®veadditional C/D box snoRNAs using software (www.rna.wustl.edu/snoRNAdb/) that detects snoRNAs that guide ribose methylation ofribosomal RNA.

A combination of algorithms, all optimized with parametersbased on known Arabidopsis gene structures, was used to de®negene structure. We used similarities to known protein and expressedsequence tag (EST) sequence to re®ne gene models. Eighty per centof the gene structures predicted by the three centres involved werecompletely consistent, 93% of ESTs matched gene models, and lessthan 1% of ESTs matched predicted non-coding regions, indicating

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 797

100 kb

Pseudo-colour spectra:High density Low density

Chr. 1 29.1 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 2 19.6 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 3 23.2 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 4 17.5 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 5 26.0 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Figure 1 Representation of the Arabidopsis chromosomes. Each chromosome is

represented as a coloured bar. Sequenced portions are red, telomeric and centromeric

regions are light blue, heterochromatic knobs are shown black and the rDNA repeat

regions are magenta. The unsequenced telomeres 2N and 4N are depicted with dashed

lines. Telomeres are not drawn to scale. Images of DAPI-stained chromosomes were

kindly supplied by P. Fransz. The frequency of features was given pseudo-colour

assignments, from red (high density) to deep blue (low density). Gene density (`Genes')

ranged from 38 per 100 kb to 1 gene per 100 kb; expressed sequence tag matches

(`ESTs') ranged from more than 200 per 100 kb to 1 per 100 kb. Transposable element

densities (`TEs') ranged from 33 per 100 kb to 1 per 100 kb. Mitochondrial and

chloroplast insertions (`MT/CP') were assigned black and green tick marks, respectively.

Transfer RNAs and small nucleolar RNAs (`RNAs') were assigned black and red ticks

marks, respectively.

© 2000 Macmillan Magazines Ltd

Page 3: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

that most potential genes were identi®ed. The sensitivity andselectivity of the gene prediction software used in this report hasbeen comprehensively and independently assessed20.

The 25,498 genes predicted (Table 1) is the largest gene setpublished to date: C. elegans1 has 19,099 genes and Drosophila2

13,601 genes. Arabidopsis and C. elegans have similar gene density,whereas Drosophila has a lower gene density; Arabidopsis also has asigni®cantly greater extent of tandem gene duplications andsegmental duplications, which may account for its larger gene set.

The rDNA repeat regions on chromosomes 2 and 4 were notsequenced because of their known repetitive structure and content.The centromeric regions are not completely sequenced owing tolarge blocks of monotonic repeats such as 5S rDNA and 180-bprepeats. The sequence continues to be extended further intocentromeric and other regions of complex sequence.

Characterization of the coding regionsTo assess the similarities and differences of the Arabidopsis genecomplement compared with other sequenced eukaryotic genomes,we assigned functional categories to the complete set of Arabidopsisgenes. For chromosome 4 genes and the yeast genome, predictedfunctions were previously manually assigned5,21. All other predictedproteins were automatically assigned to these functionalcategories22, assuming that conserved sequences re¯ect commonfunctional relationships.

The functions of 69% of the genes were classi®ed according tosequence similarity to proteins of known function in all organisms;only 9% of the genes have been characterized experimentally(Fig. 2a). Generally similar proportions of gene products werepredicted to be targeted to the secretory pathway and mitochondriain Arabidopsis and yeast, and up to 14% of the gene products are

articles

798 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com

Table 1 Summary statistics of the Arabidopsis genome

Feature Value...................................................................................................................................................................................................................................................................................................................................................................

(a) The DNA molecules

Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 S

Length (bp) 29,105,111 19,646,945 23,172,617 17,549,867 25,953,409 115,409,949Top arm (bp) 14,449,213 3,607,091 13,590,268 3,052,108 11,132,192Bottom arm (bp) 14,655,898 16,039,854 9,582,349 14,497,759 14,803,217

Base composition (%GC)Overall 33.4 35.5 35.4 35.5 34.5Coding 44.0 44.0 44.3 44.1 44.1Non-coding 32.4 32.9 33.0 32.8 32.5

Number of genes 6,543 4,036 5,220 3,825 5,874 25,498Gene density(kb per gene)

4.0 4.9 4.5 4.6 4.4

Average genelength (bp)

2,078 1,949 1,925 2,138 1,974

Average peptidelength (bp)

446 421 424 448 429

ExonsNumber 35,482 19,631 26,570 20,073 31,226 13,2982Total length (bp) 8,772,559 5,100,288 6,654,507 5,150,883 7,571,013 33,249,250Average per gene 5.4 4.9 5.1 5.2 5.3Average size (bp) 247 259 250 256 242

IntronsNumber 28,939 15,595 21,350 16,248 25,352 107,484Total length (bp) 4,828,766 2,768,430 3,397,531 3,030,649 4,030,045 18,055,421Average size (bp) 168 177 159 186 159

Number of geneswith ESTs (%)

60.8 56.9 59.8 61.4 61.4

Number of ESTs 30,522 14,989 20,732 16,605 22,885 105,733...................................................................................................................................................................................................................................................................................................................................................................

(b) The proteome

Classi®cation/function

Total proteins 6,543 4,036 5,220 3,825 5,874 25,498With INTERPROdomains

4,19464.1%

1,20529.9%

2,98957.8%

1,54540.4%

3,13653.4%

13,06951.3%

Genes containing atleast one TM domain

2,33435.7%

1,32232.8%

1,61530.9%

1,40236.7%

1,94033.0%

8,61333.8%

Genes containing atleast one SCOP domain

2,51338.4%

1,42435.3%

1,66431.9%

1,30434.1%

2,12136.1%

9,02635.4%

With putative signal peptidesSecretory pathway 1,242 19.0% 675 16.7% 877 17.0% 659 17.2% 1,014 17.3% 4,467 17.6%.0.95 speci®city 1,146 17.5% 632 15.7% 813 15.7% 632 16.5% 964 16.4% 4,167 16.4%Chloroplast 866 13.2% 535 13.2% 754 14.6% 532 13.9% 887 15.1% 3,574 14.0%.0.95 speci®city 602 9.2% 290 7.2% 420 8.1% 298 7.8% 475 8.1% 2,085 8.2%mitochondria 901 13.8% 425 10.5% 554 10.7% 390 10.2% 627 10.7% 2,897 11.4%.0.95 speci®city 113 1.7% 49 1.2% 63 1.2% 59 1.5% 65 1.1% 349 1.4%

Functional classi®cationCellular metabolism 1,188 22.7% 620 23.3% 745 22.8% 588 22.9% 868 21.1% 4,009 22.5%Transcription 880 16.8% 474 17.8% 566 17.3% 335 13.1% 763 18.6% 3,018 16.9%Plant defence 640 12.2% 276 10.4% 354 10.8% 295 11.5% 490 11.9% 2,055 11.5%Signalling 573 11.0% 296 11.1% 356 10.9% 210 8.2% 420 10.2% 1,855 10.4%Growth 542 10.4% 263 9.9% 357 10.9% 448 17.5% 469 11.4% 2,079 11.7%Protein fate 520 9.9% 273 10.2% 314 9.6% 264 10.3% 395 9.6% 1,766 9.9%Intracellular transport 435 8.3% 214 8.9% 269 8.2% 220 8.6% 334 8.1% 1,472 8.3%Transport 236 4.5% 139 5.2% 155 4.7% 113 4.4% 206 5.0% 849 4.8%Protein synthesis 216 4.1% 111 4.2% 148 4.5% 90 3.5% 165 4.0% 730 4.1%

Total 5,230 2,666 3,264 2,563 4,110 17,833...................................................................................................................................................................................................................................................................................................................................................................

The features of Arabidopsis chromosomes 1±5 and the complete nuclear genome are listed. Specialized searches used the following programs and databases: INTERPRO23; transmembrane (TM) domainsby ALOM2 (unpublished); SCOP domain database121; functional classi®cation by the PEDANT analysis system22. Signal peptide prediction (secretory pathway, targeted to chloroplast or mitochondria) wasperformed using TargetP122 and http://www.cbs.dtu.dk/services/TargetP/.* Default value.

© 2000 Macmillan Magazines Ltd

Page 4: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

likely to be targeted to the chloroplast (Table 1). The signi®cantproportion of genes with predicted functions involved in metabo-lism, gene regulation and defence is consistent with previousanalyses5. Roughly 30% of the 25,498 predicted gene products,(Fig. 2a), comprising both plant-speci®c proteins and proteins withsimilarity to genes of unknown function from other organisms,could not be assigned to functional categories.

To compare the functional catagories in more detail, we com-pared data from the complete genomes of Escherichia coli23,Synechocystis sp.24, Saccharomyces cerevisiae21, C. elegans1 andDrosophila2, and a non-redundant protein set of Homo sapiens,with the Arabidopsis genome data (Fig. 2b), using a stringentBLASTP threshold value of E , 10-30. The proportion ofArabidopsis proteins having related counterparts in eukaryoticgenomes varies by a factor of 2 to 3 depending on the functionalcategory. Only 8±23% of Arabidopsis proteins involved in transcrip-tion have related genes in other eukaryotic genomes, re¯ecting theindependent evolution of many plant transcription factors. Incontrast, 48±60% of genes involved in protein synthesis havecounterparts in the other eukaryotic genomes, re¯ecting highly

conserved gene functions. The relatively high proportion ofmatches between Arabidopsis and bacterial proteins in the categories`metabolism' and `energy' re¯ects both the acquisition of bacterialgenes from the ancestor of the plastid and high conservation ofsequences across all species. Finally, a comparison between uni-cellular and multicellular eukaryotes indicates that Arabidopsisgenes involved in cellular communication and signal transductionhave more counterparts in multicellular eukaryotes than in yeast,re¯ecting the need for sets of genes for communication in multi-cellular organisms.

Pronounced redundancy in the Arabidopsis genome is evident insegmental duplications and tandem arrays, and many other geneswith high levels of sequence conservation are also scattered over thegenome. Sequence similarity exceeding a BLASTP value E , 10-20

and extending over at least 80% of the protein length were used asparameters to identify protein families (Table 2). A total of 11,601protein types were identi®ed. Thirty-®ve per cent of the predictedproteins are unique in the genome, and the proportion of proteinsbelonging to families of more than ®ve members is substantiallyhigher in Arabidopsis (37.4%) than in Drosophila (12.1%) or

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 799

a

Unclassified

MetabolismTranscription

Cell growth, cell divisionand DNA synthesis

Cell rescue, defence,cell death, ageing

Cellularcommunciation/signal transduction

Protein destination

Intracellular transport

Cellular biogenesis

Transport facilitation

Energy

Protein synthesis

Ionic homeostasis

Figure 2 Functional analysis of Arabidopsis genes. a, Proportion of predicted Arabidopsis

genes in different functional categories. b, Comparison of functional categories between

organisms. Subsets of the Arabidopsis proteome containing all proteins that fall into a

common functional class were assembled. Each subset was searched against the

complete set of translations from Escherichia coli, Synechocystis sp. PCC6803,

Saccharomyces cerevisae, Drosophila, C. elegans and a Homo sapiens non-redundant

protein database. The percentage of Arabidopsis proteins in a particular subset that had a

BLASTP match with E # 10-30 to the respective reference genome is shown. This re¯ects

the measure of sequence conservation of proteins within this particular functional

category between Arabidopsis and the respective reference genome. y axis, 0.1 = 10%.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Met

aboli

sm

Energ

y

Cell g

rowth

, cell

divi

sion

an

d DNA sy

nthe

sis

Trans

criptio

n

Prote

in sy

nthe

sis

Prote

in des

tinat

ion

Trans

port f

acilit

ation

Intra

cellu

lar tr

ansp

ort

Cellula

r biog

enes

is

Cellula

r com

mun

icatio

n/

signa

l tran

sduc

tion

Cell re

scue

, def

ence

,

cell d

eath

, age

ing

Ionic

hom

eosta

sis

Classif

icatio

n no

t yet

clea

r-cut

Unclas

sified

E. coliSynechocystis

S. cerevisiaeC. elegansDrosophilaHuman

b

© 2000 Macmillan Magazines Ltd

Page 5: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

C. elegans (24.0%). The absolute number of Arabidopsis genefamilies and singletons (types) is in the same range as the othermulticellular eukaryotes, indicating that a proteome of 11,000±15,000 types is suf®cient for a wide diversity of multicellular life.The proportion of gene families with more than two members isconsiderably more pronounced in Arabidopsis than in other eukar-yotes (Fig. 3). As segmental duplication is responsible for 6,303 geneduplications (see below), the extent of tandem gene duplicationsaccounts for a signi®cant proportion of the increased family size.These features of the Arabidopsis, and presumably other plantgenomes, may indicate more relaxed constraints on genome sizein plants, or a more prominent role of unequal crossing over togenerate new gene copies.

Conserved protein domains revealed more informative differ-ences through INTERPRO25 analysis of the predicted gene productsfrom Arabidopsis, S. cerevisiae, C. elegans and Drosophila. Statisti-cally over-represented domains, and those that are absent from theArabidopsis genome, indicate domains that may have been gained orlost during the evolution of plants (Supplementary InformationTable 1). Proteins containing the Pro-Pro-Arg repeat, which isinvolved in RNA stabilization and RNA processing, are over-represented as compared to yeast, ¯y and worm; 400 proteinscontaining this signature were detected in Arabidopsis comparedwith only 10 in total in yeast, Drosophila and C. elegans. Proteinkinases and associated domains, 169 proteins containing a diseaseresistance protein signature, and the Toll/IL-1R (TIR) domain, acomponent of pathogen recognition molecules26, are also relativelyabundant. This suggests that pathways transducing signals inresponse to pathogens and diverse environmental cues are moreabundant in plants than in other organisms.

The RING zinc ®nger domain is relatively over-represented inArabidopsis compared with yeast, Drosophila and C. elegans, whereasthe F-box domain is over-represented as compared with yeast andDrosophila only. These domains are involved in targeting proteins tothe proteasome27 and ubiquitinylation28 pathways of protein degra-dation, respectively. In plants many processes such as hormone anddefence responses, light signalling, and circadian rhythms andpattern formation use F-box function to direct negative regulators

to the ubiquitin degradation pathway. This mode of regulationappears to be more prevalent in plants and may account for a higherrepresentation of the F box than in Drosophila and for the over-representation of the ubiquitin domain in the Arabidopsis genome.RING ®nger domain proteins in general have a role in ubiquitinprotein ligases, indicating that proteasome-mediated degradation isa more widespread mode of regulation in plants than in otherkingdoms.

Most functions identi®ed by protein domains are conserved insimilar proportions in the Arabidopsis, S. cerevisiae, Drosophila andC. elegans genomes, pointing to many ubiquitous eukaryotic path-ways. These are illustrated by comparing the list of human diseasegenes29 to the complete Arabidopsis gene set using BLASTP. Outof 289 human disease genes, 139 (48%) had hits in Arabidopsisusing a BLASTP threshold E , 10-10. Sixty-nine (24%) exceeded anE , 10-40 threshold, and 26 (9.3%) had scores better than E , 10-100

(Table 3). There are at least 17 human disease genes more similar toArabidopsis genes than yeast, Drosophila or C. elegans genes(Table 3).

This analysis shows that, although numerous families of proteinsare shared between all eukaryotes, plants contain roughly 150unique protein families. These include transcription factors, struc-tural proteins, enzymes and proteins of unknown function. Mem-bers of the families of genes common to all eukaryotes haveundergone substantial increases or decreases in their size inArabidopsis. Finally, the transfer of a relatively small number ofcyanobacteria-related genes from a putative endosymbiotic ances-tor of the plastid has added to the diversity of protein structuresfound in plants.

Genome organization and duplicationThe Arabidopsis genome sequence provides a complete view ofchromosomal organization and clues to its evolutionary history.Gene families organized in tandem arrays of two or more units havebeen described in C. elegans1 and Drosophila2. Analysis of theArabidopsis genome revealed 1,528 tandem arrays containing4,140 individual genes, with arrays ranging up to 23 adjacentmembers (Fig. 3). Thus 17% of all genes of Arabidopsis are arrangedin tandem arrays.

Large segmental duplications were identi®ed either by directlyaligning chromosomal sequences or by aligning proteins andsearching for tracts of conserved gene order. All ®ve chromosomeswere aligned to each other in both orientations using MUMmer30,and the results were ®ltered to identify all segments at least 1,000 bpin length with at least 50% identity (Supplementary InformationFig. 1). These revealed 24 large duplicated segments of 100 kb orlarger, comprising 65.6 Mb or 58% of the genome. The onlyduplicated segment in the centromeric regions was a 375-kbsegment on chromosome 4. Many duplications appear to haveundergone further shuf¯ing, such as local inversions after theduplication event.

We used TBLASTX5 to identify collinear clusters of genes residingin large duplicated chromosomal segments. The duplicated regionsencompass 67.9 Mb, 60% of the genome, slightly more than was

articles

800 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com

1052

249108 57 36 20 18 17 15 6 2 2

200

0

400

600

800

1,000

1,200

2 3 4 5 6 7 8 9 10 11–15 16–20 21–23

Number of tandemly repeated genes per gene array

Num

ber

of a

rray

s

Figure 3 Distribution of tandemly repeated gene arrays in the Arabidopsis genome.

Tandemly repeated gene arrays were identi®ed using the BLASTP program with a

threshold of E , 10-20. One unrelated gene among cluster members was tolerated. The

histogram gives the number of clusters in the genome containing 2 to n similar gene units

in tandem.

Table 2 Proportion of genes in different organisms present as either singletons or in paralogous families

No of singletons anddistinct gene families

Unique Gene families containing

2 members 3 members 4 members 5 members .5 members...................................................................................................................................................................................................................................................................................................................................................................

H. in¯uenzae 1,587 88.8% 6.8% 2.3% 0.7% 0.0% 1.4%S. cerevisiae 5,105 71.4% 13.8% 3.5% 2.2% 0.7% 8.4%D. melanogaster 10,736 72.5% 8.5% 3.4% 1.9% 1.6% 12.1%C. elegans 14,177 55.2% 12.0% 4.5% 2.7% 1.6% 24.0%Arabidopsis 11,601 35.0% 12.5% 7.0% 4.4% 3.6% 37.4%...................................................................................................................................................................................................................................................................................................................................................................The number of genes in the genomes of Haemophilus in¯uenzae, S. cerevisiae, Drosophila, C. elegans and Arabidopsis that are present either as singletons or in gene families with two or more members arelisted. To be grouped in a gene family, two genes had to show similarity exceeding a BLASTP value E , 10-20 and a FASTA alignment over at least 80% of the protein length. In column 1, the number of genesthat are unique plus the number of gene families are listed. Columns 2 to 6 give the percentage of genes present as singletons or in gene families of n members.

© 2000 Macmillan Magazines Ltd

Page 6: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

found in the DNA-based alignment (Fig. 4), and these data extendearlier ®ndings4,5,31. The extent of sequence conservation of theduplicated genes varies greatly, with 6,303 (37%) of the 17,193 genesin the segments classi®ed as highly conserved (E , 10-30) and afurther 1,705 (10%) showing less signi®cant similarity up toE , 10-5. The proportion of homologous genes in each duplicatedsegment also varies widely, between 20% and 47% for the highlyconserved class of genes. In many cases, the number of copies of agene and its counterpart differ (for example, one copy on onechromosome and multiple copies on the other; see SupplementaryInformation Fig. 2); this could be due to either tandem duplicationor gene loss after the segmental duplication.

What does the duplication in the Arabidopsis genome tell usabout the ancestry of the species? Polyploidy occurs widely in plantsand is proposed to be a key factor in plant evolution32. As themajority of the Arabidopsis genome is represented in duplicated(but not triplicated) segments, it appears most likely thatArabidopsis, like maize, had a tetraploid ancestor33. A comparativesequence analysis of Arabidopsis and tomato estimated that aduplication occurred ,112 Myr ago to form a tetraploid34. Thedegrees of conservation of the duplicated segments might be due todivergence from an ancestral autotetraploid form, or might re¯ectdifferences present in an allotetraploid ancestor. It is also possible,however, that several independent segmental duplication eventstook place instead of tetraploid formation and stabilization.

The diploid genetics of Arabidopsis and the extensive divergenceof the duplicated segments have masked its evolutionary history.The determination of Arabidopsis gene functions must therefore bepursued with the potential for functional redundancy taken intoaccount. The long period of time over which genome stabilizationhas occurred has, however, provided ample opportunity for thedivergence of the functions of genes that arose from duplications.

Comparative analysis of Arabidopsis accessionsComparing the multiple accessions of Arabidopsis allows us toidentify commonly occurring changes in genome microstructure.It also enables the development of new molecular markers forgenetic mapping. High rates of polymorphism betweenArabidopsis accessions, including both DNA sequence and copynumber of tandem arrays, are prevalent at loci involved in diseaseresistance35. This has been observed for other plant species, and suchloci are thought to serve as templates for illegitimate recombination

to create new pathogen response speci®cities36. We carried out acomparative analysis between 82 Mb of the genome sequence ofArabidopsis accession Columbia (Col-0) and 92.1 Mb of non-redundant low-pass (twofold redundant) sequence data of thegenomic DNA of accession Landsberg erecta (Ler). We identi®edtwo classes of differences between the sequences: single nucleotidepolymorphisms (SNPs), and insertion±deletions (InDels). As weused high stringency criteria, our results represent a minimumestimate of numbers of polymorphisms between the two genomes.

In total, we detected 25,274 SNPs, representing an average densityof 1 SNP per 3.3 kb. Transitions (A/T±G/C) represented 52.1% ofthe SNPs, and transversions accounted for the remainder: 17.3% forA/T±T/A, 22.7% for A/T±C/G and 7.9% for C/G±G/C. In total, wedetected 14,570 InDels at an average spacing of 6.1 kb. They rangedfrom 2 bp to over 38 kilobase-pairs, although 95% were smaller than50 bp. Only 10% of the InDels were co-located with simple sequencerepeats identi®ed with the program Sputnik. An analysis of 416relative insertions greater than 250 bp in Col-0 showed that 30%matched transposon-related proteins, indicating that a substantialproportion of the large InDels are the result of transposon insertionor excision. Many InDels contained entire active genes not related totransposons. Half of such genes absent from corresponding posi-tions in the Col-0 sequence were found elsewhere on the genome ofLer. This indicates that genes have been transferred to new genomiclocations.

Gene structures are often affected by small InDels and SNPs. Thepositions of SNPs and InDels were mapped relative to 87,427 exonsand 70,379 introns annotated in the Col-0 sequence. SNPs werefound in exons, introns and intergenic regions at frequencies of 1SNP per 3.1, 2.2 and 3.5 kb, respectively. The frequencies for InDelswere 1 per 9.3, 3.1 and 4.3 kb, respectively. Polymorphisms weredetected in 7% of exons, and alter the spliced sequences of 25% ofthe predicted genes. For InDels in exons, insertion lengths divisibleby three are prevalent for small insertions (, 50 bp), indicating thatmany proteins can withstand small insertions or deletions of aminoacids without loss of function.

Our analyses show that sequence polymorphisms between acces-sions of Arabidopsis are common, and that they occur in bothcoding and non-coding regions. We found evidence for the reloca-tion of genes in the genome, and for changes in the complement oftransposable elements. The data presented here are available athttp://www.arabidopsis.org/cereon/.

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 801

5 Mb 10 Mb 15 Mb 20 Mb 25 Mb 30 Mb

5 Mb 10 Mb 15 Mb 20 Mb 25 Mb 30 Mb

Figure 4 Segmentally duplicated regions in the Arabidopsis genome. Individual

chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top),

centromeres are marked black. Coloured bands connect corresponding duplicated

segments. Similarity between the rDNA repeats are excluded. Duplicated segments in

reversed orientation are connected with twisted coloured bands. The scale is

in megabases.

© 2000 Macmillan Magazines Ltd

Page 7: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

Comparison of Arabidopsis and other plant generaComparative genetic mapping can reveal extensive conservation ofgenome organization between closely related species37,38. The com-parative analysis of plant genome microstructure reveals muchabout the evolution of plant genomes and provides unprecedentedopportunities for crop improvement by establishing the detailedstructures of, and relationships between, the genomes of crops andArabidopsis.

The lineages leading to Arabidopsis and Capsella rubella (shepherd'spurse) diverged between 6.2 and 9.8 Myr ago, and the gene contentand genome organization of C. rubella is very similar to that ofArabidopsis39, including the large-scale duplications. Alignment ofArabidopsis complementary DNA and EST sequences with genomicDNA sequences of Arabidopsis and C. rubella showed conservationof exon length and intron positions. Coding sequences predictedfrom these alignments differed from the annotated Arabidopsis genesequences in two out of ®ve cases.

The ancestral lineages of Arabidopsis and the Brassica (cabbageand mustard) genera diverged 12.2±19.2 Myr ago40. Brassica genesshow a high level of nucleotide conservation with their Arabidopsisorthologues, typically more than 85% in coding regions40. Thestructure of Brassica genomes resembles that of Arabidopsis, butwith extensive triplication and rearrangement41, and extensivedivergence of microstructure (Supplementary Information Fig. 3).The divergence between the genomes of Arabidopsis and Brassicaoleracea is in striking contrast to that observed between Arabidopsisand C. rubella, although the time since divergence is only twofoldgreater. This accelerated rate of change in triplicated segments of thegenome of B. oleracea indicates that polyploidy fosters rapidchromosomal evolution.

The Arabidopsis and tomato lineages diverged roughly 150 Myrago, and comparative sequence analysis of segments of theirgenomes has revealed complex relationships34. Four regions of theArabidopsis genome are related to each other and to one region inthe tomato genome, suggesting that two rounds of duplication may

have occurred in the Arabidopsis lineage. The extensive duplicationdescribed here supports the proposal that the more recent of theseduplications, estimated to have occurred ,112 Myr ago, was theresult of a polyploidization event. The lineages of Arabidopsis andrice diverged ,200 Myr ago42. Three regions of the genome ofArabidopsis were related to each other and to one region in the ricegenome, providing further evidence for multiple duplicationevents43,44.

The frequent occurrence of tandem gene duplications and theapparent deletion of single genes, or small groups of adjacent genes,from duplicated regions suggests that unequal crossing over may bea key mechanism affecting the evolution of plant genome micro-structure. However, the segmental inversions and gene transloca-tions in the genomes of both rice and B. oleracea that are not foundin Arabidopsis indicate that additional mechanisms may beinvolved40.

Integration of the three genomes in the plant cellThe three genomes in the plant cellÐthose of the nucleus, theplastids (chloroplasts) and the mitochondriaÐdiffer markedly ingene number, organization and stability. Plastid genes are denselypacked in an order highly conserved in all plants45, whereasmitochondrial genes46 are widely dispersed and subjected to exten-sive recombination.

Organellar genomes are remnants of independent organismsÐplastids are derived from the cyanobacterial lineage and mitochon-dria from the a-Proteobacteria. The remaining genes in plastidsinclude those that encode subunits of the photosystem and theelectron transport chain, whereas the genes in mitochondria encodeessential subunits of the respiratory chain. Both organelles containsets of speci®c membrane proteins that, together with housekeepingproteins, account for 61% of the genes in the chloroplast and 88 %in the mitochondrion (Table 4). The balances are involved intranscription and translation.

The number of proteins encoded in the nucleus likely to be found

articles

802 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com

Table 3 Arabidopsis genes with similarities to human disease genes

Human disease gene E value Gene code Arabidopsis hit...................................................................................................................................................................................................................................................................................................................................................................

Darier±White, SERCA 5.9 ´ 10-272 T27I1_16 Putative calcium ATPaseXeroderma Pigmentosum, D-XPD 7.2 ´ 10-228 F15K9_19 Putative DNA repair proteinXeroderma pigment, B-ERCC3 9.6 ´ 10-214 AT5g41360 DNA excision repair cross-complementing proteinHyperinsulinism, ABCC8 7.1 ´ 10-188 F20D22_11 Multidrug resistance proteinRenal tubul. acidosis, ATP6B1 1.0 ´ 10-182 AT4g38510 Probable H+-transporting ATPaseHDL de®ciency 1, ABCA1 2.4 ´ 10-181 At2g41700 Putative ABC transporterWilson, ATP7B 7.6 ´ 10-181 AT5g44790 ATP-dependent copper transporterImmunode®ciency, DNA Ligase 1 8.2 ´ 10-172 T6D22_10 DNA ligaseStargardt's, ABCA4 2.8 ´ 10-168 At2g41700 Putative ABC transporterAtaxia telangiectasia, ATM 3.1 ´ 10-168 AT3g48190 Ataxia telangiectasia mutated protein AtATMNiemann±Pick, NPC1 1.2 ´ 10-166 F7F22_1 Niemann±Pick C disease protein-like proteinMenkes, ATP7A 1.1 ´ 10-153 F2K11_17 ATP-dependent copper transporter, putativeHNPCC*, MLH1 1.5 ´ 10-150 AT4g09140 MLH1 proteinDeafness, hereditary, MYO15 2.7 ´ 10-150 At2g31900 Putative unconventional myosinFam, cardiac myopathy, MYH7 6.5 ´ 10-147 T1G11_14 Putative myosin heavy chainXeroderma Pigmentosum, F-XPF 1.4 ´ 10-146 AT5g41150 Repair endonuclease (gb|AAF01274.1)G6PD de®ciency, G6PD 7.6 ´ 10-137 AT5g40760 Glucose-6-phosphate dehydrogenaseCystic ®brosis, ABCC7 2.3 ´ 10-135 AT3g62700 ABC transporter-like proteinGlycerol kinase de®c, GK 7.9 ´ 10-135 T21F11_21 Putative glycerol kinaseHNPCC, MSH3 6.6 ´ 10-134 AT4g25540 Putative DNA mismatch repair proteinHNPCC, PMS2 5.1 ´ 10-128 AT4g02460 No titleZellweger, PEX1 4.1 ´ 10-125 AT5g08470 Putative proteinHNPCC, MSH6 9.6 ´ 10-122 AT4g02070 G/T DNA mismatch repair enzymeBloom, BLM 4.4 ´ 10-109 T19D16_15 DNA helicase isologFinnish amyloidosis, GSN 2.2 ´ 10-107 AT5g57320 VillinChediak±Higashi, CHS1 5.8 ´ 10-99 F10O3_11 Putative transport proteinXeroderma Pigmentosum, G-XPG 7.1 ´ 10-89 AT3g28030 Hypothetical proteinBare lymphocyte, ABCB3 1.3 ´ 10-84 AT5g39040 ABC transporter-like proteinCitrullinemia, type I, ASS 3.2 ´ 10-83 AT4g24830 Argininosuccinate synthase-like proteinCof®n±Lowry, RPS6KA3 5.2 ´ 10-81 AT3g08720 Putative ribosomal-protein S6 kinase (ATPK19)Keratoderma, KRT9 8.5 ´ 10-81 AT3g17050 Unknown proteinMyotonic dystrophy, DM1 1.4 ´ 10-76 At2g20470 Putative protein kinaseBartter's, SLC12A1 1.6 ´ 10-75 F26G16_9 Cation-chloride co-transporter, putativeDents, CLCN5 3.3 ´ 10-74 AT5g26240 CLC-d chloride channel proteinDiaphanous 1, DAPH1 1.9 ´ 10-73 68069_m00158 Hypothetical proteinAKT2 6.9 ´ 10-72 AT3g08730 Putative ribosomal-protein S6 kinase (ATPK6)...................................................................................................................................................................................................................................................................................................................................................................

© 2000 Macmillan Magazines Ltd

Page 8: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

in organelles was predicted using default settings on TargetP(Table 1). Many nuclear gene products that are targeted to either(or both) organelles were originally encoded in the organellegenomes and were transferred to the nuclear genome duringevolutionary history. A large number also appear to be of eukaryoticorigin, with functions such as protein import components, whichwere probably not required by the free-living ancestors of theendosymbionts.

To identify nuclear genes of possible organellar ancestry, wecompared all predicted Arabidopsis proteins to all proteins fromcompleted genomes including those from plastids and mitochon-dria (Supplementary Information Table 2). This search identi®edproteins encoded by the Arabidopsis nuclear genome that are mostsimilar to proteins encoded by other species' organelle genomes (14mitochondrial and 44 plastid). These represent organelle-to-nuclear gene transfers that have occurred sometime after thedivergence of the organelle-containing lineages47. There is a greatexcess of nuclear encoded proteins most similar to proteins from thecyanobacteria Synechocystis (Supplementary Information Fig. 4;806 Arabidopsis predicted proteins matching 404 different Synecho-cystis proteins, providing further evidence of a genome duplica-tion). These 806 Arabidopsis predicted proteins, and many others ofgreatly diverse function, are possibly of plastid descent. Throughsearches against proteins from other cyanobacteria (with incom-pletely sequenced genomes), we identi®ed 69 additional genes ofpossibly plastid descent. Only 25% of these putatively plastid-derived proteins displayed a target peptide predicted by TargetP,indicating potential cytoplasmic functions for most of these genes.

The difference between predicted plastid-targeted and predictedplastid-derived genes indicates that there is a probable overestima-tion by ab initio targeting prediction methods and a lack ofresolution with respect to destination organelles, the possibleextensive divergence of some endosymbiont-derived genes in thenuclear genome, the co-opting of nuclear genes for targeting toorganelles, and cytoplasmic functions for cyanobacteria-derivedproteins. Clearly more re®ned tools and extensive experimentationis required to catalogue plastid proteins.

The transfer of genes between genomes still continues (Supple-mentary Information Table 3). Plastid DNA insertions in thenucleus (17 insertions totalling 11 kb) contain full-length genesencoding proteins or tRNAs, fragments of genes and an intron aswell as intergenic regions. Subsequent reshuf¯ing in the nucleus isillustrated by the atpH gene, which was originally transferredcompletely, but is now in two pieces separated by 2 kb. The 13small mitochondrial DNA insertions total 7 kb in addition to thelarge insertion close to the centromere of chromosome 2 (ref. 3).The high level of recombination in the mitochondrial genome mayaccount for these events.

Transposable elementsTransposons, which were originally identi®ed in maize by BarbaraMcClintock, have been found in all eukaryotes and prokaryotes. A

subset of transposons replicate through an RNA intermediate (classI), whereas others move directly through a DNA form (class II).Transposons are further classi®ed by similarity either between theirmobility genes or between their terminal and/or internal motifs, aswell as by the size and sequence of their target site. Internally deletedelements can often be mobilized in trans by fully functionalelements.

Transposons in Arabidopsis account for at least 10% of thegenome, or about one-®fth of the intergenic DNA. TheArabidopsis genome has a wealth of class I (2,109) and II(2,203) elements, including several new groups (1,209 elements;Supplementary Information Table 4). Mobile histories for manyelements were obtained by identifying regions of the genome withsigni®cant similarity to `empty' target sites (RESites) thus providinghigh-resolution information concerning the termini and target siteduplications48,49. These regions were readily detected because of thepropensity of transposons to integrate into repeats and because ofduplications in the genome sequence. In several cases, genes appearto have been included as `passengers' in transposable units48. Insome cases, shared sequence similarity, coding capacity and RESitesattest to recent activity of transposable elements in the Arabidopsisgenome. Only about 4% of the complete elements identi®edcorrespond to an EST, however, suggesting that most are nottranscribed.

Transposable elements found in many other plant genomes arewell represented in Arabidopsis, including copia- and gypsy-like longterminal repeat (LTR) retrotransposons, long interspersal nuclearelements (LINEs); short interspersed nuclear elements (SINEs),hobo/Activator/Tam3 (hAT)-like elements, CACTA-like elementsand miniature inverted-repeat transposable elements (MITES).Although usually small in size, some larger Tourist-like MITEscontain open reading frames (ORFs) with similarity to the trans-posases of bacterial insertion sequences48. Basho and many Mutator-like elements (MULEs), ®rst discovered in the Arabidopsis sequence,represent structurally unique transposons48±50. Basho elements havea target site preference for mononucleotide A' and wide distributionamong plants48,51. MULEs exhibit a high level of sequence diversityand members of most groups lack long terminal inverted repeats(TIRs). Phylogenetic analysis of the Arabidopsis MURA-like trans-posases suggests that TIR-containing MULEs are more closelyrelated to one another than to MULEs lacking TIRs49,52.

For many plants with large genomes, class I retrotransposonscontribute most of the nucleotide content53. In the small Arabidopsisgenome, class I elements are less abundant and primarily occupy thecentromere. In contrast, Basho elements and class II transposonssuch as MITEs and MULEs predominate on the periphery ofpericentromeric domains (Fig. 5). In class II transposons, MULEsand CACTA elements are clustered near centromeres and hetero-chromatic knobs, whereas MITEs and hAT elements have a lesspronounced bias. The distribution pattern of transposable elementsobserved in Arabidopsis may re¯ect different types of pericentro-meric heterochromatin regions and may be similar to those foundin animals.

Numerous centromeric satellite repeats are located betweeneach chromosome arm and have not yet been sequenced, butare represented in part by unanchored BAC contigs (R. Martienssenand M. Marra, unpublished data). End sequence suggests that thesedomains contain many more class I than class II elements, con-sistent with the distribution reported here (K. Lemcke and R.Martienssen, unpublished data). We do not know the signi®canceof the apparent paucity of elements in telomeric regions and in theregion ¯anking the rDNA repeats on chromosome 4 (but not onchromosome 2).

Overall, transposon-rich regions are relatively gene-poor andhave lower rates of recombination and EST matches, indicating acorrelation between low gene expression, high transposon densityand low recombination51. The role of transposons in genome

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 803

Table 4 General features of genes encoded by the three genomes inArabidopsis

Nucleus/cytoplasm Plastid Mitochondria.............................................................................................................................................................................

Genome size 125 Mb 154 kb 367 kbGenome equivalent/cell 2 560 26Duplication 60% 17% 10%Number of protein genes 25,498 79 58Gene order Variable, but syntenic Conserved VariableDensity(kb per protein gene)

4.5 1.2 6.25

Average coding length 1,900 nt 900 nt 860 ntGenes with introns 79% 18.4% 12%Genes/pseudogenes 1/0.03 1/0 1/0.2±0.5Transposons(% of total genome size)

14% 0% 4%

.............................................................................................................................................................................

© 2000 Macmillan Magazines Ltd

Page 9: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

organization and chromosome structure can now be addressed in amodel organism known to undergo DNA methylation and otherforms of chromatin modi®cation thought to regulatetransposition52.

rDNA, telomeres and centromeresNucleolar organizers (NORs) contain arrays of unit repeats encod-ing the 18S, 5.8S and 25S ribosomal RNA genes and are transcribedby RNA polymerase I. Together with 5S RNA, which is transcribedby RNA polymerase III, these rRNAs form the structural andcatalytic cores of cytoplasmic ribosomes. In Arabidopsis, theNORs juxtapose the telomeres of chromosomes 2 and 4, andcomprise uninterrupted 18S, 5.8S and 25S units all orientated onthe chromosomes in the same direction54. In contrast, the 5S rRNAgenes are localized to heterogeneous arrays in the centromericregions of chromosomes 3, 4 and 5 (ref. 55; and Fig. 6). BothNORs are roughly 3.5±4.0 megabase-pairs and comprise ,350±400highly methylated rRNA gene units, each ,10 kb (ref. 54). Thesequence between the euchromatic arms and NORs has beendetermined. Elsewhere in the genome, only one other 18S, 5.8S,25S rRNA gene unit was identi®ed in centromere 3. Although minorvariations in sequence length and composition occur in the NORrepeats, these variants are highly clustered, supporting a model ofsequence maintenance through concerted evolution55.

Arabidopsis telomeres are composed of CCCTAAA repeats andaverage ,2±3 kb (ref. 56). For TEL4N (telomere 4 North), con-sensus repeats are adjacent to the NOR; the remaining telomeres aretypically separated from coding sequences by repetitive subtelo-meric regions measuring less than 4 kb. Imperfect telomere-likearrays of up to 24 kb are found elsewhere in the genome, particularly

near centromeres. These arrays might affect the expression of nearbygenes and may have resulted from ancient rearrangements, such asinversions of the chromosome arms.

Centromere DNA mediates chromosome attachment to themeiotic and mitotic spindles and often forms dense heterochroma-tin. Genetic mapping of the regions that confer centromere functionprovided the markers necessary to precisely place BAC clones atindividual centromeres17; 69 clones were targeted for sequencing,resulting in over 5 Mb of DNA sequence from the centromericregions. The unsequenced regions of centromeres are composedprimarily of long, homogeneous arrays that were characterizedpreviously with physical57 and genetic mapping17 and contain over3 Mb of repetitive arrays, including the 180-bp repeats and 5SrDNA51 (Fig. 6).

Arabidopsis centromeres, like those of many higher eukaryotes,contain numerous repetitive elements including retroelements,transposons, microsatellites and middle repetitive DNA17. Theserepeats are rare in the euchromatic arms and often most abundantin pericentromeric DNA. The repeats, af®nity for DNA-bindingdyes, dense methylation patterns and inhibition of homologousrecombination indicate that the centromeric regions are highlyheterochromatic, and such regions are generally viewed as verypoor environments for gene expression. Unexpectedly, we found atleast 47 expressed genes encoded in the genetically de®ned centro-meres of Arabidopsis (http://preuss.bsd.uchicago.edu/arabidopsis.genome.html). In several cases, these genes reside on islands ofunique sequence ¯anked by repetitive arrays, such as 180-bp or 5SrDNA repeats. Among the genes encoded in the centromeres aremembers of 11 of the 16 functional categories that comprise theproteome. The centromeres are not subject to recombination;consequently, genes residing in these regions probably exhibitunique patterns of molecular evolution.

The function of higher eukaryotic centromeres may be speci®edby proteins that bind to centromere DNA, by epigeneticmodi®cations, or by secondary or higher order structures. Apairwise comparison of the non-repetitive portions of all ®vecentromeres showed they share limited (1±7%) sequence similarity.Forty-one families of small, conserved centromere sequences(AtCCS, see http://preuss.bsd.uchicago.edu/arabidopsis.genome.html) are enriched in the centromeric and pericentromeric regionsand differ from sequences found in the centromeres of othereukaryotes. Molecular and genetic assays will be required todetermine whether these conserved motifs nucleate Arabidopsiscentromere activity. Apart from the AtCCS sequences, most cen-tromere DNA is not shared between chromosomes, complicatingefforts to derive clear evolutionary relationships. In contrast, geneticand cytological assays indicate that homologous centromeres arehighly conserved among Arabidopsis accessions, albeit subject torearrangements such as inversions to form knobs5,58,59 andinsertions4. Further investigation of centromere DNA promises toyield information on the evolutionary forces that act in regions oflimited recombination, as well as an improved understanding of therole of DNA sequence patterns in chromosome segregation.

Membrane transportTransporters in the plasma and intracellular membranes ofArabidopsis are responsible for the acquisition, redistribution andcompartmentalization of organic nutrients and inorganic ions, aswell as for the ef¯ux of toxic compounds and metabolic endproducts, energy and signal transduction, and turgor generation.Previous genomic analyses of membrane transport systems inS. cerevisiae and C. elegans led to the identi®cation of over 100distinct families of membrane transporters60,61. We comparedmembrane transport processes between Arabidopsis, animals,fungi and prokaryotes, and identi®ed over 600 predicted membranetransport systems in Arabidopsis (http://www-biology.ucsd.edu/,ipaulsen/transport/), a similar number to that of C. elegans

articles

804 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com

12

8

4

00 10

ba

dc

Chr. 2

Position (Mb)

Freq

uenc

yFr

eque

ncy

Freq

uenc

y

20

12

16

8

4

00 10

Chr. 3

Position (Mb)20

20

8

12

16

4

00 10

Chr. 4

Position (Mb)20

e20

8

12

16

4

00 10

Chr. 5

Position (Mb)

20 30

12

8

4

00 10

Chr. 1

Class IClass IIBasho

Position (Mb)20 30

Figure 5 Distribution of class I, II and Basho transposons in Arabidopsis chromosomes.

The frequency of class I retroelements (green), class II DNA transposons (blue) and Basho

elements (purple) are shown at 100-kb intervals along the ®ve chromosomes (a±e) of

Arabidopsis.

© 2000 Macmillan Magazines Ltd

Page 10: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

(,700 transporters) and over twofold greater than eitherS. cerevisiae or E. coli (,300 transporters).

We compared the transporter complement of Arabidopsis,C. elegans and S. cerevisiae in terms of energy coupling mechanisms(Fig. 7a). Unlike animals, which use a sodium ion P-type ATPasepump to generate an electrochemical gradient across the plasmamembrane, plants and fungi use a proton P-type ATPase pump toform a large membrane potential (-250 mV)62. Consequently, plantsecondary transporters are typically coupled to protons rather thanto sodium63. Compared with C. elegans, Arabidopsis has a surpris-ingly high percentage of primary ATP-dependent transporters (12%and 21% of transporters, respectively), re¯ecting increased numbersof P-type ATPases involved in metal ion transport and ABC ATPasesproposed to be involved in sequestering unusual metabolites anddrugs in the vacuole or in other intracellular compartments. Theseprocesses may be necessary for pathogen defence and nutrientstorage.

About 15% of the transporters in Arabidopsis are channel pro-teins, ®ve times more than in any single-celled organism but half thenumber in C. elegans (Fig. 7b). Almost half of the Arabidopsischannel proteins are aquaporins, and Arabidopsis has 10-fold moreMfamily major intrinsic protein (MIP) family water channels thanany other sequenced organism. This abundance emphasizes theimportance of hydraulics in a wide range of plant processes,including sugar and nutrient transport into and out of the vascu-lature, opening of stomatal apertures, cell elongation and epinasticmovements of leaves and stems. Although Arabidopsis has a diverserange of metal cation transporters, C. elegans has more, many ofwhich function in cell±cell signalling and nerve signal transduction.Arabidopsis also possesses transporters for inorganic anions such asphosphate, sulphate, nitrate and chloride, as well as for metal cationchannels that serve in signal transduction or cell homeostasis.Compared with other sequenced organisms, Arabidopsis has 10-fold more predicted peptide transporters, primarily of the proton-dependent oligopeptide transport (POT) family, emphasizing theimportance of peptide transport or indicating that there is broadersubstrate speci®city than previously realized. There are nearly 1,000Arabidopsis genes encoding Ser/Thr protein kinases, suggesting thatpeptides may have an important role in plant signalling64.

Virtually no transporters for carboxylates, such as lactate andpyruvate, were identi®ed in the Arabidopsis genome. About 12% ofthe transporters were predicted to be sugar transporters, mostlyconsisting of paralogues of the MFS family of hexose transporters.Notably, S. cerevisiae, C. elegans and most prokaryotes useAPC family transporters as their principle means of amino-acid

transport, but Arabidopsis appears to rely primarily on the AAAPfamily of amino-acid and auxin transporters. More than 10% of thetransporters in Arabidopsis are homologous to drug ef¯ux pumps;these probably represent transporters involved in the sequestrationinto vacuoles of xenobiotics, secondary metabolites, and breakdownproducts of chlorophyll.

Surprisingly, Arabidopsis has close homologues of the humanABC TAP transporters of antigenic peptides for presentation to themajor histocompatability complex (MHC). In Arabidopsis, thesetransporters may be involved in peptide ef¯ux, or more specula-tively, in some form of cell-recognition response. Arabidopsis alsohas 10-fold more members of the multi-drug and toxin extrusion(MATE) family than any other sequenced organism; in bacteria,these transporters function as drug ef¯ux pumps. Curiously,Arabidopsis has several homologues of the Drosophila RND trans-porter family Patched protein, which functions in segment polarity,and more than ten homologues of the Drosophila ABC family eyepigment transporters. In plants, these are presumably involved inintracellular sequestration of secondary metabolites.

DNA repair and recombinationDNA repair and recombination pathways have many functions indifferent species such as maintaining genomic integrity, regulatingmutation rates, chromosome segregation and recombination,genetic exchange within and between populations, and immunesystem development. Comparing the Arabidopsis genome withother species65 indicates that Arabidopsis has a similar set of DNArepair and recombination (RAR) genes to most other eukaryotes.The pathways represented include photoreactivation, DNA ligation,non-homologous end joining, base excision repair, mismatchexcision repair, nucleotide excision repair and many aspects ofDNA recombination (Supplementary Information Table 5). TheArabidopsis RAR genes include homologues of many DNA repairgenes that are defective in different human diseases (for example,hereditary breast cancer and non-polyposis colon cancer, xero-derma pigmentosum and Cockayne's syndrome).

One feature that sets Arabidopsis apart from other eukaryotes isthe presence of additional homologues of many RAR genes. This isseen for almost every major class of DNA repair, including recom-bination (four RecA), DNA ligation (four DNA ligase I), photo-reactivation (one class II photolyase and ®ve class I photolyasehomologues) and nucleotide excision repair (six RPA1, two RPA2,two Rad25, three TFB1 and four Rad23). This is most striking forgenes with probable roles in base excision repair. Arabidopsisencodes 16 homologues of DNA base glycosylases (enzymes that

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 805

F21I2F14G16

F28D6T27D20 T26N6F4H6

T19J18T4B21

T1J1T32N4T19B17

T1J24F6H8

C17L7C6L9

CEN4

T18C6T5E7

T12J2T13E11

T25N22F9A16

T5M2T17H1

F27C21 T14C8F7B19

T15D9

CEN2

CEN3

F6H5T8N9

F1D9T7B9

T13O13T15D2

T28G19F23H6 T18B3

T26P13T14A11

T4P3F21A14

F4M19T27B3

F26B15T14K23

T18N24F28L22 T28N5

F12G6F2C1

F25O15F9D18 T4I21

F9M8 F5A13

CEN1

F23C8F3F24 F17M7

F19I11F7I20T3P1 F18O9

F14C23 T15F17F3D18 T6F8

F15I15 T29A4F19N2F13C19

F18A12T32B3

T25B21

CEN5

Mitochondrial

180 bp

160 bp

5S rDNA

Key

100 kb

Figure 6 Predicted centromere composition. Genetically de®ned centromere boundaries

are indicated by ®lled circles; fully and partially assembled BAC sequences are

represented by solid and dashed black lines, respectively. Estimates of repeat sizes within

the centromeres were derived from consideration of repeat copy number, physical

mapping and cytogenetic assays.

© 2000 Macmillan Magazines Ltd

Page 11: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

recognize abnormal DNA bases and cleave them from the sugar-phosphate backbone)Ðmore than any other species known. Thisincludes several homologues of each of three families of alkylationdamage base glycosylases: two of the S. cerevisiae MPG; six of the E.coli TagI; and two of the E. coli AlkA. Arabidopsis also encodes threehomologues of the apurinic-apyrimidinic (AP) endonuclease Xth.AP endonucleases continue the base excision repair started byglycosylases by cleaving the DNA backbone at abasic sites.

Evolutionary analysis indicates that some of the extra copies ofRAR genes in Arabidopsis originated through relatively recent geneduplicationsÐbecause many of the sets of genes are more closelyrelated to each other than to their homologues in any other species.As duplication is frequently accompanied by functional divergence,the duplicate (paralogous) genes may have different repair speci®-cities or may have evolved functions that are outside RAR functions(as is the case for two of the ®ve class I photolyase homologues,which function as blue-light receptors). In most cases, it is notknown whether the paralogous gene copies have different functions.The presence of multiple paralogues might also allow functionalredundancy or a greater repair or recombination capacity.

The multiplicity of RAR genes in Arabidopsis is also partly due tothe transfer of genes from the organellar genomes to the nucleus.Repair gene homologues that appear to be of chloroplast origin(Supplementary Information Tables 2 and 5) include the recombi-nation proteins RecA, RecG and SMS, two class I photolyasehomologues, Fpg, two MutS2 proteins, and the transcription-repair coupling factor Mfd. Two of these (RecA and Fpg) areinvolved in RAR functions in the plastid, suggesting that theothers may be as well. The ®nding of an Mfd orthologue ofcyanobacterial descent is surprising. In E. coli, Mfd couples nucleo-tide excision repair carried out by UvrABC to transcription, leadingto the rapid repair of DNA damage on the transcribed strand oftranscribed genes66 The absence of orthologues of UvrABC inArabidopsis renders the function of Mfd dif®cult to predict. Thepresence of Mfd but not UvrABC has been reported for only oneother species, a bacterial endosymbiont of the pea aphid.

Other nuclear-encoded Arabidopsis DNA repair gene homolo-gues are evolutionarily related to genes from a-Proteobacteria, andthus may be of mitochondrial descent. In particular, the six homo-logues of the alkyl-base glycosylase TagI appear to be the result of alarge expansion in plants after transfer from the mitochondrialgenome. Whether any of these TagI homologues function in therepair and maintenance of mitochondrial DNA has not beendetermined. More detailed phylogenetic analysis may reveal addi-tional Arabidopsis RAR genes to be of organellar ancestry.

There are some notable absences of proteins important for RARin other species, including alkyltransferases, MSH4, RPA3 and manycomponents of TFIIH (TFB2, TFB3, TFB4, CCL1, Kin28). Never-theless, Arabidopsis shows many similarities to the set of DNA repairgenes found in other eukaryotes, and therefore offers an experi-mental system for determining the functions of many of theseproteins, in part through characterization of mutants defective inDNA repair67.

Gene regulationEukaryotic gene expression involves many nuclear proteins thatmodulate chromatin structure, contribute to the basal transcriptionmachinery, or mediate gene regulation in response to developmen-tal, environmental or metabolic cues. As predicted by sequencesimilarity, more than 3,000 such proteins may be encoded by theArabidopsis genome, suggesting that it has a comparable complexityof gene regulation to other eukaryotes. Arabidopsis has an additionallevel of gene regulation, however, with DNA methylation potentiallymediating gene silencing and parental imprinting.

Plants have evolved several variations on chromatin remodellingproteins, such as the family of HD2 histone deacetylases68. AlthoughArabidopsis possesses the usual number of SNF2-type chromatin

remodelling ATPases, which regulate the expression of nearly allgenes, there are signi®cant structural differences between yeast andmetazoan SNF2-type genes and their orthologues in Arabidopsis.DDM1, a member of the SNF2 superfamily, and MOM1, a gene withsimilarity to the SNF2 family, are involved in transcriptional genesilencing in Arabidopsis. MOM1 has no clear orthologue in fungal ormetazoan genomes.

Consistent with its methylated DNA, Arabidopsis possesseseight DNA methyltransferases (DMTs). Two of the three typesare orthologous to mammalian DMT69 whereas one, chromo-methyltransferase70, is unique to plants. No DMTs are found inyeast or C. elegans, although two DMT-like genes are found inDrosophila71. Arabidopsis also encodes eight proteins with methyl-DNA-binding domains (MBDs). Despite lacking methylated DNA,Drosophila encodes four MBD proteins and C. elegans has two.These differences in chromatin components are likely tore¯ect important differences in chromatin-based regulatorycontrol of gene expression in eukaryotes (Supplementary Informa-tion Table 6; http://Ag.Arizona.Edu/chromatin/chromatin.html).

The Arabidopsis genome encodes transcription machinery for thethree nuclear DNA-dependent RNA polymerase systems typical ofeukaryotes (Supplementary Information Table 6). Transcription byRNA polymerases II and III appears to involve the same machineryas is used in other eukarotes; however, most transcription factors forRNA polymerase I are not readily identi®ed. Only two polymerase Iregulators (other than polymerase subunits and TATA-bindingprotein) are apparent in Arabidopsis, namely homologues of yeastRRN3 and mouse TTF-1. All eukaryotes examined to date havedistinct genes for the largest and second largest subunits of poly-merase I, II and III. Unexpectedly, Arabidopsis has two genesencoding a fourth class of largest subunit and second-largestsubunit (Supplementary Information Fig. 5). It will be interestingto determine whether the atypical subunits comprise a polymerasethat has a plant-speci®c function. Four genes encoding single-subunit plastid or mitochondrial RNA polymerases have beenidenti®ed in Arabidopsis (Supplementary Information Table 6).Genes for the bacterial b-, b9- and a-subunits of RNA polymeraseare also present, as are homologues of various s-factors, and theseproteins may regulate chloroplast gene expression. Mutations in theSde-1 gene, encoding RNA-dependent RNA polymerase (RdRp),lead to defective post-transcriptional gene silencing72. We alsoidenti®ed ®ve more closely related RdRp genes.

Our analysis, using both similarity searches and domain matches,has identi®ed 1,709 proteins with signi®cant similarity to knownclasses of plant transcription factors classi®ed by conserved DNA-binding domains. This analysis used a consistent conservativethreshold that probably underestimates the size of families ofdiverse sequence. This class of protein is the least conservedamong all classes of known proteins, showing only 8±23% similar-ity to transcription factors in other eukaryotes (Fig. 2b). Thisreduced similarity is due to the absence of certain classes oftranscription factors in Arabidopsis and large numbers of plant-speci®c transcription factors. We did not detect any members ofseveral widespread families of transcription factors, such as the REL(Rel-like DNA-binding domain) homology region proteins, nuclearsteroid receptors and forkhead-winged helix and POU (Pit-1, Oct-and Unc-8b) domain families of developmental regulators. Con-versely, of 29 classes of Arabidopsis transcription factors, 16 appearto be unique to plants (Supplementary Information Table 6).Several of these, such as the AP2/EREBP-RAV, NAC and ARF-AUX/IAA families, contain unique DNA-binding domains, whereasothers contain plant-speci®c variants of more widespread domains,such as the DOF and WRKY zinc-®nger families and the two-repeatMYB family.

Functional redundancy among members of large families ofclosely related transcription factors in Arabidopsis is a signi®cantpotential barrier to their characterization73. For example, in the

articles

806 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com© 2000 Macmillan Magazines Ltd

Page 12: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

SHATTERPROOF and SEPALLATA families of MADS box tran-scription factors, all genes must be defective to produce visiblemutant phenotypes74,75. These functionally redundant genes arefound on the segmental duplications described above. Our analyses,together with the signi®cant sequence similarity found in largefamilies of transcription factors such as the R2R3-repeat MYB andWRKY families, suggest that strategies involving overexpression willbe important in determining the functions of members of tran-scription factor families.

Arabidopsis has two or over three times more transcription factorsthan identi®ed in Drosophila29 or C.elegans1, respectively. The sig-ni®cantly greater extent of segmental chromosomal and local tandemduplications in the Arabidopsis genome generates larger gene families,including transcription factors. The partly overlapping functionsde®ned for a few transcription factors are also likely to be muchmore widespread, implicating many sequence-related transcriptionfactors in the same cellular processes. Finally, the expanded numberof genes involved in metabolism, defence and environmental inter-action in Arabidopsis (Fig. 2a), which have few counterparts inDrosophila and C. elegans, all require additional numbers and classesof transcription factors to integrate gene function in response to avast range of developmental and environmental cues.

Cellular organizationPlant cells differ from animal cells in many features such as plastids,vacuoles, Golgi organization, cytoskeletal arrays, plasmodesmatalinking cytoplasms of neighbouring cells, and a rigid polysacchar-ide-rich extracellular matrixÐthe cell wall. Because the cell wallmaintains the position of a cell relative to its neighbours, bothchanges in cell shape and organized cell divisions, involving cytos-keleton reorganization and membrane vesicle targeting, have majorroles in plant development. Plant cytokinesis is also unique inthat the partitioning membrane is formed de novo by vesicle fusion.We compared the Arabidopsis genome with those of C. elegans,

Drosophila and yeast to glimpse the genetic basis of plant-cell-speci®c features.

The principal components of the plant cytoskeleton are micro-tubules (MTs) and actin ®laments (AFs); intermediate ®laments(IFs) have not been described in plants. Arabidopsis appears to lackgenes for cytokeratin or vimentin, the main components of animalIFs, but has several variants of actin, a- and b-tubulin. TheArabidopsis genome also encodes homologues of chaperones thatmediate the folding of tubulin and actin polypeptides in yeast andanimal cells, such as the prefoldin and cytosolic chaperonin com-plexes and tubulin-folding cofactors. The dynamic stability of MTsand AFs is in¯uenced by MT-associated proteins and actin-bindingproteins, respectively, several of which are encoded by Arabidopsisgenes. These include the MT-severing ATPase katanin, AF-cross-linking/bundling proteins, such as ®mbrins and villins, and AF-disassembling proteins, such as pro®lin and actin-depolymerizingfactor/co®lin. The Arabidopsis proteome appears to lack homolo-gues of proteins that, in animal cells, link the actin cytoskeletonacross the plasma membrane to the extracellular matrix, such asintegrin, talin, spectrin, a-actinin, vitronectin or vinculin. Thisapparent lack of `anchorage' proteins is consistent with the differentcomposition of the cell wall and with a prominence of cortical MTsat the expense of cortical AFs in plant cells.

Plant-speci®c cytoskeletal arrays include interphase cortical MTsmediating cell shape, the preprophase band marking the cortical siteof cell division, and the phragmoplast assisting in cytokinesis76.Although plant cells lack structural counterparts of the yeast spindlepole body and the animal centrosome, Arabidopsis has homologuesof core components of the MT-nucleating g-tubulin ring complex,such as g-tubulin, Spc97/hGCP2 and Spc98/hGCP3. Arabidopsishas numerous motor molecules, both kinesins and dyneins withassociated dynactin complex proteins, which are presumablyinvolved in the dynamic organization of MTs and in transportingcargo along MT tracks. There are also myosin motors that may beinvolved in AF-supported organelle traf®cking. Essential features ofthe eukaryotic cytoskeleton appear to be conserved in Arabidopsis.

The Arabidopsis genome encodes homologues of proteinsinvolved in vesicle budding, including several ARFs and ARF-related small G-proteins, large but not small ARF GEFs (adenosineribosylation factor on guanine nucleotide exchange factor), adapterproteins, and coat proteins of the COP and non-COP types.Arabidopsis also has homologues of proteins involved in vesicledocking and fusion, including SNAP receptors (SNAREs), N-ethylmaleimide-sensitive factor (NSF) and Cdc48-related ATPases,accessory proteins such as Sec1 and soluble NSF attachment protein(SNAP), and Rab-type GTPases. The large number of ArabidopsisSNAREs can be grouped by sequence similarity to yeast and animalcounterparts involved in speci®c traf®cking pathways, and somehave been localized to the trans-Golgi and the pre-vacuolarpathway77. Arabidopsis also has a receptor for retention of proteinsin the endoplasmic reticulum, a cargo receptor for transport to thevacuole and several phragmoplastins related to animal dynaminGTPases. Thus, plant cells appear to use the same basic machineryfor vesicle traf®cking as yeast and animal cells.

Animal cells possess many functionally diverse small G-proteinsof the Ras superfamily involved in signal transduction, AF reorga-nization, vesicle fusion and other processes. Surprisingly,Arabidopsis appears to lack genes for G-proteins of the Ras, Rho,Rac and Cdc42 subfamilies but has many Rab-type G-proteinsinvolved in vesicle fusion and several Rop-type G-proteins, one ofwhich has a role in actin organization of the tip-growing pollentube78. The signi®cance of this divergent ampli®cation of differentsubfamilies of small G-proteins in plants and animals remains to bedetermined.

Arabidopsis possesses cyclin-dependent kinases (CDKs), includ-ing a plant-speci®c Cdc2b kinase expressed in a cell-cycle-depen-dent manner, several cyclin subtypes, including a D-type cyclin that

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 807

A. thalianaa C. elegans S. cerevisiae

b

Channels

Secondary transport

Primary transport

Uncharacterized

Cations (inorganic)

Anions (inorganic)

Water

Sugars and derivatives

Carboxylates

Amino acids

Amines, amides and polyamines

Peptides

Bases and derivatives

Vitamins and cofactors

Drugs and toxins

Macromolecules

Unknown

Figure 7 Comparison of the transport capabilities of Arabidopsis, C. elegans and

S. cerevisiae. Pie charts show the percentage of transporters in each organism according

to bioenergetics (a) and substrate speci®city (b).

© 2000 Macmillan Magazines Ltd

Page 13: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

mediates cytokinin-stimulated cell-cycle progression79, a retinoblas-toma-related protein and components of the ubiquitin-dependentproteolytic pathway of cyclin degradation. In yeast and animal cells,chromosome condensation is mediated by condensins, sister chro-matids are held together by cohesins such as Scc1, and metaphase±anaphase transition is triggered by separin/Esp1 endopeptidaseproteolysis of Scc1 on APC-mediated degradation of its inhibitor,securin/Psd1. Related proteins are encoded by the Arabidopsisgenome. Thus, the basic machinery of cell-cycle progression,genome duplication and segregation appears to be conserved inplants. By contrast, entry into M phase, M-phase progression andcytokinesis seem to be modi®ed in plant cells. Arabidopsis does notappear to have homologues of Cdc25 phosphatase, which activatesCdc2 kinase at the onset of mitosis, or of polo kinase, whichregulates M-phase progression in yeast and animals. Conversely,plant-speci®c mitogen-actived protein (MAP) kinases appear to beinvolved in cytokinesis.

Cytokinesis partitions the cytoplasm of the dividing cell. Yeastand animal cells expand the membrane from the surface towards thecentre in a cleavage process supported by septins and a contractilering of actin and type II myosin. By contrast, plant cytokinesis startsin the centre of the division plane and progresses laterally. Atransient membrane compartment, the cell plate, is formed denovo by fusion of Golgi-derived vesicles traf®cking along thephragmoplast MTs80. Consistent with the unique mode of plantcytokinesis, Arabidopsis appears to lack genes for septins and type IImyosin. Conversely, cell-plate formation requires a cytokinesis-speci®c syntaxin that has no close homologue in yeast and animals.Although syntaxin-mediated membrane fusion occurs in animalcytokinesis and cellularization, the vesicles are delivered to the baseof the cleavage furrow. Thus, the plant-speci®c mechanism of celldivision is linked to conserved eukaryotic cell-cycle machinery.

Two main conclusions are suggested by this comparative analysis.First, Arabidopsis and eukaryotic cells have common features relatedto intracellular activities, such as vesicle traf®cking, cytoskeletonand cell cycle. Second, evolutionarily divergent features, such asorganization of the cytoskeleton and cytokinesis, appear to relate tothe plant cell wall.

DevelopmentThe regulation of development in Arabidopsis, as in animals,involves cell±cell communication, hierarchies of transcription fac-tors, and the regulation of chromatin state; however, there is noreason to suppose that the complex multicellular states of plant andanimal development have evolved by elaborating the same generalprocesses during the 1.6 billion years since the last common uni-cellular ancestor of plants and animals81,82. Our genome analysesre¯ect the long, independent evolution of many processes contri-buting to development in the two kingdoms.

Plants and animals have converged on similar processes of patternformation, but have used and expanded different transcriptionfactor families as key causal regulators. For example, segmentationin insects and differentiation along the anterior±posterior and limbaxes in mammals both involve the spatially speci®c activation of aseries of homeobox gene family members. The pattern of activationis causal in the later differentiation of body and limb axis regions. Inplants the pattern of ¯oral whorls (sepals, petals, stamens, carpels) isalso established by the spatially speci®c activation of members of afamily of transcription factors, but in this instance the family is theMADS box family. Plants also have homeobox genes and animalshave MADS box genes, implying that each lineage invented sepa-rately its mechanism of spatial pattern formation, while convergingon actions and interactions of transcription factors as the mechan-ism. Other examples show even greater divergence of plant andanimal developmental control. Examples are the AP2/EREBP andNAC families of transcription factors, which have important roles in¯ower and meristem development; both families are so far found

only in plants (Supplementary Information Table 6).A similar story can be told for cell±cell communication. Plants do

not seem to have receptor tyrosine kinases, but the Arabidopsisgenome has at least 340 genes for receptor Ser/Thr kinases, belong-ing to many different families, de®ned by their putative extracellulardomains (Supplementary Information Table 7). Several familieshave members with known functions in cell±cell communication,such as the CLV1 receptor involved in meristem cell signalling, theS-glycoprotein homologues involved in signalling from pollen tostigma in self-incompatible Brassica species, and the BRI1 receptornecessary for brassinosteroid signalling83. Animals also have recep-tor Ser/Thr kinases, such as the transforming growth factor-b(TGF-b) receptors, but these act through SMAD proteins that areabsent from Arabidopsis. The leucine-rich repeat (LRR) family ofArabidopsis receptor kinases shares its extracellular domain withmany animal and fungal proteins that do not have associated kinasedomains, and there are at least 122 Arabidopsis genes that code forLRR proteins without a kinase domain. Other Arabidopsis receptorkinase families have extracellular domains that are unfamiliar inanimals. Thus, evolution is modular, and the plant and animallineages have expanded different families of receptor kinases for asimilar set of developmental processes.

Several Arabidopsis genes of developmental importance appear tobe derived from a cyanobacteria-like genome (SupplementaryInformation Table 2), with no close relationship to any animal orfungal protein. One salient example is the family of ethylenereceptors; another gene family of apparent chloroplast origin isthe phytochromesÐlight receptors involved in many developmen-tal decisions (see below). Whereas the land plant phytochromesshow clear homology to the cyanobacterial light receptors, whichare typical prokaryotic histidine kinases, the plant phytochromesare histidine kinase paralogues with Ser/Thr speci®city84. Similarlyto the ethylene receptors, the proteins that act downstream of plantphytochrome signalling are not found in cyanobacteria, and thus itappears that a bacterial light receptor entered the plant genomethrough horizontal transfer, altered its enzymatic activity, andbecame linked to a eukaryotic signal transduction pathway. Thisinfusion of genes from a cyanobacterial endosymbiont shows thatplants have a richer heritage of ancestral genes than animals, andunique developmental processes that derive from horizontal genetransfer.

Signal transductionBeing generally sessile organisms, plants have to respond to localenvironmental conditions by changing their physiology or redirect-ing their growth. Signals from the environment include light andpathogen attack, temperature, water, nutrients, touch and gravity.In addition to local cellular responses, some stimuli are commu-nicated across the plant body, with plant hormones and peptidesacting as secondary messengers. Some hormones, such as auxin, aretaken up into the cell, whereas others, such as ethylene andbrassinosteroids, and the peptide CLV3, act as ligands for receptorkinases on the plasma membrane. No matter where the signal isperceived by the cell, it is transduced to the nucleus, resulting inaltered patterns of gene expression.

Comparative genome analysis between Arabidopsis, C. elegansand Drosophila supports the idea that plants have evolved their ownpathways of signal transduction85. None of the components of thewidely adopted signalling pathways found in vertebrates, ¯ies orworms, such as Wingless/Wnt, Hedgehog, Notch/lin12, JAK/STAT,TGF-b/SMADs, receptor tyrosine kinase/Ras or the nuclear steroidhormone receptors, is found in Arabidopsis. By contrast, brassinos-teroids are ligands of the BRI1 Ser/Thr kinase, a member of thelargest recognizable class of transmembrane sensors encoded by340 receptor-like kinase (RLK) genes in the Arabidopsis genome(Supplementary Information Table 7). With a few notable excep-tions, such as CLV1, the types of ligands sensed by RLKs are

articles

808 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com© 2000 Macmillan Magazines Ltd

Page 14: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

completely unknown, providing an enormous future challenge forplant biologists. G-protein-coupled receptors (GPCRs)/ seven-transmembrane proteins are an abundant class of proteins inmammalian genomes, instrumental in signal transduction. INTER-PRO detected 27 GPCR-related domains in Arabidopsis (Supple-mentary Information Table 1), although there is no directexperimental evidence for these. Arabidopsis contains a family of18 seven-transmembrane proteins of the mildew resistance (MlO)class, several of which are involved in defence responses. Notably,only single Ga (GPA1) and Gb (AGB1) subunits are found inArabidopsis, both previously known86.

Although cyclic GMP has been proposed to be involved in signaltransduction in Arabidopsis87, a protein containing a guanylatecyclase domain was not identi®ed in our analyses. Nevertheless,cyclic nucleotide-binding domains were detected in various pro-teins, indicating that cNMPs may have a role in plant signaltransduction. Thus, although cNMP-binding domains appear tohave been conserved during evolution, cNMP synthesis inArabidopsis may have evolved independently.

We were unable to identify a protein with signi®cant similarity toknown Gg subunits, but recent biochemical studies suggest that aprotein with this functional capacity is likely to be present in plantcells (H. Ma, personal communication). Therefore, there is poten-tial for the formation of only a single heterotrimeric G-proteincomplex; however, its functional interaction with any of the poten-tial GPCR-related proteins remains to be determined.

Modules of cellular signal pathways from bacteria and animalshave been combined and new cascades have been innovated inplants. A pertinent example is the response to the gaseous planthormone ethylene88. Ethylene is perceived and its signal transmittedby a family of receptors related to bacterial-type two-componenthistidine kinases (HKs). In bacteria, yeast and plants, these proteinssense many extracellular signals and function in a His-to-Aspphosphorelay network89. In turn, these proteins physically interactwith the genetically downstream protein CTR1, a Raf/MAPKKK-related kinase, revealing the juxtaposition of bacterial-type two-component receptors and animal-type MAP kinase cascades. Unlikeanimals, however, Arabidopsis does not seem to have a Ras proteinto activate the MAP kinase cascade. MAP kinases are found inabundance in Arabidopsis: we identi®ed ,20, a higher number thanin any other eukaryote. As potentially counteracting components,we found ,70 putative PP2C protein phosphatases. Although thisgroup is largely uncharacterized functionally, several members arerelated to ABI1/ABI2, key negative regulators in the signallingpathway for the plant hormone abscisic acid. Additional compo-nents of the His-to-Asp phosphorelay system were also found inArabidopsis, including authentic response regulators (ARRs), pseu-doresponse regulators (PRRs) and phosphotransfer intermediateprotein (HPt)90. We found 11 HKs in the proteome (3 new), 16 RRs(2 new) and 8 PRRs (2 new). The biological roles of most ARRs,PRRs and HPts are largly unknown, but several have been foundto have diverse functions in plants, including transcriptional activa-tion in response to the plant hormone cytokinin91, and as compo-nents of the circadian clock92.

Plants seem to have evolved unique signalling pathways bycombining a conserved MAP kinase cascade module with newreceptor types. In many cases, however, the ligands are unknown.Conversely, some known signalling molecules, such as auxin, arestill in search of a receptor. Auxin signalling may represent yetanother plant-speci®c mode of signalling, with protein degradationthrough the ubiquitin-proteasome pathway preceding altered geneexpression. With many Arabidopsis genes encoding components ofthe ubiquitin-proteasome pathway, elimination of negative regula-tors may be a more widespread phenomenon in plant signalling.

Recognizing and responding to pathogensPlants are constantly exposed to pests, parasites and pathogens and

have evolved many defences. In mammals, polymorphism forparasite recognition encoded in the MHC genes contributes toresistance. In plants, disease resistance (R) genes that confer parasiterecognition are also extremely polymorphic. This polymorphismhas been proposed to restrict parasites, and its absence may explainthe breakdown of resistance in crop monocultures93. In contrast toMHC genes, plant resistance genes are found at several loci, and thecomplete genome sequence enables analysis of their complementand structure. Parasite recognition by resistance genes triggersdefence mechanisms through various signalling molecules, such asprotein kinases and adapter proteins, ion ¯uxes, reactive oxygenintermediates and nitric oxide. These halt pathogen colonizationthrough transcriptional activation of defence genes and a form ofprogrammed cell death called the hypersensitive response94. TheArabidopsis genome contains diverse resistance genes distributed atmany loci, along with components of signalling pathways, andmany other genes whose role in disease resistance has been inferredfrom mutant phenotypes.

Most resistance genes encode intracellular proteins with a nucleo-tide-binding (NB) site typical of small G proteins, and carboxy-terminal LRRs95. Their amino termini either carry a TIR domain, ora putative coiled coil (CC). There are 85 TIR±NB±LRR resistancegenes at 64 loci, and 36 CC±NB±LRR resistance genes at 30 loci.Some NB±LRR resistance genes express neither obvious TIR norCC domains at their N termini. This potential class is present seventimes, at six loci. There are 15 truncated TIR±NB genes that lack anLRR at 10 loci, often adjacent to full TIR±NB±LRR genes. There arealso six CC±NB genes, at ®ve loci. These truncated products mayfunction in resistance. Intriguingly, two TIR±NB±LRR genes carrya WRKY domain, found in transcription factors that are implicatedin plant defence, and one of these also encodes a protein kinasedomain.

Resistance gene evolution may involve duplication and diver-gence of linked gene families36; however, most (46) resistance genesare singletons; 50 are in pairs, 21 are in 7 clusters of 3 familymembers, with single clusters of 4, 5, 7, 8 and 9 members,respectively. Of the non-singletons, ,60% of pairs are in directrepeats, and ,40% are in inverted repeats. Resistance genes areunevenly distributed between chromosomes, with 49 on chromo-some 1; 2 on chromosome 2; 16 on chromosome 3; 28 on chromo-some 4; and 55 on chromosome 5.

In other plant species, resistance genes encode both transmem-brane receptors for secreted pathogen products and protein kinases,and some other classes are also found. The Cf genes in tomatoencode extracellular LRRs with a transmembrane domain and shortcytoplasmic domain. Mutation in an Arabidopsis homologue,CLAVATA2, results in enlarged meristems, but to date no resistancefunction has been assigned to the 30 Arabidopsis CLV2 homologues.CLAVATA1, a transmembrane LRR kinase, is also required formeristem function. Xa21, a rice LRR-kinase, confers Xanthomonasresistance, and the Arabidopsis FLS2 LRR kinase confers recognitionof ¯agellin. It has been proposed that CLV1 and CLV2 function as aheterodimer; perhaps this is also true for Xa21, FLS2 and Cfproteins. There are 174 LRR transmembrane kinases inArabidopsis, with only FLS2 assigned a role in resistance. A uniqueresistance gene, beet Hs1pro-1, which confers nematode resistance,has two Arabidopsis homologues.

The tomato Pto Ser/Thr kinase acts as a resistance protein inconjunction with an NB±LRR protein, so similar kinases might dothe same for Arabidopsis NB±LRR proteins. There are 860 Ser/Thrkinases in the Arabidopsis sequence. Fifteen of these share 50%identity over the Pto-aligned region. The Toll pathway in Drosophilaand mammals regulates innate immune responses throughLRR/TIR domain receptors that recognize bacterial lipopo-lysaccharides96. Pto is highly homologous to Drosophila PELLEand mammalian IRAK protein kinases that mediate the TIRpathway.

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 809© 2000 Macmillan Magazines Ltd

Page 15: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

Additional genes have been de®ned that are required for resis-tance by our analysis of the genome sequence. The ndr1 mutationde®nes a gene required by the CC±NB±LRR gene RPS2 and RPM1.NDR1 is 1 of 28 Arabidopsis genes that are similar both to each otherand to the tobacco HIN1 gene that is transcriptionally induced earlyduring the hypersensitive response. EDS1 is a gene required forTIR±NB±LRR function, and like PAD4, encodes a protein with aputative lipase motif. EDS1, PAD4 and a third gene comprise theEDS1/PAD4 family. The NPR1/NIM1/SAI1 gene is required forsystemic acquired resistance, and we found ®ve additional NPR1homologues. Recessive mutations at both the barley Mlo andArabidopsis LSD1 loci confer broad-spectrum resistance and dere-press a cell-death program. There are at least 18 Mlo familymembers that resemble heterotrimeric GPCRs in Arabidopsis, andonly two LSD1 homologues.

One of the earliest responses to pathogen recognition is theproduction of reactive oxygen intermediates. This involves a spe-cialized respiratory burst oxidase protein that transfers an electronacross the plasma membrane to make superoxide. Arabidopsisencodes eight apparently functional gp91 homologues, calledAtrboh genes. Unlike gp91, they all carry an ,300 amino-acid N-terminal extension carrying an EF-hand Ca2+-binding domain. Inmammals, activation of the respiratory oxidative burst complex inthe neutrophil, which includes gp91, requires the action of Racproteins. As no Rac or Ras proteins are found in Arabidopsis,members of the large rop family of G proteins may carry this out.Similarly, we did not detect any Arabidopsis homologues of othermammalian respiratory burst oxidase components (p22, p47, p67,p40).

There are no clear homologues of many mammalian defence andcell-death control genes. Although nitric oxide production isinvolved in plant defence, there is no obvious homologue of nitricoxide synthase. Also absent are apparent homologues of the RELdomain transcription factors involved in innate immunity in bothDrosophila and mammals. We found no similarity to proteinsinvolved in regulating apoptosis in animal cells, such as classicalcaspases, bcl2/ced9 and baculovirus p35. There are, however, 36cysteine proteases. There are also eight homologues of a newlyde®ned metacaspase family97, two of which, along with LSD1, have aclear GATA-type zinc-®nger.

Photomorphogenesis and photosynthesisBecause nearly all plants are sessile and most depend on photo-synthesis, they have evolved unique ways of responding to light.Light serves as an energy source, as well as a trigger and modulatorof complex developmental pathways, including those regulated bythe circadian clock. Light is especially important during seedlingemergence, where it stimulates chlorophyll production, leaf devel-opment, cotyledon expansion, chloroplast biogenesis and the coor-dinated induction of many nuclear- and chloroplast-encoded genes,while at the same time inhibiting stem growth. The goal of thisprocess, called photomorphogenesis, is the establishment of a bodyplan that allows the plant to be an ef®cient photosynthetic machineunder varying light conditions98. The signal transduction cascadeleading to light-induced responses begins with the activation ofphotoreceptors. Next, the light signal is transduced via positivelyand negatively acting nuclear and cytoplasmic proteins, causingactivation or derepression of nuclear and chloroplast-encodedphotosynthetic genes and enabling the plant to establish optimalphotoautotrophic growth. Although genetic and biochemical stud-ies have de®ned many of the components in this process, thegenome sequence provides an opportunity to identify comprehen-sively Arabidopsis genes involved in photomorphogenesis and theestablishment of photoautotrophic growth. We identi®ed at least100 candidate genes involved in light perception and signalling, and139 nuclear-encoded genes that potentially function in photosynth-esis.

The roles have been described of only 35 of the 100 candidatephotomorphogenic genes (Supplementary Information Table 8).All of the light photoreceptors had been discovered previously,including ®ve red/far-red absorbing phytochromes (PHYA-E), twoblue/ultraviolet-A absorbing cryptochromes (CRY1 and CRY2),one blue-absorbing phototropin (NPH1) and one NPH1-like (orNPL1). In contrast, we uncovered many new proteins similar tothe photomorphogenesis regulators COP/DET/FUS, PKS1, PIF3,NDPK2, SPA1, FAR1, GIGANTEA, FIN219, HY5, CCA1, ATHB-2,ZEITLUPE, FKF1, LKP1, NPH3 and RPT2.

Both the phytochromes and NPH1 contain chromophores forlight sensing coupled to kinase domains for signal transmission.Phytochromes have an N-terminal chromophore-binding domain,two PAS domains, and a C-terminal Ser/Thr kinase domain99,whereas NPH1 has two LOV domains (members of the PASdomain superfamily) for ¯avin mononucleotide binding and aC-terminal Ser/Thr kinase domain100. PAS domains potentiallysense changes in light, redox potential and oxygen energy levels, aswell as mediating protein±protein interactions99,100. We searchedfor uncharacterized proteins with the combination of a kinasedomain and either a phytochrome chromophore-binding site orPAS domains. Although we found no new phytochrome-likegenes, we did identify four predicted proteins that contain PASand kinase domains (Supplementary Information Fig. 6). Theseproteins share 80% amino-acid identity, but, unlike NPH1 andNPL1, have only one PAS domain. The combination of potentialsignal sensing and transmitting domains makes it tempting tospeculate that these proteins may be receptors for light or othersignals.

Our screen included searches for components of photosyntheticreaction centres and light-harvesting complexes, enzymes involvedin CO2 ®xation and enzymes in pigment biosynthesis. We identi®ed11 core proteins of photosystem I, including the eukaryotic-speci®ccomponents PsaG and PsaH101, and 8 photosystem II proteins,including a single member (psbW) of the photosystem II core. Wealso found 26 proteins similar to the Chlorophyll-a/b bindingproteins (8 Lhca and 18 Lhcb). Of the seven subunits of thecytochrome b6f complex (PetA±D, PetG, PetL, PetM), only one(PetC) was found in the nuclear genome, whereas the remainder areprobably encoded in the chloroplast. Similarly, of the nine subunitsof the chloroplast ATP synthase complex, three are encoded in thenucleus, including the II- , g- and d-subunits; the remainingsubunits (I, III, IV, a, b, e) are encoded in the chloroplast102. Tengenes were related to the soluble components of the electron transferchain, including two plastocyanins, ®ve ferredoxins and threeferredoxin/NADP oxidoreductases. Forty genes are predicted tohave a role in CO2 ®xation, including all of the enzymes in theCalvin±Benson cycle. For pigment biosynthesis, 16 genes in chlor-ophyll biosynthesis and 31 genes in carotenoid biosynthesis werefound (Supplementary Information Table 8). Our analyses haveidenti®ed several potential components of the light perceptionpathway, and have revealed the complex distribution of componentsof the photosynthetic apparatus between nuclear and plastidgenomes.

MetabolismArabidopsis is an autotrophic organism that needs only minerals,light, water and air to grow. Consequently, a large proportion of thegenome encodes enzymes that support metabolic processes, such asphotosynthesis, respiration, intermediary metabolism, mineralacquisition, and the synthesis of lipids, fatty acids, amino acids,nucleotides and cofactors103. With respect to these processes,Arabidopsis appears to contain a complement of genes similar tothose in the photoautotropic cyanobacterium Synechocystis45, but,whereas Synechocystis generally has a single gene encoding an enzyme,Arabidopsis frequently has many. For example, Arabidopsis has atleast seven genes for the glycolytic enzyme pyruvate kinase, with an

articles

810 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com© 2000 Macmillan Magazines Ltd

Page 16: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

additional ®ve for pyruvate kinase-like proteins. Whatever thereason for this high level of redundancy, it varies from gene togene in the same pathway; the 11 enzymes of glycolysis are encodedby up to 51 genes that are present in as few as one or as many as eightcopies. Similarly, of the 59 genes encoding proteins involved inglycerolipid metabolism, 39 are represented by more than onegene104. Genome duplication and expansion of gene families bytandem duplication have contributed to this diversity.

This high degree of apparent structural redundancy does notnecessarily imply functional redundancy. For instance, althoughthere are seven genes for serine hydroxymethyltransferase, a muta-tion in the gene for the mitochondrial form completely blocks thephotorespiratory pathway105. Although there are 12 genes forcellulose synthase, mutations in at least 2 of the 12 confer distinctphenotypes because of tissue-speci®c gene expression106.

The metabolome of Arabidopsis differs from that of cyanobac-teria, or of any other organism sequenced to date, by the presence ofmany genes encoding enzymes for pathways that are unique tovascular plants. In particular, although relatively little is knownabout the enzymology of cell-wall metabolism, more than 420 genescould be assigned probable roles in pathways responsible for thesynthesis and modi®cation of cell-wall polymers. Twelve genesencode cellulose synthase, and 29 other genes encode 6 families ofstructurally related enzymes thought to synthesize other majorpolysaccharides106. Roughly 52 genes encode polygalacturonases,20 encode pectate lyases and 79 encode pectin esterases, indicating amassive investment in modifying pectin. Similarly, the presence of39 b-1,3-glucanases, 20 endoxyloglucan transglycosylases, 50 cellu-lases and other hydrolases, and 23 expansins re¯ects the importanceof wall remodelling during growth of plant cells. Excluding ascor-bate and glutathione peroxidases, there are 69 genes with signi®cantsimilarity to known peroxidases and 15 laccases (diphenol oxi-dases). Their presence in such abundance indicates the importanceof oxidative processes in the synthesis of lignin, suberin and othercell-wall polymers. The high degree of apparent redundancy in thegenes for cell-wall metabolism might re¯ect differences in substratespeci®city by some of the enzymes.

The high degree of apparent redundancy in the genes for cell wallmetabolism might re¯ect differences in substrate speci®city by someof the enzymes. It is already known that cell types have different wallcompositions, which may require that the relevant enzymes besubject to cell-type-speci®c transcriptional regulation. Of the 40 orso cell types that plants make, almost all can be identi®ed by uniquefeatures of their cell wall107. A large number of genes involved in wallmetabolism have yet to be de®ned. Although more than 60 genes forglycosyltransferases can be found in the genome sequence, most ofthese are probably involved in protein glycosylation or metabolitecatabolism and do not seem to be adequate to account for thepolysaccharide complexity of the wall. For instance, at least 21enzymes are required just to produce the linkages of the pecticpolysaccharide RGII, and none of these enzymes has been identi®edat present. Thus, if these and related enzymes involved in thesynthesis of other cell-wall polymers are also represented by multi-ple genes, a substantial number of the genes of currently unknownfunction may be involved in cell-wall metabolism.

Higher plants collectively synthesize more than 100,000 second-ary metabolites. Because ¯owering plants are thought to havesimilar numbers of genes, it is apparent that a great deal ofenzyme creation took place during the evolution of higher plants.An important factor in the rapid evolution of metabolic complexityis the large family of cytochrome P450s that are evident inArabidopsis (Supplementary Information Table 1). These enzymesrepresent a superfamily of haem-containing proteins, most of whichcatalyse NADPH- and O2-dependent hydroxylation reactions. PlantP450s participate in myriad biochemical pathways including thosedevoted to the synthesis of plant products, such as phenylpropa-noids, alkaloids, terpenoids, lipids, cyanogenic glycosides and

glucosinolates, and plant growth regulators, such as gibberellins,jasmonic acid and brassinosteroids. Whereas Arabidopsis has ,286P450 genes, Drosophila has 94, C. elegans has 73 and yeast has only 3.This low number in yeast indicates that there are few reactions ofbasic metabolism that are catalysed by P450s. It seems likely thatmany animal P450s are involved in detoxi®cation of compoundsfrom food plant sources. The role of endogenous enzymes is poorlyunderstood; only a few dozen P450 enzymes from plants have beencharacterized to any extent. The discrepancy between the number ofknown P450-catalysed reactions and the number of genes suggeststhat Arabidopsis produces a relatively large number of metabolitesthat have yet to be identi®ed.

In addition to the large number of cytochrome P450s, Arabidopsishas many other genes that suggest the existence of pathways orprocesses that are not currently known. For instance, the presence of19 genes with similarity to anthranilate N-hydroxycinnamoyl/benzoyl transferase is currently inexplicable. This enzyme isinvolved in the synthesis of dianthramide phytoalexins in Caryo-phyllaceae and Gramineae. No phytoalexins of this class have beendescribed in Arabidopsis as yet. Similarly, the presence of 12 geneswith sequence similarity to the berberine bridge enzyme, ((S)-reticuline:oxygen oxidoreductase (methylene-bridge-forming); EC1.5.3.9), and 13 genes with similarity to tropinone reductase,suggests that Arabidopsis may have the ability to produce alkaloids.In other plants, the berberine bridge enzyme transforms reticulineinto scoulerine, a biosynthetic precursor to a multitude of species-speci®c protopine, protoberberine and benzophenanthridine alka-loids. The discovery of these and many other intriguing genes inthe Arabidopsis genome has created a wealth of new opportunitiesto understand the metabolic and structural diversity of higherplants.

Concluding remarksThe twentieth century began with the rediscovery of Mendel's rulesof inheritance in pea108, and it ends with the elucidation of thecomplete genetic complement of a model plant, Arabidopsis. Theanalysis of the completed sequence of a ¯owering plant reportedhere provides insights into the genetic basis of the similarities anddifferences of diverse multicellular organisms. It also creates thepotential for direct and ef®cient access to a much deeper under-standing of plant development and environmental responses, andpermits the structure and dynamics of plant genomes to be assessedand understood.

Arabidopsis, C. elegans and Drosophila have a similar range of11,000±15,000 different types of proteins, suggesting this is theminimal complexity required by extremely diverse multicellulareukaryotes to execute development and respond to their environ-ment. We account for the larger number of gene copies inArabidopsis compared with these other sequenced eukaryotes withtwo possible explanations. First, independent ampli®cation ofindividual genes has generated tandem and dispersed gene familiesto a greater extent in Arabidopsis, and unequal crossing over may bethe predominant mechanism involved. Second, ancestral duplica-tion of the entire genome and subsequent rearrangements haveresulted in segmental duplications. The pattern of these duplica-tions suggests an ancient polyploidy event, and mutant analysisindicates that at least some of the many duplicate genes arefunctionally redundant. Their occurrence in a functionally diploidgenetic model came as a surprise, and is reminiscent of the situationin maize, an ancient segmental allotetraploid. The remarkabledegree of genome plasticity revealed in the large-scale duplicationsmay be needed to provide new functions, as alternative promotersand alternative splicing appear to be less widely used in plants thanthey are in animals. Apart from duplicated segments, the overallchromosome structure of Arabidopsis closely resembles that ofDrosophila; transposons and other repetitive sequences are concen-trated in the heterochromatic regions surrounding the centromere,

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 811© 2000 Macmillan Magazines Ltd

Page 17: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

whereas the euchromatic arms are largely devoid of repetitivesequences. Conversely, most protein-coding genes reside in theeuchromatin, although a number of expressed genes have beenidenti®ed in centromeric regions. Finally, Arabidopsis is the ®rstmethylated eukaryotic genome to be sequenced, and will be invalu-able in the study of epigenetic inheritance and gene regulation.

Unlike most animals, plants generally do not move, they canperpetuate inde®nitely, they reproduce through an extended hap-loid phase, and they synthesize all their metabolites. Our compar-ison of Arabidopsis, bacterial, fungal and animal genomes starts tode®ne the genetic basis for these differences between plants andother life forms. Basic intracellular processes, such as translation orvesicle traf®cking, appear to be conserved across kingdoms, re¯ect-ing a common eukaryotic heritage. More elaborate intercellularprocesses, including physiology and development, use different setsof components. For example, membrane channels, transporters andsignalling components are very different in plants and animals, andthe large number of transcription factors unique to plants contrastswith the conservation of many chromatin proteins across the threeeukaryotic kingdoms. Unexpected differences between seeminglysimilar processes include the absence of intracellular regulators ofcell division (Cdc25) and apoptosis (Bcl-2). On the other hand,DNA repair appears more highly conserved between plants andmammals than within the animal kingdom, perhaps re¯ectingcommon factors such as DNA methylation. Our analysis alsoshows that many genes of the endosymbiotic ancestor of the plastidhave been transferred to the nucleus, and the products of this richprokaryotic heritage contribute to diverse functions such as photo-autotrophic growth and signalling.

The sequence reported here changes the fundamental nature ofplant genetic analysis. Forward genetics is greatly simpli®ed asmutations are more conveniently isolated molecularly, but at thesame time extensive gene duplications mean that functional redun-dancy must be taken into account. At a biochemical level, thespeci®city conferred by nucleotide sequence, and the completenessof the survey allow complex mixtures of RNA and protein to beresolved into their individual components using micro-arrays andmass spectrometry. This speci®city can also be used in the parallelanalysis of genome-wide polymorphisms and quantitative traits innatural populations109. Looking ahead, the challenge of determiningthe function of the large set of predicted genes, many of which areplant-speci®c, is now a clear priority, and multinational programshave been initiated to accomplish this goal using site-selectedmutagenesis among the the necessary tools110. Finally, productivepaths of crop improvement, based on enhanced knowledge ofArabidopsis gene function, will help meet the challenge of sustainingour food supply in the coming years.Note added in proof: at the time of publication 17 centromeric BACsand 5 sequence gaps in chromosome arms are being sequenced. M

MethodsThe three centres used similar annotation approaches involving in silico gene-®ndingmethods, comparison to EST and protein databases, and manual reconciliation of thatdata. Gene ®nding involved three steps: (1) analysis of BAC sequences using a computa-tional gene ®nder; (2) alignment of the sequence to the protein and EST databases; (3)assignment of functions to each of the genes. Genscan111, GeneMark.HMM112, Xgrail113

Gene®nder (P. Green, unpublished software) and GlimmerA114 were used to analyse BACsequences. All of these systems were specially trained for Arabidopsis genes. Splice siteswere predicted using NetGene2115, Splice Predictor116 and GeneSplicer (M. Pertea andS. Salzberg, unpublished software). For the second step, BACs were aligned to ESTs and tothe Arabidopsis gene index117 using programs such as DDS/GAP2118 or BLASTN119.Segmental duplications were analysed and displayed using a modi®ed version ofDIALIGN2 (ref. 120).

Received 20 October; accepted 15 November 2000.

1. The C. elegans Sequencing Consortium. Sequence and analysis of the genome of C. elegans. Science

282, 2012±2018 (1998).

2. Adams, M. D. The genome sequence of Drosophila melanogaster. Science 287, 2185±2195 (2000).

3. Meinke, D. W., Cherry, J. M., Dean, C., Rounsley, S. D. & Koornneef, M. Arabidopsis thaliana: a

model plant for genome analysis. Science 282, 662±665 (1998).

4. Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402,

761±768 (1999).

5. Mayer, K. et al. Sequence and analysis of chromsome 4 of the plant Arabidopsis thaliana. Nature 402,

769±777 (1999).

6. Theologis, A. et al. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature

408, 816±820 (2000).

7. Salanoubat, M. et al. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana.

Nature 408, 820±822 (2000).

8. Tabata, S. et al. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature

408, 820±822 (2000).

9. Choi, S. D., Creelman, R., Mullet, J. & Wing, R. A. Construction and characterisation of a bacterial

arti®cial chromosome library from Arabidopsis thaliana. Weeds World 2, 17±20 (1995).

10. Mozo, T., Fischer, S., Shizuya, H. & Altmann, T. Construction and characterization of the IGF

Arabidopsis BAC library. Mol. Gen. Genet. 258, 562±570 (1998).

11. Lui, Y. -G., Mitsukawa, N., Vazquez-Tello, A. & Whittier, R. F. Generation of a high-quality P1 library

of Arabidopsis suitable for chromosome walking. Plant J. 7, 351±358 (1995).

12. Lui, Y. -G. et al. Complementation of plant mutants with large genomic DNA fragments by a

transformation-competent arti®cial chromosome vector accelerates positional cloning. Proc. Natl

Acad. Sci. USA 96, 6535±6540 (1999).

13. Marra, M. et al. A map or sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22,

265±270 (1999).

14. Mozo, T. et al. A complete BAC-based physical map of the Arabidopsis thaliana genome. Nature

Genet. 22, 271±275 (1999).

15. Sato, S. et al. Structural analysis of Arabidopsis thaliana chromosome 5. I. Sequence features of the 1.

6 Mb regions covered by twenty physically assigned P1 clones. DNA Res. 4, 215±230 (1997).

16. Bent, E., Johnson, S. & Bancroft, I. BAC representation of two low-copy regions of the genome of

Arabidopsis thaliana. Plant J. 13, 849±855 (1998).

17. Copenhaver, G. P. et al. Genetic de®nition and sequence analysis of Arabidopsis centromeres. Science

286, 2468±2474 (1999).

18. Meyerowitz, E. M. & Somerville, C. R. Arabidopsis (Cold Spring Harbor Laboratory Press, Cold

Spring Harbor, New York, 1994)

19. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in

genomic sequence. Nucleic Acids Res. 25, 955±964 (1997).

20. Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to

Arabidopsis thaliana sequences. BioInformatics 15, 887±900 (1999).

21. Mewes, H. W. et al. Overview of the yeast genome. Nature 387 (Suppl.) 7±65 (1997).

22. Frishman, D. et al. Functional and structural genomics using PEDANT. BioInformatics (in the press).

23. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453±1462

(1997).

24. Kotani, H. & Tabata, S. Lessons from the sequencing of the genome of a unicellular cyanobacterium,

Synechocystis SP. PCC6803. Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 151±171 (1998).

25. Apweiler, R. et al. INTERPRO (http://www. ebi. ac. uk/interpro/). Collaborative Computer Project

11 Newsletter no. 10 (Cambridge, 2000).

26. Bent, A. F. et al. RPS2 of Arabidopsis thaliana a leucine-rich repeat class of plant disease resistance

genes. Science 265, 1856±1860 (1994).

27. Skowyra, D. et al. F box proteins are receptors that recruit phsphorylated substrates to the SCF

ubiquitin-ligase complex. Cell 91, 209±219 (1997).

28. Joazeiro, C. A. P. & Weissman, A. M. RING ®nger proteins: mediators of ubiquitin ligase activity. Cell

102, 549±552 (2000).

29. Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000).

30. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369±2376 (1999).

31. Blanc, G. et al. Extensive duplication and reshuf¯ing in the Arabidopsis genome. Plant Cell 12, 1093±

1102 (2000).

32. Wendel, J. F. Genome evolution in polyploids. Plant Mol. Biol. 42, 225±249 (2000).

33. Gaut, B. S. & Doebley, J. F. DNA sequence evidence for the segmental allotetraploid origin of maize.

Proc. Natl Acad. Sci. USA 94, 6809±6814 (1997).

34. Ku, H. -M., Vision, T., Liu, J. & Tanksley, S. D. Comparing sequenced segments of the tomato and

Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of

synteny. Proc. Natl Acad. Sci. USA 97, 9121±9126 (2000).

35. Noel, L. et al. Pronounced intraspeci®c haplotype divergence at the RPP5 complex disease resistance

locus of Arabidopsis. Plant Cell 11, 2099±2111 (1999).

36. Ellis, J., Dodds, P. & Pryor, T. Structure, function, and evolution of plant disease resistance genes.

Trends Plant Sci. 3, 278±284 (2000).

37. Tanksley, S. D. et al. High density molecular linkage maps of the tomato and potato genomes.

Genetics 132, 1141±1160 (1992).

38. Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Grasses, line up and form a circle. Curr. Biol. 5,

737±739 (1995).

39. Acarkan, A., Rossberg, M., Koch, M. & Schmidt, R. Comparative genome analysis reveals extensive

conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J. 23, 55±

62 (2000).

40. Cavell, A., Lydiate, D., Parkin, I., Dean, C. & Trick, M. A 30 centimorgan segment of Arabidopsis

thaliana chromosome 4 has six collinear homologues within the Brassica napus genome. Genome 41,

62±69 (1998).

41. O'Neill, C. & Bancroft, I. Comparative physical mapping of segments of the genome of Brassica

oleracea var alboglabra that are homologous to sequenced regions of the chromosomes 4 and 5 of

Arabidopsis thaliana. Plant J. 23, 233±243 (2000).

42. Wolfe, K. H., Gouy, M., Yang, Y. -W., Sharp, P. M. & Li, W. -H. Date of the monocot-dicot divergence

estimated from the chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA 86, 6201±6205 (1989).

43. van Dodeweerd, A. -M. et al. Identi®cation and analysis of homologous segments of the genomes of

rice and Arabidopsis thaliana. Genome 42, 887±892 (1999)

44. Mayer, K. Sequence level analysis of homologous segments of the genomes of rice and Arabidopsis

thaliana. Genome Res. (submitted).

45. Sato, S. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Research 6,

283±290 (1999).

46. Unseld, M., Marienfeld, J., Brandt, P. & Brennicke, A. The mitochondrial genome in Arabidopsis

articles

812 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com© 2000 Macmillan Magazines Ltd

Page 18: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

thaliana contains 57 genes in 366,924 nucleotides. Nature Genet. 15, 57±61 (1997).

47. Palmer, J. D. et al. Dynamic evolution of plant mitochondrial genomes: mobile genes and introns

and highly variable mutation rates. Proc. Natl Acad. Sci. USA 97, 6960±6966 (2000).

48. Le, Q. -H. et al. Transposon diversity in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 97, 7376±

7381 (2000).

49. Yu, Z., Wright, S. & Bureau, T. Mutator-like elements (MULEs) in Arabidopsis thaliana: Structure,

diversity and evolution. Genetics (in the press).

50. Feschotte, C. & Mouches, C. Evidence that a family of miniature inverted-repeat transposable

elements (MITEs) from the Arabidopsis thaliana genome has arisen from a pogo-like DNA

transposon. Mol. Biol. Evol. 17, 730±737 (2000).

51. Martienssen, R. Transposons, DNA methylation and gene control. Trends Genet. 14, 263±264

(1998).

52. Singer, T., Yordan, C. & Martienssen, R. Robertson's Mutator transposons in Arabidopsis are

regulated by the chromatin-remodeling gene Decrease in DNA Methylation (DDM1). Genes Dev. (in

the press).

53. SanMiguel, P. et al. Nested retrotransposons in the intergenic regions of the maize genome. Science

274, 765±768 (1996).

54. Copenhaver, G. P. & Pikaard, C. S. Two-dimensional RFLP analyses reveal megabase-sized clusters of

rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for

gene homogenization during concerted evolution. Plant J. 9, 273±282 (1996).

55. Fransz, P. et al. Cytogenetics for the model system Arabidopsis thaliana. Plant J. 13, 867±876 (1998).

56. Richards, E. J. & Ausubel, F. M. Isolation of a higher eukarotic telomere from Arabidopsis thaliana.

Cell 53, 127±136 (1988).

57. Round, E. K., Flowers, S. K. & Richards, E. J. Arabidopsis thaliana centromere regions: genetic map

positions and repetitive DNA structure. Genome Res. 7, 1045±1053 (1997).

58. The CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium. The complete sequence of a

heterochromatic island from a higher eukaryote. Cell 100, 377±386 (2000).

59. Fransz, P. F. et al. Integrated cytogenetic map of chromosome arm 4S of A. thaliana: Structural

organization of heterochromatic knob and centromere region. Cell 100, 367±376 (2000).

60. Paulsen, I. T., Nguyen, L., Sliwinski, M. K., Rabus, R. & Saier, M. H. Jr Microbial genome analyses:

comparative transport capabilities in eighteen prokaryotes. J. Mol. Biol. 301, 75±101 (2000).

61. Paulsen, I. T., Sliwinski, M. K., Nelissen, B., Goffeau, A. & Saier, M. H. Jr Uni®ed inventory of

established and putative transporters encoded within the complete genome of Saccharomyces

cerevisiae. FEBS Lett. 430, 116±125 (1998).

62. Hirsch, R. E., Lewis, B. D, Spalding, E. P. & Sussman, M. R. A role for the AKT1 potassium channel in

plant nutrition. Science 280, 918±921 (1998).

63. Slayman, C. L. & Slayman, C. W. Depolarization of the plasma membrane of Neurospora during

active transport of glucose: evidence for a proton-dependent cotransport system. Proc. Natl Acad.

Sci. USA 71, 1035±1939 (1974).

64. Ryan, C. A. & Pearce, G. Systemin: a polypeptide signal for plant defensive genes. Annu. Rev. Cell.

Dev. Biol. 14, 1±17 (1998).

65. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes.

Mutat. Res. 435, 171±213 (1999).

66. Selby, C. P. & Sancar, A. Structure and function of transcription-repair coupling factor. Structural

domains and binding properties. J. Biol. Chem. 270, 4882±4889 (1995).

67. Britt, A. B. Molecular genetics of DNA repair in higher plants. Trends Plant Sci. 4, 20±25 (1999).

68. Dangl, M. Response to Aravind, L. & Koonin, E. V. Second Family of Histone Deacetylases. Science

280, 1167 (1998).

69. Cao, X. et al. Conserved plant genes with similarity to mammalian de novo DNA methyltransferases.

Proc. Natl Acad. Sci. USA 97, 4979±4984 (2000).

70. Henikoff, S. & Comai, L. A DNA methyltransferase homologue with a chromodomain exists in

multiple polymorphic forms in Arabidopsis. Genetics 149, 307±318 (1998).

71. Hung, M. -S. et al. Drosophila proteins related to vertebrate DNA (5-cytosine) methyltransferases.

Proc Natl Acad. Sci. USA 96, 11940±11945 (1999).

72. Dalmay, T., Hamilton, A. J., Rudd, S., Angell, S. & Baulcombe, D. C. An RNA-dependent-RNA

polymerase in Arabidopsis is required for post transcriptional gene silencing mediated by a transgene

but not by a virusÐthe truth. Cell 101, 543±553 (2000).

73. Riechmann, J. L. & Ratcliffe, O. J. A genomic perspective on plant transcription factors. Curr. Opin.

Plant Biol. 3, 423±434 (2000).

74. Liljegren, S. J. et al. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis.

Nature 404, 766±770 (2000).

75. Pelaz, S. et al. B and C ¯oral organ identity functions require SEPALLATA MADS-box genes. Nature

405, 200±203 (2000).

76. Canaday, J., Stoppin-Mellet, V., Mutterer, J., Lambert, A. M. & Schmit, A. C. Higher plant cells:

gamma-tubulin and microtubule nucleation in the absence of centrosomes. Microsc. Res. Technol.

49, 487±495 (2000).

77. Bassham, D. C. & Raikhel, N. V. Unique features of the plant vacuolar sorting machinery. Curr. Opin.

Cell Biol. 12, 491±495 (2000).

78. Zheng, Z. L. & Yang, Z. The Rrop GTPase switch turns on polar growth in pollen. Trends Plant Sci. 5,

298-303 (2000).

79. den Boer, B. G. & Murray, J. A. Triggering the cell cycle in plants. Trends Cell Biol. 10, 245±250

(2000).

80. Heese, M., Mayer, U. & Jurgens, G. Cytokinesis in ¯owering plants: cellular process and

developmental integration. Curr. Opin. Plant Biol. 1, 486±491 (1998).

81. Meyerowitz, E. M. Plants, animals, and the logic of development. Trends Genet. 15, M65±M68

(1999).

82. Wang, D. Y. C. et al. Divergence time estimates for the early history of animal phyla and the origin of

plants, animals and fungi. Proc. R. Soc. Lond. B Bio. 266, 63±171 (1999).

83. Torii, K. Receptor kinase activation and signal transduction in plants: an emerging picture. Curr.

Opin. Plant Biol. 3, 362±367 (2000).

84. Yeh, K. C. & Lagarias, J. C. Eukaryotic phytochromes: Light-regulated serine/threonine protein

kinases with histidine kinase ancestry. Proc. Natl Acad. Sci. USA 95, 13976±13981 (1998).

85. McCarty, D. R. & Chory, J. Conservation and innovation in plant signaling pathways. Cell 103, 201±

211 (2000).

86. Weiss, C. A., Garnaat, C., Mukai, K., Hu, Y. & Ma, H. Molecular cloning of cDNAs from maize and

Arabidopsis encoding a G protein beta subunit. Proc. Natl Acad. Sci. USA 91, 9554±9558 (1994).

87. Bowler, C. et al. Cyclic GMP and calcium mediate phytochrome phototransduction. Cell 77, 73±81

(1994).

88. Stepanova, A. & Ecker, J. R. Ethylene signaling: from mutants to molecules. Curr. Opin. Plant Biol. 3,

353±360 (2000).

89. Urao, T., Yamaguchi-Shinozaki, K. & Shinozaki, K. Two-component systems in plant signal

transduction. Trends Plant Sci. 5, 67±74 (2000).

90. Makino, S. et al. Genes encoding pseudo-response regulators: Insight into His-to-Asp phosphorelay

and circadian rhythm in Arabidopsis thaliana. Plant Cell Physiol. 41, 791±803 (2000).

91. D'Agostino, I. B. & Kieber, J. J. Phosphorelay signal transduction: the emerging family of plant

response regulators. Trends Biol. Sci. 24, 452±456 (1999).

92. Strayer, C. et al. Cloning of the Arabidopsis clock gene TOC1, an autoregulatory response regulator

homologue. Science 289, 768±771 (2000).

93. Stahl, E. A. & Bishop, J. G. Plant-Pathogen arms races at the molecular level. Curr. Opin. Plant Biol. 3,

299±304 (2000).

94. McDowell, J. M. & Dangl, J. L. Signal transduction in the plant innate immune response. Trends

Biochem. Sci. 25, 79±82 (2000).

95. Van der Biezen, E. A. & Jones, J. D. Plant disease-resistance proteins and the gene-for-gene concept.

Trends Biochem Sci. 23, 454±456 (1998).

96. Belvin, M. P. & Anderson, K. V. A conserved signaling pathway: the Drosophila toll-dorsal pathway.

Annu. Rev. Cell. Dev. Biol. 12, 393±416 (1996).

97. Uren, A. G. et al. Identi®cation of paracaspases and metacaspases: Two ancient families of caspase-

like proteins, one of which plays a key role in MALT lymphoma. Mol. Cell 6, 961±967 (2000).

98. Fankhauser, C. & Chory, J. Light control of plant development. Annu. Rev. Cell. Dev. Biol. 13, 203±

229 (1997).

99. Briggs, W. R. & Huala, E. Blue-light photoreceptors in higher plants. Annu. Rev. Cell. Dev. Biol. 15,

33±62 (1999).

100. Christie, J. M., Salomon, M., Nozue, K., Wada, M. & Briggs, W. R. LOV (light, oxygen, or voltage)

domains of the blue-light photoreceptor phototropin (nph1): binding sites for the chromophore

¯avin mononucleotide. Proc. Natl Acad. Sci. USA 96, 8779±8783 (1999).

101. Golbeck, J. H. Structure and function of photosystem I. Annu. Rev. Plant Physiol. Plant Mol. Biol. 43,

293±324 (1992).

102. Maier, R. M., Neckermann, K., Igloi, G. L. & Kossel, H. Complete sequence of the maize chloroplast

genome: gene content, hotspots of divergence and ®ne tuning of genetic information by transcript

editing. J. Mol. Biol. 251, 614±28 (1995).

103. Buchanan, B. B., Gruissem, W. & Jones, R. L. in Biochemistry and Molecular Biology of Plants 1367

(Am. Soc. Plant Physiol., Rockville, Maryland, 2000).

104. Mekhedov, S., MartõÂnez de IlaÂrduya, O. & Ohlrogge, J. Toward a functional catalog of the plant

genome. A survey of genes for lipid biosynthesis. Plant Physiol. 122, 389±401 (2000).

105. Somerville, C. R., & Ogren, W. L. Photorespiration de®cient mutants of Arabidopsis thaliana lacking

mitochrondrial serine transhydroxymethylase activity. Plant Physiol. 67, 666±671 (1981).

106. Richmond, T., & Somerville, C. R. The cellulose synthase superfamily. Plant Physiol 124, 495±499

(1999).

107. Carpita, N. Vergara C: A recipe for cellulose. Science 279, 672±673 (1998).

108. De Vries, H. Sur la loi de disjonction des hybrides. C. R. Acad. Sci. Paris 130, 845±847 (1900).

109. Alonso-Blanco, C. & Koornneef, M. Naturally occurring variation in Arabidopsis: an underexploited

resource for plant genetics. Trends Plant Sci. 5, 1360±1385 (1999).

110. Chory, J. Functional genomics and the virtual plant. A blueprint for understanding how plants are

built and how to improve them. Plant Physiology 123, 423±425 (2000).

111. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.

268, 78±94 (1997).

112. Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene ®nding. Nucleic Acids

Res. 26, 1107±1115 (1998).

113. Uberbacher, E. C. & Mural, R. J. Locating protein-coding regions in human DNA sequences by a

multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA 88, 11261±11265 (1991).

114. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models

for eukaryotic gene ®nding. Genomics 59, 24±31 (1999).

115. Hebsgaard, S. M. et al. Splice site prediction in Arabidopsis thaliana DNA by combining local and

global sequence information. Nucleic Acids Res. 24, 3439±3452 (1996).

116. Brendel, V. & Kleffe, J. Prediction of locally optimal splice sites in plant pre-mRNA with applications

to gene identi®cation in Arabidopsis thaliana genomic DNA. Nucleic Acids Res. 26, 4748±4757

(1998).

117. Quackenbush, J., Liang, F., Holt, I., Pertea, G. & Upton, J. The TIGR gene indices: reconstruction and

representation of expressed gene sequences. Nucleic Acids Res. 28, 141±145 (2000).

118. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic

sequences. Genomics 46, 37±45 (1997).

119. Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215, 403±410 (1990).

120. Morgenstern, B. DIALIGN2: improvement of the segment-to-segment approach to multiple

sequence alignment. BioInformatics 15, 211±218 (1999).

121. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classi®cation of proteins

database for the investigation of sequences and structures. J. Mol. Biol. 247, 536±540 (1995).

122. Emanuelsson, O., Nielsen, H., Brunak, S. & von Heijne, G. Predicting subcellular localization of

proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300, 1005±1016 (2000).

Supplementary information is available on Nature's World-Wide Web site(http://www.nature.com) or as paper copy from the London editorial of®ce of Nature.

Acknowledgements

This work was supported by the National Science Foundation (NSF) CooperativeAgreements (funded by the NSF, the US Department of Agriculture (USDA) and the USDepartment of Energy (DOE)), the Kazusa DNA Research Institute Foundation, and bythe European Commission. Additional support from the USDA, MinisteÁre de laRecherche, GSF-Forschungszentrum f. Umwelt u. Gesundheit, BMBF (Bundesminister-ium f. Bildung, Forschung und Technologie), the BBSRC (Biotechnology and Biological

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 813© 2000 Macmillan Magazines Ltd

Page 19: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

Sciences Research Council) and the Plant Research International, Wageningen, is alsogratefully acknowledged. The authors wish to thank E. Magnien, D. Nasser and J. D.Watson for their continual support and encouragement.

Correspondence and requests for materials should be addressed to The ArabidopsisGenome Initiative (e-mail: [email protected] or [email protected]).

The Arabidopsis Genome Initiative

Three groups contributed to the work reported here. The Genome Sequencing groups,arranged here in order of sequence contribution, sequenced and annotated assignedchromosomal regions. The Genome Analysis group carried out the analyses described.The Contributing Authors interpreted the genome analyses, incorporating other data andanalyses, with respect to selected biological topics.

articles

814 NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com

Genome Sequencing GroupsSamir Kaul, Hean L. Koo, Jennifer Jenkins, Michael Rizzo, Timothy Rooney, Luke J. Tallon, Tamara Feldblyum, William Nierman,Maria-Ines Benito, Xiaoying Lin, Christopher D. Town, J. Craig Venter & Claire M. FraserThe Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA

Satoshi Tabata, Yasukazu Nakamura, Takakazu Kaneko, Shusei Sato, Erika Asamizu, Tomohiko Kato, Hirokazu Kotani &Shigemi SasamotoKazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292, Japan

Joseph R. Ecker1*², Athanasios Theologis2*, Nancy A. Federspiel3*², Curtis J. Palm3, Brian I. Osborne2, Paul Shinn1,Aaron B. Conway3, Valentina S. Vysotskaia2, Ken Dewar1, Lane Conn3, Catherine A. Lenz2, Christopher J. Kim1, Nancy F. Hansen3,Shirley X. Liu2, Eugen Buehler1, Hootan Alta®3, Hitomi Sakano2, Patrick Dunn1, Bao Lam3, Paul K. Pham2, Qimin Chao1, Michelle Nguyen3, GuixiaYu2, Huaming Chen1, Audrey Southwick3, Jeong Mi Lee2, Molly Miranda3, Mitsue J. Toriumi2 & Ronald W. Davis3

1, Plant Science Institute, Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 USA; 2, Plant Gene ExpressionCenter/USDA-U.C.Berkeley, 800 Buchanan Street, Albany, California 94710, USA; 3, Stanford Genome Technology Center, 855 California Avenue, Palo Alto, California

94304, USA. * These authors contributed equally to this work. ² Present addresses: The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla,

California 92037, USA (J.R.E.); Exelixis, Inc., 170 Harborway, P.O. Box 511, South San Francisco, California 94083-0511, USA (N.A.F)

European Union Chromosome 4 and 5 Sequencing Consortium: R. Wambutt1, G. Murphy2, A. DuÈ sterhoÈ ft3, W. Stiekema4, T. Pohl5,K.-D. Entian6, N. Terryn7 & G. Volckaert8

1, AGOWA GmbH, Glienicker Weg 185, D-12489 Berlin, Germany; 2, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 3, QIAGEN GmbH, Max-Volmer-Str. 4,

D-40724 Hilden, Germany; 4, Greenomics, Plant Research International, Droevendaalsesleeg 1, NL 6700, AA Wageningen, The Netherlands; 5, GATC GmbH, Fritz-

Arnold Strasse 23, D-78467 Konstanz, Germany; 6, SRD GmbH, Oberurseler Str. 43, Oberursel 61440, Germany; 7, Department for Plant Genetics, (VIB), University of

Gent, K.L. Ledeganckstraat 35, B-9000 Gent, Belgium; 8, Katholieke Universiteit Leuven, Laboratory of Gene Technology, Kardinaal Mercierlaan 92, B-3001 Leuven,Belgium

European Union Chromosome 3 Sequencing Consortium: M. Salanoubat1, N. Choisne1, M. Rieger2, W. Ansorge3, M. Unseld4,B. Fartmann5, G. Valle6, F. Artiguenave1, J. Weissenbach1 & F. Quetier1

1, Genoscope and CNRS FRE2231, 2 rue G. CreÂmieux, 91057 Evry Cedex, France; 2, Genotype GmbH Angelhofweg 39, D-69259 Wilhemlsfeld, Germany; 3, European

Molecular Biology Laboratory, Biochemical Instrumentation Program, Meyerhoftstr. 1, D-69117 Heidelberg, Germany; 4, LION Bioscience AG, Im Neuenheimer Feld515-517, 69120 Heidelberg, Germany; 5, MWG-Biotech AG, Anzinger Strasse 7a, 85560 Ebersberg, Germany; 6, CRIBI, UniversitaÁ di Padova, via G. Colombo 3, Padova

35131, Italy

The Cold Spring Harbor and Washington University Genome Sequencing Center Consortium: Richard K. Wilson1, Melissa de la Bastide2,M. Sekhon1, Emily Huang2, Lori Spiegel2, Lidia Gnoj2, K. Pepin1, J. Murray1, D. Johnson1, Kristina Habermann2, Neilay Dedhia2,Larry Parnell2, Raymond Preston2, L. Hillier1, Ellson Chen3, M. Marra2, Robert Martienssen4 & W. Richard McCombie2

1, Washington University Genome Sequencing Center, Washington University in St Louis School of Medicine, 4444 Forest Park Blvd., St. Louis, Missouri 63108 USA;

2, Lita Annenberg Hazen Genome Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3, Celera Genomics, 850 Lincoln Center Drive,

Foster City, California 94494, USA; 4, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA

Genome Analysis GroupKlaus Mayer1*, Owen White2*, Michael Bevan3, Kai Lemcke1, Todd H. Creasy2, Cord Bielke2, Brian Haas1, Dirk Haase1, Rama Maiti2,Stephen Rudd1, Jeremy Peterson2, Heiko Schoof1, Dimitrij Frishman1, Burkhard Morgenstern1, Paulo Zaccaria1, Maria Ermolaeva 2, MihaelaPertea2, John Quackenbush2, Natalia Volfovsky2, Dongying Wu2, Todd M. Lowe4, Steven L. Salzberg 2 & Hans-Werner Mewes1

1, GSF-Forschungszentrum f. Umwelt u. Gesundheit, Munich Information Center for Protein Sequences, am Max-Planck-Institut f. Biochemie, Am Klopferspitz 18a,

D-82152, Germany; 2, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA; 3, Molecular Genetics Deartment, John Innes

Centre, Colney Lane, Norwich NR4 7UH, UK; 4, Dept Genetics, Stanford University Medical School, Stanford, California 94305-5120, USA. * These authors contributed

equally to this work

Contributing Authors

Comparative analysis of the genomes of A. thaliana accessions. S. Rounsley, D. Bush, S. Subramaniam, I. Levin & S. NorrisCereon Genomics LLC, 45 Sidney St, Cambridge, Massachussetts 02139, USA

Comparative analysis of the genomes of A. thaliana and other genera. R. Schmidt1, A. Acarkan1 & I. Bancroft2

1, Max-DelbruÈck-Laboratorium in der Max-Planck-Gesellschaft, Carl-von-LinneÂ-Weg 10, 50829 Cologne, Germany; 2, Brassicas and Oilseeds Research Department,John Innes Centre, Norwich NR4 7UJ, UK

© 2000 Macmillan Magazines Ltd

Page 20: articles Analysis of the genome sequence of the flowering ... · Analysis of the genome sequence of the flowering plant Arabidopsis thaliana ... complement compared with other sequenced

articles

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com 815

Integration of the three genomes in the plant cell: the extent of protein and nucleic acid traf®c between nucleus, plastids andmitochondria. F. Quetier1, A. Brennicke2 & J. A. Eisen3.1, Genoscope, Centre Nationale de Sequencage, 2 rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France; 2, Molekulare Botanik, UniversitaÈt Ulm, 89069 Ulm,

Germany; 3, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA

Transposable elements. T. Bureau1, B.-A. Legault1, Q.-H. Le1, N. Agrawal1, Z. Yu1 & R. Martienssen2

1, McGill University, Dept of Biology, 1205 rue Dr Pen®eld, Montreal, Quebec, H3A 1B1, Canada; 2, Plant Biology Group, Cold Spring Harbor Laboratory, Cold SpringHarbor, New York 11724, USA

rDNA, telomeres and centromeres. G. P. Copenhaver1, S. Luo1, C. S. Pikaard2 & D. Preuss1

1, Howard Hughes Medical Institute, The University of Chicago, 1103 East 57th Street, Chicago, Illiois, USA; 2, Biology Department, Washington University in St Louis,St Louis, Missouri 63130, USA

Membrane transport. I. T. Paulsen1 & M. Sussman2

1, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA; 2, University of Wisconsin Biotechnology Center, 425 Henry Mall,Madison, Wisconsin 53706, USA

DNA repair and recombination. A. B. Britt1 & J. A. Eisen2

1, Section of Plant Biology, University of California, Davis, California 95616, USA; 2, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville,Maryland 20850, USA

Gene regulation. D. A. Selinger1, R. Pandey1, D. W. Mount2, V. L. Chandler1, R. A. Jorgensen1 & C. Pikaard3

1, Department of Plant Sciences, University of Arizona, 303 Forbes Hall; and 2, Department of Molecular and Cellular Biology, University of Arizona, Tucson, Arizona85721, USA; 3, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA

Cellular organization. G. JuergensEntwicklungsgenetik, ZMBP-Centre for Plant Molecular Biology, auf der Morgenstelle 1, Tuebingen D-72076, Germany

Development. E. M. Meyerowitz.Division of Biology, California Institute of Biology, Pasadena, California 91125, USA

Signal transduction. J. R. Ecker1 & A. Theologis2.1, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA; 2, Plant Gene Expression Center/USDA-UC Berkeley, 800Buchanan Street, Albany, California 94710, USA

Recognition of and response to pathogens. J. Dangl1 & J. D. G. Jones2

1, Biology Department, Coker Hall, University of North Carolina, Chapel Hill, North Carolina 27599, USA; 2, Sainsbury Laboratory, John Innes Centre, Colney Lane,Norwich NR4 7UJ, UK

Photomorphogenesis and photosynthesis. M. Chen & J. ChoryHoward Hughes Medical Institute and Plant Biology Laboratory, The Salk Institute, 10010 North Torrey Pines Road, La Jolla, California 92037, USA

Metabolism. C. SomervilleCarnegie Institution, 260 Panama Street, Stanford, California 94305, USA

© 2000 Macmillan Magazines Ltd