Transcriptome analysis of drough-tolerant CAM plants, Agave deserti and Agave tequilana Stephen M. Gross 1,2 , Jeffrey A. Martin 1,2 , June Simpson 3 , Zhong Wang 1,2 , and Axel Visel 1,2 1. DOE Joint Genome Institute, Walnut Creek, CA 2. Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 3. CINVESTAV, Irapuato, MX Agaves are succulent monocotyledonous plants native to hot and arid environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis) and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, we use RNA-seq data to gain insights into biological functions along the A. deserti juvenile leaf proximal-distal axis. Our work presents a foundation for further investigation of agave biology and their improvement for bioenergy development. ABSTRACT Cinvestav BLAST A. deserti A. tequilana 2948 2835 13,388 OrthoMCL one-to-one RBH protein comparison protein family comparison 20,377 20,161 14,709 A. deserti A. tequilana OrthoMCL B. distachyon 16618 O. sativa 19643 S. bicolor 18218 18218 8218 821 18218 Z. mays 20681 A. deserti 16223 A. tequilana 16336 A C B 461 315 782 755 129 503 199 2086 13 55 12 7 13 26 23 3 5 21 15 31 12 593 89 29 46 16 111 787 3751 1108 2735 161 362 575 3718 33 1789 14 70 20 26 16 9 36 11 4 3 27 13 29 20 748 4323 35 81 23 15 135 11 46 176 108 8144 Relative composite RPKM value normalized to section with highest expression 0.0 0.2 0.4 0.6 0.8 1.0 transcription factors hormones 1 2 3 4 1 2 3 4 cell wall & stomata development Cell wall biosynthesis Cellulose biosynthesis Lignin biosynthesis Stomata development Cutin & suberin biosynthesis photosynthesis Antenna proteins Photosystem II Photosystem I Cytochrome b6f & ATP synthase Calvin cycle C4 dicarboxylic acid cycles dark light CAM Chlorophyll biosynthesis Class I Class II MADS-box GRAS YABBY MYB bHLH Zn finger KNOX Auxin biosynthesis transport CK biosynthesis GA biosynthesis ETH biosynthesis BR biosynthesis ABA biosynthesis 79645 of 88718 loci clustered photosynthesis regulation of gene expression translation cellular protein modification DNA metabolism vesicle-mediated transport Cluster E O O O O 1 2 3 4 17579 loci Cluster F O O O O 1 2 3 4 8426 loci Distal Expression ï 0 1 zïVFDOHG 53.0 Cluster A O O O O 1 2 3 4 22249 loci Cluster B O O O O 1 2 3 4 12063 loci 3UR[LPDO ([SUHVVLRQ Cluster C O O O O 1 2 3 4 11789 loci Cluster D O O O O 1 2 3 4 7539 loci 0HGLDO ([SUHVVLRQ 3 4 2 1 proximal (base) distal (tip) D A B C Number of reads Probability of observing a unique 25-mer contig GC content density 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 A. tequilanacontigs removed contigs contig GC content 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 A. deserti contigs removed contigs 0.0 0.2 0.4 0.6 0.8 1.0 1 100 10000 ORFXV FRGLQJ SRWHQWLDO non-coding coding A. deserti Locus RPKM Transcript length (nt) 0.0 0.3 0.6 0.9 1.2 100 1000 10000 density VSHFLHV A. deserti A. tequilana Locus RPKM 0.0 0.5 1.0 1.5 2.0 1 100 10000 A. tequilana ORFXV FRGLQJ SRWHQWLDO non-coding coding density O O O O O O O O O O O O O O O O O O O 0.2 0.4 0.6 0.8 210 215 220 225 230 6SHFLHV GDWDVHW O A. deserti A. tequilana C D A B E F Agave tequilana Agave deserti 0 1000 2000 0 10 20 30 0 2000 4000 6000 PacBio subreads || Rnnotator Rnnotator contig contains PacBio subread Pacbio subreadcontains Rnnotator contig Sequences overlap unaligned Number of transcripts n = 4767 2683 862 1221 1 GenBank || Rnnotator Rnnotator contig contains GenBank sequence GenBank sequence contains Rnnotator contig Sequences overlap unaligned n = 82 1 38 14 29 A. deserti || A. tequilana A. tequilana || A. deserti A. deserti contains A. tequilana A. tequilanacontains A. deserti Sequences overlap unaligned 0 10,000 20,000 30,000 40,000 50,000 0 30,000 60,000 90,000 A. deserti contains A. tequilana A. tequilanacontains A. deserti Sequences overlap unaligned Fraction of A. tequilanatranscript aligning to A. deserti transcript 0 20 30 0.0 0.3 0.6 0.9 Number of A. tequilanatrasncripts (thousands) a Fraction of A. deserti transcript aligning to A. tequilanatranscript 0 10 20 30 40 50 Number of A. deserti transcripts (thousands) 0.0 0.6 0.9 0.3 A. deserti || A. tequilana A. tequilana || A. deserti Number of transcripts (thousands) 28,627 44,443 107,821 23,649 n = 128,959 32,231 37,241 52,109 7,378 n = 204,530 10 McKain et al. || Rnnotator n = 12,972 Overlap: Containment: Class descriptions $OLJQPHQW SURSHUWLHV indels no indels unaligned Rnnotator contig contains McKain et al. sequence McKain et al. contig contains Rnnotator contig Sequences overlap unaligned 6578 2710 3560 124 A B C AGAVE TRANSCRIPTOME ASSEMBLIES FROM DEEP RNA-seq COMPARISON OF AGAVE DE NOVO ASSEMBLIES FIGURE 3: Comparison of the de novo Agave transcriptome assemblies (A) Comparisons of the A. tequilana de novo Rnnotator assembly to error corrected Pacific Biosciences subreads, 82 GenBank A. tequilana sequences, and an additional A. tequilana dataset from McKain et al. 2012. [4] (B) Comparisons between the A. tequilana and A. deserti de novo Rnnotator assemblies. (C) Histograms of the fraction of aligned sequence lengths between A. deserti and A. tequilana. Symbol || separates query sequence dataset from subject sequence dataset. Total number of sequences (n) is noted in each bar chart, total number of sequences in alignment classes are noted above bar. FIGURE 4: Proteomic comparison of agaves to other plant species (A) Venn diagram of BLASTP-based one-to-one reciprocal best hit proteins shared between A. deserti and A. tequilana. (B) Venn diagram of OrthoMCL-defined protein families shared between agaves. (C) Edwards-Venn diagram of OrthoMCL-defined plant orthologous-group protein families (Plant OGs) shared between agave and 4 additional monocotyledonous plant species. Shape and color used for each species is at the right with the total number of Plant OGs within each species. FIGURE 5: Transcriptomic analysis of the A. deserti leaf proximal-distal axis. (A) One of the A. deserti leaves used for analysis, indicating proximal-distal (PD) sections 1–4. (B) Six major K-means clusters of gene expression along the PD axis. Clusters are manually grouped by highest expression in proximal, medial, or distal tissues. Blue lines connect mean z-scaled RPKM values, shaded areas represent the 25th and 75th percentiles, red lines indicate standard error at each mean. Green text beneath each cluster denotes the description of the most significantly enriched GO term in each cluster. (C, D) Heatmaps of composite gene expression for indicated biological processes along the leaf PD axis. PROTEOMIC ANALYSES SUPPORT COMPREHENSIVE AGAVE TRANSCRIPTOME ASSEMBLIES PROFILING OF THE A. DESERTI LEAF HIGHLIGHTS REGIONS CRITICAL TO DEVELOPMENT AND PHOTOSYNTHESIS FIGURE 2: A. tequilana, A. deserti, and their respective transcriptomes (A) Cultivated A. tequilana in Jalisco, Mexico. (B) A. deserti (foreground) in natural habitat, Riverside County, California, USA. (C) Plot of the fraction of unique 25-mers over indicated read depth (log2 scale). (D) Density plot of GC content of agave transcript contigs vs. contigs from contamination and commensal organisms. (E) Density plots of A. deserti and A. tequilana transcript lengths. Note log 10 scale. Peaks at 150 and 250 nt represent single reads or paired-end reads, respectively, that were not assembled into larger contigs. (F) Density plot of locus RPKM values for coding (dark shading) and non-coding (light shading) loci. Species A. tequilana A. deserti Total Sequencing 293.5 Gbp 184.7 Gbp No. of loci 139,525 88,718 No. transcript contigs 204,530 128,869 N50 length 1387 bp 1323 bp Sum assembled length 204.9 Mbp 125.0 Mbp No. protein-coding loci 34,870 35,086 OVERVIEW OF AGAVE TRANSCRIPTOME ASSEMBLIES CAM PHOTOSYNTHESIS, ARID ENVIRONMENTS, AND BIOENERGY Agave species are adapted to their native habitat in arid regions of Mexico and the United States. Agave thus holds promise as a biofuel feedstock [1,2], capable of growing on marginal lands where other proposed bioenergy plants cannot. The ability of agaves to withstand hot and arid conditions relies upon crassulacean acid metabolism (CAM)—a specialized form of photosynthesis allowing agaves to keep leaf stomata (pores) closed during the hot day, minimizing water loss through evapotranspiration. A sugar CO 2 C 4 C 3 CO 2 C 3 C 4 Calvin Cycle light vacuole chloroplast NIGHT DAY B C Agave Semi-arid regions FIGURE 1: Agaves and CAM biology (A) Agave tequilana cultivated in Mexico. (B) Semi-arid regions of the United States (brown) are unsuitable for cultivation of other bioenergy plants, which require more temperate regions (green). Most Agave species are adapted to semi-arid regions in Mexico and the extreme southwestern USA (purple). (C) Crassulacean Acid Metabolism (CAM). CO 2 enters plant cells at night, joins with a 3-carbon molecule (C 3 ) and is stored in the vacuole as a 4-carbon molecule (C 4 ). During the day, C 4 molecules diffuse out of the vacuole, and CO 2 is relased and assimilated into sugar in the chloroplast. Comparison of inputs (water and nitrogen) and outputs (biomass and ethanol) of agaves and other biofuel feedstock species. Though agaves are harvested at several years of age, their annualized growth rate is on par with Miscanthus. Table is modified from reference [2]. Corn grain Corn stover Miscanthus Poplar coppice Agave spp. 75–120 70–105 30–80 50–80 low low moderate high 0–15 0–50 0–12 90–120 7–10 3–6 15–40 5–11 10–34 2900 900 4600–12,400 1500–3400 3000–10,500 Feedstock Water (cm yr -1 ) Drought tolerance Nitrogen (kg ha -1 yr -1 ) Dry biomass (Mg ha -1 yr -1 ) Ethanol (liters yr -1 ) Inputs Outputs This work performed at the U.S. Department of Energy Joint Genome Institute was supported in part by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH112. [1] Davis, A. S. et al. The global potential for Agave as a biofuel feedstock. GCB Bioenergy 3, 68–78, (2011). [2] Somerville, C. et al. Feedstocks for lignocellulosic biofuels. Science 329, 790-2, (2010). [3] Martin, J. et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-seq reads. BMC Genomics 11, 663, (2010). [4] McKain, M. et al. Phylogenomic analysis of transcriptome data elucidates co-occurrence of a paleopolyploid event and the origin of bimodal karyotates in Agavoideae (Asparagaceae). Am J Bot 99:2, 397–406. To provide sequence resources for the Agave research community, we built de novo transcriptomes of Agave tequilana and Agave deserti from deep Illumina RNA-seq data. Sequences were assembled by Rnnotator [3], a de novo transcriptome assembly pipeline. ACKNOWLEDGEMENTS AND CITATIONS Analysis of assembled contigs suggest the Agave de novo assemblies are comprehensive and accurate. Proteome comparisons between Agave species and additional monocot species suggest the majority of Agave proteins are conserved across taxa. We can also identify protein families specific to agaves. Agaves spend the majority of their lives as compact rosettes, thus leaves are important organs in which to study Agave developmental and bioenergetic processes.