Genomic Sequence Diversity and Population Structure of Saccharomyces cerevisiae Assessed by RAD-seq Gareth A. Cromie 1* , Katie E. Hyma 2* , Catherine L. Ludlow 1 , Cecilia Garmendia-Torres 1 , Teresa L Gilbert ‡ , Patrick May 1,3 , Angela A. Huang 4 , Aimée M. Dudley 1,5 , Justin C. Fay 6 1 Institute for Systems Biology, Seattle, WA, USA 2 Bioinformatics Facility (CBSU), Institute for Biotechnology, Cornell University, Ithaca, NY, USA 3 Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg 4 University of Pennsylvania, PA, USA 5 Corresponding author 6 Department of Genetics, Washington University, St. Louis, MO, USA * These authors contributed equally to this work. ‡ Deceased Correspondence: Aimée M. Dudley ([email protected]) Institute for Systems Biology 401 Terry Avenue North Seattle, WA 98109 Tel: (206) 732-1214 Fax: (206) 732-1299
28
Embed
Genomic Sequence Diversity and Population Structure of ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Genomic Sequence Diversity and Population Structure of Saccharomyces cerevisiae Assessed by RAD-seq
Gareth A. Cromie 1*, Katie E. Hyma2*, Catherine L. Ludlow1, Cecilia Garmendia-Torres1, Teresa L
Gilbert‡, Patrick May1,3, Angela A. Huang4, Aimée M. Dudley1,5, Justin C. Fay6
1 Institute for Systems Biology, Seattle, WA, USA 2 Bioinformatics Facility (CBSU), Institute for Biotechnology, Cornell University, Ithaca, NY,
USA 3 Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette,
Luxembourg4 University of Pennsylvania, PA, USA 5 Corresponding author6 Department of Genetics, Washington University, St. Louis, MO, USA
* These authors contributed equally to this work.‡ Deceased
NC_001133.8, NC_001133.8, NC_001133.8, NC_001133.8) and generating consensus
reduced-genome sequences for each strain. The tagged reads were split into strain-
pools by their 4 base prefix barcodes. Reads with N’s or with Phred quality scores less
than 20 in the barcode sequence were removed. Any reads with more than 2 Ns outside
the barcode were also removed. Reads were aligned to the S288c reference using BWA
15
(V0.5.8, [32]) with 6 or fewer mismatches tolerated. Samtools (V0.1.8, [33]) was then
used to generate a pileup from the aligned reads using the “pileup” command and the “-
c” parameter . Base calls were retained if they had a consensus quality greater than 20.
Positions with root mean squared (RMS) mapping qualities less than 15 and
insertion/deletion polymorphisms were ignored. After filtering there was an average of
209,765 bp for each strain. Sequences from each strain were combined into a multiple
sequence alignment via their common alignment to the S288c genome. Sites with more
than 10% missing data were removed, resulting in a multiple sequence alignment of
116,880 base pairs.
Whole Genome Sequencing Alignment
Previously-generated whole genome sequences (WGS) were incorporated into the
RAD-seq dataset for population genetic analysis. For genomes with an S288c NCBI
coordinate system, sequences were extracted directly based on S288c reference
coordinates. For genomes using an alternative coordinate system (SGRP), blat was
used to convert from the S288c NCBI reference coordinates to the alternative coordinate
system prior to extracting sequences. For assembled genomes without S288c
alignments, coordinates were obtained by blast. A fasta file of the S288c reference
sequence was generated for each contiguous segment in the multiple sequence
alignment. The resulting files were used to query each genome assembly using blast.
When quality scores were available, sites with sequence quality less than 20 were
converted to “N,” prior to blasting or following sequence retrieval.
Duplicated Strains
Some strains were sequenced by both WGS and RAD-seq. For duplicate strains with
pairwise divergence less than 0.0005 substitutions per site, excluding singleton alleles
(i.e. found in only 1 strain), only the RAD-seq data was retained for analysis. For
duplicate strains that exceeded the threshold, both RAD-seq and WGS data were
retained and strain names were labeled with an "r" and "g", respectively.
Population Analysis
Neighbor-joining phylogenetic tree construction was carried out using MEGA [34]
(V5.0), based on P-distance with pairwise deletion. Population structure was inferred
using InStruct [26]. Because InStruct failed to converge using all sites, it was instead run
16
on 759 sites with allele frequency greater than or equal to 10%. Polymorphic sites were
made biallelic by treating third alleles as missing data. InStruct was run with the
parameters "-u 40000 -b 20000 -t 10 -c 10 -sl 0.95 -a 0 -g 1 -r 1000 -p 2 -v 2" with K
(number of populations) ranging from 3 to 15. While the lowest deviance information
criterion (DIC) was obtained from a chain with K = 15, there was substantial variation
among independent chains. We chose K = 9 as the optimal model to work with based on
the average DIC for K = 10 being nearly identical to that of K = 9 and subsequent drops
in DIC for larger values of K being small compared to the standard deviation in DIC
among chains (Table S3). Consensus population assignments for K = 8, 9 and 10 were
obtained for the five chains with the highest likelihood using CLUMPP (V1.1.2) [35] with
parameters “-m 3 -w 0 -s 2” and with greedy option = 2 and repeats = 10,000. The
similarity among the five chains (H') was 0.995 for K = 9. Compared to K = 9,
populations 6 (African, S. E. Asia/Palm, Cocoa, Fruit) and 7 (Israel/Soil) were merged for
K = 8, and a new population was inferred within populations 3 (Asian/Food, Drink) and 6
(African, S. E. Asia/Palm, Cacao, Fruit) for K = 10 (Figure S1). Multidimensional scaling
was performed on all 5868 sites and 262 strains using the identity by state distance
between each pair of strains and the "cmdscale” function in R with three dimensions.
Hierarchical clustering of either sites or strains was performed using the "hclust" function
in R with complete linkage and the euclidean distance of identity by state.
AcknowledgementsWe thank Meridith Blackwell, Andreas Hellström, Eviatar Nevo, Mat Goddard and Lene Jesperson for providing strains. We thank Eric Jeffery for help with the manuscript, Adrian Scott for help with the figures, and Scott Bloom for assistance with Illumina sequencing. A.M.D. is funded by a strategic partnership between ISB and the University of Luxembourg. J.C.F. is supported by National Institutes of Health grant GM080669.
17
Figure Legends
Figure 1. Neighbor-joining tree of the 262 S. cerevisiae strains based on multiple alignment of 116,880 bases. Branch lengths are proportional to sequence divergence measured as P-distance. Scale bar indicates 5 polymorphisms/ 10kb of sequence. Geographical and environmental clusters of strains are named and are indicated by black-outlined/grey-filled ovals. Colored ovals with numbering refer to strain populations identified in Figure 2. Seven strains widely used in the laboratory are labeled.
Figure 2. Clustered genotypes with inferred population structure and membership. Sites were clustered by complete heirarchical clustering using the euclidean distance of allele sharing (identity by state). Strains were grouped by population structure and memberships inferred using InStruct. Minor alleles are shown in red, heterozygous sites in yellow, common alleles in black, missing data is gray. Populations are labeled by the most common source and/or geographic location from which they were originally isolated.
Figure 3. Coincidence of admixture between pairs of populations. Each bar shows the number of strains with at least 20% ancestry from a reference population (bar labels) and 20% ancestry with another population (indicated by color in the legend). For comparison, grey filled circles show the number of strains with more than 80% ancestry from each population.
Figure 4. Relatedness among strains and the inferred populations to which they belong. The first and second principal coordinates (A) and the first and third principal coordinates (B) obtained from multidimensional scaling. Each circle shows a strain with color indicating the population contributing the largest proportion of ancestry and size indicating the proportion of ancestry from that population (see legend). Circles ringed in black show strains with more than 20 heterozygous sites.
Figure 5. Subpopulations defined by clustering of low frequency alleles. Two-dimensional hierarchical clustering of low frequency sites and strains. InStruct assignments are shown on the left, clustered genotypes are shown in the middle, with minor alleles in red, heterozygous sites in yellow, common alleles in black, and missing data in gray. Selected subpopulations are labeled on the right.
18
Supporting Information
Table S1. Strains used in this study, with population assignments inferred by InStruct.
Table S2. Populations inferred using InStruct and summary statistics.
Table S3. Fit of the population structure model as a function of the number of populations.
Figure S1. Population ancestry of strains inferred by InStruct. Populations are color-coded and the proportion of population ancestry assigned to each strain is indicated by bar height. Strain ancestry is shown assuming 8, 9 and 10 populations (K), with the order of strains based on K = 9 and color-coding of major populations matching that of K = 9.
Supplemental Dataset 1. Matrix of polymorphic sites. The matrix consists of 5,868 biallelic sites (columns) and 262 strains (rows) with column labels indicating the chromosome number and position separated by a period. Genotypes are represented by 0 or 2 for homozygotes, 1 for heterozygotes and -9 for missing data. Entries are comma delimited.
Supplemental Dataset 2. Neighbor-joining tree of 262 S. cerevisiae strains based on multiple alignment of 116,880 bases in Newick format. This tree is a version of Figure 1 that includes strain labels and the maximum group membership from Figure 2.
19
References
1. Goffeau, A., et al., Life with 6000 genes. Science, 1996. 274(5287): p. 546, 563-7.
2. Botstein, D. and G.R. Fink, Yeast: an experimental organism for 21st Century biology. Genetics, 2011. 189(3): p. 695-704.
3. Wenger, J.W., K. Schwartz, and G. Sherlock, Bulk segregant analysis by high-throughput sequencing reveals a novel xylose utilization gene from Saccharomyces cerevisiae. PLoS Genet, 2010. 6(5): p. e1000942.
4. Sinha, H., et al., Sequential elimination of major-effect contributors identifies additional quantitative trait loci conditioning high-temperature growth in yeast. Genetics, 2008. 180(3): p. 1661-70.
5. McGovern, P.E., et al., Fermented beverages of pre- and proto-historic China. Proc Natl Acad Sci U S A, 2004. 101(51): p. 17593-8.
6. Gerke, J., et al., Gene-environment interactions at nucleotide resolution. PLoS Genet, 2010. 6(9).
7. Gerke, J., K. Lorenz, and B. Cohen, Genetic interactions between transcription factors cause natural variation in yeast. Science, 2009. 323(5913): p. 498-501.
8. Ehrenreich, I.M., et al., Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature, 2010. 464(7291): p. 1039-42.
9. Ehrenreich, I.M., et al., Genetic architecture of highly complex chemical resistance traits across four yeast strains. PLoS Genet, 2012. 8(3): p. e1002570.
10. Brem, R.B. and L. Kruglyak, The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci U S A, 2005. 102(5): p. 1572-7.
11. Nieduszynski, C.A. and G. Liti, From sequence to function: Insights from natural variation in budding yeasts. Biochim Biophys Acta, 2011. 1810(10): p. 959-66.
12. Fay, J.C. and J.A. Benavides, Evidence for domesticated and wild populations of Saccharomyces cerevisiae. PLoS Genet, 2005. 1(1): p. 66-71.
13. Wang, Q.M., et al., Surprisingly diverged populations of Saccharomyces cerevisiae in natural environments remote from human activity. Mol Ecol, 2012. 21(22): p. 5404-17.
14. Stefanini, I., et al., Role of social wasps in Saccharomyces cerevisiae ecology and evolution. Proc Natl Acad Sci U S A, 2012. 109(33): p. 13398-403.
20
15. Aa, E., et al., Population structure and gene evolution in Saccharomyces cerevisiae. FEMS Yeast Res, 2006. 6(5): p. 702-15.
16. Liti, G., et al., Population genomics of domestic and wild yeasts. Nature, 2009. 458(7236): p. 337-41.
17. Schacherer, J., et al., Comprehensive polymorphism survey elucidates population structure of Saccharomyces cerevisiae. Nature, 2009. 458(7236): p. 342-5.
18. Ezov, T.K., et al., Molecular-genetic biodiversity in a natural population of the yeast Saccharomyces cerevisiae from "Evolution Canyon": microsatellite polymorphism, ploidy and controversial sexual status. Genetics, 2006. 174(3): p. 1455-68.
19. Goddard, M.R., et al., A distinct population of Saccharomyces cerevisiae in New Zealand: evidence for local dispersal by insects and human-aided global dispersal in oak barrels. Environ Microbiol, 2010. 12(1): p. 63-73.
20. Schuller, D., et al., Genetic diversity and population structure of Saccharomyces cerevisiae strains isolated from different grape varieties and winemaking regions. PLoS One, 2012. 7(2): p. e32507.
21. Legras, J.L., et al., Bread, beer and wine: Saccharomyces cerevisiae diversity reflects human history. Mol Ecol, 2007. 16(10): p. 2091-102.
22. Baird, N.A., et al., Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One, 2008. 3(10): p. e3376.
23. Miller, M.R., et al., Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res, 2007. 17(2): p. 240-8.
24. Hyma, K.E. and J.C. Fay, Mixing of vineyard and oak-tree ecotypes of Saccharomyces cerevisiae in North American vineyards. Mol Ecol, 2013.
25. Winzeler, E.A., et al., Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics, 2003. 163(1): p. 79-89.
26. Gao, H., S. Williamson, and C.D. Bustamante, A Markov chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data. Genetics, 2007. 176(3): p. 1635-51.
27. Warringer, J., et al., Trait variation in yeast is defined by population history. PLoS Genet, 2011. 7(6): p. e1002111.
28. Doniger, S.W., et al., A catalog of neutral and deleterious polymorphism in yeast. PLoS Genet, 2008. 4(8): p. e1000183.
29. Odds, F.C. and R. Bernaerts, CHROMagar Candida, a new differential isolation medium for presumptive identification of clinically important Candida species. J Clin Microbiol, 1994. 32(8):
21
p. 1923-9.30. Boekhout, T. and V. Robert, Yeasts in Food. 2003, Cambridge,
England: Woodhead Publishing Ltd.31. Lorenz, K. and B.A. Cohen, Small- and large-effect quantitative
trait locus interactions underlie variation in yeast sporulation efficiency. Genetics, 2012. 192(3): p. 1123-32.
32. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.
33. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-9.
34. Tamura, K., et al., MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol, 2011. 28(10): p. 2731-9.
35. Jakobsson, M. and N.A. Rosenberg, CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics, 2007. 23(14): p. 1801-6.