Top Banner
Cui et al., Historical variations in mutation rate, Supporting Information SI page 1 Supporting Information Appendix Content SI Materials and Methods…………………………………………………….…...2 SI References………………………………………………………………….……5 SI Figures……………………………………………………………………….…..6 Fig. S1………………………………………………………………………….6 Fig. S2………………………………………………………………………….7 Fig. S3………………………………………………………………………….8 Fig. S4………………………………………………………………………….9 Fig. S5………………………………………………………………………….10 Fig. S6………………………………………………………………………….11 Fig. S7………………………………………………………………………….12 Fig. S8………………………………………………………………………….13 Fig. S9………………………………………………………………………….14 SI Tables………………………………………………………………………....….15 Table S1………………………………………………………………………..15 Table S2……………………………………………………………………..…16 Table S3……………………………………………………………………..…18
20

Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Apr 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Cui et al., Historical variations in mutation rate, Supporting Information 

SI page 1  

Supporting Information Appendix

Content

SI Materials and Methods…………………………………………………….…...2

SI References………………………………………………………………….……5

SI Figures……………………………………………………………………….…..6

Fig. S1………………………………………………………………………….6 Fig. S2………………………………………………………………………….7 Fig. S3………………………………………………………………………….8 Fig. S4………………………………………………………………………….9 Fig. S5………………………………………………………………………….10 Fig. S6………………………………………………………………………….11 Fig. S7………………………………………………………………………….12 Fig. S8………………………………………………………………………….13 Fig. S9………………………………………………………………………….14

SI Tables………………………………………………………………………....….15

Table S1………………………………………………………………………..15 Table S2……………………………………………………………………..…16 Table S3……………………………………………………………………..…18

Page 2: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Cui et al., Historical variations in mutation rate, Supporting Information 

SI page 2  

SI Materials and Methods

Strain collection. We performed genome sequencing on a selection of 107 Chinese isolates that represent the maximum diversity of biovar, host and date of isolation of Y. pestis in China (1-3). These were supplemented with three isolates from the Former Soviet Union (FSU) isolates, six from Mongolia, one from Myanmar, and one vaccine strain (EV76) from Madagascar for a total of 118 strains that were sequenced for this study (Dataset S1). 30 of the 118 isolates have been previously SNP-typed in a separate global study (4), and two (K11973002 and E1979001) have been sequenced by traditional Sanger sequencing (5).

Sequencing and assembly. Whole genome sequencing was performed using Illumina GA II (Illumina Inc. U.S.A) by generating multiplexed paired-end libraries with an average insert size of 500 bp. A base-calling pipeline (Solexa Pipeline- version 1.0) was used to process the raw fluorescent images and call sequences. For each strain, raw short-read sequences were assembled de novo using SOAPdenovo (6). To evaluate the single base accuracy of the assembled genome sequences, all raw reads were mapped onto the scaffolds using SOAPaligner (7). The detailed description of assembly results were provided in Dataset S2.

Core and pan-genome construction. We compared all 118 de novo assembled genomes in this study with 15 published genomes (Dataset S1 and SI Appendix, Fig. S9) to identify genomic contents shared by all 133 genomes. Briefly, we remapped all the assembled contigs to the completed reference genome sequence (strain CO92, accession number NC_003143) to delineate shared regions with identity ≥ 90% and e-value <1e-5 according to BLASTn (8). We then compared the raw reads of each strain to the core-genome by using SOAPaligner (7), and unmapped reads were collected and assembled to construct the accessory genome. For the 15 published genomes, regions that mapped to the core-genome were removed, and the remaining sequences were combined to obtain a set of redundant Strain Specific Sequences (SSSs). SSSs were aligned to each other using BLAT (9) and grouped into sets of sequences with ≥ 90% identity and match length ≥ 85% in pair-wise comparisons. Groups of similar sequences were merged and all but the longest sequence in each group were considered to represent redundancy, and therefore removed.

Identification and filtration of SNPs. The processes for SNP identification and filtration are summarized in Fig. S9. Briefly, SNPs were identified by aligning contigs of each strain to the genome of CO92 using MUMmer 3.0 (10). The MUMmer results for each strain were filtered to remove SNPs that might be unreliable according to the following criteria: 1) quality scores < 20 (average base calling error rate greater than 0.01); 2) covered by < 10 paired-end reads; 3) in repetitive regions (Dataset S7); 4) not identified by BLAT searches of the contigs of each strain to core-genome sequences.

Phylogenetic analysis. The ancestral state for all SNPs was equated to the corresponding nucleotide in Y. pseudotuberculosis IP32953 and all alternative nucleotides were scored as derived. After excluding 28 homoplastic SNPs (SI Appendix, Table S2), 2,298 non-homoplastic SNPs were imported into Bionumerics 5.1 (Applied Maths, Belgium) as a binary character set, consisting of “0” for the ancestral state and “1” for the derived state. This character set was then used in Bionumerics 5.1 to construct a minimum spanning tree (MSTree)

Page 3: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Cui et al., Historical variations in mutation rate, Supporting Information 

SI page 3  

with hypothetical intermediate nodes. The MSTree was fully parsimonious for all 133 Y. pestis isolates. A maximum likelihood tree was constructed on the basis of concatenated SNPs using the software PHYML with the HKY model (11), and 500 bootstrap replications.

Nomenclature of populations. Previous studies subdivided the genealogy of Y. pestis into branches 0, 1 and 2 (4, 12), each of which contains multiple populations. The populations were designated according to the branch plus a mnemonic for the biovar (Mnemonics: ANT, Antiqua; PE, Pestoides (including Microtus isolates); MED, Medievalis; ORI, Orientalis; IN, intermediate strains between 1.ANT and 1.ORI) followed by a number for sequential populations, e.g. 1.ORI2. We identified two novel branches in this paper, and designate them as branches 3 and 4. Each unique node (genome) in the MSTree was assigned a unique identifier consisting of the population designation plus a lower case letter indicating the node. The population designations are congruent with prior assignments (4), except that new isolates on branch 3 were assigned to populations 3.ANT1 and 3.ANT2 and the sole isolate on branch 4 was assigned to 4.ANT. The node designations of all strains are listed in Dataset S1.

Confident interval (CI) estimation of the mean and variance in the distance between the tips and MRCA of the MSTree. We firstly created a bootstrap dataset by resampling items (here, tip-to-MRCA distances), and then calculated the mean and variance of the distance and save it. The bootstrapping procedure was performed with 10,000 times of iterations. Finally, the CIs are derived from the quantiles of bootstrapping.

Testing for heterogeneity in substitution rates. To characterize the differences in substitution rate across the tree, we used the software package BEAST v1.7.1 (13-14) to implement a Bayesian relaxed clock model where substitution rate is allowed to vary in an uncorrelated fashion on each branch of the phylogeny, and where each rate is assumed to be drawn from a lognormal distribution. The variance of this distribution is estimated from the data, and provides a measure of the extent of the rate variation present. We analyzed the core genomes (~3.53 Mb) from 134 isolates. The Angola strain was excluded on the basis of initial analyses which showed that it had an anomalously high rate of evolution, which would not be adequately modeled by the lognormal approach. All strains were placed as dated tips according to their dates of sampling (see Dataset S1), strains with unknown dates were assigned a date of 1990 and the two Black Death strains were placed at 1348, the date of burial of the skeletons (15). The analyses used the HKY model of nucleotide substitution with four discrete gamma distributed rate categories, and as a prior on the relative node ages. We assumed a coalescent process with a constant population size across the tree. The priors for all other parameters were the default settings. This estimate is based on the discovery of mutations in endemic isolates from Madagascar (4).We calculated a posterior distribution of all parameters using two separate MCMCs, each run for 10,000,000 generations. The burn-in phase of the distribution was manually removed by inspecting the traces of parameters for the different chains. Convergence and mixing of these chains was confirmed in the program Tracer v1.5 (13).

Comparing the number of SNPs per gene to theoretical expectations

To detect genes under natural selection, we compared the distribution of the number of SNPs per gene to theoretical expectations. The probability of observing polymorphism at a given nucleotide was computed as m=n/N, where n was the total number of SNPs and N was the sum

Page 4: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Cui et al., Historical variations in mutation rate, Supporting Information 

SI page 4  

of all gene lengths. The expected number of SNP for a gene i of length Li was then computed as mLi. We compared the observed numbers of SNPs to theoretical expectations using a chi-square goodness of fit test. The test statistic was computed as:

22

1

( )Ki i

obsi i

n mLmL

χ=

−=∑ , (1.1)

where K was the number of genes. The chi-square statistic was compared to theoretical expectations using both parametric and non-parametric approaches. The parametric distribution was defined as a chi-square distribution with K-1=2821 degrees of freedom (Fig. 4A). Because the expected number of SNPs were lower than typical requirements for a chi-square test (mLi should be greater than 5 for all genes), we also derived a reference distribution using simulations. For each gene, we drew a simulated number of SNPs n’i from a binomial distribution B (Li,m). A reference chi-square statistic was computed by replacing ni

with n’i in Equation 1.1. We repeated this operation 1,000 times to obtain a non-parametric reference distribution of chi-square values (Fig. 4A). The parametric and non-parametric distributions were very similar, and both tests revealed a clear deviation from theoretical

expectations ( 2obsχ =3582.48, pparam = 4.16x10-21; pnon-param = 9.9x10-4).

We assessed which genes deviated most from theoretical expectations by examining their respective contributions to the chi-square statistic (Fig. 4B). The contribution of a gene i, expressed as a percentage, was computed as:

2

2

( )100 i i

obs i

n mLmLχ

−× (1.2)

We used a sequential cumulative Jackknife procedure to identify genes with the most unlikely number of SNPs. This procedure can be described as follows:

i) Increment the number of dependent tests, z, by one. ii) Remove x, the gene with the largest contribution (Equation 1.2), from the dataset. iii) Update the values of m and K, and compute a new chi-square statistic (Equation

1.1). iv) Compare the chi-square statistic to a chi-square distribution with K-1 degrees of

freedom and a Bonferroni-corrected type one error defined as: ' / zα α= , where α is the target type one error (set to 0.01) and 'α is the effective threshold value.

If 2 2( 1)( ) 'obs Kp χ χ α−≤ < , accept x as an outlier and go back to step i); otherwise,

stop the procedure.

Page 5: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Cui et al., Historical variations in mutation rate, Supporting Information 

SI page 5  

SI References

1. Cui Y, et al. (2008) Insight into microevolution of Yersinia pestis by clustered regularly interspaced short palindromic repeats. PLoS One 3(7):e2652.

2. Li Y, et al. (2008) Different region analysis for genotyping Yersinia pestis isolates from China. PLoS One 3(5):e2166.

3. Li Y, et al. (2009) Genotyping and phylogenetic analysis of Yersinia pestis by MLVA: insights into the worldwide expansion of Central Asia plague foci. PLoS One 4(6):e6000.

4. Morelli G, et al. (2010) Yersinia pestis genome sequencing identifies patterns of global phylogentic diversity. Nat Genet 42(12):1140-1143.

5. Eppinger M, et al. (2009) Draft genome sequences of Yersinia pestis isolates from natural foci of endemic plague in China. J Bacteriol 191(24):7628-7629.

6. Haensch S, et al. (2010) Distinct clones of Yersinia pestis caused the black death. PLoS Pathog 6(10):e1001134.

7. Li R, et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15):1966-1967.

8. Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403-410.

9. Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res 12(4):656-664. 10. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes. Genome Biol

5(2):R12. 11. Guindon S & Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large

phylogenies by maximum likelihood. Syst Biol 52(5):696-704. 12. Achtman M, et al. (2004) Microevolution and history of the plague bacillus, Yersinia pestis.

Proc Natl Acad Sci U S A 101(51):17837-17842. 13. Drummond AJ & Rambaut A (2007) BEAST: Bayesian evolutionary analysis by sampling trees.

BMC Evol Biol 7:214. 14. Drummond AJ, Suchard MA, Xie D, & Rambaut A (2012) Bayesian Phylogenetics with

BEAUti and the BEAST 1.7. Mol Biol Evol 29(8):1969-1973. 15. Bos KI, et al. (2011) A draft genome of Yersinia pestis from victims of the Black Death. Nature 

478(7370):506-510.

Page 6: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

••

••

• •

••

•••

••

•••••

••

1 20 40 60 80 100 120

3000

3200

3400

3600

3800

num

ber o

f gen

es

number of strains

••

••

•••

••

••

••

••

••

••

•••••

••

••

•••

••••

•••

••

•••

••

•••

••

••

••

•••

••

•••

••

••

••

•••

••

••

••

•••

••

••

•••••••••••••

•••

••

••••••

••

••

•••

•••••••••

••

••

••••

••••••

•••• ••

•••

•••••••••

••

••

••••

•••••

•••• ••

••

•••••••••

••

••

••••••••••

••• ••

•••••••••••

••

••••••••••••

••• ••

••

•••••••••••

••

••••••••••••

••• ••

••

•••••••••••••

••••••••••••• ••

••••••••••••••

•••••••••••• ••

••••••••••••

•••••••••••• •••

••

•••

••

••

••

•••

••

•••

•••

•••

•• •••

•••

••••

••

••

••

••

••

••

•• ••••

••

••••

••

••

••

••

••

•• •••

••

••

••

••

••

•••

••

••

•••

••

• ••

••

••

••

•••

••

••

••

••

• ••

••••

••••••

••

•••

••

•••

••

••

•••••

••

•••

•••••••••

••

••

••

••••

••

• •••••••

••

•••••

1 20 40 60 80 100 120

4000

4400

4800

5200•••••••

••

•••••

num

ber o

f gen

es

••

••••

••

•• • •

• •

• •

••

••

••

••

••

•••••

••••• ••••• ••••

•••

••

1 20 40 60 80 100 120

3.6

3.8

4.0

4.2

4.4

4.6

Cor

e ge

nom

e si

ze (M

b)

••

• •

• • •

• ••

• •

••

••

••

••

••

••

••

••

••

••

•• •• ••

••

••

• •

•••

• •••••

1 20 40 60 80 100 120

4.6

4.8

5.0

5.2

5.4

Pan

geno

me

size

(Mb)

A B

C D

Figure S1 Size of the core- (A, C) and pan-genome (B, D) of Y. pestis versus numbers of

genomic sequences. Size estimates are indicated by box and whisker plots, in which the box

encloses the second and third quartile (25-75%) of the data. The median is indicated by a

thin red line within each box. The whiskers indicate the top and bottom quartiles except for

outliers, which are indicated by dots. Ranges of data represent the results of 50 independent

runs, except part b, which is based on 200 iterations (A, B) Number of genes included in the

core- and pan-genome. (C, D) Size of the core and pan-genome.

number of strains number of strains

number of strains

SI Figures

SI page 6

Page 7: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

0.1

2.MED 2.ANT 1.ORI 1.IN 1.ANT

4.ANT 3.ANT 0.ANT 0.PE

0.001

IP31758

IP32953

PB1

YPIII

100

100

100

100

Y. pestis

Figure S2 Maximum likelihood tree of four Y. pseudotuberculosis and 133 Y. pestis strains based on 32,664 SNPs. The tree was constructed using PHYML with 500 bootstrap replicates. Numbers nearby internal nodes are percentages bootstrap values and the scale bars represent substitutions per SNP site. Left: The phylogeny of Y. pseudotuberculosis and a collapsed phylogeny of Y. pestis. PB1, IP32953, YPIII and IP31758 are Y. pseudotuberculosis strains with publicly available genome sequences. Right: The expanded phylogeny of 133 Y. pestis strains. The colored bars on the right side of the cladogram shows population designations (see legend at the bottom). For 0.PE, individual populations are indicated on each branch in red. The branch leading to strain Angola (0.PE3) was truncated due to its unusually long branch length (548 SNPs).

100100

100

97

90

100

100

100

100

99

100

100

100

MRCA

MRCA 0.PE20.PE7

0.PE30.PE4

SI page 7

Page 8: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Figure S3 Minimum spanning tree of relationships between 133 core genomes of Y. pestis based on 2298 SNPs. In both panels, each circle represents a node in the tree, which corresponds to a single genome with one exception (see below). (a) The branch lengths are proportional to the actual number of SNPs between nodes. (b) The branch lengths are proportional to the natural logarithm of the numbers of SNPs between nodes. Theoretical internal nodes (N01- N54) are indicated where required by the branching structure. Circles are colored by population assignments (2.MED1, 2.MED2, etc.) and nodes contain node-specific lower case letters (e.g. 2.MED1a, 2.MED1b). Numbers superimposed on branches indicate the number of SNPs between two nodes. The exceptional node, 0.ANT2a corresponds to two indistinguishable core genomes from strains B42003004 and 2330. These were isolated 25 days apart from Marmota baibacina in the same location in Wenquan, Xinjiang province. MRCA, most recent common ancestor.

3.ANT1

3.ANT2

4.ANT1.ORI1

1.ORI2

1.ORI3

1.IN3

1.IN2

1.IN1

1.ANT

2.ANT3

2.ANT2

2.ANT1

2.MED1

2.MED3

0.ANT3

0.ANT2

0.ANT1

0.PE4C

0.PE3

0.PE2

0.PE7

2.MED2

N01

N16

N02

N03

N17

N18

N19N20N21 N04

N05

N06

N22

N23

N24

N27N28

N09

N26

N08

N10

N41N47

N42

N43 N44

N29

N55

N30

N31N34

N36

N37

N38

N33

N32

N39N40

N48

N49

N52

N46

N45

N53N50

N54

N14

N59

N15

N61

N62

N60

N63

N13

N12

N11

N57

N25

N07

N35

N51

N56

N58

Pest_F

Angola

91001

Pest_A

India195

CA88

CO92

IP275

MG05 UG05

Antiqua

Nepal516

KIM

B42003004

96

1

179

75

35

437

45

2

2

59

68

2

1533

36

1

2

153

15

13

3

10

2912

914

148

1

46

52

1

17

2

1

6

42

2417

7

18

3

2

111 3

4 11

13

33

1

22

5

1

15

13

11

23

552

3

43

36

3

44

8

1652

12

1612

31

1121

13

21

12

31

1

114

3

2 14

1 7

54

7

54

43

2

8

6

23

61

4

169

2

12

72

1

4 11 1 1

1 1

4

3

3

9

79

2

2

5

782

16

3

1

3

973

3

14

12

4

68

1

8

1

2113

12

1116

1

6

5

1211

137

12

21

12

66

6

1

3

2

1

2

1

8

1

cd

ba

h

e

gf

c

db

ae

a

b

a

ba

aa b

ad

e

cb

f

a

c

e

b

acd b

a

d

a

bd e

c

a b

f

c

ab

d

e

c

f g

lj

i

kh

c

d

b

a

c

a

d

b

a

b d

e

c

fg

m

p

e

no

l

ji

kh

a

b

b

f

gm

d

qj

ik

h

a

l

c

o

n

p

e

f

gd

e

i

h

a

bc

a

bfg

d

e

ci h

a

bc

ab

d

e

c

3

3

86

a bc

a

1

28124; 8291; 11972East Smithfield

6330

Branch 4

Branch 0

Branch 3

Branch 1

Branch 2

0.PE4A0.PE4B

Branch 0

Branch 1

Branch 2

Branch 3

Branch 4

Ancient genomes

MRCA of Y. pestis sample

MRCA of Y. pestis sample

A

B

SI page 8

Page 9: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

0 200 400 600 8000

0.1

0.2

0.3

0.4

Pairwise numbers of different SNPs (MSTree)

Pai

rwis

e pa

trist

ic d

iffer

ence

s be

twee

n no

des

(ML

tree)

R2=0.997

Figure S4 Relationship of pairwise patristic differences between nodes of MSTree and the MLTtree. The patristic distances between all pairs of strains are shown with distances from the MSTree on the x-axis and MLTree on the y-axis. The two sets of distances are tightly correlated (R^2= 0.997).

SI page 9

Page 10: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

0 1 2

0.0

0.5

1.0

1.5

Scaled longitude

Sca

led

latit

ude

0.ANT1 0.ANT2

0.ANT3

0.PE4

0.PE7 1.IN1

1.IN2

1.IN3

1.ORI1 1.ORI2

2.ANT1 2.ANT2

2.ANT3 2.MED1

2.MED2 2.MED3

3.ANT1

3.ANT2

4.ANT1

A

B

Numbers of isolates

1.ORI3

Hum

an

1.ANT1

Figure S5. Spatial and host distribution of sampled Y. pestis isolates. A. Spatial distribution of the sources of isolates according to longitude and latitude that were scaled to a mean value of zero and standard deviation of 1.0 to account for differences between latitude and longitude. Ellipses encompass 95% of the dispersion (inertia) of each group and the geographic sources of singletons are indicated by points. B. Numbers of isolates per bacterial population and host genus. Non-rodent mammalian hosts are indicated in a grey font.

SI page 10

Page 11: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

100

300

260

220

180

140

1945 1985197519651955 20051995

Year of isolation

SN

Ps

to M

RC

A

0.ANT1

0.PE70.PE40.ANT30.ANT2

Branch 0

1.IN1

1.ORI21.ORI11.IN31.IN2

Branch 1

2.ANT1

2.MED32.MED22.MED12.ANT32.ANT2

Branch 2

3.ANT1Branch 3 & 4

3.ANT24.ANT1

1.ORI31.ANT

Figure S6. SNPs to the MRCA in MSTree versus year of isolation for individual strains. The plot excludes five isolates lacking information on the date of sampling as well as the vaccine strain EV76, which would be off scale.

SI page 11

Page 12: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

log rate distribution standard deviation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

01

23

45

priorposterior

Figure S7. Prior and posterior probability density functions for the standard deviation parameter of the lognormal rate distribution. The prior and posterior probability distributions differ, and the posterior distribution (estimated from the data) contains no weight around zero. This provides strong support for variable substitution rates across the tree and the rejection of a strict clock model.

SI page 12

Page 13: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

SI page 12

Page 14: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Figure S8. Maximum clade credibility tree of the core genomes of Y. pestis showing 95% Highest Posterior Density confidence intervals of dates on each of the nodes (blue bars). The tree is scaled in years before present and the solid red and solid black lines represent the date of the Justinian’s pandemic (541AD) and the Black Death (1347 AD). The strain Angola was excluded from this estimation because its anomalously fast rate could not be fit by the lognormal model but the arrow represents its placement according to the Maximum Likelihood tree (Fig. S2). This provides strong evidence that this strain diverged significantly earlier than the Justinian’s pandemic.

SI page 13

Page 15: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

Figure S8. Maximum clade credibility tree of the core genomes of Y. pestis showing 95% Highest Posterior Density confidence intervals of dates on each of the nodes (blue bars). The tree is scaled in years before present and the solid red and solid black lines represent the date of the Justinian’s pandemic (541AD) and the Black Death (1347 AD). The strain Angola was excluded from this estimation because its anomalously fast rate could not be fit by the lognormal model but the dashed line represents its placement according to the Maximum Likelihood tree (Fig. S2). This provides strong evidence that this strain diverged significantly earlier than the Justinian’s pandemic.

SI page 13

Page 16: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

⑦⑨

Sample1raw reads

Sample2raw reads

Sample118raw reads

14 Publishedgenomes

Contigs Contigs Contigs

Reference genome:

CO92

...

...

Core-genome

Accessory-genome

Raw SNPs

Clean SNPs

Repeat region ⑧

Define the core- and pan-genome:

① Assembly;

② Alignment of contigs to reference genome by BLASTn yields a core-genomes as the

region shared in all genomes;

③ Raw reads of each strain are compared to the core-genome;

④ Unmapped reads are assembled into the accessory genome;

SNP calling:

⑤ Align contigs of each strain individualy to the core-genome by MUMmer;

⑥ Accumulate MUMmer results for each strain;

⑦ Define the quality score and number of reads for each SNP;

⑧ Define repetitive DNA within the CO92 genome;

⑨ Verify the nucleotide of each SNP using BLAT;

⑩ Filter all SNPs with <10 paired-end reads, quality score <20, within repetitive region, did

not map to core-genome.

Figure S9. A pipeline for defining the core- and pan-genome and calling SNPs.

SI page 14

Page 17: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

SI page 15  

Supplementary Tables  

Table S1. The general features of Core- and pan-genome from 133 Y. pestis genomes

Core-genome Accessory genome Pan-genome

Total size (bp) 3,534,940 1,921,302 5,456,242

Blocks 990 3301 4291

G+C content 48.31% 47.78% 48.15%

Chromosomal genes 3450 1249 4699

Average chromosomal gene length (bp) 877 669 822

Plasmid genes 0 451 451

Average plasmid gene length (bp) - 719 719

Total length of plasmid sequences (bp) 0 421,422 421,422

Page 18: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

SI page 16  

Table S2. Homoplastic SNPs that were excluded from the MSTree

SNP

ID CO92 Pos Gene ID Mutation Type

Anc

Nt

Der

Nt

Anc

AA

Der

AA Branches

s3691 98534 NA intergenic A G NA NA MRCA_N01, and reversal mutation in N42_N46

s3700 579383 NA intergenic A G NA NA all, except N21_0.PE4i

s3703 793697 NA intergenic A G NA NA N18_0.PE4c, N60_1.ORI1d

s3707 1956130 NA intergenic T C NA NA N16_0.PE2a, N03_N04

s3714 3371854 NA intergenic T C NA NA MRCA_N01, and reversal mutation in N02_0.PE3a

s3716 4499729 NA intergenic G T NA NA N16_0.PE2b, N63_1.IN3h

s3717 4499733 NA intergenic A G NA NA N63_1.IN3i, N36_2.ANT3g, N45_2.MED2e,

s3692 196232 YPO0176 nsyn T G C G N01_N16, N02_0.PE3a

s3693 228268 YPO0222a syn T G P P N05_0.ANT2a, N06_0.ANT3a, N21_0.PE4i, N06_N07

s3694 325708 YPO0319 nsyn A C V G 1.IN3b_1.IN3e, N63_1.IN3i

s3695 358874 YPO0348 nsyn C A L F N02_0.PE3a, N40_2.ANT1c, N37_2.ANT3j, N07_4.ANT1a

s3696 358875 YPO0348 nsyn A G L S N01_N16, 1.IN2m_1.IN2n, N58_1.IN3a, 1.IN3b_1.IN3f, N61_1.ORI1b, N34_2.ANT3e,

N37_N38, N26_3.ANT1b, N27_N28, N27_3.ANT2e

s3697 358876 YPO0348 nsyn A C L V MRCA_N01, and reversal mutation in N17_N18, N52_2.MED3i

s3698 430520 YPO0412 syn A C G G N58_1.IN3b, N51_2.MED3d

s3699 461550 YPO0440 nsyn G A C Y MRCA_N01, and reversal mutation in 2.MED3a_2.MED3b, N52_2.MED3i

s3701 759027 YPO0698 nsyn T G L V 1.IN2m_1.IN2q, 1.IN3b_1.IN3e

s3702 784090 YPO0725 nsyn T G V G N14_1.ORI1c, N07_N09

s3704 1547235 YPO1376 nsyn G A A T N02_0.PE3a, N03_N04

s3705 1792465 YPO1571 nsyn C T T I N17_N18, and reversal mutation in N21_0.PE4i

s3706 1805037 YPO1581 nsyn C A W L N16_0.PE2a, N17_N18

s3708 2484439 YPO2210 syn C T V V N02_0.PE3a, MRCA_0.PE7a

s3709 2484441 YPO2210 nsyn C T V M MRCA_N01, and reversal mutation in N10_N55

Page 19: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

SI page 17  

s3710 2586102 YPO2299 nsyn A C T P 1.ORI2f_1.ORI2i, 3.ANT1b_3.ANT1c

s3711 3073160 YPO2742 nonsense C A S * N18_N19, and reversal mutation in N21_0.PE4i

s3712 3247320 YPO2904 nsyn A C S A N01_N16, N02_0.PE3a

s3713 3367341 YPO3013 nsyn C T G D N17_N18, and reversal mutation in N21_0.PE4h, 0.PE4d_0.PE4e

s3715 3587339 YPO3223 nsyn A C W G N16_0.PE2a, N43_2.MED2b

s3718 4628490 YPO4107 syn A C G G N12_1.IN2e, N63_1.IN3i

Abbreviations:

syn: synonymous mutation; nsyn: nonsynonymous mutation; nonsense: mutation results in premature stop codon; intergenic: mutation in intergenic region.

Anc: ancestral; Der - derived; Nt: Nucleotide;AA:amino acid. NA: not applicable

 

Page 20: Supporting Information Appendix - PNAS · Cui et al., Historical variations in mutation rate, Supporting Information ... completed reference genome sequence (strain CO92, accession

SI page 18  

Table S3. Features of highly variable genes

Gene ID Type** Product Length (bp)

# of SNPs

# of nsSNPs

# of sSNPs

YPO0348* multiple SNPs and homoplasies aspartate ammonia-lyase 1437 8 8 0 YPO0821 multiple SNPs hypothetical protein 195 3 2 1 YPO2210* multiple SNPs and homoplasies hypothetical protein 129 4 2 2 YPO2387 multiple SNPs DNA-binding transcriptional repressor PurR 1026 5 5 0 YPO2639 multiple SNPs integrase (partial) 186 2 2 0 YPO3041 multiple SNPs nitrate/nitrite response regulator protein NarP 630 4 4 0 YPO3561 multiple SNPs stringent starvation protein A 642 4 4 0 YPO0698 homoplasies outer membrane usher protein 2481 6 5 1 YPO1376 homoplasies putative cell division protein 3918 6 6 0 YPO0176 homoplasies hypothetical protein 408 1 1 0 YPO0222a homoplasies 30S ribosomal protein S14 306 1 0 1 YPO0319 homoplasies quinone oxidoreductase, NADPH-dependent 984 3 1 2 YPO0412 homoplasies putative ABC transporter ATP-binding protein 1584 3 1 2 YPO0440 homoplasies purine nucleoside phosphorylase 720 1 1 0 YPO0725 homoplasies flagellar hook protein FlgE 1242 1 1 0 YPO1571 homoplasies hypothetical protein 1431 4 4 0 YPO1581 homoplasies alpha-galactosidase 2127 2 1 1 YPO2299 homoplasies putative methylated-DNA--protein-cysteine methyltransferase 519 2 1 1 YPO2904 homoplasies putative 3-phenylpropionic acid transporter 1158 1 1 0 YPO3013 homoplasies sulfate/thiosulfate transporter permease subunit 876 1 1 0 YPO3223 homoplasies DNA-binding transcriptional regulator Crl 402 2 2 0 YPO4107 homoplasies Xanthine/uracil permeases family protein 1329 2 1 1

*genes contain both multiple SNPs and homoplasies **multiple SNPs, seven genes with unexpected number of SNPs as shown in Fig. 4. homoplasies, genes with homoplastic SNPs as show in Table S2