-
10.1101/gr.143198.112Access the most recent version at doi:
published online October 4, 2012Genome Res.
Wei Wei, Qasim Ayub, Yuan Chen, et al. resequencingA calibrated
human Y-chromosomal phylogeny based on
P
-
1
A calibrated human Y-chromosomal phylogeny based on
resequencing
Wei Wei1,2
, Qasim Ayub1, Yuan Chen
1, Shane McCarthy
1, Yiping Hou
2, Ignazio Carbone
3, Yali
Xue1, Chris Tyler-Smith
1
1The Wellcome Trust Sanger Institute, Hinxton, Cambs. CB10 1SA,
UK
2 Department of Forensic Genetics, School of Basic Science and
Forensic Medicine, Sichuan
University (West China University of Medical Sciences), Chengdu,
610041, Sichuan, PR China 3Center for Integrated Fungal Research,
Department of Plant Pathology, North Carolina
State University, Raleigh, NC 27695-7244, USA
Keywords: human Y chromosome; whole-genome resequencing;
targeted resequencing;
phylogenetic tree; male expansions
Abstract
We have identified variants present in high-coverage complete
sequences of 36 diverse
human Y chromosomes from Africa, Europe, South Asia, East Asia
and the Americas
representing eight major haplogroups. After restricting our
analysis to 8.97 Mb of unique
male-specific Y sequence, we identified 6,662 high-confidence
variants including SNPs,
MNPs and indels. We constructed phylogenetic trees using these
variants, or subsets of
them, and recapitulated the known structure of the tree.
Assuming a male mutation rate of
1x10-9
per bp per year, the time depth of the tree (haplogroups A3-R)
was about 101-115
thousand years, and the lineages found outside Africa dated to
57-74 thousand years, both
as expected. In addition, we dated a striking Paleolithic male
lineage expansion to 41-52
thousand years ago and the node representing the major European
Y lineage, R1b, to 4-13
thousand years ago, supporting a Neolithic origin for these
modern European Y
chromosomes. In all, we provide a nearly 10-fold increase in the
number of Y markers with
phylogenetic information, and novel historical insights derived
from placing them on a
calibrated phylogenetic tree.
Introduction
The human Y chromosome offers a unique perspective on human
genetics because of its
male-line inheritance and the very high resolution of its
haplotype tree (Jobling and Tyler-
Smith 2003). Insights it has provided include a recent African
origin for paternal lineages
(Cruciani et al. 2011, Hammer 1995), a single expansion of
modern humans out of Africa
(Underhill et al. 2000), evidence for Paleolithic movement back
to Africa (Scozzari et al.
1999) and many examples of tracing events within historical
times such as the Mongol
(Zerjal et al. 2003) and Phoenician legacies (Zalloua et al.
2008) or the descendants of
Thomas Jefferson (Foster et al. 1998). These conclusions have
been reached after
genotyping small numbers of known markers, or limited
resequencing. Despite the valuable
insights that have been obtained, there have been other areas of
Y-chromosomal study
where uncertainty or debate have persisted for decades, such as
the understanding of the
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
2
causal mutations underlying Y-linked spermatogenic failure
(Tyler-Smith and Krausz 2009),
the role of Y-chromosomal variation in phenotypes such as
coronary artery disease
(Charchar et al. 2012), or the origins of European male
lineages. In the last case, opinions
have included a Paleolithic origin for the major lineage (Semino
et al. 2000), a Neolithic
origin for the equivalent lineage (Balaresque et al. 2010), and
the view that reliable age
estimates for this lineage are currently impossible (Busby et
al. 2011).
Studies of mitochondrial DNA, often regarded as a female-line
counterpart of the Y, have
benefited substantially from developments in data acquisition,
moving from early RFLP
typing (Cann et al. 1987) and control-region resequencing
(Vigilant et al. 1991) to large-scale
complete sequencing of chosen lineages or populations (Behar et
al. 2008, Gunnarsdottir et
al. 2011). It seemed likely that complete sequencing of Y
chromosomes might lead to similar
or even greater benefits. Although its 3,000-fold greater length
has made this more difficult,
current technologies allow an enormous increase in sequence data
collection. This is,
however, accompanied by a cost: short reads are generated and
these cannot be mapped
accurately if they are derived from repetitive regions. Since
the Y is richer in repeated
sequences than any other chromosome, this is particularly
problematic for sequencing the
Y, but can be overcome by concentrating on unique regions. This
approach has allowed the
sequencing of a pair of related Y chromosomes with sufficient
accuracy to provide a
measurement of the mutation rate (Xue et al. 2009), and lower
coverage sequencing of 77 Y
chromosomes to discover 2,870 high-confidence Y-SNPs, 74% new
(The 1000 Genomes
Project Consortium 2010). Although low-coverage sequencing is an
efficient way to discover
variants, the interpretation of the resulting data is
complicated by the incomplete
ascertainment of variants in any single sample, and thus
reliable information about some
features, such as the time depth of the phylogeny, is difficult
to extract. Sequencing
technologies have now developed further and allow
moderately-sized samples to be
sequenced at high coverage from either the complete genome or
targeted regions
(Drmanac et al. 2010, Hu et al. 2012), which should permit more
complete ascertainment of
variants and simplify interpretation.
Here, we have analysed a dataset of 36 Y chromosome sequences in
order to explore how
effectively complete sequence data from the Y can be used to
construct and calibrate a
phylogeny, and the insights that may result from such an
analysis.
Results
High-coverage sequences from 36 males were used, 35 released by
Complete Genomics and
an additional sequence from a haplogroup A individual (the most
basal haplogroup branch)
generated for this study. In all, there were nine males from
Africa, 15 from Europe (including
a three-generation family containing eight males), three from
South Asia, two from East Asia
and seven from the Americas. SNP genotyping data available from
the HapMap3 Project
(Altshuler et al. 2010) revealed that haplogroups A, D, E, G, I,
N, Q and R were represented,
with many subdivisions of some of these, particularly E and R.
Thus, despite the small
number of individuals, there was good geographical
representation of global populations
and of the haplogroup tree. After QC and validation, we
extracted 6,662 high-confidence
variants (i.e. sites that differ from the Y chromosome reference
sequence), including both
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
3
SNPs and indels, from 8.97 Mb of unique Y sequence (Table S1,
S2). The variants were
distributed evenly along the target regions of the chromosome
(Figure 1), and increase the
number of high-quality variants with phylogenetic information by
almost 10-fold.
We could assign ancestral states to 6,271 of the variants, and
then constructed a rooted
parsimony-based phylogenetic tree containing all 6,662 variants
(tree 1, Supplemental
Figure 1), and additional trees containing subsets of these,
consisting of SNPs only (tree 2,
Supplemental Figure 2), or SNPs with ancestral state
information, no recurrent mutations
and high coverage in all individuals (tree 3, Figure 2). The
branch leading to the haplogroup
A individual was shorter than any other, most likely reflecting
the low-coverage sequencing
of parts of this chromosome and consequent under-calling of
variants; this effect was
largely eliminated in tree 3, where only high-coverage regions
were used (Figure 2). The
topology of all three trees was very similar, differing only by
the lack of resolution of the
haplogroup G and I branch order in the most highly-filtered
tree, because the relevant SNPs
lay outside the region used. The structure recapitulated the
known phylogeny of the Y
chromosome (Karafet et al. 2008) and identified new splits
within the E1b1a8a and I1*
branches. In addition, the new markers allowed all unrelated
chromosomes to be
distinguished.
5,865 (88.0%) of the variants identified were SNPs, 56 (0.8%)
MNPs, and 741 (11.1%) indels.
Based on the phylogeny, we found that 172 (2.9%) SNPs, 5 (9%)
MNPs and 85 (11.5%) indels
showed evidence of recurrent mutation, either reverting to the
ancestral state, recurring in
more than one location on the tree, or showing more than two
alleles. Indels, most of which
lay in mononucleotide runs or short tandem repeats, were highly
enriched for this behavior
(P
-
4
size and migration rate (GENETREE-2 and BEAST; details in
Methods) (Bahlo and Griffiths
2000, Drummond et al. 2012). The other two estimates were based
on the phylogeny (Rho-1
and Rho-2) (Forster et al. 1996, Saillard et al. 2000). These
times were all broadly consistent
and are relevant to debates about human expansions and
migrations, so are discussed
further below.
Discussion
Many aspects of the approach adopted in this study, including
complete sequencing of Y
chromosomes, using data from public sequence resources,
generating additional data from
lineages of particular interest, and filtering of the
Y-chromosomal regions and variants, are
likely to become standard in future analyses of this kind. The
current study has provided
insights into several relevant methodological topics, and also
into the way that a calibrated
Y-chromosomal phylogeny can provide insights into recent human
evolution. We consider
issues arising in these areas in this Discussion.
Next-generation sequence data are error-prone, with errors
contributed by both base
calling and mapping. The former can largely be overcome by high
coverage, and so are
minimized in this study; in order to exclude the latter, we
stringently excluded repeated
regions, which are most prone to mapping errors. This resulted
in a high-quality dataset, but
at the cost of not detecting and using all variants. Future
studies should investigate the
possibility of extracting reliable variant calls from additional
regions of the chromosome,
and this will be facilitated by the longer reads expected as
sequencing technologies
improve. An additional source of biological error is the
mutations that occur somatically in
the donor or during cell culture, relevant here since all
sequences were derived from
lymphoblastoid cell lines. We can estimate this number from the
sequences of the three-
generation family. The grandfather and father carry 13 and 11
specific variants respectively,
two of which are absent from the grandfather, but present in the
father and transmitted to
all his sons, and thus likely to represent in vivo de novo
mutations, while the remaining 22
are likely to be somatic (Supplementary Figure 1). This
observation of two germline
mutations in two transmissions of 8.97 Mb is consistent with the
expectation of ~0.6
mutations in two transmissions (0.3 variants observed per
meiosis in 10.5 Mb (Xue et al.
2009)). In addition, the sons carry from zero to 17
individual-specific variants each. If we
assume that all the non-transmitted variants are somatic, we can
estimate an upper limit to
the number of somatic variants at 8/individual (3
SNPs/individual). These somatic variants
are thus highly enriched for indels compared with total variants
(42/67 compared with
699/6,595, P
-
5
positions. When multiple individuals within a haplogroup were
sequenced, the shared
variants already provide useful markers. For example, the
geographical origin (south or west
Asia) and time depth of haplogroup R1a are disputed and few
useful markers within this
haplogroup have previously been available (Underhill et al.
2010); typing the 173 shared
variants identified here in additional geographically diverse
R1a samples would be of great
interest.
Although this study was not aimed at investigating Y gene
function, it was striking that a
variant predicted to be highly damaging to protein structure was
discovered in the USP9Y
gene. USP9Y loss of function has been associated with variable
phenotypes ranging from
azoospermia to oligoasthenoteratozoospermia and normal sperm
production (Tyler-Smith
and Krausz 2009). The transmission of this variant through three
generations demonstrates
that it is compatible with male fertility, providing further
evidence for the phenotypic
diversity linked to variation in this gene.
A novel aspect of using extensive sequence data is that it is
possible to investigate the time
depth of the entire Y-chromosomal phylogeny represented by the
samples, or of any subset
of lineages. Times are determined in mutational steps, and in
order to convert these into a
more useful unit such as years, a calibration metric has to be
applied. We used a calibration
based on direct measurement of the Y-chromosomal SNP mutation
rate both in years and
generations from a deep-rooting family (Xue et al. 2009) since
this requires the minimum
number of assumptions and has already been adopted in the
literature (Cruciani et al.
2011). The measurement does, however, have wide confidence
intervals since only a small
number of mutations were observed, and these confidence
intervals were not included in
our consideration of times, which used the point estimate. Two
lines of reasoning suggest
that we may have more confidence in the point estimate than
simple consideration of the
number of mutations might suggest. First, it is consistent with
other direct measurements of
the human mutation rate, allowing for the expected higher
mutation rate on the Y because
of its permanent location in the mutation-prone male germ line
(e.g. Roach et al. 2010).
Second, it is consistent with the rate inferred from
human-chimpanzee comparisons of the
same sections of the Y chromosome: 1.3 x 10-9
mutations/nucleotide/year for a 6.5 million
year Y-chromosomal divergence time (Scally et al. 2012, Xue et
al. 2009). Nevertheless,
additional measurements of mutation rate are urgently needed to
improve calibration.
We based our time estimates solely on SNPs, because indel
mutation rates are poorly
known and likely to be complex. We used either the complete set
of SNPs (with the rho
estimator), or a reduced set where coverage was high in all
samples, ancestral state was
known and recurrent mutations were absent, and related
individuals were excluded,
required by GENETREE and also used with BEAST and rho. As a
result, we obtained five
estimates for each time, which were similar (Table 1). Point
estimates of the TMRCA for the
complete set of chromosomes examined were 101-115 KYA. This is
consistent with the
published estimate of 105 KYA for haplogroup A3 (Cruciani et al.
2011). We identified three
additional nodes in the tree as being of particular interest.
The first of these was DR,
corresponding to the expansion of Y chromosomes following the
out-of-Africa migration
(Jobling and Tyler-Smith 2003, Underhill et al. 2000). The time
of 57-74 KYA years is
consistent with abundant non-genetic evidence, for example of
the first colonization of
Australia around 50 KYA (Roberts et al. 1990). The agreement of
these two times with
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
6
previous work strengthens our confidence in the remaining
estimates. The second internal
node examined was the multifurcation of haplogroups I, G and NR
(i.e. K), the
representation in this reduced set of individuals of a larger
multifurcation also involving
several F sublineages, H and J (Karafet et al. 2008). The
minimal resolution of the lineages,
even in a phylogeny based on sequencing of 8.97 Mb, implies a
rapid expansion, which we
date here at 41-52 KYA. This could correspond to an expansion
into the interior of Europe
and Asia following adaptation to these novel environments soon
after the initial rapid
coastal migration. The third internal node was that of R1b, a
well-documented expansion in
Europe, but with a much-debated time depth. Here, we estimate a
time of 4.3-13 KYA, the
most uncertain of the dates. Despite the range of estimates, all
these dates favor a Neolithic
(Balaresque et al. 2010) more than a Paleolithic (Semino et al.
2000) or Mesolithic expansion
of this lineage. The three haplogroup I individuals,
representing the next most frequent
haplogroup in Europe, show signs of an expansion at
approximately the same time (Figure
2), although the number of individuals is too low to present any
clear conclusion about
whether or not this lineage was influenced by the same
demographic events as R1b.
Nevertheless, the rapid expansion of R1b (and possibly I1) in
Europe contrasts with the less
starlike expansion of E1b1a in Africa, which has been associated
with the spread of farming,
ironworking and Bantu languages in Africa over the last 5,000
years (Berniell-Lee et al.
2009). Both R1b and E1b1a samples are from a mixture of
indigenous donors (from Europe
and Africa, respectively) and admixed American donors, so
sampling strategy does not
provide an obvious explanation for the difference. Instead, the
different phylogenetic
structure, with far more resolution of the individual E1a1a
branches, may reflect expansion
starting from a larger and more diverse population, and thus
retaining more ancestral
diversity.
In conclusion, our study identifies the methodological steps
necessary to obtain reliable
biological insights from current next-generation sequence data,
and the novel information
that is available. It reveals, for example, how rapid some
expansions of the Y phylogeny
were, so that even extensive sequence data do not resolve all
multifurcations. During the
expansion of the six unrelated R1b lineages examined, only one
mutation has arisen, despite
a mutation rate of 0.3/generation/10 Mb. The study also poses
challenges for the field,
among which are (1) integrating sequence data generated by
different technologies with
different error modes on different samples; (2) the question of
whether an inferred
ancestral reference sequence would be more useful than the
current hybrid of modern
sequences; (3) the need to develop a compact and useful
nomenclature system that can
accommodate extensive sequence information, for example based on
major haplogroups
and significant sub-clusters that provide a memorable but not
complete indication of the
location in the phylogeny; and (4) the difficulty of
understanding the archaeological events
associated with the male expansions detected, and the standards
of evidence that should be
required in order to accept links.
Methods
Definition of Y-specific unique regions
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
7
We identified unique regions within the male-specific part of
the Y chromosome reference
sequence (GRCh37
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/)
where we expected read mapping and variant detection to escape
complications introduced
by repeated sequences. This was achieved by excluding the
pseudoautosomal,
heterochromatic, X-transposed and ampliconic segments (Skaletsky
et al. 2003), leaving
nine separate regions (Figure 1, Table S1). Together, these
spanned 8.97 Mb. All analyses in
this study were restricted to these regions, or subsets of
them.
Y-chromosomal sequence data sources
High-coverage Y-chromosomal sequence data from 35 males were
downloaded from the
Complete Genomics database
(ftp://ftp2.completegenomics.com/vcf_files/Build37_2.0.0/).
Coverage of the Y unique regions ranged from 19x to 35x (Table
S4). These males did not
include any individual belonging to the most basal haplogroup,
haplogroup A. We therefore
generated additional sequence data from a haplogroup A
individual, NA21313. This
chromosome was derived for the markers M32, M190, M220, M144,
M202, M305 and
M219, and ancestral for P97, placing it in haplogroup A3b2*.
Sequencing included both low
coverage (mean 5x) whole-genome data, and high coverage data
from long-PCR products
between chrY:13,798,579-19,720,738 (GRCh37).
NA21313 Low coverage sequencing
Low coverage reads were mapped and variants called in
combination with 1000 Genomes
samples using samtools and bcftools (Li et al. 2009), then
filtered (StrandBias 1e-5;
EndDistBias 1e-7; MaxDP 10000; MinDP 2; Qual 3; SnpCluster 5,10;
MinAltBases 2; MinMQ
10; SnpGap 3).
NA21313 High coverage sequencing
High coverage sequence information was generated by amplifying
5-6 kb overlapping
fragments by long-PCR. Approximately equimolar amounts of PCR
fragments were pooled
and used for library preparation and paired end sequencing (54
bp) on an Illumina GAII
Genome Analyzer to obtain 475x median coverage (ENA Sample
Accession Number:
ERS006694; www.ebi.ac.uk/ena/). MAQ (Li et al. 2008) was used
for mapping and SNP
calling in the high coverage data. Read depth coverage (>1/2
mean depth) and mapping
(consensus >30, Mapping > 63) filters reduced the raw
variant calls from a total of 5,684 to
2,233. Subsequent filters included removal of heterozygous calls
and sites that were within
4 bp of each other to create a high confidence filtered list of
615 SNPs, none of which were
discordant with the established Y phylogeny.
Ancestral states
We extracted the ancestral allele for each position that was
variable in humans (assumed to
be the allele present in chimpanzee) using the Ensembl-Compara
pipeline (Vilella et al.
2009) release 66 and obtained calls for 6,271 of the total
number of 6,662 variable sites
(Figure 1, Table S2).
Data QC and filtering
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
8
We extracted the variants from the Y-specific unique regions
present in the 35 male
individuals from the database of Complete Genomics, including
both SNPs and indels. The
number of differences from the reference sequence ranged from
210 to 1,257. Examination
of the distribution of variants along the chromosome revealed an
excess in one sample,
GS19649, in the region 28,670,244 28,735,914 compared with the
other individuals; this
was associated with low coverage of this region, so we
hypothesised that the excess arose
from a combination of deletion of the region in this individual
and mismapping of reads
originating from other parts of the chromosome. We excluded from
further consideration all
variable sites from all individuals that fell into this
region.
We identified 631 variable sites mapping within the 8.79 Mb
region that had been described
in the literature and placed on the 2008 Y Chromosome Consortium
phylogenetic tree
(Karafet et al. 2008) and its subsequent partial updates
(Chiaroni et al. 2009, Cruciani et al.
2011, Cruciani et al. 2008, Cruciani et al. 2010, Debnath et al.
2011, Jota et al. 2011, Mendez
et al. 2011, Sims et al. 2009, Trombetta et al. 2011, Trombetta
et al. 2010). These were used
for assigning a standard haplogroup to each sample and in
validation.
From all 36 individuals, we identified a total of 6,662 variable
sites. Missing calls ranged
from 31 to 165 per individual; call rates were on average 98.9%.
We imputed the missing
sites based on the tree structure and the assumption of
parsimony; the only errors
introduced by this procedure will be missed reversions or
recurrent mutations, both very
rare.
Validation
SNP validation
We performed in silico validation of the SNP calls using three
approaches. First, calls from
101 SNPs based on the intersection of the Affymetrix Human SNP
array 6.0 array and the
Illumina Human 1M beadchip were available for 11 individuals
from the HapMap3
genotyping of the same samples (Altshuler et al. 2010). We
observed 100% concordance at
these 1,122 positions (Table S5). Second, high-coverage Illumina
GA2 sequence data were
available for two individuals, NA19239 and NA12891, the trio
fathers in the 1000 Genomes
Pilot Project (The 1000 Genomes Project Consortium 2010). There
was 85.1% (553/650) and
96.5% (694/719) concordance between the two datasets. In the
light of the other validation
results, we ascribe the relatively low concordance in this test
to the early sequencing
technologies used in this part of the 1000 Genomes Pilot
project, much of it single-ended
and particularly susceptible to mapping problems. Third, using
the literature SNP (see Data
QC and filtering above) with its derived allele furthest from
the root in each individual to
assign a haplogroup, we could predict the allelic states
expected for the remaining 630
literature sites. Here, after correcting some typographical
errors in site positions (Table S6)
there was 99.8% concordance between prediction and observation
(22,680 genotypes);
discrepancies are listed at the top of Table S6 and several may
even represent mistakes in
the locations of published makers on the phylogeny or recurrent
mutations not previously
recorded, rather than genotyping errors in the present dataset.
Validation rates in the two
reliable tests are thus very high.
Indel validation
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
9
We used two approaches to assess the quality of the indel calls.
First, using a logic similar to
the third SNP validation method above, we identified 40 reported
indel sites within the
unique regions of the Y chromosome (Karafet et al. 2008).
Genotype calls were made at 28
of these in our dataset, and we observed 99.7% concordance
between the observed calls
and those expected from the phylogeny (974/977). The false
negative call rate was
therefore low. Second, most indel mutations occur in simple
sequences such as
mononucleotide runs or short tandem repeats. We therefore
examined the sequences
flanking the calls, and observed a preponderance of these
motifs: 471 poly A or poly T, 17
poly C or poly G, 87 short tandem repeats, 166 other (Table S7).
This shows that, although
these variants have not been experimentally tested, at least
575/741 of the indels (78%) are
highly plausible candidates. In all, we can have reasonable
confidence in this indel callset.
Constructing the Y-chromosomal haplogroup tree
FASTA formatted sequence files used to generate haplogroup
trees. Sequence alignments
were built using CLUSTALW2 (http://www.clustal.org/), and a
maximum parsimony (MP)
phylogenetic tree was created using the PHYLIP software
(http://evolution.gs.washington.edu/phylip.html). We generated
three trees. First, we used
all the 6,662 sites in 36 individuals to construct a haplogroup
tree, which was rooted using
the chimpanzee Y sequence. The resulting tree (Tree 1) is shown
in Supplemental Figure 1.
Second, we excluded all indels and MNPs to generate a second
haplogroup tree based only
on SNPs (Tree 2, Supplemental Figure 2). Third, we removed all
sites that were recurrent in
this set of 36 males or lacked ancestral information, and
restricted the region considered to
the 3.2 Mb with high coverage in the haplogroup A individual,
NA21313 (Tree 3, Figure 2).
Stringent sites of this kind are required by GENETREE (Bahlo and
Griffiths 2000), one of the
approaches used to estimate times on the tree.
Estimating the TMRCA and ages of nodes of the haplogroup
tree
To estimate the TMRCA of the complete tree and the ages of nodes
of particular interest,
DR, FR and R1b, we applied two broad approaches that provided
five individual estimates
for each timepoint. The first was to use coalescent modelling
implemented either in
GENETREE (Bahlo and Griffiths 2000) or BEAST (Drummond et al.
2012) using the sites in
Tree 3. To avoid the influence of related individuals, we
removed seven out of the eight
individuals who were part of the same three-generation pedigree.
This resulted in 2,004
SNPs from 29 individuals.
GENETREE
We were unable to successfully run GENETREE using all 2,004
sites. We therefore divided
the data into non-overlapping sets of 90-99 SNPs according to
their position on the
chromosome, resulting in 21 sets and thus 21 GENETREE runs.
GENETREE version 8.3 was
run in two ways. The first used a demographic model of a
constant-sized but subdivided
population with fixed migration rates. Run details: an initial
theta value was calculated using
the formula: = Ne x x length of region spanning 90-99 SNPs (Ne =
2,000; = 3 x 10-8
mutations/nucleotide/generation); 1,000,000 coalescent
simulations and 101 surface points
were used to find the optimal value. The migration rate between
Africa and Europe was set
at 3.2 x 10-5
/generation and between Africa and Asia at 0.8 x 10
-5/generation (Schaffner et
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
10
al. 2005) to generate the migration file. From this calculation,
a best estimate of theta was
obtained and used to calculate the TMRCA and other times
(GENETREE-1, Table 1, Figure
S3.1-3.21). The second used a demographic model of exponential
growth and a subdivided
population, with growth and migration rates derived from the
data. Run details: an initial
theta value was calculated as before and used to generate the
migration file; initial
exponential growth rates for Europe (0.20), Asia (0.41) and
Africa (0.74) were used (Shi et al.
2010). 1,000,000 coalescent simulations and 101 surface points
were used to find the
optimal value for theta, the migration rate and the growth rate.
From these, a best estimate
of theta, migration rate and exponential growth rate were
obtained and used to calculate
the TMRCA and other times (GENETREE-2, Table 1)
BEAST
We also used BEAST v1.7.2 (http://beast.bio.ed.ac.uk) to
estimate the TMRCA and
divergence time for each branch of interest. The NEXUS formatted
file was used to generate
the BEAST XML input file for the BEAST v1.7.2 program. To obtain
the TMRCA and the time
for the branches, four taxon subsets were set up (AR, DR, FR and
R1b). We chose GTR as the
substitution model, Gamma as the site heterogeneity model,
lognormal relaxed clock as the
clock model, exponential growth as the tree prior, = 10-9
mutations/nucleotide/year, and
100,000,000 as the MCMC chain length. The log output files were
obtained by running the
BEAST software. We carried out two independent BEAST runs from
the same XML input file
and combined the two log output files using LogCombiner, as
recommended to increase the
ESS (effective sample size) of the analysis and also allow us to
determine whether or not the
two independent runs were converging on the same distribution
(Table S8). Tracer v1.5
(http://beast.bio.ed.ac.uk/Tracer) was used to analyze the
output file and times (BEAST in
Table 1), and TreeAnnotator v1.7.2 to obtain an estimate of the
phylogenetic tree. Finally,
we viewed the tree in FigTree
(http://beast.bio.ed.ac.uk/FigTree) (Supplemental Figure 4).
Rho
In the second broad approach, we used the phylogeny-based rho
statistic (Forster et al.
1996, Saillard et al. 2000), applied both to the same set of
2,004 SNPs (Tree 3, Rho-1 in
Table 1), and also to the complete set of 5,865 SNPs, including
recurrent SNPs (Tree 2, Rho-2
in Table 1).
In all five estimates (two GENETREE, one BEAST, two rho),
calibration was achieved using
the directly-measured SNP mutation rate of 1.0 x 10-9
mutations/nucleotide/year or 3.0 x
10-8
mutations/nucleotide/generation (Xue et al. 2009). Where
necessary, we converted
estimates in generations to estimates in years using 30
years/generation.
Annotation
Each variable site was annotated (where relevant) with its type
(SNP, MNP or indel),
location in hg36 and hg37, the reference, ancestral and derived
alleles, dbSNP ID, name in
the existing phylogenetic tree, haplogroup in which it is
variable, presence as part of a CpG
dinucleotide, location in a gene and consequences for protein
structure (Table S2). The
predicted consequences of missense variants for protein function
were taken from the
modified Condel scores (Gonzalez-Perez and Lopez-Bigas 2011) in
Ensembl.
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
11
Data access
NA21313 low coverage: www.ebi.ac.uk/ena/ Accession Number
ERS037274
NA21313 high coverage: www.ebi.ac.uk/ena/ Accession Number
ERS006694
Acknowledgements
We thank Richard Durbin for comments, Christophe Dessimoz and
Kevin Gori for advice
about BEAST, Bob Griffiths for helpful suggestions about running
GENETREE, and the Sanger
library and sequencing teams for generating the haplogroup A
data. This work was
supported by grant number 098051 from The Wellcome Trust and a
China Scholarship
Council (CSC) fellowship to Wei Wei.
References
Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF,
Yu F, Bonnen PE, de Bakker
PI, Deloukas P, Gabriel SB et al. 2010. Integrating common and
rare genetic variation
in diverse human populations. Nature 467: 52-58.
Bahlo M, Griffiths RC. 2000. Inference from gene trees in a
subdivided population. Theor
Popul Biol 57: 79-95.
Balaresque P, Bowden GR, Adams SM, Leung HY, King TE, Rosser ZH,
Goodwin J, Moisan JP,
Richard C, Millward A et al. 2010. A predominantly Neolithic
origin for European
paternal lineages. PLoS Biol 8: e1000285.
Behar DM, Villems R, Soodyall H, Blue-Smith J, Pereira L,
Metspalu E, Scozzari R, Makkan H,
Tzur S, Comas D et al. 2008. The dawn of human matrilineal
diversity. Am J Hum
Genet 82: 1130-1140.
Berniell-Lee G, Calafell F, Bosch E, Heyer E, Sica L,
Mouguiama-Daouda P, van der Veen L,
Hombert JM, Quintana-Murci L, Comas D. 2009. Genetic and
demographic
implications of the Bantu expansion: insights from human
paternal lineages. Mol Biol
Evol 26: 1581-1589.
Busby GB, Brisighelli F, Sanchez-Diz P, Ramos-Luis E,
Martinez-Cadenas C, Thomas MG,
Bradley DG, Gusmao L, Winney B, Bodmer W et al. 2011. The
peopling of Europe and
the cautionary tale of Y chromosome lineage R-M269. Proc Biol
Sci 279: 884-892.
Cann RL, Stoneking M, Wilson AC. 1987. Mitochondrial DNA and
human evolution. Nature
325: 31-36.
Charchar FJ, Bloomer LD, Barnes TA, Cowley MJ, Nelson CP, Wang
Y, Denniff M, Debiec R,
Christofidou P, Nankervis S et al. 2012. Inheritance of coronary
artery disease in
men: an analysis of the role of the Y chromosome. Lancet 379:
915-922.
Chiaroni J, Underhill PA, Cavalli-Sforza LL. 2009. Y chromosome
diversity, human expansion,
drift, and cultural evolution. Proc Natl Acad Sci U S A 106:
20174-20179.
Cruciani F, Trombetta B, Massaia A, Destro-Bisol G, Sellitto D,
Scozzari R. 2011. A revised
root for the human Y chromosomal phylogenetic tree: the origin
of patrilineal
diversity in Africa. Am J Hum Genet 88: 814-818.
Cruciani F, Trombetta B, Novelletto A, Scozzari R. 2008.
Recurrent mutation in SNPs within Y
chromosome E3b (E-M215) haplogroup: a rebuttal. Am J Hum Biol
20: 614-616.
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
12
Cruciani F, Trombetta B, Sellitto D, Massaia A, Destro-Bisol G,
Watson E, Beraud Colomb E,
Dugoujon JM, Moral P, Scozzari R. 2010. Human Y chromosome
haplogroup R-V88: a
paternal genetic record of early mid Holocene trans-Saharan
connections and the
spread of Chadic languages. Eur J Hum Genet 18: 800-807.
Debnath M, Palanichamy MG, Mitra B, Jin JQ, Chaudhuri TK, Zhang
YP. 2011. Y-chromosome
haplogroup diversity in the sub-Himalayan Terai and Duars
populations of East India.
J Hum Genet 56: 765-771.
Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani
BG, Carnevali P,
Nazarenko I, Nilsen GB, Yeung G et al. 2010. Human genome
sequencing using
unchained base reads on self-assembling DNA nanoarrays. Science
327: 78-81.
Drummond AJ, Suchard MA, Xie D, Rambaut A. 2012. Bayesian
Phylogenetics with BEAUti
and the BEAST 1.7. Mol Biol Evol 29: 1969-1973.
Forster P, Harding R, Torroni A, Bandelt HJ. 1996. Origin and
evolution of Native American
mtDNA variation: a reappraisal. Am J Hum Genet 59: 935-945.
Foster EA, Jobling MA, Taylor PG, Donnelly P, de Knijff P,
Mieremet R, Zerjal T, Tyler-Smith C.
1998. Jefferson fathered slave's last child. Nature 396:
27-28.
Gonzalez-Perez A, Lopez-Bigas N. 2011. Improving the assessment
of the outcome of
nonsynonymous SNVs with a consensus deleteriousness score,
Condel. Am J Hum
Genet 88: 440-449.
Gunnarsdottir ED, Li M, Bauchet M, Finstermeier K, Stoneking M.
2011. High-throughput
sequencing of complete human mtDNA genomes from the Philippines.
Genome Res
21: 1-11.
Hammer MF. 1995. A recent common ancestry for human Y
chromosomes. Nature 378: 376-
378.
Hu M, Ayub Q, Guerra-Assuncao JA, Long Q, Ning Z, Huang N,
Romero IG, Mamanova L,
Akan P, Liu X et al. 2012. Exploration of signals of positive
selection derived from
genotype-based human genome scans using re-sequencing data. Hum
Genet 131:
665-674.
Jobling MA, Tyler-Smith C. 2003. The human Y chromosome: an
evolutionary marker comes
of age. Nat Rev Genet 4: 598-612.
Jota MS, Lacerda DR, Sandoval JR, Vieira PP, Santos-Lopes SS,
Bisso-Machado R, Paixao-
Cortes VR, Revollo S, Paz YMC, Fujita R et al. 2011. A new
subhaplogroup of native
American Y-Chromosomes from the Andes. Am J Phys Anthropol 146:
553-559.
Karafet TM, Mendez FL, Meilerman MB, Underhill PA, Zegura SL,
Hammer MF. 2008. New
binary polymorphisms reshape and increase resolution of the
human Y chromosomal
haplogroup tree. Genome Res 18: 830-838.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth
G, Abecasis G, Durbin R.
2009. The Sequence Alignment/Map format and SAMtools.
Bioinformatics 25: 2078-
2079.
Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads
and calling variants using
mapping quality scores. Genome Res 18: 1851-1858.
Mendez FL, Karafet TM, Krahn T, Ostrer H, Soodyall H, Hammer MF.
2011. Increased
resolution of Y chromosome haplogroup T defines relationships
among populations
of the Near East, Europe, and Africa. Hum Biol 83: 39-53.
Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT,
Rowen L, Pant KP, Goodman
N, Bamshad M et al. 2010. Analysis of genetic inheritance in a
family quartet by
whole-genome sequencing. Science 328: 636-639.
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
13
Roberts RG, Jones R, Smith MA. 1990. Thermoluminescence dating
of a 50,000-year-old
human occupation site in northern Australia. Nature 345:
153-156.
Rozen S, Marszalek JD, Alagappan RK, Skaletsky H, Page DC. 2009.
Remarkably little variation
in proteins encoded by the Y chromosome's single-copy genes,
implying effective
purifying selection. Am J Hum Genet 85: 923-928.
Saillard J, Forster P, Lynnerup N, Bandelt HJ, Norby S. 2000.
mtDNA variation among
Greenland Eskimos: the edge of the Beringian expansion. Am J Hum
Genet 67: 718-
726.
Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero
J, Hobolth A, Lappalainen T,
Mailund T, Marques-Bonet T et al. 2012. Insights into hominid
evolution from the
gorilla genome sequence. Nature 483: 169-175.
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D.
2005. Calibrating a coalescent
simulation of human genome sequence variation. Genome Res 15:
1576-1583.
Scozzari R, Cruciani F, Santolamazza P, Malaspina P, Torroni A,
Sellitto D, Arredi B, Destro-
Bisol G, De Stefano G, Rickards O et al. 1999. Combined use of
biallelic and
microsatellite Y-chromosome polymorphisms to infer affinities
among African
populations. Am J Hum Genet 65: 829-846.
Semino O, Passarino G, Oefner PJ, Lin AA, Arbuzova S, Beckman
LE, De Benedictis G,
Francalacci P, Kouvatsi A, Limborska S et al. 2000. The genetic
legacy of Paleolithic
Homo sapiens sapiens in extant Europeans: a Y chromosome
perspective. Science
290: 1155-1159.
Sezgin E, Lind JM, Shrestha S, Hendrickson S, Goedert JJ,
Donfield S, Kirk GD, Phair JP, Troyer
JL, O'Brien SJ et al. 2009. Association of Y chromosome
haplogroup I with HIV
progression, and HAART outcome. Hum Genet 125: 281-294.
Shi W, Ayub Q, Vermeulen M, Shao RG, Zuniga S, van der Gaag K,
de Knijff P, Kayser M, Xue
Y, Tyler-Smith C. 2010. A worldwide survey of human male
demographic history
based on Y-SNP and Y-STR data from the HGDP-CEPH populations.
Mol Biol Evol 27:
385-393.
Sims LM, Garvey D, Ballantyne J. 2009. Improved resolution
haplogroup G phylogeny in the
Y chromosome, revealed by a set of newly characterized SNPs.
PLoS One 4: e5792.
Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L,
Brown LG, Repping S,
Pyntikova T, Ali J, Bieri T et al. 2003. The male-specific
region of the human Y
chromosome is a mosaic of discrete sequence classes. Nature 423:
825-837.
The 1000 Genomes Project Consortium. 2010. A map of human genome
variation from
population-scale sequencing. Nature 467: 1061-1073.
Trombetta B, Cruciani F, Sellitto D, Scozzari R. 2011. A new
topology of the human Y
chromosome haplogroup E1b1 (E-P2) revealed through the use of
newly
characterized binary polymorphisms. PLoS One 6: e16073.
Trombetta B, Cruciani F, Underhill PA, Sellitto D, Scozzari R.
2010. Footprints of X-to-Y gene
conversion in recent human evolution. Mol Biol Evol 27:
714-725.
Tyler-Smith C, Krausz C. 2009. The will-o'-the-wisp of
genetics--hunting for the azoospermia
factor gene. N Engl J Med 360: 925-927.
Underhill PA, Myres NM, Rootsi S, Metspalu M, Zhivotovsky LA,
King RJ, Lin AA, Chow CE,
Semino O, Battaglia V et al. 2010. Separating the post-Glacial
coancestry of European
and Asian Y chromosomes within haplogroup R1a. Eur J Hum Genet
18: 479-484.
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
14
Underhill PA, Shen P, Lin AA, Jin L, Passarino G, Yang WH,
Kauffman E, Bonne-Tamir B,
Bertranpetit J, Francalacci P et al. 2000. Y chromosome sequence
variation and the
history of human populations. Nat Genet 26: 358-361.
Vigilant L, Stoneking M, Harpending H, Hawkes K, Wilson AC.
1991. African populations and
the evolution of human mitochondrial DNA. Science 253:
1503-1507.
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney
E. 2009. EnsemblCompara
GeneTrees: Complete, duplication-aware phylogenetic trees in
vertebrates. Genome
Res 19: 327-335.
Xue Y, Wang Q, Long Q, Ng BL, Swerdlow H, Burton J, Skuce C,
Taylor R, Abdellah Z, Zhao Y
et al. 2009. Human Y chromosome base-substitution mutation rate
measured by
direct sequencing in a deep-rooting pedigree. Curr Biol 19:
1453-1457.
Zalloua PA, Platt DE, El Sibai M, Khalife J, Makhoul N, Haber M,
Xue Y, Izaabel H, Bosch E,
Adams SM et al. 2008. Identifying genetic traces of historical
expansions: Phoenician
footprints in the Mediterranean. Am J Hum Genet 83: 633-642.
Zerjal T, Xue Y, Bertorelle G, Wells RS, Bao W, Zhu S, Qamar R,
Ayub Q, Mohyuddin A, Fu S et
al. 2003. The genetic legacy of the Mongols. Am J Hum Genet 72:
717-721.
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com
-
Cold Spring Harbor Laboratory Press on November 1, 2012 -
Published by genome.cshlp.orgDownloaded from
http://genome.cshlp.org/http://www.cshlpress.com