Page 1
Primate Comparative Genomics
“…man’s position in the animate world is an indispensable
preliminary to the proper understanding of his relations to the
universe – and this again resolves itself, in the long run, into an
inquiry into the nature and the closeness of the ties which connect
him with those singular creatures (the Great Apes) whose history
has been sketched in the preceding pages.”
-Thomas H. Huxley
-Man’s Place in Nature, 1894
Page 2
Humans and Chimps
Homo sapiens 99.9% identical
Homo sapiens and Pan troglodytes 99.0% identical
Page 3
Why sequence chimps?
Two white papers.
http://www.genome.gov/11008056
Page 4
Chimps Are Resistant To Many
Human Diseases
Comparison of disease susceptibility between chimps and humans
Condition Human Chimp
HIV progression to AIDS common very rare
Influenza A symptoms moderate/severe mild
Hepatitus B/C complications moderate/severe mild
Plasmodium falciparum malaria susceptible resistant
Menopause universal rare
E. Coli K99 gastroenteritis resistant sensitive
Alzheimer’s disease pathology complete incomplete
Epithelial cancers common rare
Source: Olson, M.V. et al. White paper advocating the complete sequencing of the common chimpanzee, Pan troglyodytes, (2002)
Page 5
Chimp sequence can inform our
unique population history
Kasserman et al (2001) Nat. Genet. 27: 155-56
Page 6
Chimps can inform our unique
population history
• Fixation of deleterious
alleles during bottlenecks
• Chimp genome might offer
a “fix” to common diseases
speech+speech--
hypertension+
hypertension--
obesity+obesity--
bipedal+
bipedal--
speech+
hypertension+
obesity+
bipedal+
Page 7
Chimp sequence can help detect
selection• Important to know the ancestral allele
• Over-representation of the non-ancestral allele can suggest selection
A
AA
AB B
BB
BB
B
B
BB
B
B
B
A
B
B
BB
A allele fixed in
Chimps
A and B are
polymorphic in
Humans
Page 8
Only species appropriate for
comparison of fast moving regions
• Pericentric duplications
• Subtelomeric repeats
• Y-chromosome
• 5-7% of the genome is in large segmental
duplications
Page 9
What does the genome tell us?
• (Roughly) same size genome (3.1 GB)
• (Roughly) same number of genes (~20,500)
• (Roughly) same genes
• Large number of papers reporting specific
differences between human and chimps
• Many papers also claim to detect positive selection
on specific human genes
Not too much yet…
Page 10
Let’s do the math
How many differences do we need to look at?
(3 x 109 bp) (1% divergence) (50% in humans) = 15 million bp
In coding DNA?
(15 million bp) (1.5% coding) (75% non-synonomyous) =169,000 bp
or about 7 non-synonomyous changes per gene
Non-coding DNA?
(15 million bp) (3.5% under selection) = 525,000 bp
Page 11
What are the possibilities?
• Gene loss
• Gene gain
• Gene mutation (a few or many)
• Gene regulation
• Something else?
Page 12
Inter- versus Intraspecific
Variation
He (man) resembles them (apes) as they
resemble one another – he differs from
them as they differ from one another.
-Thomas Huxley
-Man’s Place in Nature, 1894
Page 13
Gene Loss
Hypothesis: Humans have lost (one or
more) genes compared to chimps, and
it is the loss of those functions that
accounts for our “humanness”
Page 14
Sialic Acid Biologyan example of database mining
Chou et al. (1998) Proc. Natl. Acad. Sci. USA 95, 11751-11756
• Apes have lots of Neu5Gc, humans very little
• Neu5Gc is located on the surface of epithelial cells
• Neu5Gc is present in very low levels in the brain even
in animals that have lots of Neu5Gc
hydroxylase
humanchimpgorillamouse
A 92 bp deletion in the CMP-Neu5a
hydroxylase is specific to the
human lineage
ATG
ATG
ATG
ATG
Page 15
Indels are ~50% of human-chimp differences
Frazer et al (2003) Genome. Res. 13: 341-346
Locke et al. (2003) Genome. Res. 13: 347-357
Page 16
Gene Gain
Hypothesis: Humans have gained (one
or more) genes compared to chimps, and
it is the gain of these new functions that
accounts for our “humanness.”
Page 17
Morpheus Gene Family
Johnson et al. (2001) Nature 413:514-519
Page 18
Morpheus Gene family
Johnson et al. (2001) Nature 413:514-519
• 20 Kb duplicated segment on short arm of
chromosome 16
• 98% identity in introns/non-coding DNA,
81% identity in exonic DNA
• Ka/Ks tests indicate (possibility of) extreme
positive selection
• Gene family has no homology to known
genes
Page 19
Morpheus Gene Family
Page 20
Gene Mutation
Hypothesis: Humans acquired (one or
more) substitutions in the coding
regions of their genes that alter the
functions of those proteins so as to
account for our “humanness.”
Page 21
What about organism specific substitutions?
http://sayer.lab.nig.ac.jp/~silver/
C-C chemokine receptor (nucleotides 1 to 60)
Human_1 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGC
Human_2 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGC
Human_3 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGC
Human_4 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGC
Chimp_1 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCGATTATTATACATCGGAGCCCTGC
Chimp_2 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCGATTATTATACATCGGAGCCCTGC
Chimp_3 ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCGATTATTATACATCGGAGCCCTGC
Goril_1 ATGGATTATCAAGTGTCAAGTCCAACCTATGACATCGATTATTATACATCGGAGCCCTGC
Goril_2 ATGGATTATCAAGTGTCAAGTCCAACCTATGACATCGATTATTATACATCGGAGCCCTGC
Goril_3 ATGGATTATCAAGTGTCAAGTCCAACCTATGACATCGATTATTATACATCGGAGCCCTGC
************************* ********** ***********************
Problem: How can we make a conclusion based on one substitution?
Page 22
Detecting Selective Sweeps
• Selective sweeps are (thought to be) accompanied
by a local reduction in diversity
• Test for overabundance of low frequency alleles
(Tajima’s D)
Apadted from Carroll, S. (2003) Nature 422:849-57
beneficial mutation arisesSelection drives
mutation to fixationmutation/recombination
Page 23
FOXP2, The Human Speech Gene?1) Mapped in families with inherited speech
defects (normal IQ)
2) Forkhead transcription factor
FOXP2 Nucleotide Substitutions
Enard et al. (2002) Nature 418, 869-72
Page 24
FOXP2, The Human Speech Gene?
Enard et al. (2002) Nature 418, 869-72
• Sequencing of adjacent non-coding DNA
revealed an excess in the number of low
frequency alleles relative to what would be
expected given neutral DNA in a randomly
mating population of constant size
• Tajima’s D = -2.20 (P<0.01)
Page 25
Gene Expression
Hypothesis: It is not the structural
differences in proteins, but rather their
differences in expression between
humans and chimps that account for
our “humanness.”
Page 26
Differences in Gene Expression in the Brain?Enard et al (2002) Science 296, 340-343.
microarrays
2D Gels
Page 27
Neutral Theory of Gene
Expression?
• Consider how one might construct a neutral
theory of gene expression akin to the neutral
theory of gene mutation
Page 28
1) What is the sequence of the
normal Human Genome?
2) What accounts for the genetic
differences between individuals?
Page 29
Finding Segmental Duplications in the
Human Genome
Bailey et al (2002) Science 297:1003-07
Page 30
Segmental Duplications in the Human Genome
Bailey et al (2002) Science 297:1003-07
Page 31
Polymorphism in Segmental Duplications
Iafrate et al (2004) Nat Genet 36:949-51
Page 32
Polymorphism in Segmental
Duplications
• CGH studies find many copy number polymorphisms in segmental duplications (~12 per individual)
• Rare and common polymorphisms
• Many overlap coding regions
• Critical for the interpretation of amplifications in cancers
• Responsible for phenotypic differences between people?
Page 33
SNPs/Hap Map/1000 Genomes
The International HapMap Project is a multi-country effort to
identify and catalog genetic similarities and differences in human
beings. Using the information in the HapMap, researchers will be
able to find genes that affect health, disease, and individual
responses to medications and environmental factors. The Project is
a collaboration among scientists and funding agencies from Japan,
the United Kingdom, Canada, China, Nigeria, and the United
States. All of the information generated by the Project will be
released into the public domain
Page 34
Questions
1. How many sub-populations best partition the
data?
2. How strong is the evidence for the clusters?
3. Do the inferred clusters correspond to our
notions of race, ethnicity, ancestry, or
geography?
4. Given the inferred clusters can we accurately
can we classify new individuals?
5. Can we identify population admixture or
migration events?
Page 35
Attempts to group humans by genotype
Page 36
and Fst
1. , average nucleotide diversity
(~1 in 1000 bp)
2. Fst, proportion of genetic variation that can
be ascribed to differences between
populations (~10%)
Page 37
Summary of Findings
• and Fst are small
• Diversity within “African” populations is
highest
• Unsupervised clustering tends to support
either 3 or 4 sub-populations depending on
number and type of markers and individuals
included in the study, but the composition
of the groups are often different in different
studies
Page 38
A contradiction?
• Although they differed on the extent and
composition of sub-populations, so far all
studies have found evidence of significant
sub-structure in human populations
• And yet, all studies agree that Fst is small
(between 3-15%)
See review by Jorde and Wooding (2004) Nature Genet. 36: S28-S33
Page 39
Small Fst does not imply lack of structure
A1
D2
B2
A1
B2
A1
A1
A1
A2A2
D2A1C1
C2
A1
B1
B1
B1
A1
C1A2
D1
A2
A1C2
A1
D2
C2
D1D1
A1
C1
D1
B2E2
E2
E1E1E1
E1
E2
E2
E2
C2
Page 40
Clustering human populations by
genotype
K-means clustering of gene expression data
• Pick a number (k) of cluster centers
• Assign every gene to its nearest cluster center
• Move each cluster center to the mean of its assigned genes
• Repeat 2-3 until convergence
EM-based clustering of genotype data
• Pick a number (k) of sub-populations
• Assign every individual to a sub-population based on the allele frequencies in the sub-population
• Recalculate the allele frequencies in each sub population
• Repeat 2-3 until convergence
Page 41
An ExampleI1= (A1,B1,C2)
I2= (A1,B1,C2)
I3= (A1,B2,C2)
I4= (A2,B2,C1)
I5= (A1,B1,C1)
I6= (A1,B1,C2)
I7= (A1,B1,C2)
I8= (A2,B2,C2)
I9= (A1,B2,C1)
I10= (A2,B1,C2)
I11= (A2,B2,C2)
I12= (A2,B2,C2)
12 individuals genotyped at three
different independent biallelic loci
Page 42
k1 k3k2
I1= (A1,B1,C2)
I2= (A1,B1,C2)
I3= (A1,B2,C2)
I4= (A2,B2,C1)
I5= (A1,B1,C1)
I6= (A1,B1,C2)
I7= (A1,B1,C2)
I8= (A2,B2,C2)
I9= (A1,B2,C1)
I10= (A2,B1,C2)
I11= (A2,B2,C2)
I12= (A2,B2,C2)
F(A1)k1=0.75
F(B1)k1=0.5
F(C1)k1=0.25
F(A1)k2=0.75
F(B1)k2=0.75
F(C1)k2=0.25
F(A1)k3=0.25
F(B1)k3=0.25
F(C1)k3=0.25
Consider individual I1= (A1,B1,C2)
P(I1 in k1) = (.75)(.5)(.75) = 0.28
P(I1 in k2) = (.75)(.75)(.75) = 0.42
P(I1 in k3) = (.25)(.25)(.75) = 0.046
Therefore reassign I1 to k2
Page 43
An exampleBamshad et al (2003) Am. J. Hum. Genet. 72:578-89
Page 44
But…Bamshad et al (2003) Am. J. Hum. Genet. 72:578-89
Page 45
Genes mirror geography in EuropeNovembre et al. Nature 456, 98-101
Page 46
Pharmacogenomics
• Many drugs never reach the market because
of side effects in a small minority of
patients
• Many drugs on the market are efficacious in
only a small fraction of the population
• This variation is (in part) due to genetic
determinants
– OrissaEGF mutations
– Codeinecytochrome P450 alleles
Page 47
Question: Is race, ancestry, ethnicity,
geography or genetic substructure a
reasonable proxy for genotype at
alleles relevant for drug metabolism?
Answer: So far…No. Still looks as if we will have to genotype
the relevant loci before making any guesses
Page 48
Population genetic structure of
variable drug response.Wilson et al (2001) Nat Genet. 29: 265-269
A = African
B = European
C = Asian
A B C
CYP1A2
GSTM1
CYP2C19
DIA4
NAT2
CYP2D6
Page 49
Evidence for Archaic Asian Ancestry on
the Human X ChromosomeGarrigan et al. (2005) Mol. Biol. And Evol. 22:189-192
1) Pseudogene on the X-chromosome
2) 18 substitutions between human-chimp
3) 15 substitutions between two human alleles
4) Assuming a molecular clock the split between
the two human alleles is about 2 million years
5) Both alleles found in southern Asia, only one
allele found in Africa
6) Only human gene tree to “root” in Asia
Page 50
Garrigan et al. (2005) Mol. Biol. And Evol. 22:189-192
Page 51
Garrigan et al. (2005) Mol. Biol. And Evol. 22:189-192
Page 52
Human evolution in a nutshell
chimpsH. sapien
H. ergaster
H. erectus
H. neanderthalis
5-6 mya
1 mya
0.5 mya
0.2 mya
Page 53
Human evolution in a nutshell
chimpsH. sapien
H. ergaster
H. erectus
H. neanderthalis
5-6 mya
1 mya
0.5 mya
0.2 mya?
Page 54
So what happened?
1. Strong selection for the Asian allele in southern Asia
-not likely since this is a pseudogene locus
-fails Tajima’s D test
2. Gene flow between H. sapien and H.erectus in
southern Asia
-branch lengths are about right for 2 million years of divergence
-H. erectus was in southern Asia until 18,000 years ago
(Morwood et al. and Brown et al. in Nature (2004) vol
431.)
-supporting evidence from genetic analysis of lice and other
human parasites (Reed et al (2004) PLoS 2:1972-83)