INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS Mamoru Kato Cold Spring Harbor Laboratory, USA
Jan 30, 2016
INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS
Mamoru Kato
Cold Spring Harbor Laboratory, USA
Background
• 2003 – The complete sequence of the human genome was released by International Human Genome Sequencing Consortium.– coverage ~99%; accuracy >99.99%
• However, this complete sequence is an “average” sequence (derived from the DNA samples of multiple individuals).
• Little information on variation/polymorphism in the sequence among multiple individuals
Background
• The International HapMap Project started after the Human Genome Project to address variation/polymorphism in human sequences Focused on single nucleotide polymorphism (SNP) – the simplest
polymorphism Catalogued SNP genotypes for 270 individuals in three ethnical
populations (Asian, African, European) at 3 million SNP loci Medical application as well as biological investigation
• 2005 – The Phase I was published.• 2007 – The Phase II was published.
SNP
• An achievement in the HapMap ProjectLinkage disequilibrium (LD)
• LD – statistical association between different loci
• Promotes genome-wide disease association studies
Background
SNP
LD LD
SNP
• Genome-wide association studiesFind SNPs associated with
a disease on the genomic scale
Many diseases: diabetes, rheumatoid arthritis, myocardial infarction, Crohn's disease, ...
Background
• When the HapMap Project was ongoing, a more complex type of genetic variation than SNP had been recognized. – 2004 Copy number variation (Sebat et al; Iafrate et al)
• By microarrays
• Until then, this variation was believed to be rare in normal individuals
– 2005 Inversion (Stefansson et al)
• These variations are collectively called structural variations.
Chromosome 10Chromosome 3
Structural Variations
• Structural variation – variation with a long lengthVariation in which sequence segments >1 kb in size are
involvedCopy Number
Variation Inversion
Translocation
>1 kb
Homologous chromosomeIndividual 1
Individual 2
Individual 3
Homologous chromosome
Homologous chromosomeHomologous chromosome
Homologous chromosomeHomologous chromosome
(from father)
(from mother)
Copy Number Variation• Copy Number Variation (CNV)
The simplest type of structural variation Difference in the number of copies >1 kb
Individual 1
Individual 2
Individual 3
CNVCNV represented by a
reference genome
>1 kbCNV region
CNV segment
CNV• Since the studies in 2004, many studies have
been performed on the genomic scale for many individuals. CNV regions cover 4-6% of the human genome 3,000-6,000 regions
• cf. Common SNPs: 0.3% in coverage, 10 million in number CNV regions often include entire genes and their
regulatory regions.• e.g., CCL3L1 gene – HIV
CNVs are likely to influence human phenotypes such as disease susceptibility.
• Autoimmunity, autism, psoriasis, schizophrenia, ...
Location
ytisnetnilangiS
Individual 1
Individual 2
Individual 3
CNV• The principle of CNV detection
– by microarrays and quantitative PCR (excl. pair-end mapping)
Locationytisnetnilangi
S
Individual 1
Individual 2
Problem• These techniques cannot discern the configurations of
genotypes (pairs of alleles) of CNV
Allele: copy number 1 As total number 3
Allele: copy number 2
Allele: copy number 0
Allele: copy number 3
As total number 3
Individual 2A
G A
AIndividual 1
G
A
Signal intensity A
Sig
nali
nten
sity
G
Signal intensity A
Sig
nali
nten
sity
G
Problem• These techniques cannot discern the configurations of
genotypes (pairs of alleles) of CNV
Allele: G
Allele: A, A
(# A, # G) = (2, 1)
(# A, # G) = (2, 1)
Allele: A
Allele: G, A
BA
Bi iAi is NN
NqNpH
22 11
Problem• This is problematic –
most theories in population genetics are constructed based on alleles (or haplotypes), not on the total numbers.
ts HHF /1st
i BA
BiAit NN
NqNpH
2
1
Population differentiation
Linkage disequilibrium
Frequencies of alleles (haplotypes) in a population
2
1
2
1
22 )(
i j ji
jiij
qp
qphR
Problem
• Some method is required to get information on alleles/haplotypes from observed data
• Haplotype inference for CNVs
• It handles– >50 individuals– One locus to the genomic level
Haplotype Inference for CNVs
• Deterministic approach (Redon et al, 2006; McCarroll et al, 2006; Hinds et al, 2006)– One state from observed data
• Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009)– Multiple states from observed data
Deterministic Approach
• A simple approach to infer alleles – clustering the signal intensities of individuals (Redon et al, 2006) If signal intensities are grouped into three clusters, they
correspond to two homozygotes and one heterozygoteAssumption: two different alleles
Ind. 1
Ind. 2
Ind. 3
Ind. 4
Allele 1
Allele 2
Homo Hetero Homo
CNV
(Redon et al, 2006)
Statistical Approach
• Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009)Consider multiple possible states consistent with the
total numbers (, which I call diploid numbers)Condition: diploid numbers are observed.Unrelated individuals (opposite to pedigrees)
• This approach is realized in the expectation-maximization (EM) algorithm
Statistical Approach
• The EM algorithm is a statistical method to estimate parameters, often used for data with unobserved parameters.
– Unobserved parameters here: diplotypes (pairs of haplotypes), since their configurations are not experimentally determined
1. First, it handles unobserved data as if the unobserved data are observed, by utilizing the observed data on other parameters
– Observed parameters here: diploid numbers (the total numbers over a genotype)
2. Second, it iteratively calculates E and M steps to increase estimation accuracy.
1. List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers)
Principle of the Algorithm
Sample Diploid number of copies
Possible diplotypes [x copies / y copies]
Ind. 1 3 [0 / 3] OR [1 / 2]
Ind. 2 0 [0 / 0]
Ind. 3 2 [0 / 2] OR [1 / 1]
Ind. 4 1 [0 / 1]
Ind. 1
Ind. 2
Ind. 3
Ind. 4
Quantitative PCR, HMM in microarray
“/”: separator symbol bet haplotypesCNV
(Kato et al, 2008)
1. List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers)
Principle of the Algorithm
Sample
CNV SNPPossible diplotypes [haplotype / haplotype]
Diploid number of copies
[allele / allele]
Ind. 1 1 [a / t] [0_a / 1_t] OR [0_t / 1_a]
Ind. 2 0 [a / a] ...
Ind. 3 2 [a / a] ...
Ind. 4 3 [t / t] ...
Ind. 1
Ind. 2
Ind. 3
Ind. 4
at
aa
aa
tt
SNPCNV “_”: separator symbol bet loci
Haplotype composed of CNV and SNP(Kato et al, 2008)
..A..G..
1. List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers)
Principle of the Algorithm
Sample
SNVC 1 SNVC 2Possible diplotypes
[haplotype / haplotype]
Dip. # of A
Dip. # of C
Dip. # of G
Dip. # of T
Ind. 1 1 1 2 0 [AG / CG] OR [-- / AG, CG]
Ind. 2 2 1 2 1 ...
Ind. 3 1 0 0 1 ...
Ind. 1
RETINA technique, HMM in microarray
“,”: separator symbol bet copies
Single Nucleotide Variations in CNVs (SNVCs)
..A..T..
..C..G..
..A..G..
Ind. 2
..C..G..
..A..T..Ind. 3
..A..G..
..A..T..
..C..G..
..A..G..
..C..G..
..A..T..
“-”: deletionSNVC1 SNVC2
(Kato et al, 2008)
2. Repeat E- and M-steps to estimate haplotype frequencies– Using possible diplotypes obtained at the previous step
IterationGiving arbitrary values to haplotype frequencies
M step:Number of haplotypes, considering the weights→Haplotype frequencies
E step: Haplotype frequencies→Diplotype frequencies→Update the weights
Sample Possible diplotypeWeight (probability that the sample takes this diplotype)
Diplotype frequency under Hardy-Weinberg equilibrium
Ind. 1
haplotype 1 / haplotype 1 w11 F(h1 / h1) F(h1 / h1) = 1F(h1)F(h1)
haplotype 1 / haplotype 2 w12 F(h1 / h2) F(h1 / h2) = 2F(h1)F(h2)
haplotype 2 / haplotype 3 w13 F(h2 / h3) F(h2 / h3) = 2F(h2)F(h1)
... ... ...
Ind. 2haplotype 1 / haplotype 1 w21 F(h1 / h1) F(h1 / h1) = 1F(h1)F(h1)
… … ...
Principle of the Algorithm
F(x): frequency of x
Application
• Population differentiation, Fst
• CNV regions with high Fst (indicating natural selection)– Microarray data for CEU and YRI populations (90 individuals each)
– Frequencies of allelic copy numbers were estimated.
Chr Start End Fst Overlapping gene
1 149,365,093 149,419,009 0.427 LCE3A/3B/3C/3D/3E, late cornified envelope
2 3,701,609 3,727,783 0.121 None
2 34,576,366 34,662,239 0.469 None
4 10,063,092 10,086,289 0.106 ZNF518B, zinc finger protein 518B
4 34,595,900 34,663,168 0.485 None
4 187,464,847 187,498,012 0.373 CYP4V2, cytochrome P450
6 32,060,463 32,136,004 0.209 CYP21A2, cytochrome P450
8 120,216,553 120,271,645 0.106 Collectin sub-family member 10
14 81,562,743 81,591,452 0.588 None
15 25,588,301 25,606,455 0.116 None
22 17,921,878 18,002,715 0.618 CLDN5, claudin 5 transcript variant 2
(Kato et al, 2009)
0 20 40 60 80
0.0
1.02.0
3.04.0
5.0
Distance (kb)
CNV−SNPCNV−SNP (permutated)
(DL
V2
)
0 20 40 60 80
0.0
1.02.0
3.04.0
5.0
CNV−SNPCNV−SNP (permutated)SNP (adjusted)−SNPCNV−SNP, larger deletion freqCNV−SNP, relatively larger deletion freqCNV−SNP, larger duplication freqCNV−SNP, relatively larger duplization freq
(DL
R2
)
0 20 40 60 80
0.01.
02.0
3.0
4.05.0
CNV−SNPCNV−SNP (permutated)SNP (adjusted)−SNPCNV−SNP, larger deletion freqCNV−SNP, relatively larger deletion freqCNV−SNP, larger duplication freqCNV−SNP, relatively larger duplization freq
(D
LR
2)
CEU
ApplicationCNV SNP SNP
LD (association)
YRI
0 20 40 60 80
0.0
1.02.0
3.04.0
5.0
Distance (kb)
CNV−SNPCNV−SNP (permutated)
(DL
V2
)
Bi-allelic
Tri-allelic(del, dup)
CEU YRI
(Kato et al, 2009)
Application
1 c
op
y
0 c
op
y
2 c
op
ies
CEUYRI
CYP2D6
Allele
Fre
qu
en
cy
0.0
0.2
0.4
0.6
0.8
1.0
1 c
op
y
2 c
op
ies
0 c
op
y
3 c
op
ies
5 c
op
ies
4 c
op
ies
CEUYRI
MRGPRX1
Allele
Fre
qu
en
cy
0.0
0.2
0.4
0.6
0.8
GG
AA
GC
GA
AG
GC
GA
A,G
AA
−−
−G
GC
,AG
CG
GA
,GA
AA
GA
GG
C,G
GA
GG
C,A
GA
GG
A,G
GA
GG
A,A
GC
AG
C,A
GC
AG
C,A
GC
,AG
CG
AC
,GA
AG
AA
,AG
CG
AC
,AG
AG
AA
,AG
C,A
GC
GG
C,A
AA
GG
C,G
GC
,GG
AG
GC
,AG
A,A
AC
GG
C,G
AC
CEUYRI
MRGPRX1
Haplotype
Fre
qu
en
cy
0.0
0.1
0.2
0.3
0.4
0.5G
CC
GC
T
AT
C
−−
−
AC
C
AT
C,A
TC
GC
T,G
CT
GC
C,G
CC
CEUYRI
CYP2D6
Haplotype
Fre
qu
en
cy
0.0
0.1
0.2
0.3
0.4
RETINA data for CEU and YRI populations (90 individuals each)
Estimation using only copy numbers
Estimation using both bases and copy numbers (SNVCs)
• More information in SNVC than in only copy number
(Kato et al, 2008)
Future Issues• CNV association studies
– Find CNV regions associated with a disease
• Currently, they are based on diploid numbers of copies (or categorized numbers like “2 copies”).– It wouldn't be necessary to infer haplotypes, as long as only
copy numbers are examined.• However, it would be necessary to infer haplotypes, if
SNVCs is associated with a disease– More complex, hard to analyze without haplotype inference– SNVCs = SNPs + copy number changes
• Even only SNPs have a significant risk for diseases
Future Issues
• Issues in the methodology– Use of pedigree information– Methods based on other algorithms
• EM has a limitation.
• Gibbs sampling, Coalescence-based sampling, ...
– Assumption of the Hardy-Weinberg equilibrium– Errors in microarray data
Conclusions
• Human genome to human variations/polymorphismsSNPCNV
• Experimental technologies for CNV diploid numbers• CNV haplotype inference
Deterministic approachStatistical approach
• Applications to population genetics• Future issues
Applications to CNV disease association studies – SNVCOvercoming limitations of the current algorithms
Acknowledgments• CSHL
– Michael Q. Zhang– Anthony Leotta
• Univ. of Tokyo– Hiroyuki Aburatani– Shumpei Ishikawa
• RIKEN– Tatsuhiko Tsunoda– Naoya Hosono– Takahisa Kawaguchi– Reiichiro Nakamichi– Michiaki Kubo– Naoyuki Kamatani– Yusuke Nakamura
• Affymetrix– Keith Jones– Michael Shapero Funding:
– National Cancer Institute – Japan Society for Promotion of Science
END