INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS

INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS

Mamoru Kato

Cold Spring Harbor Laboratory, USA

Background

• 2003 – The complete sequence of the human genome was released by International Human Genome Sequencing Consortium.– coverage ~99%; accuracy >99.99%

• However, this complete sequence is an “average” sequence (derived from the DNA samples of multiple individuals).

• Little information on variation/polymorphism in the sequence among multiple individuals

Background

• The International HapMap Project started after the Human Genome Project to address variation/polymorphism in human sequences Focused on single nucleotide polymorphism (SNP) – the simplest

polymorphism Catalogued SNP genotypes for 270 individuals in three ethnical

populations (Asian, African, European) at 3 million SNP loci Medical application as well as biological investigation

• 2005 – The Phase I was published.• 2007 – The Phase II was published.

SNP

• An achievement in the HapMap ProjectLinkage disequilibrium (LD)

• LD – statistical association between different loci

• Promotes genome-wide disease association studies

Background

SNP

LD LD

SNP

• Genome-wide association studiesFind SNPs associated with

a disease on the genomic scale

Many diseases: diabetes, rheumatoid arthritis, myocardial infarction, Crohn's disease, ...

Background

• When the HapMap Project was ongoing, a more complex type of genetic variation than SNP had been recognized. – 2004 Copy number variation (Sebat et al; Iafrate et al)

• By microarrays

• Until then, this variation was believed to be rare in normal individuals

– 2005 Inversion (Stefansson et al)

• These variations are collectively called structural variations.

Chromosome 10Chromosome 3

Structural Variations

• Structural variation – variation with a long lengthVariation in which sequence segments >1 kb in size are

involvedCopy Number

Variation Inversion

Translocation

>1 kb

Homologous chromosomeIndividual 1

Individual 2

Individual 3

Homologous chromosome

Homologous chromosomeHomologous chromosome

Homologous chromosomeHomologous chromosome

(from father)

(from mother)

Copy Number Variation• Copy Number Variation (CNV)

The simplest type of structural variation Difference in the number of copies >1 kb

Individual 1

Individual 2

Individual 3

CNVCNV represented by a

reference genome

>1 kbCNV region

CNV segment

CNV• Since the studies in 2004, many studies have

been performed on the genomic scale for many individuals. CNV regions cover 4-6% of the human genome 3,000-6,000 regions

• cf. Common SNPs: 0.3% in coverage, 10 million in number CNV regions often include entire genes and their

regulatory regions.• e.g., CCL3L1 gene – HIV

CNVs are likely to influence human phenotypes such as disease susceptibility.

• Autoimmunity, autism, psoriasis, schizophrenia, ...

Location

ytisnetnilangiS

Individual 1

Individual 2

Individual 3

CNV• The principle of CNV detection

– by microarrays and quantitative PCR (excl. pair-end mapping)

Locationytisnetnilangi

S

Individual 1

Individual 2

Problem• These techniques cannot discern the configurations of

genotypes (pairs of alleles) of CNV

Allele: copy number 1 As total number 3

Allele: copy number 2



As total number 3

Individual 2A

G A

AIndividual 1

G

A

Signal intensity A

Sig

nali

nten

sity

G

Signal intensity A

Sig

nali

nten

sity

G

Problem• These techniques cannot discern the configurations of

genotypes (pairs of alleles) of CNV

Allele: G

Allele: A, A

(# A, # G) = (2, 1)

(# A, # G) = (2, 1)

Allele: A

Allele: G, A

BA

Bi iAi is NN

NqNpH

22 11

Problem• This is problematic –

most theories in population genetics are constructed based on alleles (or haplotypes), not on the total numbers.

ts HHF /1st

i BA

BiAit NN

NqNpH

2

1

Population differentiation

Linkage disequilibrium

Frequencies of alleles (haplotypes) in a population

2

1

2

1

22 )(

i j ji

jiij

qp

qphR

Problem

• Some method is required to get information on alleles/haplotypes from observed data

• Haplotype inference for CNVs

• It handles– >50 individuals– One locus to the genomic level

Haplotype Inference for CNVs

• Deterministic approach (Redon et al, 2006; McCarroll et al, 2006; Hinds et al, 2006)– One state from observed data

• Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009)– Multiple states from observed data

Deterministic Approach

• A simple approach to infer alleles – clustering the signal intensities of individuals (Redon et al, 2006) If signal intensities are grouped into three clusters, they

correspond to two homozygotes and one heterozygoteAssumption: two different alleles

Ind. 1

Ind. 2

Ind. 3

Ind. 4

Allele 1

Allele 2

Homo Hetero Homo

CNV

(Redon et al, 2006)

Statistical Approach

• Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009)Consider multiple possible states consistent with the

total numbers (, which I call diploid numbers)Condition: diploid numbers are observed.Unrelated individuals (opposite to pedigrees)

• This approach is realized in the expectation-maximization (EM) algorithm

Statistical Approach

• The EM algorithm is a statistical method to estimate parameters, often used for data with unobserved parameters.

– Unobserved parameters here: diplotypes (pairs of haplotypes), since their configurations are not experimentally determined

1. First, it handles unobserved data as if the unobserved data are observed, by utilizing the observed data on other parameters

– Observed parameters here: diploid numbers (the total numbers over a genotype)

2. Second, it iteratively calculates E and M steps to increase estimation accuracy.

1. List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers)

Principle of the Algorithm

Sample Diploid number of copies

Possible diplotypes [x copies / y copies]

Ind. 1 3 [0 / 3] OR [1 / 2]

Ind. 2 0 [0 / 0]

Ind. 3 2 [0 / 2] OR [1 / 1]

Ind. 4 1 [0 / 1]

Ind. 1

Ind. 2

Ind. 3

Ind. 4

Quantitative PCR, HMM in microarray

“/”: separator symbol bet haplotypesCNV

(Kato et al, 2008)



Sample

CNV SNPPossible diplotypes [haplotype / haplotype]

Diploid number of copies

[allele / allele]

Ind. 1 1 [a / t] [0_a / 1_t] OR [0_t / 1_a]

Ind. 2 0 [a / a] ...

Ind. 3 2 [a / a] ...

Ind. 4 3 [t / t] ...

Ind. 1

Ind. 2

Ind. 3

Ind. 4

at

aa

aa

tt

SNPCNV “_”: separator symbol bet loci

Haplotype composed of CNV and SNP(Kato et al, 2008)

..A..G..



Sample

SNVC 1 SNVC 2Possible diplotypes

[haplotype / haplotype]

Dip. # of A

Dip. # of C

Dip. # of G

Dip. # of T

Ind. 1 1 1 2 0 [AG / CG] OR [-- / AG, CG]

Ind. 2 2 1 2 1 ...

Ind. 3 1 0 0 1 ...

Ind. 1

RETINA technique, HMM in microarray

“,”: separator symbol bet copies

Single Nucleotide Variations in CNVs (SNVCs)

..A..T..

..C..G..

..A..G..

Ind. 2

..C..G..

..A..T..Ind. 3

..A..G..

..A..T..

..C..G..

..A..G..

..C..G..

..A..T..

“-”: deletionSNVC1 SNVC2

(Kato et al, 2008)

2. Repeat E- and M-steps to estimate haplotype frequencies– Using possible diplotypes obtained at the previous step

IterationGiving arbitrary values to haplotype frequencies

M step:Number of haplotypes, considering the weights→Haplotype frequencies

E step: Haplotype frequencies→Diplotype frequencies→Update the weights

Sample Possible diplotypeWeight (probability that the sample takes this diplotype)

Diplotype frequency under Hardy-Weinberg equilibrium

Ind. 1

haplotype 1 / haplotype 1 w11 F(h1 / h1) F(h1 / h1) = 1F(h1)F(h1)



... ... ...

Ind. 2haplotype 1 / haplotype 1 w21 F(h1 / h1) F(h1 / h1) = 1F(h1)F(h1)

… … ...


F(x): frequency of x

Application

• Population differentiation, Fst

• CNV regions with high Fst (indicating natural selection)– Microarray data for CEU and YRI populations (90 individuals each)

– Frequencies of allelic copy numbers were estimated.

Chr Start End Fst Overlapping gene

1 149,365,093 149,419,009 0.427 LCE3A/3B/3C/3D/3E, late cornified envelope

2 3,701,609 3,727,783 0.121 None

2 34,576,366 34,662,239 0.469 None

4 10,063,092 10,086,289 0.106 ZNF518B, zinc finger protein 518B

4 34,595,900 34,663,168 0.485 None

4 187,464,847 187,498,012 0.373 CYP4V2, cytochrome P450

6 32,060,463 32,136,004 0.209 CYP21A2, cytochrome P450

8 120,216,553 120,271,645 0.106 Collectin sub-family member 10

14 81,562,743 81,591,452 0.588 None

15 25,588,301 25,606,455 0.116 None

22 17,921,878 18,002,715 0.618 CLDN5, claudin 5 transcript variant 2

(Kato et al, 2009)

0 20 40 60 80

0.0

1.02.0

3.04.0

5.0

Distance (kb)

CNV−SNPCNV−SNP (permutated)

(DL

V2

)

0 20 40 60 80

0.0

1.02.0

3.04.0

5.0

CNV−SNPCNV−SNP (permutated)SNP (adjusted)−SNPCNV−SNP, larger deletion freqCNV−SNP, relatively larger deletion freqCNV−SNP, larger duplication freqCNV−SNP, relatively larger duplization freq

(DL

R2

)

0 20 40 60 80

0.01.

02.0

3.0

4.05.0

CNV−SNPCNV−SNP (permutated)SNP (adjusted)−SNPCNV−SNP, larger deletion freqCNV−SNP, relatively larger deletion freqCNV−SNP, larger duplication freqCNV−SNP, relatively larger duplization freq

(D

LR

2)

CEU

ApplicationCNV SNP SNP

LD (association)

YRI

0 20 40 60 80

0.0

1.02.0

3.04.0

5.0

Distance (kb)

CNV−SNPCNV−SNP (permutated)

(DL

V2

)

Bi-allelic

Tri-allelic(del, dup)

CEU YRI

(Kato et al, 2009)

Application

1 c

op

y

0 c

op

y

2 c

op

ies

CEUYRI

CYP2D6

Allele

Fre

qu

en

cy

0.0

0.2

0.4

0.6

0.8

1.0

1 c

op

y

2 c

op

ies

0 c

op

y

3 c

op

ies

5 c

op

ies

4 c

op

ies

CEUYRI

MRGPRX1

Allele

Fre

qu

en

cy

0.0

0.2

0.4

0.6

0.8

GG

AA

GC

GA

AG

GC

GA

A,G

AA

−−

−G

GC

,AG

CG

GA

,GA

AA

GA

GG

C,G

GA

GG

C,A

GA

GG

A,G

GA

GG

A,A

GC

AG

C,A

GC

AG

C,A

GC

,AG

CG

AC

,GA

AG

AA

,AG

CG

AC

,AG

AG

AA

,AG

C,A

GC

GG

C,A

AA

GG

C,G

GC

,GG

AG

GC

,AG

A,A

AC

GG

C,G

AC

CEUYRI

MRGPRX1

Haplotype

Fre

qu

en

cy

0.0

0.1

0.2

0.3

0.4

0.5G

CC

GC

T

AT

C

−−

−

AC

C

AT

C,A

TC

GC

T,G

CT

GC

C,G

CC

CEUYRI

CYP2D6

Haplotype

Fre

qu

en

cy

0.0

0.1

0.2

0.3

0.4

RETINA data for CEU and YRI populations (90 individuals each)

Estimation using only copy numbers

Estimation using both bases and copy numbers (SNVCs)

• More information in SNVC than in only copy number

(Kato et al, 2008)

Future Issues• CNV association studies

– Find CNV regions associated with a disease

• Currently, they are based on diploid numbers of copies (or categorized numbers like “2 copies”).– It wouldn't be necessary to infer haplotypes, as long as only

copy numbers are examined.• However, it would be necessary to infer haplotypes, if

SNVCs is associated with a disease– More complex, hard to analyze without haplotype inference– SNVCs = SNPs + copy number changes

• Even only SNPs have a significant risk for diseases

Future Issues

• Issues in the methodology– Use of pedigree information– Methods based on other algorithms

• EM has a limitation.

• Gibbs sampling, Coalescence-based sampling, ...

– Assumption of the Hardy-Weinberg equilibrium– Errors in microarray data

Conclusions

• Human genome to human variations/polymorphismsSNPCNV

• Experimental technologies for CNV diploid numbers• CNV haplotype inference

Deterministic approachStatistical approach

• Applications to population genetics• Future issues

Applications to CNV disease association studies – SNVCOvercoming limitations of the current algorithms

Acknowledgments• CSHL

– Michael Q. Zhang– Anthony Leotta

• Univ. of Tokyo– Hiroyuki Aburatani– Shumpei Ishikawa

• RIKEN– Tatsuhiko Tsunoda– Naoya Hosono– Takahisa Kawaguchi– Reiichiro Nakamichi– Michiaki Kubo– Naoyuki Kamatani– Yusuke Nakamura

• Affymetrix– Keith Jones– Michael Shapero Funding:

– National Cancer Institute – Japan Society for Promotion of Science

END

INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS

Documents

number of copies

number cnv regions

copy number variation

human genome project

human sequences

human phenotypes

complete sequence

sequence segments