Single Tumor-Normal Pair Parent-Specific Copy Number …helper.ipam.ucla.edu/publications/genws2/genws2_10184.pdfIllumina’s “luster Regression” ... TumorBoost: Normalization

Single Tumor-Normal Pair Parent-Specific Copy Number Analysis Henrik Bengtsson

Department of Epidemiology & Biostatistics, UCSF

with: Pierre Neuvial, Berkeley/CNRS

Adam Olshen, UCSF

Richard Olshen, Stanford

Venkatraman Seshan, MSKCC

Terry Speed, Berkeley/WEHI

Paul Spellman, LBNL/OHSU

NORMAL REGION

GAIN COPY-NEUTRAL LOH

“This presentation has been modified from its original version...“ The content of the slides was formatted to fit the upper 3/4 of the screen at IPAM, so that also the audience in the back would be able to see all of it.

Paired PSCBS

-- H Bengtsson, P Neuvial, TP Speed, TumorBoost: Normalization of allele-specific tumor copy numbers from one single tumor-normal pair of genotyping microarrays, BMC Bioinformatics 2010. -- AB Olshen, H Bengtsson, P Neuvial, PT Spellman, RA Olshen, VE Seshan, Parent-specific copy number in paired tumor-normal studies using circular binary segmentation, Bioinformatics 2011.

Parent-specific copy numbers from a single tumor-normal pair of SNP arrays

1. Tumor-normal pair 2. Genotype normal 3. Normalize tumor using normal 4. Segment tumor CNs in two steps 5. Estimate PSCNs within segments 6. Call segments

B

B

A

B

B

A

A

A

AB

BB

AB

AA

Single nucleotide polymorphism

10-20 million known SNPs

Genotypes are observed at single loci

♂ ♀

Genotypes and total copy numbers reflect the parent-specific copy numbers

B

B

A

B

B

A

A

A

AB

BB

AB

AA

Matched Normal (diploid)

BB

B

A

B

B

A

AA

A

AAB

BBB

AB

AA

Tumor with gain

(C1,C2): (1,2) (1,1)

* Occam's razor: Minimal number of events has occurred.

-

-

A

BB

BB

A

-

A

BB

BB

A

AA

Tumor with deletion &

copy-neutral LOH (C1,C2): (0,2) (0,1) (1,1)

SNP microarrays quantify total and allele-specific copy numbers Chip Design:

DNA

Probes CGTGTAATTGAACC

||||||||||||||

GCACATTAACTTGG

CCCCGTAAAGTACT

TATGCCGCCCTGCG

||||||||||||||

ATACGGCGGGACGC

GCACATCAACTTGG

||||||||||||||

CGTGTAGTTGAACC

T/C Sample DNA:

+

Together the SNPs of a region indicate the parent-specific copy numbers

NORMAL (1,1)

(1 individual, many SNPs, 2 different regions)

BB

AA

AB

GAIN (1,2)

CB

CA

BBB

ABB

AAB

AAA

CB

CA

Total CN: C = CA+CB

Total CNs and allele B fractions are easier to work with than ASCNs

AA AB BB AAA AAB ABB BBB

Total CN: C = CA+CB BAF: β = CB / C NORMAL (1,1) GAIN (1,2)

0 1/2 1 0 1/3 2/3 1

C

C

BAF BAF

Total CNs and BAFs reflect the underlying parent-specific CNs

NORMAL (1,1) GAIN (1,2) COPY-NEUTRAL LOH (0,2)

Total CN: C = CA + CB

← CN=2

Allele B Fraction: β = CB / C

← 100% B:s

← 0% B:s

← 50% B:s

← CN=3

Matched tumor-normals

- With a matched normal it is easier! …because we can genotype the normal and find the heterozygous SNPs... - Also, much greater SNRs

← BB ← AB ← AA

1. Genotypes (AA,AB,BB) from BAFs of a matched normal

(all loci)

2a. Total CNs C = CA + CB

Heterozygous SNPs (not homozygous) are informative for PSCNs

(SNPs only)

2b. Tumor BAFs β = CB / C

(hets only)

3. Decrease in Heterozygosity ρ = 2*| β - 1/2| ; hets only

Total CNs C = CA + CB

Decrease in Heterozygosity ρ = 2*| β - 1/2| ; hets only

Total CNs & DHs segmentation gives us PSCN regions and estimates

Per-segment PSCNs (C1,C2): C1 = 1/2 * (1- ρ) * C C2 = C - C1

avg(all loci) * avg(hets only)

NORMAL (1,1) GAIN (1,2) CN-LOH (0,2)

(i) Find change points (ii) Estimate mean levels

avg(hets only)

avg(all loci)

It is hard to infer PSCNs reliably when signals are noisy

Actual data:

←

? Segmentation may fail…

Let’s improve this...

CalMaTe

M Ortiz-Estevez, A. Aramburu, H. Bengtsson, P. Neuvial, & A. Rubio. A calibration method to improve allele-specific copy number estimates from SNP microarrays (submitted).

Better allele-specific copy numbers in tumors without matched normals by borrowing across many samples Features: • Multiple (> 30) samples. • Any SNP microarray platform. • Bounded memory usage (< 1GB of RAM) More: http://www.aroma-project.org/

The noise is due to SNP-specific effects that we can estimate and remove

Example: (CA,CB) for 310 samples per SNP: Systematic effects…

²

²

²

…are SNP specific!

²

²

²

SNP #1072 SNP #1053

Allele B fractions (BAFs): The bias is greater than the noise

SNP #1053

Example: (CA,CB) for 310 samples per SNP. TCN: between 2 arrays. BAF: within array.

²

²

²

Multi-sample model: (one per SNP) Fit affine transform across samples

CalMaTe

CalMaTe Multi-sample method for each SNP separately: Non-negative Matrix Factorization (NMF). Robustified against outliers (e.g. tumors). Special cases: Only one or two genotype groups.

Related ideas: Illumina’s “Cluster Regression” CRLMM CNs (*RLMM, …) …

Improved SNR of BAFs (and total CNs) when removing SNP-specific variation

Estimate & Backtransform Repeat for all 1,000,000 SNPs

before after

!

The above is the chromosomal plot for one sample of the 310 samples.

TumorBoost

H. Bengtsson, P. Neuvial, T.P. Speed TumorBoost: Normalization of allele-specific tumor copy numbers from one single tumor-normal pair of genotyping microarrays, BMC Bioinformatics, 2010.

Better allele-specific copy numbers in tumors with matched normals Requirements: • Matched tumor-normal pairs. • A single pair is enough. • Any SNP microarray platform. • Bounded memory usage (< 1GB of RAM) More: http://www.aroma-project.org/

The tumor “should be” close to its normal

² ²

When we have only a single tumor-normal pair: (i) Normal should be at e.g. (1,1) …so lets move it there! (ii) Adjust the tumor in a “similar” direction.

One SNP, a tumor-normal pair

CA

CB

² ²

CA

CB

Tumor- Boost

CB

NORMAL REGION

βT

βN

C

The tumor “should be” close to the normal; - data strongly agree!

For each genotype: Cor(βT, βN) ≈ 1

βT

βN

AA BB AB

A shared SNP effect: systematic variation

(βN, βT)

(βN,TRUE, βT,TBN)

δ

δ* βT,TBN

βN

The SNP effect can be estimated & removed for each SNP independently! βT

βN

AA BB AB

Observed: Allele B fractions βN [0,1] βT [0,1] Genotype calls (AA,AB,BB): βN,TRUE {0, 0.5, 1} Estimate from normal: SNP effect δ = βN - βN,TRUE

Remove from tumor: βT,TBN = βT – δ*

0 0.5 1

βT βT,TBN

δ*

2. Remove SNP effect from the tumor

3. Repeat for all SNPs.

1. Estimate SNP effect in the normal and its genotypes

βN,TRUE βN

0 0.5 1 AA AB BB

δ

TumorBoost removes the SNP effects from the tumor (only)

Before:

After:


Even with a single tumor-normal pair, we can greatly improve the SNR

² ² ² ²

before after

! Estimate & Backtransform Repeat for all 1,000,000 SNPs

TumorBoost => more distinct (CA,CB) - key for PSCN segmentation

Original:

TumorBoost: - single-pair - tumor-normals - normal is not corrected

NORMAL (1,1) CN-LOH (0,2)

CalMaTe: - multi-sample

GAIN (1,2)

Original

Original

TumorBoost

CalMaTe

TumorBoost and CalMaTe significantly improve power to detect change points ! DH

DH

DH

CalMaTe (multi-sample)

TumorBoost (single pair)

Assessment: 1 sample, 1 change point

Paired PSCBS Parent-specific copy numbers from a single tumor-normal pair of SNP arrays

1. Tumor-normal pair 2. Genotype normal 3. Normalize tumor using normal 4. CBS segment tumor: (a) TCN, then (b) DH 5. Estimate PSCNs within segments 6. Call segments

Total CNs C = CA + CB

Decrease in Heterozygosity ρ = 2*| β - 1/2| ; hets only

Total CNs & DHs segmentation gives us PSCN regions and estimates

Per-segment PSCNs (C1,C2): C1 = 1/2 * (1- ρ) * C C2 = C - C1

avg(all loci) * avg(hets only)


(i) Find change points (ii) Estimate mean levels

avg(hets only)

avg(all loci)

Calling allelic balance:

• Null: C1 = C2 (equivalent to DH = 0)

• DH is estimated with bias near 0, so we need offset ΔAB in test.

• Reject null if α:th percentile of bootstrap-estimated DH - ΔAB > 0.

• How do we choose ΔAB?

Calling LOH:

• Null: C1 > 0 (“not in LOH”)

• C1 is estimated with bias due to background (e.g. normal contamination), so we need offset ΔLOH in test.

• Reject null if (1-α):th percentile of bootstrap-estimated C1 - ΔLOH < 0.

• How do we choose ΔLOH?

Calling allelic balance and LOH

Results

PSCBS works with any SNP array - similar results on Affymetrix and Illumina

Affymetrix GenomeWideSNP_6

Illumina HumanHap550

!

Paired BAF (Staaf et al., 2008) is a paired.

Algorithm:

1. Genotype normal sample

2. Drop homozygote SNPs

3. Segment “mirrored BAF” (like DH)

4. Estimate parent-specific copy numbers

Other methods exists e.g. Paired BAF segmentation

Paired PSCBS performs very well compared to other PSCN methods

Assessment of calls:

- Staaf simulated data set. - Known regions. - Different amount of

normal contamination. - Keep FP rates at 0.0%. - TP rate of calls.

!

Preprocessing: • Affymetrix: ASCRMAv2 (single-array) [aroma.affymetrix] • Illumina: <elsewhere>

Normalization of ASCNs: • Single tumor-normal pair: TumorBoost [aroma.light, aroma.cn] • Multiple samples: CalMaTe [CalMaTe]

PSCN segmentation: • Single tumor-normal pair: Paired PSCBS [PSCBS] • No matched normals: <we’re working on it>

Everything is bounded in memory (< 1GB of RAM)

Methods are available (www.aroma-project.org)

Conclusions Paired PSCBS w/ TumorBoost: • High quality tumor PSCNs • Single tumor-normal pair • No external references needed • Any SNP microarray technology • Algorithms is fast and bounded in memory

Future: • Non-paired PSCBS • Calibration of PSCN states (e.g. clonality & ploidy)…

Next: We need to calibrate (C1,C2) before calling! (ongoing work with Pierre Neuvial)

After (1,2) (0,2) (2,2) (2,2)

clonality!

Before

? ? ? ? ?

Extra slides

The power to detect a change point varies with type of change!

NORMAL REGION GAIN

Decrease in Heterozygosity

Original

TumorBoost

Total CN

DH

DH

Total CNs NORMAL REGION GAIN

TCN

The reason why Illumina is “better” is because they do this calibration - Affymetrix does not.

Illumina (Human1M-Duo):

Affymetrix (GenomeWideSNP_6):

Illumina and Affymetrix have similar noise levels after CalMaTe.

Illumina and Affymetrix have similar noise levels after CalMaTe

Illumina and Affymetrix have similar noise levels after CalMaTe

← BB ← AB ← AA

1. Genotypes (AA,AB,BB) from BAFs of a matched normal

(all loci)

2a. Total CNs C = CA+CB

(SNPs only)

2b. Tumor BAFs β = CB/C

(hets only)

3. Decrease in Heterozygosity ρ = 2*| β -1/2| ; hets only

PSCNs can be estimated at each SNP if we know which SNPs are heterozygous

4. SNP-specific (C1, C2): C1 = 1/2*(1- ρ)*C C2 = C - C1 (hets only)

(1,1) (0,2) (1,2)

Single Tumor-Normal Pair Parent-Specific Copy Number …helper.ipam.ucla.edu/publications/genws2/genws2_10184.pdfIllumina’s “luster Regression” ... TumorBoost: Normalization

Documents