Pooling Data Across Microarray Experiments Using Different Versions of Affymetrix Oligonucleotide Arrays Jeffrey S. Morris Li Zhang, Chunlei Wu, Keith.

Pooling Data Across Microarray Experiments Using Different Versions of Affymetrix Oligonucleotide Arrays

Jeffrey S. MorrisLi Zhang, Chunlei Wu, Keith

Baggerly and Kevin CoombesUT MD Anderson Cancer

CenterHouston, TX, USA

Combining Information across Microarray Studies

Many publicly available microarray data sets Can combine information across studies to:1. Validate results from individual studies

Find intersection of differentially expressed genes Build model using one study, validate using another

2. Discover new biological insights by analyses pooling data across studies.

Potential for increased statistical power Important since many individual studies are

underpowered.

Pooling Data across Studies

Challenge: In general, microarray data from different studies not comparable

Clinical differences Different study populations

Technical differences Laboratory differences: sample collection

and storage, microarray protocol Different platforms: cDNA/oligo, different

versions of same technology (e.g. Affy chips)


Approaches in Existing Literature:1. Include study effects in model

Gene-specific study effects SVD, Distance-weighted DiscriminationDrawback: First-order corrections not enough

2. Model unitless summary measures standardized log fold-change t-statistics probabilities of +/0/- expressionDrawback: Implicit assumptions about

comparability of clinical populations across studies


Sequence-related reasons for incomparability of raw expression levels across platforms:

Cross-hybridization RNA degradation (near 5’ end) Probe validity – map to RefSeq? Alternative splicing

It may be possible that, by taking these into account, we can obtain more comparable raw expression levels to use in pooled analyses

Our focus: combining information across different versions of Affymetrix genechips

Overview of Affymetrix GeneChips

Probes: 25-base sequences from gene of interest

Probesets: set of probes corresponding to same gene.

Obtained from current sequence information in GenBank, Unigene, RefSeq

Generations of human chips: HuGeneFL: 5600 genes, 20 probes/gene U95Av2: 10,000 genes, 16 probes/gene U133A: 14,500 genes, 11 probes/gene

Example: CAMDA Lung Cancer Data

CAMDA: “Critical Assessment of Microarray Data Analysis”: annual conference at Duke University

CAMDA 2003: Two studies relating gene expression data to survival in lung cancer patients

1. Harvard (Bhattacharjee, et al. 2001) 124 lung adenocarcinoma samples

2. Michigan (Beer, et al. 2002) 86 lung adenocarcinoma samples

GOAL: Pool data across studies to identify prognostic genes for lung cancer.

Pooling Information across Studies

Harvard patients – worse prognosis

Pooling Information across Chip Types

Michigan: HuGeneFl Chip 6,633 probe sets -- 20 probe pairs each

Harvard: HG_U95Av2 Chip 12,453 probe sets – 16 probe pairs each

Problems: Different genes Incomparable expression levels

“Partial Probeset” Method

HuGeneFL :HG_U95Av2:

Matching Probes

“Partial Probesets”1. Identify “matching probes”2. Recombine into new probesets based on UNIGENE

clusters, which we refer to as “partial probesets” 3. Eliminate any probesets containing just one or two

probes Note: Any quantification method can subsequently

be used (MAS, dChip, RMA, PDNN)

……

Pooling Information across Chip Types

Quantification of Expression Levels

Gene expressions quantified by applying Li’s PDNN model to our partial probesets Uses probe sequence info to predict patterns of

specific and nonspecific hybridization intensities Allows borrowing of strength across probe sets Model is not overparameterized – O(N probesets)

See Zhang, et al. (2003) Nature Biotech for further details on method and comparison

Detecting Outliers

L54 L88 L89 L90 Other outliers: 6 from Michigan, 2 Harvard

Other preprocessing (remove low expr./normalize) Matching clinical/microarray data for 200 patients

(124 H, 76 M)

Log-scale plots to detect outliers Large spot detected on 4 Michigan chips

Assessing Our Method for Combining Information Across Chip Types

“Partial Probeset” method appears to give comparable expression levels across chip types.

Assessing our Method for Combining Information across Chip Types

Median “partial probeset” size is 7, vs. 16 or 20Loss of precision?

No evidence of significant precision loss

Assessing “Partial Probeset” Method

Agreement in relative quantifications across samples



Less variable genes worse




Eliminate genes with sd<0.20 or r<0.90




Eliminate genes with sd<0.20 or r<0.90

1,036 genes

Identifying Prognostic Genes

1. Preprocess raw microarray data Outlier Detection, Normalization, Quantification,

Remove Some Genes? Left with “n-by-p” matrix of expression levels

for p genes on n microarrays.

2. Identify which genes are correlated to outcome of interest

Perform standard statistical test for each gene – obtain (permutation) p-values

Find “cutpoint” on p-values to declare significance that accounts for multiplicities.


After preprocessing: 1036 genes, 200 samples

Identify genes related to survival After adjusting for known clinical

predictors Provide prognostic information on

survival above and beyond clinical predictors

Identifying Prognostic Genes: Cox Regression

Modeling Hazard : (t) ~ Prob(X<t +t | X>t ) Cox Model: i(t) = 0(t) exp(Xi )

Xi = Vector of covariates for subject i = Vector of regression coefficients

Key Assumption: Proportional Hazards Hazard ratio between subjects with different

covariates does not vary over time. i(t )/k(t ) = exp{ (Xi-Xk) } Exp() =Change in hazard per unit change in X

Identifying Prognostic Genes: Cox Regression

Modeling Best Clinical Model:

Factor Exp()

Z p

Study Michigan = 0 Harvard = 1

0.67

1.95 2.73

0.0062

Age 0.03

1.03 2.60

0.0094

Stage Early (1-2) = 0 Late (3-4) = 1

1.53

4.61 6.61

<0.000000001


Series of 1036 multivariable Cox models fit to identify prognostic genes. Each model contained: Study (Michigan=-1, Harvard=1). Age (continuous factor). Stage (early=0/late=1). Probeset (log intensity value as continuous factor).

Exact p-values for each probeset computed using permutation approach

By using multivariate modeling, we search for genes offering prognostic information beyond clinical predictors

Identifying Prognostic Genes: BUM Method

No prognostic genes pvals Uniform Prognostic genes smaller pvals Fit Beta-Uniform mixture to histogram of

p-values – “BUM” method (Pounds and Morris, 2003 Bioinformatics)

Method can be used to identify prognostic genes while controlling FDR

Results Histogram suggests

there are some significant probesets

FDR=0.20 corresponds pval cutoff of 0.0024 (BUM, Pounds and Morris 2003)

26 probesets flagged as significant

Selected Flagged GenesRank Gene p Function

1 FCGRT -2.07 <0.00001

Induced by IF- in treating SCLC

2 ENO2 1.46 0.00001 Marker of NSCLC

4 RRM1 1.81 0.00002 Linked to survival in NSCLC

8 CHKL -1.43 0.00010 Marker of NSCLC

11 CPE 0.72 0.00031 Marker of SCLC

12 ADRBK1 -2.20 0.00044 Co-expressed with Cox-2 in lung ADC

16 CLU -0.52 0.00109 Marker of SCLC

20 SEPW1 -1.29 0.00145 H202 cytotox. in NSCLC cell lines

21 FSCN1 0.66 0.00150 Marker of invasiveness in Stg 1 NSCLC

25 BTG2 -0.75 0.00232 Induced by p53 in SCLC cell lines

Selected Flagged Genes

Rank Gene p Function

1 FCGRT -2.07 <0.00001












Rank Gene p Function

1 FCGRT -2.07 <0.00001 Induced by IF- in treating SCLC










Selected Flagged GenesRan

kGene p pStage Function

1 FCGRT -2.07 <0.00001

0.154


2 ENO2 1.46 0.00001 0.282

Marker of NSCLC

4 RRM1 1.81 0.00002 0.321

Linked to survival in NSCLC

8 CHKL -1.43 0.00010 0.979

Marker of NSCLC

11 CPE 0.72 0.00031 0.088

Marker of SCLC

12 ADRBK1

-2.20 0.00044 0.484

Co-expressed with Cox-2 in PUC

16 CLU -0.52 0.00109 0.014

Marker of SCLC

20 SEPW1 -1.29 0.00145 0.028

H202 cytotox. in NSCLC cell lines

21 FSCN1 0.66 0.00150 0.082

Marker of invasiveness in Stage 1 NSCLC



1 FCGRT -2.07 <0.00001

0.154


2 ENO2 1.46 0.00001 0.282

Marker of NSCLC

4 RRM1 1.81 0.00002 0.321


8 CHKL -1.43 0.00010 0.979

Marker of NSCLC

11 CPE 0.72 0.00031 0.088

Marker of SCLC

12 ADRBK1

-2.20 0.00044 0.484


16 CLU -0.52 0.00109 0.014

Marker of SCLC

20 SEPW1 -1.29 0.00145 0.028


21 FSCN1 0.66 0.00150 0.082




1 FCGRT -2.07 <0.00001

0.154


2 ENO2 1.46 0.00001 0.282

Marker of NSCLC

4 RRM1 1.81 0.00002 0.321


8 CHKL -1.43 0.00010 0.979

Marker of NSCLC

11 CPE 0.72 0.00031 0.088

Marker of SCLC

12 ADRBK1

-2.20 0.00044 0.484

Co-expr with Cox-2 in lung adenocarc

16 CLU -0.52 0.00109 0.014

Marker of SCLC

20 SEPW1 -1.29 0.00145 0.028


21 FSCN1 0.66 0.00150 0.082



Rank

Gene p pStage Function

1 FCGRT -2.07 <0.00001

0.154


2 ENO2 1.46 0.00001 0.282

Marker of NSCLC

4 RRM1 1.81 0.00002 0.321


8 CHKL -1.43 0.00010 0.979

Marker of NSCLC

11 CPE 0.72 0.00031 0.088

Marker of SCLC

12 ADRBK1

-2.20 0.00044 0.484


16 CLU -0.52 0.00109 0.014

Marker of SCLC

20 SEPW1 -1.29 0.00145 0.028


21 FSCN1 0.66 0.00150 0.082



Rank Gene p pStage Function

3 NFRKB

-2.81 0.00001

0.058 Amplified in AML

7 ATIC 1.81 0.00009

0.771 Fusion partner of ALK which defines subtype of ALCL

13 BCL9 -1.64 0.00069

0.057 Over-expressed in ALL

15 TPS1 -0.64 0.00107

0.882 Associated with pulmonary inflammation

25 BTG2 -0.75 0.00232

0.726 Inhibits cell proliferation in primary mouse embryo fibroblasts lacking functional p53



3 NFRKB

-2.81 0.00001


7 ATIC 1.81 0.00009


13 BCL9 -1.64 0.00069


15 TPS1 -0.64 0.00107


25 BTG2 -0.75 0.00232




3 NFRKB

-2.81 0.00001


7 ATIC 1.81 0.00009


13 BCL9 -1.64 0.00069


15 TPS1 -0.64 0.00107


25 BTG2 -0.75 0.00232


Results Our gene list has almost no overlap with other

publications of these data. Reasons:1. We addressed a different research question

Us: ID Genes offering prognostic info beyond clinical Michigan: Univariate Cox models fit; results used to

construct dichotomous “risk index” Harvard: Cluster analysis done; clusters linked to

survival; found genes driving the clustering

2. Pooling across studies yielded significant gains in statistical power.

Most genes (17/26) in our study are not flagged if we analyze 2 data sets separately (i.e. no pooling)

CAMDA 2003 Results

We Won!

Limitations of Partial Probeset Method

Worked well for combining across HuGeneFL/ U95Av2 ~25% probes from HuGeneFL on U95Av2, with

4,101 probesets Not enough matching probes for use with

U95Av2/U133A ~6% of probes from U95Av2 also on U133A,

with only 628 probesets Requiring matching probes strong

criterion, maybe weaker criterion would suffice?

Alternative Splicing

Diagram of C2GnT I gene organization and different mRNA variants of this gene that are differentially expressed across tissue types. From Falkenberg, et al. (2003) Glycobiology 13(6), 411-418.

Full-Length Transcript Based Probesets

New probeset definition (FLTBP): probes match the same set of full-length mRNA sequences

Procedure 1. Construct comprehensive library of full-length mRNA

transcript sequences from RefSeq and HinvDB 2. For each probe, identify all matching full-length

transcripts using Blast programU95Av2: 15% matched no sequence, 33% matched multiple seq.U133A: 18% matched no sequence, 38% matched multiple seq.

3. Group probes with same matched target lists (FLTBPs)U95Av2: 23,972 probesets, U133A: 14,148 probesets

Full-Length Transcript Based Probesets

Matching across chip types: 9,642 FLTBPs match across U95Av2 and U133A Affymetrix has their own method for mapping their

probesets across arrays – 9,480 pairs of probesets (only about ½ map the same way as FLTBPs)

Example: Lung cancer cell line data 28 cell lines, each hybridized onto both U95Av2 and

U133A arrays. Paired design suggests any differences between

paired measurements due to technical, not biological, sources.

Different quantification methods (PDNN, RMA, MAS, dChip)

Results Density

estimate of chip-to-chip correlations for each gene

Positive shift for FLTBP suggests better correlations

Improvement greatest for PDNN

Correlation still not perfect

-0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Gene-gene correlation

Density

A

PDNN(P<0.00001)

-0.5 0.0 0.5 1.00.0

0.5

1.0

1.5

2.0

2.5


Density

B

RMA(P<0.00031)

-0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

2.5


Density

C

MAS5(P<0.00575)

-0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

2.5


Density

D

dChip(P<0.00005)

Example: Sample Gene 1

4

5

6

7

8

1 6 11 16 21 26Sample

Ln

(pro

be

sig

nal

)

4

5

6

7

8

1 6 11 16 21 26

Sample

Ln

(pro

be

sig

nal

)

10.5

11.0

11.5

6.0 6.5 7.0U133A

U95

Av2

R2=0.84R2=0.33

Plot of probe signals for two chip types (Red=FLTBP) Scatterplot of log-expression values for each sample across

the two chip types (Black=all probes, Red=FLTBP) Correlation across chips significantly improved with FLTBP

Example: Sample Gene 2

4

5

6

7

8

9

10

1 6 11 16 21 26

Sample

Ln(p

robe

sig

nal)

4

5

6

7

8

9

10

1 6 11 16 21 26

Sample

Ln(p

robe

sig

nal)

11

12

13

14

15

7 8 9 10 11

U95Av2

U13

3A

R2=0.87R2=0.25

Again, significantly higher correlation using FLTBP than using Affymetrix’ definition

Results Boxplot of

chip-to-chip correlations (over genes) for each sample

PDNN resulted in higher correlations

PDNN RMA MAS5 dChip

FL

TB

P

Aff

yP

S

FL

TB

P

Aff

yP

S

FL

TB

P

Aff

yP

S

FL

TB

P

Aff

yP

S

0.6

0.7

0.8

0.9

1.0

Sa

mp

le c

orr

ela

tio

ns

Conclusions New method for pooling info across studies

using different versions of Affymetrix chips. Recombine matched probes into new

probesets using Unigene clusters. Method appears to obtain comparable

expression levels across chips without sacrificing much precision or significantly altering the relative ordering of the samples.

Worked well combining information across HuGeneFL/U95Av2, but not U95Av2/U133A

Conclusions

Discussed new probeset definition based on full-length transcript sequences. Removes effect of known alternative splicing Yields stronger between-chip correlations

than Affymetrix standard definitions Pooling information across studies is

difficult – there is still more work to be done – but worth the effort.

ReferencesMorris JS, Yin G, Baggerly KA, Wu C, and Zhang L (2005). Pooling Information Across Different Studies and Oligonucleotide Microarray Chip Types to Identify Prognostic Genes for Lung Cancer. Methods of Microarray Data Analysis IV, eds. JS Shoemaker and SM Lin, pp. 51-66, New York: Springer-Verlag.

Wu C, Morris JS, Baggerly KA, Coombes KR, Minna JD, and Zhang L (2005). A probe-to-transcripts mapping method for cross-platform comparisons of microarray data taking into account the effects of alternative splicing. Under review.

Morris JS, Wu C, Coombes KR, Baggerly KA, Wang J, and Zhang L (2006). Alternative Probeset Definitions for Combining Microarray Data Across Studies Using Different Versions of Affymetrix Oligonucleotide Arrays. To appear in Meta-Analysis in Genetics, edited by Rudy Guerra and David Allison, Chapman-Hall.

Pooling Data Across Microarray Experiments Using Different Versions of Affymetrix Oligonucleotide Arrays Jeffrey S. Morris Li Zhang, Chunlei Wu, Keith.

Documents