Top Banner
1 A combining approach to statistical methods for p >> n problems Shinto Eguchi shop on Statistical Genetics, Nov 9, 2004 at IS
31

A combining approach to statistical methods for p >> n problems

Jan 03, 2016

Download

Documents

kiona-harding

A combining approach to statistical methods for p >> n problems. Workshop on Statistical Genetics, Nov 9, 2004 at ISM. Shinto Eguchi. Microarray data. cDNA microarry. Prediction from gene expressions. Feature vector dimension = number of genes p - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A combining approach  to statistical methods  for  p  >>  n  problems

1

A combining approach to statistical methods for p >> n problems

Shinto Eguchi

Workshop on Statistical Genetics, Nov 9, 2004 at ISM

Page 2: A combining approach  to statistical methods  for  p  >>  n  problems

2

Microarray data

cDNA microarry

Page 3: A combining approach  to statistical methods  for  p  >>  n  problems

3

Prediction from gene expressions

),( 1 pxx x

}1,1{ y

yf x:

Feature vector dimension = number of genes p components = quantities of gene expression

Class label disease, adverse effect

Classification machine

based on training dataset }1:),({ niyii x

Page 4: A combining approach  to statistical methods  for  p  >>  n  problems

4

Leukemic diseases, Golub et alhttp://www.broad.mit.edu/cgi-bin/cancer/publications/

Page 5: A combining approach  to statistical methods  for  p  >>  n  problems

5

Web microarray data

p n y = +1 y = - 1

ALLAML 7129 72 37 35

Colon 2000 62 40 22

Estrogen 7129 49 25 24

p >> n

http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/

Page 6: A combining approach  to statistical methods  for  p  >>  n  problems

6

Genomic data

SNPs Proteome Microarray

Data

dimension p 1,000~100,000

function 5,000~20,000

data size n 100 ~ 1000 5 ~ 20 20 ~ 100

mRNA ProteinGenome

Page 7: A combining approach  to statistical methods  for  p  >>  n  problems

7

Problem: p >> n

Fundamental issue on Bioinformatics

p is the dimension of biomarker

(SNPs, proteome, microarray, …)

n is the number of individuals

(informed consent, institutional protocol, …bioethics)

Page 8: A combining approach  to statistical methods  for  p  >>  n  problems

8

Current paradigm

Biomarker space pRI B

SNPs Haplotype block (Fujisawa)

Microarray Model-based clustering

Proteome Peak data reduction (Miyata)

GroupBoost (Takenouchi)

pnp )dim(but, B

Network gene model

Haplotype & adverse effects (Matsuura)

Page 9: A combining approach  to statistical methods  for  p  >>  n  problems

9

An approach by combining

Let B be a biomarker space

Rapid expansion of genomic data

},,1:)({ kik niD z

pnKK

kk

1

larger

Let be K experimental facilitiesKII ,,1

),...,1( Kknp k

Page 10: A combining approach  to statistical methods  for  p  >>  n  problems

10

Bridge Study?

1D

CAMDA (Critical Assessment of Microarray Data Analysis )

DDBJ (DNA Data Bank Japan, NIG)

2D

KD

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

…. result

Page 11: A combining approach  to statistical methods  for  p  >>  n  problems

11

CAMDA 2003

4 datasets for Lung Cancer

Harvard PNAS, 2001 Affymetrix

Michigan Nature Med,

2002

Affymetrix

Stanford PNAS, 2001 cDNA

Ontario Cancer Res

2001

cDNA

http://www.camda.duke.edu/camda03/datasets/

Page 12: A combining approach  to statistical methods  for  p  >>  n  problems

12

Some problems

1. Heterogeneity in feature space

cDNA, Affymetrix

Differences in covariates, medical diagnosis

Uncertainty for microarray experiments

2. Heterogeneous class-labeling

3. Heterogeneous generalization powers

A vast of unpublished studies

4. Publication bias

Page 13: A combining approach  to statistical methods  for  p  >>  n  problems

13

Machine learning

Leanability: boosting weak learners?

AdaBoost : Freund & Schapire (1997)

weak classifiers

})(,....),({ 1 xx pff

A strong classifier

)()( )()1(1 xx tt ff

)(xf

)()1(1 xfstagewise

Page 14: A combining approach  to statistical methods  for  p  >>  n  problems

14

AdaBoost

0)(),1()(: settings Initial.1 01

1 xFniiw

n    

,)())((I)(1

iwfyf t

n

iiit

x

)(

)(1

21

)(

)(log)b(tt

ttt

f

f

T

tttTT fFF

1)( )()( where,)(sign.3 )(  xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tf

tt

Page 15: A combining approach  to statistical methods  for  p  >>  n  problems

15

One-gene classifier

Error number 5 5 5 6 466 5 565

})(sgn({minarg i

ijij bxyIbb

one-gene classifier

jj

jjj bx

bxf

if1

if1)(x

jb

Let be expressions of the j-th genenjj xx ,...,1

jx

Page 16: A combining approach  to statistical methods  for  p  >>  n  problems

16

The second training

Errror number 45.5 7 9

87.56

798.5

Update the weight:

jx

4.5

jb

Weight up to 2

})(sgn()({minarg i

ijij bxyIiwbb

2log4

16log5.0

ans. false of nb.

ans.correct of nb.log5.01

Weight down to 0.5

jb

Page 17: A combining approach  to statistical methods  for  p  >>  n  problems

17

Learning algorithm

D

)(,),1( 11 nww

)(,),1( 22 nww 1

2

1

2

T

)()1( xf

T

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final machine

T

tttTT fFF

1)( )()(where,)(sign )(  xxx

Page 18: A combining approach  to statistical methods  for  p  >>  n  problems

18

Exponential loss

n

iiiFy

nFL

1

)}(exp{1

)( xExponential loss

)(minarglog 1)()(

)(

)(

)(1

21

ttttt

ttt FfL

f

f

21)()(min)( )(1)( ttt

ftt fff

)}({)}({ 1 iwiw tt Update :

Page 19: A combining approach  to statistical methods  for  p  >>  n  problems

19

Different datasets

}1:),({where )()(k

ki

kik niyD x

K

kkDD

1

Normalization: ]1,0[ RI )()( pki

ki

p xx

),...,1(minmax

min)()(

)()(

)( pjxx

xxx

kji

i

kji

i

kji

i

kjik

ji

)(

)( RIk

i

pki

y

x expression vector of the same genes

label of the same clinical item

Page 20: A combining approach  to statistical methods  for  p  >>  n  problems

20

Weighted Errors

))((I)(1

)()(

1

)(

)()(

)(

)(

k

k

n

i

ki

kin

i

k

t

k

tkt fyf

iw

iwx

K

k

kt

kt ff

1

)()( )()(

The k-th weighted error

The combined weighted error

K

h

n

i

h

t

n

i

k

tk

h

k

iw

iw

1 1

)(

1

)(

)(

)(

)(

where

Page 21: A combining approach  to statistical methods  for  p  >>  n  problems

21

)(

)(121

)(

)(

log)b( )(k

k

tt

ttkt

f

f

))(exp()()()d( )()()()()(1

ki

kit

kt

kt

kt yfiwiw x

)(minarg(a) )()( ff kt

f

kt

BridgeBoost

K

k

kt

ktt f

Kf

1

)()( )(1

)()c( xx

Kkkk

t niiw 1)( }1:)({

Page 22: A combining approach  to statistical methods  for  p  >>  n  problems

22

Learning

Stage t : )( )()()1()1(1)(

Kt

K

tttKt fff

)()1( xtf1D

KD

2D )()2( xtf

)()( xKtf

)1(t

)2(t

)(Kt

)}({ )1( iwt

)}({ )2( iwt

})({ )( iw Kt

D

)}({ )1(1 iwt

)}({ )2(1 iwt

})({ )(1 iw K

t

)()( xtf

Stage t+1 :

Page 23: A combining approach  to statistical methods  for  p  >>  n  problems

23

Mean exponential loss

kn

i

ki

ki

kk Fy

nFL

1

)()( )}(exp{1

)( xExponential loss

)(minarglog 1)()(

)(

)()(

)(

)(121

tk

tk

tktt

kttk

t FfLf

f

K

kk FLFL

1

)()(Mean exponential loss

)()(1

)( 11

)()(11

t

K

k

kt

kttktt FLfFL

KfFL

Note: convexity of Expo-Loss

Page 24: A combining approach  to statistical methods  for  p  >>  n  problems

24

Meta-leaning

validatory-crossis)( 1

)( t

kth FfLkh

kk

thh DfDL onis;onis)( )(

kh

tk

thtk

tk

K

ht

kth FfLFfLFfL )()()( 1

)(1

)(

11

)(

Separate learning Meta-learning

Page 25: A combining approach  to statistical methods  for  p  >>  n  problems

25

Simulation

Collapsed dataset

Traning error Test error

3 datasets

},,{ 321 DDD

21 , DD

3D

Test error 0 ( ideal)

Test error 0.5( ideal)

50,50,50,100 321 nnnp

data 1, data2

data3

Page 26: A combining approach  to statistical methods  for  p  >>  n  problems

26

Comparison

Separate AdaBoost BridgeBoost

Training error Training errorTest error Test error

Page 27: A combining approach  to statistical methods  for  p  >>  n  problems

27

Min =15% Min =4%Min =43%Min = 3% Min = 4%

Collapsed AdaBoost Separate AdaBoost BridgeBoost

Test errors

Page 28: A combining approach  to statistical methods  for  p  >>  n  problems

28

Conclusion

1D

2D

KD

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

….result

SeparateLeaning

Meta-leaning

Page 29: A combining approach  to statistical methods  for  p  >>  n  problems

29

Unsolved problems

3. On the information on the unmatched genes in combining datasets

2. Prediction for class-label for a given new x ?

4. Heterogeneity is OK, but publication bias?

1. Which dataset should be joined or deleted in BridgeBoost ?

Page 30: A combining approach  to statistical methods  for  p  >>  n  problems

30

Mean and s.d. of 37 studies

37,...,1:),ˆ( kskk

Passive smokers vs lung cancer

Funnel plot

heterogeneitypublication bias

Publication bias?

(Copas & Shi, 2001)

Page 31: A combining approach  to statistical methods  for  p  >>  n  problems

31

References

[1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.