November 18, 2002 Stanford Statistics 1 ✬ ✫ ✩ ✪ Supervised Learning from Micro-Array Data: Datamining with Care Trevor Hastie Stanford University November 18, 2002 joint work with Robert Tibshirani, Balasubramanian Narasimhan, Gil Chu, Pat Brown and David Botstein
24
Embed
Supervised Learning from Micro-Array Data: Datamining …hastie/TALKS/pam_short.pdf · Supervised Learning from Micro-Array Data: Datamining with Care TrevorHastie StanfordUniversity
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
November 18, 2002 Stanford Statistics 1✬
✫
✩
✪
Supervised Learning from
Micro-Array Data:
Datamining with Care
Trevor HastieStanford University
November 18, 2002
joint work with Robert Tibshirani, Balasubramanian
Narasimhan, Gil Chu, Pat Brown and David Botstein
November 18, 2002 Stanford Statistics 2✬
✫
✩
✪
DNA microarrays
• Exciting new technology for measuring geneexpression of tens of thousands of genesSIMULTANEOUSLY in a single sample ofcells
• first multivariate, quantitative way ofmeasuring gene expression
• a key idea: to find genes, follow aroundmessenger RNA
• also known as “gene chips”— there are anumber of different technologies: Affymetrix,Incyte, Brown Lab,...
• techniques for analysis of microarray data arealso applicable to SNP data, protein arrays,etc.
November 18, 2002 Stanford Statistics 3✬
✫
✩
✪
DNA microarray process
November 18, 2002 Stanford Statistics 4✬
✫
✩
✪
The entire Yeast genome on a chip
November 18, 2002 Stanford Statistics 5✬
✫
✩
✪
Statistical challenges
• Typically have ∼ 5,000–40,000 genesmeasured over ∼ 50–100 samples.
• Goal: understand patterns in data, and lookfor genes that explain known features in thesamples.
• Biologists don’t want to miss anything (lowtype II error). Statisticians have to help themto appreciate Type I error, and find ways toget a handle on it.
November 18, 2002 Stanford Statistics 6✬
✫
✩
✪
Types of problems
• Preprocessing: Li, Wong (Dchip), Speed
• The analysis of expression arrays can beseparated into:unsupervised — “class discovery” andsupervised — “class prediction”
• In unsupervised problems, only the expressiondata is available. Clustering techniques arepopular. Hierarchical (Eisen’s TreeView -next slide), K-means, SOMs, block-clustering,gene-shaving (H&T), plaid models (Owen&Lazzeroni). SVD also useful.
• In supervised problems, a responsemeasurement is available for each sample. Forexample, a survival time or cancer class.
November 18, 2002 Stanford Statistics 7✬
✫
✩
✪
Two-way hierarchical clustering
Molecular portraits of Breast Cancer — Perou et al,
Nature 2000
November 18, 2002 Stanford Statistics 8✬
✫
✩
✪
Some editorial comments
• Most statistical work in this area is beingdone by non-statisticians.
• Journals are filled with papers of the form“Application of <machine-learning method>to Microarray Data”
• Many are a waste of time. P � N i.e. manymore variables (genes) than samples.Data-mining research has produced exoticenhancements of standard statistical modelsfor the N � P situation (neural networks,boosting,...). Here we need to restrict thestandard models; cannot even do linearregression.
• Simple is better: a complicated method isonly worthwhile if it works significantlybetter than the simple one.
November 18, 2002 Stanford Statistics 9✬
✫
✩
✪
• Give scientists good statistical software, withmethods they can understand. They knowtheir science better than you. With yourhelp, they can do a better job analyzing theirdata than you can do alone.
• Software should be easy to install (e.g. R)and easy to use (e.g. SAM is an excel add-in)
November 18, 2002 Stanford Statistics 10✬
✫
✩
✪
SAM
How to do 7000 t-tests all at once!
Significance Analysis of Microarrays (Tusher,Tibshirani and Chu, 2001).
-10 -5 0 5 10
010
020
030
040
050
0
t statistic
• At what threshold should we call a genesignificant?
Golub et al 1999, Science. They use a “voting”procedure for each gene, where votes are based ona t-like statistic
Method CV err Test err
Golub 3/38 4/34
Nearest Prototype 1/38 2/34
Breast Cancer classification
Hedenfalk et al 2001, NEJM. They use a“compound predictor”
∑j wjxj , where the
weights are t-statistics.
Method BRCA1+ BRCA1- BRCA2+ BRCA2-
Heden et. al. 3/7 2/15 3/8 1/14
Nearest Prototype 2/7 1/15 2/8 0/14
November 18, 2002 Stanford Statistics 24✬
✫
✩
✪
Summary
• With large numbers of genes as variables(P � N), we have to learn how to tame evenour simplest supervised learning techniques.Even linear models are way too aggressive.
• We need to talk with the biologists to learntheir priors; i.e. genes work in groups.
• We need to provide them with easy-to-usesoftware to try out our ideas, and involvethem in the design. They are much better atunderstanding their data than we are.