Top Banner
Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003
34

Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October.

Mar 26, 2015

Download

Documents

Seth Meyer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

Data Mining in Genomics: the

dawn of personalized

medicineGregory Piatetsky-Shapiro

KDnuggetswww.KDnuggets.com/gps.html

Connecticut College, October 15, 2003

Page 2: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

22© 2003 KDnuggets

Overview

Data Mining and Knowledge Discovery

Genomics and Microarrays

Microarray Data Mining

Page 3: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

33© 2003 KDnuggets

Trends leading to Data Flood

More data is generated:

Bank, telecom, other business transactions ...

Scientific Data: astronomy, biology, etc

Web, text, and e-commerce

More data is captured:

Storage technology faster and cheaper

DBMS capable of handling bigger DB

Page 4: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

44© 2003 KDnuggets

______

______

______

Transformed Data

Patternsand

Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

Integration

Understanding

Knowledge Discovery Process

DATAWarehouse

Knowledge

Page 5: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

55© 2003 KDnuggets

Major Data Mining Tasks Classification: predicting an item class

Clustering: finding clusters in data

Associations: e.g. A & B & C occur frequently

Visualization: to facilitate human discovery

Summarization: describing a group

Estimation: predicting a continuous value

Deviation Detection: finding changes

Link Analysis: finding relationships

Page 6: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

66© 2003 KDnuggets

Major Application Areas for Data Mining Solutions

Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web

Page 7: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

77© 2003 KDnuggets

Genome, DNA & Gene Expression An organism’s genome is the “program” for making the organism, encoded in DNA Human DNA has about 30-35,000 genes

A gene is a segment of DNA that specifies how to make a protein

Cells are different because of differential gene expression About 40% of human genes are expressed at

one time

Microarray devices measure gene expression

Page 8: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

88© 2003 KDnuggets

Molecular Biology Overview

Cell Nucleus

Chromosome

ProteinGraphics courtesy of the National Human Genome Research Institute

Gene (DNA)Gene (mRNA), single strand

Geneexpression

Page 9: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

99© 2003 KDnuggets

Affymetrix Microarrays

50um

1.28cm

~107 oligonucleotides,half Perfectly Match mRNA (PM), half have one Mismatch (MM)Gene expression computed from PM and MM

Page 10: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1010© 2003 KDnuggets

Affymetrix Microarray Raw Image

Gene ValueD26528_at 193D26561_cds1_at -70D26561_cds2_at 144D26561_cds3_at 33D26579_at 318D26598_at 1764D26599_at 1537D26600_at 1204D28114_at 707

Scannerenlarged section of raw image

raw data

Page 11: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1111© 2003 KDnuggets

Microarray Potential Applications New and better molecular diagnostics

New molecular targets for therapy few new drugs, large pipeline, …

Outcome depends on genetic signature best treatment?

Fundamental Biological Discovery finding and refining biological pathways

Personalized medicine ?!

Page 12: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1212© 2003 KDnuggets

Microarray Data Mining Challenges Avoiding false positives, due to

too few records (samples), usually < 100

too many columns (genes), usually > 1,000

Model needs to be robust in presence of noise

For reliability need large gene sets; for diagnostics or drug targets, need small gene sets

Estimate class probability

Model needs to be explainable to biologists

Page 13: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1313© 2003 KDnuggets

False Positives in Astronomy

cartoon used with permission

Page 14: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1414© 2003 KDnuggets

Preparation

2-Class Multi-Class

Clustering

CATs: Clementine Application Templates CATs - examples of

complete data mining processes

Microarray CAT

Page 15: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1515© 2003 KDnuggets

Key Ideas

Capture the complete process

X-validation loop w. feature selection inside

Randomization to select significant genes

Internal iterative feature selection loop

For each class, separate selection of optimal gene sets

Neural nets – robust in presence of noise

Bagging of neural nets

Page 16: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1616© 2003 KDnuggets

Microarray Classification

Train data Feature and Parameter Selection

EvaluationTest data

Data Model Building

Page 17: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1717© 2003 KDnuggets

Classification: External X-val

Train data Feature and Parameter Selection

EvaluationTest data

Gene Data

T r a i n

FinalTest

Data Model Building

Final Model

Final Results

Page 18: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1818© 2003 KDnuggets

Measuring false positives with randomization

ClassGene

178105

41747133

1122

Class

178105

41747133

2112

RandClass

2112Randomize

500 times

Bottom 1% T-value = -2.08

Select potentially interesting genes at 1%

Gene

Page 19: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

1919© 2003 KDnuggets

Gene Reduction improves Classification most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes

Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference

Heuristic: select equal # genes from each class

Then apply a favorite machine learning algorithm

Page 20: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2020© 2003 KDnuggets

Iterative Wrapper approach to selecting the best gene set Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation.

Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class

For randomized algorithms, average 10+ Cross-validation runs!

Select gene set with lowest average error

Page 21: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2121© 2003 KDnuggets

Clementine stream for subset selection by x-validation

Page 22: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2222© 2003 KDnuggets

Microarrays: ALL/AML Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000

genes

well-studied (CAMDA-2000), good test example

ALL AML

Visually similar, but genetically very different

Page 23: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2323© 2003 KDnuggets

Gene subset selection: one X-validation

Error Avg for 10-fold X-val

0%5%

10%15%20%25%30%

1 2 3 4 5 10 20 30 40

Genes per Class

Single Cross-Validation run

Page 24: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2424© 2003 KDnuggets

Gene subset selection: multiple cross-validation runs

For ALL/AML data, 10 genes per class had the lowest error: (<1%)

Point in the centeris the average error from 10 cross-validation runs

Bars indicate 1 st. devabove and below

Page 25: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2525© 2003 KDnuggets

ALL/AML: Results on the test data Genes selected and model trained on Train set ONLY!

Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): 33 correct predictions (97% accuracy),

1 error on sample 66 Actual Class AML, Net prediction: ALL

other methods consistently misclassify sample 66 -- misclassified by a pathologist?

Page 26: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2626© 2003 KDnuggets

Pediatric Brain Tumour Data

92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital

Outer cross-validation with gene selection inside the loop

Ranking by absolute T-test value (selects top positive and negative genes)

Select best genes by adjusted error for each class

Bagging of 100 neural nets

Page 27: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2727© 2003 KDnuggets

Selecting Best Gene Set

Minimizing Combined Error for all classes is not optimal

Average, high and low error rate for all classes

Page 28: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2828© 2003 KDnuggets

Error rates for each class

Error

rate

Genes per Class

Page 29: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

2929© 2003 KDnuggets

Evaluating One Network

Class Error rate

MED 2.1%

MGL 17%

RHB 24%

EPD 9%

JPA 19%

*ALL* 8.3%

Averaged over 100 Networks:

Page 30: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

3030© 2003 KDnuggets

Bagging 100 Networks

Note: suspected error on one sample (labeled as MED but consistently classified as RHB)

Class Individual Error Rate

Bag Error rate

Bag Avg Conf

MED 2.1% 2% (0)* 98%

MGL 17% 10% 83%

RHB 24% 11% 76%

EPD 9% 0 91%

JPA 19% 0 81%

*ALL* 8.3% 3% (2)* 92%

Page 31: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

3131© 2003 KDnuggets

AF1q: New Marker for Medulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein Related to leukemia (3 PUBMED entries) but not to

Medulloblastoma

Page 32: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

3232© 2003 KDnuggets

Future directions for Microarray Analysis Algorithms optimized for small samples

Integration with other data biological networks

medical text

protein data

Cost-sensitive classification algorithms error cost depends on outcome (don’t want to

miss treatable cancer), treatment side effects, etc.

Page 33: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

3333© 2003 KDnuggets

Acknowledgements

Eric Bremer, Children’s Hospital (Chicago) & Northwestern U.

Greg Cooper, U. Pittsburgh

Tom Khabaza, SPSS

Sridhar Ramaswamy, MIT/Whitehead Institute

Pablo Tamayo, MIT/Whitehead Institute

Page 34: Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets  Connecticut College, October.

3434© 2003 KDnuggets

Thank you

Further resources on Data Mining: www.KDnuggets.com

Microarrays:

www.KDnuggets.com/websites/microarray.html

Contact:

Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html