Top Banner
Avoiding Paralysis of Analysis: Building an Intellectual Prosthesis Knowledge-Oriented Analysis of Mycroarray Data I. Jurisica DIMACS'01 I. Jurisica 1
30

Jurisica Slides - Rutgers University

Dec 10, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Jurisica Slides - Rutgers University

Avoiding Paralysis of Analysis:

Building an Intellectual Prosthesis

Knowledge-Oriented Analysis of Mycroarray Data

I. Jurisica

DIMACS'01 I. Jurisica 1

Page 2: Jurisica Slides - Rutgers University

Goals

Parallel analysis of gene expressionsImproved understanding of tumorigenesisTumor classification

Individualized medicineImproved diagnosis, prognostics, treatment planning & adjustmentTargetted therapy &drug design/useInformed patient

DIMACS'01 I. Jurisica 2

Page 3: Jurisica Slides - Rutgers University

Problems

Multi-dimensionalitymany degrees of freedom, few datapoints

NoiseImprecision, variationLow number of repeats

Non-independebilityNon-linearityDBs changeIntegration of results with other DBs & multiple experiments

DIMACS'01 I. Jurisica 3

Page 4: Jurisica Slides - Rutgers University

Intellectual Prosthesis

Fixed

Parametric

Nonparametric

Nonparametricwith Processing

Mo

re K

no

wle

dg

e

Mor

e D

ata

Finding appropriate model to support reasoning

Exceptions

Evolution

DIMACS'01 I. Jurisica 4

Page 5: Jurisica Slides - Rutgers University

AnalysisClustering organizes observations into groups by max. iner-cluster and min. inter-cluster similarityClassification/prediction assigns an observation to a class (finite/infinite)Comparison describes the item by comparing it to other itemsSummarization describes common characteristics of a subsetDiscrimination describes minimum features needed to differentiate among classesAssociation finds common occurrence of observations

DIMACS'01 I. Jurisica 5

Page 6: Jurisica Slides - Rutgers University

Paralysis

Sourcetoo slow to search the problem spacenot enough data/processing time available for a system to generate a NP modellack of domain knowledgetoo much data (including noise) from HTP (high dimensionality)

A solutionHTP & computationGenerate - analyze - reduce - test - validate

DIMACS'01 I. Jurisica 6

Page 7: Jurisica Slides - Rutgers University

HTP

Modified CBR approachsymbolic similaritylazy learning combined with

clustering & classificationsummarization

Analysis-based researchDNA microarray analysisannotation

RememberingRetrievingReasoning

DIMACS'01 I. Jurisica 7

Page 8: Jurisica Slides - Rutgers University

Model-Building Solutions

Eager approach1. analyze data2. create a model3. use the modelLazy approach - data-driven model1. incrementally accumulate data2. incrementally analyze & evolve

Generate - analyze - reduce - test - validate

Exceptions

Evolution

DIMACS'01 I. Jurisica 8

Page 9: Jurisica Slides - Rutgers University

Analyzing and Using MA DataProblems

Knowledge of classesProviding parametersClinical attributes as measures of "meaningfulness"ScalabilityAnnotating and explaining resultsQuality assuranceIntegratability

DIMACS'01 I. Jurisica 9

Page 10: Jurisica Slides - Rutgers University

Discovery Algorithms

http://cmgm.stanford.edu/pbrown/

www.partek.com

DIMACS'01 I. Jurisica 10

Page 11: Jurisica Slides - Rutgers University

DIMACS'01 I. Jurisica 11

Page 12: Jurisica Slides - Rutgers University

Case-Based ReasoningSOLUTION

1. Diagnosis2. Prognosis3. Treatment planGeneral Demographics

& Medical HistoryClinical Presentation

& Prognostic FactorsSurgical DetailsPathology StagingClinical StagingResearch ProtocolFollow-upAgeDatesHematologyBiochemistry19.2k expression profiles, .... Store

ReasonAnalyze

DIMACS'01 I. Jurisica 12

Page 13: Jurisica Slides - Rutgers University

Case-Based ReasoningDSS

Cases represent experiential knowledgeCases are patterns: context, problem, solutionSymbolic similarity - context-basedRetrieval - k-NN with context and structureAnytime algorithm

KM for evolving domainsDocumenting, analyzing,transferring & sharing experienceClassification, prediction,guidance in hypothesis discoveryClustering, summarizationAcquire now, process later

RememberingRetrievingReasoning

DIMACS'01 I. Jurisica 13

Page 14: Jurisica Slides - Rutgers University

Patient Information Management

we need detailed disease classificationwe need markers to improve diagnosis, prognosis and treatment planingwe need new and systematic methods

DIMACS'01 I. Jurisica 14

Page 15: Jurisica Slides - Rutgers University

CBR for DNA Micro Arrays

Gene expression signatureFind patients with similar signature

k-NN approach - without prior domain knowledge

Provide diagnosis, prognosis & treatment by analogyApply Explain function for marker & cancer subtype summarization

DIMACS'01 I. Jurisica 15

Page 16: Jurisica Slides - Rutgers University

Advantage of CBR

Supports reasoning, not just analysisMeasure of similarity is based on gene expression profileDoes not require prior knowledgeSupports evolution & is more flexibleHandles inconsistencies

Inconsistencies get resolved at run-time with contextual informationCBR can be used to find inconsistencies

Supports discovery & validationDIMACS'01 I. Jurisica 16

Page 17: Jurisica Slides - Rutgers University

Outliers

Represent change and deviationdata outside of normal region of input

unusual but correctunusual & incorrect

for numeric attributesdetect with histogram

remove with threshold filteridentify by calculating the mean & stdev

remove by specifying "window", e.g., 2 standard deviations from the mean

DIMACS'01 I. Jurisica 17

Page 18: Jurisica Slides - Rutgers University

KD and CBR

Patients

Gen

es &

clin

ical

att

ribut

es

Genes

Pat

ient

s

Organize genes into groupsOrganize attribute values into taxonomies

Clinical

DIMACS'01 I. Jurisica 18

Page 19: Jurisica Slides - Rutgers University

Context Relaxation

DIMACS'01 I. Jurisica 19

Page 20: Jurisica Slides - Rutgers University

Patient-Patient Similarity

DIMACS'01 I. Jurisica 20

Page 21: Jurisica Slides - Rutgers University

DIMACS'01 I. Jurisica 21

Page 22: Jurisica Slides - Rutgers University

DIMACS'01 I. Jurisica 22

Page 23: Jurisica Slides - Rutgers University

Open Source BIOdb

Automated annotationSchema integration, info validationQuerying and analysisReasons for local source:

certain tasks are more efficient and effectivecertain tasks become possible

DIMACS'01 I. Jurisica 23

Page 24: Jurisica Slides - Rutgers University

WebOQL

A system for supporting data restructuring operations

to integrate data from different sources (documents, relational tables, hypertexts)to restructure an instance of a given source into an instance of another one

We used WebOQL to write wrappers for UniGene

more generic, dynamic, incremental

http://www.cs.toronto.edu/~weboql

DIMACS'01 I. Jurisica 24

Page 25: Jurisica Slides - Rutgers University

Autoannotations

Information may not be downloadableInformation may not be complete

ID=1TITLE=Hippocampus,_Stratagene_(cat.__936205)TISSUE=brain, hippocampus VECTOR=lambdaZAP-II

Lib.1Infant, 2 yrs, femalebrain, hippocampuslambdaZAP-II453 ESTs have been classified, 411 gene sets

DIMACS'01 I. Jurisica 25

Page 26: Jurisica Slides - Rutgers University

Ad

ipo

seA

dren

al g

land

Am

nion

Nor

ma

Ao

rta

B-C

ells

Bla

dd

erB

ladd

er T

omo

Blo

odB

one

Bon

e M

arro

wB

rain

Bre

ast

Bre

ast

No

rmal

Cer

vix

CN

SC

olon

Col

on E

ST

Col

on IN

SC

on

nec

tive

Ti

Den

is D

rash Ear

Eye

Fo

resk

inG

all B

lad

der

Ger

m C

ell

Hea

d N

eck

Hea

rtK

idn

eyK

idne

y Tu

mou

Lary

nxL

iver

Lu

ngLu

ng N

orm

alLu

ng T

umou

rLy

mph

Mar

row

Mu

scle

Mu

scle

(sk

elet

Ner

vous

Nor

mN

ervo

us T

umo

No

seO

vary

Per

iph

eral

Ner

Pan

crea

sP

arat

hyr

oid

Pla

cen

taP

oo

led

Pro

stat

eP

rost

ate

No

rmP

rost

ate

Tu

mo

Ski

nS

ple

enS

tom

ach

Syn

ovi

al M

emT

esti

sT

esti

s N

orm

alTo

nsil

Ute

rus

Who

le E

mbr

yo

0

1

2

3

4

5

6

7

8

Tho

usan

ds

Distinct

Adi

pose

Adr

enal

gla

ndA

mni

on N

orm

alA

ort

aB

-Cel

lsB

ladd

erB

ladd

er T

omou

rB

lood

Bon

eB

one

Mar

row

Bra

inB

reas

tB

reas

t Nor

mal

Cer

vix

CN

SC

olon

Co

lon

ES

TC

olon

INS

Con

nect

ive

Tiss

uD

enis

Dra

sh Ear

Eye

Fore

skin

Gal

l Bla

dder

Ger

m C

ell

Hea

d N

eck

Hea

rtK

idne

yK

idne

y Tu

mou

rLa

rynx

Live

r L

un

gLu

ng N

orm

alLu

ng T

umou

rLy

mph

Mar

row

Mus

cle

Mus

cle

(ske

leta

l)N

ervo

us N

orm

alN

ervo

us T

umou

rN

ose

Ova

ryP

erip

hera

l Ner

voP

ancr

eas

Par

athy

roid

Pla

cen

taP

oo

led

Pro

stat

eP

rost

ate

Nor

mal

Pro

stat

e Tu

mou

rS

kin

Spl

een

Sto

mac

hS

ynov

ial M

embr

aTe

stis

Test

is N

orm

alTo

nsil

Ute

rus

Who

le E

mbr

yo

0

50

100

150

200

250

300

One

Expression Distribution

DIMACS'01 I. Jurisica 26

Page 27: Jurisica Slides - Rutgers University

Lung 15,410Lung-tumor 67Lung-tumor & suppressor 26Lung-tumor & necrosis 20Lung-tumor & antigen 5Lung-tumor & susceptibility 3

Hs.241493 M. musculus PIR:B47328 B47328 natural killer cell tumor-recognition protein - mouse" 1511 79 %Hs.241493 H. sapiens SP:P30414 NKCR_HUMAN NK-TUMOR RECOGNITION PROTEIN" 1461 100 %Hs.19074 H. sapiens PID:g7212790 large tumor suppressor 2" 1045 100 %Hs.48499 H. sapiens PID:g7144644 AF102177 1 tumor antigen SLP-8p" 965 100 %Hs.116875 M. musculus PID:g7637845 AF172722 1 tumor-rejection antigen SART3" 962 87 %Hs.211600 M. musculus SP:Q60769 TNP3 MOUSE TUMOR NECROSIS FACTOR, ALPHA-INDUCED

PROTEIN 3"789 88 %

Hs.211600 H. sapiens SP:P21580 TNP3_HUMAN TUMOR NECROSIS FACTOR, ALPHA-INDUCED PROTEIN 3"

789 100 %

Lung

DIMACS'01 I. Jurisica 27

Page 28: Jurisica Slides - Rutgers University

Conclusions

Management - representation - reasoning - discovery

moving from hypothesis-driven to exploration-driven research (analysis)systematically analyzing the problem space

HTPautomation, systematicity, reproducibilityhypothesis search - generation & evaluation

DIMACS'01 I. Jurisica 28

Page 29: Jurisica Slides - Rutgers University

"Most disease processes and treatments are manifested at the protein level""Gene-based expression analysis alone will (in certain cases) be totally inadequate for drug discovery""Only 2% of diseases are believed to be monogenic - we need to understand protein-protein interactions"

The Future

DDT 4(3):129-133, 1999

DIMACS'01 I. Jurisica 29

Page 30: Jurisica Slides - Rutgers University

Thanks

P. Rogers, M. SultanA. Rehaag, G. QuonD. Wigle, O. HunerP. Macgregor, M. Albert

J. Glasgow

NSERC, CITO,NIH, IBM, OCIA. Barta

M. MaziarzW. Andreopoulos

http://www.cs.utoronto.ca/~juris DIMACS'01 I. Jurisica 30