Avoiding Paralysis of Analysis: Building an Intellectual Prosthesis Knowledge-Oriented Analysis of Mycroarray Data I. Jurisica DIMACS'01 I. Jurisica 1
Avoiding Paralysis of Analysis:
Building an Intellectual Prosthesis
Knowledge-Oriented Analysis of Mycroarray Data
I. Jurisica
DIMACS'01 I. Jurisica 1
Goals
Parallel analysis of gene expressionsImproved understanding of tumorigenesisTumor classification
Individualized medicineImproved diagnosis, prognostics, treatment planning & adjustmentTargetted therapy &drug design/useInformed patient
DIMACS'01 I. Jurisica 2
Problems
Multi-dimensionalitymany degrees of freedom, few datapoints
NoiseImprecision, variationLow number of repeats
Non-independebilityNon-linearityDBs changeIntegration of results with other DBs & multiple experiments
DIMACS'01 I. Jurisica 3
Intellectual Prosthesis
Fixed
Parametric
Nonparametric
Nonparametricwith Processing
Mo
re K
no
wle
dg
e
Mor
e D
ata
Finding appropriate model to support reasoning
Exceptions
Evolution
DIMACS'01 I. Jurisica 4
AnalysisClustering organizes observations into groups by max. iner-cluster and min. inter-cluster similarityClassification/prediction assigns an observation to a class (finite/infinite)Comparison describes the item by comparing it to other itemsSummarization describes common characteristics of a subsetDiscrimination describes minimum features needed to differentiate among classesAssociation finds common occurrence of observations
DIMACS'01 I. Jurisica 5
Paralysis
Sourcetoo slow to search the problem spacenot enough data/processing time available for a system to generate a NP modellack of domain knowledgetoo much data (including noise) from HTP (high dimensionality)
A solutionHTP & computationGenerate - analyze - reduce - test - validate
DIMACS'01 I. Jurisica 6
HTP
Modified CBR approachsymbolic similaritylazy learning combined with
clustering & classificationsummarization
Analysis-based researchDNA microarray analysisannotation
RememberingRetrievingReasoning
DIMACS'01 I. Jurisica 7
Model-Building Solutions
Eager approach1. analyze data2. create a model3. use the modelLazy approach - data-driven model1. incrementally accumulate data2. incrementally analyze & evolve
Generate - analyze - reduce - test - validate
Exceptions
Evolution
DIMACS'01 I. Jurisica 8
Analyzing and Using MA DataProblems
Knowledge of classesProviding parametersClinical attributes as measures of "meaningfulness"ScalabilityAnnotating and explaining resultsQuality assuranceIntegratability
DIMACS'01 I. Jurisica 9
Case-Based ReasoningSOLUTION
1. Diagnosis2. Prognosis3. Treatment planGeneral Demographics
& Medical HistoryClinical Presentation
& Prognostic FactorsSurgical DetailsPathology StagingClinical StagingResearch ProtocolFollow-upAgeDatesHematologyBiochemistry19.2k expression profiles, .... Store
ReasonAnalyze
DIMACS'01 I. Jurisica 12
Case-Based ReasoningDSS
Cases represent experiential knowledgeCases are patterns: context, problem, solutionSymbolic similarity - context-basedRetrieval - k-NN with context and structureAnytime algorithm
KM for evolving domainsDocumenting, analyzing,transferring & sharing experienceClassification, prediction,guidance in hypothesis discoveryClustering, summarizationAcquire now, process later
RememberingRetrievingReasoning
DIMACS'01 I. Jurisica 13
Patient Information Management
we need detailed disease classificationwe need markers to improve diagnosis, prognosis and treatment planingwe need new and systematic methods
DIMACS'01 I. Jurisica 14
CBR for DNA Micro Arrays
Gene expression signatureFind patients with similar signature
k-NN approach - without prior domain knowledge
Provide diagnosis, prognosis & treatment by analogyApply Explain function for marker & cancer subtype summarization
DIMACS'01 I. Jurisica 15
Advantage of CBR
Supports reasoning, not just analysisMeasure of similarity is based on gene expression profileDoes not require prior knowledgeSupports evolution & is more flexibleHandles inconsistencies
Inconsistencies get resolved at run-time with contextual informationCBR can be used to find inconsistencies
Supports discovery & validationDIMACS'01 I. Jurisica 16
Outliers
Represent change and deviationdata outside of normal region of input
unusual but correctunusual & incorrect
for numeric attributesdetect with histogram
remove with threshold filteridentify by calculating the mean & stdev
remove by specifying "window", e.g., 2 standard deviations from the mean
DIMACS'01 I. Jurisica 17
KD and CBR
Patients
Gen
es &
clin
ical
att
ribut
es
Genes
Pat
ient
s
Organize genes into groupsOrganize attribute values into taxonomies
Clinical
DIMACS'01 I. Jurisica 18
Open Source BIOdb
Automated annotationSchema integration, info validationQuerying and analysisReasons for local source:
certain tasks are more efficient and effectivecertain tasks become possible
DIMACS'01 I. Jurisica 23
WebOQL
A system for supporting data restructuring operations
to integrate data from different sources (documents, relational tables, hypertexts)to restructure an instance of a given source into an instance of another one
We used WebOQL to write wrappers for UniGene
more generic, dynamic, incremental
http://www.cs.toronto.edu/~weboql
DIMACS'01 I. Jurisica 24
Autoannotations
Information may not be downloadableInformation may not be complete
ID=1TITLE=Hippocampus,_Stratagene_(cat.__936205)TISSUE=brain, hippocampus VECTOR=lambdaZAP-II
Lib.1Infant, 2 yrs, femalebrain, hippocampuslambdaZAP-II453 ESTs have been classified, 411 gene sets
DIMACS'01 I. Jurisica 25
Ad
ipo
seA
dren
al g
land
Am
nion
Nor
ma
Ao
rta
B-C
ells
Bla
dd
erB
ladd
er T
omo
Blo
odB
one
Bon
e M
arro
wB
rain
Bre
ast
Bre
ast
No
rmal
Cer
vix
CN
SC
olon
Col
on E
ST
Col
on IN
SC
on
nec
tive
Ti
Den
is D
rash Ear
Eye
Fo
resk
inG
all B
lad
der
Ger
m C
ell
Hea
d N
eck
Hea
rtK
idn
eyK
idne
y Tu
mou
Lary
nxL
iver
Lu
ngLu
ng N
orm
alLu
ng T
umou
rLy
mph
Mar
row
Mu
scle
Mu
scle
(sk
elet
Ner
vous
Nor
mN
ervo
us T
umo
No
seO
vary
Per
iph
eral
Ner
Pan
crea
sP
arat
hyr
oid
Pla
cen
taP
oo
led
Pro
stat
eP
rost
ate
No
rmP
rost
ate
Tu
mo
Ski
nS
ple
enS
tom
ach
Syn
ovi
al M
emT
esti
sT
esti
s N
orm
alTo
nsil
Ute
rus
Who
le E
mbr
yo
0
1
2
3
4
5
6
7
8
Tho
usan
ds
Distinct
Adi
pose
Adr
enal
gla
ndA
mni
on N
orm
alA
ort
aB
-Cel
lsB
ladd
erB
ladd
er T
omou
rB
lood
Bon
eB
one
Mar
row
Bra
inB
reas
tB
reas
t Nor
mal
Cer
vix
CN
SC
olon
Co
lon
ES
TC
olon
INS
Con
nect
ive
Tiss
uD
enis
Dra
sh Ear
Eye
Fore
skin
Gal
l Bla
dder
Ger
m C
ell
Hea
d N
eck
Hea
rtK
idne
yK
idne
y Tu
mou
rLa
rynx
Live
r L
un
gLu
ng N
orm
alLu
ng T
umou
rLy
mph
Mar
row
Mus
cle
Mus
cle
(ske
leta
l)N
ervo
us N
orm
alN
ervo
us T
umou
rN
ose
Ova
ryP
erip
hera
l Ner
voP
ancr
eas
Par
athy
roid
Pla
cen
taP
oo
led
Pro
stat
eP
rost
ate
Nor
mal
Pro
stat
e Tu
mou
rS
kin
Spl
een
Sto
mac
hS
ynov
ial M
embr
aTe
stis
Test
is N
orm
alTo
nsil
Ute
rus
Who
le E
mbr
yo
0
50
100
150
200
250
300
One
Expression Distribution
DIMACS'01 I. Jurisica 26
Lung 15,410Lung-tumor 67Lung-tumor & suppressor 26Lung-tumor & necrosis 20Lung-tumor & antigen 5Lung-tumor & susceptibility 3
Hs.241493 M. musculus PIR:B47328 B47328 natural killer cell tumor-recognition protein - mouse" 1511 79 %Hs.241493 H. sapiens SP:P30414 NKCR_HUMAN NK-TUMOR RECOGNITION PROTEIN" 1461 100 %Hs.19074 H. sapiens PID:g7212790 large tumor suppressor 2" 1045 100 %Hs.48499 H. sapiens PID:g7144644 AF102177 1 tumor antigen SLP-8p" 965 100 %Hs.116875 M. musculus PID:g7637845 AF172722 1 tumor-rejection antigen SART3" 962 87 %Hs.211600 M. musculus SP:Q60769 TNP3 MOUSE TUMOR NECROSIS FACTOR, ALPHA-INDUCED
PROTEIN 3"789 88 %
Hs.211600 H. sapiens SP:P21580 TNP3_HUMAN TUMOR NECROSIS FACTOR, ALPHA-INDUCED PROTEIN 3"
789 100 %
Lung
DIMACS'01 I. Jurisica 27
Conclusions
Management - representation - reasoning - discovery
moving from hypothesis-driven to exploration-driven research (analysis)systematically analyzing the problem space
HTPautomation, systematicity, reproducibilityhypothesis search - generation & evaluation
DIMACS'01 I. Jurisica 28
"Most disease processes and treatments are manifested at the protein level""Gene-based expression analysis alone will (in certain cases) be totally inadequate for drug discovery""Only 2% of diseases are believed to be monogenic - we need to understand protein-protein interactions"
The Future
DDT 4(3):129-133, 1999
DIMACS'01 I. Jurisica 29