Help Conquer Cancer 1
Update May 2008 Thank you for your continuing support of the Help Conquer Cancer project. We are grateful for all the computing power you donate to this and other exciting and useful research at WCG. We do benefit
from it greatly, but we also participate in WCG as an Integrative Discovery Team: . It is a TEAM effort (Together Everyone Accomplishes More) that will help us to solve these complex problems.
Since the launch of Help Conquer Cancer project in November 2007, WCG members contributed almost 12,000 years of run time, averaging about 54 years a day.
Reminder about the complexity of protein crystallization Crystallization is a multi‐parametric process with three classical steps: nucleation, growth and cessation of growth. Technical difficulties in protein crystallization are due to mainly two reasons:
1. A large number of parameters affect the crystallization outcome, including purity of proteins, super‐saturation, temperature, pH, time, ionic strength and purity of chemicals, volume and geometry of samples;
2. We only partially understand correlations between the variation of a parameter and the propensity for a given macromolecule to crystallize.
Conceptually, protein crystal growth can be divided into two phases: search and optimization. Search phase determines a subset of all possible crystallization conditions that yield promising crystallization outcome. These conditions are varied during the optimization phase to produce diffraction‐quality crystals. Neither of the two phases is trivial to execute. If we consider only 20 possible conditions, each having 20 possible values, the result would be 1.04858E+26 possible experiments; impossible to test exhaustively. Even a broad search phase may not produce any promising conditions, and many of the promising leads may elude optimization strategies.
High‐throughput screening (HTS) can speed up the search phase, and has the potential to increase process quality. Automated image analysis and classification achieves two important goals: it improves throughput and generates consistent and objective results. Objective image classification is a necessary input to data mining and reasoning, which is essential to elucidate knowledge from large number of successful and failed crystallization experiments. These results will help understand protein chemistry and lead to achieving our overall goal – to improve number and quality of protein structures determined. We hypothesize that (1) comprehensive and probabilistic image classification will increase both specificity and sensitivity of the process, and (2) systematic image analysis combined with data mining and reasoning will lead to improved understanding the chemistry of protein crystallization, and thus will also increase number of solved structures from the HTS pipeline.
Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope with this diversity, we must use multiple algorithms to identify crystals reliably, i.e., with high sensitivity and specificity.
Figure 1 Diverse crystal forms.
Image classification challenge Individual images have to be first analyzed to determine their morphologic features, and then use combination of these features to classify them into a predefined set of categories, as shown in Figure 2.
Figure 2 I
Phase 1During theprocess o
• Tr
• Im
Using the completefeature exis useful alinear. Altsensible oimages, wJanuary 2
• A• C
cl• C
&• C• Ev• Se
wac
mage classific
1 e first phase of optimizing f
ruth data set
mage analysis
WCG compud by January xtraction for aand necessarythough it is nooption is to dewhich covers a008, enabling
ssess the valuompute inforlass); alculate infor
& image class)ompute inforvaluate informelect “optimawith preferencccuracy.
cation proces
of the projectfeatures acro
t: 165,416 ima
s: 12,375 feat
uting capacity2008. While all unscored dy since the reot practical toetermine the a large numbeg us to perfor
ue of all comprmation conte
rmation conte; rmation contemation densital” subset of fce for less com
Hel
ss.
t, we processss wide range
ages, classifie
tures on 90 m
y, all feature ewe are deterdata. Our prelationship amo wait till 201best feature er of diverse rm the follow
puted featureent of single f
ent of feature
ent per CPU‐sty across featfeatures for immputationally
lp Conquer C
sed a well‐chae of individua
ed by 3 exper
million images
extraction formining the beeliminary resumong features13 to computesubset on a wimage classes
wing feature se
es; features (mut
e pairs (mutu
second (featuture‐parametmage classificy intensive fe
Cancer
aracterized seal image categ
ts into 10 cat
s.
r hand‐scoredest subset of ults show thats, their parame features forwell‐charactes. This compuelection:
tual informat
al informatio
ure utility); ter space; cation, i.e., thatures when
et of images. gories.
tegories;
d image data hfeatures to ut this compremeters, and imr all 86 millionerized set of hutation has be
tion between
n between fe
he most inforpossible with
We are in the
has been use, we continehensive apprmage classes n images, thehand‐scored een finished i
feature & im
eature‐featur
mative featurhout decreasi
3
e
nue roach is not e only
in
mage
e pair
res, ing
Computedimage clasensitivity
Figure 3 I
As shownpredictingdepend owell, as sh
Figure 4 C
d features wissification (suy and specific
mage classes
in the exampg image class n the image chown in Figur
Correlation of
ll be used to uch as a 3‐claity.
across the tr
ples below, w(see Figure 4class. Thus, wre 6.
f individual fe
Hel
identify essenss classificati
ruth data set.
we can optimi4, 5). But the pwe need to co
eatures across
lp Conquer C
ntial combinaon in Figure 3
ze which featprocess is chansider featur
s all image ca
Cancer
ation of featu3), i.e., classif
tures and whallenging as fee and parame
ategories
res that will lfication that a
ich parameteeatures and peter optimiza
lead to accuraachieves both
ers are useful parameters hation per class
4
ate h high
for ighly s as
Figure 5 Omutual in
Optimizing paformation (m
arameters formeasured in b
Hel
individual feits) between
lp Conquer C
atures acrossfeatures (plo
Cancer
s all image catotted in param
tegories. Heameter space).
at maps indica
5
ate
Figure 6 Eindicatingspecific creach featuplots indic
PrelimWe have informatio
• •
Using the each classspecificity
Effect of parag mutual inforrystallization ure family’s pcate candidat
minary imaused a set of on plots to bu
three-wayten-way: precipitat
training set osifier is in idey of each class
meter changermation (meaoutcomes (clparameter spate features fo
age classihandpicked 7uilt two prelim
y: clear, nonclear, phasee, precipitate
of images andntifying imagsifier.
Hel
es to the infoasured in bitsear, precipitaace are sensitor HCC Phase
ifiers 74 features frminary classif
n-crystal pree separation, e + skin, pre
d a leave‐onee from individ
lp Conquer C
ormation cont) between feaate, crystal) istive to differeII.ii
rom peaks in fiers, using a
cipitate, othephase + pre
ecipitate + cr
e‐out cross‐vadual categori
Cancer
tents of imageatures (plottes shown. Notent crystalliza
the clear, preNaïve Bayes m
er; ecipitate, skinrystal, crysta
alidation, we hes, i.e., what
e features. Heed in paramette how differeation outcome
ecipitate and model:
n, phase + cral, garbage.
have measureis the sensiti
eat maps ter space) anent regions ofes. Peaks in t
other mutua
rystal,
ed how accurvity and
6
d f hese
l
rate
Figure 7 N
Future• Im
ca
• P
• Id
• Cm
As a resul
Thank you
C. A. Cum
i Jurisica, I.Volume 8,
Naïve Bayes c
e directionmprove imageategorization
rotein crystal
dentify poten
rystallization mining.
t, more struc
u,
mbaa and I. Jur
., D. A. Wigle. KChapman & H
lassifiers for 3
ns e analysis to a, and improve
llization princ
tially success
optimization
ctures will be
risica
Knowledge Disall/CRC Press,
Hel
3 and 10 clas
achieve high se scalability t
ciples derived
ful conditions
n plans derive
determined f
covery in Prote2006.
lp Conquer C
ses.
specificity ano near real ti
d from the cry
s for proteins
ed by combini
for larger num
eomics, Mathe
Cancer
d sensitivity ime.
ystallization d
s that were no
ing case‐base
mber of impo
ematical & Com
in multi‐class
atabase by d
ot yet crystal
ed reasoning s
ortant cancer
mputational Bio
experiment
ata mining.
lized.
system and d
proteins.
ology Series,
7
ata
Help Conquer Cancer 8 ii Cumbaa, C. A., and I. Jurisica. Crystallization image analysis on the World Community Grid. NIH PSI Bottlenecks Meeting, Bethesda, MD, March 2008.