Top Banner
Help Conquer Cancer 1 Update May 2008 Thank you for your continuing support of the Help Conquer Cancer project. We are grateful for all the computing power you donate to this and other exciting and useful research at WCG. We do benefit from it greatly, but we also participate in WCG as an Integrative Discovery Team: . It is a TEAM effort (Together Everyone Accomplishes More) that will help us to solve these complex problems. Since the launch of Help Conquer Cancer project in November 2007, WCG members contributed almost 12,000 years of run time, averaging about 54 years a day. Reminder about the complexity of protein crystallization Crystallization is a multiparametric process with three classical steps: nucleation, growth and cessation of growth. Technical difficulties in protein crystallization are due to mainly two reasons: 1. A large number of parameters affect the crystallization outcome, including purity of proteins, supersaturation, temperature, pH, time, ionic strength and purity of chemicals, volume and geometry of samples; 2. We only partially understand correlations between the variation of a parameter and the propensity for a given macromolecule to crystallize. Conceptually, protein crystal growth can be divided into two phases: search and optimization. Search phase determines a subset of all possible crystallization conditions that yield promising crystallization outcome. These conditions are varied during the optimization phase to produce diffractionquality crystals. Neither of the two phases is trivial to execute. If we consider only 20 possible conditions, each having 20 possible values, the result would be 1.04858E+26 possible experiments; impossible to test exhaustively. Even a broad search phase may not produce any promising conditions, and many of the promising leads may elude optimization strategies. Highthroughput screening (HTS) can speed up the search phase, and has the potential to increase process quality. Automated image analysis and classification achieves two important goals: it improves throughput and generates consistent and objective results. Objective image classification is a necessary input to data mining and reasoning, which is essential to elucidate knowledge from large number of successful and failed crystallization experiments. These results will help understand protein chemistry and lead to achieving our overall goal – to improve number and quality of protein structures determined. We hypothesize that (1) comprehensive and probabilistic image classification will increase both specificity and sensitivity of the process, and (2) systematic image analysis combined with data mining and reasoning will lead to improved understanding the chemistry of protein crystallization, and thus will also increase number of solved structures from the HTS pipeline.
8

Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          Help Conquer Cancer        1  

Update May 2008                                Thank you for your continuing support of the Help Conquer Cancer project.  We are grateful for all the computing power you donate to this and other exciting and useful research at WCG.  We do benefit 

from it greatly, but we also participate in WCG as an Integrative Discovery Team:  .  It is a TEAM effort (Together Everyone Accomplishes More) that will help us to solve these complex problems. 

 

Since the launch of Help Conquer Cancer project in November 2007, WCG members contributed almost 12,000 years of run time, averaging about 54 years a day. 

Reminder about the complexity of protein crystallization Crystallization is a multi‐parametric process with three classical steps: nucleation, growth and cessation of growth. Technical difficulties in protein crystallization are due to mainly two reasons: 

1. A large number of parameters affect the crystallization outcome, including purity of proteins, super‐saturation, temperature, pH, time, ionic strength and purity of chemicals, volume and geometry of samples; 

2. We only partially understand correlations between the variation of a parameter and the propensity for a given macromolecule to crystallize. 

Conceptually, protein crystal growth can be divided into two phases: search and optimization.  Search phase determines a subset of all possible crystallization conditions that yield promising crystallization outcome. These conditions are varied during the optimization phase to produce diffraction‐quality crystals. Neither of the two phases is trivial to execute. If we consider only 20 possible conditions, each having 20 possible values, the result would be 1.04858E+26 possible experiments; impossible to test exhaustively. Even a broad search phase may not produce any promising conditions, and many of the promising leads may elude optimization strategies.  

High‐throughput screening (HTS) can speed up the search phase, and has the potential to increase process quality. Automated image analysis and classification achieves two important goals: it improves throughput and generates consistent and objective results. Objective image classification is a necessary input to data mining and reasoning, which is essential to elucidate knowledge from large number of successful and failed crystallization experiments. These results will help understand protein chemistry and lead to achieving our overall goal – to improve number and quality of protein structures determined.  We hypothesize that (1) comprehensive and probabilistic image classification will increase both specificity and sensitivity of the process, and (2) systematic image analysis combined with data mining and reasoning will lead to improved understanding the chemistry of protein crystallization, and thus will also increase number of solved structures from the HTS pipeline. 

Page 2: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          Help Conquer Cancer        2  The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope with this diversity, we must use multiple algorithms to identify crystals reliably, i.e., with high sensitivity and specificity. 

   

Figure 1 Diverse crystal forms. 

Image classification challenge Individual images have to be first analyzed to determine their morphologic features, and then use combination of these features to classify them into a predefined set of categories, as shown in Figure 2. 

Page 3: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          

Figure 2 I

Phase 1During theprocess o

• Tr

• Im

Using the completefeature exis useful alinear. Altsensible oimages, wJanuary 2

• A• C

cl• C

&• C• Ev• Se

wac

 

mage classific

 1 e first phase of optimizing f

ruth data set

mage analysis

WCG compud by January xtraction for aand necessarythough it is nooption is to dewhich covers a008, enabling

ssess the valuompute inforlass); alculate infor

& image class)ompute inforvaluate informelect “optimawith preferencccuracy. 

cation proces

of the projectfeatures acro

t: 165,416 ima

s: 12,375 feat

uting capacity2008. While all unscored dy since the reot practical toetermine the a large numbeg us to perfor

ue of all comprmation conte

rmation conte; rmation contemation densital” subset of fce for less com

Hel

ss. 

t, we processss wide range

ages, classifie

tures on 90 m

y, all feature ewe are deterdata. Our prelationship amo wait till 201best feature er of diverse rm the follow

puted featureent of single f

ent of feature

ent per CPU‐sty across featfeatures for immputationally

lp Conquer C

sed a well‐chae of individua

ed by 3 exper

million images

extraction formining the beeliminary resumong features13 to computesubset on a wimage classes

wing feature se

es; features (mut

e pairs (mutu

second (featuture‐parametmage classificy intensive fe

Cancer     

aracterized seal image categ

ts into 10 cat

s. 

r hand‐scoredest subset of ults show thats, their parame features forwell‐charactes. This compuelection: 

tual informat

al informatio

ure utility); ter space; cation, i.e., thatures when 

 

et of images. gories. 

tegories; 

d image data hfeatures to ut this compremeters, and imr all 86 millionerized set of hutation has be

tion between 

n between fe

he most inforpossible with

We are in the

has been use, we continehensive apprmage classes n images, thehand‐scored een finished i

feature & im

eature‐featur

mative featurhout decreasi

nue roach is not e only 

in 

mage 

e pair 

res, ing 

ij
Stamp
Page 4: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          Computedimage clasensitivity

Figure 3 I

As shownpredictingdepend owell, as sh

Figure 4 C

d features wissification (suy and specific

mage classes

 in the exampg image class n the image chown in Figur

Correlation of

ll be used to uch as a 3‐claity. 

 across the tr

ples below, w(see Figure 4class. Thus, wre 6. 

f individual fe

Hel

identify essenss classificati

ruth data set.

we can optimi4, 5). But the pwe need to co

eatures across

lp Conquer C

ntial combinaon in Figure 3

 

ze which featprocess is chansider featur

 

s all image ca

Cancer     

ation of featu3), i.e., classif

tures and whallenging as fee and parame

ategories 

res that will lfication that a

ich parameteeatures and peter optimiza

lead to accuraachieves both

ers are useful parameters hation per class

ate h high 

 

for ighly s as 

Page 5: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          

Figure 5 Omutual in

Optimizing paformation (m

arameters formeasured in b

Hel

 individual feits) between 

lp Conquer C

 

atures acrossfeatures (plo

Cancer     

s all image catotted in param

tegories. Heameter space).

at maps indica 

ate 

Page 6: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          

Figure 6 Eindicatingspecific creach featuplots indic

PrelimWe have informatio

• •

Using the each classspecificity

Effect of parag mutual inforrystallization ure family’s pcate candidat

minary imaused a set of on plots to bu

three-wayten-way: precipitat

training set osifier is in idey of each class

meter changermation (meaoutcomes (clparameter spate features fo

age classihandpicked 7uilt two prelim

y: clear, nonclear, phasee, precipitate

of images andntifying imagsifier. 

Hel

es to the infoasured in bitsear, precipitaace are sensitor HCC Phase 

ifiers 74 features frminary classif

n-crystal pree separation, e + skin, pre

d a leave‐onee from individ

lp Conquer C

ormation cont) between feaate, crystal) istive to differeII.ii 

rom peaks in fiers, using a 

cipitate, othephase + pre

ecipitate + cr

e‐out cross‐vadual categori

Cancer     

tents of imageatures (plottes shown.  Notent crystalliza

the clear, preNaïve Bayes m

er; ecipitate, skinrystal, crysta

alidation, we hes, i.e., what 

 

e features. Heed in paramette how differeation outcome

ecipitate and model: 

n, phase + cral, garbage.

have measureis the sensiti

eat maps ter space) anent regions ofes. Peaks in t

other mutua

rystal,

ed how accurvity and 

d f hese 

rate 

Page 7: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          

Figure 7 N

Future• Im

ca

• P

• Id

• Cm

As a resul

Thank you

C. A. Cum

                  i Jurisica, I.Volume 8, 

Naïve Bayes c

e directionmprove imageategorization

rotein crystal

dentify poten

rystallization mining.  

t, more struc

u, 

mbaa and I. Jur

                       ., D. A. Wigle. KChapman & H

lassifiers for 3

ns e analysis to a, and improve

llization princ

tially success

optimization

ctures will be 

risica  

                   Knowledge Disall/CRC Press, 

Hel

3 and 10 clas

achieve high se scalability t

ciples derived

ful conditions

n plans derive

determined f

covery in Prote2006. 

lp Conquer C

ses. 

specificity ano near real ti

d from the cry

s for proteins

ed by combini

for larger num

eomics, Mathe

Cancer     

d sensitivity ime. 

ystallization d

s that were no

ing case‐base

mber of impo

ematical & Com

in multi‐class 

atabase by d

ot yet crystal

ed reasoning s

ortant cancer 

mputational Bio

experiment 

ata mining. 

lized. 

system and d

proteins. 

ology Series, 

 

ata 

ij
Stamp
Page 8: Conquer Cancer Update May 2008 - cs.toronto.edujuris/WCG/UPDATE-MAY2008.pdf · Help Conquer Cancer 2 The challenge is the wide diversity of crystals i, as shown in Figure 1. To cope

          Help Conquer Cancer        8                                                                                                                                                                                                  ii Cumbaa, C. A., and I. Jurisica. Crystallization image analysis on the World Community Grid. NIH PSI Bottlenecks Meeting, Bethesda, MD, March 2008.