Feature selection and Feature selection and transduction transduction for prediction of molecular for prediction of molecular bioactivity bioactivity for drug design for drug design Reporter: Yu Lun Kuo (D95922037) E-mail: [email protected]Date: April 17, 2008 Bioinformatics Vol. 19 no. 6 2003 (Pages 764- 771)
40
Embed
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D95922037) E-mail: [email protected]@gmail.com.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Feature selection and transductionFeature selection and transduction for prediction of molecular bioactivity for prediction of molecular bioactivity
• Challenge– Highly imbalanced, high-dimensional, different
distribution
• Approach– Bayesian network predictive model
– Data PreProcessor system– BN PowerPredictor system – BN PowerConstructor system
112/04/18 12
Data Set (1/3)Data Set (1/3)
• Provided by DuPont Pharmaceuticals– Drug binds to a target site on thrombin, a key
receptor in blood clotting
• Each example has a fixed length vector of 139,351 binary features in {0, 1}– Which describe three-dimensional properties of
the molecule
112/04/18 13
Data Set (2/3)Data Set (2/3)
• Positive examples are labeled +1
• Negative examples are labeled -1
• In the training set– 1909 examples, 42 of which bind
(rather unbalanced, positive is 2.2%)
• In the test set– 634 additional compounds
112/04/18 14
Data Set (3/3)Data Set (3/3)
• An important characteristic of the data– Very few of the feature entries are non-zero
(0.68% of the 1,909 X 139,351 training matrix)
112/04/18 15
System AssessmentSystem Assessment
• Performance is evaluated according to a weighted accuracy criterion– The score of an estimate y’ of the labels y
– Complete success is a score of 1• Multiply this score by 100 as the percentage weighted
success rate
112/04/18 16
)}1:{#
}1'^1:'{#(2
1)
}1:{#
}1'^1:'{#(2
1)',(
yy
yyy
yy
yyyyylbal
MethodologyMethodology
• Predict the labels on the test set by using a machine learning algorithm
• The positively and negatively labeled training examples are split randomly into n groups– For n-fold cross validation such that as close to
1/n of the positively labeled examples are present in each group as possible
• Called balanced cross validation– As few positive examples
112/04/18 17
MethodologyMethodology
• The method is– Trained on n-1 of groups– Tested on the remaining group– Repeated n times (different group for testing)– Final score: mean of the n scores
112/04/18 18
Feature Selection (1/2)Feature Selection (1/2)
• Called the unbalanced correlation score
– fj: the score of feature j
– X: training data as a matrix X where columns are features and examples are rows
• Take λ very large in order to select features which have non-zero entries (λ ≧3)
112/04/18 19
1 1yi yi
j XijXijf
Feature Selection (2/2)Feature Selection (2/2)
• This score is an attempt to encode prior information that– The data is unbalanced– Large number of features– Only positive correlations are likely to be useful
112/04/18 20
JustificationJustification
• Justify the unbalanced correlation score using methods of information theory– Entropy: higher non-regular
• Pi: the probability of appearance of event i
112/04/18 21
)ln( ii pp
EntropyEntropy
• The probability of random appearance of a feature with an unbalanced score of N=Np-Nn
– Np= number of one entries associated to +1
– Nn= number of one entries associated to -1
– Tp= total number of positive labels in training set
– Tn= total number of negative labels in training set
112/04/18 22
)()()(),,,(
0
1
0
1
1 iTniTpNp
NnNpNnNpTnTpP
i
Nn
i
NnNp
EntropyEntropy
• Need to compute the probability that a certain N might occur randomly
• Finally, compute the entropy for each feature
112/04/18 23
),min(
0
12 )),0max(,),0max(,,(1
),,(NTnNTp
i
iNiNTnTpPTnTp
NTnTpP
)log( 2121 PPPP
Entropy and unbalanced scoreEntropy and unbalanced score
• The entropy and unbalanced score will not reach the same feature – Because the unbalanced correlation score will
no select samples with low negative
• In this particular problem– Reach a similar ranking of the features
• Due to the unbalanced nature of the data
112/04/18 24
Entropy and unbalanced scoreEntropy and unbalanced score
• The first 6 features for both scores – 5 out of 6 are the same ones – For 16 features, 12 coincide
– Pay more attention to positive correlations
112/04/18 25
Multivariate unbalanced Multivariate unbalanced correlationcorrelation• The feature selection algorithm described so
far is univariate– Reduces the chance of overfitting– Between the inputs and targets are too complex
this assumption may be to restrictive
• We extend our criterion to assign a rank to a subset of feature– Rather than just a single feature
112/04/18 26
Multivariate unbalanced Multivariate unbalanced correlationcorrelation• By computing the logical OR of the subset of
features S (as they are binary)
112/04/18 27
Sj
XijSXi )1(1)(
Fisher ScoreFisher Score
– μ(+): the mean of the feature values for positive – μ(-): the mean of the feature values for negative – σ(+): standard deviations– σ(-): standard deviations
112/04/18 28
2)(
2)(
2)()(
)()(
)(
jj
jjjf
• In each case, the algorithms are evaluated for different numbers of features d– The range d = 1, …, 40
• Choose a small number of features in order to render interpretability of the decision function
• It is anticipated that a large number of features are noisy and should not be selected
112/04/18 29
Classification algorithms Classification algorithms (Inductive)(Inductive)• The task may not simply be just to identify
relevant characteristics via feature selection– But also to provide a prediction system
• Simplest of classifiers
– We call this a logical OR classifier
112/04/18 30
otherwised
ixifxf
d
i
,1
0)(
,1)( 1
Comparison TechniquesComparison Techniques
• We compared a number of rather more sophisticated classification– Support vector machines (SVM)– SVM*
• Make a search over all possible values of the threshold parameter in the linear model after training