K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sašo Džeroski K.U.Leuven Department of Computer Science
12
Embed
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
K.U.LeuvenDepartment of
Computer Science
Predicting gene functions using hierarchical multi-label
decision tree ensembles
Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel,Dragi Kocev, Sašo Džeroski
K.U.LeuvenDepartment of
Computer Science
K.U.LeuvenDepartment of
Computer Science
K.U.LeuvenDepartment of
Computer Science
• Classification: a common machine learning task e.g.,
Hierarchy constraint: if gene is labeled with function X, then
it is also labeled with all parents of X
Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction
K.U.LeuvenDepartment of
Computer Science
Predictions in Functional Genomics
• S. cerevisiae (13 datasets) and A. thaliana (12 datasets)
• two of biology’s model organisms
• most genes are annotated, ideal for testing purposes
• method can be applied to other organisms
• Data
• based on sequence statistics, phenotype, secondary structure, homology, microarray data,…
K.U.LeuvenDepartment of
Computer Science
Predictive Clustering Trees•Our focus is on decision trees
•Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret,
…
•General framework: predictive clustering trees (PCTs)
PCT-algo
genes with features and known functions
Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …
Input Algorithm Output
top-down inductionof PCTs PCT
K.U.LeuvenDepartment of
Computer Science
Clus-SC Clus-HSC
Clus-HMC
Hierarchy constraint
Identifies global feats
Predictive performance
Model size
Efficiency
Standard approachlearns one tree per class
Special-purpose approachlearns one tree per class +
hierarchy constraint
Our approachlearns one single tree
for all classes
Decision Trees for HMC: Different Approaches
K.U.LeuvenDepartment of
Computer Science
Predictive Clustering Forests
50 predictions
50 bootstrap replicates
Training set
•Ensembles
•Less interpretability
•Better performance
•Algorithm: Clus-HMC-Ens
…
1
2
n
3
Clus-HMC
50 PCTs
…
Test set
combined prediction
Clus-HMC
Clus-HMC
Clus-HMC
L1
L2
L3
Ln
L
K.U.LeuvenDepartment of
Computer Science
Clus-SC Clus-HSC
Clus-HMC Clus-HMC-Ens
Hierarchy constraint
Identifies global feats
Predictive performance
Model size
Efficiency
Standard approachlearns one tree per class
Special-purpose approachlearns one tree per class +
hierarchy constraint
Our approachlearns one single tree
for all classes
Variant of our approach
learns forest
Decision Trees for HMC: Different Approaches
K.U.LeuvenDepartment of
Computer Science
• Evaluation: precision-recall
• precision: percentage of predicted functions that are correct (TP/(TP+FP))
• recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN))
• Average PR curve
– Consider (instance,class) couples
– Couple is (predicted) true if instance (is predicted to have) has class
Evaluation
TP FN
FP TN
K.U.LeuvenDepartment of
Computer Science
S. cerevisiae-FunCat (hom) A. thaliana-GO (seq)
S. cerevisiae-FunCat (expr) A. thaliana-GO (interpro)
•Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%)
•Clus-HMC better than C4.5H (state-of-the-art system for HMC)(for the same recall of C4.5H, average precision improvement of 20.9%)
K.U.LeuvenDepartment of
Computer Science
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
K.U.LeuvenDepartment of
Computer Science
• Comparison with SVMs(Barutcuoglu et al.)
– Learn SVM per class
– Correct for HC violations with bayesian model
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
K.U.LeuvenDepartment of
Computer Science
• Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks
• Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability