Late fusion methods and performance metrics for the effective prioritization of drug candidates Author: Gábor Csizmadia Supervisor: Péter Antal
Jan 16, 2016
Late fusion methods and performance metrics for the
effective prioritization of drug candidates
Author: Gábor CsizmadiaSupervisor: Péter Antal
Abstract
There are many different ways to assess similarities between compounds, such as:
•Structure-based
•Chemical property-based
•Biological effect-based
•Literature-based
If we combine different methods, we should get more accurate results -> data fusion
Implemented software: rank and score fusion methods, performance metrics
Overview
1. Drug prioritization
2. Data fusion approaches
3. Rank/score fusion
4. Performance metrics
5. Implemented software
6. Future plans
Drug prioritization
List of known active compounds for a specific
condition
Predicting which compounds are active in the unknown set
Assess similarities
to other compounds
Data fusion approaches
1. Early: data vectors are concatenated
2. Intermediate: similarity matrices are combined
3. Late: rankings or scorings are combined -> rank and score fusion
Rank and score fusion
Learning to rank
Rank fusion methods:
• Borda fusion
• Rank vote
• Pareto ranking
• Parallel selection
Score fusion methods:
• Sum score
Borda fusion
1. Each ranking assigns a certain number of points to the ranked compounds based on their rank
2. The points are then summed to get the score of each compound
Ranking 1 Ranking 2 Ranking 3 Borda count
Compound 1 4 3 3 10
Compound 2 3 2 4 9
Compound 3 2 4 2 8
Compound 4 1 0 1 2
Compound 5 0 1 0 1
Rank vote
1. Each ranking votes for its top n compounds
2. The ranking is based on how many votes a compound received
(n=2) Ranking 1 Ranking 2 Ranking 3 Votes
Compound 1 1 1 1 3
Compound 2 1 0 1 2
Compound 3 0 1 0 1
Compound 4 0 0 0 0
Compound 5 0 0 0 0
Pareto ranking
Each compound is ranked based on the number of compounds better in all rankings
Ranking 1 Ranking 2 Ranking 3 better in all
Compound 1 1 2 2 0
Compound 2 2 3 1 0
Compound 3 3 1 3 0
Compound 4 4 5 4 3
Compound 5 5 4 5 3
Parallel selection
• Compounds are selected from each ranking in turn
• If a compound that would be selected has already been selected before, the next compound from that ranking is selected instead
Ranking 1 Ranking 2 Ranking 3 Fused ranking
1 Compound 1 Compound 3 Compound 2 Compound 1
2 Compound 2 Compound 1 Compound 1 Compound 3
3 Compound 3 Compound 2 Compound 3 Compound 2
4 Compound 4 Compound 5 Compound 4 Compound 4
5 Compound 5 Compound 4 Compound 5 Compound 5
Sum score
The normalized scores of each ranking are summed to get the fused score of a compound
Ranking 1 Ranking 2 Ranking 3 Sum score
Compound 1 1 0.9 0.7 2.6
Compound 2 0.8 0.5 1 2.3
Compound 3 0.7 1 0.5 2.2
Compound 4 0.2 0 0.1 0.3
Compound 5 0 0.3 0 0.3
Performance metrics 1.
The performance of a ranking (how early it ranks actives) can be measured in various ways:
Area under curve (AUC) values for the following:
•AC (Accumulation Curve): plots the true positive rate as a function of the fraction of data classified as positive
•ROC (Receiver Operating Characteristic): plots the true positive rate as a function of the false positive rate
•CAC (Centralized AC)
•CROC (Centralized ROC)
ROC curve, source: Wikipedia
Performance metrics 2.
Implemented software
• Java language
• command line
2 modules: fuser (12 classes), performance tester (13 classes + 2 interfaces)
• dedicated class for scored rankings: Ranking
• common interface for all fusion methods: Fuser
• common interface for all metrics: Metric
• java fusiontester.Main [type] [r1path] [r1ms] [r2path] [r2ms] ...
• java performancetester.Main [type] [rankingpath] [activespath]
Future plans
• better handling of incomplete data
• testing effects of noise
• consider statistical significance of sources
• ...
• (TDK)
References
1. Bolgár Bence Márton. Kernel fúziós módszerek alkalmazása a genomikai kísérlettervezésben és adatelemzésben. 2012.
2. Fredrik Svensson, Anders Karlén, and Christian Sköld. Virtual Screening Data Fusion Using Both Structure- and Ligand-based Methods. J. Chem. Inf. Model. 2012, 52, 225−232.
3. S. Joshua Swamidass, Chloé-Agathe Azencott, Kenny Daily and Pierre Baldi. A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval. Advance Access publication April 7, 2010.
4. Jean-François Truchon and Christopher I. Bayly. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. J. Chem. Inf. Model. 2007, 47, 488-508.