Late fusion methods and performance metrics for the effective prioritization of drug candidates Author: Gábor Csizmadia Supervisor: Péter Antal.

Late fusion methods and performance metrics for the

effective prioritization of drug candidates

Author: Gábor CsizmadiaSupervisor: Péter Antal

Abstract

There are many different ways to assess similarities between compounds, such as:

•Structure-based

•Chemical property-based

•Biological effect-based

•Literature-based

If we combine different methods, we should get more accurate results -> data fusion

Implemented software: rank and score fusion methods, performance metrics

Overview

1. Drug prioritization

2. Data fusion approaches

3. Rank/score fusion

4. Performance metrics

5. Implemented software

6. Future plans

Drug prioritization

List of known active compounds for a specific

condition

Predicting which compounds are active in the unknown set

Assess similarities

to other compounds

Data fusion approaches

1. Early: data vectors are concatenated

2. Intermediate: similarity matrices are combined

3. Late: rankings or scorings are combined -> rank and score fusion

Rank and score fusion

Learning to rank

Rank fusion methods:

• Borda fusion

• Rank vote

• Pareto ranking

• Parallel selection

Score fusion methods:

• Sum score

Borda fusion

1. Each ranking assigns a certain number of points to the ranked compounds based on their rank

2. The points are then summed to get the score of each compound

Ranking 1 Ranking 2 Ranking 3 Borda count

Compound 1 4 3 3 10

Compound 2 3 2 4 9

Compound 3 2 4 2 8

Compound 4 1 0 1 2

Compound 5 0 1 0 1

Rank vote

1. Each ranking votes for its top n compounds

2. The ranking is based on how many votes a compound received

(n=2) Ranking 1 Ranking 2 Ranking 3 Votes

Compound 1 1 1 1 3

Compound 2 1 0 1 2

Compound 3 0 1 0 1

Compound 4 0 0 0 0

Compound 5 0 0 0 0

Pareto ranking

Each compound is ranked based on the number of compounds better in all rankings

Ranking 1 Ranking 2 Ranking 3 better in all

Compound 1 1 2 2 0

Compound 2 2 3 1 0

Compound 3 3 1 3 0

Compound 4 4 5 4 3

Compound 5 5 4 5 3

Parallel selection

• Compounds are selected from each ranking in turn

• If a compound that would be selected has already been selected before, the next compound from that ranking is selected instead

Ranking 1 Ranking 2 Ranking 3 Fused ranking

1 Compound 1 Compound 3 Compound 2 Compound 1





Sum score

The normalized scores of each ranking are summed to get the fused score of a compound

Ranking 1 Ranking 2 Ranking 3 Sum score

Compound 1 1 0.9 0.7 2.6

Compound 2 0.8 0.5 1 2.3

Compound 3 0.7 1 0.5 2.2

Compound 4 0.2 0 0.1 0.3

Compound 5 0 0.3 0 0.3

Performance metrics 1.

The performance of a ranking (how early it ranks actives) can be measured in various ways:

Area under curve (AUC) values for the following:

•AC (Accumulation Curve): plots the true positive rate as a function of the fraction of data classified as positive

•ROC (Receiver Operating Characteristic): plots the true positive rate as a function of the false positive rate

•CAC (Centralized AC)

•CROC (Centralized ROC)

ROC curve, source: Wikipedia

Performance metrics 2.

Implemented software

• Java language

• command line

2 modules: fuser (12 classes), performance tester (13 classes + 2 interfaces)

• dedicated class for scored rankings: Ranking

• common interface for all fusion methods: Fuser

• common interface for all metrics: Metric

• java fusiontester.Main [type] [r1path] [r1ms] [r2path] [r2ms] ...

• java performancetester.Main [type] [rankingpath] [activespath]

Future plans

• better handling of incomplete data

• testing effects of noise

• consider statistical significance of sources

• ...

• (TDK)

References

1. Bolgár Bence Márton. Kernel fúziós módszerek alkalmazása a genomikai kísérlettervezésben és adatelemzésben. 2012.

2. Fredrik Svensson, Anders Karlén, and Christian Sköld. Virtual Screening Data Fusion Using Both Structure- and Ligand-based Methods. J. Chem. Inf. Model. 2012, 52, 225−232.

3. S. Joshua Swamidass, Chloé-Agathe Azencott, Kenny Daily and Pierre Baldi. A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval. Advance Access publication April 7, 2010.

4. Jean-François Truchon and Christopher I. Bayly. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. J. Chem. Inf. Model. 2007, 47, 488-508.

Late fusion methods and performance metrics for the effective prioritization of drug candidates Author: Gábor Csizmadia Supervisor: Péter Antal.

Documents

compound receivedn

pareto rankingeach compound

n compoundsthe ranking

rankrank fusion methods

late fusion methods

different methods

active compounds

number of compounds