Top Banner
Data-integration for Endpoints, Chemoinformatics and Omics Danyel Jennen BRAZIL 2014
41

Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Jul 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Data-integration for Endpoints, Chemoinformatics and Omics

Danyel Jennen

BRAZIL 2014

Page 2: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 2

DECO2: Moving from DECO towards OECD Cefic LRI-AIMT4 June 2014 – May 2016

DECO: Data-integration for Endpoints, Cheminformatics and Omics Cefic LRI-AIMT3 March 2012 – February 2014

Page 3: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 3

The opportunity •  There is increasing pressure, through REACH and other

initiatives, to generate more complete information sets on safety for every chemical in commerce.

•  Current toxicity testing methods are resource- and animal-intensive.

•  A full battery of tests is generally only available for compounds that may be biologically active (e.g., drugs and pesticides), high production volume, or potential for widespread exposure.

•  The methods of traditional toxicology testing are insufficient to meet this information need.

•  Assessment of other compounds has been through a variety of procedures, in some instances reliance on structural similarity to well-tested compounds.

  There is an opportunity to use state-of-the-art biotechnology and informatics methodologies to fill this gap.

Page 4: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 4

Background •  Grouping / categorization based on similar properties

combined with read-across is an important concept for safety assessment.

•  Chemical analogues identified by chemoinformatic tools may share similar toxicological properties.

•  However, subtle changes in chemical structure can result in relevant changes in toxicity (e.g. 2-AAF and 4-AAF, and APAP and AMAP).  Chemoinformatics alone is insufficient to predict the

toxicological profile of a chemical. •  Any additional biological indication could help to increase

the likelihood of prediction, such as: –  ‘omics’ data from either short-term in vivo or in vitro

tests –  high-throughput in vitro screening assays (HTS assays)

such as EPA’s ToxCast/Tox-21 •  They can also provide more detailed and mechanistic

insights.

Page 5: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 5

DECO’s general Hypothesis

Evaluation strategies based on integrating chemoinformatics approaches with mechanistic data from ‘omics’ and HTS technologies, will strengthen confidence that the respective compounds within a class behave similarly in terms of their toxicological profile.

This would allow for improved identification of toxicological hazards leading to better classification and labelling of chemicals.

Page 6: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 6

Objectives

•  To develop a transparent framework that improves the prediction of the repeated dose toxicity of new chemicals by integrating chemoinformatic data with biological information from ‘omics’ and HTS technologies.

•  To generate a high-quality database that contains for a large number of chemicals, its chemical properties, in vitro omics, in vivo omics, HTS data, and (in vivo) toxicity test.

•  To develop and evaluate expert rules, useful for industry as well as regulators

  To build a demonstration project where the advancement of integrating different domains, i.e. chemoinformatics, ‘omics, and HTS, is clearly explained to regulators and industry.

Page 7: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 7

Liver as target •  The liver is by far the most important organ for determining

critical effects in repeated dose toxicity studies. •  Expected availability of sufficient high quality data in

especially the ‘omics’ field, both in vivo (mainly rat) and in various in vitro models.

•  Well described pathologies, such as hepatitis, steatosis, cirrhosis, inflammation, phospholipidosis, necrosis and cholestasis.

•  The many different documented toxicity mechanisms. •  Several studies already demonstrated that specific liver

pathologies can be predicted by ‘omics approaches. •  Chemcial induced hepatoxocity is also very relevant for

humans.

  The liver is an ideal organ for demonstrating the proof-of-concept for integrative approaches.

Page 8: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 8

Project Organization

Page 9: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 9

Work package 1 Data Collection and Compound selection

Selection of compounds with high quality data Data selection from public and in house databases Chemoinformatics tool selection Selection of a validation set of compounds

Work package 2 Knowledge base and Toolbox infrastructure

Setup of hardware infrastructure. Setup of software infrastructure. Development of database management tools

Work package 3 Data integration and evaluation of predictive approaches

Toxicity prediction approaches based on similarity scores. Toxicity prediction of chemical analogues explorative. Toxicity prediction by classification approaches.

Page 10: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 10

Data sources and data types

Data  Source   Data  Type   Organ/Tissue/Model   Compounds  

PubChem   Chemical  Structure   2D  fingerprints   ~1400  

TGGATEs  Transcriptomics  

rat  liver   143  rat  prim.  hepatocytes   130  human  prim.  hepatocytes   119  

Clinical  Chemistry   rat  blood   143  Histopathology   rat  liver   127  

Drugmatrix  Transcriptomics   rat  liver   201  Clinical  Chemistry   rat  blood   601  Histopathology   rat  liver   568  

UM   transcriptomics   HepG2   122  

BASF   Metabonomics   rat  plasma   371  

ToxCast   HTS  data   in  vitro  assays   320  Histpathology   rat  liver   220  

Page 11: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 11

Overlaps of compounds between data sources

Page 12: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 12

Data processing

Example transcriptomics

Page 13: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 13

Data types

•  Transcriptomics, metabolomics, clinical chemistry –  Log2ratios

•  Histopathology –  Severity score

•  Structure similarity, 2D fingerprint –  Tanimoto score

Page 14: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 14

Liver phenotypic endpoints

•  Liver toxicity •  Enzyme induction •  Peroxisome Proliferation •  Oxidative stress •  Necrosis •  Cholestasis

. . .

Page 15: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 15

DECO Hardware Infrastructure

Page 16: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 16

Page 17: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 17

Data integration and evaluation of predictive approaches

1.  Similarity scores & matrices

2.  Integrating transcriptomics datasets

3. Unsupervised clustering → iClusterPlus

4.  Supervised clustering/classification → PAM

Page 18: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 18

Pearson Correlation or Euclidian Distance?

-6

-4

-2

0

2

4

6

-5 0 5

1 vs 2

Pearson Correlation high √ Euclidian Distance low √

Correlation coefficients do not (always) correctly reflect the similarities between compounds

Euclidian distances better quantify the (dis)similarity between compounds as shown in a PCA

-1,5 -1

-0,5 0

0,5 1

1,5 2

2,5

3 vs 4

low X low √

-6

-4

-2

0

2

4

6

-2 -1 0 1 2

1 vs 3

very low √ high √

Page 19: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 19

Calculation Euclidian Distances

•  Euclidian distance does not depend on number of samples, but depends on number of variables (genes, metabolites etc)

•  Euclidian distance ranges from 0 to …. depending on data type, therefore rescale to range 0-1 (divide by maximum)

•  In order to compare chemical structure with the other data types structure similarity has to be recalculated as 1-Tanimoto score

•  All available variables have been used to calculate Euclidian distance

Page 20: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 20

Data types

•  Transcriptomics, metabolomics, clinical chemistry –  Log2ratios –  Euclidian distances

•  Histopathology –  Severity score –  Euclidian distances

•  Structure similarity, 2D fingerprint –  Tanimoto score

Page 21: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 21

Integrating transcriptomics data from TG-GATEs & DrugMatrix

Tanimoto = 0.3329

WY-14,643 fenofibrate

Page 22: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 22

iClusterPlus

http://www.mskcc.org/research/epidemiology-biostatistics/biostatistics/iclusterplus

•  Predict cancer subtypes •  Integrate discrete and continuous variables

R package: iClusterPlus

Page 23: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 23

Latent variables or

Hidden causes

Page 24: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 24

iClusterPlus on structurally similar compounds

Data types: •  Structure similarity, 2D fingerprint (Tanimoto

score > 0.7) •  Transcriptomics DrugMatrix; in vivo rat liver

data, repeated high dose, 3 days (Euclidian distances)

•  Clinical chemistry DrugMatrix; in vivo rat liver data, repeated high dose, 3 days (Euclidian distances)

In total 9 compounds

Page 25: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 25

iClusterPlus on structurally similar compounds from DrugMatrix Tanimoto score > 0.7

Page 26: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 26

iClusterPlus on DrugMatrix data, Transcriptomics (144 features) & Clinical chemistry (613 features)

1=betamethasone 2=dexamethasone 3=hydrocortisone 4=17-methyltestosterone 5=ethisterone 6=norethindrone acetate 7=progesterone 8=cortisone 9=testosterone

Page 27: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 27

Conclusions iClusterPlus

•  Grouping compounds based on any single data type can result in wrongly clustered compounds that have different toxicity or MOA

•  Or compounds with the same toxicity or MOA don’t cluster together

•  Integrating different data types improves the clustering of analogues.

•  Expert knowledge is needed

Page 28: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 28

PAM classification (1)

•  Tool: PAM = Prediction Analysis of Microarrays

•  Nearest Shrunken Centroids:  average gene expression for each gene in each class divided by the within-class standard deviation for that gene

•  Unknown sample compared to class centroids •  The class whose centroid that it is closest to, in squared

distance, is the predicted class for that new sample •  By increasing threshold (means decreasing # features)

centroids are shrunken. •  Advantages: - removing noisy genes

- automatic gene selection

Page 29: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 29

PAM classification (2)

Page 30: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 30

1. Adjust PAM R-script 2. Perform classification

Output: 1.  Sensitivity, specificity, accuracy 2.  List of variables: genes and

metabolites 3. Graphs 4.  Per label distribution of toxic and

non-toxic samples in train and test sets

Load data sets

Select training & test set

T-test using train samples

PAM training

PAM cross validation

PAM test

Average results of 5 sets

5x

PAM classification (3)

Page 31: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 31

PAM classification COMPOUNDS

VARIABLES

(genes, metabolites, compounds)

5 different train and test sets Same fraction of toxic compounds in train and test set 10-fold cross validation Adjust threshold in order to optimize sensitivity, specificity and accuracy

TRAIN (2/3) TEST (1/3)

LABELS: toxic enzyme induction cholestasis

LABELS: non-toxic no enzyme induction no cholestasis

Page 32: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 32

Combination of data types (log2ratios)

•  TGX gives best results •  Combining data types does not improve PAM prediction for all tested endpoints •  Combinations with TGX perform best

0

20

40

60

80

100

tgxall mbxall clinall tgxall mbxall

tgxall clinall mbxall clinall

tgxall mbxall clinall

Log2ratios, oxidative stress

sensitivity specificity accuracy

Page 33: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 33

Combination of data types (Eucl. Dist.)

0

20

40

60

80

100

str str tgxall str tgxall mbxall str tgxall mbxall clinall

Euclidian Dist. & Tanimoto score, Enzyme induction

•  Low sensitivity, high specificity

•  No effect of adding data types

0

20

40

60

80

100

tgxall tgxall mbxall

tgxall clinall mbxall clinall

tgxall mbxall clinall sensitivity specificity accuracy

•  Removing structural information improves PAM prediction

•  Combinations with TGX provides best PAM prediction

Page 34: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 34

PAM predictions using transcriptomics data at different exposure times

0

20

40

60

80

100

3d 7d 14d 28d

Enzyme Induction

sensitivity specificity accuracy

0

20

40

60

80

100

3d 7d 14d 28d

Oxidative Stress

sensitivity specificity accuracy

0

20

40

60

80

100

3d 7d 14d 28d

Peroxisome Proliferation

sensitivity specificity accuracy

•  Predictions at earlier time points equally good or even better •  Best predictions at 7 or 14 days •  Predictions for peroxisome proliferators already good at day 3

Page 35: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 35

I)  Training: 35 comp from TG-GATEs Test: 17 comp from TG-GATEs

II)  Training: 60 comp from TG-GATEs Test: 44 comp from DrugMatrix

III)  Training: 44 comp DrugMatrix Test: 60 comp from TG-GATEs

I II III I II III

day 3 day 7

sensitivity 56,7 90 40 56,7 90,9 40

specificity 98,7 97,1 100 100 85,2 100

accuracy 91,8 95,5 90 92,9 86,8 90

0

20

40

60

80

100

TG-GATEs and DrugMatrix combined results for peroxisome proliferation at day 3 & 7

•  TG-GATEs as training and DrugMatrix as test set improves overall toxicity prediction

•  Increasing size of training set improves prediction

Page 36: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 36

Class distribution of 56 compounds at high dose and 28 days of exposure

15

30 33

37 34

47

41

26 23

19 22

9

0

5

10

15

20

25

30

35

40

45

50

toxicity necrosis oxidative stress

cholestasis enzyme induction

peroxisomal proliferation

non toxic

toxic

•  For some endpoints data is imbalanced e.g. ‘toxicity’ and ’peroxisome proliferation’

•  This may result in serious underrepresentation in some training sets •  Apply oversampling of minority class or under-sampling of majority

class

Page 37: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 37

Conclusions PAM •  Transcriptomics provides best classification results •  Combinations of data type do not improve toxicity

prediction •  Combinations with transcriptomics perform best •  Combinations with structural information are worst •  Log2ratios perform better than Euclidian distances •  Predictive models for oxidative stress and peroxisome

proliferation perform best •  Increasing size of training set improves prediction •  Combined use of TG-GATEs and DrugMatrix improves

toxicity predictions

Page 38: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 38

Summary

•  the use of omics data has an added value to the prediction of liver toxicity induced by chemical compounds

•  the availability of good quality data including non-(liver) toxicants is essential

•  the DECO project has resulted in new and promising approaches for the prediction of liver toxicity by integrating chemoinformatic and omics data

Page 39: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 39

Recommendations (1) •  Further optimization and integration of the different

approaches for liver toxicity prediction; •  Include non-(liver) toxicants in order to improve accuracy

of the predictions; •  Dose response relationships from the various in vivo

assays have to be taken into account to include potency into the analysis;

•  Include other endpoints, like kidney toxicity, (non-genotoxic) carcinogenicity or developmental toxicity.

Page 40: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 40

Recommendations (2) •  Apply in vitro transcriptomics data to predict in vivo liver

toxicity; •  All activities need to be projected into simple

interpretable tools for data availability overview, data visualization, and for making an assessment on the prediction(s).

•  A link to the Adverse Outcome Pathway concept and its development has to be established;

•  Clear guidance for how to generate transcriptomics and metabolomics data in experimental in vivo and in vitro studies for the prediction approaches is needed.

Page 41: Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Department of Toxicogenomics 41

Acknowledgement

Joost van Delft Jan Polman Jos Kleinjans

Dinant Kroese Rob Stierum Eugene van Someren

Hennicke Kamp Gina Montoya-Para

Grace Patlewicz