Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Data-integration for Endpoints, Chemoinformatics and Omics

Danyel Jennen

BRAZIL 2014

Department of Toxicogenomics 2

DECO2: Moving from DECO towards OECD Cefic LRI-AIMT4 June 2014 – May 2016

DECO: Data-integration for Endpoints, Cheminformatics and Omics Cefic LRI-AIMT3 March 2012 – February 2014


The opportunity •  There is increasing pressure, through REACH and other

initiatives, to generate more complete information sets on safety for every chemical in commerce.

•  Current toxicity testing methods are resource- and animal-intensive.

•  A full battery of tests is generally only available for compounds that may be biologically active (e.g., drugs and pesticides), high production volume, or potential for widespread exposure.

•  The methods of traditional toxicology testing are insufficient to meet this information need.

•  Assessment of other compounds has been through a variety of procedures, in some instances reliance on structural similarity to well-tested compounds.

  There is an opportunity to use state-of-the-art biotechnology and informatics methodologies to fill this gap.


Background •  Grouping / categorization based on similar properties

combined with read-across is an important concept for safety assessment.

•  Chemical analogues identified by chemoinformatic tools may share similar toxicological properties.

•  However, subtle changes in chemical structure can result in relevant changes in toxicity (e.g. 2-AAF and 4-AAF, and APAP and AMAP).  Chemoinformatics alone is insufficient to predict the

toxicological profile of a chemical. •  Any additional biological indication could help to increase

the likelihood of prediction, such as: –  ‘omics’ data from either short-term in vivo or in vitro

tests –  high-throughput in vitro screening assays (HTS assays)

such as EPA’s ToxCast/Tox-21 •  They can also provide more detailed and mechanistic

insights.


DECO’s general Hypothesis

Evaluation strategies based on integrating chemoinformatics approaches with mechanistic data from ‘omics’ and HTS technologies, will strengthen confidence that the respective compounds within a class behave similarly in terms of their toxicological profile.

This would allow for improved identification of toxicological hazards leading to better classification and labelling of chemicals.


Objectives

•  To develop a transparent framework that improves the prediction of the repeated dose toxicity of new chemicals by integrating chemoinformatic data with biological information from ‘omics’ and HTS technologies.

•  To generate a high-quality database that contains for a large number of chemicals, its chemical properties, in vitro omics, in vivo omics, HTS data, and (in vivo) toxicity test.

•  To develop and evaluate expert rules, useful for industry as well as regulators

  To build a demonstration project where the advancement of integrating different domains, i.e. chemoinformatics, ‘omics, and HTS, is clearly explained to regulators and industry.


Liver as target •  The liver is by far the most important organ for determining

critical effects in repeated dose toxicity studies. •  Expected availability of sufficient high quality data in

especially the ‘omics’ field, both in vivo (mainly rat) and in various in vitro models.

•  Well described pathologies, such as hepatitis, steatosis, cirrhosis, inflammation, phospholipidosis, necrosis and cholestasis.

•  The many different documented toxicity mechanisms. •  Several studies already demonstrated that specific liver

pathologies can be predicted by ‘omics approaches. •  Chemcial induced hepatoxocity is also very relevant for

humans.

  The liver is an ideal organ for demonstrating the proof-of-concept for integrative approaches.


Project Organization


Work package 1 Data Collection and Compound selection

Selection of compounds with high quality data Data selection from public and in house databases Chemoinformatics tool selection Selection of a validation set of compounds

Work package 2 Knowledge base and Toolbox infrastructure

Setup of hardware infrastructure. Setup of software infrastructure. Development of database management tools

Work package 3 Data integration and evaluation of predictive approaches

Toxicity prediction approaches based on similarity scores. Toxicity prediction of chemical analogues explorative. Toxicity prediction by classification approaches.


Data sources and data types

Data Source Data Type Organ/Tissue/Model Compounds

PubChem Chemical Structure 2D fingerprints ~1400

TGGATEs Transcriptomics

rat liver 143 rat prim. hepatocytes 130 human prim. hepatocytes 119

Clinical Chemistry rat blood 143 Histopathology rat liver 127

Drugmatrix Transcriptomics rat liver 201 Clinical Chemistry rat blood 601 Histopathology rat liver 568

UM transcriptomics HepG2 122

BASF Metabonomics rat plasma 371

ToxCast HTS data in vitro assays 320 Histpathology rat liver 220


Overlaps of compounds between data sources


Data processing

Example transcriptomics


Data types

•  Transcriptomics, metabolomics, clinical chemistry –  Log2ratios

•  Histopathology –  Severity score

•  Structure similarity, 2D fingerprint –  Tanimoto score


Liver phenotypic endpoints

•  Liver toxicity •  Enzyme induction •  Peroxisome Proliferation •  Oxidative stress •  Necrosis •  Cholestasis

. . .


DECO Hardware Infrastructure



Data integration and evaluation of predictive approaches

1.  Similarity scores & matrices

2.  Integrating transcriptomics datasets

3. Unsupervised clustering → iClusterPlus

4.  Supervised clustering/classification → PAM


Pearson Correlation or Euclidian Distance?

-6

-4

-2

0

2

4

6

-5 0 5

1 vs 2

Pearson Correlation high √ Euclidian Distance low √

Correlation coefficients do not (always) correctly reflect the similarities between compounds

Euclidian distances better quantify the (dis)similarity between compounds as shown in a PCA

-1,5 -1

-0,5 0

0,5 1

1,5 2

2,5

3 vs 4

low X low √

-6

-4

-2

0

2

4

6

-2 -1 0 1 2

1 vs 3

very low √ high √


Calculation Euclidian Distances

•  Euclidian distance does not depend on number of samples, but depends on number of variables (genes, metabolites etc)

•  Euclidian distance ranges from 0 to …. depending on data type, therefore rescale to range 0-1 (divide by maximum)

•  In order to compare chemical structure with the other data types structure similarity has to be recalculated as 1-Tanimoto score

•  All available variables have been used to calculate Euclidian distance


Data types

•  Transcriptomics, metabolomics, clinical chemistry –  Log2ratios –  Euclidian distances

•  Histopathology –  Severity score –  Euclidian distances

•  Structure similarity, 2D fingerprint –  Tanimoto score


Integrating transcriptomics data from TG-GATEs & DrugMatrix

Tanimoto = 0.3329

WY-14,643 fenofibrate


iClusterPlus

http://www.mskcc.org/research/epidemiology-biostatistics/biostatistics/iclusterplus

•  Predict cancer subtypes •  Integrate discrete and continuous variables

R package: iClusterPlus


Latent variables or

Hidden causes


iClusterPlus on structurally similar compounds

Data types: •  Structure similarity, 2D fingerprint (Tanimoto

score > 0.7) •  Transcriptomics DrugMatrix; in vivo rat liver

data, repeated high dose, 3 days (Euclidian distances)

•  Clinical chemistry DrugMatrix; in vivo rat liver data, repeated high dose, 3 days (Euclidian distances)

In total 9 compounds


iClusterPlus on structurally similar compounds from DrugMatrix Tanimoto score > 0.7


iClusterPlus on DrugMatrix data, Transcriptomics (144 features) & Clinical chemistry (613 features)

1=betamethasone 2=dexamethasone 3=hydrocortisone 4=17-methyltestosterone 5=ethisterone 6=norethindrone acetate 7=progesterone 8=cortisone 9=testosterone


Conclusions iClusterPlus

•  Grouping compounds based on any single data type can result in wrongly clustered compounds that have different toxicity or MOA

•  Or compounds with the same toxicity or MOA don’t cluster together

•  Integrating different data types improves the clustering of analogues.

•  Expert knowledge is needed


PAM classification (1)

•  Tool: PAM = Prediction Analysis of Microarrays

•  Nearest Shrunken Centroids: average gene expression for each gene in each class divided by the within-class standard deviation for that gene

•  Unknown sample compared to class centroids •  The class whose centroid that it is closest to, in squared

distance, is the predicted class for that new sample •  By increasing threshold (means decreasing # features)

centroids are shrunken. •  Advantages: - removing noisy genes

- automatic gene selection




1. Adjust PAM R-script 2. Perform classification

Output: 1.  Sensitivity, specificity, accuracy 2.  List of variables: genes and

metabolites 3. Graphs 4.  Per label distribution of toxic and

non-toxic samples in train and test sets

Load data sets

Select training & test set

T-test using train samples

PAM training

PAM cross validation

PAM test

Average results of 5 sets

5x



PAM classification COMPOUNDS

VARIABLES

(genes, metabolites, compounds)

5 different train and test sets Same fraction of toxic compounds in train and test set 10-fold cross validation Adjust threshold in order to optimize sensitivity, specificity and accuracy

TRAIN (2/3) TEST (1/3)

LABELS: toxic enzyme induction cholestasis

LABELS: non-toxic no enzyme induction no cholestasis


Combination of data types (log2ratios)

•  TGX gives best results •  Combining data types does not improve PAM prediction for all tested endpoints •  Combinations with TGX perform best

0

20

40

60

80

100

tgxall mbxall clinall tgxall mbxall

tgxall clinall mbxall clinall

tgxall mbxall clinall

Log2ratios, oxidative stress

sensitivity specificity accuracy


Combination of data types (Eucl. Dist.)

0

20

40

60

80

100

str str tgxall str tgxall mbxall str tgxall mbxall clinall

Euclidian Dist. & Tanimoto score, Enzyme induction

•  Low sensitivity, high specificity

•  No effect of adding data types

0

20

40

60

80

100

tgxall tgxall mbxall

tgxall clinall mbxall clinall

tgxall mbxall clinall sensitivity specificity accuracy

•  Removing structural information improves PAM prediction

•  Combinations with TGX provides best PAM prediction


PAM predictions using transcriptomics data at different exposure times

0

20

40

60

80

100

3d 7d 14d 28d

Enzyme Induction


0

20

40

60

80

100

3d 7d 14d 28d

Oxidative Stress


0

20

40

60

80

100

3d 7d 14d 28d

Peroxisome Proliferation


•  Predictions at earlier time points equally good or even better •  Best predictions at 7 or 14 days •  Predictions for peroxisome proliferators already good at day 3


I)  Training: 35 comp from TG-GATEs Test: 17 comp from TG-GATEs

II)  Training: 60 comp from TG-GATEs Test: 44 comp from DrugMatrix

III)  Training: 44 comp DrugMatrix Test: 60 comp from TG-GATEs

I II III I II III

day 3 day 7

sensitivity 56,7 90 40 56,7 90,9 40

specificity 98,7 97,1 100 100 85,2 100

accuracy 91,8 95,5 90 92,9 86,8 90

0

20

40

60

80

100

TG-GATEs and DrugMatrix combined results for peroxisome proliferation at day 3 & 7

•  TG-GATEs as training and DrugMatrix as test set improves overall toxicity prediction

•  Increasing size of training set improves prediction


Class distribution of 56 compounds at high dose and 28 days of exposure

15

30 33

37 34

47

41

26 23

19 22

9

0

5

10

15

20

25

30

35

40

45

50

toxicity necrosis oxidative stress

cholestasis enzyme induction

peroxisomal proliferation

non toxic

toxic

•  For some endpoints data is imbalanced e.g. ‘toxicity’ and ’peroxisome proliferation’

•  This may result in serious underrepresentation in some training sets •  Apply oversampling of minority class or under-sampling of majority

class


Conclusions PAM •  Transcriptomics provides best classification results •  Combinations of data type do not improve toxicity

prediction •  Combinations with transcriptomics perform best •  Combinations with structural information are worst •  Log2ratios perform better than Euclidian distances •  Predictive models for oxidative stress and peroxisome

proliferation perform best •  Increasing size of training set improves prediction •  Combined use of TG-GATEs and DrugMatrix improves

toxicity predictions


Summary

•  the use of omics data has an added value to the prediction of liver toxicity induced by chemical compounds

•  the availability of good quality data including non-(liver) toxicants is essential

•  the DECO project has resulted in new and promising approaches for the prediction of liver toxicity by integrating chemoinformatic and omics data


Recommendations (1) •  Further optimization and integration of the different

approaches for liver toxicity prediction; •  Include non-(liver) toxicants in order to improve accuracy

of the predictions; •  Dose response relationships from the various in vivo

assays have to be taken into account to include potency into the analysis;

•  Include other endpoints, like kidney toxicity, (non-genotoxic) carcinogenicity or developmental toxicity.


Recommendations (2) •  Apply in vitro transcriptomics data to predict in vivo liver

toxicity; •  All activities need to be projected into simple

interpretable tools for data availability overview, data visualization, and for making an assessment on the prediction(s).

•  A link to the Adverse Outcome Pathway concept and its development has to be established;

•  Clear guidance for how to generate transcriptomics and metabolomics data in experimental in vivo and in vitro studies for the prediction approaches is needed.


Acknowledgement

Joost van Delft Jan Polman Jos Kleinjans

Dinant Kroese Rob Stierum Eugene van Someren

Hennicke Kamp Gina Montoya-Para

Grace Patlewicz

Data-integration for Endpoints, Chemoinformatics and Omicspoliticaspublicas.weebly.com › uploads › 5 › 3 › 9 › 6 › 5396788 › brazil_jennen.pdfMoving from DECO towards

Documents