Data-integration for Endpoints, Chemoinformatics and Omics Danyel Jennen BRAZIL 2014
Data-integration for Endpoints, Chemoinformatics and Omics
Danyel Jennen
BRAZIL 2014
Department of Toxicogenomics 2
DECO2: Moving from DECO towards OECD Cefic LRI-AIMT4 June 2014 – May 2016
DECO: Data-integration for Endpoints, Cheminformatics and Omics Cefic LRI-AIMT3 March 2012 – February 2014
Department of Toxicogenomics 3
The opportunity • There is increasing pressure, through REACH and other
initiatives, to generate more complete information sets on safety for every chemical in commerce.
• Current toxicity testing methods are resource- and animal-intensive.
• A full battery of tests is generally only available for compounds that may be biologically active (e.g., drugs and pesticides), high production volume, or potential for widespread exposure.
• The methods of traditional toxicology testing are insufficient to meet this information need.
• Assessment of other compounds has been through a variety of procedures, in some instances reliance on structural similarity to well-tested compounds.
There is an opportunity to use state-of-the-art biotechnology and informatics methodologies to fill this gap.
Department of Toxicogenomics 4
Background • Grouping / categorization based on similar properties
combined with read-across is an important concept for safety assessment.
• Chemical analogues identified by chemoinformatic tools may share similar toxicological properties.
• However, subtle changes in chemical structure can result in relevant changes in toxicity (e.g. 2-AAF and 4-AAF, and APAP and AMAP). Chemoinformatics alone is insufficient to predict the
toxicological profile of a chemical. • Any additional biological indication could help to increase
the likelihood of prediction, such as: – ‘omics’ data from either short-term in vivo or in vitro
tests – high-throughput in vitro screening assays (HTS assays)
such as EPA’s ToxCast/Tox-21 • They can also provide more detailed and mechanistic
insights.
Department of Toxicogenomics 5
DECO’s general Hypothesis
Evaluation strategies based on integrating chemoinformatics approaches with mechanistic data from ‘omics’ and HTS technologies, will strengthen confidence that the respective compounds within a class behave similarly in terms of their toxicological profile.
This would allow for improved identification of toxicological hazards leading to better classification and labelling of chemicals.
Department of Toxicogenomics 6
Objectives
• To develop a transparent framework that improves the prediction of the repeated dose toxicity of new chemicals by integrating chemoinformatic data with biological information from ‘omics’ and HTS technologies.
• To generate a high-quality database that contains for a large number of chemicals, its chemical properties, in vitro omics, in vivo omics, HTS data, and (in vivo) toxicity test.
• To develop and evaluate expert rules, useful for industry as well as regulators
To build a demonstration project where the advancement of integrating different domains, i.e. chemoinformatics, ‘omics, and HTS, is clearly explained to regulators and industry.
Department of Toxicogenomics 7
Liver as target • The liver is by far the most important organ for determining
critical effects in repeated dose toxicity studies. • Expected availability of sufficient high quality data in
especially the ‘omics’ field, both in vivo (mainly rat) and in various in vitro models.
• Well described pathologies, such as hepatitis, steatosis, cirrhosis, inflammation, phospholipidosis, necrosis and cholestasis.
• The many different documented toxicity mechanisms. • Several studies already demonstrated that specific liver
pathologies can be predicted by ‘omics approaches. • Chemcial induced hepatoxocity is also very relevant for
humans.
The liver is an ideal organ for demonstrating the proof-of-concept for integrative approaches.
Department of Toxicogenomics 8
Project Organization
Department of Toxicogenomics 9
Work package 1 Data Collection and Compound selection
Selection of compounds with high quality data Data selection from public and in house databases Chemoinformatics tool selection Selection of a validation set of compounds
Work package 2 Knowledge base and Toolbox infrastructure
Setup of hardware infrastructure. Setup of software infrastructure. Development of database management tools
Work package 3 Data integration and evaluation of predictive approaches
Toxicity prediction approaches based on similarity scores. Toxicity prediction of chemical analogues explorative. Toxicity prediction by classification approaches.
Department of Toxicogenomics 10
Data sources and data types
Data Source Data Type Organ/Tissue/Model Compounds
PubChem Chemical Structure 2D fingerprints ~1400
TGGATEs Transcriptomics
rat liver 143 rat prim. hepatocytes 130 human prim. hepatocytes 119
Clinical Chemistry rat blood 143 Histopathology rat liver 127
Drugmatrix Transcriptomics rat liver 201 Clinical Chemistry rat blood 601 Histopathology rat liver 568
UM transcriptomics HepG2 122
BASF Metabonomics rat plasma 371
ToxCast HTS data in vitro assays 320 Histpathology rat liver 220
Department of Toxicogenomics 11
Overlaps of compounds between data sources
Department of Toxicogenomics 12
Data processing
Example transcriptomics
Department of Toxicogenomics 13
Data types
• Transcriptomics, metabolomics, clinical chemistry – Log2ratios
• Histopathology – Severity score
• Structure similarity, 2D fingerprint – Tanimoto score
Department of Toxicogenomics 14
Liver phenotypic endpoints
• Liver toxicity • Enzyme induction • Peroxisome Proliferation • Oxidative stress • Necrosis • Cholestasis
. . .
Department of Toxicogenomics 15
DECO Hardware Infrastructure
Department of Toxicogenomics 16
Department of Toxicogenomics 17
Data integration and evaluation of predictive approaches
1. Similarity scores & matrices
2. Integrating transcriptomics datasets
3. Unsupervised clustering → iClusterPlus
4. Supervised clustering/classification → PAM
Department of Toxicogenomics 18
Pearson Correlation or Euclidian Distance?
-6
-4
-2
0
2
4
6
-5 0 5
1 vs 2
Pearson Correlation high √ Euclidian Distance low √
Correlation coefficients do not (always) correctly reflect the similarities between compounds
Euclidian distances better quantify the (dis)similarity between compounds as shown in a PCA
-1,5 -1
-0,5 0
0,5 1
1,5 2
2,5
3 vs 4
low X low √
-6
-4
-2
0
2
4
6
-2 -1 0 1 2
1 vs 3
very low √ high √
Department of Toxicogenomics 19
Calculation Euclidian Distances
• Euclidian distance does not depend on number of samples, but depends on number of variables (genes, metabolites etc)
• Euclidian distance ranges from 0 to …. depending on data type, therefore rescale to range 0-1 (divide by maximum)
• In order to compare chemical structure with the other data types structure similarity has to be recalculated as 1-Tanimoto score
• All available variables have been used to calculate Euclidian distance
Department of Toxicogenomics 20
Data types
• Transcriptomics, metabolomics, clinical chemistry – Log2ratios – Euclidian distances
• Histopathology – Severity score – Euclidian distances
• Structure similarity, 2D fingerprint – Tanimoto score
Department of Toxicogenomics 21
Integrating transcriptomics data from TG-GATEs & DrugMatrix
Tanimoto = 0.3329
WY-14,643 fenofibrate
Department of Toxicogenomics 22
iClusterPlus
http://www.mskcc.org/research/epidemiology-biostatistics/biostatistics/iclusterplus
• Predict cancer subtypes • Integrate discrete and continuous variables
R package: iClusterPlus
Department of Toxicogenomics 23
Latent variables or
Hidden causes
Department of Toxicogenomics 24
iClusterPlus on structurally similar compounds
Data types: • Structure similarity, 2D fingerprint (Tanimoto
score > 0.7) • Transcriptomics DrugMatrix; in vivo rat liver
data, repeated high dose, 3 days (Euclidian distances)
• Clinical chemistry DrugMatrix; in vivo rat liver data, repeated high dose, 3 days (Euclidian distances)
In total 9 compounds
Department of Toxicogenomics 25
iClusterPlus on structurally similar compounds from DrugMatrix Tanimoto score > 0.7
Department of Toxicogenomics 26
iClusterPlus on DrugMatrix data, Transcriptomics (144 features) & Clinical chemistry (613 features)
1=betamethasone 2=dexamethasone 3=hydrocortisone 4=17-methyltestosterone 5=ethisterone 6=norethindrone acetate 7=progesterone 8=cortisone 9=testosterone
Department of Toxicogenomics 27
Conclusions iClusterPlus
• Grouping compounds based on any single data type can result in wrongly clustered compounds that have different toxicity or MOA
• Or compounds with the same toxicity or MOA don’t cluster together
• Integrating different data types improves the clustering of analogues.
• Expert knowledge is needed
Department of Toxicogenomics 28
PAM classification (1)
• Tool: PAM = Prediction Analysis of Microarrays
• Nearest Shrunken Centroids: average gene expression for each gene in each class divided by the within-class standard deviation for that gene
• Unknown sample compared to class centroids • The class whose centroid that it is closest to, in squared
distance, is the predicted class for that new sample • By increasing threshold (means decreasing # features)
centroids are shrunken. • Advantages: - removing noisy genes
- automatic gene selection
Department of Toxicogenomics 29
PAM classification (2)
Department of Toxicogenomics 30
1. Adjust PAM R-script 2. Perform classification
Output: 1. Sensitivity, specificity, accuracy 2. List of variables: genes and
metabolites 3. Graphs 4. Per label distribution of toxic and
non-toxic samples in train and test sets
Load data sets
Select training & test set
T-test using train samples
PAM training
PAM cross validation
PAM test
Average results of 5 sets
5x
PAM classification (3)
Department of Toxicogenomics 31
PAM classification COMPOUNDS
VARIABLES
(genes, metabolites, compounds)
5 different train and test sets Same fraction of toxic compounds in train and test set 10-fold cross validation Adjust threshold in order to optimize sensitivity, specificity and accuracy
TRAIN (2/3) TEST (1/3)
LABELS: toxic enzyme induction cholestasis
LABELS: non-toxic no enzyme induction no cholestasis
Department of Toxicogenomics 32
Combination of data types (log2ratios)
• TGX gives best results • Combining data types does not improve PAM prediction for all tested endpoints • Combinations with TGX perform best
0
20
40
60
80
100
tgxall mbxall clinall tgxall mbxall
tgxall clinall mbxall clinall
tgxall mbxall clinall
Log2ratios, oxidative stress
sensitivity specificity accuracy
Department of Toxicogenomics 33
Combination of data types (Eucl. Dist.)
0
20
40
60
80
100
str str tgxall str tgxall mbxall str tgxall mbxall clinall
Euclidian Dist. & Tanimoto score, Enzyme induction
• Low sensitivity, high specificity
• No effect of adding data types
0
20
40
60
80
100
tgxall tgxall mbxall
tgxall clinall mbxall clinall
tgxall mbxall clinall sensitivity specificity accuracy
• Removing structural information improves PAM prediction
• Combinations with TGX provides best PAM prediction
Department of Toxicogenomics 34
PAM predictions using transcriptomics data at different exposure times
0
20
40
60
80
100
3d 7d 14d 28d
Enzyme Induction
sensitivity specificity accuracy
0
20
40
60
80
100
3d 7d 14d 28d
Oxidative Stress
sensitivity specificity accuracy
0
20
40
60
80
100
3d 7d 14d 28d
Peroxisome Proliferation
sensitivity specificity accuracy
• Predictions at earlier time points equally good or even better • Best predictions at 7 or 14 days • Predictions for peroxisome proliferators already good at day 3
Department of Toxicogenomics 35
I) Training: 35 comp from TG-GATEs Test: 17 comp from TG-GATEs
II) Training: 60 comp from TG-GATEs Test: 44 comp from DrugMatrix
III) Training: 44 comp DrugMatrix Test: 60 comp from TG-GATEs
I II III I II III
day 3 day 7
sensitivity 56,7 90 40 56,7 90,9 40
specificity 98,7 97,1 100 100 85,2 100
accuracy 91,8 95,5 90 92,9 86,8 90
0
20
40
60
80
100
TG-GATEs and DrugMatrix combined results for peroxisome proliferation at day 3 & 7
• TG-GATEs as training and DrugMatrix as test set improves overall toxicity prediction
• Increasing size of training set improves prediction
Department of Toxicogenomics 36
Class distribution of 56 compounds at high dose and 28 days of exposure
15
30 33
37 34
47
41
26 23
19 22
9
0
5
10
15
20
25
30
35
40
45
50
toxicity necrosis oxidative stress
cholestasis enzyme induction
peroxisomal proliferation
non toxic
toxic
• For some endpoints data is imbalanced e.g. ‘toxicity’ and ’peroxisome proliferation’
• This may result in serious underrepresentation in some training sets • Apply oversampling of minority class or under-sampling of majority
class
Department of Toxicogenomics 37
Conclusions PAM • Transcriptomics provides best classification results • Combinations of data type do not improve toxicity
prediction • Combinations with transcriptomics perform best • Combinations with structural information are worst • Log2ratios perform better than Euclidian distances • Predictive models for oxidative stress and peroxisome
proliferation perform best • Increasing size of training set improves prediction • Combined use of TG-GATEs and DrugMatrix improves
toxicity predictions
Department of Toxicogenomics 38
Summary
• the use of omics data has an added value to the prediction of liver toxicity induced by chemical compounds
• the availability of good quality data including non-(liver) toxicants is essential
• the DECO project has resulted in new and promising approaches for the prediction of liver toxicity by integrating chemoinformatic and omics data
Department of Toxicogenomics 39
Recommendations (1) • Further optimization and integration of the different
approaches for liver toxicity prediction; • Include non-(liver) toxicants in order to improve accuracy
of the predictions; • Dose response relationships from the various in vivo
assays have to be taken into account to include potency into the analysis;
• Include other endpoints, like kidney toxicity, (non-genotoxic) carcinogenicity or developmental toxicity.
Department of Toxicogenomics 40
Recommendations (2) • Apply in vitro transcriptomics data to predict in vivo liver
toxicity; • All activities need to be projected into simple
interpretable tools for data availability overview, data visualization, and for making an assessment on the prediction(s).
• A link to the Adverse Outcome Pathway concept and its development has to be established;
• Clear guidance for how to generate transcriptomics and metabolomics data in experimental in vivo and in vitro studies for the prediction approaches is needed.
Department of Toxicogenomics 41
Acknowledgement
Joost van Delft Jan Polman Jos Kleinjans
Dinant Kroese Rob Stierum Eugene van Someren
Hennicke Kamp Gina Montoya-Para
Grace Patlewicz