Mining Big Datasets to Create and Validate Machine Learning Models Alex M. Clark 1* and Sean Ekins 2,3,4* 1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada 2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA 4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
32
Embed
Mining Big datasets to create and validate machine learning models
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mining Big Datasets to Create and Validate Machine Learning Models
Alex M. Clark1* and Sean Ekins2,3,4*
1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
• Work with a collaborator to get experimental data
• Go out and mine literature for data• Curate, issues with intra lab variability, data quality
• Mine databases• Issues with chemistry quality, errors in data
Just a matter of scale?
Drug Discovery’s definition of Big data
Everyone else’s definition of Big data
• Data Sources
• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
Where can we get the datasets
Mining for gold
Melting point and solubility
• But no structures!
Open source – but much smaller
400 diverse, drug-like molecules active against neglected diseases
400 cpds from around 20,000 hits
generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK.
Many screens completed
Bigger datasets and model collections
• Profiling “big datasets” is going to be the norm.
• A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data
• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.
• Kinase screening data (1000s mols x 100s assays)
• GPCR datasets etc (1000s mols x 100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863
• Bit folding – plateau at 4096, can use 1024 with little degredation
• Cut off – works well
• Evaluated balanced training: test and diabolical were test and training sets are structurally differentEasy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Models in mobile app
• Added atom coloring using ECFP6 fingerprints
• Red and green high and low probability of activity, respectively
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
ToxCast data
• Few studies use the ToxCast data for machine learning
• Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895.
• Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51
• A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp,
OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories
• six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies)
• nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity
• CART, ENSMB, and SVM classifiers performed the best
CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff