Text-mining to produce large chemistry datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK2 NextMove Software, UK3 HelmholtzZentrum München, Germany4 Mestrelab Research, Santiago de Compostela, Spain5 EPA, US
ACS Fall 2015Boston, MAAugust 17th 2015
ChemSpider
Refs - we live in linked world
Properties
ChemSpider spectra
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
RSC Archive – since 1841
Prospecting RSC articles
Further work – properties and spectra mining
Text mining of the chemical documents
Term Examples of text matchedFromLiterature “lit.”
MeltingPoint “mpt”, “melting point”, “m.p.”Qualifier “>”; “approximately”
Value “75° C”, “200° F”, “one hundred degrees Celsius”Range “184-186° C”, “191.5 to 192.4° C”
MeasurementError
“50±° C”
OutcomeQualifier
“decomp.”, “with decomposition”, “subl.”
FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
Why MP?
Used for water solubility prediction
Yalkowsky equation:
logS = 0.5 – 0.01(MP-25) – log Kow
Detecting suspicious melting points
• Value was greater than 500° C
• Value was a range wider than 50° C
• Value was a range where the second temperature was lower than the first temperature
300k Melting Point Datasets
Bergström 277Bradley 2886OCHEM 22404Enamine 21883Patents 228079
data
BergströmBradleyOCHEMEnaminePatents
Tetko et al J. Chemoinformatics, in preparation
Melting point model: data distribution
Some modeling highlights
LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization)Largest model:
668k descriptors (MolPrint) ~ 0.2 trillions entriesBiggest model:
618Mb (Dragon descriptors)Most accurate model: Consensus, average of 5 models
RMSE < 32°C for the drug like region, MP [50,250]°C
Prediction error
NMR data• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543C 56536
unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8
NMR text mining• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
NMR extracted by year of publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Cum
ulati
ve d
istin
ct N
MR
extr
acte
d
Year of Publication
USPTO grants
USPTO applications
NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing
MestreLabs Mnova NMR
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
Detecting suspicious NMR spectra
• Last peak of NMR spectra is unannotated and:– All other peaks are annotated– Spectrum has 1 peak and is proton or
unknown NMR
O
O
OH
Br
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
H2N
NH2
O
O
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
Synthetic chemistry articleCompoundsReactionAnalytical DataText and References
RSC Databases
RSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…
Input pipelineDeposition Gateway
Staging databases
Compounds Reactions Spectra Crystals
Materials
Compounds Module
Spectra Module
Reactions Module
Materials Module
TextminingModule
«Module
Web UI for unified depositions
DropBox, Google Drive, SkyDrive, etc
ELNs, templated data input
Documents
API, FTP, etc
Raw data
Valid
ated
data
Staging databases
All databases are sliced by data sources/data collections and have simple security model where each data slice/source is private, public or embargoed
Etc
Experiments
Research
Output pipeline
Compounds Reactions Spectra Crystals Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
CrystalsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
CrystalsWidgets
DocumentsWidgets
Data layer
Data access layer
User interface widgets
layer
Analytical Laboratory application
User interface
layer(examples)
Electronic Laboratory NotebookPaid 3rd party integrations(various platforms – SharePoint, Google, etc)
Chemical Inventory application
ChemSpider 2.0
Cross-database links
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and private databases
• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Reactions domain
Reactions domain
Analytical data domain
Crystallography domain
3D printable structures
New Repository Architecturedoi: 10.1007/s10822-014-9784-5