Text-mining to produce large chemistry datasets for community access Valery Tkachenko 1 , Aileen Day 1 , Daniel Lowe 2 , Igor Tetko 3 , Carlos Coba 4 , Antony Williams 5 1 Royal Society of Chemistry, UK 2 NextMove Software, UK 3 HelmholtzZentrum München, Germany 4 Mestrelab Research, Santiago de Compostela, Spain 5 EPA, US ACS Fall 2015 Boston, MA August 17 th 2015
44
Embed
Text mining to produce large chemistry datasets for community access
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Text-mining to produce large chemistry datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK2 NextMove Software, UK3 HelmholtzZentrum München, Germany4 Mestrelab Research, Santiago de Compostela, Spain5 EPA, US
ACS Fall 2015Boston, MAAugust 17th 2015
ChemSpider
Refs - we live in linked world
Properties
ChemSpider spectra
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
RSC Archive – since 1841
Prospecting RSC articles
Further work – properties and spectra mining
Text mining of the chemical documents
Term Examples of text matchedFromLiterature “lit.”
• Last peak of NMR spectra is unannotated and:– All other peaks are annotated– Spectrum has 1 peak and is proton or
unknown NMR
O
O
OH
Br
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
H2N
NH2
O
O
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
Synthetic chemistry articleCompoundsReactionAnalytical DataText and References