From Open text mining solutions to Open Data resources
Post on 27-Aug-2014
447 Views
Preview:
DESCRIPTION
Transcript
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
From Open text mining solutions to Open Data resources
Daniel Lowe
NextMove Software
Cambridge, UK
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
The idea
Accessible text e.g. US patents
Open Reaction Data resource
Reaction Extraction
System
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Building on existing projects
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
ol
What is chemical name to structure?
(2S)- but 2- Amino 1- -
Stereochemistry locant substituent locant alk unsaturation suffix
an
NH2•
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported chain nomenclature
Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides
dodectetractkiliane pentaphosphane disilazane
Trivial acids
butyric acid
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported ring nomenclature Monocyclic spiro
dispiro[4.2.4.2]tetradecane
Hantzsch-Widman
1,3,5-triazine
furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl
Fused ring Ring assembly
Von Baeyer
tricyclo[2.2.1.12,5]octane
Polycyclic spiro
spiro[piperidine-4,9'-xanthene]
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural assembly nomenclature
Conjunctive nomenclature
benzeneethanol
Substitutive nomenclature
2,4,6-trinitrotoluene
Additive nomenclature
methylsulfonyl
Multiplicative nomenclature
4,4'-methylenedioxydibenzoic acid
Functional class nomenclature
ethyl alcohol
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural modifications
Heteroatom replacement
1-thia-4-aza-2,6-disilacyclohexane
Unsaturation
hexa-1,3-dien-5-yne
Hydro, dehydro, indicated hydrogen and added hydrogen
2,7-dihydro-1H-azepine
Functional replacement Suffixes including
infixed suffixes
methanedithioic acid 1-chloro-2,4-
diimidotricarbonic acid
Lambda convention
2λ6-trisulfane
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Bridges and stereochemistry Bridges
4a,8a-propanoquinoline
E/Z stereochemistry
(Z)-2-chloro-but-2-ene
Relative cis/trans stereochemistry
trans-2,6-dimethyl-2,6-dihydronaphthalene
R/S stereochemistry
(1R,3S)-3-amino-3-methylcyclohexanol
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately positioned structural features
Charge and oxidation numbers
methylmercury(1+) or methylmercury(II)
“per-nomenclature”
2-deoxy-ᴅ-ribose
Subtractive nomenclature
perhydroanthracene
perchlorobenzene
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Polymer nomenclature
poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]
Structure-based polymer nomenclature
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Domain specific nomenclature
Steroid nomenclature
17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
ʟ-leucinamide
Amino acid
cyclo(ᴅ-alanyl-ʟ-phenylalanyl) ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline
Oligopeptide Cyclic peptide
guanylyl(3'-5')uridine 3'-monophosphate
Nucleotide nomenclature
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Carbohydrates
ʟ-ribo-ᴅ-manno-nonose
2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid
β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside
β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-(1→3)-ᴅ-glucopyranose
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Usage
Batch conversion on the command line
RESTful web service (opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName);
Java API
java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Who is using OPSIN?
Commercial software
Cinfony (interface to
Python)
Many text mining efforts
Workflows Web services
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Steps involved
• Identifying experimental sections
• Identifying chemical entities
• Chemical name to structure conversion (including anaphora resolution)
• Associating chemical entities with quantities
• Assigning chemical roles
• Atom-atom mapping
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Graphical Output
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
<productList>
<product role="product">
<molecule id="m0">
<name dictRef="nameDict:unknown">title compound</name>
</molecule>
<amount units="unit:mmol">1.8</amount>
<amount units="unit:mg">690</amount>
<amount units="unit:percentYield">85.0</amount>
<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
<dl:entityType>definiteReference</dl:entityType>
<dl:state>solid</dl:state>
</product>
</productList>
<reactantList>
<reactant role="reactant" count="1">
<molecule id="m1">
<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
</molecule>
<amount units="unit:mmol">2.1</amount>
<amount units="unit:mg">606</amount>
<amount units="unit:eq">1.0</amount>
<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>
Quantities including yield are extracted
Entity is classified as an exact compound, definite reference, chemical class or fragment
Reaction SMILES
SMILES and InChIs for every structure resolvable reagent/product
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Current status
• ~1 million reactions from US patent applications (2001-2013)
• ~1 million reactions from US patent grants (1976-2013)
• At minimum over a million constitutionally distinct reactions
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads
Current status
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Identify Synthetic Routes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2
Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5
0
100000
200000
300000
400000
500000
600000
700000
Occ
urr
en
ces
Number of steps
Intermediates
Terminal Products
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
19
76
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Suzu
ki c
ou
plin
gs a
s a
per
cen
tage
of
reac
tio
ns
in a
yea
r
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
19
76
19
77
19
78
19
79
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
Pe
rce
nta
ge o
f re
acti
on
s in
th
at y
ear
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Conclusions
Open Source tools facilitate reuse and remixing of code
Open Data allows reuse in an infinite number of potential applications and analyses
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Acknowledgements
• Albina Asadulina
• Peter Corbett
• Robert Glen
• David Jessop
• Lezan Hawizy
• Peter Murray-Rust
• Roger Sayle
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com
top related