1 Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge
May 10, 2015
1
Automated Extraction of Reactions from the Patent Literature
Daniel LoweUnilever Centre for Molecular Science Informatics
University of Cambridge
2
Chemistry patent applications
2000 2001 2002 2003 2004 2005 2006 2007 2008 20090
50000
100000
150000
200000
250000
300000
350000
400000
Chem
istry
pat
ent a
pplic
ation
s per
yea
r
World Intellectual Property Indicators, 2011 edition
• 100,000s applications each year
4
The idea
XML patents
Extracted Reactions
Reaction Extraction
System
5
Steps involved
• Identifying experimental sections• Identifying chemical entities• Chemical name to structure conversion• Associating chemical entities with quantities• Assigning chemical roles• Atom-atom mapping
7
Archetypal experimental section
Paragraph number
Section heading
Section target compound
Step target compound
Synthesis
Characterisation
Workup
Step identifier
8
Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40.
9
ChemicalTagger
• Tags words of text
• Parses tags to identify phrases
• Generate XML parse tree– http://chemicaltagger.ch.cam.ac.uk/– Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic
text-mining in chemistry. J Cheminf 2011, 3, 17.
10
Tagging
Additional taggers:• OPSIN tagger: Finds names OPSIN can parse• Trivial chemical name tagger: Tags a few chemicals missed by
the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent
• Regex tagger: tags keywords e.g. “yield”, “mL”• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
e.g. “2-methylpyridine”• OpenNLP: Tags parts of speech
11
Sample ChemicalTagger Output<MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY></MOLECULE>
15
Pyridine, pyridines and pyridine rings
Entity Pyridine The pyridine /Pyridine from step 1
Pyridines /A pyridine
Pyridine ring /Pyridyl
Type Exact DefiniteReference ChemicalClass Fragment
16
Section/Step Parsing
Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify,
Recover, Remove, Wash, Quench
18
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate
To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).
20
CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>
Quantities including yield are extracted
Entity is classified as an exact compound, definite reference, chemical class or polymer
Reaction SMILES
SMILES and InChIs for every structure resolvable reagent/product
21
Evaluation
• 2008-2011 USPTO patent applications classified as containing organic chemistry 65,034 documents.
• 484,259 reactions atom mapped reactions extracted
• Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds 424,621 reactions.
• 100 of these were selected for manual evaluation of quality
22
Reactions found
0 200 400 600 800 10001
10
100
1,000
10,000
100,000
Number of extracted reactions
Pate
nts w
ith g
iven
num
ber o
f rea
ction
s
23
Results• 96% correctly identified the primary starting material and product
whilst not misidentifying reagents that could be confused with the starting material
• As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives
• Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent
• Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled
24
Use Cases
• Reaction searching
• Analysing trends in reactions over time
• Reaction outcome prediction
25
Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)
6 reactions found in 5 patents
+H3C
1
CH22 I I
CH3
1
2
26
Name I20110224.tar\US20110046406A1-20110224.ZIP\0066
Text from US 2011/0046406 A1
27
Most lexical variants
1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochlorideN-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochlorideN-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HClN1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochlorideN-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochlorideN-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride
And 127 more!
675 chemicals had over 10 lexical variants!
29
Known Limitations
• The first workup reagent is often erroneously classified as a reactant
• Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms
• Conditions from analogous reactions are not resolved
• Temperature/time for reactions to occur not captured
30
Conclusions
• 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications
• Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important
• All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction
31
Acknowledgements
Unilever centre:Robert GlenPeter Murray-RustLezan HawizyDavid JessopMatthew Grayson
Indigo toolkit:Mikhail RybalkinSavelyev AlexanderDmitry Pavlov
Boehringer Ingelheim for funding SMARTS searching:Roger Sayle