Top Banner
1 Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge
32

Automated Extraction of Reactions from the Patent Literature

May 10, 2015

Download

Technology

dan2097

We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automated Extraction of Reactions from the Patent Literature

1

Automated Extraction of Reactions from the Patent Literature

Daniel LoweUnilever Centre for Molecular Science Informatics

University of Cambridge

Page 2: Automated Extraction of Reactions from the Patent Literature

2

Chemistry patent applications

2000 2001 2002 2003 2004 2005 2006 2007 2008 20090

50000

100000

150000

200000

250000

300000

350000

400000

Chem

istry

pat

ent a

pplic

ation

s per

yea

r

World Intellectual Property Indicators, 2011 edition

• 100,000s applications each year

Page 4: Automated Extraction of Reactions from the Patent Literature

4

The idea

XML patents

Extracted Reactions

Reaction Extraction

System

Page 5: Automated Extraction of Reactions from the Patent Literature

5

Steps involved

• Identifying experimental sections• Identifying chemical entities• Chemical name to structure conversion• Associating chemical entities with quantities• Assigning chemical roles• Atom-atom mapping

Page 6: Automated Extraction of Reactions from the Patent Literature

6

Building on existing projects

Page 7: Automated Extraction of Reactions from the Patent Literature

7

Archetypal experimental section

Paragraph number

Section heading

Section target compound

Step target compound

Synthesis

Characterisation

Workup

Step identifier

Page 8: Automated Extraction of Reactions from the Patent Literature

8

Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40.

Page 9: Automated Extraction of Reactions from the Patent Literature

9

ChemicalTagger

• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree– http://chemicaltagger.ch.cam.ac.uk/– Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic

text-mining in chemistry. J Cheminf 2011, 3, 17.

Page 10: Automated Extraction of Reactions from the Patent Literature

10

Tagging

Additional taggers:• OPSIN tagger: Finds names OPSIN can parse• Trivial chemical name tagger: Tags a few chemicals missed by

the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent

• Regex tagger: tags keywords e.g. “yield”, “mL”• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical

e.g. “2-methylpyridine”• OpenNLP: Tags parts of speech

Page 11: Automated Extraction of Reactions from the Patent Literature

11

Sample ChemicalTagger Output<MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY></MOLECULE>

Page 15: Automated Extraction of Reactions from the Patent Literature

15

Pyridine, pyridines and pyridine rings

Entity Pyridine The pyridine /Pyridine from step 1

Pyridines /A pyridine

Pyridine ring /Pyridyl

Type Exact DefiniteReference ChemicalClass Fragment

Page 16: Automated Extraction of Reactions from the Patent Literature

16

Section/Step Parsing

Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify,

Recover, Remove, Wash, Quench

Page 18: Automated Extraction of Reactions from the Patent Literature

18

Example

Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).

Page 20: Automated Extraction of Reactions from the Patent Literature

20

CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>

Quantities including yield are extracted

Entity is classified as an exact compound, definite reference, chemical class or polymer

Reaction SMILES

SMILES and InChIs for every structure resolvable reagent/product

Page 21: Automated Extraction of Reactions from the Patent Literature

21

Evaluation

• 2008-2011 USPTO patent applications classified as containing organic chemistry 65,034 documents.

• 484,259 reactions atom mapped reactions extracted

• Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds 424,621 reactions.

• 100 of these were selected for manual evaluation of quality

Page 22: Automated Extraction of Reactions from the Patent Literature

22

Reactions found

0 200 400 600 800 10001

10

100

1,000

10,000

100,000

Number of extracted reactions

Pate

nts w

ith g

iven

num

ber o

f rea

ction

s

Page 23: Automated Extraction of Reactions from the Patent Literature

23

Results• 96% correctly identified the primary starting material and product

whilst not misidentifying reagents that could be confused with the starting material

• As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives

• Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent

• Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled

Page 24: Automated Extraction of Reactions from the Patent Literature

24

Use Cases

• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction

Page 25: Automated Extraction of Reactions from the Patent Literature

25

Example of reaction searching

C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)

6 reactions found in 5 patents

+H3C

1

CH22 I I

CH3

1

2

Page 26: Automated Extraction of Reactions from the Patent Literature

26

Name I20110224.tar\US20110046406A1-20110224.ZIP\0066

Text from US 2011/0046406 A1

Page 27: Automated Extraction of Reactions from the Patent Literature

27

Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochlorideN-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochlorideN-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HClN1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochlorideN-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochlorideN-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride

And 127 more!

675 chemicals had over 10 lexical variants!

Page 29: Automated Extraction of Reactions from the Patent Literature

29

Known Limitations

• The first workup reagent is often erroneously classified as a reactant

• Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms

• Conditions from analogous reactions are not resolved

• Temperature/time for reactions to occur not captured

Page 30: Automated Extraction of Reactions from the Patent Literature

30

Conclusions

• 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications

• Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important

• All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction

Page 31: Automated Extraction of Reactions from the Patent Literature

31

Acknowledgements

Unilever centre:Robert GlenPeter Murray-RustLezan HawizyDavid JessopMatthew Grayson

Indigo toolkit:Mikhail RybalkinSavelyev AlexanderDmitry Pavlov

Boehringer Ingelheim for funding SMARTS searching:Roger Sayle