Automated Extraction of Reactions from the Patent Literature

Post on 10-May-2015

1292 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.

Transcript

1

Automated Extraction of Reactions from the Patent Literature

Daniel LoweUnilever Centre for Molecular Science Informatics

University of Cambridge

2

Chemistry patent applications

2000 2001 2002 2003 2004 2005 2006 2007 2008 20090

50000

100000

150000

200000

250000

300000

350000

400000

Chem

istry

pat

ent a

pplic

ation

s per

yea

r

World Intellectual Property Indicators, 2011 edition

• 100,000s applications each year

4

The idea

XML patents

Extracted Reactions

Reaction Extraction

System

5

Steps involved

• Identifying experimental sections• Identifying chemical entities• Chemical name to structure conversion• Associating chemical entities with quantities• Assigning chemical roles• Atom-atom mapping

7

Archetypal experimental section

Paragraph number

Section heading

Section target compound

Step target compound

Synthesis

Characterisation

Workup

Step identifier

8

Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40.

9

ChemicalTagger

• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree– http://chemicaltagger.ch.cam.ac.uk/– Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic

text-mining in chemistry. J Cheminf 2011, 3, 17.

10

Tagging

Additional taggers:• OPSIN tagger: Finds names OPSIN can parse• Trivial chemical name tagger: Tags a few chemicals missed by

the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent

• Regex tagger: tags keywords e.g. “yield”, “mL”• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical

e.g. “2-methylpyridine”• OpenNLP: Tags parts of speech

11

Sample ChemicalTagger Output<MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY></MOLECULE>

15

Pyridine, pyridines and pyridine rings

Entity Pyridine The pyridine /Pyridine from step 1

Pyridines /A pyridine

Pyridine ring /Pyridyl

Type Exact DefiniteReference ChemicalClass Fragment

16

Section/Step Parsing

Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify,

Recover, Remove, Wash, Quench

18

Example

Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).

20

CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>

Quantities including yield are extracted

Entity is classified as an exact compound, definite reference, chemical class or polymer

Reaction SMILES

SMILES and InChIs for every structure resolvable reagent/product

21

Evaluation

• 2008-2011 USPTO patent applications classified as containing organic chemistry 65,034 documents.

• 484,259 reactions atom mapped reactions extracted

• Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds 424,621 reactions.

• 100 of these were selected for manual evaluation of quality

22

Reactions found

0 200 400 600 800 10001

10

100

1,000

10,000

100,000

Number of extracted reactions

Pate

nts w

ith g

iven

num

ber o

f rea

ction

s

23

Results• 96% correctly identified the primary starting material and product

whilst not misidentifying reagents that could be confused with the starting material

• As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives

• Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent

• Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled

24

Use Cases

• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction

25

Example of reaction searching

C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)

6 reactions found in 5 patents

+H3C

1

CH22 I I

CH3

1

2

26

Name I20110224.tar\US20110046406A1-20110224.ZIP\0066

Text from US 2011/0046406 A1

27

Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochlorideN-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochlorideN-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HClN1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochlorideN-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochlorideN-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride

And 127 more!

675 chemicals had over 10 lexical variants!

29

Known Limitations

• The first workup reagent is often erroneously classified as a reactant

• Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms

• Conditions from analogous reactions are not resolved

• Temperature/time for reactions to occur not captured

30

Conclusions

• 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications

• Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important

• All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction

31

Acknowledgements

Unilever centre:Robert GlenPeter Murray-RustLezan HawizyDavid JessopMatthew Grayson

Indigo toolkit:Mikhail RybalkinSavelyev AlexanderDmitry Pavlov

Boehringer Ingelheim for funding SMARTS searching:Roger Sayle

32

Any Questions?

Email: daniel@nextmovesoftware.com

top related