Top Banner
www.guidetopharmacology.org Looking at the gift horse: pros and cons of patent- extracted structures in PubChem Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh. ICIC Heidelberg, Monday 23 rd Oct 2017 1 22 million
24

Pros and cons of patent-extracted structures in PubChem

Jan 22, 2018

Download

Science

Chris Southan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pros and cons of patent-extracted structures in PubChem

www.guidetopharmacology.org

Looking at the gift horse: pros and cons of patent-

extracted structures in PubChem

Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative

Physiology, University of Edinburgh. ICIC Heidelberg, Monday 23rd Oct 2017

1

22 million

Page 2: Pros and cons of patent-extracted structures in PubChem

Abstract (will be skipped for the presentation)

2

As of August 2017, the major automated patent chemistry extractions (in ascending size,

NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from

the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting

with advantages; a) while the relative coverage between open and commercial sources is difficult

to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of

medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most

first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has

query, analysis, clustering and linking features difficult to match in commercial sources, e) many

structures can be associated with bioactivity data f) connections between manually curated

papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However,

looking more closely also indicates disadvantages; a) extraction coverage is compromised by

dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major

open pipeline continuously running in situ but has a PubChem updating lag, c) automated

extraction generates structural “noise” that degrades chemistry quality d) PubChem patent

document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the

records indicates IP status, e) continual re-extraction of common chemistry results in over-

mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds

are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated

drugs g) linking between assay data and targets is still a manual exercise. However, all things

considered the PubChem patent “big bang” presents users with the best of both worlds (PMID

26194581). Academics or smaller enterprises who cannot afford commercial solutions can now

patent mine extensively. Even for those with commercial subscriptions, PubChem has become

an essential adjunct/complementary source for the analysis of patent chemistry and associated

bio entities such as diseases and drug targets.

Page 3: Pros and cons of patent-extracted structures in PubChem

Outline

• History of patent chemistry feeds to PubChem

• Relative source contributions

• Caveats with automated extraction

• Source intersects

• Fragmentation

• Source extraction comparisons

• Circularity for virtuals

• Mixtures

• Lag times

• Conclusions

• References

• Workshop alert

3

Page 4: Pros and cons of patent-extracted structures in PubChem

Chemical Named Entity Recognition (CNER)

• Automated process of documents in > structures out

• SureChEMBL pipeline shown above, other sources similar

• Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to-

struc (i2s) and mol files from USPTO Complex Work Units (CWUs)

• Indexing usually added e.g. abstract, descriptions, claims

• As well as patents, IBM run PubMed abstracts and PMC

4

Page 5: Pros and cons of patent-extracted structures in PubChem

History of patent chemistry feeds into PubChem

• 2006 Thomson (now Clavariant) Pharma, manual extractions from patents

and papers, 4.3 mil (but ceased Jan 2016)

• 2011 IBM phase 1 Chemical Named Entity Recognition (CNER) 2.5 mil

• SLING Consortium EPO extraction 0.1 mil

• 2012 SCRIPDB, CNER + Complex Work Units (CWU) 4.0 mil

• 2013 SureChem, CNER + image, 9.0 mil

• 2014 BindingDB manual activity curation 0.13 mill

• 2015 (CNER+images + CWU)

• SureChEMBL 13.0 mil

• IBM phase 2, 7.0 mil,

• NextMove Software 1.4 mil synthesis mapping

• 2016 SureChEMBL 15.8 mil

• 2017 IBM Phase 3, 6.0 mill

5

Page 6: Pros and cons of patent-extracted structures in PubChem

2011 “fizzle” > 2015 “big bang”

6

Page 7: Pros and cons of patent-extracted structures in PubChem

October 2017, from 93.89 mill PubChem CIDs

7

Page 8: Pros and cons of patent-extracted structures in PubChem

Pro: PubChem indexes IPC splits

Con: document indexing is USPTO

dominated (i.e. early WO’s missed)

Con: Entrez cant handle the joins

8

Page 9: Pros and cons of patent-extracted structures in PubChem

Cons: Mw plots reveal CNER fragmentation

9

ChEMBL + Thomson

Pharma = 5.6 million

manual extraction

Patent CNER

= 21.8 million

Page 10: Pros and cons of patent-extracted structures in PubChem

Con: those “Chessbordanes” still hanging around……

10

Page 11: Pros and cons of patent-extracted structures in PubChem

Pros & cons arising from intersects and filters

11

Page 12: Pros and cons of patent-extracted structures in PubChem

Con: circular extraction of virtual enumerations

12

1511 codeine

records, mainly 563

deuterations from

Auspex US7872013

> 3-source

multiplexing

652 InChI key inner

layer records via 266

stereos of vorapaxar

via Schering

US20080085923 >

4-source multiplexing

in UniChem

Page 13: Pros and cons of patent-extracted structures in PubChem

Pro: comparative analysis

• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial)

• Concluded; “50–66 % of the relevant content from the latter was also found in the former”

• Equivalent comparisons in the latest PubChem would record a higher overlap

• Probability of completely missing a recently exemplified series completely getting lower

13

Managing expectations: assessment of chemistry databases generated by

automated extraction of chemical structures from patents, Senger, et al. J.

Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)http://www.ncbi.nlm.nih.gov/pubmed/26457120

Page 14: Pros and cons of patent-extracted structures in PubChem

Examining extraction

selectivity for same patent

14

Page 15: Pros and cons of patent-extracted structures in PubChem

Con and pro: comparative coverage from US9181236

15

• 173 BindingDB CIDs

curated from PubChem via

US9181236

• 405 substances SDF from

SciFinder OpenBabel > 391

IK > 362 CIDs

• 1657 rows > 834

SureChEMBL IDs > 664

CIDs

• 3-way Venn of CIDs

• Pro: convergence

• Con: divergence

Page 16: Pros and cons of patent-extracted structures in PubChem

Con: the common chemistry problem

16

Spurious patent < > cpd indexing: aspirin = 131,410, atorvastatin = 14,968,

ethanol = 72,027

Page 17: Pros and cons of patent-extracted structures in PubChem

Con: the mixtures problem

17

Page 18: Pros and cons of patent-extracted structures in PubChem

Pro: entity mark-up via SciBite’s Termite in SureChEMBL

Con: not working 18 Oct 2017 :(

18

Page 19: Pros and cons of patent-extracted structures in PubChem

Con: no open automated SAR extraction

Pro: DIY manual extraction doable

Pro: ~2K patents have target-mapped BindingDB curated SAR

19

• SAR table from WO2016096979, Jansen BACE1 inhibitors

• Left to right, page from the PDF, SureChEMBL mark-up and Excel paste-across

Page 20: Pros and cons of patent-extracted structures in PubChem

Con: Lag in PubChem synch times

Pro: SureChEMBL in situ speed

• Internal UniChem load at EBI, 10 Oct = 18691416

• PubChem submission, 07 Oct = 17687607

• Latest in situ entries below for 12 Oct

20

Page 21: Pros and cons of patent-extracted structures in PubChem

Con: IBM CNER > 80% of all PubChem < > PMID links

21

• IBM extracts PubMed abstracts as

well as patents

• PubChem < > structures to PMID

• Automated associations swamp

out expert-curated assignments

• Specificity/accuracy is equivocal

Page 22: Pros and cons of patent-extracted structures in PubChem

Conclusions

• For the PubChem patent chemistry “Big Bang” the pros massively outweigh

the cons (i.e. it’s not a bad horse …)

• Contributors are to be congratulated and PubChem for wrangling them

• However, it is important to look closely at the gift horse…..

• Users need to understand CNER quirks, pitfalls and confounding artefacts

• PubChem slicing and filtering can partially ameliorate these

• Activity-to-target mapping for SAR extraction still pinch point

• Open extraction is a crucial comparator for commercial efforts

• Those without commercial sources are well enabled for patent mining

• Those with commercial sources can synergise with open searching

22

Page 23: Pros and cons of patent-extracted structures in PubChem

Info

23

http://cdsouthan.blogspot.com/ many posts have the tag “patents”

http://www.ncbi.nlm.nih.gov/pubmed/26194581

http://www.guidetopharmacology.org/

http://www.sciencedirect.com/science/article/pii/B9780124095472138144

Page 24: Pros and cons of patent-extracted structures in PubChem

Questions? (but wait …. there’s more, a Tuesday tutorial)

24