Top Banner
Dr. Matthias Negri Scientific Information Center Boehringer Ingelheim Pharma GmbH & Co. KG Chemistry-Enriched Patent Curation semi-automatic analysis and elaboration of patents II-SDV Nice, 21 April 2015 Árpád Figyelmesi ChemAxon
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: II-SDV 2015, 20 - 21 April, in Nice

Dr. Matthias NegriScientific Information Center

Boehringer Ingelheim Pharma GmbH & Co. KG

Chemistry-Enriched Patent Curationsemi-automatic analysis and elaboration of patents

II-SDV Nice, 21 April 2015

Árpád FigyelmesiChemAxon

Page 2: II-SDV 2015, 20 - 21 April, in Nice

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

Negri Matthias, II-SDV 2015 2

Page 3: II-SDV 2015, 20 - 21 April, in Nice

Chemistry in patents

Chemistry appears within diverse form in patents:

1. TEXT - IUPAC names, common names, etc

2. IMAGES - embedded within or attached to the document

3. ATTACHMENTS (MOL/CDX)

4. TABLES

– as ONE-image file (tables with chemistry and bioactivity data)

– as chemistry-only image files embedded within table tags

5. Markush Structures/Formulas with R-groups

---------------------------------------------------------------------------------------

Currently NO commercial solution covers all these cases

Most of the cases are considered in the patent curation workflow

(Markush/R-group Formulas recognized and stored separately)

Negri Matthias, II-SDV 2015 3

Page 4: II-SDV 2015, 20 - 21 April, in Nice

Why do we need a patent curation workflow?

Motivations:

1. Linked chemistry-retrieval from patents (+ chemistry as images)

2. IUPAC-enriched XML patent files as NEW source for text-mining

3. extraction of bioactivity data/targets/diseases/… in relation to chemistry

4. Similarity/Substructure frequency in compound sets of patents

5. …

Negri Matthias, II-SDV 2015 4

Page 5: II-SDV 2015, 20 - 21 April, in Nice

Semi-automatic Patent Curation Workflow

Overview – current state

2 parallel branches

Negri Matthias, II-SDV 2015 5

I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval

SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML

FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT

Page 6: II-SDV 2015, 20 - 21 April, in Nice

Linked tools/technologies

1. KNIME/XPATH

2. ChemAxon ChemCurator (ChemCC)

3. Other ChemAxon tools in KNIME nodes (document2structure/d2s, Naming,

Molconverter, Structure checker, Standardizer, …)

4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)

5. Optical Structure Recognition – Keymodule CLiDE Batch

Negri Matthias, II-SDV 2015 6

Page 7: II-SDV 2015, 20 - 21 April, in Nice

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

Negri Matthias, II-SDV 2015 7

Page 8: II-SDV 2015, 20 - 21 April, in Nice

Computer-aided chemical data extraction

English, Chinese and Japanese N2S

Markush Editor

Structure Checker

Hit visualization

Third party OSR technologies

ChemCurator (ChemCC)

Árpád Figyelmesi, II-SDV 20158

Page 9: II-SDV 2015, 20 - 21 April, in Nice

ChemCurator (ChemCC)

Name to Structure

Support for many nomenclatures (common, drug names, …)

IUPAC names

Custom dictionaries

English (2008)

Chinese (2013)

Japanese (2014)

Árpád Figyelmesi, II-SDV 20159

Page 10: II-SDV 2015, 20 - 21 April, in Nice

Compound Extraction View

Compound listProject explorer

Annotated document

Selected structures

ChemCurator (ChemCC)

10

Page 11: II-SDV 2015, 20 - 21 April, in Nice

Markush Extraction View

Markush editor

Example structures

Annotated document

Project explorer

Selected structures

Structure checker

ChemCurator (ChemCC)

11

Page 12: II-SDV 2015, 20 - 21 April, in Nice

General Document Curation

Extract Markush Structures from patents

Extract specific structures

Journal articles

Company reports

Patent examples

Structure extraction wizards

Exclude fragments, chemical elements, etc.

ChemCurator (ChemCC)

Árpád Figyelmesi, II-SDV 201512

Page 13: II-SDV 2015, 20 - 21 April, in Nice

ChemCurator (ChemCC)

Integration & Information Sharing

Other ChemAxon products:

Direct IJC schema connection

Project sharing function

Accessible from Plexus, IJC, etc.

Third party tools:

Standard file formats

Export functions

Easily processable projects

Árpád Figyelmesi, II-SDV 201513

Page 14: II-SDV 2015, 20 - 21 April, in Nice

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

Negri Matthias, II-SDV 2015 14

Page 15: II-SDV 2015, 20 - 21 April, in Nice

Semi-automatic Patent Curation Workflowa) input sources and b) bibliographic data

a) Input sources

files with patent-IDs list

XML collection

b) Retrieval of bibliographic information and attachment data

family ID, patent references, expiration date, etc

Attachment files MOL/CDX (US-patents only), TIF files

….

Negri Matthias, II-SDV 2015 15

Page 16: II-SDV 2015, 20 - 21 April, in Nice

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

1. ChemCurator branch

data retrieval (XML, attachments) from IFI Claims Direct BI-server

ChemCurator project creation/sharing/annotation html output

Chemistry extraction name2structure/document2structure sdf output

Generation of pre-annotated patent set stored as ChemCC projects

Faster, but lower quality within the chemistry extraction process

Negri Matthias, II-SDV 2015 16

Page 17: II-SDV 2015, 20 - 21 April, in Nice

2. KNIME branch

- OCR-errors CLEAN-UP in KNIME improved chemistry recognition

- MOL/CDX/TIF - standardizer, structure checker filter formulas, solvents, R-groups

Higher quality and more control in chemistry extraction process

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

Negri Matthias, II-SDV 2015 17

Page 18: II-SDV 2015, 20 - 21 April, in Nice

2. KNIME branch

MOL IUPAC

CDX IUPAC

TIFF (via CLiDE) IUPAC

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

Negri Matthias, II-SDV 2015 18

Page 19: II-SDV 2015, 20 - 21 April, in Nice

Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks

IUPAC

string length (different output order of chemicals in multiple molecules image/multiMOL files

OCR-correction (“dictionary” based)

2. KNIME - Chemistry “Normalization”

(within KNIME) set up a relation between each TIFF/attachment file

1. to (one or more) IUPAC name(s)

2. to a position/section in the text/document

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

Negri Matthias, II-SDV 2015 19

Merge IUPAC Clean-Up IUPAC

If NO IUPAC IMG-name is set

“Normalize” IUPAC names

Page 20: II-SDV 2015, 20 - 21 April, in Nice

Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names

Chemistry present as text is recognized and extracted either via

- Textmining (I2E chemistry – d2s is working in behind) or

- Within KNIME/ChemCC using annotate/molconvert

Replacement:<chemistry> vs IUPAC

IUPAC-enriched XML

Negri Matthias, II-SDV 2015 20

Page 21: II-SDV 2015, 20 - 21 April, in Nice

OCR-errors in chemical names

Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names

TIF

CDX

MOL

Replacement with the derived IUPAC name

Negri Matthias, II-SDV 2015 21

Page 22: II-SDV 2015, 20 - 21 April, in Nice

XPATH/XML parsing and extraction of:

Tables

Rows - XML tags & strings

Entries - XML tags & strings

Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH

Negri Matthias, II-SDV 2015 22

Page 23: II-SDV 2015, 20 - 21 April, in Nice

IUPAC-enriched XML as source for I2E API/textmining

indexing

pre-defined queries

results retrieval

saved as SDF files (KNIME)

Semi-automatic Patent Curation Workflow f) Text-/datamining with Linguamatics I2E via KNIME

Text-mining retrieved (chemistry-related) information

Example Nr.

Bioactivity data from tables

Claims, regions where chemistry appears in patents

Genes, diseases

Negri Matthias, II-SDV 2015 23

Page 24: II-SDV 2015, 20 - 21 April, in Nice

1. Example Nr. – IUPAC

Table:Image:

For comparison – chemistry in PDF:

Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps

Source: (IUPAC-enriched) XML

2. Example Nr. – Bioactivity data

24

IUPAC

Bioactivity

Example Nr.

Page 25: II-SDV 2015, 20 - 21 April, in Nice

Semi-automatic Patent Curation Workflowg) Visualize data-/textmining results in ChemCC

SDF file loaded into ChemCC project + automatic mapping to existing chemistry

Negri Matthias, II-SDV 2015 25

Page 26: II-SDV 2015, 20 - 21 April, in Nice

Lessons learned, weak-points, limitations

1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch

chemistry check/normalization – 3 input sources improved quality

improved chemistry recall - ALL images (incl. tables and drawings)

More filtering options in KNIME workflow vs ChemCurator only

IUPAC-enriched XML as new source for I2E

….

Negri Matthias, II-SDV 2015 26

Page 27: II-SDV 2015, 20 - 21 April, in Nice

Lessons learned, weak-points, limitations

2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..)

Missing attachment files

No tables present in XML

Error rate in chemistry recognition (OPSIN vs n2s/d2s)

NEEDS: different workflows/branches, patent-files clean-up (OCR)

3. Time & Computational Resources-consuming process

Negri Matthias, II-SDV 2015 27

Page 28: II-SDV 2015, 20 - 21 April, in Nice

Outlook

1. KNIME Workflow

Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..

Usage of ChemCC html output as source for textmining

Ontology mapping

Expand workflow by including other sources (internal PDF, literature full-text)

Use KNIME to interconnect to BI-intern workflows, DB, etc

chemistry-linked information in a patent-DB improved (semantic) search

Negri Matthias, II-SDV 2015 28

Page 29: II-SDV 2015, 20 - 21 April, in Nice

Outlook

2. ChemCurator

Improved n2s

New command-line functions

Complex-phrase requests from IFI server

Improved SDF import

Preprocessing wizards

Árpád Figyelmesi, II-SDV 201529

Page 30: II-SDV 2015, 20 - 21 April, in Nice

Thank You !

Negri Matthias, II-SDV 2015 30

INPU

T