Dr. Matthias Negri Scientific Information Center Boehringer Ingelheim Pharma GmbH & Co. KG Chemistry-Enriched Patent Curation semi-automatic analysis and elaboration of patents II-SDV Nice, 21 April 2015 Árpád Figyelmesi ChemAxon
Jul 16, 2015
Dr. Matthias NegriScientific Information Center
Boehringer Ingelheim Pharma GmbH & Co. KG
Chemistry-Enriched Patent Curationsemi-automatic analysis and elaboration of patents
II-SDV Nice, 21 April 2015
Árpád FigyelmesiChemAxon
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
Negri Matthias, II-SDV 2015 2
Chemistry in patents
Chemistry appears within diverse form in patents:
1. TEXT - IUPAC names, common names, etc
2. IMAGES - embedded within or attached to the document
3. ATTACHMENTS (MOL/CDX)
4. TABLES
– as ONE-image file (tables with chemistry and bioactivity data)
– as chemistry-only image files embedded within table tags
5. Markush Structures/Formulas with R-groups
---------------------------------------------------------------------------------------
Currently NO commercial solution covers all these cases
Most of the cases are considered in the patent curation workflow
(Markush/R-group Formulas recognized and stored separately)
Negri Matthias, II-SDV 2015 3
Why do we need a patent curation workflow?
Motivations:
1. Linked chemistry-retrieval from patents (+ chemistry as images)
2. IUPAC-enriched XML patent files as NEW source for text-mining
3. extraction of bioactivity data/targets/diseases/… in relation to chemistry
4. Similarity/Substructure frequency in compound sets of patents
5. …
Negri Matthias, II-SDV 2015 4
Semi-automatic Patent Curation Workflow
Overview – current state
2 parallel branches
Negri Matthias, II-SDV 2015 5
I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval
SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML
FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT
Linked tools/technologies
1. KNIME/XPATH
2. ChemAxon ChemCurator (ChemCC)
3. Other ChemAxon tools in KNIME nodes (document2structure/d2s, Naming,
Molconverter, Structure checker, Standardizer, …)
4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)
5. Optical Structure Recognition – Keymodule CLiDE Batch
Negri Matthias, II-SDV 2015 6
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
Negri Matthias, II-SDV 2015 7
Computer-aided chemical data extraction
English, Chinese and Japanese N2S
Markush Editor
Structure Checker
Hit visualization
Third party OSR technologies
ChemCurator (ChemCC)
Árpád Figyelmesi, II-SDV 20158
ChemCurator (ChemCC)
Name to Structure
Support for many nomenclatures (common, drug names, …)
IUPAC names
Custom dictionaries
English (2008)
Chinese (2013)
Japanese (2014)
Árpád Figyelmesi, II-SDV 20159
Compound Extraction View
Compound listProject explorer
Annotated document
Selected structures
ChemCurator (ChemCC)
10
Markush Extraction View
Markush editor
Example structures
Annotated document
Project explorer
Selected structures
Structure checker
ChemCurator (ChemCC)
11
General Document Curation
Extract Markush Structures from patents
Extract specific structures
Journal articles
Company reports
Patent examples
Structure extraction wizards
Exclude fragments, chemical elements, etc.
ChemCurator (ChemCC)
Árpád Figyelmesi, II-SDV 201512
ChemCurator (ChemCC)
Integration & Information Sharing
Other ChemAxon products:
Direct IJC schema connection
Project sharing function
Accessible from Plexus, IJC, etc.
Third party tools:
Standard file formats
Export functions
Easily processable projects
Árpád Figyelmesi, II-SDV 201513
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
Negri Matthias, II-SDV 2015 14
Semi-automatic Patent Curation Workflowa) input sources and b) bibliographic data
a) Input sources
files with patent-IDs list
XML collection
…
b) Retrieval of bibliographic information and attachment data
family ID, patent references, expiration date, etc
Attachment files MOL/CDX (US-patents only), TIF files
….
Negri Matthias, II-SDV 2015 15
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
1. ChemCurator branch
data retrieval (XML, attachments) from IFI Claims Direct BI-server
ChemCurator project creation/sharing/annotation html output
Chemistry extraction name2structure/document2structure sdf output
Generation of pre-annotated patent set stored as ChemCC projects
Faster, but lower quality within the chemistry extraction process
Negri Matthias, II-SDV 2015 16
2. KNIME branch
- OCR-errors CLEAN-UP in KNIME improved chemistry recognition
- MOL/CDX/TIF - standardizer, structure checker filter formulas, solvents, R-groups
Higher quality and more control in chemistry extraction process
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
Negri Matthias, II-SDV 2015 17
2. KNIME branch
MOL IUPAC
CDX IUPAC
TIFF (via CLiDE) IUPAC
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
Negri Matthias, II-SDV 2015 18
Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks
IUPAC
string length (different output order of chemicals in multiple molecules image/multiMOL files
OCR-correction (“dictionary” based)
2. KNIME - Chemistry “Normalization”
(within KNIME) set up a relation between each TIFF/attachment file
1. to (one or more) IUPAC name(s)
2. to a position/section in the text/document
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
Negri Matthias, II-SDV 2015 19
Merge IUPAC Clean-Up IUPAC
If NO IUPAC IMG-name is set
“Normalize” IUPAC names
Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names
Chemistry present as text is recognized and extracted either via
- Textmining (I2E chemistry – d2s is working in behind) or
- Within KNIME/ChemCC using annotate/molconvert
Replacement:<chemistry> vs IUPAC
IUPAC-enriched XML
Negri Matthias, II-SDV 2015 20
OCR-errors in chemical names
Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names
TIF
CDX
MOL
Replacement with the derived IUPAC name
Negri Matthias, II-SDV 2015 21
XPATH/XML parsing and extraction of:
Tables
Rows - XML tags & strings
Entries - XML tags & strings
Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH
Negri Matthias, II-SDV 2015 22
IUPAC-enriched XML as source for I2E API/textmining
indexing
pre-defined queries
results retrieval
saved as SDF files (KNIME)
Semi-automatic Patent Curation Workflow f) Text-/datamining with Linguamatics I2E via KNIME
Text-mining retrieved (chemistry-related) information
Example Nr.
Bioactivity data from tables
Claims, regions where chemistry appears in patents
Genes, diseases
Negri Matthias, II-SDV 2015 23
1. Example Nr. – IUPAC
Table:Image:
For comparison – chemistry in PDF:
Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps
Source: (IUPAC-enriched) XML
2. Example Nr. – Bioactivity data
24
IUPAC
Bioactivity
Example Nr.
Semi-automatic Patent Curation Workflowg) Visualize data-/textmining results in ChemCC
SDF file loaded into ChemCC project + automatic mapping to existing chemistry
Negri Matthias, II-SDV 2015 25
Lessons learned, weak-points, limitations
1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch
chemistry check/normalization – 3 input sources improved quality
improved chemistry recall - ALL images (incl. tables and drawings)
More filtering options in KNIME workflow vs ChemCurator only
IUPAC-enriched XML as new source for I2E
….
Negri Matthias, II-SDV 2015 26
Lessons learned, weak-points, limitations
2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..)
Missing attachment files
No tables present in XML
Error rate in chemistry recognition (OPSIN vs n2s/d2s)
…
NEEDS: different workflows/branches, patent-files clean-up (OCR)
3. Time & Computational Resources-consuming process
Negri Matthias, II-SDV 2015 27
Outlook
1. KNIME Workflow
Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..
Usage of ChemCC html output as source for textmining
Ontology mapping
Expand workflow by including other sources (internal PDF, literature full-text)
Use KNIME to interconnect to BI-intern workflows, DB, etc
chemistry-linked information in a patent-DB improved (semantic) search
Negri Matthias, II-SDV 2015 28
Outlook
2. ChemCurator
Improved n2s
New command-line functions
Complex-phrase requests from IFI server
Improved SDF import
Preprocessing wizards
Árpád Figyelmesi, II-SDV 201529
Thank You !
Negri Matthias, II-SDV 2015 30
INPU
T