Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing

Leveraging IP data for its scientific content

• The future of IP in the era of machine learning / cognitive computing / AI

• Computer curation & finding dark data

Stephen Boyer [email protected]@gmail.com

mailto:[email protected]

mailto:[email protected]

To be accomplished in the next hour

• the utility of IP data for advancing science

• emerging technologies [ machine learning ]

• the potential to be realized

• the challenges & opportunities

An appreciation of :

“There’s gold in ‘them’ documents”

What’s in them ?

What is it good for ?

What are people doingto mine the information ?

How are they using it ?

A bit of history

How we got to now !

What technologies do we have to work with ?

Evolving technologies relevant to patents

The Past Recent Past Present / Future

1990 - 2005 2006 - 2009 2010 - 2018

• Easy Web Access• Text Searching• Keyword• Boolean • BRS (open text) • Verity • Lucene / Solr

• Image Downloads• Tiff --> PDF

• Text Analytics • IBM UIMA

• Natural Language Processing• NLP

• Entity Identification• Co-occurrence Analysis• Visualization Tools• Citation Mapping• W3C Standards• Federated Search • Unique Entity ID’s

• InChI• GeneID’s• other

• Data Availability • Integration of open source• Google Patents

• Contextual analysis • Semantic search• Network graphs • Relationship detection• Advanced grammar analysis

• Machine Learning • Google Patents • Neural Networks • Image Analysis• OSRA /Clide• Automated analysis • Machine translation

Evolving Analytics, Visualization & Knowledge

Availability of bulk machine-readible data

Understanding the content of the documents Why bother ?

Making documents “machine readable “

• Sections• Tables• Citations• Data types • Etc.

Understanding the format of the documents

3.1 million patent applications worldwide in 2016

Source = Francis Gurry, WIPO, Ambassador Briefing 2018

How many patent documents are there ?

Distribution of Global Patenting has Shifted in Recent Decades

Source = Francis Gurry, WIPO, Ambassador Briefing 2018

Machine learningto analyze & interpret

the components of a document

Example: Work done by Peter Starr, IBM Zurich labs

Step 1: Making documents “machine-readable”

PDF parser PDF interpretation Semantic representation

PDF is the pervasive language of the enterprise

Step 1: Making documents “machine-readable”

Cross-mapping citation data between publications and patents

PatentsO-References

Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,

ontologies, etc.

journal articles

cited in patents

PatentsO-References


ontologies, etc.

~6,000,000 / 40,000,000 patent citations map to journal articles other than patents

Citation mapping

In ~10,500 unique journals

~175,000,000 patent citations map to other patents


PatentsO-References


ontologies, etc.

~6,000,000 / 40,000,000 patent citations map to journal articles other than patents

Citation mapping


~175,000,000 patent citations map to other patents

?


PatentsO-References


ontologies, etc.

Citation mapping


?


Funding• NIH• NSF • EU

$

Using computers to understand

what’s in the documents

Annotating the documents – NLPEntity identification

Visualizing the content

A brief review of work done at IBM & by a host of others

Step 2 : Computer Curation of Content


Availability of bulk machine-readible data

Understanding the content of the documents Why bother ?

Making documents machine- readable

• Sections• Tables• Citations• Data types • Etc.

Understanding the format of the documents

Analysis of the content

• NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization

Early Text Mining Technologies

entity identification

a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-

benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49

(s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g

of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-

7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24

(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is

hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by

What is this compound ??

NO

O

HO

N

N

N

O

NH2

Paper Words

-- - - - --- -- -- - - -- -- --- -- -

Chemical Names

Dictionary of the English language– minus –

the dictionary of desired entities

. -- -

toluene

[CC1=CC=CC=C1]

CH3

Name=Structure

SMILES String

2D Structure

methyl benzene

Computational Resources

Blue Gene – enabled

Summary of overall text analysis operations for chemistry HMM, CRF, CFG

3D structurecompute

300 properties permolecule

Patent document’s chemical report molecular timeline & chemical name-to-structure mouse-over

• Chemical yields

• Quantities

• Physical attributes Melting points, Boiling Points

• Solvents and Temperatures

• Spectral Data

• NMR data• IR data• Mas Spec data • Assay data

Text-mining technologies identify in-document properties

Source courtesy of Dr. Roger Sayles

Example of text mining from patent & scientific literature

Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK

> 175K Compound-value associations

Signals / Triggers for identifying specific entities

Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.

Example of extracting NMR & MS data from US patents

What about BP?


1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Re-creating spectral data from text data

text input

spectral ouput


NMR data extracted by year of publication

0

500000

1000000

1500000

2000000

2500000

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Cum

ulat

ive

dist

inct

NM

R ex

trac

ted

Year of Publication

USPTO grants

USPTO applications

Documenting the increase in data with time

From 1976-2014 USPTONMR data

►

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8


1H-N

MR

freq

uenc

y

0 Mhz

50 Mhz

100 Mhz

150 Mhz

200 Mhz

250 Mhz

300 Mhz

350 Mhz

400 Mhz

450 Mhz

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing

Tracking technology improvement with timeExample of NMR


Year of patent filing

Extracting chemical reactions from text

Results from Drs. Roger Sayles & Daniel Lowe, NextMove


US20150038506

Reaction Extraction System


Making dark data useful example: extracting chemical reactions from text



US20150038506 Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.


Making dark data useful

The Growing Number of Chemical Reactions Derived from the Patent Literature

https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/

Millions of reaction SMILESmade publically available

thanks to Daniel Lowe & Roger Sayles

# of

che

mic

al re

actio

ns


Source = Roger Sayles & Daniel Lowe, Next Move

Computer curation Classifying patents from their technical content

What does this enable that could not be done before ?

Categorization of chemical reactions from patents


10 most frequent reactions

Classifying patents via its scientific content

https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/

% Y

ield

Mass of Product [grams]

What does this enable that could not be done before ?Analyze scale-vs-yield

Reactions of greatest interestfor manufacturing High yield Large scale

20%

40%

60%

80%


16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications


16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications

What does this enable that could not be done before ?Analyzing frequency-vs-time

Suzu

ki c

oupl

ings

as

%ag

e of

re

actio

ns /

year


Relationships

Entity types identified that were associated with structures derived from patents

Source = Roger Sayles – NextMove

The Number of Biological Activities Derived from Patents vs the Scientific Literature

Source = Roger Sayles


Availability of bulk machine readible data

Understanding the Documents

Understanding what’s In the documents Why bother ?

The format of the document • Sections• Tables• Citations• Data types • Etc…

Analysis of the content • NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization

• Integration with Other Data • Development of feature spaces• Seeing the unobvious • Learning • Predicting

Patent data alone is insufficient

PubChem

CIDSIDAID

InChIKeyCAS

SynonymsPubMedPatents

NLM MeSH

Chemical:SynonymsMeSH DUIDisease:

MeSH DUI

FDA SRS

Drug:FDA SPLFDA NDC

Ingredient:UNII

InChIKey

NCBIProteinGeneCDD

TaxonomyPubMed

BioSystems

NLM HSDBPharmacology

ToxicityMetabolismProperties

Manufacture

VA NDF-RT

NLM RxNorm

FDA/NLMDailyMed

NCI Metathesaurus

Disease Ontology

Protein Ontology

GeneOntology

DrugBankDrug:

PubChemATC

Target:Uniprot

GeneCard

KEGG

Drug:PubChem

ATCTarget:Gene

Disease:OMIM

ChEMBL

Drug:ATC

ChEBITradeNameCompound:

Pharmacology

ChEBI

Source:IntEnzKEGG

PDBeChemChEMBL

IUPHAR-DB

Drug:Classification

Target:NomenclaturePharmacology

IBM

PatentPubMedTerminology/Ontology

Public Database

Database + Terminology

Integration with Open Source Data

Drs. Evan Bolton & Gang Fu, NIH

NIH PubChem RDF – Triple & Entity Counts

https://pubchem.ncbi.nlm.nih.gov/rdf/ Drs. Evan Bolton & Gang Fu, NIH

Integration with Open Source Data

https://pubchem.ncbi.nlm.nih.gov/rdf/

What are “Cognitive Technologies” ?

“Big Data, (Machine Learning, Neural Networks, Cognitive Computing, AI) is like teenage sex:

Everyone talks about it, nobody really knows how to do it,

Everyone thinks everyone else is doing it.

So everyone claims they are doing it….”

Source: Dan Ariely , Duke University

Machine Learning, Neural Networks, Cognitive Computing, AI

Google “ A mostly complete chart of Neural Networks “

A mostly complete chart ofNeural Networks

Google “ A mostly complete chart of Neural Networks “

A mostly complete chart ofNeural Networks

Accessible Information

Usefulness starts with access to the information.

Transformation Apply business logic,

human curation, and/or machine learning

Useful Information

Solving user problems

Making IP Data Accessible and UsefulWhat Google is doing

Slide courtesy of Ian Wetherbee , Google

The critical first step in making patent information useful Is open access to machine-readable bulk data

Accessible Information

Usefulness starts with access to the information.

Transformation Apply business logic,

human curation, and/or machine learning

Useful Information

Solving user problems

Making IP Data Accessible and UsefulWhat Google is doing

Slide courtesy of Ian Wetherbee , Google

The critical first step in making patent information useful Is open access to machine-readable bulk data

• Machine Classification• Document Similarity • Machine Translation• ….

http://media.epo.org/play/gsgoogle2017

http://media.epo.org/play/gsgoogle2017

5

4

3

2

1

0

6 Perfect translation

humanneural (GNMT)

Phrase-based (PBMT)

English>Spanish

English>French

English>Chinese

Spanish>English

French >English

Chinese >English

Google’s machine translation Tr

ansl

atio

n qu

ality

Translation model Slide courtesy of Ian Wetherbee , Google

47,710,923 patents full-text translated

Accessible InformationUsefulness starts with access to the information.

The advantages of making patents accessible & useful”

Slide curtesy of Ian Wetherbee , Google

Enables the private sector to transform and improve information, benefitting the patent system

Improves the transparency into patent quality and the patent system

Improves transparency into legal rights

Empowers the public to obtain the full benefits of the disclosure

“Open machine-readable data is the critical first step in making patent information useful” *

An Example

Finding compounds that might fight cancer

What are people doing with this data ?

Pharma asks

1. What genes regulate xyz condition ? 2. What compounds regulate those xyz genes ?

An approach to answering these questions : chemical ontologies

Other approaches include• Computational chemical modeling• Similarity Ensemble Approach (SEA) • Literature-based discovery• Experimental high through-put screening

Chemical Ontologies

But first some chemistry

Work done in collaboration with:

University of Alberta Prof David Wishart & Yannick FeunangOntochem Prof Lutz Weber

Physical • Examples: Molecular Weight, Melting point, Boiling Point

Molecular• Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole

Functional • Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide

Legal attributes • Patented for a purpose

Molecules have different types of attributes

Example of a chemical ontology

Consider this molecule

Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

Carboxylic acid


Carboxylic acidBenzoic acid



Phenol



Phenol

Hydroxy group



Phenol

Hydroxy group

Azo



Phenol

Hydroxy group

Azo

Pyridine



Phenol

Hydroxy group

Azo

Pyridine



Phenol

Hydroxy group

Azo

Pyridine

Sulfone



Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide



Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene



Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene



Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene

Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy



Phenol

Hydroxy group

Azo

Pyridine

Sulfone

Sulfonamide

Azobenzene

Benzene

Functional attributes Is used for the treatment of Crohn's diseaseIs used for the treatment of rheumatoid arthritisIs used for the treatment of ulcerative colitis


Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy

[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O

SMILES String

ClassyFire OntoChem

ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes(9); 11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids, progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids (32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101); Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272); Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633); Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic compounds (978); Chemical entities (989); Hydrocarbon derivatives (995);

OntoChem: 17-deoxy-prednisolones (6); halohydrins (6); prednisolones (6); ethanoic acid esters (20); methyl esters (20); acetals (37); alkyl fluorides (56); cyclic ketones (61); natural product derivatives (92); fluorine compounds (126); alkene derivatives (172); polycyclic compounds (184); oxacyclic compounds (190); secondary alcohols (202); carboxylic acids (249); formic acid derivatives (559); lipophilic molecules (642); lipinski molecules (785); bioavailable molecules (867); oxygen compounds (891); small molecules (949); carbon compounds (974); hetero compounds (978);

Generating molecular attributes via SMILES

ChemBL dB of 1.4 Mcompounds AND their

bio activity towards targets

UOA Classifyer (CF) SW OntoChem (OC) SW

SMILES STRINGS

ChemBL dB of 1.4 Mcompounds AND their

bioactivity towards targets IncludingCF + OC

chemical Lables

Obtain a database of chemical compounds & their SAR

OC labels CF labels

This processing was provided by Ontochem

This processing was provided by U of Alberta

This database was provided by EBI

This processing was provided by IBM

Research

We call this the CHEMBL ontology dB

MDM2 Raw output Out of 1.4 M molecules ~ 558 had activity towards MDM2 but only 27 had activity less then 30 nm

Scoring of molecular labels for MDM2-produced training set of 27 compounds[ label cutoff = 20 , activity cutoff = 30 , corpus count cutoff = 200K ]

Score = (observed count - expected count)2 / expected count

MDM2 Raw output

Classyfire (CF) OntoChem (OC)

Comparison of the 100 compounds identified by CF with the 100 compounds identified by OC for MDM2 with label cut off = 10 labels & assay minimum = 30 & corpus count cut off = 300K

57 of the predicted compounds are in common

Overlap based on ChemBl ID’s of predicted compounds

A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC

IC50 value for Mdm2/P53 binding wascalculated (by sigmoid fitting using Prism(GraphPad Software).The results are shown below.

US 2009/ 0312310A1 This [240 page] patent application had 26 compounds with reported assay data for MDM2

IC50 value for Mdm2/P53 binding wascalculated by sigmoid fitting using Prism(GraphPad Software).The results are shown below.

US 2009/ 0312310A1

Example 18Example 39

Example 93

Example 97

Example 111

Example 155

Example 126 Example 180

Example 220

A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC

Compound Attributes

compound 1

compound 2

compound 3

A B C D E … Y Z

1 2 0 0 4 5 2 0 7 3 0 1 1 2

………

Feature Vector

compound 1compound 2compound 3

Physical Relate Attributes

LcStructure Pka

Log P …

StructureMol File / SMILES

Functional Attributes

EC50

Target -Assay

PairOther Attributes

LD50

Target-Assay

PairTarget1 Target

2 EC50

Primary Assays Secondary Assays

MDM2

JAK3

SGLT2

---

Ki ---

Anti-target -Assay

Pair

Target Attributes

Target 1

Target 2

Target 3

A B C D E … Y Z

1 2 0 0 4 5 2 0 7 3 0 1 1 2

………

Feature Vector

Goal oriented learning

Cost / reward

Act

Predict an action which will reduce cost and/or increase reward

BIG ISSUES

1) Obfuscation

2) Access to & integration of worldwide data• Open access to bulk machine-readable data

3) Incentives & quotas

4) Algorithms and Bias

WHAT IS THIS?

= a soccer ball

= a spherical recreational device

BIG ISSUES OBFUSCATION

Source = Dr. George Papadatatos EMBL – EBI

European Molecular Biology Laboratory EMBL & EBI

Markush structures are daunting and the situation is getting worse

BIG ISSUES

Access to and integration of WW Data

• Chinese, Japanese and Korean (CJK) patents now account for over half of all national patentfilings and hence are of increasing importance to patent informatics.

To demonstrate the importance of this …

• 1,740,040 distinct compounds were extracted from ~63,000 Korean patent applications - spanning from 1990 to March 2015

• Of these ~ 230,770 compounds were novel to Korean patents when compared tocompounds derived from US data - (spanning from 1976-March 2015)

• In the period 2006-2014, 46% of compounds appeared in a KIPO filing before a USPTO filing.

The Importance of Foreign Patent Filings

Notes from Drs. D, Low & R Sayles

An Example of extracting chemical entities from CJK patents

Notes from Drs. D, Low & R Sayles

Chemicalsfrom Chinese

Patents -

Attempts to process Chinese Patent Documents

Extracting chemical structures from Chinese patents…

Work done in collaboration with Dr R Sayles

Final Thoughts

Thanks to

• everyone in this room

• the scientific community

• especially those whose data was presented

• society in general

for providing us with these important

“Adjacent Possibilities”

Final thoughts

Source – J Kreulen

IBM Almaden Research Center, San Jose, California