Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing / AI • Computer curation & finding dark data Stephen Boyer Ph.D. s [email protected] [email protected]
Leveraging IP data for its scientific content
• The future of IP in the era of machine learning / cognitive computing / AI
• Computer curation & finding dark data
Stephen Boyer [email protected]@gmail.com
To be accomplished in the next hour
• the utility of IP data for advancing science
• emerging technologies [ machine learning ]
• the potential to be realized
• the challenges & opportunities
An appreciation of :
“There’s gold in ‘them’ documents”
What’s in them ?
What is it good for ?
What are people doingto mine the information ?
How are they using it ?
A bit of history
How we got to now !
What technologies do we have to work with ?
Evolving technologies relevant to patents
The Past Recent Past Present / Future
1990 - 2005 2006 - 2009 2010 - 2018
• Easy Web Access• Text Searching• Keyword• Boolean • BRS (open text) • Verity • Lucene / Solr
• Image Downloads• Tiff --> PDF
• Text Analytics • IBM UIMA
• Natural Language Processing• NLP
• Entity Identification• Co-occurrence Analysis• Visualization Tools• Citation Mapping• W3C Standards• Federated Search • Unique Entity ID’s
• InChI• GeneID’s• other
• Data Availability • Integration of open source• Google Patents
• Contextual analysis • Semantic search• Network graphs • Relationship detection• Advanced grammar analysis
• Machine Learning • Google Patents • Neural Networks • Image Analysis• OSRA /Clide• Automated analysis • Machine translation
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine-readible data
Understanding the content of the documents Why bother ?
Making documents “machine readable “
• Sections• Tables• Citations• Data types • Etc.
Understanding the format of the documents
3.1 million patent applications worldwide in 2016
Source = Francis Gurry, WIPO, Ambassador Briefing 2018
How many patent documents are there ?
Distribution of Global Patenting has Shifted in Recent Decades
Source = Francis Gurry, WIPO, Ambassador Briefing 2018
Machine learningto analyze & interpret
the components of a document
Example: Work done by Peter Starr, IBM Zurich labs
Step 1: Making documents “machine-readable”
PDF parser PDF interpretation Semantic representation
PDF is the pervasive language of the enterprise
Step 1: Making documents “machine-readable”
Cross-mapping citation data between publications and patents
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
journal articles
cited in patents
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
~6,000,000 / 40,000,000 patent citations map to journal articles other than patents
Citation mapping
In ~10,500 unique journals
~175,000,000 patent citations map to other patents
Cross-mapping citation data between publications and patents
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
~6,000,000 / 40,000,000 patent citations map to journal articles other than patents
Citation mapping
In ~10,500 unique journals
~175,000,000 patent citations map to other patents
?
Cross-mapping citation data between publications and patents
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
Citation mapping
In ~10,500 unique journals
?
Cross-mapping citation data between publications and patents
Funding• NIH• NSF • EU
$
Using computers to understand
what’s in the documents
Annotating the documents – NLPEntity identification
Visualizing the content
A brief review of work done at IBM & by a host of others
Step 2 : Computer Curation of Content
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine-readible data
Understanding the content of the documents Why bother ?
Making documents machine- readable
• Sections• Tables• Citations• Data types • Etc.
Understanding the format of the documents
Analysis of the content
• NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization
Early Text Mining Technologies
entity identification
a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-
benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49
(s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g
of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-
7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24
(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is
hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by
What is this compound ??
NO
O
HO
N
N
N
O
NH2
Paper Words
-- - - - --- -- -- - - -- -- --- -- -
Chemical Names
Dictionary of the English language– minus –
the dictionary of desired entities
. -- -
toluene
[CC1=CC=CC=C1]
CH3
Name=Structure
SMILES String
2D Structure
methyl benzene
Computational Resources
Blue Gene – enabled
Summary of overall text analysis operations for chemistry HMM, CRF, CFG
3D structurecompute
300 properties permolecule
Patent document’s chemical report molecular timeline & chemical name-to-structure mouse-over
• Chemical yields
• Quantities
• Physical attributes Melting points, Boiling Points
• Solvents and Temperatures
• Spectral Data
• NMR data• IR data• Mas Spec data • Assay data
Text-mining technologies identify in-document properties
Source courtesy of Dr. Roger Sayles
Example of text mining from patent & scientific literature
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
> 175K Compound-value associations
Signals / Triggers for identifying specific entities
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Example of extracting NMR & MS data from US patents
What about BP?
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
Re-creating spectral data from text data
text input
spectral ouput
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
NMR data extracted by year of publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Cum
ulat
ive
dist
inct
NM
R ex
trac
ted
Year of Publication
USPTO grants
USPTO applications
Documenting the increase in data with time
From 1976-2014 USPTONMR data
►
H 975543C 56536
unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
1H-N
MR
freq
uenc
y
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing
Tracking technology improvement with timeExample of NMR
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
Year of patent filing
Extracting chemical reactions from text
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
US20150038506
Reaction Extraction System
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Making dark data useful example: extracting chemical reactions from text
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
US20150038506 Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
Making dark data useful
The Growing Number of Chemical Reactions Derived from the Patent Literature
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/
Millions of reaction SMILESmade publically available
thanks to Daniel Lowe & Roger Sayles
# of
che
mic
al re
actio
ns
Year of patent filing
Source = Roger Sayles & Daniel Lowe, Next Move
Computer curation Classifying patents from their technical content
What does this enable that could not be done before ?
Categorization of chemical reactions from patents
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
10 most frequent reactions
Classifying patents via its scientific content
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/
% Y
ield
Mass of Product [grams]
What does this enable that could not be done before ?Analyze scale-vs-yield
Reactions of greatest interestfor manufacturing High yield Large scale
20%
40%
60%
80%
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications
What does this enable that could not be done before ?Analyzing frequency-vs-time
Suzu
ki c
oupl
ings
as
%ag
e of
re
actio
ns /
year
Year of patent filing
Relationships
Entity types identified that were associated with structures derived from patents
Source = Roger Sayles – NextMove
The Number of Biological Activities Derived from Patents vs the Scientific Literature
Source = Roger Sayles
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine readible data
Understanding the Documents
Understanding what’s In the documents Why bother ?
The format of the document • Sections• Tables• Citations• Data types • Etc…
Analysis of the content • NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization
• Integration with Other Data • Development of feature spaces• Seeing the unobvious • Learning • Predicting
Patent data alone is insufficient
PubChem
CIDSIDAID
InChIKeyCAS
SynonymsPubMedPatents
NLM MeSH
Chemical:SynonymsMeSH DUIDisease:
MeSH DUI
FDA SRS
Drug:FDA SPLFDA NDC
Ingredient:UNII
InChIKey
NCBIProteinGeneCDD
TaxonomyPubMed
BioSystems
NLM HSDBPharmacology
ToxicityMetabolismProperties
Manufacture
VA NDF-RT
NLM RxNorm
FDA/NLMDailyMed
NCI Metathesaurus
Disease Ontology
Protein Ontology
GeneOntology
DrugBankDrug:
PubChemATC
Target:Uniprot
GeneCard
KEGG
Drug:PubChem
ATCTarget:Gene
Disease:OMIM
ChEMBL
Drug:ATC
ChEBITradeNameCompound:
Pharmacology
ChEBI
Source:IntEnzKEGG
PDBeChemChEMBL
IUPHAR-DB
Drug:Classification
Target:NomenclaturePharmacology
IBM
PatentPubMedTerminology/Ontology
Public Database
Database + Terminology
Integration with Open Source Data
Drs. Evan Bolton & Gang Fu, NIH
NIH PubChem RDF – Triple & Entity Counts
https://pubchem.ncbi.nlm.nih.gov/rdf/ Drs. Evan Bolton & Gang Fu, NIH
Integration with Open Source Data
What are “Cognitive Technologies” ?
“Big Data, (Machine Learning, Neural Networks, Cognitive Computing, AI) is like teenage sex:
Everyone talks about it, nobody really knows how to do it,
Everyone thinks everyone else is doing it.
So everyone claims they are doing it….”
Source: Dan Ariely , Duke University
Machine Learning, Neural Networks, Cognitive Computing, AI
Google “ A mostly complete chart of Neural Networks “
A mostly complete chart ofNeural Networks
Google “ A mostly complete chart of Neural Networks “
A mostly complete chart ofNeural Networks
Accessible Information
Usefulness starts with access to the information.
Transformation Apply business logic,
human curation, and/or machine learning
Useful Information
Solving user problems
Making IP Data Accessible and UsefulWhat Google is doing
Slide courtesy of Ian Wetherbee , Google
The critical first step in making patent information useful Is open access to machine-readable bulk data
Accessible Information
Usefulness starts with access to the information.
Transformation Apply business logic,
human curation, and/or machine learning
Useful Information
Solving user problems
Making IP Data Accessible and UsefulWhat Google is doing
Slide courtesy of Ian Wetherbee , Google
The critical first step in making patent information useful Is open access to machine-readable bulk data
• Machine Classification• Document Similarity • Machine Translation• ….
http://media.epo.org/play/gsgoogle2017
5
4
3
2
1
0
6 Perfect translation
humanneural (GNMT)
Phrase-based (PBMT)
English>Spanish
English>French
English>Chinese
Spanish>English
French >English
Chinese >English
Google’s machine translation Tr
ansl
atio
n qu
ality
Translation model Slide courtesy of Ian Wetherbee , Google
47,710,923 patents full-text translated
Accessible InformationUsefulness starts with access to the information.
The advantages of making patents accessible & useful”
Slide curtesy of Ian Wetherbee , Google
Enables the private sector to transform and improve information, benefitting the patent system
Improves the transparency into patent quality and the patent system
Improves transparency into legal rights
Empowers the public to obtain the full benefits of the disclosure
“Open machine-readable data is the critical first step in making patent information useful” *
An Example
Finding compounds that might fight cancer
What are people doing with this data ?
Pharma asks
1. What genes regulate xyz condition ? 2. What compounds regulate those xyz genes ?
An approach to answering these questions : chemical ontologies
Other approaches include• Computational chemical modeling• Similarity Ensemble Approach (SEA) • Literature-based discovery• Experimental high through-put screening
Chemical Ontologies
But first some chemistry
Work done in collaboration with:
University of Alberta Prof David Wishart & Yannick FeunangOntochem Prof Lutz Weber
Physical • Examples: Molecular Weight, Melting point, Boiling Point
Molecular• Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole
Functional • Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide
Legal attributes • Patented for a purpose
Molecules have different types of attributes
Example of a chemical ontology
Consider this molecule
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Functional attributes Is used for the treatment of Crohn's diseaseIs used for the treatment of rheumatoid arthritisIs used for the treatment of ulcerative colitis
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy
[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O
SMILES String
ClassyFire OntoChem
ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes(9); 11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids, progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids (32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101); Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272); Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633); Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic compounds (978); Chemical entities (989); Hydrocarbon derivatives (995);
OntoChem: 17-deoxy-prednisolones (6); halohydrins (6); prednisolones (6); ethanoic acid esters (20); methyl esters (20); acetals (37); alkyl fluorides (56); cyclic ketones (61); natural product derivatives (92); fluorine compounds (126); alkene derivatives (172); polycyclic compounds (184); oxacyclic compounds (190); secondary alcohols (202); carboxylic acids (249); formic acid derivatives (559); lipophilic molecules (642); lipinski molecules (785); bioavailable molecules (867); oxygen compounds (891); small molecules (949); carbon compounds (974); hetero compounds (978);
Generating molecular attributes via SMILES
ChemBL dB of 1.4 Mcompounds AND their
bio activity towards targets
UOA Classifyer (CF) SW OntoChem (OC) SW
SMILES STRINGS
ChemBL dB of 1.4 Mcompounds AND their
bioactivity towards targets IncludingCF + OC
chemical Lables
Obtain a database of chemical compounds & their SAR
OC labels CF labels
This processing was provided by Ontochem
This processing was provided by U of Alberta
This database was provided by EBI
This processing was provided by IBM
Research
We call this the CHEMBL ontology dB
MDM2 Raw output Out of 1.4 M molecules ~ 558 had activity towards MDM2 but only 27 had activity less then 30 nm
Scoring of molecular labels for MDM2-produced training set of 27 compounds[ label cutoff = 20 , activity cutoff = 30 , corpus count cutoff = 200K ]
Score = (observed count - expected count)2 / expected count
MDM2 Raw output
Classyfire (CF) OntoChem (OC)
Comparison of the 100 compounds identified by CF with the 100 compounds identified by OC for MDM2 with label cut off = 10 labels & assay minimum = 30 & corpus count cut off = 300K
57 of the predicted compounds are in common
Overlap based on ChemBl ID’s of predicted compounds
A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC
IC50 value for Mdm2/P53 binding wascalculated (by sigmoid fitting using Prism(GraphPad Software).The results are shown below.
US 2009/ 0312310A1 This [240 page] patent application had 26 compounds with reported assay data for MDM2
IC50 value for Mdm2/P53 binding wascalculated by sigmoid fitting using Prism(GraphPad Software).The results are shown below.
US 2009/ 0312310A1
Example 18Example 39
Example 93
Example 97
Example 111
Example 155
Example 126 Example 180
Example 220
A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC
Compound Attributes
compound 1
compound 2
compound 3
A B C D E … Y Z
1 2 0 0 4 5 2 0 7 3 0 1 1 2
………
Feature Vector
compound 1compound 2compound 3
Physical Relate Attributes
LcStructure Pka
Log P …
StructureMol File / SMILES
Functional Attributes
EC50
Target -Assay
PairOther Attributes
LD50
Target-Assay
PairTarget1 Target
2 EC50
Primary Assays Secondary Assays
MDM2
JAK3
SGLT2
---
Ki ---
Anti-target -Assay
Pair
Target Attributes
Target 1
Target 2
Target 3
A B C D E … Y Z
1 2 0 0 4 5 2 0 7 3 0 1 1 2
………
Feature Vector
Goal oriented learning
Cost / reward
Act
Predict an action which will reduce cost and/or increase reward
BIG ISSUES
1) Obfuscation
2) Access to & integration of worldwide data• Open access to bulk machine-readable data
3) Incentives & quotas
4) Algorithms and Bias
WHAT IS THIS?
= a soccer ball
= a spherical recreational device
BIG ISSUES OBFUSCATION
Source = Dr. George Papadatatos EMBL – EBI
European Molecular Biology Laboratory EMBL & EBI
Markush structures are daunting and the situation is getting worse
BIG ISSUES
Access to and integration of WW Data
• Chinese, Japanese and Korean (CJK) patents now account for over half of all national patentfilings and hence are of increasing importance to patent informatics.
To demonstrate the importance of this …
• 1,740,040 distinct compounds were extracted from ~63,000 Korean patent applications - spanning from 1990 to March 2015
• Of these ~ 230,770 compounds were novel to Korean patents when compared tocompounds derived from US data - (spanning from 1976-March 2015)
• In the period 2006-2014, 46% of compounds appeared in a KIPO filing before a USPTO filing.
The Importance of Foreign Patent Filings
Notes from Drs. D, Low & R Sayles
An Example of extracting chemical entities from CJK patents
Notes from Drs. D, Low & R Sayles
Chemicalsfrom Chinese
Patents -
Attempts to process Chinese Patent Documents
Extracting chemical structures from Chinese patents…
Work done in collaboration with Dr R Sayles
Final Thoughts
Thanks to
• everyone in this room
• the scientific community
• especially those whose data was presented
• society in general
for providing us with these important
“Adjacent Possibilities”
Final thoughts
Source – J Kreulen
IBM Almaden Research Center, San Jose, California