Top Banner
IBEnt: Chemical Entity Mentions in Patents using ChEBI Andre Lamurias , Luis F. Campos, and Francisco M. Couto LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal BioCreative V.5 Workshop , April 2627, 2017
14

IBEnt: Chemical Entity Mentions in Patents using ChEBI

Jan 24, 2018

Download

Science

Francisco Couto
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IBEnt: Chemical Entity Mentions in Patents using ChEBI

IBEnt: Chemical Entity Mentions in Patents using ChEBI

Andre Lamurias , Luis F. Campos, and Francisco M. CoutoLaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal

BioCreative V.5 Workshop , April 26‐27, 2017

Page 2: IBEnt: Chemical Entity Mentions in Patents using ChEBI

CEMP Pipeline

Page 3: IBEnt: Chemical Entity Mentions in Patents using ChEBI

IBEnt

- IBEnt: in‐house tool to identify entities and relations in text

- Machine learning and ontologies- Integrates Stanford CoreNLP, Genia Sentence splitter, CRFsuite

- Used to participate in text mining competitions

https://github.com/lasigeBioTM/IBEnt

Page 4: IBEnt: Chemical Entity Mentions in Patents using ChEBI

CRF Classifier- Generic features:

- Prefix, suffix, HasNumber, WordCase, Lemma, POS, Word shape

- Current token and window size=‐1/1

Cells exposed to α‐MeDA showed an increase in intracellular glutathione (GSH) levels

Page 5: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Chemical‐specific features:• Periodic table: token is a periodic table element

• “oxygen”, “gold”, “carbon”

• Amino acid: token consists of an amino acid abbreviation

• “ala”, “arg”, “asn”

• Greek letter: token contains at least one Greek letter

CRF Classifier

Page 6: IBEnt: Chemical Entity Mentions in Patents using ChEBI

FiGO – Finding GO terms• Based on the Information Content of the words• Choose term with highest IC• The term t =“punt binding”• With the synonym “punt activity”

• #punt=1, #binding=4, #activity=8, max=16• IC(“punt”)=‐log(1/16)=4, • IC(“binding”)=‐log(4/16)=2• IC(“activity”)=‐log(8/16)=1• IC(“punt binding”)=4+2=6• IC(“punt activity”)=4+1=5• IC(t)=max{6,5}=6

Page 7: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Semantic Similarity

• “Despite a lack of data regarding their efficacy, both caffeine and doxapram have been recommended for treatment of hypercapnia in equine neonates with central nervous system damage.” (PMID: 18371030)

• The fact that caffeine and doxapram are semantically similar

• both central nervous system stimulants• is an evidence for being correctly identified

Page 8: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Improving chemical entity recognition through h‐index based semantic similarity

A. Lamurias, J. Ferreira, and F. Couto, Improving chemical entity recognition through h‐index based semantic similarity, Journal of Cheminformatics, vol. 7, no. Suppl 1, pp. S13, #20, 2015

8

Page 9: IBEnt: Chemical Entity Mentions in Patents using ChEBI

MER ‐Minimal Named‐Entity Recognizer

- Lexicon‐based matching- Developed for performance- No machine learning, just grep and awk- Minimal dependencies, easily portable- Lexicons: ChEBI, ChEMBL, DrugBank, HDMB

https://github.com/lasigeBioTM/MER

Page 10: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Runs

Page 11: IBEnt: Chemical Entity Mentions in Patents using ChEBI

2017 Results:

2015 Results:

Page 12: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Issues identified post‐submission

• Runs using MER (2, 3 and 4) were affected by bugs identified only after submitting results

• Run 5 contained a bug that affected the results more severely:

• Entity identified by more than one lexicon were excluded!• High impact on the recall which should have been higher than Run 4

• We still have to determine what caused Run 2 to have worse results than Run 1

Page 13: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Closing remarks- Trade‐off between machine learning and rule‐based system- Best run (F‐score: 0.8531) used machine learning and semantic similarity

- 43% recall in 2015 using semantic similarity- This edition 83% recall and increasing precision

- Problems with MER- Was still in development at the time of submission- It was developed for performance not accuracy

Page 14: IBEnt: Chemical Entity Mentions in Patents using ChEBI

Future Work- Lexicon matching could be improved: add abbreviation, synonyms and variations to lexicons

- Use more data to train CRF classifier than Random Forests classifier

- Links:- https://github.com/lasigeBioTM/IBEnt- https://github.com/lasigeBioTM/MER- http://labs.fc.ul.pt/mer/