Top Banner
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email protected]
30

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email protected].

Mar 26, 2015

Download

Documents

Chloe O'Neil
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

Chemical named entity recognition and literature mark-upColin BatchelorInformatics DepartmentRoyal Society of [email protected]

Page 2: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

2

Overview

Project Prospect: what we find and how we find it.

RDF: How should we be disseminating it?

Next steps: Basics for a chemical ontology.

Page 3: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

3

Page 4: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

4

Page 5: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

5

Page 6: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

6

Page 7: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

7

Page 8: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

8

Page 9: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

9

Project Prospect: What do we find?

Chemical compounds Chemical terms from the IUPAC Gold Book

Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

Page 10: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

10

Project Prospect: How do we find it?

For compound names:~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and

Corbett 2007)

~20% PubChem~20% ChemDrawFor compound numbers:~70% author ChemDraw~30% editors

Page 11: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

11

Page 12: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

12

RDF in an RSS reader

Page 13: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

13

RDF: how we do it now

Content module from RSS 1.0

http://web.resource.org/rss/1.0/modules/content

In what sense does an article “contain” pyridine or base pairs?

We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

Page 14: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

14

RDF: what it looks like now

<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1><title> [… title] </title><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link><description> [… blah] </description><content:encoded> [… human-readable stuff</content:encoded>[… dublin core stuff …]<content:items> <rdf:Bag> <rdf:li>

<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/></rdf:li><rdf:li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/></rdf:li>

</rdf:Bag></content:items></item>

Page 15: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

15

Basics for a chemical ontology

1. Unambiguous representation of objects of chemical discourse

2. Proper parthood relations

Page 16: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

16

Basics for a chemical ontology:1. Objects of chemical discourse

Must be able to represent and clearly distinguish

Compounds Classes of compound Parts of molecules Mixtures

Would be nice to have:

Disambiguation cues for the first three

Page 17: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

17

Imidazole

Page 18: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

18

An imidazole

Page 19: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

19

The imidazole side-chain/group/ring

Page 20: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

20

Can ChEBI handle this?

Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069)

Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc.)

… and there are no disambiguation cues

Page 21: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

21

Disambiguation

One Sense per Discourse (Gale et al. 1992)

… this doesn’t hold at all

One Sense per Collocation (Yarowsky 1993)

… matches our intuitions

Page 22: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

22

Disambiguation:What a one sense per collocation feature set might look like

CLASS:w(–1) = a, an, the, thisw(0) plural (bit of a cheat, as not a collocation)

PART:w(–1) = bridging, terminalw(+1) = backbone, bridge, chain, core, dyad,

fluorophore, fragment, framework (and many more)

w(+1)w(+2) = “building block”, “protecting group”, “side chain”

Page 23: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

23

Basics for a chemical ontology:2. Parthood relations

Parthood in ChEBI means at least three things:

is necessarily chemically part of

carbonyl group part_of carbonyl compounds

Page 24: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

24

Basics for a chemical ontology:2. Parthood relations

Is possibly chemically part of:

Lead(2+) part_of lead diacetate

(most lead(2+) isn’t)

Electron part_of muonium (!)

Page 25: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

25

Basics for a chemical ontology:2. Parthood relations

Is part of a mixture

Kanamycin A part_of kanamycin

Page 26: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

26

Basics for a chemical ontology:2. Parthood relations

Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005)

carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)

Page 27: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

27

Basics for a chemical ontology:2. Parthood relations

Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships

Carbonyl compound molecule has_part carbonyl substituent

Muonium atom has_part electron

Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)

Page 28: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

28

Open questions

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

Page 29: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

29

Acknowledgements

University of Cambridge: Peter Corbett

OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)

www.projectprospect.org

Page 30: Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org.

30

Open questions

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?