NII Internship Program Report Project: PHENOMINERvnlp.net/wp-content/uploads/2012/10/NII-Internship... · Unpublished Technical Report – For Circulation to Authors and Collaborators,

NII Internship Program Report Project: PHENOMINER

(Phenotype mining in genetic texts)

Presenter: Hoang-Quynh Le

Mai-Vu Tran

Hanoi, 6th October 2012

Main contents

• An Overview: Entity classes and PHENOMINER pipeline

• A hybrid approach to finding Phenotype candidates in genetic texts

• Genetic document classification using LPU with graph-based learning

2

Overview

Phenotype, Symptom and Bodily feature (BF) definition • A BF mention is a bodily quality in a human or mouse

(normal or abnormal) • Scheuermann et al. [1] provide definitions for

PHENOTYPE and SYMPTOM: – A PHENOTYPE is a BF or combination of BF of an organism

determined by the interaction of the organism’s genes with its environment.

– A SYMPTOM is a BF of a patient that is observed by the patient or clinician and suspected of being caused by a disease.

• BIG Ambiguity !!!

3 [1] Scheuermann, R. H., Ceusters, W. and Smith, B. (2009), “Toward an ontological treatment of disease and diagnosis”, AMIA Summit on Translational Bioinformatics, pp 116-120

Overview

An informal overview of entity classes and their roles

For more information: Nigel Collier, Mai-vu Tran and Hoang-Quynh Le. PHENOMINER project working report on named entity annotation. Unpublished Technical Report – For Circulation to Authors and Collaborators, Draft version 1.10 (July 2012)

4

Overview

• PHENOMINER Pipeline

• Resources: – PubMed, PMC, UKPMC

– HPO, MPO

– PATO, FMA, CTO, etc.

Journal articles

Source and clean text

Domain text classification

In domain POS

Domain parsing

Map to PAS structures

Name Entity tagging

Merge

Zone identification

Novelty detection 5

Overview

Golden corpus

• 112 PMC abstracts are chosen based on a selection of 19 clinically significant auto-immune diseases from OMIM*

• Based on the annotation guidelines

• Annotated by Yo Shidahara (Annotator who annotated Genia corpus)

6

(*) Online Mendelian Inheritance of Man: http://www.ncbi.nlm.nih.gov/omim

http://www.ncbi.nlm.nih.gov/omim

Overview

7

A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES

IN GENETIC TEXTS

Presenter: Hoang-Quynh Le

Working report on Named Entity Recognition

Introduction

• Biomedical named entity recognition (NER) is a computational technique used to identify and classify strings of text (mentions) that designate important concepts in biomedicine.

• NER has been extensively studied for the names of genes, gene products, cells, chemical compounds and diseases but there are few proposed solutions for phenotypes.

9

Introduction

• The challenge of Phenotypes: – Until recently been little effort to provide data integration standards

for phenotypes o Lack of standard nomenclatures

– Complex structure: o Can include quantifications that are either specific or relative o Can include negation o Can be noun, noun phrase and even clause

– Traits and phenotypes can apply at all levels of anatomical granularity from chemical structures to cells and organs

– Normal and abnormal – Ambiguities between entities – Context sensitive – Etc.

• We employed two types of entity in this study: BF and GGP

10

Khordad et al. baseline model

• KMR corpus: 120 sentences from 4 full texts with 110 annotated phenotype mentions

• Annotation was conducted with reference to the HPO so that a term was tagged as phenotype if it was in the HPO or if it was not in the HPO but its definition showed that it was caused by a genotype

11 [2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011.

Khordad et al. baseline model

12 [2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011.

System Precision Recall F1

Khordad

et al.’s

system 97.58 88.32 92.71

Our

rebuilt

system 90.74 88.44 89.58

Proposed Model

13

Proposed Model

• Machine learning (ML) labelers

– HMM vs CRF

– BF ML labeler and

GGP ML labeler

– Class labels for tokens follow the standard BIO system

14 1Part of speech tags are assigned by training the GENIA tagger 2MetaMap is a program developed by NLM to map biomedical text to the UMLS Metathesaurus: http://metamap.nlm.nih.gov

Proposed Model

• Knowledge-based (KB) labelers

– Rule-matching labeler is an implementation of Khordad’s approach using MetaMap and the HPO as well as 5 staged heuristics to identify phenotypes

– Dictionary-matching labeler uses a longest string matching approach to recognise entities from HPO (9500 terms describing human phenotypes), MPO1 (9162 terms describing mouse phenotypes) and GGP entities from the NCBI’s gene list2 (9 million gene names)

15 1Mammalian Phenotype Ontology (MPO): ftp://ftp.informatics.jax.org/pub/reports/index.html [Downloaded; July 10th 2012] 2National Center for Biotechnology Information (NCBI) Gene list: http://www.ncbi.nlm.nih.gov/gene [Downloaded; July 10th 2012]

ftp://ftp.informatics.jax.org/pub/reports/index.html

http://www.ncbi.nlm.nih.gov/gene

Proposed Model

• Merge results assigns the final entity label to each token in the corpus by applying the following rules to each source module output: – Following Jimeno et al [3] we combine the putative entity

labels by collecting any entity-specific result that has been proposed by at least one method.

– Based on our ontological analysis of BF and GGP it is often possible for a GGP to form a fully embedded part of a BF mention apply a longest span rule and give priority to BF over GGP

– If there is a boundary conflict, we merge neighbouring entity mentions that share parts of their token sequence.

For ex: [AB]GGP and [BC]BF [ABC]BF

16 [3] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., and Rebholz-Schuhmann, D.(2008). Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(Suppl 3):S3

Evaluation

• Precision, recall and F1

• Matching is calculated using partial matching, i.e. a correct match is recorded when the span of text that is manually annotated in the gold standard corpus and the span of text output as an entity by the NER tagger partially overlap (Kabiljo et al.[4])

• Compare the result in Phenominer corpus and KMR corpus

17 [4] Kabiljo, R., Clegg, A., and Shepherd, A. (2009). A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics, 10(1):233

Evaluation

• KMR corpus

• Phenominer corpus

18

Discussion

• The results on the Phenominer corpus for Hybrid (F1:74.26) on BF are very encouraging and as we hoped demonstrate the strength of combining a mildly context sensitive ML approach with knowledge base lookup.

• Current NE methods based on a state-of-the-art learning approach such as CRF seem well suited to non-complex NE types such as GGP but maybe less effective for complex entities such as BF.

• KMR corpus: the average phenotype mention length: 1.72 tokens, the longest term was 5 tokens: [hypoplasia of the corpus callosum]

• Phenominer corpus: The average BF is 2.89 tokens with the longest had 16 token [susceptibility to psoriasis (PS) and psoriatic arthritis (PSA), inflammatory

diseases of the skin and joints]. The longest GGP in Phenominer had 16 tokens: [chromosomes1 (D1S235), 4 (D4S1647), 12 (D12S373), 16 (D16S403), and 17 (D17S1301)]

19

Work in process

• ML labeler should be simplified • Add ANATOMY, ORGANISM, CHED and DISEASE • More and more ambiguities:

– ORGANISM vs BF – DISEASE vs BF – Etc.

• Priority list? DISEASE>BF>GGP>CHED>ANATOMY>ORGANISM

• Complex overlap between entities Ex: [germ-free (GF) MYD88-negative NOD mice]

• Generic term • Golden corpus: Small and contain too complex phenotypes

20

GENETIC DOCUMENT CLASSIFICATION USING LPU WITH GRAPH-BASED METHOD

Presenter: Mai-Vu Tran

Working report on document classification

Introduction

• Text classification is the process of assigning predefined category labels to new documents based on the classifier learnt from training examples

• A document maybe assigned as ‘phenotype’ if there is a phenotype appears many times or there are many phenotypes in this document

22

Introduction

• Challenges: – How to choose ‘good’ positive data for training

• OMIM* record’s type: 'Phenotype' or 'Gene' or 'Phenotype/Gene' or 'moved/removed‘

• But there are 2,094/7,295 OMIM ‘phenotype’ record do not have clinical symptoms field OMIM record’s type is not completely reliable

– Lack of negative data – Our goal is processing Medline data for human and

mouse phenotype. OMIM data is very different with Medline data using OMIM to train the classifier make it’s hard to get a good result when applying to Medline data

23

(*) Online Mendelian Inheritance of Man: http://www.ncbi.nlm.nih.gov/omim

http://www.ncbi.nlm.nih.gov/omim

Data Preparation

• Make ‘good’ and ‘enough’ positive data set for training and testing

• Make ‘good’ and ‘as much as possible’ negative data set for testing

• Take advantages of OMIM data (for human)

• Take advantages of HPO/MP ontology

24

Data Preparation

• Initial positive data set: – (1) Collect Medline/PMC abstracts contain human phenotype

• Extract PMID in references fields of ‘phenotype’ OMIM entries • Using the PMID to crawl abstract in PMC dataset and Medline sample • Choose abstract which contain at least 1 phenotype term in HPO or

field *CS* of OMIM record referred to this abstract

– (2) Collect Medline/PMC abstracts which contain at least 1 phenotype in Mammalian Phenotype Ontology

– Combine (1) and (2): initial positive dataset

• Good positive data set: Re-check the initial positive data set – Use the supervised classifier trained by OMIM data to classify

initial positive data as ‘phenotype’ or ‘non-phenotype’

Result: 103,966 abstracts

25

Data preparation

26

Data Preparation

• Build as much as possible negative data set – Use MetaMap for tagging Medline Sample

abstract

– Follow Cohen R. et al [6] and Khordad et al. [2], there are 3 semantic groups which are strong related to Phenotype: Disorder, Anatomy and Physiology.

– Choose Medline abstracts do not contain terms belong to these groups

Result: 47,043 abstracts

27 [6] Raphael Cohen, Avitan Gefen, Michael Elhadad, and Ohad S Birk, “CSI-OMIM - Clinical Synopsis Search in OMIM”, BMC Bioinformatics, 2011; 12: 65

Learning from Positive and Unlabeled data (LPU)

• Problem: Given a set of documents of a particular class P, and a large set U of mixed documents that contains documents from class P and other types of documents (N), identify the documents from class P in U. The key feature of this problem is that there is no labeled non-P document, which makes traditional machine learning techniques inapplicable, as they all need labeled documents of both classes [5]

• Some related works: S-EM (Liu et al, 2002), PEBL (Yu et al, 2002), Roc-SVM (Li & Liu, 2003) 28 [5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu (2003), “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM '03

Proceedings of the Third IEEE International Conference on Data MiningPage 179


• Step 1: Identifying a set of reliable negative documents from the unlabeled set.

Step 1 is trying to find some initial reliable negative examples from the unlabeled set. • Step 2: Building a set of classifiers by iteratively applying a

classification algorithm and then selecting a good classifier from the set.

Step 2 tried to identify more and more negative examples iteratively

29

These two steps together can be seen as an iterative method of increasing the number of unlabeled examples that are classified as negative while maintaining the positive examples correctly classified


• Disadvantages of original LPU – Construct LPU

experiment with Spy-EM/SVM algorithm

– 5 unlabeled data sets: 20,000 abstracts; 40,000 abstracts; 60,000 abstracts; 80,000 abstracts and 100,000 abstracts

– Used 2 positive sets with 31,200 abstracts and 52,000 abstracts.

30


If we increase the number of unlabeled data, the result does not increase, even lower

Disadvantages of original LPU: within each iteration, LPU use positive and reliable negative data to classify unlabeled data, then add new negative data to reliable negative data set. Thus, until LPU converge, the number of negative data may be too big

Because of this disadvantage, if we increase the number of unlabeled data, it can make the deviation between positive and negative data become bigger, lead to the lower results

We need a learning method which can create a training set with the balance between positive and negative data. Our solution is use label propagation on graph-based learning method 31

Proposed method

• Based on LPU idea with 2-step strategy

• Step 1: Build context graph

• Step 2: Graph based method using Iterative propagation

32

Proposed method

• We propose a context graph-based algorithm to identify reliable negative set (step 1)

33

1. Layer = Null; 2. for each document di ∈ U 3. for each document dj ∈ P 4. if cosine(di,dj) < t then 5. Layer = Layer ∪ {di}; 6. break; 7. U = U – Layer; 8. Loop 9. count = 0; 10. for each document di ∈ U 11. for each document dj ∈Layer 12. if cosine(di,dj) < t then 13. Layer = Layer ∪ {di}; 14. count = count + 1; 15. break; 16. U = U – Layer; 17. if U = {} or count = 0 then exit-loop;

t ∈ [0,1] Rate between reliable negative/training positive is about 10-15%

Reliable negative

Proposed method

• Step 2 use Iterative propagation algorithm

• This algorithm is similar to Iterative SVM algorithm proposed by Bing Liu

• Instead of using SVM we use Label propagation

34

1. Every document in P is assigned the class label 1; 2. Every document in RN is assigned the class label –1; 3. i = 1; 4. Loop 5. Use P and RN to train a SVM classifier Si; 6. Classify Q using Si; 7. Let the set of documents in Q that are classified as negative be W; 8. if W = {} then exit-loop 9. else Q = Q – W; 10. RN = RN ∪ W; 11. i = i +1;

Experiments and evaluations

• Data – Demo data set*

• Positive set: 665 documents • Unlabeled set: 8,367 documents • Test set: 3,871 documents

– Our PMC data set (phenotype and non-phenotype) • Positive set: 5200 documents • 5 Unlabeled sets: 20,000 documents; 40,000 documents; 60,000

documents; 80,000 documents and 100,000 documents (randomly chosen from PMC).

• Test set: 145,660 documents

– Reuters-21578 corpus** • Consists of 21,578 news stories appeared on the Reuters newswire in

1987 • 12,902 documents manually assigned to 135 categories

35 (*) Supplied by Bing Liu: http://www.cs.uic.edu/~liub/LPU/LPU-download.html (**) This corpus was used for evaluating LPU algorithm by Bing Liu et al. in [5]


• Demo data set

• PMC Phenotype data set

36

NB/SVM ROC/SVM SPY/SVM LPGraph

F1 83.78 90.12 86.13 91.995%


• Reuter-21578 data set

– Bing Liu uses only the 10 most populous classes

– For experiments, we use the same selection method as Bing Liu et al., 10 classes are 10 largest classes

37

NB/SVM ROC/SVM SPY/SVM LPGraph

F1 (in Bing Liu et al. paper) 86.5 86.7 86.5

F1 (Our rebuilt system) 84.74 86.73 85.92 86.26

Discussion

• Using LPU can resolve the problem of lacking of negative data on phenotype document classification

• LPU with Graph-based method brings better results than original LPU

• Writing paper to publish

38

References

[1] Scheuermann, R. H., Ceusters, W. and Smith, B. (2009), “Toward an ontological treatment of disease and diagnosis”, AMIA Summit on Translational Bioinformatics, pp 116-120

[2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011

[3] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., and Rebholz-Schuhmann, D.(2008). Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(Suppl 3):S3

[4] Kabiljo, R., Clegg, A., and Shepherd, A. (2009). A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics, 10(1):233

[5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu (2003), “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM '03 Proceedings of the Third IEEE International Conference on Data MiningPage 179

[6] Raphael Cohen, Avitan Gefen, Michael Elhadad, and Ohad S Birk, “CSI-OMIM - Clinical Synopsis Search in OMIM”, BMC Bioinformatics, 2011; 12: 65

…

39

Thank you for your attention!

40

NII Internship Program Report Project: PHENOMINERvnlp.net/wp-content/uploads/2012/10/NII-Internship... · Unpublished Technical Report – For Circulation to Authors and Collaborators,

Documents