NII Internship Program Report Project: PHENOMINER (Phenotype mining in genetic texts) Presenter: Hoang-Quynh Le Mai-Vu Tran Hanoi, 6 th October 2012
NII Internship Program Report Project: PHENOMINER
(Phenotype mining in genetic texts)
Presenter: Hoang-Quynh Le
Mai-Vu Tran
Hanoi, 6th October 2012
Main contents
• An Overview: Entity classes and PHENOMINER pipeline
• A hybrid approach to finding Phenotype candidates in genetic texts
• Genetic document classification using LPU with graph-based learning
2
Overview
Phenotype, Symptom and Bodily feature (BF) definition • A BF mention is a bodily quality in a human or mouse
(normal or abnormal) • Scheuermann et al. [1] provide definitions for
PHENOTYPE and SYMPTOM: – A PHENOTYPE is a BF or combination of BF of an organism
determined by the interaction of the organism’s genes with its environment.
– A SYMPTOM is a BF of a patient that is observed by the patient or clinician and suspected of being caused by a disease.
• BIG Ambiguity !!!
3 [1] Scheuermann, R. H., Ceusters, W. and Smith, B. (2009), “Toward an ontological treatment of disease and diagnosis”, AMIA Summit on Translational Bioinformatics, pp 116-120
Overview
An informal overview of entity classes and their roles
For more information: Nigel Collier, Mai-vu Tran and Hoang-Quynh Le. PHENOMINER project working report on named entity annotation. Unpublished Technical Report – For Circulation to Authors and Collaborators, Draft version 1.10 (July 2012)
4
Overview
• PHENOMINER Pipeline
• Resources: – PubMed, PMC, UKPMC
– HPO, MPO
– PATO, FMA, CTO, etc.
Journal articles
Source and clean text
Domain text classification
In domain POS
Domain parsing
Map to PAS structures
Name Entity tagging
Merge
Zone identification
Novelty detection 5
Overview
Golden corpus
• 112 PMC abstracts are chosen based on a selection of 19 clinically significant auto-immune diseases from OMIM*
• Based on the annotation guidelines
• Annotated by Yo Shidahara (Annotator who annotated Genia corpus)
6
(*) Online Mendelian Inheritance of Man: http://www.ncbi.nlm.nih.gov/omim
Overview
7
A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES
IN GENETIC TEXTS
Presenter: Hoang-Quynh Le
Working report on Named Entity Recognition
Introduction
• Biomedical named entity recognition (NER) is a computational technique used to identify and classify strings of text (mentions) that designate important concepts in biomedicine.
• NER has been extensively studied for the names of genes, gene products, cells, chemical compounds and diseases but there are few proposed solutions for phenotypes.
9
Introduction
• The challenge of Phenotypes: – Until recently been little effort to provide data integration standards
for phenotypes o Lack of standard nomenclatures
– Complex structure: o Can include quantifications that are either specific or relative o Can include negation o Can be noun, noun phrase and even clause
– Traits and phenotypes can apply at all levels of anatomical granularity from chemical structures to cells and organs
– Normal and abnormal – Ambiguities between entities – Context sensitive – Etc.
• We employed two types of entity in this study: BF and GGP
10
Khordad et al. baseline model
• KMR corpus: 120 sentences from 4 full texts with 110 annotated phenotype mentions
• Annotation was conducted with reference to the HPO so that a term was tagged as phenotype if it was in the HPO or if it was not in the HPO but its definition showed that it was caused by a genotype
11 [2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011.
Khordad et al. baseline model
12 [2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011.
System Precision Recall F1
Khordad
et al.’s
system 97.58 88.32 92.71
Our
rebuilt
system 90.74 88.44 89.58
Proposed Model
13
Proposed Model
• Machine learning (ML) labelers
– HMM vs CRF
– BF ML labeler and
GGP ML labeler
– Class labels for tokens follow the standard BIO system
14 1Part of speech tags are assigned by training the GENIA tagger 2MetaMap is a program developed by NLM to map biomedical text to the UMLS Metathesaurus: http://metamap.nlm.nih.gov
Proposed Model
• Knowledge-based (KB) labelers
– Rule-matching labeler is an implementation of Khordad’s approach using MetaMap and the HPO as well as 5 staged heuristics to identify phenotypes
– Dictionary-matching labeler uses a longest string matching approach to recognise entities from HPO (9500 terms describing human phenotypes), MPO1 (9162 terms describing mouse phenotypes) and GGP entities from the NCBI’s gene list2 (9 million gene names)
15 1Mammalian Phenotype Ontology (MPO): ftp://ftp.informatics.jax.org/pub/reports/index.html [Downloaded; July 10th 2012] 2National Center for Biotechnology Information (NCBI) Gene list: http://www.ncbi.nlm.nih.gov/gene [Downloaded; July 10th 2012]
Proposed Model
• Merge results assigns the final entity label to each token in the corpus by applying the following rules to each source module output: – Following Jimeno et al [3] we combine the putative entity
labels by collecting any entity-specific result that has been proposed by at least one method.
– Based on our ontological analysis of BF and GGP it is often possible for a GGP to form a fully embedded part of a BF mention apply a longest span rule and give priority to BF over GGP
– If there is a boundary conflict, we merge neighbouring entity mentions that share parts of their token sequence.
For ex: [AB]GGP and [BC]BF [ABC]BF
16 [3] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., and Rebholz-Schuhmann, D.(2008). Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(Suppl 3):S3
Evaluation
• Precision, recall and F1
• Matching is calculated using partial matching, i.e. a correct match is recorded when the span of text that is manually annotated in the gold standard corpus and the span of text output as an entity by the NER tagger partially overlap (Kabiljo et al.[4])
• Compare the result in Phenominer corpus and KMR corpus
17 [4] Kabiljo, R., Clegg, A., and Shepherd, A. (2009). A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics, 10(1):233
Evaluation
• KMR corpus
• Phenominer corpus
18
Discussion
• The results on the Phenominer corpus for Hybrid (F1:74.26) on BF are very encouraging and as we hoped demonstrate the strength of combining a mildly context sensitive ML approach with knowledge base lookup.
• Current NE methods based on a state-of-the-art learning approach such as CRF seem well suited to non-complex NE types such as GGP but maybe less effective for complex entities such as BF.
• KMR corpus: the average phenotype mention length: 1.72 tokens, the longest term was 5 tokens: [hypoplasia of the corpus callosum]
• Phenominer corpus: The average BF is 2.89 tokens with the longest had 16 token [susceptibility to psoriasis (PS) and psoriatic arthritis (PSA), inflammatory
diseases of the skin and joints]. The longest GGP in Phenominer had 16 tokens: [chromosomes1 (D1S235), 4 (D4S1647), 12 (D12S373), 16 (D16S403), and 17 (D17S1301)]
19
Work in process
• ML labeler should be simplified • Add ANATOMY, ORGANISM, CHED and DISEASE • More and more ambiguities:
– ORGANISM vs BF – DISEASE vs BF – Etc.
• Priority list? DISEASE>BF>GGP>CHED>ANATOMY>ORGANISM
• Complex overlap between entities Ex: [germ-free (GF) MYD88-negative NOD mice]
• Generic term • Golden corpus: Small and contain too complex phenotypes
20
GENETIC DOCUMENT CLASSIFICATION USING LPU WITH GRAPH-BASED METHOD
Presenter: Mai-Vu Tran
Working report on document classification
Introduction
• Text classification is the process of assigning predefined category labels to new documents based on the classifier learnt from training examples
• A document maybe assigned as ‘phenotype’ if there is a phenotype appears many times or there are many phenotypes in this document
22
Introduction
• Challenges: – How to choose ‘good’ positive data for training
• OMIM* record’s type: 'Phenotype' or 'Gene' or 'Phenotype/Gene' or 'moved/removed‘
• But there are 2,094/7,295 OMIM ‘phenotype’ record do not have clinical symptoms field OMIM record’s type is not completely reliable
– Lack of negative data – Our goal is processing Medline data for human and
mouse phenotype. OMIM data is very different with Medline data using OMIM to train the classifier make it’s hard to get a good result when applying to Medline data
23
(*) Online Mendelian Inheritance of Man: http://www.ncbi.nlm.nih.gov/omim
Data Preparation
• Make ‘good’ and ‘enough’ positive data set for training and testing
• Make ‘good’ and ‘as much as possible’ negative data set for testing
• Take advantages of OMIM data (for human)
• Take advantages of HPO/MP ontology
24
Data Preparation
• Initial positive data set: – (1) Collect Medline/PMC abstracts contain human phenotype
• Extract PMID in references fields of ‘phenotype’ OMIM entries • Using the PMID to crawl abstract in PMC dataset and Medline sample • Choose abstract which contain at least 1 phenotype term in HPO or
field *CS* of OMIM record referred to this abstract
– (2) Collect Medline/PMC abstracts which contain at least 1 phenotype in Mammalian Phenotype Ontology
– Combine (1) and (2): initial positive dataset
• Good positive data set: Re-check the initial positive data set – Use the supervised classifier trained by OMIM data to classify
initial positive data as ‘phenotype’ or ‘non-phenotype’
Result: 103,966 abstracts
25
Data preparation
26
Data Preparation
• Build as much as possible negative data set – Use MetaMap for tagging Medline Sample
abstract
– Follow Cohen R. et al [6] and Khordad et al. [2], there are 3 semantic groups which are strong related to Phenotype: Disorder, Anatomy and Physiology.
– Choose Medline abstracts do not contain terms belong to these groups
Result: 47,043 abstracts
27 [6] Raphael Cohen, Avitan Gefen, Michael Elhadad, and Ohad S Birk, “CSI-OMIM - Clinical Synopsis Search in OMIM”, BMC Bioinformatics, 2011; 12: 65
Learning from Positive and Unlabeled data (LPU)
• Problem: Given a set of documents of a particular class P, and a large set U of mixed documents that contains documents from class P and other types of documents (N), identify the documents from class P in U. The key feature of this problem is that there is no labeled non-P document, which makes traditional machine learning techniques inapplicable, as they all need labeled documents of both classes [5]
• Some related works: S-EM (Liu et al, 2002), PEBL (Yu et al, 2002), Roc-SVM (Li & Liu, 2003) 28 [5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu (2003), “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM '03
Proceedings of the Third IEEE International Conference on Data MiningPage 179
Learning from Positive and Unlabeled data (LPU)
• Step 1: Identifying a set of reliable negative documents from the unlabeled set.
Step 1 is trying to find some initial reliable negative examples from the unlabeled set. • Step 2: Building a set of classifiers by iteratively applying a
classification algorithm and then selecting a good classifier from the set.
Step 2 tried to identify more and more negative examples iteratively
29
These two steps together can be seen as an iterative method of increasing the number of unlabeled examples that are classified as negative while maintaining the positive examples correctly classified
Learning from Positive and Unlabeled data (LPU)
• Disadvantages of original LPU – Construct LPU
experiment with Spy-EM/SVM algorithm
– 5 unlabeled data sets: 20,000 abstracts; 40,000 abstracts; 60,000 abstracts; 80,000 abstracts and 100,000 abstracts
– Used 2 positive sets with 31,200 abstracts and 52,000 abstracts.
30
Learning from Positive and Unlabeled data (LPU)
If we increase the number of unlabeled data, the result does not increase, even lower
Disadvantages of original LPU: within each iteration, LPU use positive and reliable negative data to classify unlabeled data, then add new negative data to reliable negative data set. Thus, until LPU converge, the number of negative data may be too big
Because of this disadvantage, if we increase the number of unlabeled data, it can make the deviation between positive and negative data become bigger, lead to the lower results
We need a learning method which can create a training set with the balance between positive and negative data. Our solution is use label propagation on graph-based learning method 31
Proposed method
• Based on LPU idea with 2-step strategy
• Step 1: Build context graph
• Step 2: Graph based method using Iterative propagation
32
Proposed method
• We propose a context graph-based algorithm to identify reliable negative set (step 1)
33
1. Layer = Null; 2. for each document di ∈ U 3. for each document dj ∈ P 4. if cosine(di,dj) < t then 5. Layer = Layer ∪ {di}; 6. break; 7. U = U – Layer; 8. Loop 9. count = 0; 10. for each document di ∈ U 11. for each document dj ∈Layer 12. if cosine(di,dj) < t then 13. Layer = Layer ∪ {di}; 14. count = count + 1; 15. break; 16. U = U – Layer; 17. if U = {} or count = 0 then exit-loop;
t ∈ [0,1] Rate between reliable negative/training positive is about 10-15%
Reliable negative
Proposed method
• Step 2 use Iterative propagation algorithm
• This algorithm is similar to Iterative SVM algorithm proposed by Bing Liu
• Instead of using SVM we use Label propagation
34
1. Every document in P is assigned the class label 1; 2. Every document in RN is assigned the class label –1; 3. i = 1; 4. Loop 5. Use P and RN to train a SVM classifier Si; 6. Classify Q using Si; 7. Let the set of documents in Q that are classified as negative be W; 8. if W = {} then exit-loop 9. else Q = Q – W; 10. RN = RN ∪ W; 11. i = i +1;
Experiments and evaluations
• Data – Demo data set*
• Positive set: 665 documents • Unlabeled set: 8,367 documents • Test set: 3,871 documents
– Our PMC data set (phenotype and non-phenotype) • Positive set: 5200 documents • 5 Unlabeled sets: 20,000 documents; 40,000 documents; 60,000
documents; 80,000 documents and 100,000 documents (randomly chosen from PMC).
• Test set: 145,660 documents
– Reuters-21578 corpus** • Consists of 21,578 news stories appeared on the Reuters newswire in
1987 • 12,902 documents manually assigned to 135 categories
35 (*) Supplied by Bing Liu: http://www.cs.uic.edu/~liub/LPU/LPU-download.html (**) This corpus was used for evaluating LPU algorithm by Bing Liu et al. in [5]
Experiments and evaluations
• Demo data set
• PMC Phenotype data set
36
NB/SVM ROC/SVM SPY/SVM LPGraph
F1 83.78 90.12 86.13 91.995%
Experiments and evaluations
• Reuter-21578 data set
– Bing Liu uses only the 10 most populous classes
– For experiments, we use the same selection method as Bing Liu et al., 10 classes are 10 largest classes
37
NB/SVM ROC/SVM SPY/SVM LPGraph
F1 (in Bing Liu et al. paper) 86.5 86.7 86.5
F1 (Our rebuilt system) 84.74 86.73 85.92 86.26
Discussion
• Using LPU can resolve the problem of lacking of negative data on phenotype document classification
• LPU with Graph-based method brings better results than original LPU
• Writing paper to publish
38
References
[1] Scheuermann, R. H., Ceusters, W. and Smith, B. (2009), “Toward an ontological treatment of disease and diagnosis”, AMIA Summit on Translational Bioinformatics, pp 116-120
[2] Maryam Khordad, Robert E. Mercer, and Peter Rogan “Improving Phenotype Name Recognition” Canadian AI 2011, LNAI 6657, pp. 246–257, 2011
[3] Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., and Rebholz-Schuhmann, D.(2008). Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(Suppl 3):S3
[4] Kabiljo, R., Clegg, A., and Shepherd, A. (2009). A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics, 10(1):233
[5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu (2003), “Building Text Classifiers Using Positive and Unlabeled Examples”, ICDM '03 Proceedings of the Third IEEE International Conference on Data MiningPage 179
[6] Raphael Cohen, Avitan Gefen, Michael Elhadad, and Ohad S Birk, “CSI-OMIM - Clinical Synopsis Search in OMIM”, BMC Bioinformatics, 2011; 12: 65
…
39
Thank you for your attention!
40