Kevin Bretonnel Cohen, Ph.D. Instructor, Department of Pharmacology University of Colorado School of Medicine Adjunct Assistant Professor Department of Linguistics University of Colorado at Boulder [email protected]http://compbio.ucdenver.edu/Hunter_lab/Coh Biomedical natural language processing and text mining
67
Embed
Biomedical natural language processing and text mining
Biomedical natural language processing and text mining. What is natural language processing?. NLP, text mining, computational linguistics Computational modeling of human language Access to knowledge in linguistic form Information retrieval Information extraction Document classification - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Kevin Bretonnel Cohen, Ph.D.Instructor, Department of PharmacologyUniversity of Colorado School of MedicineAdjunct Assistant ProfessorDepartment of LinguisticsUniversity of Colorado at Boulder
•F-measure: “harmonic mean” of precision and recall
Evaluation of NLP systems
•Formal definition:
•Typical definition: β = 1, so…
(1 + β2) * precision * recall
(β2 * precision) + recall
Fβ =
Evaluation of NLP systems
•Typical definition:
•…or just F: β is usually assumed to be 1
2 * precision * recall
precision + recallF1 =
Evaluation of NLP systems
•β allows you to weight precision and recall differently–Increasing β weights recall more highly
–Decreasing β weights precision more highly
•Rarely used, but designated by value of β, e.g. F0.5 or F2
Chang et al.’s improvement on PSI-BLAST (2001)
Ng (2006)
Significant improvement in precision
P R
Standard PSI-BLAST .84 .33
Chang et al. .95 .32
Goal: Predict subcellular localization to understand function
•Signal peptides and other sequences are indicative of localization
•Machine learning based predictors are moderately accurate
•Try adding text…
Subcellular localization (Stapley et al. 2002, Eskin and Agichtein 2004)
Single SVM
Build specialized amino acid and text kernels, then build combined kernel
Ng (2006)
Text improves clustering of gene expression profiles, too
•Create per-gene distance matrices based on expression data
•Create per-gene distance matrices based on literature data
•Combine using Fisher’s omnibus
•…then cluster
Matrix merging (Glenisson et al. 2003)
Ng (2006)
More sophisticated text analysis can improve these
results
See the YouTube Hanalyzer demo fora better sense of the process
Leach et al. (2009)
APPLICATIONS
TextPresso
Chilibot (www.chilibot.net)
Chen and Sharp (2004)
Chilibot
Chen and Sharp (2004)
Chilibot
Chen and Sharp (2004)
iHop (http://www.ihop-net.org/UniPub/iHOP)
Reflect (www.reflect.ws)
•Firefox plug-in
•Recognises proteins and small molecules mentioned in a web page, and links them to information-rich summaries.
Karin Verspoor
Doms, A. et al. Nucl. Acids Res. 2005 33:W783-W786; doi:10.1093/nar/gki470
GoPubMed
BIOMEDICAL LANGUAGE PROCESSING
Surely Shuy jests...
“There is little reason for the data on which a linguist works to have the right to name that work.”
Tokenization is different
•Commas– 2,6-diaminohexanoic acid
– tricyclo(3.3.1.13,7)decanone
•Hyphens– “Syntactic”(Calcium-dependent, Hsp-60)
– Knocked-out gene: lush-- flies
– Negation: -fever
– Electric charge: Cl-
•PMID: 10516078
B-cell-CD4(+)-T-cell interactions
Named Entity Recognition is different
•Genes have names?
to, the, there, a, I, …sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A [SEMA5A]
lot white
maggie Breast cancer 1 (BRCA1)
scott of the antarctic ring
always early -> british rail Ribosomal protein S27
asp -> cleopatra p53
tudor -> vasa -> gustavus Heat shock protein 110
nanos -> smaug Mitogen activated protein kinase 15
pray for elves Mitogen activated protein kinase kinase kinase 5
Karin Verspoor
It really is different on every level
•Corpus construction
•Semantic representation…
Ultimately, we need specific knowledge of the domain to do a good job with the language.
Linguistic Levels of Analysis
From Hunter & Cohen, Biomedical Language Processing: What’s Beyond PubMed?, Molecular Cell 21, 589–594, 2006 DOI 10.1016/j.molcel.2006.02.012
SUBTASKS AND TOOLS
Information Retrieval
•Retrieving from a collection of indexed documents– Indices based on
•Calculate distance to all other documents in a collection (various metrics) Karin
Verspoor
Named entity recognition
HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe
3
Entity normalization
Entity normalization: find concepts in text and map them to unique identifiers
A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.
3
•Perfect named entity recognition finds 5 mentions; they correspond to just 2 genes:
–FBgn0000592 (esterase 6)
–FBgn0026412 (leucine aminopeptidase)
A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.
Entity normalization
3
Entity normalization
•Partial list of synonyms for FBgn0000592: –Esterase 6
–Carboxyl ester hydrolase
–CG6917
–Est6
–Est-D
–Est-5
3
Biological Nomenclature: “V-SNARE”
SNAP Receptor
Vesicle SNARE
V-SNARE
N-Ethylmaleimide-Sensitive Fusion Protein
Soluble NSF Attachment Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment
Protein Receptor(Alex Morgan, MITRE)
Information/relation extraction
Information extraction: relationships between things
BINDING_EVENT
Binder:
Bound:
2
Information/relation extraction
Met28 binds to DNA.
BINDING_EVENTBinder: Met28Bound: DNA
2
Document clustering
•For browsing large numbers of relevant documents– In biomedicine, unlike most Google searches, the
goal is not one relevant document, but many
•Statistical measures of document distance –Cosine distance over term (or stem) vectors
–PubMed document neighbors (TF*IDF clustering)
– Latent Semantic Analysis (LSA)
•Knowledge-based approaches:–Mapping documents to a predefined set of types
–Use information extraction as basis for clusteringKarin
Verspoor
Automated summarization
•Useful for browsing retrieved documents
•Multidocument summarization can characterize document clusters
•Select the “best” sentence/passage–Based on appearance of query terms (a la
Syntax helps• 125I-labeled C3b was covalently deposited on CR2, when
hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>
•
CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>
• The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>
• Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex> Larry Hunter
Coordination isparticularly hard
In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.
Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>
The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>
Documents as evidence of function or other
relationships• Cooccurrence statistics. How often are two or more genes (or other entities) mentioned in the same document?– PubGene is a large database of co-occurrence
statistics http://www.pubgene.org
• Functional coherence measure (Altman, et al)– For each article mentioning a gene from a
putatively functional group, score the article's relevance based on whether similar articles also mention genes in the group
– Compare the number of high scoring articles that a group generates to an expected number from random genes.
Literature-based groupings
combined with other data• Using literature-based assessments of groupings
or coherence can improve quality of other clustering tasks– Chang, et al, uses literature similarity measures to
improve quality of PSI-BLAST searches for distant homologs
– Blashke's GEISHA system, associates clusters of genes from expression array experiments with medline abstracts, extracting keywords to annotate the gene clusters.
– Masys, et al, use UMLS to score subtrees of various hierarchical medical ontologies, based on how frequently genes in an expression array cluster are tied to them.
– Reasoning: infer additional knowledgeand relate the knowledge to data
– Reporting: provide information helps biologist explain the phenomena in their data and generate new hypotheses
More sophisticated text analysis can improve these
results
See the YouTube Hanalyzer demo fora better sense of the process
Leach et al. (2009)
More projects than people
• Ongoing:– Coreference resolution– Software engineering perspectives on natural language processing– Odd problems of full text– Tuberculosis and translational medicine– Discourse analysis annotation– OpenDMAP
• In need of fresh blood:– Metagenomics/Microbiome studies– Translational medicine from the clinical side– Summarization– Negation– Question-answering: Why?– Nominalizations– Metamorphic testing for natural language processing