BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009 1
54
Embed
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.
Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering 3
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer ScienceInstitute for Genomic Biology
StatisticsGraduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009 1
Goal of Informatics Research• Develop general and scalable computational methods
to enable– Semantic integration of data and information
– Effective information access and exploration– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and computer science– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
3
Informatics Research Accomplishments
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis Test
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]
Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],
[Chee & Schatz 08]
Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]
Automatic Function Annotation [He et al. 09/10]
4
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
5
Part 1. Information Extraction
6
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
7
Entity & Relation Extraction
Gene X Gene YBcd hb…. ….… …
Genetic Interaction
Gene X Anatomy YBcd embryoHb egg… …
Expression Location
…8
Lopes FJ et al., 2005 J. Theor. Biol.
General Approach: Machine Learning
• Computers learn from labeled examples to compute a function to predict labels of new examples
• Examples of predictions– Given a phrase, predict whether it is a gene name– Given a sentence with two gene names mentioned,
predict whether there is a genetic interaction relation
• Many learning methods are available, but training data isn’t always available
9
Extraction Example 1: Gene Name Recognition
… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.
• Contextual clues:– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in
the same article
11
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits
• Estimate i with training data
12
Special Challenges
• Gene name disambiguation
• Domain adaptation
13
Gene Name Disambiguation
• Gene names can be common English words: for (foraging), in (inturned), similar (sima),
yellow (y), black (b)…
• Solution: – Disambiguate by looking at the context of the
candidate word – Train a classifier
14
Discriminative Neighbor Words
15
Sample Disambiguation Results
16
... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980
the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759
“foraging”, “for”
“black”
Nov 27, 2007 17
Problem of Domain Overfitting
gene name recognizer 54.1%
gene name recognizer 28.1%
ideal setting
realistic settingwingless
daughterless
eyeless
apexless…
fly
Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
18
Generalizable Feature: “w+2 = expressed”
Generalizability-Based Feature Ranking
…training
data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167…… 19
20
Effectiveness of Domain Adaptation
Fly + Mouse Yeastgene name recognizer 63.3%
Fly + Mouse Yeastgene name recognizer 75.9%
standard learning
domain adaptive learning
More Results on Domain AdaptationExp Method Precision Recall F1
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21
Extraction Example 2: Genetic Interaction Relation
22
Gene
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.
Challenges
• No/little training data
• What features to use?
23
Solution: Pseudo Training Data
24
Gene:
Bcd +
These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented development.
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
30
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space
• Retrieval algorithm:– Query word distribution: p(w|Q)
• Task: Given any entity and k aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
40
Further Generalizations
• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
41
New method based on mixture modeland regularized optimization
Part 4. Function Analysis
42
Annotating Gene Lists: GO Terms vs. Literature MiningLimitations of GO annotations: - Labor-intensive- Limited Coverage
Literature Mining:- Automatic - Flexible exploration in the entire literature space
For any term:
test its significance
Segmentation 56.0Pattern 34.2
Cell_cycle 25.6Development 22.1
Regulation 20.4…
Enriched concepts
Interactive analysis
Gene group
BcdCad…Tll
Entrez Gene
…
Document sets
For any gene:retrieve
its relevant documents
Bcd
Cad
Tll
Overview of Gene List Annotator
Intuition for Literature-based Annotation
Gene TPI1 GPM1 PGK1 TDH3 TDH2
protein_kinase 0 0 2 0 0
decarboxylase 10 0 10 7 6
protein 39 26 65 44 33
stationary_phase 2 7 3 4 2
energy_metabolism 4 5 5 8 0
oscillation 0 0 0 0 1
Likelihood Ratio Test with 2-Poisson Mixture Model
Dataset distribution: Poisson(λ;d)
Reference distribution: Poisson(λ0;d)
Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment