BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009 1
BeeSpace Informatics Research. ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign. BeeSpace Workshop, May 22, 2009. Goal of Informatics Research. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Institute for Genomic Biology
Statistics
Graduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009 1
Goal of Informatics Research
• Develop general and scalable computational methods to enable
– Semantic integration of data and information
– Effective information access and exploration
– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and computer science
– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
3
Informatics Research Accomplishments
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis Test
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]
Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],
[Chee & Schatz 08]
Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]
Automatic Function Annotation [He et al. 09/10]
4
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
5
Part 1. Information Extraction
6
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
7
Entity & Relation Extraction
Gene X Gene Y
Bcd hb
…. ….
… …
Genetic Interaction
Gene X Anatomy Y
Bcd embryo
Hb egg
… …
Expression Location
…
8
Lopes FJ et al., 2005 J. Theor. Biol.
General Approach: Machine Learning
• Computers learn from labeled examples to compute a function to predict labels of new examples
• Examples of predictions
– Given a phrase, predict whether it is a gene name
– Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation
• Many learning methods are available, but training data isn’t always available
9
Extraction Example 1: Gene Name Recognition
… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.
10
Gene?
Gene? Gene?
Features for Recognizing Genes
• Syntactic clues:
– Capitalization (especially acronyms)
– Numbers (gene families)
– Punctuation: -, /, :, etc.
• Contextual clues:
– Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc.
– Global: same noun phrase occurs several times in the same article
11
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:
– y = gene & candidate phrase starts with a capital letter
– y = gene & candidate phrase contains digits
• Estimate i with training data
12
Special Challenges
• Gene name disambiguation
• Domain adaptation
13
Gene Name Disambiguation
• Gene names can be common English words:
for (foraging), in (inturned), similar (sima), yellow (y), black (b)…
• Solution:
– Disambiguate by looking at the context of the candidate word
– Train a classifier
14
Discriminative Neighbor Words
15
Sample Disambiguation Results
16
... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359
(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497
function for the for gene in sensory responsiveness and … -0.582 +5.980
the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in
assays of black mutants and although … +9.759
“foraging”, “for”
“black”
Nov 27, 2007 17
Problem of Domain Overfitting
gene name recognizer 54.1%
gene name recognizer 28.1%
ideal setting
realistic setting
wingless
daughterless
eyeless
apexless
…
fly
Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
18
Generalizable Feature: “w+2 = expressed”
Generalizability-Based Feature Ranking
…training
data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167…… 19
20
Effectiveness of Domain Adaptation
Fly + Mouse Yeastgene name recognizer 63.3%
Fly + Mouse Yeastgene name recognizer 75.9%
standard learning
domain adaptive learning
More Results on Domain Adaptation
Exp Method Precision Recall F1
F+M→Y Baseline 0.557 0.466 0.508
Domain 0.575 0.516 0.544
% Imprv. +3.2% +10.7% +7.1%
F+Y→M Baseline 0.571 0.335 0.422
Domain 0.582 0.381 0.461
% Imprv. +1.9% +13.7% +9.2%
M+Y→F Baseline 0.583 0.097 0.166
Domain 0.591 0.139 0.225
% Imprv. +1.4% +43.3% +35.5%
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21
Extraction Example 2: Genetic Interaction Relation
22
Gene
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.
Challenges
• No/little training data
• What features to use?
23
Solution: Pseudo Training Data
24
Gene:
Bcd +
These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented development.
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
30
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space