BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.

Post on 18-Jan-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering 3

Transcript

BeeSpace Informatics Research

ChengXiang (“Cheng”) Zhai

Department of Computer ScienceInstitute for Genomic Biology

StatisticsGraduate School of Library & Information Science

University of Illinois at Urbana-Champaign

BeeSpace Workshop, May 22, 2009 1

Goal of Informatics Research• Develop general and scalable computational methods

to enable– Semantic integration of data and information

– Effective information access and exploration– Knowledge discovery

– Hypothesis formulation and testing

• Reinforcement of research in biology and computer science– CS research to automate manual tasks of biologests

– Biology research to raise new challenges for CS

2

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

3

Informatics Research Accomplishments

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis Test

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]

Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]

Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],

[Chee & Schatz 08]

Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]

Automatic Function Annotation [He et al. 09/10]

4

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

5

Part 1. Information Extraction

6

Natural Language Understanding

…We have cloned and sequenced

a cDNA encoding Apis mellifera ultraspiracle (AMUSP)

and examined its responses to …

NP

NP NP

NPVP

VP VP

Gene Gene

7

Entity & Relation Extraction

Gene X Gene YBcd hb…. ….… …

Genetic Interaction

Gene X Anatomy YBcd embryoHb egg… …

Expression Location

…8

Lopes FJ et al., 2005 J. Theor. Biol.

General Approach: Machine Learning

• Computers learn from labeled examples to compute a function to predict labels of new examples

• Examples of predictions– Given a phrase, predict whether it is a gene name– Given a sentence with two gene names mentioned,

predict whether there is a genetic interaction relation

• Many learning methods are available, but training data isn’t always available

9

Extraction Example 1: Gene Name Recognition

… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.

10

Gene?

Gene? Gene?

Features for Recognizing Genes

• Syntactic clues:– Capitalization (especially acronyms)– Numbers (gene families)– Punctuation: -, /, :, etc.

• Contextual clues:– Local: surrounding words such as “gene”,

“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in

the same article

11

Maximum Entropy Modelfor Gene Tagging

• Given an observation (a token or a noun phrase), together with its context, denoted as x

• Predict y {gene, non-gene}

• Maximum entropy model:

P(y|x) = K exp(ifi(x, y))

• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits

• Estimate i with training data

12

Special Challenges

• Gene name disambiguation

• Domain adaptation

13

Gene Name Disambiguation

• Gene names can be common English words: for (foraging), in (inturned), similar (sima),

yellow (y), black (b)…

• Solution: – Disambiguate by looking at the context of the

candidate word – Train a classifier

14

Discriminative Neighbor Words

15

Sample Disambiguation Results

16

... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980

the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759

“foraging”, “for”

“black”

Nov 27, 2007 17

Problem of Domain Overfitting

gene name recognizer 54.1%

gene name recognizer 28.1%

ideal setting

realistic settingwingless

daughterless

eyeless

apexless…

fly

Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in

analogous patterns in each primordium of…

…that CD38 is expressed by both neurons and glial

cells…that PABPC5 is expressed in fetal brain and in

a range of adult tissues.

18

Generalizable Feature: “w+2 = expressed”

Generalizability-Based Feature Ranking

…training

data

……-less……expressed……

………expressed………-less

………expressed……-less…

…………expressed……-less

12345678

12345678

12345678

12345678

…expressed………-less……

…0.125………0.167…… 19

20

Effectiveness of Domain Adaptation

Fly + Mouse Yeastgene name recognizer 63.3%

Fly + Mouse Yeastgene name recognizer 75.9%

standard learning

domain adaptive learning

More Results on Domain AdaptationExp Method Precision Recall F1

F+M→Y Baseline 0.557 0.466 0.508Domain 0.575 0.516 0.544

% Imprv. +3.2% +10.7% +7.1%F+Y→M Baseline 0.571 0.335 0.422

Domain 0.582 0.381 0.461% Imprv. +1.9% +13.7% +9.2%

M+Y→F Baseline 0.583 0.097 0.166Domain 0.591 0.139 0.225

% Imprv. +1.4% +43.3% +35.5%

•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21

Extraction Example 2: Genetic Interaction Relation

22

Gene

Gene

Is there a genetic interaction relation here?

Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.

Challenges

• No/little training data

• What features to use?

23

Solution: Pseudo Training Data

24

Gene:

Bcd +

These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are

known to act in concert for most anterior segmented development.

Pseudo Training Data Works Reasonably Well

25

Precision

Recall

Using all features works the best

Large-Scale Entity/Relation Extraction

• Entity annotation

• Relation extraction

Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +

machine learningAnatomy FlyBase Dictionary string searchChemical MeSH, Biosis, … Dictionary string searchBehavior “x x behavior” pattern search

Relation Type MethodRegulatory Pre-defined pattern + machine learningExpressed In Co-occurrence + relevant keywords

Gene Behavior Co-occurrenceGene Chemical Co-occurrence

53

Part 2: Semantic Navigation

27

Space-Region Navigation

Literature Spaces

Bee Fly

Behavior

Bird…

Topic Regions

Bee Forager

MAP MAP

Bird Singing

EXTRACT

…Fly Rover

EXTRACT

SWITCHING

Intersection, Union,…

Intersection, Union,…

My Regions/Topics

My Spaces

28

General Approach: Language Models

• Topic = word distribution

• Modeling text in a space with mixture models of multinomial distributions

• Text Mining = Parameter Estimation + Inferences

• Matching = Computer similarity between word distributions

• Users can “control” a model by specifying topic preferences

29

A Sample Topic & Corresponding Space

filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626

actin filamentsflight muscleflight muscles

labels

• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle

Word Distribution (language model)

Example documents

Meaningful labels

30

MAP: Topic/RegionSpace

• MAP: Use the topic/region description as a query to search a given space

• Retrieval algorithm:– Query word distribution: p(w|Q)

– Document word distribution: p(w|D)

– Score a document based on similarity of Q and D

• Leverage existing retrieval toolkits: Lemur/Indri

Vocabularyw D

QQDQ wp

wpwpDDQscore

)|()|(

log)|()||(),(

31

EXTRACT: Space Topic/Region

• Assume k topics, each being represented by a word distribution

• Use a k-component mixture model to fit the documents in a given space (EM algorithm)

• The estimated k component word distributions are taken as k topic regions

| |

1 1

log ( | ) log[ ( | ) (1 ) ( | )]D k

i B j i jD C i j

p C p D p D

Likelihood:

Maximum likelihood estimator: * arg max ( | )p C

Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32

User-Controlled Exploration: Sample Topic 1

age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439

Prior:

labor 0.2division 0.2

33

behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045

Prior:

behavioral 0.2maturation 0.2

34

User-Controlled Exploration: Sample Topic 2

foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

Exploit Prior for Concept Switching

35

Part 3: Entity Summarization

36

Gene product

Expression

Sequence

Interactions

Mutations

General Functions

Multi-Aspect Gene Summary

Automated Gene Summarization?

A Two-Stage Approach

Text Summary of Gene Abl

General Entity Summarizer

• Task: Given any entity and k aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category

40

Further Generalizations

• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category

41

New method based on mixture modeland regularized optimization

Part 4. Function Analysis

42

Annotating Gene Lists: GO Terms vs. Literature MiningLimitations of GO annotations: - Labor-intensive- Limited Coverage

Literature Mining:- Automatic - Flexible exploration in the entire literature space

For any term:

test its significance

Segmentation 56.0Pattern 34.2

Cell_cycle 25.6Development 22.1

Regulation 20.4…

Enriched concepts

Interactive analysis

Gene group

BcdCad…Tll

Entrez Gene

Document sets

For any gene:retrieve

its relevant documents

Bcd

Cad

Tll

Overview of Gene List Annotator

Intuition for Literature-based Annotation

Gene TPI1 GPM1 PGK1 TDH3 TDH2

protein_kinase 0 0 2 0 0

decarboxylase 10 0 10 7 6

protein 39 26 65 44 33

stationary_phase 2 7 3 4 2

energy_metabolism 4 5 5 8 0

oscillation 0 0 0 0 1

Likelihood Ratio Test with 2-Poisson Mixture Model

Dataset distribution: Poisson(λ;d)

Reference distribution: Poisson(λ0;d)

Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment

GO Theme Related Annotator terms

neurogenesis axon guidance, growth cone,commissural axon, proneural gene

synaptic transmission synaptic vesicle, neurotransmitterrelease, synaptic transmission, sodiumchannel

cytoskeletal protein alpha tubulin, actin filament

cell communication tight junction, heparan sulfateproteoglycan

47

Discovering Novel Themes• Gene List: 69 genes up-regulated by the methoprene treatment

Theme Annotator terms

muscle flight muscle, muscle myosin, nonmusclemyosin, light chain, myosin ii, thickfilament, thin filament, striated muscle

synaptic transmission neurotransmitter release, synaptictransmission, synaptic vesicle

signaling pathway notch signal

48

Summary

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

49

Machine Learning + Language Models + Minimum Human Effort

General and scalable, but there’s room for deeper semantics

Looking Ahead…

• Knowledge integration, inferences

• Support for hypothesis formulation and testing

50

51

Exploring Knowledge Space

Gene A2

Gene A1

Gene A4

Gene A3

Gene A4’

Gene A1’

Behavior B4Behavior B3

Behavior B2

Behavior B1

isa isaCo-occur-fly

Orth-mosCo-occur-mos

Co-occur-bee

Co-occur-fly

Regorth

RegReg

1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}

Gene A5Reg

P= PathBetween({Z, B4, {co-occur, reg,isa})

52

Full-Fledged BeeSpace V5

BiomedicalLiterature

Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …

ExperimentData

Analysis

Additional entities and relations

Expert knowledge

InferencesHypothesis Formulation & Testing

Thanks to

Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)

Qiaozhu Mei (UIUC/Michigan)

& Bruce Schatz (PI, BeeSpace)53

Thank You!

54

top related