Top Banner
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009 1
54

BeeSpace Informatics Research

Dec 30, 2015

Download

Documents

Beatrice Barton

BeeSpace Informatics Research. ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign. BeeSpace Workshop, May 22, 2009. Goal of Informatics Research. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BeeSpace Informatics Research

BeeSpace Informatics Research

ChengXiang (“Cheng”) Zhai

Department of Computer Science

Institute for Genomic Biology

Statistics

Graduate School of Library & Information Science

University of Illinois at Urbana-Champaign

BeeSpace Workshop, May 22, 2009 1

Page 2: BeeSpace Informatics Research

Goal of Informatics Research

• Develop general and scalable computational methods to enable

– Semantic integration of data and information

– Effective information access and exploration

– Knowledge discovery

– Hypothesis formulation and testing

• Reinforcement of research in biology and computer science

– CS research to automate manual tasks of biologests

– Biology research to raise new challenges for CS

2

Page 3: BeeSpace Informatics Research

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

3

Page 4: BeeSpace Informatics Research

Informatics Research Accomplishments

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis Test

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]

Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]

Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],

[Chee & Schatz 08]

Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]

Automatic Function Annotation [He et al. 09/10]

4

Page 5: BeeSpace Informatics Research

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

5

Page 6: BeeSpace Informatics Research

Part 1. Information Extraction

6

Page 7: BeeSpace Informatics Research

Natural Language Understanding

…We have cloned and sequenced

a cDNA encoding Apis mellifera ultraspiracle (AMUSP)

and examined its responses to …

NP

NP NP

NPVP

VP VP

Gene Gene

7

Page 8: BeeSpace Informatics Research

Entity & Relation Extraction

Gene X Gene Y

Bcd hb

…. ….

… …

Genetic Interaction

Gene X Anatomy Y

Bcd embryo

Hb egg

… …

Expression Location

8

Lopes FJ et al., 2005 J. Theor. Biol.

Page 9: BeeSpace Informatics Research

General Approach: Machine Learning

• Computers learn from labeled examples to compute a function to predict labels of new examples

• Examples of predictions

– Given a phrase, predict whether it is a gene name

– Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation

• Many learning methods are available, but training data isn’t always available

9

Page 10: BeeSpace Informatics Research

Extraction Example 1: Gene Name Recognition

… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.

10

Gene?

Gene? Gene?

Page 11: BeeSpace Informatics Research

Features for Recognizing Genes

• Syntactic clues:

– Capitalization (especially acronyms)

– Numbers (gene families)

– Punctuation: -, /, :, etc.

• Contextual clues:

– Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc.

– Global: same noun phrase occurs several times in the same article

11

Page 12: BeeSpace Informatics Research

Maximum Entropy Modelfor Gene Tagging

• Given an observation (a token or a noun phrase), together with its context, denoted as x

• Predict y {gene, non-gene}

• Maximum entropy model:

P(y|x) = K exp(ifi(x, y))

• Typical f:

– y = gene & candidate phrase starts with a capital letter

– y = gene & candidate phrase contains digits

• Estimate i with training data

12

Page 13: BeeSpace Informatics Research

Special Challenges

• Gene name disambiguation

• Domain adaptation

13

Page 14: BeeSpace Informatics Research

Gene Name Disambiguation

• Gene names can be common English words:

for (foraging), in (inturned), similar (sima), yellow (y), black (b)…

• Solution:

– Disambiguate by looking at the context of the candidate word

– Train a classifier

14

Page 15: BeeSpace Informatics Research

Discriminative Neighbor Words

15

Page 16: BeeSpace Informatics Research

Sample Disambiguation Results

16

... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359

(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497

function for the for gene in sensory responsiveness and … -0.582 +5.980

the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in

assays of black mutants and although … +9.759

“foraging”, “for”

“black”

Page 17: BeeSpace Informatics Research

Nov 27, 2007 17

Problem of Domain Overfitting

gene name recognizer 54.1%

gene name recognizer 28.1%

ideal setting

realistic setting

wingless

daughterless

eyeless

apexless

fly

Page 18: BeeSpace Informatics Research

Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in

analogous patterns in each primordium of…

…that CD38 is expressed by both neurons and glial

cells…that PABPC5 is expressed in fetal brain and in

a range of adult tissues.

18

Generalizable Feature: “w+2 = expressed”

Page 19: BeeSpace Informatics Research

Generalizability-Based Feature Ranking

…training

data

……-less……expressed……

………expressed………-less

………expressed……-less…

…………expressed……-less

12345678

12345678

12345678

12345678

…expressed………-less……

…0.125………0.167…… 19

Page 20: BeeSpace Informatics Research

20

Effectiveness of Domain Adaptation

Fly + Mouse Yeastgene name recognizer 63.3%

Fly + Mouse Yeastgene name recognizer 75.9%

standard learning

domain adaptive learning

Page 21: BeeSpace Informatics Research

More Results on Domain Adaptation

Exp Method Precision Recall F1

F+M→Y Baseline 0.557 0.466 0.508

Domain 0.575 0.516 0.544

% Imprv. +3.2% +10.7% +7.1%

F+Y→M Baseline 0.571 0.335 0.422

Domain 0.582 0.381 0.461

% Imprv. +1.9% +13.7% +9.2%

M+Y→F Baseline 0.583 0.097 0.166

Domain 0.591 0.139 0.225

% Imprv. +1.4% +43.3% +35.5%

•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21

Page 22: BeeSpace Informatics Research

Extraction Example 2: Genetic Interaction Relation

22

Gene

Gene

Is there a genetic interaction relation here?

Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.

Page 23: BeeSpace Informatics Research

Challenges

• No/little training data

• What features to use?

23

Page 24: BeeSpace Informatics Research

Solution: Pseudo Training Data

24

Gene:

Bcd +

These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are

known to act in concert for most anterior segmented development.

Page 25: BeeSpace Informatics Research

Pseudo Training Data Works Reasonably Well

25

Precision

Recall

Using all features works the best

Page 26: BeeSpace Informatics Research

Large-Scale Entity/Relation Extraction

• Entity annotation

• Relation extraction

Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +

machine learning

Anatomy FlyBase Dictionary string search

Chemical MeSH, Biosis, … Dictionary string search

Behavior “x x behavior” pattern search

Relation Type MethodRegulatory Pre-defined pattern + machine learning

Expressed In Co-occurrence + relevant keywords

Gene Behavior Co-occurrence

Gene Chemical Co-occurrence53

Page 27: BeeSpace Informatics Research

Part 2: Semantic Navigation

27

Page 28: BeeSpace Informatics Research

Space-Region Navigation

Literature Spaces

BeeFly

Behavior

Bird…

Topic Regions

Bee Forager

MAP MAP

Bird Singing

EXTRACT

…Fly Rover

EXTRACT

SWITCHING

Intersection, Union,…

Intersection, Union,…

My Regions/Topics

My Spaces

28

Page 29: BeeSpace Informatics Research

General Approach: Language Models

• Topic = word distribution

• Modeling text in a space with mixture models of multinomial distributions

• Text Mining = Parameter Estimation + Inferences

• Matching = Computer similarity between word distributions

• Users can “control” a model by specifying topic preferences

29

Page 30: BeeSpace Informatics Research

A Sample Topic & Corresponding Space

filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626

actin filamentsflight muscleflight muscles

labels

• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle

Word Distribution (language model)

Example documents

Meaningful labels

30

Page 31: BeeSpace Informatics Research

MAP: Topic/RegionSpace

• MAP: Use the topic/region description as a query to search a given space

• Retrieval algorithm:

– Query word distribution: p(w|Q)

– Document word distribution: p(w|D)

– Score a document based on similarity of Q and D

• Leverage existing retrieval toolkits: Lemur/Indri

Vocabularyw D

QQDQ wp

wpwpDDQscore

)|(

)|(log)|()||(),(

31

Page 32: BeeSpace Informatics Research

EXTRACT: Space Topic/Region

• Assume k topics, each being represented by a word distribution

• Use a k-component mixture model to fit the documents in a given space (EM algorithm)

• The estimated k component word distributions are taken as k topic regions

| |

1 1

log ( | ) log[ ( | ) (1 ) ( | )]D k

i B j i jD C i j

p C p D p D

Likelihood:

Maximum likelihood estimator: * arg max ( | )p C

Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32

Page 33: BeeSpace Informatics Research

User-Controlled Exploration: Sample Topic 1

age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439

Prior:

labor 0.2division 0.2

33

Page 34: BeeSpace Informatics Research

behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045

Prior:

behavioral 0.2maturation 0.2

34

User-Controlled Exploration: Sample Topic 2

Page 35: BeeSpace Informatics Research

foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

Exploit Prior for Concept Switching

35

Page 36: BeeSpace Informatics Research

Part 3: Entity Summarization

36

Page 37: BeeSpace Informatics Research

Gene product

Expression

Sequence

Interactions

Mutations

General Functions

Multi-Aspect Gene Summary

Automated Gene Summarization?

Page 38: BeeSpace Informatics Research

A Two-Stage Approach

Page 39: BeeSpace Informatics Research

Text Summary of Gene Abl

Page 40: BeeSpace Informatics Research

General Entity Summarizer

• Task: Given any entity and k aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method:

– Train a recognizer for each aspect

– Given an entity, retrieve sentences relevant to the entity

– Classify each sentence into one of the k aspects

– Choose the best sentences in each category

40

Page 41: BeeSpace Informatics Research

Further Generalizations

• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method:

– Train a recognizer for each aspect

– Given an entity, retrieve sentences relevant to the entity

– Classify each sentence into one of the k aspects

– Choose the best sentences in each category

41

New method based on mixture modeland regularized optimization

Page 42: BeeSpace Informatics Research

Part 4. Function Analysis

42

Page 43: BeeSpace Informatics Research

Annotating Gene Lists: GO Terms vs. Literature Mining

Limitations of GO annotations: - Labor-intensive- Limited Coverage

Literature Mining:- Automatic - Flexible exploration in the entire literature space

Page 44: BeeSpace Informatics Research

For any term:

test its significance

Segmentation 56.0Pattern 34.2

Cell_cycle 25.6Development 22.1

Regulation 20.4…

Enriched concepts

Interactive analysis

Gene group

BcdCad…Tll

Entrez Gene

Document sets

For any gene:retrieve

its relevant documents

Bcd

Cad

Tll

Overview of Gene List Annotator

Page 45: BeeSpace Informatics Research

Intuition for Literature-based Annotation

Gene TPI1 GPM1 PGK1 TDH3 TDH2

protein_kinase 0 0 2 0 0

decarboxylase 10 0 10 7 6

protein 39 26 65 44 33

stationary_phase 2 7 3 4 2

energy_metabolism 4 5 5 8 0

oscillation 0 0 0 0 1

Page 46: BeeSpace Informatics Research

Likelihood Ratio Test with 2-Poisson Mixture Model

Dataset distribution: Poisson(λ;d)

Reference distribution: Poisson(λ0;d)

Page 47: BeeSpace Informatics Research

Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment

GO Theme Related Annotator terms

neurogenesis axon guidance, growth cone,

commissural axon, proneural gene

synaptic transmission synaptic vesicle, neurotransmitter

release, synaptic transmission, sodium

channel

cytoskeletal protein alpha tubulin, actin filament

cell communication tight junction, heparan sulfate

proteoglycan47

Page 48: BeeSpace Informatics Research

Discovering Novel Themes

• Gene List: 69 genes up-regulated by the methoprene treatment

Theme Annotator terms

muscle flight muscle, muscle myosin, nonmuscle

myosin, light chain, myosin ii, thick

filament, thin filament, striated muscle

synaptic transmission neurotransmitter release, synaptic

transmission, synaptic vesicle

signaling pathway notch signal

48

Page 49: BeeSpace Informatics Research

Summary

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

49

Machine Learning + Language Models + Minimum Human Effort

General and scalable, but there’s room for deeper semantics

Page 50: BeeSpace Informatics Research

Looking Ahead…

• Knowledge integration, inferences

• Support for hypothesis formulation and testing

50

Page 51: BeeSpace Informatics Research

51

Exploring Knowledge Space

Gene A2

Gene A1

Gene A4

Gene A3

Gene A4’

Gene A1’

Behavior B4Behavior B3

Behavior B2

Behavior B1

isa isaCo-occur-fly

Orth-mosCo-occur-mos

Co-occur-bee

Co-occur-fly

Regorth

RegReg

1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}

Gene A5Reg

P= PathBetween({Z, B4, {co-occur, reg,isa})

Page 52: BeeSpace Informatics Research

52

Full-Fledged BeeSpace V5

BiomedicalLiterature

Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …

ExperimentData

Analysis

Additional entities and relations

Expert knowledge

InferencesHypothesis Formulation & Testing

Page 53: BeeSpace Informatics Research

Thanks to

Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)

Qiaozhu Mei (UIUC/Michigan)

& Bruce Schatz (PI, BeeSpace)

53

Page 54: BeeSpace Informatics Research

Thank You!

54