Top Banner
Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei
29

Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Dec 31, 2015

Download

Documents

marshall-giles

Topic Extraction from Biology Literature: Prior, Labeling, and Switching. Qiaozhu Mei. A Sample Topic. Word Distribution (language model). Meaningful labels. labels. actin filaments flight muscle flight muscles. filaments 0.0410238 muscle 0.0327107 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic Extraction from Biology Literature: Prior, Labeling, and

SwitchingQiaozhu Mei

Page 2: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

A Sample Topic

filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626

actin filamentsflight muscleflight muscles

labels

• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle

Word Distribution (language model)

Example documents

Meaningful labels

Page 3: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic/Theme Extraction

• A theme/topic is represented with a multinomial distribution over words

• Unigram language models – Easier to interpret– Easy to add prior– Easy for retrieval

• Assumption:– K themes in a collection– A document covers multiple themes

Page 4: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic Extraction v.s. Clustering

• Topic Extraction: – Effective to reveal the latent topics, and find most

relevant documents to a topic– Better interpretation, worse accuracy– Effective to add priors (control the topics)

• Clustering algorithms:– Effective to assign documents into non-overlapped

clusters– Better accuracy, worse interpretation– Hard to control

Page 5: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic Extraction (Results)

corpora   (0.0438967 )allata   (0.0315774 )hormone   (0.0249687 )juvenile   (0.0184049 )insulin   (0.0174549 )embryos   (0.0165997 )neurosecretory  (0.0127734 )embryo   (0.0124167 )biosynthesis  (0.0118067 )cardiaca   (0.00969471 )sexta   (0.0088941 )medium   (0.00865245 )iran   (0.00703376 )mannose   (0.00668768 )volume   (0.00661038 )synapse   (0.00652483 )injected   (0.00636151 )

Related documents

44 biosis:199598006316: 44 biosis:200000292072: 44 biosis:199293065558: 44 biosis:199799595920: 44 biosis:199395062782:

stimulatory effect of octopamine on juvenile hormone biosynthesis in honey bees (apis mellifera): physiological and immunocytochemical evidence

• May want a more general topic

• How to tell the algorithm to find a more general topic, like “behavioral maturation”?

Page 6: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic Extraction (Results cont.)

pollen   (0.467911 )foraging   (0.0373205 )foragers   (0.0365857 )collected   (0.0318249 )grains   (0.0314324 )loads   (0.025104 )collection   (0.0208903 )nectar   (0.0185726 )sources   (0.0113751 )collecting   (0.00999529 )types   (0.00978636 )pellets   (0.00942175 )germination  (0.00733012 )load   (0.00646375 )stored   (0.00599516 )amount   (0.00481306 )trips   (0.00478013 )

Related Documents

13 biosis:200200039990: 13 biosis:199900297835: 13 biosis:200100318017: 13 biosis:199497516580: 13 biosis:200000045397:

the response of the stingless bee melipona beecheii to experimental pollen stress, worker loss and different levels of information input

• Biased towards “Pollen”

• Not precisely covering “foraging”

• How to tell the algorithm to focus on “foraging”?

Page 7: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Topic Extraction (Full Results)

• 100 topics from biosis-bee: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic.html

• 5 themes for query “food” in biosis-bee; 500 documents: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-food-5-basic.html

Page 8: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors

• Either topic extraction or clustering:– Cannot guarantee the themes are expected– User exploration: usually has preference.– E.g., want one topic/cluster is about foraging

behavior

• Use prior to guild the theme extraction– Prior as a simple language model– E.g. forage 0.2; foraging 0.3; food 0.05; etc.

Page 9: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors

Original EM:

EM with Prior:

Prior: language model; interpreted as pseudo counts

Prior

Prior

Page 10: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors (results)foraging 0.0498044food 0.0472535foragers 0.0310718dance 0.0266078source 0.0254369nectar 0.0162739distance 0.0141869forage 0.0141503information 0.0129047dances 0.012684hive 0.0124987landmarks 0.0119087dancing 0.0109375waggle 0.0101672feeder 0.0101266rate 0.0085641sources 0.00825884recruitment 0.00813717forager 0.00796914

Prior:

forage 0.1foraging 0.1food 0.1source 0.1

Page 11: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors (results: cont.)

age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439

Prior:

labor 0.2division 0.2

Page 12: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors (results: cont.)

gene 0.0648303expression 0.0486273sequence 0.0407999sequences 0.0311126brain 0.0233977drosophila 0.020891cdna 0.0186153predict 0.0166939expressed 0.0166521amino 0.0126359dna 0.010655genome 0.0101629conserved 0.0098135bp 0.00908649nucleotide 0.00906794phylogenetic 0.00887771encoding 0.00866418melanogaster 0.00798409

Prior:

brain 0.1predict 0.1gene 0.1expresion 0.1

Page 13: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors (results: cont.)

behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045

Prior:

behavioral 0.2maturation 0.2

Page 14: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Incorporating Topic Priors (Full results)

• 30 topics from biosis-bee (first 7 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior.html

• 30 topics from biosis-bee (first 2 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior3.html

Page 15: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic

• Themes (Topic models) can be hard to interpret.• Give meaningful labels to a topic is hard

Page 16: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

What is a Good Label?

• Suggesting the theme (relevance)

• Understandable – phrases?

• High coverage inside topic– A theme is often a mixture of concepts

• Discriminative across topics– A theme is usually in the context of k topics

• …

Page 17: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Our Method

• Guarantee understandability with a pre-processing step– Use phrases as candidate topic labels– Other possible choices: entities

• Satisfy relevance, coverage, and discriminability with a probabilistic framework

Good labels = Understandable + Relevant +

High Coverage + Discriminative

Page 18: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic: Candidate Labels

• Phrase generation: – Statistically significant 2-grams – Hypothesis testing– T-test used; ranked by t-score

• Other choices?– Entities? – Behavior ontology?– GO: hard to use, because they are not real phrases

from literature.

Page 19: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic: Semantic Relevance

• Zero-order: use phrases which well cover the top words:

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label:“clustering algorithm”

body

Bad Label:“body shape”

Page 20: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic: Semantic Relevance (cont.)

• First-order: use phrases with similar context:

Clustering

dimension

partition

algorithm

hash

Clustering

hash

dimension

algorithm

partition

SIGMOD Proceedings

Topic

… …

P(w|) P(w|l)

D(|l)

Good Label:“clustering algorithm”

Clustering

hash

dimension

join

algorithm

… Bad Label:“hash join”

Page 21: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic (results)female   (0.0892427 )females   (0.0856834 )male   (0.0854142 )males   (0.0812643 )sex   (0.0577668 )reproductive  (0.0214618 )ratio   (0.0142873 )alleles   (0.0133912 )diploid   (0.0125172 )offspring  (0.0120271 )sexes   (0.0116374 )investment  (0.0115359 )mating   (0.00902159 )number   (0.00823397 )success   (0.00785498 )sexual   (0.00751456 )determination  (0.00663546 )size   (0.00633002 )

Labels:

sex ratio (2.49468) (32 );   male female (2.29508) (51 ); sex determination (2.16534) (21 );  female flowers (1.83686) (23 );   sex alleles (1.79415) (16 );   multiple mating (1.72684) (19 );

Page 22: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic (results cont.)

Labels:

juvenile hormone 2.44992 117hormone jh 1.58432 49larval instar 1.53676 20worker larvae 1.52398 51corpora allata 1.50391 34

hormone 0.0536175jh 0.0518038juvenile 0.0466941development 0.0387031larval 0.0276814hemolymph 0.0216493pupal 0.0189934stage 0.0188286glands 0.0173832larvae 0.0169996adult 0.0154695instar 0.0149492haemolymph 0.0140053vitellogenin 0.0131076caste 0.0124822protein 0.0116558glucose 0.0112673corpora 0.0105111

Page 23: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic (results)

Labels

food source -6.72378 107nectar foraging -7.11784 28nectar foragers -7.58965 47nectar source -7.78975 16food sources -7.8487 72waggle dance -8.21514 31

foraging 0.0498044food 0.0472535foragers 0.0310718dance 0.0266078source 0.0254369nectar 0.0162739distance 0.0141869forage 0.0141503information 0.0129047dances 0.012684hive 0.0124987landmarks 0.0119087dancing 0.0109375waggle 0.0101672feeder 0.0101266rate 0.0085641recruitment 0.00813717forager 0.00796914

Prior

0 forage 0.10 foraging 0.10 food 0.10 source 0.1

Page 24: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Labeling a Topic (full results)

• 100 topics from biosis-bee (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic-l.html

• 100 topics from biosis-fly-genetics (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/fly-100-l.html

Page 25: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Context Switching

• Utilize topic extraction for concept switching (two possible ways)– Label the same topic model with phrases in

another context– Use the topic model from context A as prior to

extract topics from context B

Page 26: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

foraging behavior 2.45263 27age related 2.29676 20drosophila larvae 2.15361 67feeding rate 1.99218 17apis mellifera 1.9847 23diptera drosophilidae 1.9 25

foraging trip 2.31174 21nectar foragers 2.23428 47tremble dance 2.21407 10returning foragers 2.18954 16food sources 2.14453 72food source 2.13647 107foraging strategy 2.101 14individual foraging 2.08334 16waggle dance 2.07836 31

Labels with bee context

Labels with fly context

Page 27: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

Page 28: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Speed of topic extraction

# documents # themes Running time

500 5 8.3 s

500 10 10.6 s

1000 5 17.6 s

10k 30 350 s

16k 150 4000 s

Page 29: Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Questions?

Thanks!