Analyzing Federal Funding, Scientific
Publications and Emailwith Probabilistic Topic
ModelsMark Steyvers UC Irvine
Padhraic SmythDave NewmanTom Griffiths
UC IrvineUC IrvineBrown University
Analyzing Content/ Managing Information
JOURNALS
NEWSPAPERS
The Problem Many information retrieval systems assess
similarity of documents on the raw word counts
DOCUMENTDOCUMENT
CARCAR
CHEAPCHEAP
PRICEPRICE
......
DOCUMENT DOCUMENT
AUTOMOBILEAUTOMOBILE
AFFORDABLEAFFORDABLE
AMOUNTAMOUNT
......
DOCUMENTDOCUMENT
BANKBANKMONEYMONEY
......
DOCUMENTDOCUMENT
BANKBANKRIVERRIVER
......
no word overlapno word overlap
but high similaritybut high similarity
high word overlaphigh word overlap
but low similaritybut low similarity
One solution: compare documents on a latent set of
factors (topics)
topic 1
topic 2
DOCUMENTDOCUMENT
CARCAR
CHEAPCHEAP
PRICEPRICE
......
DOCUMENTDOCUMENT
AUTOMOBILEAUTOMOBILE
AFFORDABLEAFFORDABLE
AMOUNTAMOUNT
......
DOCUMENTDOCUMENT
BANKBANKMONEYMONEY
......
DOCUMENTDOCUMENT
BANKBANKRIVERRIVER
......
topic 1
topic 2
high topical high topical overlapoverlap
topic 3 topic 4no topical no topical overlapoverlap
2nd generation systems Go beyond the raw word information
Extract content in terms of topics
Deal with large sets of documents
Miniminal supervision
Probabilistic Topic Models
originated in domain of statistics & machine learning
performs unsupervised extraction of topics from large text collections
Text documents: scientific articles book chapters newspaper articles .... any set of words in a verbal context
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
Probabilistic Topic Models
Each document is a probability distribution over topics
Each topic is a probability distribution over words
We do not observe these distributions but we can infer them statistically
The Generative ModelView document generation as a probabilistic process
TOPICS MIXTURETOPICS MIXTURE
TOPIC TOPIC TOPICTOPIC
WORDWORD WORDWORD
......
......
1.1. for each document, for each document, choosechoosea mixture of topics a mixture of topics
2.2. For every word slot, For every word slot, sample a topic [1..T] sample a topic [1..T] from the mixturefrom the mixture
3.3. sample a word from the sample a word from the topictopic
loan
TOPIC 1
money
loan
bank
moneyb
an
k
river
TOPIC 2
river
river
stream
bank
bank
stream
bank
loan
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1
loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2
river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2
money1
DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1
stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1
money1 river2 bank1 money1 bank1 loan1 bank1 money1
stream2 .3
.8
.2
Example
Mixture components
Mixture weights
Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( )
.7
TOPIC 1
TOPIC 2
DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewing by large audiences ). A Play is written because playwrights have something ...
INVERTING THE GENERATIVE PROCESS
?
?
?
DOCUMENT 2: He was listening to music coming
from a passing riverboat. The music had already captured his heart as well as his ear . It was jazz . Bix beiderbecke had already had music lessons . He wanted to play the cornet. And he wanted to play jazz .......
We estimate the assignments of topics to words
TOPIC 1
TOPIC 2
DOCUMENT 1: A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ...
INVERTING THE GENERATIVE PROCESS
DOCUMENT 2: He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077.......
We estimate the assignments of topics to words
Choosing number of topics
Subjective interpretability
Bayesian model selection
Generalization tests
Models that grow with size of data
INPUT:word-document counts
OUTPUT:topic assignments to each word
likely words in each topic
likely topics for a document (“gist”)
Input/Output
z
)|( zwP
)w|(zP
Example: topics from an educational corpus (TASA)
PRINTINGPAPERPRINT
PRINTEDTYPE
PROCESSINK
PRESSIMAGE
PRINTERPRINTS
PRINTERSCOPY
COPIESFORM
OFFSETGRAPHICSURFACE
PRODUCEDCHARACTERS
PLAYPLAYSSTAGE
AUDIENCETHEATERACTORSDRAMA
SHAKESPEAREACTOR
THEATREPLAYWRIGHT
PERFORMANCEDRAMATICCOSTUMES
COMEDYTRAGEDY
CHARACTERSSCENESOPERA
PERFORMED
TEAMGAME
BASKETBALLPLAYERSPLAYER
PLAYPLAYINGSOCCERPLAYED
BALLTEAMSBASKET
FOOTBALLSCORECOURTGAMES
TRYCOACH
GYMSHOT
JUDGETRIAL
COURTCASEJURY
ACCUSEDGUILTY
DEFENDANTJUSTICE
EVIDENCEWITNESSES
CRIMELAWYERWITNESS
ATTORNEYHEARING
INNOCENTDEFENSECHARGE
CRIMINAL
HYPOTHESISEXPERIMENTSCIENTIFIC
OBSERVATIONSSCIENTISTS
EXPERIMENTSSCIENTIST
EXPERIMENTALTEST
METHODHYPOTHESES
TESTEDEVIDENCE
BASEDOBSERVATION
SCIENCEFACTSDATA
RESULTSEXPLANATION
STUDYTEST
STUDYINGHOMEWORK
NEEDCLASSMATHTRY
TEACHERWRITEPLAN
ARITHMETICASSIGNMENT
PLACESTUDIED
CAREFULLYDECIDE
IMPORTANTNOTEBOOK
REVIEW
• 37K docs, 26K words• 1700 topics, e.g.:
Polysemy
PRINTINGPAPERPRINT
PRINTEDTYPE
PROCESSINK
PRESSIMAGE
PRINTERPRINTS
PRINTERSCOPY
COPIESFORM
OFFSETGRAPHICSURFACE
PRODUCEDCHARACTERS
PLAYPLAYSSTAGE
AUDIENCETHEATERACTORSDRAMA
SHAKESPEAREACTOR
THEATREPLAYWRIGHT
PERFORMANCEDRAMATICCOSTUMES
COMEDYTRAGEDY
CHARACTERSSCENESOPERA
PERFORMED
TEAMGAME
BASKETBALLPLAYERSPLAYERPLAY
PLAYINGSOCCERPLAYED
BALLTEAMSBASKET
FOOTBALLSCORECOURTGAMES
TRYCOACH
GYMSHOT
JUDGETRIAL
COURTCASEJURY
ACCUSEDGUILTY
DEFENDANTJUSTICE
EVIDENCEWITNESSES
CRIMELAWYERWITNESS
ATTORNEYHEARING
INNOCENTDEFENSECHARGE
CRIMINAL
HYPOTHESISEXPERIMENTSCIENTIFIC
OBSERVATIONSSCIENTISTS
EXPERIMENTSSCIENTIST
EXPERIMENTALTEST
METHODHYPOTHESES
TESTEDEVIDENCE
BASEDOBSERVATION
SCIENCEFACTSDATA
RESULTSEXPLANATION
STUDYTEST
STUDYINGHOMEWORK
NEEDCLASSMATHTRY
TEACHERWRITEPLAN
ARITHMETICASSIGNMENT
PLACESTUDIED
CAREFULLYDECIDE
IMPORTANTNOTEBOOK
REVIEW
Three documents with the word “play”
(numbers & colors topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
PNAS Topics
Applied model to PNAS abstracts(Proceedings of the National Academy of Sciences)
FORCESURFACE
MOLECULESSOLUTIONSURFACES
MICROSCOPYWATERFORCES
PARTICLESSTRENGTHPOLYMER
IONICATOMIC
AQUEOUSMOLECULARPROPERTIES
LIQUIDSOLUTIONS
BEADSMECHANICAL
HIVVIRUS
INFECTEDIMMUNODEFICIENCY
CD4INFECTION
HUMANVIRAL
TATGP120
REPLICATIONTYPE
ENVELOPEAIDSREV
BLOODCCR5
INDIVIDUALSENV
PERIPHERAL
MUSCLECARDIAC
HEARTSKELETALMYOCYTES
VENTRICULARMUSCLESSMOOTH
HYPERTROPHYDYSTROPHIN
HEARTSCONTRACTION
FIBERSFUNCTION
TISSUERAT
MYOCARDIALISOLATED
MYODFAILURE
STRUCTUREANGSTROM
CRYSTALRESIDUES
STRUCTURESSTRUCTURALRESOLUTION
HELIXTHREE
HELICESDETERMINED
RAYCONFORMATION
HELICALHYDROPHOBIC
SIDEDIMENSIONALINTERACTIONS
MOLECULESURFACE
NEURONSBRAIN
CORTEXCORTICAL
OLFACTORYNUCLEUS
NEURONALLAYER
RATNUCLEI
CEREBELLUMCEREBELLAR
LATERALCEREBRAL
LAYERSGRANULELABELED
HIPPOCAMPUSAREAS
THALAMIC
A selection of topics (out of 300)
TUMORCANCERTUMORSHUMANCELLS
BREASTMELANOMA
GROWTHCARCINOMA
PROSTATENORMAL
CELLMETASTATICMALIGNANT
LUNGCANCERS
MICENUDE
PRIMARYOVARIAN
PNAS Topics and classes
PNAS authors provide class designations major: Biological, Physical, Social Sciences minor: 33 separate disciplines
Find topics diagnostic of classes validate “reality” of classes show how disciplines overlap topically
TOPIC 210SYNAPTICNEURONS
POSTSYNAPTICHIPPOCAMPAL
SYNAPSESLTP
PRESYNAPTICTRANSMISSIONPOTENTIATION
PLASTICITYEXCITATORY
RELEASEDENDRITIC
PYRAMIDALHIPPOCAMPUS
Neurobiology
Topic 210
TOPIC 280SPECIES
SELECTIONEVOLUTION
GENETICPOPULATIONSPOPULATIONVARIATIONNATURAL
EVOLUTIONARYFITNESS
ADAPTIVERATES
THEORYTRAITS
DIVERSITY
Evolution
Topic 280
Population
biology
TOPIC 39THEORY
TIMESPACEGIVEN
PROBLEMSHAPESIMPLE
DIMENSIONALPAPER
NUMBERCASE
LOCALTERMS
SYMMETRYRANDOM
Mathematics
Topic 39
Applied Mathematics
Topic Dynamics
We have the distribution over topics for PNAS abstracts from 1991 to 2001
Analysis of dynamics: perform linear trend analysis for each
topic “hot topics” go up, “cold topics” go
down
1990 1992 1994 1996 1998 2000 20022
4
6
8
10
12
14x 10
-3
289
37
75P(t
opic
)
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
179
2
134
year
P(t
opic
)
year year
Cold topics Hot topics
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
134MICE
DEFICIENTNORMAL
GENENULL
MOUSETYPE
HOMOZYGOUSROLE
KNOCKOUTDEVELOPMENT
GENERATEDLACKINGANIMALSREDUCED
179APOPTOSIS
DEATHCELL
INDUCEDBCL
CELLSAPOPTOTIC
CASPASEFAS
SURVIVALPROGRAMMED
MEDIATEDINDUCTIONCERAMIDE
EXPRESSION
37CDNA
AMINOSEQUENCE
ACIDPROTEIN
ISOLATEDENCODING
CLONEDACIDS
IDENTITYCLONE
EXPRESSEDENCODES
RATHOMOLOGY
289KDA
PROTEINPURIFIED
MOLECULARMASS
CHROMATOGRAPHYPOLYPEPTIDE
GELSDS
BANDAPPARENTLABELED
IDENTIFIEDFRACTIONDETECTED
75ANTIBODY
ANTIBODIESMONOCLONAL
ANTIGENIGG
MABSPECIFICEPITOPEHUMANMABS
RECOGNIZEDSERA
EPITOPESDIRECTED
NEUTRALIZING
NOBEL 1987
NOBEL 2002
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
Analyzing Topics of Funding
Get a large-scale overview of funding for social sciences
How similar are different funding programs?
How is funding distributed over topics?
Dataset22,189 Abstracts from grants active in 2003
NIH NIMH (National Institute of Mental Health)
NCI (National Cancer Institute)
NSF SBE (Social, Behavioral and Economic
Sciences) BIO (Biological Sciences)
Extracted topics (1..20)
Topic Interpretation Likely words
1 training program research students program training faculty2 mental health services health intervention care mental services3 protein binding protein proteins binding domain domains4 pathway signaling signaling kinase pathways signal activation5 neural activity neurons brain neuronal synaptic activity6 collaborative projects university award research project collaboration7 gene expression expression gene transcription regulation genes8 immunology cells tumor immune cell antigen9 children/family children child family school adolescents
10 archaelogy sites site archaeological region data11 genetics genes gene genome sequence genetic12 ecosystem/climate forest soil climate ecosystem carbon13 tumors tumor tumors human expression mammary14 gene mutation mice mutations gene mutant genes15 cell differentation cell cells cycle growth differentiation16 psychiatric disorders depression disorders disorder symptoms psychiatric17 research training research training development career candidate18 clinical trials patients clinical treatment therapy trials19 cancer cancer breast lung cancers ovarian20 research center research center program core support
80 interpretable topics (out of 100)
training program environmental issues software/databases rnamental health services patient treatment metabolism sexual behavior
protein binding dna repair equipment/facilities molecular mechanismspathway signaling drugs/agents sample analysis modeling
neural activity structural biology gene development imaging techniquescollaborative projects public policy ecology family genetics
gene expression protein physiology information technology specimen collectionimmunology apoptosis markets stress response
children/family evolution cell receptors viral infectionsarchaelogy economic forces genetic variation ethnic minorities
genetics cancer treatment research development hypothesis testingecosystem/climate tumor growth theory bacteria
tumors conference/meetings commercial development systems researchgene mutation group sociology marine environment calcium channels
cell differentation memory & cognition population screening individual differencespsychiatric disorders risk factors plant growth hormones
research training data analysis circadian rythms smokingclinical trials tumor therapy leukemia research training biology
cancer functional imaging science/technology languageresearch center imaging systems decision making energy transfer
Interpretation %NIH %SBE %BIO
cancer 94 3 3tumors 92 3 5
clinical trials 91 4 5psychiatric disorders 90 5 5
tumor therapy 89 4 7cancer treatment 89 5 6
leukemia 89 4 7patient treatment 88 6 6
immunology 87 3 9mental health services 87 8 5
Likely topics for NIH
Interpretation %NIH %SBE %BIO
collaborative projects 9 81 9public policy 14 79 7
archaelogy 12 76 12economic forces 18 73 10
markets 18 70 12environmental issues 16 65 19
science/technology 21 63 17systems research 19 62 19
language 29 59 12decision making 33 51 16
Likely topics for NSF-SBE
Likely topics for NSF-BIO
Interpretation %NIH %SBE %BIO
plant growth 10 9 81ecology 9 14 77
evolution 12 13 76bacteria 17 9 74
genetic variation 19 13 69ecosystem/climate 8 24 69gene development 30 6 63
marine environment 13 27 60specimen collection 17 24 59
protein physiology 37 6 58
Level 1 Level 2 Level 3 Level 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
Cancer biology, detection and diagnosis 1
AIDS Research 2
Cancer Research Centers 3
Cancer causation 4
Cancer prevention and control 5
Cancer treatment 6
NCI research manpower development 7
AIDS Research 8
Extramural research 9
Intramural research 10
Archaeology, archeometry, and ... 11
Behavioral and cognitive sciences - Other 12
Child learning and development 13
Cultural anthropology 14Environmental social and behavioral science 15
Geography and regional science 16
Human cognition and perception 17
Instrumentation 18
Linguistics 19
Physical anthropology 20
Social psychology 21
Africa, Near East, and South Asia 22
Americas 23
Central and Eastern Europe 24
East Asia and Pacific 25
International activities - Other 26
Japan and Korea 27
Western Europe 28
Decision, risk, and management science 29
Methodology, measures, and statistics 30
Economics 31
Ethics and values studies 32
Innovation and organizational change 33
Law and social science 34
Political science 35
Research on science and technology 36
Science and technology studies 37
Social and economic sciences - Other 38
Sociology 39
Transformations to quality organizations 40
Biological infrastructure - Other 41
Human resources 42
Instrumentation 43
Research resources 44
Ecological studies 45
Environmental biology - Other 46
Systematic & population biology 47
Developmental mechanisms 48Integrative biology and neuroscience - Other 49
Neuroscience 50
Physiology and ethology 51
Biochemical and biomolecular processes 52
Biomolecular structure & function 53
Cell biology 54
Genetics 55
Molecular and cellular biosciences - Other 56
Plant genome research (119) Plant genome research project 57
National Science
Foundation (10580)
Social, Behavioral,
and Economic Sciences
(SBE) (4584)
Biological Sciences
(BIO) (5996)
Behavioral and cognitive sciences (BCS) (1469)
International science and engineering (INT) (formerly International cooperative scientific activities) (1299)
Social and economic sciences (SES) (1816)
Biological infrastructure (BIR/DBI) (1061)
Environmental biology (DEB) (1609)
Integrative biology and neuroscience (IBN/BNS)
(1673)
Molecular and cellular biosciences (MCB/DCB)
(1534)
Dept of Health and
Human Services (11609)
National Institutes of
Health (11609)
National Cancer Institute (7574)
National Institute of Mental Health (4035)
Program similarity using topics
Level 1 Level 2 Level 3 Level 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
Cancer biology, detection and diagnosis 1
AIDS Research 2
Cancer Research Centers 3
Cancer causation 4
Cancer prevention and control 5
Cancer treatment 6
NCI research manpower development 7
AIDS Research 8
Extramural research 9
Intramural research 10
Archaeology, archeometry, and ... 11
Behavioral and cognitive sciences - Other 12
Child learning and development 13
Cultural anthropology 14Environmental social and behavioral science 15
Geography and regional science 16
Human cognition and perception 17
Instrumentation 18
Linguistics 19
Physical anthropology 20
Social psychology 21
Africa, Near East, and South Asia 22
Americas 23
Central and Eastern Europe 24
East Asia and Pacific 25
International activities - Other 26
Japan and Korea 27
Western Europe 28
Decision, risk, and management science 29
Methodology, measures, and statistics 30
Economics 31
Ethics and values studies 32
Innovation and organizational change 33
Law and social science 34
Political science 35
Research on science and technology 36
Science and technology studies 37
Social and economic sciences - Other 38
Sociology 39
Transformations to quality organizations 40
Biological infrastructure - Other 41
Human resources 42
Instrumentation 43
Research resources 44
Ecological studies 45
Environmental biology - Other 46
Systematic & population biology 47
Developmental mechanisms 48Integrative biology and neuroscience - Other 49
Neuroscience 50
Physiology and ethology 51
Biochemical and biomolecular processes 52
Biomolecular structure & function 53
Cell biology 54
Genetics 55
Molecular and cellular biosciences - Other 56
Plant genome research (119) Plant genome research project 57
National Science
Foundation (10580)
Social, Behavioral,
and Economic Sciences
(SBE) (4584)
Biological Sciences
(BIO) (5996)
Behavioral and cognitive sciences (BCS) (1469)
International science and engineering (INT) (formerly International cooperative scientific activities) (1299)
Social and economic sciences (SES) (1816)
Biological infrastructure (BIR/DBI) (1061)
Environmental biology (DEB) (1609)
Integrative biology and neuroscience (IBN/BNS)
(1673)
Molecular and cellular biosciences (MCB/DCB)
(1534)
Dept of Health and
Human Services (11609)
National Institutes of
Health (11609)
National Cancer Institute (7574)
National Institute of Mental Health (4035)
Program similarity using topics
NCICancer biology,
detection and diagnosis
NCIAIDS Research
NCICancer
Research Centers
NCICancer
causationNCI
Cancer prevention and control
NCICancer
treatment
NCIResearch
manpower development
NIMHAIDS Research
NIMHExtramural research
NIMHIntramural research
BCSArchaeology,
archeometry, and ...
BCSBehavioral
and cognitive sciences - Other
BCSChild learning
and development
BCSCultural
anthropology
BCSEnvironmental social
and behavioral scienceBCSGeography
and regional science
BCSHuman cognition and perception
BCSInstrumentation
BCSLinguistics
BCSPhysical
anthropology
BCSSocial
psychology
INTAfrica, Near East, and South Asia
INTAmericas
INTCentral
and Eastern Europe
INTEast Asia
and Pacific
INTInternational
activities - Other
INTJapan
and Korea INTWestern Europe
SESDecision, risk,
and management science
SESMethodology, measures,
and statistics
SESEconomics
SESEthics
and values studies
SESInnovation
and organizational change
SESLaw
and social science
SESPolitical science
SESResearch on science
and technology
SESScience
and technology studies
SESSocial and economic
sciences - Other
SESSociologySES
Transformations to quality organizations
BIRBiological
infrastructure - Other
BIRHuman
resources
BIRInstrumentation
BIRResearch resources
DEBEcological
studies
DEBEnvironmental biology - Other
DEBSystematic
& population biology
IBNDevelopmental mechanisms
IBNIntegrative biology
and neuroscience - Other
IBNNeuroscience
IBNPhysiology
and ethology
MCBBiochemical
and biomolecular processes
MCBBiomolecular structure
& function
MCBCell biology
MCBGenetics
MCBMolecular and cellular biosciences - Other
PGRPlant genome research project
NIH
NSF – BIO
NSF – SBE
2D visualization of funding programs – nearby program support similar topics
Funding Amounts per Topic
We have $ funding per grant We have % of topics for each grant We can solve for the $ amount per topic
What are expensive topics?
Funding % Interpretation3.47 research center2.87 cancer control2.26 mental health services2.01 clinical treatment1.87 cancer 1.73 gene sequencing1.61 risk factors1.56 children/parents1.51 tumors1.48 training program1.47 immunology1.43 disorders1.40 patient treatment
Funding % Interpretation0.60 conference/meetings0.56 theory0.55 public policy0.55 collaborative projects0.55 marine environment0.55 decision making0.55 ecological diversity0.53 sexual behavior0.52 markets0.51 science/technology0.49 computer systems0.45 language0.44 archaelogy
High $$$ topics Low $$$ topics
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
Enron email data 500,000 emails500,000 emails
5000 authors5000 authors
1999-20021999-2002
Enron topics
2000 2001 2002 2003
PERSON1
PERSON2
TEXANSWIN
FOOTBALLFANTASY
SPORTSLINEPLAYTEAMGAME
SPORTSGAMES
GODLIFEMAN
PEOPLECHRISTFAITHLORDJESUS
SPIRITUALVISIT
ENVIRONMENTALAIR
MTBEEMISSIONS
CLEANEPA
PENDINGSAFETYWATER
GASOLINE
FERCMARKET
ISOCOMMISSION
ORDERFILING
COMMENTSPRICE
CALIFORNIAFILED
POWERCALIFORNIAELECTRICITY
UTILITIESPRICESMARKET
PRICEUTILITY
CUSTOMERSELECTRIC
STATEPLAN
CALIFORNIADAVISRATE
BANKRUPTCYSOCALPOWERBONDSMOU
TIMELINEMay 22, 2000
Start of California
energy crisis
Overview
I Probabilistic Topic Models
II Analyzing Scientific Topics: PNAS
III Analyzing Topics of Federal Funding
IV Analyzing Enron Email
V Extensions of the the model
VI Conclusion
Pennsylvania Gazette
1728-18001728-1800
80,000 80,000 articlesarticles
(courtesy of David Newman & Sharon Block, History Department, UC Irvine)(courtesy of David Newman & Sharon Block, History Department, UC Irvine)
Historical Trends in Pen. Gazette
YEAR
1730 1740 1750 1760 1770 1780 1790 1800
Top
ic P
ropo
rtio
n (%
)
0
2
4
6
8
10STATE
GOVERNMENTCONSTITUTION
LAWUNITEDPOWERCITIZENPEOPLEPUBLIC
CONGRES
SILKCOTTON
DITTOWHITEBLACKLINENCLOTH
WOMENBLUE
WORSTED
(courtesy of David Newman & Sharon Block, UC Irvine)(courtesy of David Newman & Sharon Block, UC Irvine)
Learning Topic Hierarchies In regular topic model, no relations between
topics
Alternative: hierarchical topic organization topic 1
topic 2 topic 3
topic 4 topic 5 topic 6 topic 7
Apply to Psych Review abstracts
theorymodeldata
informationproposed
modeltheorymodelsw ord
response
readingtext
readersmeaning
comprehension
biasassociative
matricesmatrix
al
memorylistitemitems
recognition
distributedgrams
associateassociations
paired
strengthfamiliarityretroactivedeviationlikelihood
responseinstrumentalresponsesconditioning
behavior
choicedelays
alternativesfixed
rew ard
memorymodelmodels
informationsocial
know ledgeskill
readingaccessspecific
modeleffectslearningtheory
systems
memoryretrieval
serialstoragew orking
preferencereinforcement
choicepunishmentcontingent
modeltheory
informationeffectsaccount
imagesperceptionaccordinglightnessobjects
visualimagery
representationsmental
subsystems
movementeye
positionspeedtarget
orientationeroticbem
sexualebe
situationalconsistency
crosstemporalbehavior
objectbasedneglectattentionspace
stimulivisual
componentcontourforw ard
attributestochastic
choicedifferencetransitivity
maskingmetacontrast
typeinhibition
mask
serialfunctionlatencypositionitems
reasoningbayesiansimilaritiesstatements
gain
similaritygeometricobjectsdensitydistance
ceconditioningprinciples
reinforcementrew ard
modelmemory
processesmodelslearning
imagecomponents
boundnearestneighbor
memoryreasoning
interferenceprocess
background
theorysentence
jamesfit
emotionmodel
memorydecisionresponse
theorytheory
achievementemotion
motivationfailure
modelcs
avoidanceucs
conditioningmodel
memoryproblems
itemstheoretical
goodnessapproach
representationholographic
pictorial
lettersmodelw ordsletter
memoryfunction
psychometriccorrelationsindividuals
performancestresssystemimmunearousal
fight
sexaffects
biologicaldifferenceshandedness
cognitivegigerenzerheuristicsreasoning
biases
childchildren
developmentfieldrisk
bayesianinferencealgorithmsauthors
frequency
speechauditoryacoustic
perceptualsound
actioncontrolintention
goalintentions
personalitybehavior
traitconsistencyidiographic
surfacerepresentations
surfacesoccludingcontour
psychologicalpsychology
reviewamerican
association
eventsinterpersonal
eventimpersonalequilibrium
categoriescategorymetaphor
objectmetaphors
motioncontrast
pathvisual
contour
leftcerebral
handednessspeechhuman
socialperceptionimpressionresearchapproach
sleepimagerydreaming
remeye
reinforcementbehaviorextinctionmatching
partial
binocularrivalry
stereopsismonocular
visual
structurerelations
scaledimensional
keys
riskconjunction
decisionprobabilities
risky
distanceretinal
disparityimage
perceived
perceptionvisual
directionrule
adaptation
partthinking
kindscientificactivities
behaviordevelopmentevolutionary
genescomparative
groupintelligenceintellectual
iqconnections
behaviorfood
drinkinghypothalamusphysiological
taskresource
performanceprocessinganaphors
developmentalsocialethnic
processesdevelopment
fearanxiety
painamygdalaautomatic
neuralvisual
neuronsbehavioralmasking
strategiesproblems
termconfirmation
limitationslanguagesemanticlinguisticthought
correlations
learningmapsmap
barrierparallel
statisticalheuristicsknow ledge
intuitiveheuristic face
recognitionfaces
damagedsemantic
Integrating Topics and Syntax
Syntactic dependencies short range dependencies Semantic dependencies long-range
z z z z
w w w w
s s s s
Semantic state: generate words from topic model
Syntactic states: generate words from HMM
(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
...
INBY
WITHONAS
FROMTO
FOR
THEA
ANTHIS
THEIRITS
EACHONE
ISAREBE
HASHAVEWAS
WEREAS
BASEDPRESENTEDDISCUSSEDPROPOSEDDESCRIBED
SUCHUSED
DERIVED
THEORYMODEL
PROCESSESMODELSSYSTEM
PROCESSEFFECTS
INFORMATION
ATTENTIONSEARCHVISUAL
PROCESSINGTASK
PERFORMANCEINFORMATIONATTENTIONAL
MEMORYTERMLONG
SHORTRETRIEVALSTORAGE
MEMORIESAMNESIA
IQBEHAVIOR
EVOLUTIONARYENVIRONMENT
GENESHERITABILITY
GENETICSELECTION
DRUGAROUSALNEURALBRAIN
HABITUATIONBIOLOGICALTOLERANCEBEHAVIORAL
SOCIALSELF
ATTITUDEIMPLICIT
ATTITUDESPERSONALITY
JUDGMENTPERCEPTION
(S) THE SEARCH IN LONG TERM MEMORY ……
(S) A MODEL OF VISUAL ATTENTION ……
Random sentence generation
LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Conclusion Unsupervised extraction of content from large
text collections
Topics provide quick overview of content
Topic models text-mining/ information retrieval psychology/ memory
Connection?
Good semantic memory models for finding semantically relevant information might also be good information retrieval models
Psych Review abstracts All 1281 abstracts since 1967 50 topics – examples:
SIMILARITYCATEGORY
CATEGORIESRELATIONS
DIMENSIONSFEATURES
STRUCTURESIMILAR
REPRESENTATIONONJECTS
STIMULUSCONDITIONING
LEARNINGRESPONSE
STIMULIRESPONSESAVOIDANCE
REINFORCEMENTCLASSICAL
DISCRIMINATION
MEMORYRETRIEVAL
RECALLITEMS
INFORMATIONTERM
RECOGNITIONITEMSLIST
ASSOCIATIVE
GROUPINDIVIDUAL
GROUPSOUTCOMES
INDIVIDUALSGROUPS
OUTCOMESINDIVIDUALSDIFFERENCESINTERACTION
EMOTIONALEMOTION
BASICEMOTIONS
AFFECTSTATES
EXPERIENCESAFFECTIVEAFFECTS
RESEARCH
...