Mining the 20th Century’s History from Folgert Karsdorp, Mike Kestemont, Antal van den Bosch, Walter Daelemans & Dan Roth Guest Lecture, AI Course, ULB, 9 May 2014
Mining the 20th Century’s History from
Folgert Karsdorp, Mike Kestemont, Antal van den Bosch, Walter Daelemans & Dan Roth
!Guest Lecture, AI Course, ULB, 9 May 2014
Corpus statistics• 1923-2006
• M. Davies
• TIME U.S.
• ~194M words
• ~873K unique
• ~274K documents
• Stanford NLP
●●
●
●●
●
●●
●●●
●●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●●●●●●
●
●
●●●●
●
●●●●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
2000
3000
4000
1920 1940 1960 1980 2000Year
Docum
ents
●
●
●
●●●
●●
●●●●●
●●
●
●
●
●●●
●●●
●●●●●
●●
●
●●●●●
●●●●●
●●●
●
●
●●●
●
●●
●●●●●●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●●●
●
●●
●
500
750
1000
1250
1920 1940 1960 1980 2000Year
Words
Culturomics• Reflection of cultural phenomena in language
• Michel et al. 2011
• Google Books
• Typically:
• Diachronic aspect
• Large corpora
• Also Leetaru 2011
Parsimonious Language Models
• Hiemstra et al. 2004
• IR context: term relevance
• Compact language model per document
• Low probabilities for
• stopwords
• rare terms
• Probabilistic TF-IDF
Control for specificity
Model brings IR perspective to historical analysis: “If somebody would search for a document from
the fourties, which terms would (s)he use?”
Control for specificity
Model brings IR perspective to historical analysis: “If somebody would search for a document from
the fourties, which terms would (s)he use?”
Setup
• Build composite documents:
• per year
• per decade (e.g. sixties)
• Run PLM, only on nouns (NN, NNS)
• Extract characteristic vocabulary: P(w|D)
• Automated characterization of historical periods
• Zeitgeist analysis
Evaluation?
• Quantitative evaluation not straightforward
• Bring in historian
• Evidence = self-referential, self-explanatory
• … because you already know the corpus well
• Better suited for new, unknown corpora
Breaking points?• No more “manually” setting periods
• Extract 5000 top-characteristic words from each year
• Variability-Based Nearest Neighbour Clustering
• Identification of temporal stages in diachronic data
• Gries & Hilpert 2008
• Cluster tree only merges adjacent nodes (e.g. 1943-1944)
• Ward linkage, cosine distance
TIME 100• 20th century's 100 most influential people
(1999, now yearly)
• Five categories:
• Leaders & Revolutionaries
• Builders & Titans
• Artists & Entertainers
• Scientists & Thinkers
• Heroes & Icons
Wikifier
• Problems:
• Which Clinton?
• Anne + Frank...
• Cross-document coreference resolution
• “Wikification”
<a href="http://en.wikipedia.org/wiki/Ronald_Reagan">Ronald Reagan</a> launched his campaign to make <a href="http://en.wikipedia.org/
wiki/United_States">America</a> great again in <a href="http://en.wikipedia.org/wiki/Detroit">Detroit</a> in <a href="http://en.wikipedia.org/wiki/1980">1980</
a> . Let's go back to the <a href="http://en.wikipedia.org/wiki/Detroit">Motor City</a> and
hold our <a href="http://en.wikipedia.org/wiki/2016">2016</a> national <a href="http://
en.wikipedia.org/wiki/United_States_presidential_nominating_convention">n
omination convention</a> in <a href="http://en.wikipedia.org/wiki/Detroit">Detroit</a> .
adolf_hitler
akio_morita
alan_turing
albert_einstein
alexander_fleming
amadeo_giannini
andrei_sakharov
anne_frank
aretha_franklin
bart_simpson
bill_gates
bill_w.
billy_graham
bob_dylan
bruce_lee
charles_e._merrill
charles_lindbergh
charlie_chaplin
che_guevara
coco_chanel
david_ben−gurion
david_sarnoff
diana,_princess_of_wales
edmund_hillary
edwin_hubble
eleanor_roosevelt
emmeline_pankhurst
enrico_fermi
francis_crick
frank_sinatra
franklin_d._roosevelt
g.i._(military)
harvey_milkhelen_keller
henry_ford
ho_chi_minh
igor_stravinsky
jackie_robinson
james_joyce
jean_piaget
jim_henson
john_maynard_keynes
jonas_salk
juan_trippe
kennedy_family
kurt_gödel
le_corbusier
lech_wa....sa
leo_baekeland
leo_burnett
louis_armstrong
louis_b._mayer
louis_leakey
lucille_ball
lucky_luciano
ludwig_wittgenstein
mao_zedong
margaret_sanger
margaret_thatcher
marilyn_monroemarlon_brando
martha_graham
martin_luther_king,_jr.
mikhail_gorbachev
mother_teresa
muhammad_ali
nelson_mandela
oprah_winfrey
pablo_picasso
pelépete_rozelle
philo_farnsworth
pope_john_paul_ii
rachel_carson
robert_h._goddard
rodgers_and_hammerstein
ronald_reagan
rosa_parks
ruhollah_khomeini
sam_walton
sigmund_freud
steven_spielberg
t._s._eliot
tank_man
tenzing_norgay
the_beatles
theodore_roosevelt
thomas_watson,_jr.
tim_berners−lee
vladimir_lenin
walt_disney
walter_reuther
william_levitt
william_shockley
willis_carrier
winston_churchill
wright_brothers
CATEGORY a a a a aArtistsAndEntertainers BuildersAndTitans HeroesAndIcons LeadersAndRevolutionaries ScientistsAndThinkers
Simple measure
Criterion: Longest continuous span of trimesters of TIME issues in which a person is mentioned...
Shared 1st place with Joyce, Elliot, Roosevelt, ....
Person of the Year
• 11/12/13...
• Since 1927 (Lindbergh)
• Most influence
• For better or worse
• Sometimes deviant (“You”)
• Predict this
Methods
• Cast task as ranking problem: POY@1
• TIME publishes shortlist
• More insightful: inspect top ranking
• Evaluation:
• Mean Reciprocal Rank (cutoff @20)
• Accuracy @10, @5, @1
Learning to Rank
• Steps:
1. Retrieve 100 candidates via DF (baseline)
2. Extract features for each candidate
3. Rerank 100 via Learning to Rank
• Ranklib (LambdaMART)
Topical features
• Topic metadata (science, politics, music, …)
• # of topics person appears in
• Topical variance: coefficient of variation over topics
Similarity features
• Distributional properties of persons:
• Similarity to previous POY(s)
• Similarity to “year” of election
• Word2vec (Mikolov et al. 2013)
Overall results For the ~7000 persons mentioned each year
Baseline LTR
MRR@20 .28 .43
Top-10 (%) .53 .64
Top-5 (%) .46 .57
Top-1 (%) .16 .31
Top-100 (%) .91 (upper bound)
Ablation experiments
MRR@20
All features .43
— frequency features .37
— topical features .35
— temporal features .37
— network features .31
— similarity features .34
Adding features previous years
MRR@20
Election year .33
+1 year back .43
+2 year back .41
+3 year back .41
+4 year back .40
+5 year back .37
2013 elections…!
• Less documents (ca. 1000)
• Worse metadata after 2006
• Wikifier 2011
• Give it a try anyway…
Our top-101. B. Obama
2. Vl. Putin
3. M. Cyrus
4. M. Zuckerberg
5. A. Schwarzenegger
6. B. Bernanke
7. S. Jobs
8. Pope Francis
9. L. Grossman
10. A. Jolie
Shortlist• B. Assad
• J. Bezos
• T. Cruz
• M. Cyrus (our #3)
• Pope Francis (our #8)
• B. Obama (our #1)
• H. Rouhani
• K. Sebelius
• E. Snowden
• E. Windsor
Discussion (3)
• Newcomers are difficult to predict
• Bias towards slowly emerging candidates:
• Model has very smooth view on history
• Conservative choice: similarity!
• Our system seems more “objective”:
• more women
• more bad guys