From keyword searching to discourse mining Pim Huijnen, Juliette Lonij DH2016, Kraków 15 July 2016
From
keyword searching to
discourse mining
Pim Huijnen, Juliette Lonij
DH2016, Kraków 15 July 2016
From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.
From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.
Tangherlini, T. R. and Leonard, P. (2013). Trawling in the Sea of the Great Unread: Sub-corpus
topic modeling and Humanities research, Poetics, 41: 725-749.
Van den Hoven, M., Van den Bosch, A. and Zervanou, K. (2010). Beyond Reported History:
Strikes That Never Happened. Proceedings of the First International AMICUS Workshop on
Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts,
Vienna: 20-28.
Wiedemann, G. and Niekler, A. (2014). Document Retrieval for Large Scale Content Analysis
using Contextualized Dictionaries. Terminology and Knowledge Engineering, Berlin, June
2014: https://hal.archives-ouvertes.fr/hal-01005879.
Using extensive and context-specific word lists (‘dictionaries’) to replace the contingency of single keywords
Developing a script to extract dictionaries from literature based on topic modeling
Experimenting with tools to visualise results of dictionary searching in kranten.delpher.nl
Goals researcher-in-residence project
Flexibility (evaluation based on human expertise)
Transparency (avoiding black-boxing)
Practicality (available for the wider public)
KB researcher-in-residence project
Script to extract dictionaries
B
Topic modeling
TF-IDF
A
BC
Script to extract dictionaries
Visualising results of dictionary searches in Delpher
Use OR-query to search KB’s newspaper corpus Visualise results on the basis of Solr’s relevancy-score (min. no. of words)
(arbeid* OR bedrij* OR beheer OR controle* OR factor* OR functie* OR kost* OR leiding* OR loon* OR maatregel* OR management OR methode* OR model* OR norm* OR organisatie* OR plannen OR prijs OR productie OR rationeel OR rendement OR reorganisatie OR statistiek OR taylor OR tijd OR werkbesparing OR werkverdeeling)
kbresearch.nl/dictionary
Challenges
Running an OR-query of 25+ (or, preferably, more) words on a 90.000.000+ document dataset
Accounting for particularities of the corpus: * number of newspaper titles per year * changes in newspaper titles over the years * changes in article length over the years
Getting an idea of the exact combination of words in the visualised results
Thank you!
https://github.com/jlonij/keyword_generator
http://blog.kbresearch.nl/
http://www.pimhuijnen.com