From keyword searching to discourse mining

Post on 12-Apr-2017

242 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

Transcript

From

keyword searching to

discourse mining

Pim Huijnen, Juliette Lonij

DH2016, Kraków 15 July 2016

From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.

From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.

Tangherlini, T. R. and Leonard, P. (2013). Trawling in the Sea of the Great Unread: Sub-corpus

topic modeling and Humanities research, Poetics, 41: 725-749.

Van den Hoven, M., Van den Bosch, A. and Zervanou, K. (2010). Beyond Reported History:

Strikes That Never Happened. Proceedings of the First International AMICUS Workshop on

Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts,

Vienna: 20-28.

Wiedemann, G. and Niekler, A. (2014). Document Retrieval for Large Scale Content Analysis

using Contextualized Dictionaries. Terminology and Knowledge Engineering, Berlin, June

2014: https://hal.archives-ouvertes.fr/hal-01005879.

Using extensive and context-specific word lists (‘dictionaries’) to replace the contingency of single keywords

Developing a script to extract dictionaries from literature based on topic modeling

Experimenting with tools to visualise results of dictionary searching in kranten.delpher.nl

Goals researcher-in-residence project

Flexibility (evaluation based on human expertise)

Transparency (avoiding black-boxing)

Practicality (available for the wider public)

KB researcher-in-residence project

Script to extract dictionaries

B

Topic modeling

TF-IDF

A

BC

Script to extract dictionaries

Visualising results of dictionary searches in Delpher

Use OR-query to search KB’s newspaper corpus Visualise results on the basis of Solr’s relevancy-score (min. no. of words)

(arbeid* OR bedrij* OR beheer OR controle* OR factor* OR functie* OR kost* OR leiding* OR loon* OR maatregel* OR management OR methode* OR model* OR norm* OR organisatie* OR plannen OR prijs OR productie OR rationeel OR rendement OR reorganisatie OR statistiek OR taylor OR tijd OR werkbesparing OR werkverdeeling)

kbresearch.nl/dictionary

Challenges

Running an OR-query of 25+ (or, preferably, more) words on a 90.000.000+ document dataset

Accounting for particularities of the corpus: * number of newspaper titles per year * changes in newspaper titles over the years * changes in article length over the years

Getting an idea of the exact combination of words in the visualised results

Thank you!

https://github.com/jlonij/keyword_generator

http://blog.kbresearch.nl/

http://www.pimhuijnen.com

top related