Top Banner
Search & Data Mining SKILLS SEMINAR Master of European History, University of Luxembourg, 11 December 2014 Gerben Zaagsma Lichtenberg-Kolleg,
67

Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Jul 19, 2015

Download

Education

Gerben Zaagsma
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Search & Data Mining

SKILLS SEMINAR Master of European History, University of Luxembourg, 11 December 2014

Gerben Zaagsma Lichtenberg-Kolleg,

Page 2: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 3: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Overview

1. Introduction search & data mining

3. Practical exercises

1. 2. T

Page 4: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Code yourself… …or use existing tools

Page 5: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 6: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Why historians should be interested:

Old New CHANGE

Analogue resources Digital resources

SCALE Small data Big data

Close reading Distant reading TECHNOLOGY

Page 7: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

the Big Data revolution?

Big data and claims about a paradigm change in the humanities

Page 8: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

culturomics and Google ngrams

Page 9: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 10: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 11: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history

Page 12: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism?

Page 13: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism? Based upon changes of scale & method: humanities supposedly becoming more ‘scientific’ > results can be checked and replicated, but can they? Interpretation.

Page 14: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism? Based upon changes of scale & method: humanities supposedly becoming more ‘scientific’ > results can be checked and replicated, but can they? Interpretation. Politics: funding & valorisation

Page 15: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

“One of the problems confronting data enthusiasts in the humanities is that we feel a need to convince our more old-fashioned colleagues about what can be done. But our role as advocates of data shouldn't mean that we lose our critical sense as scholars.

[....] there is a risk that we look more carefully at the technical components of the datasets than the historical context of the information that they represent.

Andrew Prescott, ‘The Deceptions of Data’, Digital Riffs (13 January 2013).

Page 16: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Frédéric Clavert, ‘Lecture des sources historiennes à l’ère numérique’ (14 November 2012)

Integrate approaches & methods/

hybridity

Page 17: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

1. SEARCH

Page 18: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Google/ Bing/ Yahoo

er is veel meer ...

Page 19: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

Page 20: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

http://www.langreiter.com/exec/yahoo-vs-google.html

Page 21: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

http://yometa.com

Page 22: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

filter bubble?

http://www.thefilterbubble.com

Page 23: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

filter bubble?

http://www.thefilterbubble.com

Page 24: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 25: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Web search round-up

differences between search engines

filter bubble

deep web versus visible web

Page 26: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Searching digital libraries & archives…

Page 27: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

composition of resources, selection…

Page 28: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

example of Compactmemory: a great resource on German-Jewish history

Page 29: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

but be aware of selection: focus on elites and organisations that highlight German Jewry’s process of emancipation :

• classical vision in historiography on German Jewry?• reinforcement of existing master narratives?

Die Sammlung umfasst die 110 wichtigsten jüdischen Zeitungen und Zeitschriften des deutschsprachigen Raumes aus den Jahren 1806-1938. Die Periodika repräsentieren die

gesamte religiöse, politische, soziale, literarische oder wissenschaftliche Bandbreite der jüdischen Gemeinschaft.

Page 30: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

mind the context…

Page 31: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 32: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 33: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 34: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Processing and searching data on your own computer…

Page 35: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 36: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 37: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 38: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

1. DATA MINING

Page 39: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 40: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

data?

data = computer-processable information

Page 41: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 42: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Example of structured data

Page 43: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Many digital libraries/archives: un-/semi-structured data

Page 44: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Digital editions: bridging the gap with XML

Page 45: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 46: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 47: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaal

http://eculture.cs.vu.nl/europeana/session/search

Semantic web and linking data

Page 48: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaalhttp://eculture.cs.vu.nl/europeana/session/search

Page 49: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaal

Page 50: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Some definitions of data mining:

Page 51: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

At its simplest, data mining is the process of extracting new knowledge (usually in terms of previously unknown

patterns) from sets of data already in existence.

Jonathan Hagood

Page 52: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of

computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection

of artificial intelligence, machine learning, statistics, and database systems.

The overall goal of the data mining process is to extract information from a data set and transform it into an

understandable structure for further use.

Wikipedia

Page 53: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Examples of projects and techniques

Page 54: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 55: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

an n-gram is a contiguous sequence of n items from a given sequence of text or speech

Page 56: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 57: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014
Page 58: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Topic Modeling Martha Ballard’s Diary

Page 59: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

data?

data & data mining ≠ neutral

Page 60: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

“What is too often forgotten, though, is that our digital helpers are full of ‘theory’ and ‘judgement’ already. As with any methodology, they rely on sets of assumptions, models, and strategies. Theory is already at work on the most basic level when it comes to defining units of analysis, algorithms, and visualisation procedures.”

Bernhard Rieder and Theo Röhle, ‘Digital Methods: Five Challenges’ in: David M Berry ed., Understanding Digital

Humanities (Houndmills: Palgrave Macmillan, 2012) 67-85, 70.

Page 61: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

2. TOOLS

Page 62: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

3. Practical exercises

Page 63: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Overview of exercises

http://goo.gl/72fCn7

Page 64: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Tools & workflows

Voyant Tools Voyant Tools Documentation Programming Historian DIRT: Digital Research Tools Turkel, William J., Kevin Kee, and Spencer Roberts, ‘A Method for Navigating the Infinite Archive’ in: Toni Weller ed., History in the Digital Age (London; New York: Routledge, 2013). William J. Turkel: How To

Page 65: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Further reading

Special issue on Digital History, BMGN - Low Countries Historical Review, 128/4 (2013). Haber, Peter, Digital Past : Geschichtswissenschaft Im Digitalen Zeitalter (München: Oldenbourg Verlag, 2011). Boonstra, Onno, Leen Breure, and Peter Doorn, Past, Present and Future of Historical Information Science (Amsterdam: NIWI-KNAW, 2004). Ciravegna, Fabio, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie McLaughlin, and Ravish Bhagdev, ‘Finding Needles in Hay- Stacks: Data-Mining in Distributed Historical Datasets’ in: Mark Greengrass and Lorna M Hughes eds., The Virtual Representation of the Past (Ashgate, 2008). Cohen, D, F Gibbs, T Hitchcock, G Rockwell, J Sander, R Shoemaker, S Sinclair, S Takats, W J Turkel, and C Briquet. "Data Mining with Criminal Intent." Final white paper (2011). Hagood, Jonathan, "A Brief Introduction to Data Mining Projects in the Humanities." Bulletin of the American Society for Information Science and Technology 38/4 (2012). Hitchcock, Tim, "Big Data for Dead People: Digital Readings and the Conundrums of Positivism." (9 December 2013). Leonard, Peter, "Mining Large Datasets for the Humanities”, IFLA WLIC 2014.

Page 66: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Dr. Gerben Zaagsma

http://gerbenzaagsma.org de.linkedin.com/in/gerbenzaagsma/ https://twitter.com/gerbenzaagsma https://uni-goettingen.academia.edu/GerbenZaagsma https://www.researchgate.net/profile/Gerben_Zaagsma https://www.slideshare.net/gerbenzaagsma

Page 67: Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Image credits

The Field Museum Library, Hall 37 Geology overview. URL: https://www.flickr.com/photos/field_museum_library/3333920156/in/set-72157614881700424. The U.S. National Archives, Photograph of Card Catalog in Central Search Room, 1942. URL: http://www.flickr.com/photos/usnationalarchives/3873932255/. Witch computer 1951: Wolverhampton and Staffordshire College of Technology in 1961, The National Computing Museum and Computer Conservation Society/UKAEA/Wolverhampton Express and Star, via: http://www.wired.com/2009/09/britan-oldest-computer/. Code: https://www.flickr.com/photos/lord_james/4696338852/. Tools: Flickr Commons The droids we're googling for: https://www.flickr.com/photos/st3f4n/3951143570/. Jaws (Steven Spielberg) original movie poster: https://en.wikipedia.org/wiki/File:JAWS_Movie_poster.jpg Structured/unstructured data: http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm Macbook Data Mining: http://www.flickr.com/photos/17208993@N00/442531562/. Topic Modeling Martha Ballard’s Diary: http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/. Boolean operators: http://uksourcers.co.uk/2012/capital-letters-the-key-to-boolean-success/ Miami University students in laboratory classroom 1908: https://www.flickr.com/photos/muohio_digital_collections/3199691495/