Introduction for skills seminar on Search and Data Mining, Master of European History, University of Luxembourg, 11 December 2014

Post on 19-Jul-2015

441 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

Transcript

Search & Data Mining

SKILLS SEMINAR Master of European History, University of Luxembourg, 11 December 2014

Gerben Zaagsma Lichtenberg-Kolleg,

Overview

1. Introduction search & data mining

3. Practical exercises

1. 2. T

Code yourself… …or use existing tools

Why historians should be interested:

Old New CHANGE

Analogue resources Digital resources

SCALE Small data Big data

Close reading Distant reading TECHNOLOGY

the Big Data revolution?

Big data and claims about a paradigm change in the humanities

culturomics and Google ngrams

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism?

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism? Based upon changes of scale & method: humanities supposedly becoming more ‘scientific’ > results can be checked and replicated, but can they? Interpretation.

the Big Data revolution?

Big data and claims about a paradigm change in the humanities Data driven history Patterns and structures: a new essentialism? Based upon changes of scale & method: humanities supposedly becoming more ‘scientific’ > results can be checked and replicated, but can they? Interpretation. Politics: funding & valorisation

“One of the problems confronting data enthusiasts in the humanities is that we feel a need to convince our more old-fashioned colleagues about what can be done. But our role as advocates of data shouldn't mean that we lose our critical sense as scholars.

[....] there is a risk that we look more carefully at the technical components of the datasets than the historical context of the information that they represent.

Andrew Prescott, ‘The Deceptions of Data’, Digital Riffs (13 January 2013).

Frédéric Clavert, ‘Lecture des sources historiennes à l’ère numérique’ (14 November 2012)

Integrate approaches & methods/

hybridity

1. SEARCH

Google/ Bing/ Yahoo

er is veel meer ...

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

http://www.langreiter.com/exec/yahoo-vs-google.html

zoeken op Internet algemeen:

Google

er is veel meer dan Google

filter bubble? bekijk eens: http://dontbubble.us

http://yometa.com

filter bubble?

http://www.thefilterbubble.com

filter bubble?

http://www.thefilterbubble.com

Web search round-up

differences between search engines

filter bubble

deep web versus visible web

Searching digital libraries & archives…

composition of resources, selection…

example of Compactmemory: a great resource on German-Jewish history

but be aware of selection: focus on elites and organisations that highlight German Jewry’s process of emancipation :

• classical vision in historiography on German Jewry?• reinforcement of existing master narratives?

Die Sammlung umfasst die 110 wichtigsten jüdischen Zeitungen und Zeitschriften des deutschsprachigen Raumes aus den Jahren 1806-1938. Die Periodika repräsentieren die

gesamte religiöse, politische, soziale, literarische oder wissenschaftliche Bandbreite der jüdischen Gemeinschaft.

mind the context…

Processing and searching data on your own computer…

1. DATA MINING

data?

data = computer-processable information

Example of structured data

Many digital libraries/archives: un-/semi-structured data

Digital editions: bridging the gap with XML

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaal

http://eculture.cs.vu.nl/europeana/session/search

Semantic web and linking data

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaalhttp://eculture.cs.vu.nl/europeana/session/search

•Google/ Bing/ Yahoo

• er is veel meer ...

• resultaten verschillen per zoekmachine

• en er is een filter bubbel

•--> kortom: weten wat je zoekt en zoekstrategie cruciaal

Some definitions of data mining:

At its simplest, data mining is the process of extracting new knowledge (usually in terms of previously unknown

patterns) from sets of data already in existence.

Jonathan Hagood

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of

computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection

of artificial intelligence, machine learning, statistics, and database systems.

The overall goal of the data mining process is to extract information from a data set and transform it into an

understandable structure for further use.

Wikipedia

Examples of projects and techniques

an n-gram is a contiguous sequence of n items from a given sequence of text or speech

Topic Modeling Martha Ballard’s Diary

data?

data & data mining ≠ neutral

“What is too often forgotten, though, is that our digital helpers are full of ‘theory’ and ‘judgement’ already. As with any methodology, they rely on sets of assumptions, models, and strategies. Theory is already at work on the most basic level when it comes to defining units of analysis, algorithms, and visualisation procedures.”

Bernhard Rieder and Theo Röhle, ‘Digital Methods: Five Challenges’ in: David M Berry ed., Understanding Digital

Humanities (Houndmills: Palgrave Macmillan, 2012) 67-85, 70.

2. TOOLS

3. Practical exercises

Overview of exercises

http://goo.gl/72fCn7

Tools & workflows

Voyant Tools Voyant Tools Documentation Programming Historian DIRT: Digital Research Tools Turkel, William J., Kevin Kee, and Spencer Roberts, ‘A Method for Navigating the Infinite Archive’ in: Toni Weller ed., History in the Digital Age (London; New York: Routledge, 2013). William J. Turkel: How To

Further reading

Special issue on Digital History, BMGN - Low Countries Historical Review, 128/4 (2013). Haber, Peter, Digital Past : Geschichtswissenschaft Im Digitalen Zeitalter (München: Oldenbourg Verlag, 2011). Boonstra, Onno, Leen Breure, and Peter Doorn, Past, Present and Future of Historical Information Science (Amsterdam: NIWI-KNAW, 2004). Ciravegna, Fabio, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie McLaughlin, and Ravish Bhagdev, ‘Finding Needles in Hay- Stacks: Data-Mining in Distributed Historical Datasets’ in: Mark Greengrass and Lorna M Hughes eds., The Virtual Representation of the Past (Ashgate, 2008). Cohen, D, F Gibbs, T Hitchcock, G Rockwell, J Sander, R Shoemaker, S Sinclair, S Takats, W J Turkel, and C Briquet. "Data Mining with Criminal Intent." Final white paper (2011). Hagood, Jonathan, "A Brief Introduction to Data Mining Projects in the Humanities." Bulletin of the American Society for Information Science and Technology 38/4 (2012). Hitchcock, Tim, "Big Data for Dead People: Digital Readings and the Conundrums of Positivism." (9 December 2013). Leonard, Peter, "Mining Large Datasets for the Humanities”, IFLA WLIC 2014.

Dr. Gerben Zaagsma

http://gerbenzaagsma.org de.linkedin.com/in/gerbenzaagsma/ https://twitter.com/gerbenzaagsma https://uni-goettingen.academia.edu/GerbenZaagsma https://www.researchgate.net/profile/Gerben_Zaagsma https://www.slideshare.net/gerbenzaagsma

Image credits

The Field Museum Library, Hall 37 Geology overview. URL: https://www.flickr.com/photos/field_museum_library/3333920156/in/set-72157614881700424. The U.S. National Archives, Photograph of Card Catalog in Central Search Room, 1942. URL: http://www.flickr.com/photos/usnationalarchives/3873932255/. Witch computer 1951: Wolverhampton and Staffordshire College of Technology in 1961, The National Computing Museum and Computer Conservation Society/UKAEA/Wolverhampton Express and Star, via: http://www.wired.com/2009/09/britan-oldest-computer/. Code: https://www.flickr.com/photos/lord_james/4696338852/. Tools: Flickr Commons The droids we're googling for: https://www.flickr.com/photos/st3f4n/3951143570/. Jaws (Steven Spielberg) original movie poster: https://en.wikipedia.org/wiki/File:JAWS_Movie_poster.jpg Structured/unstructured data: http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm Macbook Data Mining: http://www.flickr.com/photos/17208993@N00/442531562/. Topic Modeling Martha Ballard’s Diary: http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/. Boolean operators: http://uksourcers.co.uk/2012/capital-letters-the-key-to-boolean-success/ Miami University students in laboratory classroom 1908: https://www.flickr.com/photos/muohio_digital_collections/3199691495/

top related