Top Banner
Digital Text and Data Processing Week 7
21

Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Dec 25, 2015

Download

Documents

Mavis Allison
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Digital Text and

Data Processing

Week 7

Page 2: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ POS: total counts: normalise by token count

□ Unicode support

□ Synchronic and diachronic variation (dialects and historical changes)

□ Not knowing beforehand what is possible / relevant

Challenges

Page 3: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ Digital humanities methodology often demands experimentation

□ Method is mostly inductive approach (cf. deductive approach advocated by Stanley Fish)

□ When experiments are not motivated, there is a risk that the research simply exposes "a correlation between a formal feature the computer program just happened to uncover and a significance that has simply been declared, not argued for".

□ Also see Chris Anderson, The End of Theory

Page 4: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ The DH methodology is partly inductive and partly deductive

□ Computational analyses often lead to unexpected results

□ Techniques can help scholars to generate hypotheses

Page 5: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ Data acquisition

□ Clean up and enrichment (removal of stopwords, POS, lemmatisation)

□ Quantification

□ Data analysis

Phases

Page 6: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ Page images and machine-readable text (removal of typography and of paratext)

□ Low quality of OCR, see, e.g. Laura Mandell, How to Read a Literary Visualisation

□ Motivation of the choice of a specific edition

Data acquisition

Page 7: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 8: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ Text2Genome□ OSCAR□ NeuroElectro□ Peter Murray Rust’s

work on Chemical Compounds

TM on recent scientific articles

Page 9: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ The right to read does not imply the right to mine

□ Study commissioned by EC led by by prof. Ian Hargreaves

Licences

Page 10: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Article 7.2 of Settlement:

□Creation of a “Research Corpus”;

□Solely for “non-consumptive” reading, or research “in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book”

Google Books Settlement

Page 11: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 12: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 13: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

□ Lev Manovich, The Language of New Media

□ Textual narrative: linearity and reliance on typography

□ Database: random access, non-linear, no form

Database and Narrative

Page 14: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 15: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

The Semantic web

□ Envisaged by Tim Berners-Lee as “a web of data that can be processed directly and indirectly by machines”

□ RDF-Triples

□Examples:

Subject: “Book-URI” Predicate: “hasISBN” Object: “978-0-252-07829-0”

Page 16: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

dbPedia

Page 17: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 18: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.
Page 19: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Nano-Publications

Page 20: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Semantic Publishing

Page 21: Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

STCN SPARQL Endpoint