Top Banner
Towards greater transparency in digital literary analysis John Lavagnino, King‟s College London 8 May 2014 http://www.slideshare.net/jlavagnino/tgt
48

Towards greater transparency in digital literary analysis

Jul 19, 2015

Download

Education

John Lavagnino
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards greater transparency in digital literary analysis

Towards greater transparency in digital literary analysis

John Lavagnino, King‟s College London

8 May 2014

http://www.slideshare.net/jlavagnino/tgt

Page 2: Towards greater transparency in digital literary analysis

The plan

1 General reasons for doing digital analysis, and some present-day trends

2 A recent study that went badly wrong

3 Open and closed techniques

4 Open and closed data

Page 3: Towards greater transparency in digital literary analysis

Things not in the plan

Lots of things that aren‟t analysis are valuable:

1 publication and rediscovery (as by the Women Writers Project, Northeastern University)

2 discussion, argument, interaction

3 studies of digital culture

4 …

Page 4: Towards greater transparency in digital literary analysis

Why people do this

Above all, because you can: a byproduct of the web and the widespread use of computers is a wealth of textual data. Without books in transcribed form much less would happen.

Yes, you can always transcribe some new stuff yourself, but then you immediately need time and money before doing anything at all.

You can also work with small amounts of text, but it tends to get less notice.

Page 5: Towards greater transparency in digital literary analysis

What‟s harder to do

Texts not in English are less widely available in digital form and so get analyzed less.

Texts much later than the nineteenth century are in copyright.

Texts before the nineteenth century pose OCR problems and have more variable spelling.

It‟s not an accident that there are so many digital studies of nineteenth-century novels.

Page 6: Towards greater transparency in digital literary analysis

Why it‟s worth doing

When there‟s too much to read

When a different kind of attention is valuable (more systematic? or just very different from normal reading?)

When it can locate or arrange material as the basis for more traditional approaches

Page 7: Towards greater transparency in digital literary analysis

A recent study that wentbadly wrong

Page 8: Towards greater transparency in digital literary analysis

The study

Matjaž Perc, “Evolution of the most common English words and phrases over the centuries”, Journal of the Royal Society Interface, 7 December 2012: see:

http://goo.gl/7S0RT

Based on Google ngram data: see www.culturomics.org

Page 9: Towards greater transparency in digital literary analysis

A surprising claim about English

Perc, in his abstract: “We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century.”

Page 10: Towards greater transparency in digital literary analysis

Top 3-grams, 2007 and 2008

See: http://goo.gl/iUS3e

Page 11: Towards greater transparency in digital literary analysis

Top 3-grams, early 1520s

See: http://goo.gl/r4eyh

(Note that the 3-grams are case-sensitive.)

Page 12: Towards greater transparency in digital literary analysis

From 1541‟s top 3-grams

See: http://goo.gl/r4eyh

Birthdate of Sir Thomas Bodley: 2 March 1545

Page 13: Towards greater transparency in digital literary analysis

Top trigram frequencies, 1800-2000

Page 14: Towards greater transparency in digital literary analysis

Top trigram frequencies, 1520-1800

Page 15: Towards greater transparency in digital literary analysis

Evolution of popularity of the top 100 n-grams over the past five centuries.

Perc M J. R. Soc. Interface doi:10.1098/rsif.2012.0491

See: http://goo.gl/2URVT

©2012 by The Royal Society

Page 16: Towards greater transparency in digital literary analysis

Some alternative conclusions about this research

The world‟s best mass OCR is bad for books before 1800

You should read what the providers of your data say about it: Steven Levitt does

Interdisciplinary journals need to have reviewers from many fields

Page 17: Towards greater transparency in digital literary analysis

Real 1520 trigrams

Perc‟s data set contains no true 1520 imprints: his 1520 book is An Open Letter to the Christian Nobility of the German Nation, an early-twentieth-century translation of a book by Martin Luther published in German in 1520.

Page 18: Towards greater transparency in digital literary analysis

Another conclusion

Perc‟s publication of his data and an interface for exploring it is praiseworthy: this study is very transparent. It‟s not just that the Google data is readily available: Perc constructed his own tables of the top ngrams year-by-year and published them online.

Page 19: Towards greater transparency in digital literary analysis

Some very rough numbers for 1520

STC titles published in 1520: 114

In English: 47

(And figures for both 1519 and 1521 are considerably smaller, because 1520 includes many items dated c.1520.)

Page 20: Towards greater transparency in digital literary analysis

Limitations of knowledge

The kind of naïve statistical study Percperformed assumes an entirely reliable and consistent data set. The Google ngramdata isn‟t like that, but while it can be done far better, a data set for early-sixteenth-century English of that kind is not even possible.

Page 21: Towards greater transparency in digital literary analysis

Open and closed techniques

Page 22: Towards greater transparency in digital literary analysis

When is language unusual?

A man fires an arrow at a Neanderthal in William Golding‟s novel The Inheritors:

A stick rose upright and there was a lump of bone in the middle. Lok peered at the stick and the lump of bone and the small eyes in the bone things over the face. Suddenly Lok understood that the man was holding the stick out to him but neither he nor Lok could reach across the river. He would have laughed if it were not for the echo of the screaming in his head. The stick began to grow shorter at both ends. Then it shot out to full length again.

Page 23: Towards greater transparency in digital literary analysis

An obvious but useful method

David Hoover, “The End of the Irrelevant Text: Electronic Texts, Linguistics, and Literary Theory”, Digital Humanities Quarterly 1:2 (2007), used Google to find other instances of the oxymoronic phrase “grew shorter”.

When referring to physical objects (and not lectures, distances, patience, …) it‟s not about sticks, it‟s about fuses, candles, cigarettes… (in use), and articles of clothing, hair... (over time).

Page 24: Towards greater transparency in digital literary analysis

Literary significance

Hoover: “Part of the power of „the stick began to grow shorter at both ends‟ is in the shape of Lok‟s incomprehension. For Lok, the whole world is alive, so that a stick that changes length is perfectly comprehensible.”

Page 25: Towards greater transparency in digital literary analysis

Problems of technique

What forms do you look for? Hoover‟s investigation looked both at the words Golding used and at the concept of objects growing shorter.

Searches can give very different results with slight differences in query.

Page 26: Towards greater transparency in digital literary analysis

It really is true

Geoffrey Pullum, “The sparseness of linguistic data”, Language Log, 7 April 2014: “it really is true that the probability for most grammatical sequences of words actually having turned up on the web really is approximately zero, so grammaticality cannot possibly be reduced to probability of having actually occurred.”

Page 27: Towards greater transparency in digital literary analysis

Complex techniques: PCA

Larry L. Stewart, “Charles Brockden Brown: Quantitative Analysis and Literary Interpretation”, Literary and Linguistic Computing, June 2003: among other things, a study of Brown‟s novels Wielandand Carwin, and the distinctiveness of the narrating voices of Clara and Carwin.

Page 28: Towards greater transparency in digital literary analysis

Clara and Wieland as narrators

Page 29: Towards greater transparency in digital literary analysis

What is that graph based on?

PCA, or Principal Component Analysis, takes as input numerous textual features you choose, and tries to create “components” that capture as much of the variation in the texts as possible: reducing the dozens of dimensions needed to show all these things down to two that roll together a lot of what‟s going on (about half of it, in this case).

Page 30: Towards greater transparency in digital literary analysis

Principal components

This reduction is automatic: and is not really a statistical analysis, only a rearrangement of the data. But it does show us groupings of the chapters based on part of the actual data, with Clara‟s narration in Wielandhaving more exclamation points and dashes and fewer instances of “our”; combining these into one feature makes it easier to see.

Page 31: Towards greater transparency in digital literary analysis

What is that graph based on?

Page 32: Towards greater transparency in digital literary analysis

Can we get back to the text?

Yes, in that Stewart tells us what goes into the first principal component (though not the others).

No, in that he doesn‟t show any passages and analyze them in these terms.

And no, in that a component is a complex weighted combination of parts of features.

Page 33: Towards greater transparency in digital literary analysis

Graphs need analysis

It is still common to treat graphs and other visualizations as results, not as texts that themselves need interpretation. Yet they‟re only of interest if they support substantial discussion and analysis, and that ought to appear in the article. Stewart has a literary-critical discussion of the novels in light of this analysis: but why not a few pages first on the graph?

Page 34: Towards greater transparency in digital literary analysis

Graphs need interaction

You publish one or two or six graphs in an article, not two hundred, because they take up a lot of space. But if a graph‟s worth doing at all it‟s worth doing differently, and the best way to explore this kind of study is to try out variations yourself.

For all its flaws, this is one thing the Google ngrams resource got right.

Page 35: Towards greater transparency in digital literary analysis

Open and closed data

Page 36: Towards greater transparency in digital literary analysis

Big uncurated data

Ted Underwood, Michael L. Black, Loretta Auvil, and Boris Capitanu, “Mapping Mutable Genres in Structurally Complex Volumes” (2013), at http://arxiv.org/abs/1309.3323: the study analyzes “a collection of 469,200 volumes drawn from HathiTrust Digital Library”. That‟s an open data collection provided by libraries involved in Google Books.

Page 37: Towards greater transparency in digital literary analysis

How do you read 469,200 books?

You start by figuring out how to find the textin them, by skipping things like bookplates and tables of contents. (The bookplates are a reason why Google Books and Google ngrams studies of the word “library” run into problems.) Without doing that first you can‟t go on to study (as they are) the percentage of first-person novels over time.

Page 38: Towards greater transparency in digital literary analysis

But it‟s not really transparent now

If you need to do that much to the books before you can analyze them, others either need to duplicate all of that preliminary work or get the results of your preliminary work.

Much work on big data elsewhere is based on data that is simpler in form than books are, or has been prepared for use first (at someone‟s expense).

Page 39: Towards greater transparency in digital literary analysis

Curated rather than raw texts

These exist in the humanities, but not necessarily where you want to work or in the numbers you desire. Another C19-novel study by Matthew Wilkens used texts fixed up at Indiana University, with fewer textual errors and clearly-defined structure; but that meant he also had a lot fewer of them.

Page 40: Towards greater transparency in digital literary analysis

Specially prepared data

Once it was more common for digital-humanities work to involve creation of new data for analysis: not just basic texts, but also analysis or extraction of features by hand as a basis for analysis.

For example, Brad Pasanek and D. Sculley, “Mining millions of metaphors”, Literary and Linguistic Computing, September 2008.

Page 41: Towards greater transparency in digital literary analysis

Pasanek‟s collection

See http://metaphors.lib.virginia.edu/ for his Mind is a Metaphor collection, assembled to support a study of C18 thinking on the subject; a collection based in the first instance on doing lots of searches, extended over the course of many years by several hands.

Page 42: Towards greater transparency in digital literary analysis

A little on how it‟s done

Pasanek: “At present I still spend a fair amount of time conducting proximity searches for two character strings. I search one term from a set list ("mind," "heart," "soul," "thought," "idea," "imagination," "fancy," "reason," "passion," "head," "breast," "bosom," or "brain") against another word that I hope will prove metaphorical. For example, I search for "mind" within one hundred characters of "mint" and find the following couplet in William Cowper's poetry:

“The mind and conduct mutually imprintAnd stamp their image in each other's mint.””

Page 43: Towards greater transparency in digital literary analysis

Creating data as a scholarly activity

The collection itself is a major effort (and not everyone would have made it public in this way prior to publishing their monograph). Creation of this kind of resource is not yet widely recognized as valuable scholarship: the usual focus is on “uninterpreted” transcription.

And some data comes from sources that cannot be made generally available (copyright again).

Page 44: Towards greater transparency in digital literary analysis

Are we satisfied?

Over half the metaphors come from searching Chadwyck-Healey collections of texts; about a third from reading.

There‟s transparency in that Pasanekexplains in detail how he assembled his collection; but it would be a challenge to assemble a rival corpus to compare with this one. Such an effort shouldn‟t really be an individual one, but usually will be.

Page 45: Towards greater transparency in digital literary analysis

Conclusions

There‟s a potential for openness in new approaches but some challenges: new forms of publication appropriate for new kinds of work, balancing openness and scholarly recognition, copyright.

We need to find out interesting things to motivate the changes greater transparency requires.

Page 46: Towards greater transparency in digital literary analysis

Thank you!

Please contact me at [email protected]

Slides: at http://www.slideshare.net/jlavagnino/tgt

Page 47: Towards greater transparency in digital literary analysis
Page 48: Towards greater transparency in digital literary analysis