Top Banner
BOUTIQUE BIG DATA Reintegrating Close and Distant Reading of 19th-Century Newspapers M. H. Beals (ORCID: 0000-0002-2907-3313) Loughborough University @MHBEALS
26

Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

Apr 12, 2017

Download

Education

M. H Beals
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

BOUTIQUE BIG DATAReintegrating Close and Distant Reading of 19th-Century NewspapersM. H. Beals (ORCID: 0000-0002-2907-3313)Loughborough University@MHBEALS

Page 2: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

THE HISTORICAL PROBLEM

Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/2313507405

• Culture of Reprinting in 18th and 19th Centuries

• Inconsistent Attribution

• Inconsistent Survival of Network Components

• Limited Historiographical Resources

Page 3: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
Page 4: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

SEARCH AND TRANSCRIBE

Left Image Courtesy of Dan Tantrum (CC BY NC ND) at https://www.flickr.com/photos/tantrum_dan/2344581860

Page 5: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

COPYFIND REPRINT DETECTION• Freeware Programme Developed by Lou Bloomfield

http://plagiarism.bloomfieldmedia.com/z-wordpress/software/copyfind/

• Highly Customisable Search As Well as Open Source

• Measures Left, Right and Overall Matches

• Displays Left-Right Comparisons of Text

Image Courtesy of the Lou Bloomfield at http://rabi.phys.virginia.edu/lab3e/

Page 6: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
Page 7: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

COPYFIND IN OCR CORPORA• Freeware Programme Developed by Lou Bloomfield (University of Virginia)

• Highly Customisable Search Parameters

• Measures Left, Right and Overall Matches

• Displays Left-Right Comparisons of Text

• Extremely Effective at Discovering OCR-Transcribed Matches

Page 8: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
Page 9: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
Page 10: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

1810

1811

1812

1813

1814

1815

1816

1817

Page 11: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

ESTABLISHING LIKELY CANDIDATES

• Single Year (1810) Contained over 200,000 Possible Matches

• Removed Internal (Same Title) Reprints

• Restricted Match Size (90 Right, 90 Left or 160 Overall)

• Restricted Date Separation (200 Days)

Page 12: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

COMPARING DATABASES• Historical Networks Bear Little

Resemblance to Digitsed Corpora

• Undigitised Collections Require Manual Discovery and Transcription

• Paywalled Collections (Currently) Require Search-and-Transcribe Inclusion

• Overcoming Political and Linguistic Divisions

Page 13: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

ADVERTISEMENTS• Reprinted in Same Title

• Reprinted in Other Titles

• Reprinted with Minor Variations

• Reprinted after Long Periods

• Similar Wording in Different Adverts

• Own Networks Ripe for Analysis!

Page 14: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
Page 15: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

DIRECTIONALITY• Reprint Maps are Non-Linear,

Similar to Phytogenic Trees

• Paths of Specific Branches Dictated by Date, Content, Errors

• Similar Method to Meme-Tracking (Adamic et al, 2014)

• Attributions Are Often Red Herrings

Page 16: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

ANOTHER DREADFUL MASSACRE…

Times Courier

Star

St. James Chronicle

Sydney Gazette

Morning Chronicle

Caledonian Mercury

Aberdeen Journal

Page 17: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

AN AFFECTING INSTANCE OF SELF-MURDERCoroner’s Inquest.–At half past two o’clock yesterday, an Inquest was held at the Nag’s Head, Orange-court, Leicester-fields, before Anthony Gell, Esq. Coroner for Westminster, on the body of Madamoiselle Ann Paris, then lying dead at No. 4, St. Martin’s-street, Leicester-fields.

Morning Chronicle (London, England, United Kingdom), 06 January 1810, p. 3, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/381.Trewman’s Exeter Flying Post (Exeter, England, United Kingdom), 11 January 1810, p. 2, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/379.Examiner (London, England, United Kingdom), 17 January 1810, p. 15, 16, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/380.

Page 18: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

DIRECTIONALITYPerfect Match Overall Match Copy Original Reprint ID

559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt 381379

992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt 381380

Perfect Match Overall Match View Both Files File L File R

559 (82% L, 31% R) 559 (82%) L; 559 (31%) R Side-by-Side 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt

992 (96% L, 56% R) 992 (96%) L; 992 (56%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt

533 (51% L, 78% R) 533 (51%) L; 533 (78%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-11_Trewman's Exeter Flying Post_379.txt

559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-06_Morning Chronicle_381 1810-01-11_Trewman's Exeter Flying Post_379 381379

9923 3854

Type Subtype Text Copy Text Original Characters Removed Characters Added % Original % Copy

Style Capitalisation CORONER'S INQUEST Coroner's Inquest 17 17 0.34% 0.44%

Truncation Text At half past two o'clock yesterday 6919 0 69.73% 0.02%

Addition Text An inquest was held yesterday evening 0 853 8.60% 22.14%

Style Punctuation . .-- 3 1 0.04% 0.03%

Style Punctuation ; , 1 1 0.02% 0.03%

Style Punctuation , 0 1 0.01% 0.03%

Style Spelling te ea 2 2 0.04% 0.05%

Style Punctuation , 1 0 0.01% 0.00%

Style Punctuation , 1 0 0.01% 0.00%

Style Punctuation , : 1 1 0.02% 0.03%

Style Punctuation , 1 0 0.01% 0.00%

Style Punctuation , : 1 1 0.02% 0.03%

Style Punctuation , 0 1 0.01% 0.03%

78.86% 22.80%

Page 19: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

DIRECTIONALITY992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt 381380

5749 9923

Type Subtype Text Original Text Copy Characters Removed Characters Added % Original % CopyTruncation Text Coroner's Inquest.--At half past two 55 0 0.55% 0.96%Addition Text On Friday, 0 10 0.10% 0.17%Truncation Text Orange-court, 13 0 0.13% 0.23%Truncation Text before Anthony Gell, Esq 50 0 0.50% 0.87%Style Punctuation ; . 1 1 0.02% 0.03%Truncation Text that the deceased had lodged 174 0 1.75% 3.03%Style Punctuation , 0 1 0.01% 0.02%Truncation Text She was also extremely incoherent 432 0 4.35% 7.51%Truncation Text told the witness that some one had 69 0 0.70% 1.20%Style Capitalisation M m 1 1 0.02% 0.03%Truncation Text At other times, the poor young lady 466 0 4.70% 8.11%Style Punctuation , 0 1 0.01% 0.02%Truncation Text Immediately on the unfortunate 54 0 0.54% 0.94%Style Capitalisation m M 1 1 0.02% 0.03%Addition Text 0 1 0.01% 0.02%Truncation Text Mr. Emanuel Gristock, of Wardour-street 2840 0 28.62% 49.40%Truncation Text without a moment's hesitation, 31 0 0.31% 0.54%Editorial Vocabulary their the 5 3 0.08% 0.14%Style Capitalisation - 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Style Spelling at te 2 2 0.04% 0.07%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Style Capitalisation m M 1 1 0.02% 0.03%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Truncation Text completely 10 0 0.10% 0.17%Style Punctuation , 1 0 0.01% 0.02%Style Capitalisation P p 1 1 0.02% 0.03%Style Capitalisation a disordered intellect A DISORDERED INTELLECT 22 22 0.44% 0.77%Style Punctuation . ! 1 1 0.02% 0.03%

43.20% 74.57%

Page 20: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

DIRECTIONALITY

Examiner Trewman's Exeter Flying Post

Morning Chronicle

CUT FOR SPACEEDITORIALISED

!

Page 21: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

A MATTER OF SCALE• Case-Study Search and Transcribe Limited by:

• Time• Access to Relevant Collections• Creative Search Methods and Hidden Biases

• OCR-Reprint Matching Limited by:• OCR Quality• Reprint Matching Resolution (Article, Page, nGram)

Page 22: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

RE-ANALYSING THE DATABASE• Manual-to-OCR Matches

Much More Accurate

• Finds a Small but Sometimes Crucial Set of New Matches

• Can Remap the Entire Reprint Network

Page 23: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

BOUTIQUE BIG DATA?• Shared Transcription Standards

• Collegial Sharing of Data and Results

• Reuse in New and Unexpected Ways

• Case Study Discoveries Refining Big Data Search Parameters

Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/14032020799/

Page 24: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

WWW.SCISSORSANDPASTE.NET

Page 25: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

WWW.GITHUB.COM/MHBEALS/SCISSORSANDPASTE

Page 26: Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers

THANK YOUM. H. Beals (ORCID: 0000-0002-2907-3313)Loughborough University@MHBEALS