BOUTIQUE BIG DATA Reintegrating Close and Distant Reading of 19th-Century Newspapers M. H. Beals (ORCID: 0000-0002-2907-3313) Loughborough University @MHBEALS
BOUTIQUE BIG DATAReintegrating Close and Distant Reading of 19th-Century NewspapersM. H. Beals (ORCID: 0000-0002-2907-3313)Loughborough University@MHBEALS
THE HISTORICAL PROBLEM
Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/2313507405
• Culture of Reprinting in 18th and 19th Centuries
• Inconsistent Attribution
• Inconsistent Survival of Network Components
• Limited Historiographical Resources
SEARCH AND TRANSCRIBE
Left Image Courtesy of Dan Tantrum (CC BY NC ND) at https://www.flickr.com/photos/tantrum_dan/2344581860
COPYFIND REPRINT DETECTION• Freeware Programme Developed by Lou Bloomfield
http://plagiarism.bloomfieldmedia.com/z-wordpress/software/copyfind/
• Highly Customisable Search As Well as Open Source
• Measures Left, Right and Overall Matches
• Displays Left-Right Comparisons of Text
Image Courtesy of the Lou Bloomfield at http://rabi.phys.virginia.edu/lab3e/
COPYFIND IN OCR CORPORA• Freeware Programme Developed by Lou Bloomfield (University of Virginia)
• Highly Customisable Search Parameters
• Measures Left, Right and Overall Matches
• Displays Left-Right Comparisons of Text
• Extremely Effective at Discovering OCR-Transcribed Matches
1810
1811
1812
1813
1814
1815
1816
1817
ESTABLISHING LIKELY CANDIDATES
• Single Year (1810) Contained over 200,000 Possible Matches
• Removed Internal (Same Title) Reprints
• Restricted Match Size (90 Right, 90 Left or 160 Overall)
• Restricted Date Separation (200 Days)
COMPARING DATABASES• Historical Networks Bear Little
Resemblance to Digitsed Corpora
• Undigitised Collections Require Manual Discovery and Transcription
• Paywalled Collections (Currently) Require Search-and-Transcribe Inclusion
• Overcoming Political and Linguistic Divisions
ADVERTISEMENTS• Reprinted in Same Title
• Reprinted in Other Titles
• Reprinted with Minor Variations
• Reprinted after Long Periods
• Similar Wording in Different Adverts
• Own Networks Ripe for Analysis!
DIRECTIONALITY• Reprint Maps are Non-Linear,
Similar to Phytogenic Trees
• Paths of Specific Branches Dictated by Date, Content, Errors
• Similar Method to Meme-Tracking (Adamic et al, 2014)
• Attributions Are Often Red Herrings
ANOTHER DREADFUL MASSACRE…
Times Courier
Star
St. James Chronicle
Sydney Gazette
Morning Chronicle
Caledonian Mercury
Aberdeen Journal
AN AFFECTING INSTANCE OF SELF-MURDERCoroner’s Inquest.–At half past two o’clock yesterday, an Inquest was held at the Nag’s Head, Orange-court, Leicester-fields, before Anthony Gell, Esq. Coroner for Westminster, on the body of Madamoiselle Ann Paris, then lying dead at No. 4, St. Martin’s-street, Leicester-fields.
Morning Chronicle (London, England, United Kingdom), 06 January 1810, p. 3, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/381.Trewman’s Exeter Flying Post (Exeter, England, United Kingdom), 11 January 1810, p. 2, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/379.Examiner (London, England, United Kingdom), 17 January 1810, p. 15, 16, available at the Scissors and Paste Database, http://www.scissorsandpaste.net/380.
DIRECTIONALITYPerfect Match Overall Match Copy Original Reprint ID
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt 381379
992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt 381380
Perfect Match Overall Match View Both Files File L File R
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R Side-by-Side 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt
992 (96% L, 56% R) 992 (96%) L; 992 (56%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt
533 (51% L, 78% R) 533 (51%) L; 533 (78%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-11_Trewman's Exeter Flying Post_379.txt
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-06_Morning Chronicle_381 1810-01-11_Trewman's Exeter Flying Post_379 381379
9923 3854
Type Subtype Text Copy Text Original Characters Removed Characters Added % Original % Copy
Style Capitalisation CORONER'S INQUEST Coroner's Inquest 17 17 0.34% 0.44%
Truncation Text At half past two o'clock yesterday 6919 0 69.73% 0.02%
Addition Text An inquest was held yesterday evening 0 853 8.60% 22.14%
Style Punctuation . .-- 3 1 0.04% 0.03%
Style Punctuation ; , 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.03%
Style Spelling te ea 2 2 0.04% 0.05%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , : 1 1 0.02% 0.03%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , : 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.03%
78.86% 22.80%
DIRECTIONALITY992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt 381380
5749 9923
Type Subtype Text Original Text Copy Characters Removed Characters Added % Original % CopyTruncation Text Coroner's Inquest.--At half past two 55 0 0.55% 0.96%Addition Text On Friday, 0 10 0.10% 0.17%Truncation Text Orange-court, 13 0 0.13% 0.23%Truncation Text before Anthony Gell, Esq 50 0 0.50% 0.87%Style Punctuation ; . 1 1 0.02% 0.03%Truncation Text that the deceased had lodged 174 0 1.75% 3.03%Style Punctuation , 0 1 0.01% 0.02%Truncation Text She was also extremely incoherent 432 0 4.35% 7.51%Truncation Text told the witness that some one had 69 0 0.70% 1.20%Style Capitalisation M m 1 1 0.02% 0.03%Truncation Text At other times, the poor young lady 466 0 4.70% 8.11%Style Punctuation , 0 1 0.01% 0.02%Truncation Text Immediately on the unfortunate 54 0 0.54% 0.94%Style Capitalisation m M 1 1 0.02% 0.03%Addition Text 0 1 0.01% 0.02%Truncation Text Mr. Emanuel Gristock, of Wardour-street 2840 0 28.62% 49.40%Truncation Text without a moment's hesitation, 31 0 0.31% 0.54%Editorial Vocabulary their the 5 3 0.08% 0.14%Style Capitalisation - 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Style Spelling at te 2 2 0.04% 0.07%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Style Capitalisation m M 1 1 0.02% 0.03%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation ; , 1 1 0.02% 0.03%Style Punctuation , 0 1 0.01% 0.02%Truncation Text completely 10 0 0.10% 0.17%Style Punctuation , 1 0 0.01% 0.02%Style Capitalisation P p 1 1 0.02% 0.03%Style Capitalisation a disordered intellect A DISORDERED INTELLECT 22 22 0.44% 0.77%Style Punctuation . ! 1 1 0.02% 0.03%
43.20% 74.57%
DIRECTIONALITY
Examiner Trewman's Exeter Flying Post
Morning Chronicle
CUT FOR SPACEEDITORIALISED
!
A MATTER OF SCALE• Case-Study Search and Transcribe Limited by:
• Time• Access to Relevant Collections• Creative Search Methods and Hidden Biases
• OCR-Reprint Matching Limited by:• OCR Quality• Reprint Matching Resolution (Article, Page, nGram)
RE-ANALYSING THE DATABASE• Manual-to-OCR Matches
Much More Accurate
• Finds a Small but Sometimes Crucial Set of New Matches
• Can Remap the Entire Reprint Network
BOUTIQUE BIG DATA?• Shared Transcription Standards
• Collegial Sharing of Data and Results
• Reuse in New and Unexpected Ways
• Case Study Discoveries Refining Big Data Search Parameters
Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/14032020799/
WWW.SCISSORSANDPASTE.NET
WWW.GITHUB.COM/MHBEALS/SCISSORSANDPASTE
THANK YOUM. H. Beals (ORCID: 0000-0002-2907-3313)Loughborough University@MHBEALS