Top Banner
Many Hands Make Light Work, the American Version Experiences with User-Text-Correction at California Digital Newspaper Collection (CDNC): How crowd-sourcing OCR text correction impacts a historic newspaper collection
17

Many hands make light work, the american version [charleston library conference 201111]

Jul 15, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Many hands make light work, the american version [charleston library conference 201111]

Many Hands Make Light Work, the American Version

Experiences with User-Text-Correction at California Digital Newspaper Collection (CDNC):

How crowd-sourcing OCR text correction impacts a historic newspaper collection

Page 2: Many hands make light work, the american version [charleston library conference 201111]

The California Digital Newspaper Collection contains over 490,000 pages of significant California newspapers published from 1846 to 1922.

The newspapers were digitized to both page and article level METS/ALTO data as part of the National Digital Newspaper Program.

The collection is displayed using Veridian digital library software.

visits per month

minutes per visit

pages per visit

About the Collection

site statistics between Nov. 2010 and Aug. 2011

Page 3: Many hands make light work, the american version [charleston library conference 201111]

poor OCR reduces search recall to low levels OCR quality ranges between 50%-90% of word level accuracy

Page 4: Many hands make light work, the american version [charleston library conference 201111]

post OCR text correction is expensive

≈ $0.50 per 1000 characters or $5.00 to $10.00 per newspaper page $$

Daily Alta California, 2 January 1850

Page 5: Many hands make light work, the american version [charleston library conference 201111]

Like the users of many digital newspaper collections, patrons of the CDNC visit the site for personal reasons, consider themselves genealogists or family historians, and return to the site frequently.

The Average CDNC User users above 40 years old

users who consider themselves genealogists

users who visit the site at least weekly

Page 6: Many hands make light work, the american version [charleston library conference 201111]

Wikipedia on Crowdsourcing:

“distributed problem-solving and production model”

“sourcing tasks traditionally performed by specific individuals to an undefined large group of people or community (crowd)

through an open call”

Page 7: Many hands make light work, the american version [charleston library conference 201111]

Crowd-Sourcing Projects Project Gutenberg

Family Search The National Library of Australia The National Library of Finland

FreeBMD.org

Page 8: Many hands make light work, the american version [charleston library conference 201111]

Site Statistics Since User Text Correction

visits per month

minutes per visit

pages per visit

Page 9: Many hands make light work, the american version [charleston library conference 201111]

lines per month corrected by the top corrector

total lines corrected since 2008

total number of text correctors

lines corrected per month in 2011

30,000

49 Million 30,000

‘Engaging with users and building virtual communities is just as important to the users as providing the data itself. They want to be part of a community.’

Rose Holley, The National Library of Australia

2,000,000 +

Page 10: Many hands make light work, the american version [charleston library conference 201111]

User Text Correction added to CDNC

Page 11: Many hands make light work, the american version [charleston library conference 201111]

Results August 22 - October 22

Users who have corrected text

Lines corrected by top corrector

Total number of lines corrected

Lines Corrected Per Month

Page 12: Many hands make light work, the american version [charleston library conference 201111]

Goals

•  Improve OCR text at low cost

•  Improve search precision / recall

•  Build user community

Page 13: Many hands make light work, the american version [charleston library conference 201111]

Risks?

•  User text correction of newspapers is (relatively) new

•  Users won’t know what to do, interface is confusing

•  Users don’t understand errors in OCR text

•  Vandalism of text

Page 14: Many hands make light work, the american version [charleston library conference 201111]

Benefits

•  Text quality improved

•  Cost effective

•  Community involvement

•  Users empowered

$

Page 15: Many hands make light work, the american version [charleston library conference 201111]

User Reaction

“Great feature (I tested it during the beta) for a great site, which I have used extensively.  I plan to use the edit feature when I get back to research in the Los Angeles Herald and the Daily Alta California.”

~Lawrence B.

“STUNNINGLY  FANTASTIC!!!! is what I think!” ~A fifth generation Californian of multiple Forty-niner families

“I have used the new system and like it. The user correction is great idea.”

~Pat

“Exactly what the system needed!!! Pulled up a couple articles in the beta system and made some text corrections. Went back and tried the old system using the words I corrected and it worked!! Outstanding enhancement!”

~Mary B.

Page 16: Many hands make light work, the american version [charleston library conference 201111]

“The addition of user text correction (UTC) to the California Digital Newspaper Collection has dramatically improved the quality of the computer-generated text and enlivened our relationship with our

users.  Within a couple of weeks of implementing UTC, and with little publicity, a handful of users had already corrected thousands of lines

of text.  Many of those users emailed us directly with questions about or praise for the UTC, building direct, personal connections between

our staff and users that hadn’t existed before.”

~Brian Geiger, Center for Bibliographic Research, UC Riverside

Page 17: Many hands make light work, the american version [charleston library conference 201111]

? Brian Geiger, Director Center for Bibliographic Studies and Research

University of California Riverside [email protected]

Frederick Zarndt, Chair IFLA Newspapers Section [email protected]