Crowdsourcing, Family History, and Long Tails for Libraries http://slidesha.re/1qzB8vv Frederick Zarndt [email protected]Secretary, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
83
Embed
20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]
In all of its many flavors, crowdsourcing works. It works for cultural heritage organizations too. During this presentation we look at various aspects of crowdsourced OCR text correction, commenting, and tagging for digitized historical newspapers at the National Library of Australia’s Trove, the California Digital Newspaper Collection (CDNC), and at the Cambridge Public Library in Cambridge Massachusetts as well as the astounding number of historical birth, death, marriage, census, and other records transcribed by “crowd” volunteers at Family Search. Some aspects include: demographics, experiences, motivation, quality, preferred data, economics and marketing. You will see that crowd sourcing is not only feasible but also practical and desirable. You will wonder why your own cultural heritage organization hasn't begun its own crowdsourcing project!
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Crowdsourcing, Family History, and Long Tails for Libraries
was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired
magazine June 2006.
web trends for “crowdsourcing”
Jan-2006 to Jun-2014
• On the date of publication of Jeff Howe’s Wired magazine article, 1-Jun-2007, Wikipedia did not have an entry (list) of crowdsourcing projects*.
• On 25-Jan-2010 Wikipedia’s list of crowdsourcing projects had 35 entries*.
• On 17-Mar -2013 Wikipedia’s list of crowdsourcing projects had 158 entries+.
* From Internet Archives’ Wayback Machine.+ Wikipedia contributors, "List of crowdsourcing projects," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/List_of_crowdsourcing_projects (accessed March 17, 2013).
Amazon Mechanical Turk was launched Nov 2005 Alexa global / country rank of Amazon Mechanical Turk (June 2014): 6,465 / 2,046
crowdsourcing
crowdsourcing
Each day 200,000,000 recaptcha’s are solved by humans around the world
Galaxy Zoo was 1st launched July 2007 Alexa global / country traffic rank of Galaxy Zoo (June 2014): 606,971 / 100,298
citizen science
Kickstarter was 1st launched in 2009 Alexa global / country traffic rank of Kickstarter (June 2014): 782 / 326 60,000+ projects successfully funded with more than USD $1,000,000,000
crowd funding
crowd collaboration
Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (June 2014): 4,385 / 1,321
Project Gutenberg was 1st launched Dec 1971 Alexa global / country traffic rank of Project Gutenberg (June 2014): 6,615 / 4,066
Alexa global / country traffic rank of National Library of Finland 2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)
so what? why should a library care about
crowdsourcing?
Time Life Pictures
Getty Images
“user engagement refers to the quality of the user experience that emphasizes the
positive aspects of the interaction with a web application, and in particular the phenomena
associated with wanting to use that web application longer and frequently”
Elad Yom-Tov, Mounia Lalmas, Georges Dupret, Ricardo Baeza-Yates, Pinard Donmez, and Janette Lehmann. 2012. The effect of links on networked user engagement. In Proceedings of the 21st international conference companion on World Wide Web (WWW '12 Companion). ACM, New York, NY, USA, 641-642. DOI=10.1145/2187980.2188167 http://doi.acm.org/10.1145/2187980.2188167
“in addition to increasing search accuracy or lowering the costs of document transcription,
crowdsourcing is the single greatest advancement in getting people using and interacting with library
collections”
Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).
“While [the National Library of Australia’s] Trove offers a range of user engagement features, and use of each of these features continues to grow, it is Trove’s newspaper text correction features that have attracted the highest level of user engagement.”
Marie-Louise Ayres. 2013. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. Paper presented at IFLA WLIC 2013, Singapore. Accessed June 2014 IFLA Library http://library.ifla.org/id/eprint/245.
Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. A s b t C n v H a l l , m a r L a n c a s t e r , Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol "through Ins bead, 1 which instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,
raw OCR text
Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
• 72% visit UDN for genealogical research • 20% visit for various other types of historical research • 87% find obituaries useful • Over 60% find the other genealogical article types (birth and
wedding announcements) useful • Only 7% do not find genealogical articles useful • Many are writing family histories and consequently also look
for general background information • Older content is much more highly valued than more recent
content (see more detailed explanation that follows) • 44% find smaller, rural papers more useful, while only 15%
find larger, metropolitan papers more useful
Motivation 2012 user survey
John Herbert and Randy Olsen. Small town papers: still delivering the news. WLIC 2012, Helsinki Finland. http://conference.ifla.org/past-wlic/2012/119-herbert-en.pdf
• CDNC and Cambridge Public Library published a user survey in Mar 2013
• 604 / 32 responses
• Surveys are (mostly) identical except for organization name
Motivation 2013 user survey
User demographic Genealogists and family historians
X User demographic No spring chickens
User demographic Reasons for use
User demographic Types of information
• “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.”
• “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.”
Rose Holley. March 2009. Many Hands Make Light Work. National Library of Australia. Accessed June 2014 http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf.
“The ‘typical’ Trove user is a very well educated, highly paid, English speaking employed woman aged fifty or over, with a significant or primary interest in family or local history, who visits the Trove website very frequently. Users of Trove newspapers are older than the average Trove
user; only 13% of newspaper users are under 40 years or age.”
Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http://library.ifla.org/245/1/153-ayres-en.pdf.
“Many of Trove’s user engagement features are very popular. More than 100,000 users have
registered to date, and more than 2 million tags and nearly 60,000 comments had been added…
[Trove] text correction, however, stands head and shoulders above any other user engagement
features.”
Motivation Engaged users: What do they do?
Marie-Louise Ayres. ‘Singing for their supper’: Trove, Australian newspapers, and the crowd. WLIC 2013,Singapore. http://library.ifla.org/245/1/153-ayres-en.pdf.
“when someone transcribes a document, they are actually better fulfilling the mission of a cultural
heritage organization than someone who simply stops by to flip through the pages”
Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).
“I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project.
Because I am interested in history, I enjoy it.” Wesley, California
Personal communications with CDNC text correctors.
Motivation CDNC users’ report
! “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc. The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line
search at CDNC. I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and
local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.”
Ann, California
Personal communications with CDNC text correctors.
Motivation CDNC users’ report
“I am correcting text for the Coronado Tent City Program for 1903. It is important to correct any problems with personal names and other information so that researchers will be able
to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in
digitizing the text and can cause problems for searchers. Also, many of the guests' names at Tent City and Hotel Del
Coronado were taken from the registration books and reported in the Program. This led to many problems in spelling of last names and the editors were not careful to be consistent in the
spellings. This Program is an important resource since it provides an excellent picture of daily life in Tent City and
captures much of the history of Coronado itself.” Gene, California
Personal communications with CDNC text correctors.
Motivation CDNC users’ report
“I have always been interested in history, especially the development of the American West, and nothing brings it alive
better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.”
David, United Kingdom
Personal communications with CDNC text correctors.
Motivation CDNC users’ report
CDNC is an excellent source of information matching my personal interest in such topics as sea history, development
of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m
afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of
others. .... I am not doing this very regularly as this is just my hobby and pleasure.
Jerzey, Poland
Personal communications with CDNC text correctors.
Motivation CDNC users’ report
As an amateur historical researcher my time for research is very limited. Making time to travel to archives, libraries, and historical societies does not happen as often as I would like. The Cambridge
Public Library’s online newspaper collection has been an invaluable resource and it is fun. I am very grateful for all the help I have received
over the years from so many research organizations. Correcting text has several benefits. It makes it much more likely that I will find a story if I decide to search for it in the future. It is a way of saying
‘thank you’ to the Cambridge Library for having such a great resource available and maybe I can make the next person’s research a little
easier. It is my own little historical preservation project. Cambridge Historical Newspapers Text Corrector
Personal communications with CDNC text correctors.
Motivation Cambridge users’ report
so old, boring, easily entertained people correct text. convince me there are
real benefits.
Economic benefits
Public domain photo courtesy of US Navy
$Economics
Financial value of outsourced OCR text correction for newspapers?
The Assumptions
• 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average)
• Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters
$$ 2,656,497 lines x 40 characters per line x 1/1000 x $0.50 = $53,130
$ 129,046,297 lines x 40 characters per line x 1/1000 x $0.50 = $2,580,926
Economics
$Financial value of in-house OCR text correction?
The Assumptions
• Correction takes 15 seconds per line
• Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia
AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.
Economics
$$ 2,656,497 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $110,687
$ 129,046,297 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $22,518,579
Economics
Accuracy
“His Accuracy Depends on Ours!" Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]
Accuracy
• Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers
• Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers
Rose Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. Mar/Apr 2009. Accessed June 2014 http://www.dlib.org/dlib/march09/holley/03holley.html.
Edwin Kiljin. The current state-of-art in newspaper digitization. D-Lib Magazine. Jan/Feb 2008. Accessed June 2014 http://www.dlib.org/dlib/january08/klijn/01klijn.html.
Public domain graphic courtesy of Wikimedia Commons.
AccuracyMAPPING TEXTS* assesses digitization quality of digital newspapers by comparing the number of words recognized to the total number of words scanned
* Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.
How does low text accuracy affect search recall?
The Facts • Average uncorrected OCR character accuracy of the
CDNC sample data is ~89%
• Average length of an English word is 5 characters
• Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct
Accuracy
ARNDT
ARNDTARNDT
ARNDT ARNDT
ARNDT
ARNDT
ARNDT
ARNDT
ARNDT
Search recall no text correction
instances of “ARNDT” found instances of “ARNDT” not found
Accuracy
The Facts • Average corrected character accuracy of the CDNC
sample data is ~99.4%
• Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%
ARNDT
ARNDTARNDT
ARNDT ARNDT
ARNDT
ARNDT
ARNDT
ARNDT
ARNDT
instances of “ARNDT” found instances of “ARNDT” not found
Search recall with text correction
A search for “Arndt” at Chronicling America gives 10,267 results*
• If Chronicling America text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found
• If text accuracy is 97.0%, then 317 instances of “Arndt” were not found
Accuracy
* Search performed 31 Oct 2012
Accuracy
Suppose the word/name is longer than 5 characters?
The Facts • Assume that average uncorrected / corrected OCR
character accuracy is ~89% / ~99% same as CDNC.
Name Name length Raw text accuracy Corrected text accuracy
Eklund 6 49.7% 94.2%
Kennedy 7 44.2% 93.25
Espinosa 8 39.4% 92.3%
Bonaparte 9 35% 91.4%
Chatterjee 10 31.2% 90.4%
Accuracy
Name Number of search results
Missing results with raw text accuracy
Missing results with corrected text accuracy
Eklund 2,951 2,987 182
Kennedy 360,723 455,392 26,111
Espinosa 1,918 2,950 160
Bonaparte 44,664 82,947 4,203
Chatterjee 19 42 2
Chronicling America searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922).
but you left out long
tails…
Public domain illustration from "On The Genesis of
Species" by St. George Mivart
the long tail* of crowdsourced OCR text correction
a probability distribution has a long tail if a larger share of population rests within its tail than it would
under a normal distribution !
the most productive users represent a small fraction of the total user population and ~50% of total
production, or, said a different way, the largest fraction but individually not quite so productive
users are as important as the most productive users
The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.