Google Print™, Million Book Project, and Google Scholar™ Digital Libraries Colloquium January 27, 2005 Gloriana St. Clair Dean of University Libraries
Dec 28, 2015
Google Print™, Million Book Project, and Google Scholar™
Digital Libraries ColloquiumJanuary 27, 2005Gloriana St. Clair
Dean of University Libraries
“This is the day the world changes.”
John Wilkin, University of Michigan2
“Commercialize the great research libraries with a handshake, suddenly and epochally.”
Rory Litwin, in Library Juice1
Thesis
Google’s new projects are exciting and, of course, commercial
This talk will compare Google Print™ with the NSF-funded Million Book Project, and then touch briefly on Google Scholar™
Main Points
Why / Genesis - Leaders, Partners Realities - Collections, Logistics Worries – Duplication, Copyright, Copyright,
Copyright, Printing . . .
Sources For This Talk
News / web / talks / interviews, with help: Jean Alexander, Head, and the Hunt Library
Reference Department Denise Troll Covey, Special Projects Librarian Missy Harvey, Computer Science Librarian Penn State Reference Department David Seaman, Digital Library Federation Anthony Tomasic, E-XMLMedia Michael Lesk, Rutgers University
Google Print™ Leaders/Partners
Google, Inc. U. Michigan Stanford University Harvard University U. Oxford New York Public Library
Million Book Project Leaders/Partners in India Indian Institute of Science International Institute of Information Technology Indian Institute of Information Technology Anna University Mysore University University of Pune Goa University Tirumala Tirupati Devasthanams Shanmugha Arts, Science, Technology & Research
Academy Arulmigu Kalasalingam College of Engineering Maharashtra Industrial Development Corporation
Million Book Project Leaders/Partners in China
Chinese Academy of Science Chinese Ministry of Education Fudan University Nanjing University Peking University Tsinghua University Zhejiang University
Google Print™Collections
Stanford – entire collection Harvard – 40,000-volume pilot from a 15-million
volume collection U. Michigan – virtually the entire collection;
add seven million to search engine; Michigan to “receive and own a high quality digital copy”3 and provide access
New York Public Library – a subset of a 20-million volume collection; selection criteria = in public domain (1923), interesting, not too fragile
Million Book Project Targeted Subcollections
Books for College Libraries (best books) University presses / scholarly societies
(copyright permissions work) U.N.’s Food and Agriculture Organization
content
Google Print™Handling the Copyright Issue
Displays “a snippet of text”4 online for books in copyright A ‘snippet’ is defined as three lines A search returns three snippets per book, and
lists the number of times your search terms appear in the book
BUY button
Million Book Project Handling the Copyright Issue
After extensive work, we are experiencing growing success in efforts to gain permission from university presses / scholarly societies to digitize books in searchable full text
Million Book ProjectResearch Initiatives
Machine translation Massive distributed
database Storage formats Use of digital libraries Distribution and
sustainability
Security Search engines Image processing Optical Character Recognition (OCR)
Language processing Copyright laws
Google™ began as a research projectat Stanford in 1995.
Google Print™Logistics
“Google will be doing all the digitizing with their own staff at Google headquarters and supposedly at Harvard and Michigan.”5
Six-year time frame 2.25 books per minute Onsite
Million Book ProjectLogistics
● With scanning time @ one page per second: ● 20,000 pages per day shift x 200 working days per
year ● 100 years to scan 1 million books ÷ (number of
operators/machines)
● Several mega scanning centers are set up in India and China
Million Book ProjectFinances
India - $25M annually to support a large set of language translation research projects
China - $8.46M from Ministry of Education over 3 yrs (2006)
United States - $3.63M from NSF over 4 yrs (2005); and equipment, staff and money from the Internet Archive
Google Print™ has funding of $???, but estimates costs at $10 per book.
Worries
Duplication “De-duplication is NOT part of the [Google
Print™] process. NOTE Stanford is interested in having multiple copies of the same materials across various partners.”6
Million Book Project will use OCLC’s Digital Registry as soon as batch loading is available.
Worries
Copyright Google will be responsible for determining
what’s in copyright.”7
“A team is working on copyright issues but, in the meantime, Google is treating [copyright] conservatively.”8
Printing “Google will disable printing for out-of-copyright
books.”9
More Worries Google Print™
Rory Litwin, “On Google’s Monetization of Libraries”10
1. Privacy [cookies]2. Introduction of commercial bias3. Questions about democratization
and equity of access4. Disintermediation issues5. Decontextualization of knowledge6. Closing of the information commons
More WorriesMillion Book Project
1. Getting it done
2. Sustainability
3. Cohesion of content
4. Usefulness
Google Scholar™ Beta
Reviewed by Péter’s,11 Anthony Tomasic, and reference librarians at Carnegie Mellon and Penn State: Not as good as Citebase, Research Index,
RePEc/LogEc (Péter’s) Not as good as CiteSeer (Tomasic) Not as suitable as CiteSeer (Lesk) Not as good as Google press releases indicate
(St. Clair)
Google Scholar™ Beta What:12
Offers free access to bibliographic records and some abstracts
May lead to full text if the university library subscribes or if free-to-read
May lead to a document delivery company Does not penetrate the invisible Web Has significantly enlarged the scope by crawling
additional publishers, preprint and reprint servers Competes with other aggregators, such as SFX
Google Scholar™ Beta
What: Meets the needs of students looking for a different
kind of material, and targets advertising to them It is easy for a human to identify a scholarly article,
but it is a challenge for a machine (Tomasic)
Additional Challenges for a Better Scholarly Search Engine13
Exploit highly structured and tagged web pages with rich metadata from scholarly publishers
Create field-specific indexes for many distinct data elements
Offer advanced navigation with pull-down menus for limited search by document type, publisher, publication year, journal
Consolidate cited references Collect information from all relevant materials Develop utilities to help libraries find all materials
subscribed to, not just one path
Thank you
Gloriana St. ClairDean of University LibrariesCarnegie Mellon [email protected] or 412-268-2447
If you would like an electronic copy of this talk, contact Cindy Carroll, [email protected]
Endnotes1. Litwin, Rory. “On Google’s Monetization of Libraries. Library Juice
7,26 (December 17, 2004). Available: http://www.libr.org/Juice/issues/vol7/LJ_7.26.html#3.
2. Wilkin, John. Quoted in “Google to Scan Books from Major Libraries.” MSNBC Tech News & Reviews. Available: http://www.msnbc.msn.com/id/6709342.
3. University of Michigan (Nancy Connell). “Google/U-M Project Opens the Way to Universal Access to Information .“ University of Michigan News Service (December 14, 2004). Available: http://www.umich.edu/news/?Releases/2004/Dec04/library/index.
4. University of Michigan. “Google/U-M Project Questions and Answers.” The University Record Online (January 7, 2005). Available: http://www.umich.edu/~urecord/0405/Dec13_04/lib_qa.shtml.
Endnotes5. Misseli. “The Google Deal (Down on the Farm).” Message
posted by a Stanford staff member to Confessions of a Mad Librarian. Available: http://edwards.orcas.net/~misseli/blog/archives/000222.html.
6. Ibid.7. Ibid. 8. Adam Smith, Senior Business and Product Manager for Google
Print and Google Scholar, speaking informally with the ALA Electronic Text Centers Discussion Group. American Library Association Mid-Winter Conference (January 15, 2005).
9. Price, Gary. “Google Partners with Oxford, Harvard & Others to Digitize Libraries.” Search Engine Watch (December 14, 2004). Available: http://searchenginewatch.com/searchday/article.php/3447411.
Endnotes
10. Litwin.
11. Péter’s Digital Reference Shelf. “Google Scholar Beta.” (December 2004). Available: http://www.galegroup.com/servlet/HTMLFileServlet?imprint=9999®ion=7&fileName=reference/archive/200412/googlescholar.html.
12. Ibid.
13. Ibid.