D Di ig gi it t a al l L Li ib br r a ar ri ie es s: : H Hi is s t t o or ry y , , T T e ec ch hn no ol lo og gy y , , R R& &D D Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA 24061 USA [email protected]h�p://fox.cs.vt.edu 6 Jan. 2014 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
AAcckknnoowwlleeddggeemmeennttss Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University Dr. Rashid Alammari, Dean, College of Engineering, Qatar University Dr. Moumen Hasnah , Director of Academic Research, Qatar University Dr. Claudia Lux, Qatar Na�onal Library Director Dr. Imad Bachir, Qatar University Library Director Dr. Munir Tag, Ac�ng Director Technical, ICT Program Manager (QNRF) Ms. Krishna Roy Chowdhury, Associate Director for Library IT, Qatar Na�onal Library Prof. Seb� Foufou, Head of Department of Computer Science and Engineering, Qatar University
AAddddii��oonnaall TThhaannkkss
6 Jan. 2014 6
Qscience – providing collec�on: Christopher J. Leonard, Editorial Director Paul Coyne, CTO
US Na�onal Science Founda�on (recent and current grants to Fox): IIS-‐1319578 IIS-‐0916733 DUE-‐0840719 OCI-‐1032677 plus those to PSU, TAMU
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 7
IInnttrroodduucc��oonn Reasons to be here Interested Find what to do with your content Find how to help your user community
help sa�sfy info needs of users (socie�es) provide info services (scenarios) organize info in usable ways (structures) present info in usable ways (spaces) communicate info with users (streams)
Metadata (as in library catalogs) as well as content Sets of collec�ons, rather than the Web as a whole
Provided by a curator (e.g., publisher, museum) Provided by user submissions Or collected by focused ‘crawling’
Tailored services, rather than the same for everyone Browsing using categories, preserving, adding value Based on studying user requirements, e.g., chemists
Working with en��es, rather than just words Cita�ons, tables, figures, names, chemical formula Using knowledge bases, machine learning, ar�ficial intelligence
6 Jan. 2014
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 22
23
HHiissttoorryy OOvveerrvviieeww
1991, esp. from Informa�on Retrieval Connec�ng computer, library, and informa�on science communi�es NSF DL Ini�a�ve 1 in 1994 included funding for Stanford, where Google was prototyped Interna�onal conferences in the Americas (JCDL), as well as Europe (TPDL, by DELOS), Asia (ICADL) Publishers: ACM, … DOIs, (Ins�tu�onal) Repositories Spinoffs: content & courseware management systems Recently including (linked) data
6 Jan. 2014
www.nsdl.org
6 Jan. 2014 24
25
26
IInnss��ttuu��oonnaall RReeppoossiittoorriieess
“Ins�tu�onal repositories are digital collec�ons that capture and preserve the intellectual output of a single university or a mul�ple ins�tu�on community of colleges and universi�es.”
Crow, R. “Ins�tu�onal repository checklist and resource guide”, SPARC, Washington, D.C., USA
www.arl.org/sparc/IR/IR_Guide_v1.pdf
6 Jan. 2014
NNDDLLTTDD:: wwwwww..nnddllttdd..oorrgg Networked Digital Library of Theses and Disserta�ons (NDLTD)
Vision: Every thesis and disserta�on in the world is: o Devised to take advantage of the most helpful electronic publishing methods
o Shared globally and easily found o Supported by a suite of digital library services to aid authors, researchers, learners, universi�es
o Preserved and migrated permanently 6 Jan. 2014 27
28
Human tragedies that result from man-‐made and natural events affect humans and communi�es significantly. During and a�er a tragic event, there are a series of needs that have to be addressed. o Compounded by communica�on failures and a confusing plethora of data and informa�on
CTRnet (Crisis, Tragedy & Recovery Net) Word Clouds of Japan Earthquake and Libya Revolu�on (using tweets)
30 Libya Revolu�on Japan Earthquake,
Tsunami Disaster Updated every 10 minutes
31
CCTTRR ssttaakkeehhoollddeerrss
6 Jan. 2014
CINET: Network Science Middleware
32
Netviz: Course project aims to develop a visualiza�on component for CINET which contains large network graphs. The visualiza�on service will get Networks from CINET, convert from Galib to Gexf format, then visualize the graphs using Gelphi.
33
� CINET: Network Science Middleware
CINET network displayed using Gephi
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 34
WWeebb AArrcchhiivviinngg
Introduc�on: Web archiving is the process of gathering up data recorded on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research.
The Internet Archive and several na�onal libraries ini�ated Web archiving prac�ces in 1996.
A Web crawler starts with a list of URLs to visit, called the seeds.
On those page, iden�fies all the hyperlinks adds them to the list of URLs to visit recursively visits pages pointed to according to a set of policies.
Priori�zes its downloads – some pages change o�en.
6 Jan. 2014 36
FFooccuusseedd CCrraawwlleerrss
For a par�cular topic or event to build a Web collec�on focused in that area
Start with URLs of interest, viewed as seeds to grow from Expand in a ‘smart’ way to get all and only what is relevant
Use informa�on retrieval / ar�ficial intelligence / machine learning o Require ‘knowledge bases’ and/or human training examples
Nevertheless, there is a tradeoff between the resul�ng o Recall (i.e., coverage of what is out there) o Precision (i.e., freedom from noise in what is collected)
6 Jan. 2014 37
SSeeeerrSSuuiittee IInnssttaann��aa��oonnss
CiteSeerx http://citeseerx.ist.psu.edu A scientific literature digital library and search engine
ChemXSeer http://chemxseer.ist.psu.edu Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools
ArchSeer http://archseer.ist.psu.edu/ Archeology literature
TableSeer ANY fields with tables
6 Jan. 2014 38
h�p://citeseerx.ist.psu.edu CiteSeerX
3 M documents Ms of files 60 M cita�ons 3 to 6 M authors 2 to 4 M hits day 100K documents added monthly 800K individual users several Tbytes
CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science
Converts PDF to text Automa�cally extracts OAI metadata and other data Automa�c cita�on indexing, links to cited documents, crea�on of document page, author disambigua�on So�ware open source – can be used to build other such tools
6 Jan. 2014 39
6 Jan. 2014 40
6 Jan. 2014 41
SSeeeerrSSuuiittee Tool kit used to build search engines and digital libraries
CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer, AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc. Built on commercial grade open source tools (Solr/Lucene) Penn State exper�se – automated specialized metadata extrac�on
Supports research in Indexing and search Data mining & structures Informa�on and knowledge extrac�on Social networks: Name/en�ty disambigua�on Scientometrics/infometrics Systems engineering User interface design (HCI = human-‐computer interac�on) So�ware engineering and management
ChemXSeer Highlights Portal for academic researchers in chemistry which integrates the scientific
literature with experimental, analytical and simulation results and tools Provides unique metadata extraction, indexing and searching pertinent to the
chemical literature by using heuristics combined with machine learning Chemical formulae and names Tables Figures Publication functions as in CiteSeerX Expert and expertise search.
After extraction, data stored API accessible xml for users. Hybrid repository: Serves as a federated information interoperational system
Scientific papers crawled and indexed from the web User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) Scientific documents and metadata from publishers, web or archives.
Access control for proprietary provided content and user-submitted experiment data
Takes advantage of in-house open source projects such as CiteSeerX/
Seersuite.
Example Formula Search
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
Computers, so�ware, launching infrastructure at: QU: powerful server, now crawling + ready to help any group interes�ng in cura�ng a collec�on VT, QNL (postdoc), QCRI (Prof. Mitra), …
Adapt to disciplines, interes�ng parts of documents Adapt to each collec�on
Develop knowledge base and heuris�cs for the coll. Change document parser Change database to match what occurs Change extractors : document -‐> database
6 Jan. 2014 47
AArraabbiicc -‐-‐ VVTT
Handle Arabic text documents Obtain a suitable category/classifica�on system Have people provide ‘training set’ Use machine learning to automa�cally classify future Arabic text documents
Support cross-‐language informa�on retrieval Arabic ques�on against English documents English ques�on against Arabic documents
6 Jan. 2014 48
AArraabbiicc HHaannddwwrrii��nngg -‐-‐ QQUU
Images of historic documents Arabic text extracted Mapping from a part of the text to the corresponding part of the image Special tools for
Those processing the original documents Those doing research with the collec�on
Will allow work on non-‐textual collec�ons too, e.g., museum images, set of photos for teaching architecture
6 Jan. 2014 49
AAcccceessssiibbllee CCoolllleecc��oonnss iinn QQaattaarr -‐-‐ QQNNLL What collec�ons have the highest priority?
What special handling is needed for each class, for each subclass of collec�on type?
How do DLs best fit into the ac�vi�es of the Na�onal Library?
Can .qa be fully archived for Wayback Machine use?
6 Jan. 2014 50
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
o Librarians and libraries in Qatar o Researchers and academics o Government organiza�ons o Non-‐Governmental organiza�ons
(such as h�p://www.fsd.org.qa/)
Secondary: o University / School Students o Teachers / Faculty o Managers o Qatari ci�zens o Other stakeholders
6 Jan. 2014 55
h�p://elisq.qu.edu.qa/
Project Objec�ves/Aims
A. Research and prototype digital library systems and infrastructure for Qatar, focusing ini�ally on Qatari informa�on related to government and scholarly ac�vi�es.
Leverage the crawling engine from Penn State‘s SeerSuite so�ware infrastructure, and extend it beyond its current focus on English to support Arabic-‐English collec�ons, and to cover a broad range of scholarly disciplines, and all types of government informa�on.
6 Jan. 2014 56
EELLIISSQQ PPrroojjeecctt ((11 ooff 22))
Project Objec�ves/Aims (con�nued) B. Research and build the digital library community in
Qatar, suppor�ng digital library use, services, collec�on development, tailored systems, and advancing toward a Knowledge Society.
Study scholarly ac�vi�es, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consul�ng center at the proposed Ins�tute, and collabora�ve efforts with libraries and museums in Qatar, we will iden�fy par�cular needs and uses, and tailor collec�ons, systems, and services, to lead toward the Qatari Knowledge Society.
The need to preserve cultural and historical heritage => o Collec�ons of fragile and precious ar�facts => o Libraries, museums, and archives developing digital
collec�ons => o Users from all over the world accessing and studying
A one stop search of: o Informa�on about Qatar o Informa�on to preserve the culture of Qatar
Deep indexing, analysis, and retrieval of: o Resources, reports, sta�s�cs, and other types of informa�on o Informa�on in the Arabic language as well as in English
6 Jan. 2014 58
EELLIISSQQ CCoonntteenntt Metadata, data, and many types of documents (including full text) Qatari resources that first appeared in digital form -‐ ‘born’ digital At a later stage the project will include: o Digital versions of material already exis�ng in print o Mul�media (image, audio, video) forms
Free and open as well as content with limited access
6 Jan. 2014 59
EELLIISSQQ FFooccuuss
Community in Qatar Iden�fy interested stakeholders, to tailor to needs Train next genera�on of digital librarians, archivists, and curators Partners helping with addi�onal collec�on development
Advanced Technology for Enhanced Access “Low hanging fruit” by crawling Qatar-‐related Web Improved analysis (cita�ons, tables, chemicals, …) Support for both Arabic and English
6 Jan. 2014 60
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion