20140106 qu seminar

DDiiggiittaall LLiibbrraarriieess:: HHiissttoorryy,, TTeecchhnnoollooggyy,,

RR&&DD

Edward A. Fox Professor, Computer Science, Virginia Tech

Blacksburg, VA 24061 USA [email protected] h�p://fox.cs.vt.edu

6 Jan. 2014 1

OOuuttlliinnee  Acknowledgments   Introduc�on  History  Technology  Research  Development  Summary and Discussion

6 Jan. 2014 2

HTTP://WWW.QU.EDU.QA/

HTTP://WWW.TAMU.EDU/ HTTP://WWW.PSU.EDU/ HTTP://WWW.VT.EDU/

Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ

6 Jan. 2014 3

Sponsored by Qatar University & Qatar Na�onal Library

HTTP://qnl.qa

EELLIISSQQ PPrroojjeecctt TTeeaamm Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI) Sumaya Ali S A Al-Maadeed (Ph.D., PI) Myrna Tabet Asad Nafees Tahseena Moideen

This project was made possible by NPRP Grant # 4 -‐ 029 -‐ 1 – 007 from the Qatar Na�onal Research Fund (a member of Qatar Founda�on).

Virginia Tech, USA: Edward Fox (Ph.D., Lead-PI) Tarek Kanan

Penn. State University, USA: C. Lee Giles (Ph.D., PI) Sagnik Ray Choudhury

Texas A&M, USA: Richard Furuta (Ph.D., PI) Hamed Alhoori

6 Jan. 2014 4

Consultants: John Impagliazzo (Ph.D., Key Investigator) Susan Lukesh (Ph.D.) Carole Thompson

Qatar Na�onal Library, Qatar: Claudia Lux (PI) Krishna Roy Chowdhury Postdoc - TBA

AAcckknnoowwlleeddggeemmeennttss   Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University   Dr. Rashid Alammari, Dean, College of Engineering, Qatar University   Dr. Moumen Hasnah , Director of Academic Research, Qatar University   Dr. Claudia Lux, Qatar Na�onal Library Director   Dr. Imad Bachir, Qatar University Library Director   Dr. Munir Tag, Ac�ng Director Technical, ICT Program Manager (QNRF)   Ms. Krishna Roy Chowdhury, Associate Director for Library IT, Qatar Na�onal Library   Prof. Seb� Foufou, Head of Department of Computer Science and Engineering, Qatar University

AAddddii��oonnaall TThhaannkkss

6 Jan. 2014 6

Qscience – providing collec�on: Christopher J. Leonard, Editorial Director Paul Coyne, CTO

US Na�onal Science Founda�on (recent and current grants to Fox):   IIS-‐1319578   IIS-‐0916733   DUE-‐0840719   OCI-‐1032677   plus those to PSU, TAMU


6 Jan. 2014 7

IInnttrroodduucc��oonn   Reasons to be here   Interested   Find what to do with your content   Find how to help your user community

  h�p://www.morganclaypool.com/toc/icr/1/1   1. DL Introduc�on, 5S framework (2012)   2. DL Quality, Integra�on (2013)   3. DL Technologies (in press)   4. DL Applica�ons (in press)

6 Jan. 2014 8

6 Jan. 2014 9

6 Jan. 2014 10

6 Jan. 2014 11

6 Jan. 2014 12

DDLLss SShhoorrtteenn tthhee CChhaaiinn ttoo

13

Author

Reader

Digital

Library Editor

Reviewer

Teacher

Learner

Librarian

14

Digital Library Content

Articles,Reports,Books

TextDocuments

Speech,Music

VideoAudio

(Aerial)Photos

GeographicInformation

ModelsSimulations

Software,Programs

GenomeHuman,animal,plant

BioInformation

2D, 3D,VR,CAT

Images andGraphics

ContentTypes

6 Jan. 2014

15

Content Based Information Retrieval

16

Digital Library Reference Model 1.0 p. 30 of 234

IInnffoorrmmaall 55SS DDLL DDeefifinnii��oonnss

 help sa�sfy info needs of users (socie�es)  provide info services (scenarios)  organize info in usable ways (structures)  present info in usable ways (spaces)  communicate info with users (streams)

18

DLs are complex systems that:

19

IInnffoorrmmaa��oonn LLiiffee CCyyccllee

Authoring Modifying

Organizing Indexing

Storing Retrieving

Distributing Networking

Retention / Mining Accessing Filtering

Using Creating

6 Jan. 2014

20

Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing

Annotating Classifying Clustering Evaluating Extracting Indexing

Measuring Publicizing

Rating Reviewing (peer)

Surveying Translating

(language)

Conserving Converting

Copying/Replicating Emulating Renewing

Translating (format)

Acquiring Cataloging

Crawling (focused) Describing Digitizing

Federating Harvesting Purchasing Submitting

Preservational Creational Add Value

Repository-Building Information Satisfaction

Services

Infrastructure Services

21

SSeeeerrSSuuiittee iiss NNoott GGooooggllee

 Metadata (as in library catalogs) as well as content   Sets of collec�ons, rather than the Web as a whole

  Provided by a curator (e.g., publisher, museum)   Provided by user submissions   Or collected by focused ‘crawling’

  Tailored services, rather than the same for everyone   Browsing using categories, preserving, adding value   Based on studying user requirements, e.g., chemists

 Working with en��es, rather than just words   Cita�ons, tables, figures, names, chemical formula   Using knowledge bases, machine learning, ar�ficial intelligence

6 Jan. 2014


6 Jan. 2014 22

23

HHiissttoorryy OOvveerrvviieeww

  1991, esp. from Informa�on Retrieval  Connec�ng computer, library, and informa�on science communi�es  NSF DL Ini�a�ve 1 in 1994 included funding for Stanford, where Google was prototyped   Interna�onal conferences in the Americas (JCDL), as well as Europe (TPDL, by DELOS), Asia (ICADL)  Publishers: ACM, …  DOIs, (Ins�tu�onal) Repositories   Spinoffs: content & courseware management systems  Recently including (linked) data

6 Jan. 2014

www.nsdl.org

6 Jan. 2014 24

25

26

IInnss��ttuu��oonnaall RReeppoossiittoorriieess

  “Ins�tu�onal repositories are digital collec�ons that capture and preserve the intellectual output of a single university or a mul�ple ins�tu�on community of colleges and universi�es.”

 Crow, R. “Ins�tu�onal repository checklist and resource guide”, SPARC, Washington, D.C., USA

 www.arl.org/sparc/IR/IR_Guide_v1.pdf

6 Jan. 2014

NNDDLLTTDD:: wwwwww..nnddllttdd..oorrgg   Networked Digital Library of Theses and Disserta�ons (NDLTD)

  Vision: Every thesis and disserta�on in the world is: o  Devised to take advantage of the most helpful electronic publishing methods

o  Shared globally and easily found o  Supported by a suite of digital library services to aid authors, researchers, learners, universi�es

o  Preserved and migrated permanently 6 Jan. 2014 27

28

 Human tragedies that result from man-‐made and natural events affect humans and communi�es significantly.  During and a�er a tragic event, there are a series of needs that have to be addressed. o Compounded by communica�on failures and a confusing plethora of data and informa�on

CCrriissiiss,, TTrraaggeeddyy,, aanndd RReeccoovveerryy ((CCTTRR)) NNeettwwoorrkk // IInntteeggrraatteedd DDiiggiittaall EEvveenntt AArrcchhiivvee && LLiibbrraarryy ((IIDDEEAALL))

6 Jan. 2014

 CTRnet (Crisis, Tragedy & Recovery Net)   Disaster Loca�ons

29

 CTRnet (Crisis, Tragedy & Recovery Net)  Word Clouds of Japan Earthquake and Libya Revolu�on (using tweets)

30 Libya Revolu�on Japan Earthquake,

Tsunami Disaster Updated every 10 minutes

31

CCTTRR ssttaakkeehhoollddeerrss

6 Jan. 2014

 CINET: Network Science Middleware

32

 Netviz: Course project aims to develop a visualiza�on component for CINET which contains large network graphs. The visualiza�on service will get Networks from CINET, convert from Galib to Gexf format, then visualize the graphs using Gelphi.

33

�  CINET: Network Science Middleware

CINET network displayed using Gephi


6 Jan. 2014 34

WWeebb AArrcchhiivviinngg

  Introduc�on: Web archiving is the process of gathering up data recorded on the World Wide Web,   storing it,   ensuring the data is preserved in an archive, and  making the collected data available for future research.

  The Internet Archive and several na�onal libraries ini�ated Web archiving prac�ces in 1996.

6 Jan. 2014 35

CCrraawwlleerr ((HHeerriittrriixx)) ((ffoorr sseeaarrcchh eennggiinneess && WWeebb aarrcchhiivveess))

 A Web crawler starts with a list of URLs to visit, called the seeds.

 On those page, iden�fies all the hyperlinks   adds them to the list of URLs to visit   recursively visits pages pointed to   according to a set of policies.

 Priori�zes its downloads – some pages change o�en.

6 Jan. 2014 36

FFooccuusseedd CCrraawwlleerrss

  For a par�cular topic or event   to build a Web collec�on focused in that area

  Start with URLs of interest, viewed as seeds to grow from   Expand in a ‘smart’ way to get all and only what is relevant

  Use informa�on retrieval / ar�ficial intelligence / machine learning o Require ‘knowledge bases’ and/or human training examples

  Nevertheless, there is a tradeoff between the resul�ng o Recall (i.e., coverage of what is out there) o  Precision (i.e., freedom from noise in what is collected)

6 Jan. 2014 37

SSeeeerrSSuuiittee IInnssttaann��aa��oonnss

 CiteSeerx   http://citeseerx.ist.psu.edu   A scientific literature digital library and search engine

 ChemXSeer   http://chemxseer.ist.psu.edu   Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools

 ArchSeer   http://archseer.ist.psu.edu/   Archeology literature

 TableSeer  ANY fields with tables

6 Jan. 2014 38

h�p://citeseerx.ist.psu.edu CiteSeerX

  3 M documents   Ms of files   60 M cita�ons   3 to 6 M authors   2 to 4 M hits day   100K documents added monthly   800K individual users   several Tbytes

  CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science

  Converts PDF to text   Automa�cally extracts OAI metadata and other data   Automa�c cita�on indexing, links to cited documents, crea�on of document page, author disambigua�on   So�ware open source – can be used to build other such tools

6 Jan. 2014 39

6 Jan. 2014 40

6 Jan. 2014 41

SSeeeerrSSuuiittee   Tool kit used to build search engines and digital libraries

  CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer, AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc.   Built on commercial grade open source tools (Solr/Lucene)   Penn State exper�se – automated specialized metadata extrac�on

  Supports research in   Indexing and search   Data mining & structures   Informa�on and knowledge extrac�on   Social networks: Name/en�ty disambigua�on   Scientometrics/infometrics   Systems engineering   User interface design (HCI = human-‐computer interac�on)   So�ware engineering and management

ChemXSeer Highlights Portal for academic researchers in chemistry which integrates the scientific

literature with experimental, analytical and simulation results and tools Provides unique metadata extraction, indexing and searching pertinent to the

chemical literature by using heuristics combined with machine learning Chemical formulae and names Tables Figures Publication functions as in CiteSeerX Expert and expertise search.

After extraction, data stored API accessible xml for users. Hybrid repository: Serves as a federated information interoperational system

Scientific papers crawled and indexed from the web User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) Scientific documents and metadata from publishers, web or archives.

Access control for proprietary provided content and user-submitted experiment data

Takes advantage of in-house open source projects such as CiteSeerX/

Seersuite.

Example Formula Search


6 Jan. 2014 45

UUsseerrss -‐-‐ TTAAMMUU

 Requirements (content, services)  Prac�ces (scholarly, informa�on seeking)   Social framework (collabora�on, recommenda�on)

  Interviews, surveys

  Evalua�ons: usability, benefits

6 Jan. 2014 46

IInnffrraassttrruuccttuurree -‐-‐ PPSSUU

  Computers, so�ware, launching infrastructure at:   QU: powerful server, now crawling   + ready to help any group interes�ng in cura�ng a collec�on   VT, QNL (postdoc), QCRI (Prof. Mitra), …

  Adapt to disciplines, interes�ng parts of documents   Adapt to each collec�on

  Develop knowledge base and heuris�cs for the coll.   Change document parser   Change database to match what occurs   Change extractors : document -‐> database

6 Jan. 2014 47

AArraabbiicc -‐-‐ VVTT

 Handle Arabic text documents  Obtain a suitable category/classifica�on system  Have people provide ‘training set’  Use machine learning to automa�cally classify future Arabic text documents

  Support cross-‐language informa�on retrieval   Arabic ques�on against English documents   English ques�on against Arabic documents

6 Jan. 2014 48

AArraabbiicc HHaannddwwrrii��nngg -‐-‐ QQUU

  Images of historic documents  Arabic text extracted  Mapping from a part of the text to the corresponding part of the image   Special tools for

  Those processing the original documents   Those doing research with the collec�on

 Will allow work on non-‐textual collec�ons too, e.g., museum images, set of photos for teaching architecture

6 Jan. 2014 49

AAcccceessssiibbllee CCoolllleecc��oonnss iinn QQaattaarr -‐-‐ QQNNLL  What collec�ons have the highest priority?

 What special handling is needed for each class, for each subclass of collec�on type?

 How do DLs best fit into the ac�vi�es of the Na�onal Library?

 Can .qa be fully archived for Wayback Machine use?

6 Jan. 2014 50


6 Jan. 2014 51

52

DDLL CCuurrrriiccuulluumm FFrraammeewwoorrkk Semester 1:

DL collections:development/creation

Semester 2:DL services and

sustainability

CO

UR

SE

STR

UC

TUR

E

DigitizationStorage

Interchange

Digital objectsCompositesPackages

MetadataCataloging

Author submission

NamingRepositories

Archives

Spaces(conceptual,geographic,2/3D, VR)

Architectures(agents, buses,

wrappers/mediators)Interoperability

Services(searching,

linking, browsing, etc.)

Intellectual property rights mgmt.

PrivacyProtection (watermarking)

Archiving and preservation

Integrity

Architectures(agents, buses,

wrappers/mediators)Interoperability

CO

RE

DL

TOP

ICS

DocumentsE-publishing

Markup

Info. NeedsRelevanceEvaluation

Effectiveness

ThesauriOntologies

ClassificationCategorization

Bibliographic information

BibliometricsCitations

RoutingFiltering

Community filtering

Search & search strategyInfo seeking behavior

User modelingFeedback

Info summarizationVisualization

Multimedia streams/structures

Capture/representationCompression/coding

Content-based analysis

Multimedia indexing

Multimediapresentation,

rendering

RE

LATE

DTO

PIC

S

6 Jan. 2014

MMoodduulleess

 h�p://en.wikiversity.org/wiki/Curriculum_on_Digital_Libraries   Table 1: Core DL Modules   Table 2: Informa�on Retrieval Packages   Table 3: Big Data   Table 4: Mul�media So�ware

  Like lesson plans, for a training session or lecture  Can be used for self-‐study, refreshers

53

6 Jan. 2014 54

h�p://curric.dlib.vt.edu/modDev/modDev.html

EELLIISSQQ AAuuddiieennccee ((UUsseerrss))   Primary:

o  Librarians and libraries in Qatar o  Researchers and academics o  Government organiza�ons o  Non-‐Governmental organiza�ons

(such as h�p://www.fsd.org.qa/)

  Secondary: o  University / School Students o  Teachers / Faculty o  Managers o  Qatari ci�zens o  Other stakeholders

6 Jan. 2014 55

h�p://elisq.qu.edu.qa/

Project Objec�ves/Aims

A.  Research and prototype digital library systems and infrastructure for Qatar, focusing ini�ally on Qatari informa�on related to government and scholarly ac�vi�es.

Leverage the crawling engine from Penn State‘s SeerSuite so�ware infrastructure, and extend it beyond its current focus on English to support Arabic-‐English collec�ons, and to cover a broad range of scholarly disciplines, and all types of government informa�on.

6 Jan. 2014 56

EELLIISSQQ PPrroojjeecctt ((11 ooff 22))

Project Objec�ves/Aims (con�nued) B.  Research and build the digital library community in

Qatar, suppor�ng digital library use, services, collec�on development, tailored systems, and advancing toward a Knowledge Society.

Study scholarly ac�vi�es, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consul�ng center at the proposed Ins�tute, and collabora�ve efforts with libraries and museums in Qatar, we will iden�fy par�cular needs and uses, and tailor collec�ons, systems, and services, to lead toward the Qatari Knowledge Society.

6 Jan. 2014 57

EELLIISSQQ PPrroojjeecctt ((22 ooff 22))

SSiiggnniifificcaannccee ttoo LLiibbrraarriiaannss,, CCoorrppoorraa��oonnss,, aanndd GGoovveerrnnmmeennttaall AAggeenncciieess

  The need to preserve cultural and historical heritage => o  Collec�ons of fragile and precious ar�facts => o  Libraries, museums, and archives developing digital

collec�ons => o  Users from all over the world accessing and studying

  A one stop search of: o  Informa�on about Qatar o  Informa�on to preserve the culture of Qatar

  Deep indexing, analysis, and retrieval of: o  Resources, reports, sta�s�cs, and other types of informa�on o  Informa�on in the Arabic language as well as in English

6 Jan. 2014 58

EELLIISSQQ CCoonntteenntt  Metadata, data, and many types of documents (including full text)  Qatari resources that first appeared in digital form -‐ ‘born’ digital  At a later stage the project will include: o  Digital versions of material already exis�ng in print o  Mul�media (image, audio, video) forms

  Free and open as well as content with limited access

6 Jan. 2014 59

EELLIISSQQ FFooccuuss

Community in Qatar   Iden�fy interested stakeholders, to tailor to needs   Train next genera�on of digital librarians, archivists, and curators  Partners helping with addi�onal collec�on development

Advanced Technology for Enhanced Access   “Low hanging fruit” by crawling Qatar-‐related Web   Improved analysis (cita�ons, tables, chemicals, …)   Support for both Arabic and English

6 Jan. 2014 60


6 Jan. 2014 61

SSuummmmaarryy ((ssoommee hhiigghhlliigghhttss))

  Introduc�on to digital libraries: 5S, any content

  History: since 1991, Google, repositories

  Technology: SeerSuite, Heritrix, Solr, HCI   Ini�al collec�ons: Qscience, news, …

  Research: extend SeerSuite; Arabic   Adapt other tools for handwri�ng collec�on, non-‐text collec�ons

  Development: consul�ng center (addressing needs)

6 Jan. 2014 62

QQuueess��oonnss ffoorr YYoouu

 What communi�es should be served?

 What collec�ons should be made accessible?

 What services are required?

 What are the priori�es in the above?

  Can you help us find suitable partners, content owners, curators, user groups?

6 Jan. 2014 63

QQuueess��oonnss ffoorr UUss??

 h�p://elisq.qu.edu.qa/

  [email protected]

 h�p://fox.cs.vt.edu

6 Jan. 2014 64

20140106 qu seminar

Education