Contemporary Search Technologies - also for Libraries? Clemens Neudecker, KB – 20/04/2011
Dec 05, 2014
Contemporary Search Technologies - also for Libraries?
Clemens Neudecker, KB – 20/04/2011
Table of contents
Retrieval: Status Quo
New ways of searching
Prototypes & Outlook
Lossau (dlib, 2004)
How to position the library as an information provider in the 21st century?
Search services are critical!
http://www.dlib.org/dlib/june04/lossau/06lossau.html
Library as a “depot”
Collect
Preserve
Library as a “gateway”
New ways of searching and/or browsing
Service infrastructure
User-Generated content
Competition: Internet Search Engines
Simple Search
• By keyword
• Boolean operators
Advanced Search
Facets
Views
Phrases
Meta-Search
Basics
• Crawling
• Indexing
• Searching
• Ranking results
http://nlp.stanford.edu/IR-book/
Technology
Apache Lucene/Solr (KB: Migration Verity)
http://lucene.apache.org/
http://lucene.apache.org/solr/ SRU = Search/Retrieve via URL
http://www.loc.gov/standards/sru/ CQL = Contextual Query Language
http://www.loc.gov/standards/sru/specs/cql.html
Retrieval: Status Quo
Catalogue
Metadata
Catalogue Search
Metadata
Dublin Core (DCMI)
http://dublincore.org/
Z39.50
http://www.loc.gov/z3950/agency/
Metadata Harvesting
Open Archives Initiative: OIA-PMH
http://www.openarchives.org/
Linked Data
Authority Data
Named Entities
(Persons, Places, Institutions)
http://viaf.org/ Gazetteers
http://www.world-gazetteer.com/ Other Examples:
LocAuth, PND, NaCo
Persistent Identifier
URN = Uniform Resource Name
NBN = National Bibliography Number
Resolver = Translation into web address
Problems
Correctness of data
Coverage
Formats
Alignment
Multilingualism
What happened since
Google Books
The European Library
Europeana
Wolfram/Watson
What’s next?
The web
The web is not limited to the www!
Data deluge
“Deep web” – not indexed (dynamic) parts
Web of users – currently ~2 billion
Web archiving
The web as a resource
Knowledge Extraction (not the actual data!)
→ Semantic Web
(web of knowledge,
rather than data)
Semantic Web
RDFhttp://www.w3.org/RDF/
OWLhttp://www.w3.org/2004/OWL/
SPARQL http://www.w3.org/TR/rdf-sparql-query/
SKOS http://www.w3.org/2004/02/skos/
Ontologies
Ontology = “Model of the World”
Classes Instances Properties
Semantic Graphs
New resources
Digital libraries (Images + OCR) Digital born material The web
→ Interoperability (STITCH, CATCH)
Full text (OCR)
"... tte->e°n.m.66-..ie k>okke cire-5^ea. ver.è. 6.or ^ ^ ^ °
kiesrellj-oe-ikei^, v-in eeo ^elj-escdapeo ^UOI^, 7
^n>5«--'-/-r. veel8-Iiec-jc ttui5vroll^ v,a 'z » ^ v e . X. «. ^ ^ I» 2 L t. L ^-i ? > " Z Z^
l»v«e».ic. sx ^ ^ , 6en 2 l8c«. Leb. ^ L I L I tZ.
6eo zc> ^pr>!, >«(ZS. 8 O II 0 v ? L W. . L^-L"
. . ^ ... ,. , ^,a «ore Vrienilea ea Lekenaêll zeven dy aeeea ^^
^ LLQ d2i« 4 urea, 18 myoe ttuisvi-ouiv, van Kenoi5, Sis asr 0v?e darlelvk >zetief6e Vscier', ?. L08, op L
«eea vel.^esckspLa ^5^()I>Z verlof. Ke6ed w»cj6zZ reo l2urev, as eev Verval vsn ^evev^drscdceo, ^ ^ ^ "A.
Oevki>i7L«., K0>.^^Q8N()VL^, secZerr z ''Vckev öeclle^ri^ te , jv6evou6er6oru " ^
<Zen Zv ^pri!, 1806. ^x>0lè:ecsr. vsv dyQ!l 92 ^sr^n, ker ^clelvke vzet det Leu^visie vervvzilelc! 'O L ^ ^ ^ '-
".' «eckea mi6ck»z ruim êên uur verlatte ovvorfpieck-z. i>«kl. ^0-6 k»rskter verdeaxSe »Ue ryve iiinöeren en L--»S « > I L^Z
OCR Lexica
Word matching (fuzzy words) Frequency Morphology Historic forms Inflected forms
Visibility
“Hidden” - only indexed Highlighting in image Full text behind image (PDF) Parallel/switched mode User Correction/Annotation
Hidden in index
Image highlighting
Parallel/Switched
Crowdsourcing
Crowdsourcing examples
UIBK Catalogue NLA Newspapers
http://trove.nla.gov.au/newspaper Digitalkoot
http://www.digitalkoot.fi/en/splash Concert TranscriBentham
http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham
UIBK Catalogue
Trove I
Trove II
Digitalkoot
Concert
TranscriBentham
Prototypes
Prototype: FEP
Prototype: Assets
http://virserv.isti.cnr.it:8080/assetsIRService/index
Prototype: Semantic Search
http://eculture.cs.vu.nl/europeana/session/search
Prototype: Waisda
http://waisda.q42.net/, http://blog.waisda.nl/
Prototype: Geospatial Search
Prototype: Image Annotation
http://dme.arcs.ac.at/annotation/ Problem: No Flash in Europeana (A/V content)
Prototype: GeoEuropeana
http://amercader.net/dev/geoeuropeana/
Prototype: Random Image Explorer
http://europeana.fe2.nl/ (Willem Jan Faber, KB)
Solution: Common API
API = Application Programming Interface
Set of descriptions defining how to access an electronic resource/application through a common interface
API
Documented Interface Definition
Machine readable
Public/shared
API Benefits
Data/functionality available through documented, public interfaces
Anybody can use it
Can be integrated in other services/tools
Can be compared, combined, linked
Libraries need not be the actual host