Page 1
Union Catalog and Knowledge Engineering
for TELDAPKeh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAPResearch Fellow Research Center for Information Technology Innovation &Institute of Information Science, Academia Sinica
2012.04.20
Page 2
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Page 3
Introduction
The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly.
The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.
Page 4
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Page 5
What is the union catalog?• It is a catalog and portal for all digital collections
of TELDAP.
• It is an integrated platform for browsing and
searching entire digital contents of TELDAP.
• Metadata provides core descriptions and
licensing information of each digital collection.
Page 6
Browsing by topics
Search by keywords
Home Page of Union Catalog
Page 7
Some improved functions for IR
• Keyword suggestion
• Keyword extension
• Recommendation of related collections
Page 8
• Keyword
suggestion
Page 9
• Keyword extension
Page 10
Digital Image
Recommendation of related
collections
Hyperlink to database
Metadata
Citation
Social networking service
Licensing Information
Page 11
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Page 12
Metadata models for different types of objects
Archived digital items
• Union catalog metadata model- Dublin core+Web sites
• DCCAP (Dublin Core Collections Application
Profile)
• Fields for internal used only― Unique Identifier, Format, Evaluation, Cataloging
History
Documents
• Document metadata-Dublin core
Page 13
13
• Over 4 million
digital items and
still increasing
Element Definition
Title A name given to the resource
Creator An entity primarily responsible for making the content of the resource
Subject and Keywords The topic of the content of the resource
Description An account of the content of the resource
Publisher An entity responsible for making the resource available
Contributor An entity responsible for making contributions to the content of the resource
Date A date associated with an event in the life cycle of the resource
Resource Type The nature or genre of the content of the resource
Format The physical or digital manifestation of the resource
Resource Identifier An unambiguous reference to the resource within a given context
Source A Reference to a resource from which the present resource is derived
Language A language of the intellectual content of the resource
Relation A reference to a related resource
Coverage The extent or scope of the content of the resource
Rights Management Information about rights held in and over the resource
Metadata for digital
items :
Page 15
Metadata for websites
• Over 690 websites and still increasing• Metadata
– DCCAP (Dublin Core Collections Application
Profile)
– To Combine the standard with our
requirements: 19 data fields
Page 16
The Website Homepage Picture
URL, Project Information
Type, Name, Author, Subject, Description, Language, Item Type, Target
Archived Information:URL, time, authorization
Copyright, Purpose, Other Information
Figure: http://digitalarchives.tw
Social networking service
Page 17
Uses of Metadata
Search collections by matching keyword and
features
Provide basic information of each collection
Dynamic categorization
Provide information to compute similarity or
relatedness of two collections
Extract keywords
Page 18
(1) Chinese Keyword Search
Keyword+(Features)
Synonyms, hyponyms
Matched Collections
Collections+Weights
Display Results
Keyword Extension
AAT-Taiwan &Teldap Thesaur
us
Keyword Matching
Ranking
Filtering
Keyword Dictionar
y
Page 19
English Keyword Search
• English Keyword+ (Features)• Translations, Synonyms, Hyponyms• Matched Collections
• Collections+Weights
• Display Results
Keyword Translation &
Extension
AAT-Taiwan &Teldap
Thesaurus
Keyword Matching
Ranking
Filtering
Keyword Dictionar
y
Page 20
Ranking Algorithm
Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item)– Association(Keyword, item)=W1*Topical
Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item)
– Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item)
• Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item))
• Importance of relation (Keyword, item) = W1*Keyword-from Value + W2*Mutual Information (keyword, Topic(item))
• Keyword-from Value= 1 if keyword is contained in title(item)
0.5 if keyword is contained in description(item)• Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}
Page 21
Algorithm for Recommending Related Collections
i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….}
Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…;
where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0;
Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)
Page 22
(2) Dynamic categorizationUser-oriented categorization
• General, elementary school students, high school
students, researchers, …etc.
Topical-based categorization
• Archaeology, painting, animal, plant, document, …
etc.
Functional-based categorization
• Research, education, business, technology,…
Categorization based on institutions
• Academia Sinica, Taiwan U., Palace museum,…
Page 23
(3) Multi-purposes of Core IR System and Databases
Teldap– Whole
collections– Searched by
institutes, domains, and media types (documents, images, videos, and web sites)
– Monolingual
Digital Shop– Whole
collections or only fine arts
– General search and searched by licensing types
– Rely on multilingual thesaurus
• Taiwan Academy– Fine artsSearched by institutes and domains– Multilingual– Rely on
multilingual thesaurus
Page 24
Figure: http://digitalarchives.tw
Digitalarchives
.tw
Page 25
Purpose: EducationTarget: Elementary school student, Junior high school student, Teacher…
Purpose: Creative applications
Purpose: Academic researchSubject: Animal, Archaeology, Anthropology…
Digitalarchives
.tw
Page 26
Figure: http://taiwanacademy.tw
Taiwan Acade
my
Page 27
Categorization based on institutionsTopical-based categorization
Taiwan Acade
my
Page 28
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Page 29
Plans of making knowledge structures for TELDAP
Construct metadata models for different
objects.
Establish hyperlinks between contexts and
objects.
• Develop keyword extraction tools.
• Design automatic tagging tools.
Construct TELDAP ontology and thesaurus.
• Art & Architecture Thesaurus by Getty
• Chinese WordNet
Page 30
(1) Metadata models for different objects
• Digital collections– Union catalog metadata model- Dublin core+
• Web sites– DCCAP (Dublin Core Collections Application
Profile)– Public fields– Private fields
Unique Identifier, Format, Evaluation, Cataloging History
• Documents– Document metadata-Dublin core
Page 31
(2) Create keyword dictionary
Extract from metadata Collect from Google search terms By social tagging Manually collect while tag hyperlinks
Page 32
Lexical Entry of Keyword Dictionary Keyword id Keyword Synset id Hypernym id Hyponym id Features Related Collections + Association
Strengths
Page 33
(2) Establish hyperlinks between contents and objects
• Identify keywords in contents.
• Tag keywords with related object
hyperlinks.
Page 34
Develop hyperlink tagging tools
• Word segmentation tools– Resolve word segmentation ambiguities and
identify keywords.– CKIP word segmentation system:
http://ckipsvr.iis.sinica.edu.tw/
Page 35
Develop hyperlink tagging tools• TELDAP keyword dictionary– Extract keywords from metadata and establish
object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles,
descriptions, authors, locations, eras etc. From each class of text file extract keywords by
automatic word segmentation, keyword extraction, and manual post editing.
– Current dictionary contains more than 120,000 Keywords.
Page 36
Prototype system for hyperlink tagger
• Identify and select keywords from the input text
Page 37
Prototype system for hyperlink tagger• Produce text with hyperlinks
Page 38
Prototype system for hyperlink tagger• Hyperlinks point to the related digital
collections
Page 39
(3) Construct TELDAP ontology and thesaurus
• Establish association links between
Chinese keywords and Getty AAT.
• Merge TELDAP keywords with Chinese
AAT.
Page 40
AAT Browsing trees of Taiwan Academy
Page 41
AAT subject search of Taiwan Academy
Page 42
Recommendation of related items
Page 43
Outline
Introduction Union catalog Databases and metadata for
digital contents and websites Knowledge engineering Future perspective
Page 44
Future Perspective
• Technology development– Construct multi-lingua thesauri – Getty AAT.– Maintain the TELDAP keyword-and-object
relation database.– Construct name authority files, gazetteers, and
universal calendars.– Design hyperlink taggers and keyword
extension tools.– Design an authoring tool which provides
hyperlinks of keyword related digital contents automatically.
– Design knowledge-based content retrieval system.
Page 45
Future Perspectives
• Content enrichment– Within TELDAP:
Standardize object metadata model and data format.
Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with
Wiki-like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and
TELDAP collections.– Extend the knowledge sources: e.g. Wikipedia