Top Banner
Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project TELDAP Research Fellow Research Center for Information Technology Innovation & Institute of Information Science, Academia 2012.04 20
46

Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Dec 18, 2015

Download

Documents

Rosalind Grant
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Union Catalog and Knowledge Engineering

for TELDAPKeh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAPResearch Fellow Research Center for Information Technology Innovation &Institute of Information Science, Academia Sinica

2012.04.20

Page 2: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Page 3: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Introduction

The integration and management of digital contents has become an important issue as the amount of digital contents produced from different projects and institutions increases rapidly.

The goal of our project is to achieve optimized preservation, retrieval, and presentation of digital collections.

Page 4: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Page 5: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

What is the union catalog?• It is a catalog and portal for all digital collections

of TELDAP.

• It is an integrated platform for browsing and

searching entire digital contents of TELDAP.

• Metadata provides core descriptions and

licensing information of each digital collection.

Page 7: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Some improved functions for IR

• Keyword suggestion

• Keyword extension

• Recommendation of related collections

Page 8: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

• Keyword

suggestion

Page 9: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

• Keyword extension

Page 10: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Digital Image

Recommendation of related

collections

Hyperlink to database

Metadata

Citation

Social networking service

Licensing Information

Page 11: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Page 12: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Metadata models for different types of objects

Archived digital items

• Union catalog metadata model- Dublin core+Web sites

• DCCAP (Dublin Core Collections Application

Profile)

• Fields for internal used only― Unique Identifier, Format, Evaluation, Cataloging

History

Documents

• Document metadata-Dublin core

Page 13: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

13

• Over 4 million

digital items and

still increasing

Element Definition

Title A name given to the resource

Creator An entity primarily responsible for making the content of the resource

Subject and Keywords The topic of the content of the resource

Description An account of the content of the resource

Publisher An entity responsible for making the resource available

Contributor An entity responsible for making contributions to the content of the resource

Date A date associated with an event in the life cycle of the resource

Resource Type The nature or genre of the content of the resource

Format The physical or digital manifestation of the resource

Resource Identifier An unambiguous reference to the resource within a given context

Source A Reference to a resource from which the present resource is derived

Language A language of the intellectual content of the resource

Relation A reference to a related resource

Coverage The extent or scope of the content of the resource

Rights Management Information about rights held in and over the resource

Metadata for digital

items :

Page 14: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

14

Page 15: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Metadata for websites

• Over 690 websites and still increasing• Metadata

– DCCAP (Dublin Core Collections Application

Profile)

– To Combine the standard with our

requirements: 19 data fields

Page 16: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

The Website Homepage Picture

URL, Project Information

Type, Name, Author, Subject, Description, Language, Item Type, Target

Archived Information:URL, time, authorization

Copyright, Purpose, Other Information

Figure: http://digitalarchives.tw

Social networking service

Page 17: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Uses of Metadata

Search collections by matching keyword and

features

Provide basic information of each collection

Dynamic categorization

Provide information to compute similarity or

relatedness of two collections

Extract keywords

Page 18: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(1) Chinese Keyword Search

Keyword+(Features)

Synonyms, hyponyms

Matched Collections

Collections+Weights

Display Results

Keyword Extension

AAT-Taiwan &Teldap Thesaur

us

Keyword Matching

Ranking

Filtering

Keyword Dictionar

y

Page 19: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

English Keyword Search

• English Keyword+ (Features)• Translations, Synonyms, Hyponyms• Matched Collections

• Collections+Weights

• Display Results

Keyword Translation &

Extension

AAT-Taiwan &Teldap

Thesaurus

Keyword Matching

Ranking

Filtering

Keyword Dictionar

y

Page 20: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Ranking Algorithm

Rank Value(item)= W1* Association(Keyword, item) + W2*Quality(item)– Association(Keyword, item)=W1*Topical

Similarity(Topic(keyword), Topic(item)) + W2*Importance of relation (Keyword, item)

– Quality(item) =W1* Image quality (item) + W2*Qualification of provider (item) + W3*Metadata (item)

• Topical Similarity(Topic(keyword), Topic(item)) = Ontology Distance(Topic(keyword), Topic(item))

• Importance of relation (Keyword, item) = W1*Keyword-from Value + W2*Mutual Information (keyword, Topic(item))

• Keyword-from Value= 1 if keyword is contained in title(item)

0.5 if keyword is contained in description(item)• Mutual Information (keyword, Topic(item))= P(Keyword, Topic(item))/{P(Keyword)*P(Topic(item))}

Page 21: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Algorithm for Recommending Related Collections

i-th Item Vector= {Topic, Institute, Keyword1,Keyword2,….}

Similar(i-th item, j-th item)= W1*Topic Similar(i-th item, j-th item)+ W2* Institute Similar(i-th item, j-th item)+ Weight(Keyword1) *Delta(Keyword1) + Weight(Keyword2) * Delta(Keyword2)+…;

where Delta(Keyword1) = 1 if Keyword1 of i-th item is also keyword of j-th item; otherwise 0;

Recommendation= Similar(i-th item, j-th item)+ Evaluation(j-th item)

Page 22: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(2) Dynamic categorizationUser-oriented categorization

• General, elementary school students, high school

students, researchers, …etc.

Topical-based categorization

• Archaeology, painting, animal, plant, document, …

etc.

Functional-based categorization

• Research, education, business, technology,…

Categorization based on institutions

• Academia Sinica, Taiwan U., Palace museum,…

Page 23: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(3) Multi-purposes of Core IR System and Databases

Teldap– Whole

collections– Searched by

institutes, domains, and media types (documents, images, videos, and web sites)

– Monolingual

Digital Shop– Whole

collections or only fine arts

– General search and searched by licensing types

– Rely on multilingual thesaurus

• Taiwan Academy– Fine artsSearched by institutes and domains– Multilingual– Rely on

multilingual thesaurus

Page 24: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Figure: http://digitalarchives.tw

Digitalarchives

.tw

Page 25: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Purpose: EducationTarget: Elementary school student, Junior high school student, Teacher…

Purpose: Creative applications

Purpose: Academic researchSubject: Animal, Archaeology, Anthropology…

Digitalarchives

.tw

Page 27: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Categorization based on institutionsTopical-based categorization

Taiwan Acade

my

Page 28: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Page 29: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Plans of making knowledge structures for TELDAP

Construct metadata models for different

objects.

Establish hyperlinks between contexts and

objects.

• Develop keyword extraction tools.

• Design automatic tagging tools.

Construct TELDAP ontology and thesaurus.

• Art & Architecture Thesaurus by Getty

• Chinese WordNet

Page 30: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(1) Metadata models for different objects

• Digital collections– Union catalog metadata model- Dublin core+

• Web sites– DCCAP (Dublin Core Collections Application

Profile)– Public fields– Private fields

Unique Identifier, Format, Evaluation, Cataloging History

• Documents– Document metadata-Dublin core

Page 31: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(2) Create keyword dictionary

Extract from metadata Collect from Google search terms By social tagging Manually collect while tag hyperlinks

Page 32: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Lexical Entry of Keyword Dictionary Keyword id Keyword Synset id Hypernym id Hyponym id Features Related Collections + Association

Strengths

Page 33: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(2) Establish hyperlinks between contents and objects

• Identify keywords in contents.

• Tag keywords with related object

hyperlinks.

Page 34: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Develop hyperlink tagging tools

• Word segmentation tools– Resolve word segmentation ambiguities and

identify keywords.– CKIP word segmentation system:

http://ckipsvr.iis.sinica.edu.tw/

Page 35: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Develop hyperlink tagging tools• TELDAP keyword dictionary– Extract keywords from metadata and establish

object-keyword relations. Extract text from XML data for each object. The text are classified by topics, titles,

descriptions, authors, locations, eras etc. From each class of text file extract keywords by

automatic word segmentation, keyword extraction, and manual post editing.

– Current dictionary contains more than 120,000 Keywords.

Page 36: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Prototype system for hyperlink tagger

• Identify and select keywords from the input text

Page 37: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Prototype system for hyperlink tagger• Produce text with hyperlinks

Page 38: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Prototype system for hyperlink tagger• Hyperlinks point to the related digital

collections

Page 39: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

(3) Construct TELDAP ontology and thesaurus

• Establish association links between

Chinese keywords and Getty AAT.

• Merge TELDAP keywords with Chinese

AAT.

Page 40: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

AAT Browsing trees of Taiwan Academy

Page 41: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

AAT subject search of Taiwan Academy

Page 42: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Recommendation of related items

Page 43: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Outline

Introduction Union catalog Databases and metadata for

digital contents and websites Knowledge engineering Future perspective

Page 44: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Future Perspective

• Technology development– Construct multi-lingua thesauri – Getty AAT.– Maintain the TELDAP keyword-and-object

relation database.– Construct name authority files, gazetteers, and

universal calendars.– Design hyperlink taggers and keyword

extension tools.– Design an authoring tool which provides

hyperlinks of keyword related digital contents automatically.

– Design knowledge-based content retrieval system.

Page 45: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.

Future Perspectives

• Content enrichment– Within TELDAP:

Standardize object metadata model and data format.

Provide object metadata in controlled vocabulary. Write scripts and stories for different topics with

Wiki-like knowledge structure. Enrich the digital collections. Establish hyperlinks between text books and

TELDAP collections.– Extend the knowledge sources: e.g. Wikipedia

Page 46: Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.