Metadata for the Web From Discovery to Description

Post on 23-Jan-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Metadata for the Web From Discovery to Description. CS 502 – 20020224 Carl Lagoze – Cornell University. The fifteen Dublin Core Elements. http://dublincore.org/usage/terms/dc/current-elements/ http://dublincore.org. A Pidgin for Digital Tourists. Metadata is language - PowerPoint PPT Presentation

Transcript

Cornell CS 502

Metadata for the WebFrom Discovery to Description

CS 502 – 20020224Carl Lagoze – Cornell University

Cornell CS 502

The fifteen Dublin Core Elements

Creator Title Subject

Contributor Date Description

Publisher Type Format

Coverage Rights Relation

Source Language I dentifi er

http://dublincore.org/usage/terms/dc/current-elements/http://dublincore.org

Cornell CS 502

A Pidgin for Digital Tourists

• Metadata is language• Dublin Core is a small and simple language -- a

pidgin -- for finding resources across domains.• Speakers of different languages naturally

"pidginize" to communicate– E.g., tourists using simple phrases to order beer

("zwei Bier bitte" "dva pivo" "biru o san bai"...)

• We are all "tourists" on the global Internet.

Cornell CS 502

What is the Dublin Core (1)

• A simple set of properties to support resource discovery on the web (fuzzy search buckets)?

DomainIndependent

view

Cornell CS 502

What is Dublin Core (2)?

• An extensible ontology for resource desciption?

Gre

ate

r Fun

ction

ality

&

Cost

Cornell CS 502

What is the Dublin Core (3)?

• A cross-domain switchboard for interoperable metadata?

Switchboard

DublinCore

MARC

INDECSIMS

Cornell CS 502

Dublin Core Qualifiers

• From fuzzy buckets to more specific description

• Model of “graceful degradation”– Support both simplicity and specificity– Intra-domain and inter-domain semantics

Cornell CS 502

Varieties of qualifiers: Element Refinements

• Make the meaning of an element narrower or more specific.

• Narrowing implies an is a relationship – a "date created“ is a "date“– an "is part of relation“ is a "relation“

• If your software does not understand the qualifier, you can safely ignore it.

Cornell CS 502

Varieties of Qualifiers: Value Encoding Schemes

• Says that the value is– a term from a controlled vocabulary (e.g., Library of

Congress Subject Headings)– a string formatted in a standard way (e.g., "2001-05-

02" means May 3, not February 5)

• Even if a scheme is not known by software, the value should be "appropriate" and usable for resource discovery.

Cornell CS 502

A Grammar of Dublin Core

• http://www.dlib.org/dlib/october00/baker/10baker.html

• By design not as subtle as mother tongues, but easy to learn and extremely useful in practice

• Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives)

• Simple grammars: sentences (statements) follow a simple fixed pattern...

Cornell CS 502

Example Dublin Core statements

• Resource has Title 'Grammar of Dublin Core'.• Resource has Creator 'Tom Baker'.• Resource has Subject 'Metadata'.• Resource has Relation http://foo.org/file.htm.

Cornell CS 502

Resource has property

DC:CreatorDC:TitleDC:SubjectDC:Date...

X

implied subject

impliedverb

one of 15properties

property value(an appropriateliteral)

Cornell CS 502

Resource has property

DC:CreatorDC:TitleDC:SubjectDC:Date...

X

implied subject

impliedverb

one of 15properties

property value(an appropriateliteral)

[optional qualifier]

[optional qualifier]

qualifiers(adjectives)

Cornell CS 502

Resource has Date "2000-06-13"Revised

ISO8601

Resource has Subject "Languages -- Grammar"LCSH

Cornell CS 502

Dumb-Down Principle for Qualifiers

• The fifteen elements should be usable and understandable with or without the qualifiers

• Qualifiers refine meaning (but may be harder to understand)

• Nouns can stand on their own without adjectives

• If your software encounters an unfamiliar qualifier, look it up -- or just ignore it!

• "has a“ relations break the model– E.g., a creator has a hair color

Cornell CS 502

Resource has Date "2000-06-13"Revised

ISO8601

Resource has Subject "Languages -- Grammar"LCSH

Test for “good““ qualifiers:cover and ask: -- Does the statement still make sense? -- Is it still correct?

Cornell CS 502

Resource has subjectaudience

Resource has creatoraffiliation

“Incorrect” Qualification

“Cornell University”

“pre-schoolers”

Cornell CS 502

Open questions in this model

• Are uncontrolled and unconstrained values really useful for discovery?

• Is it possible for an organization (DCMI) to control the evolution of a language?

• How can "simple discovery metadata" be combined with complex descriptions? Is there a notion of graceful degradation?

• Can DC serve as a lingua franca (mapping template) among more complex models

Cornell CS 502

Models for Deploying Metadata

• Embedded in the resource– low deployment threshold– Limited flexibility, limited model

• Linked to from resource– Using xlink– Is there only one source of metadata?

• Independent resource referencing resource– Model of accessing the object through its surrogate– Resource doesn’t ‘have’ metadata, metadata is just

one resource annotating another

Cornell CS 502

Syntax Alternatives:HTML

• Advantages:– Simple Mechanism – META tags embedded in content– Widely deployed tools and knowledge

• Disadvantages– Limited structural richness (won’t support

hierarchical,tree-structured data or entity distinctions).

Cornell CS 502

Dublin Core in HTML

• http://www.dublincore.org/documents/2000/08/15/dcq-html/

• HTML constructs– <link> to establish pseudo-namespace– <meta> for metadata statements

• name attribute for DC element (DC.element.ER)

• content attribute for element value

• scheme attribute for encoding scheme or controlled vocabulary

• lang attribute for language of element value

Cornell CS 502

Dublin Core in HTML example

<link rel="schema.DC" href="http://purl.org/dc/elements/1.1"> <meta name="DC.Title" content="Business Unusual”><meta name=“DC.Title” lang=“es” content=“negocio inusual”> <meta name="DC.Creator" content="Carl Lagoze"> <meta name="DC.Subject" content="bibliographic control web cataloging "> <meta name="DC.Date.Created" scheme="W3CDTF"

content="2000-10-23"> <meta name="DC.Format" content="text/html"> <meta name="DC.Identifier" content="http://lcweb.loc.gov/lagoze_paper.html">

Cornell CS 502

Unqualified Dublin Core in XML

http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/

Cornell CS 502

Multi-entity nature of object description

Photographer

Camera type Software

Computer artist

Cornell CS 502

Attribute/Value approaches to metadata…

Hamlet has a creator Shakespeare

subject implied verb metadata noun literal

Play

wrig

ht

metadata adjective

The playwright of Hamlet was Shakespeare

R1

“Shakespeare”

“Hamlet”

dc:creator.playwright

dc:title

Cornell CS 502

…run into problems for richer descriptions…

Hamlet has a creator Stratford

birt

hpla

ce

The playwright of Hamlet was Shakespeare,who was born in Stratford

“Stratford”R1

“Shakespeare”dc:creator.playwright

dc:creator.birthplace

Cornell CS 502

…because of their failure to model entity distinctions

R1

“Stratford”

creatorR2

name “Shakespeare”

birthplacetitle

“Hamlet”

Cornell CS 502

Applying a Model-Centric Approach

• Formally define common entities and relationships underlying multiple metadata vocabularies

• Describe them (and their inter-relationships) in a simple logical model

• Provide the framework for extending these common semantics to domain and application-specific metadata vocabularies.

Cornell CS 502

Events are key to understanding resource complexity?

• Events are implicit in most metadata formats (e.g., ‘date published’, ‘translator’)

• Modeling implied events as first-class objects provides attachment points for common entities – e.g., agents, contexts (times & places), roles.

• Clarifying attachment points facilitates understanding and querying “who was responsible for what when”.

Cornell CS 502

ABC/Harmony Event-aware metadata ontology• Recognizing inherent lifecycle aspects of

description (esp. of digital content)• Modeling incorporates time (events and

situations) as first-class objects– Supplies clear attachment points for agents, roles,

existential properties

• Resource description as a “story-telling” activity

Cornell CS 502

Resource-centric Metadata

Title Anna Karenina

Author Leo Tolstoy

Illustrator Orest Vereisky

Translator Margaret Wettlin

Date Created 1877

Date Translated 1978

Description Adultery & Depression

Birthplace Moscow

Birthdate 1828

?

Cornell CS 502

“translator”

“Margaret Wettlin”“Orest Vereisky”

“illustrator”

“Anna Karenina”

“Tragic adultery andthe search for meaningfullove”

“English”

“author”

“creation”

“1877”“1978”

“translation”

“Russian”

“Leo Tolstoy”"Moscow"

“1828”

Cornell CS 502

Breaking the metadata bottleneck Human vs. machine generation

• Simple text scraping – HTML tags as hint– Other structural methods

• Natural language methods and machine learning

• Contextual methods– Google (text and image search)

Cornell CS 502

Putting metadata in its place

Cornell CS 502

Query engine architecture space

Queries

Structured Unstructured

Data

Structured

Unstructured

1

3

2

4

(Relational)DatabaseSystems

InformationRetrievalSystems

top related