Introduction to Digital Libraries Meta Data (1)
Introduction to Digital Libraries
Meta Data
(1)
Metadata
“data about data” is about as good as the definition gets...
Metadata…definition
• Data that helps in design, create, describe, preserve and use of information systems and resources
• Metadata can play in the development of effective, authoritative, interoperable, scaleable, and preservable information and record keeping systems.
Types of Metadata• Descriptive
• Discovery / description of objects• Title, author, abstract, etc.
• Structural• Storage & presentation of objects
• 1 pdf file, 1 ppt file, 1 LaTeX file, etc.
• Administrative• Managing and preservation of objects
• Access control lists, terms and conditions, format descriptions, “meta-metadata”
Functions of Metadata
Discover resources
Manage
documents
Control IP
Rights
Identify versions
Certify
authenticity
Indicate status
Mark content
structure
Situate
geospatially
Describe
processes
DL Metadata Issues
• Who provides metadata? • author? “publisher”? professional cataloger?
extracted from content?
• Is metadata “integrated” with data?• related question: is metadata a first class object?
• Formats!• which ones?• Extensible?
Metadata Analysis: Status
• Static vs. dynamic• Static metadata will not be updated, augmented, etc.—
it is essentially “dead”• Dynamic metadata is “living,” maintained by
someone, updated when needed, perhaps regularly supplemented
• This distinction will have an impact on conversion strategies and workflows
21
Metadata Analysis: File Format
• File, or data exchange, formats:• SGML / HTML• XML / XHTML• MARC• “Delimited” plain-text file• Binary (not plain-text) formats, either open or
proprietary
6
Metadata Formats• MARC is very rich
• good candidate for an “archival” metadata format, from which simpler formats can be derived
• Dublin Core designed to be simple enough for the average author to generate by hand• only 15 core fields defined
• Other formats defined for specific purposes:• BibTeX: TeX/LaTeX publishing• refer: troff/nroff• RFC-1807: email exchange
Interesting Formats
• Library science• Machine Readable Catalogue (MARC):
huge, extensive, all purpose, one size fits all format
• Computer science• application-specific formats: refer,
BibTeX, RFC-1807, etc.
Metadata OverviewResult: Every Community Creates Their Own Metadata
Archives: EAD (Encoded Archival Description)
Government: GILS (Government or Global Information Locator System)
IMS: Instructional Metadata System
TEI: Text Encoding Initiative - books and humanities; TEIH (TEI Header used for metadata description
Dublin Core EdNA http://www.edna.edu.au/edna/owa/info.getpage?sp=auto&pagecode=5210
“Flavors” CIMI Guide to Best Practice: Dublin Core. Available as PDF
from http://www.cimi.org/
Metadata Overview
MARC Machine-readable cataloging: most library catalogs worldwide.
MPEG-7 Digital Audio, Video and Still Image files. (In development. Committee draft
due October 2000)
The Importance of Standards
“In moving from dispersed digital collections to interoperable digital libraries, the most important activity we need to focus on is standards… most important is the wide variety of metadata standards [including] descriptive metadata… administrative metadata…, structural metadata, and terms and conditions metadata…”
MARC
Leader: : 01663ngm 22002771 4500: 005: : 19950927090218. 0 : 007: :vducgaiuu: 008: : 950927s1993 mau--- d vlfre d :
Ctrl Numb 001 200312310 Cntl Iden 003 OBgNWOET ISBN Numb 020 -- a 0300056958 Catl Orig 040 __ a OBgNWOET Tran c OBgNWOET Lang Summ 041 -- b fre Titl Main 245 00 a A la recontre de Philippe GMD h [videorecording] / Resp c Massachusetts Institute of Technology ; written by Gilberte Furstenburg ; directed by Janet H. Murray ; software programmed by Stuart A. Malone. Pubn City 260 __ a Cambridge, MA : Publ b dist. by Annenberg/CPB., Date c 1993. Desc Extn 300 __ a 1 laserdisc (CAV) : Othr b sd., col. : Dimn c 12 in. + Accm e Teacher guide + 3 computer disks. Note Genl 500 __ a Issued as videodisc. Note Genl 500 __ a Title from cover. Note Summ 520 __ a Provides an engaging way to sharpen comprehension skills. Students navigate through Paris neighborhoods and shops,dealing with friends, tradespeople, telephones and answering machines with the goal of finding an apartment for the hapless Philippe. Includes many helpful tools,such as self-testing exercises and an electronic glossary, visual and audio resources, including maps, telephones and newspapers which help students function within the story. Teachers can customize the program according to their students levels and abilities. Note Targ 521 2_ a Senior high and college. Note Targ 521 2_ a 09-adult. Note Tech 538 -- a Macintosh computer ; system 6.0 or later ; 2 MB of RAM ; 3.5 MB of hard disk space ; videodisc player ; video monitor. Subj Topc 650 _0 a Languages, Modern. Subj Topc 650 _0 a Language and languages. Subj Topc 658 _7 a Foreign languages, French. Srce 2 nwoet Locn Coll 852 1_ a OBgNWOET SubA b Northwest Ohio Media Center Clas h 200312310 BarC p 200312310
MARC 005: : 19950927090218. 0 :020 -- a 0300056958040 __ a OBgNWOET041 -- b fre245 00 a A la recontre de Philippe 260 __ a Cambridge, MA :300 __ a 1 laserdisc (CAV) :
b sd., col. :c 12 in. +e Teacher guide + 3 computer disks.
BibTeX
@InProceedings{dha96:pods, author = {Chanda Dharap and C. Mic Bowman}, title = {Typed Structured Documents for Information Retrieval}, booktitle = {Third International Workshop on Principles of Document Processing} year = 1996, month = sep, address = {Palo Alto, California}}
RFC-1807BIB-VERSION:: CS-TR-v2.1 ID:: OUKS//CS-TR-91-123 ENTRY:: January 15, 1992ORGANIZATION:: Oceanview University, Kansas, Computer Science TYPE:: Technical Report REVISION:: January 5, 1995; FTP access information added TITLE:: Scientific Communication must be timely AUTHOR:: Finnegan, James A. CONTACT:: Prof. J. A. Finnegan, CS Dept, Oceanview Univ, Oceanview, KS 54321 Tel: 913-456-7890 <[email protected]> AUTHOR:: Pooh, Winnie The CONTACT:: 100 Aker Wood DATE:: December 1991 PAGES:: 48 COPYRIGHT:: Copyright for the report (c) 1991, by J. A. Finnegan. All rights reserved. Permission is granted for any academic use of the report. HANDLE:: hdl:oceanview.electr/CS-TR-91-123OTHER_ACCESS:: url:http://electr.oceanview.edu/CS-TR-91-123OTHER_ACCESS:: url:ftp://electr.oceanview.edu/CS-TR-91-123 RETRIEVAL:: send email to [email protected] with fax number KEYWORD:: Scientific Communication CR-CATEGORY:: D.0 CR-CATEGORY:: C.2.2 Computer Sys Org, Communication nets, Net Protocols SERIES:: Communication FUNDING:: FAS CONTRACT:: FAS-91-C-1234 MONITORING:: FNBO LANGUAGE:: English NOTES:: This report is the full version of the paper with the same title in IEEE Trans ASSP Dec 1976ABSTRACT::
Many alchemists in the country work on important fusion problems.All of them cooperate and interact with each other through thescientific literature. This scientific communication methodologyhas many advantages. Timeliness is not one of them.
END:: OUKS//CS-TR-91-123
RFC-1807BIB-VERSION:: CS-TR-v2.1 ID:: OUKS//CS-TR-91-123 ENTRY:: January 15, 1992ORGANIZATION:: Oceanview University, Kansas, Computer Science TYPE:: Technical Report REVISION:: January 5, 1995; FTP access information added TITLE:: Scientific Communication must be timely AUTHOR:: Finnegan, James A. CONTACT:: Prof. J. A. Finnegan, CS Dept, Oceanview Univ, Oceanview, KS 54321 Tel: 913-456-7890 <[email protected]> AUTHOR:: Pooh, Winnie The CONTACT:: 100 Aker Wood DATE:: December 1991 PAGES:: 48
Dublin Core• see: http://dublincore.org/documents/dces/came out
of a 1995 joint OCLC/NCSA workshop in Dublin, Ohio
• goal: • create a metadata standard that was easier to create than
traditional bibliographic formats (e.g., MARC) that would facilitate resource discovery
• surely the world would be a better place if we all could just agree on a few simple tags to describe our resources…
• everything optional• everything repeatable
Dublin Core, pre-XML
<META NAME="DC.title" CONTENT="Metadata: Enabling the Internet"> <META NAME="DC.subject" CONTENT="(SCHEME=keyword) Metadata, Dublin Core, PICS, Resource Discovery"> <META NAME="DC.author" CONTENT="(TYPE=name) Renato Iannella"> <META NAME="DC.author" CONTENT="(TYPE=email) [email protected]"> <META NAME="DC.author" CONTENT="(TYPE=affiliation) DSTC Pty Ltd"> <META NAME="DC.author" CONTENT="(TYPE=name) Andrew Waugh"> <META NAME="DC.author" CONTENT="(TYPE=email) [email protected]"> <META NAME="DC.author" CONTENT="(TYPE=affiliation) CSIRO"> <META NAME="DC.publisher" CONTENT="(TYPE=name) DSTC Pty Ltd"> <META NAME="DC.date" CONTENT="(TYPE=creation) (SCHEME=ISO31) 1997-01-20"> <META NAME="DC.date" CONTENT="(TYPE=current) (SCHEME=ISO31) 1997-01-20"> <META NAME="DC.form" CONTENT="(SCHEME=imt) text/html"> <META NAME="DC.identifier" CONTENT="(TYPE=url) <http://www.dstc.edu.au/RDU/ reports/CAUSE97/>"> <META NAME="DC.language" CONTENT="(SCHEME=iso639) en">
Dublin Core, XML-encoded
<?xml version="1.0"?> <!DOCTYPE rdf:RDF SYSTEM "http://purl.org/dc/schemas/dcmes-xml-20000714.dtd"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc ="http://purl.org/dc/elements/1.1/"> <rdf:Description about="http://foo.edu/dl/report-1"> <dc:title>Perpetual Motion Machine</dc:title> <dc:description>This report redefines physics.</dc:description> <dc:date>1998-10-10</dc:date> <dc:format>text/html</dc:format> <dc:language>en</dc:language> <dc:contributor>Kant, B. Reproduced</dc:contributor> </rdf:Description> </rdf:RDF>
Dublin Core: Then & Now
1. subject
2. title
3. author
4. publisher
5. otheragent
6. date
7. objecttype
8. form
9. identifier
10. relation
11. source
12. language
13. coverage
1. subject
2. title
3. creator
4. description
5. publisher
6. contributor
7. date
8. type
9. format
10. identifier
11. source
12. language
13. relation
14. coverage
15. rights
The Dublin Core Metadata Element Set
• Title• Subject• Description• Creator• Publisher• Contributor• Date
• Type• Format• Identifier• Source• Language• Relation• Coverage• Rights
Dublin Core: Subject
• What the work is about, possibly keywords, terms from classification scheme if available.
<META name = “DC.Subject” content = “Middle Atlantic States - Maps -
Early works to 1800 - Facsimilies” scheme = “LCSH” >
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
LCSH = Library of Congress Subject Headers
The Core Elements
Label: Format• The physical or digital manifestation of the
resource. Typically, Format may include the media-type or dimensions of the resource include size and duration.
dublin core: coverage
• “The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic co-ordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary.
The Core Elements
• Label: Description• An account of the content of the resource.
Description may include but is not limited to: an abstract, table of contents.
Dublin Core: Source
• Is this object derived from another? Is this map a part of a larger map? Is this text a variation or revision of another piece of text?
<META name = “DC.Source”content = “G3715 1685 .V5 1969”scheme = “LCCN”
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
LCCN = Library of Congress Call Number
dublin core: source
• “A Reference to a resource from which the present resource is derived. The present resource may be derived from the Source resource in whole or part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system”… “include in this area information about a resource that is related intellectually to the described resource but does not fit easily into a Relation element.”
Dublin Core: Language
• Language of the content of the resource• For the map, there is no language content
<META
name = “DC.Language”
content = “nl”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Relation
• To what other object(s) or collection is this object related? Does it also exist in another collection? Is it derived from another document or image? How is it related?
<META name = “DC.Relation”content = “isPartOf
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Creator
• Person or organization responsible for the Intellectual Content of this object
<META
name = “DC.Creator”
content = “Nicolaum Visscher”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
dublin core: creator
• “An entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically the name of the Creator should be used to indicate the entity.”
• “Creators should be listed separately, preferably in the same order that they appear in the publication.”
Dublin Core: Publisher
• Entity responsible for making the resource available in its present form
• Not shown in the example, but should be something like this:
<META name = “DC.Publisher”content = “Library of Congress American Memory Project”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Contributor
• Any entity making a contribution to this object.
• Example: someone who added some information to the original document or image
• No entry for this map.
dublin core: rights• “Information about rights held in and over the
resource. Typically a Rights element will contain a rights management statement for the resource, or reference a service providing such information.”
• “Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the rights element is absent, no assumptions can be made about the status of these and other rights with respect to the resource.”
Dublin Core: Date
• Date on which this object was made available in its present form, possibly the date it was entered into this digital collection.
<META
name = “DC.DATE”
content = “1996-04-17”
scheme = “ISO 8601”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Specify the date format so that others can interpret it correctly
Dublin Core: Type or Category
• What sort of thing is this? Some examples: home page, novel, poem, working paper, technical report, essay dictionary, …
• Type should be selected from a controlled list. For example, see the DCMI Type Vocabulary:
Why is this recommended as a controlled vocabulary field?
Dublin Core: Unique ID
• The key for this object in the collection.• I cannot find one for the map we are looking at,
but the ID for the map of which it is a part is g3715 ct000001
• The Metadata specification for that would be<META name= “DC.Id”
content = “g3715 ct000001”>
dublin core: title
• “The name given to the resource. Typically, a Title will be a name by which the resource is formally known.”
• “If in doubt about what constitutes the title, repeat the Title element.”
Example• Title="The Sound of Music"
• Creator="Melendez Santiago, Maria Luz"
• Subject="Dogs"
• Description="Illustrated guide to airport markings and lighting signals, with particular reference to SMGCS (Surface Movement Guidance and Control System) for airports with low visibility conditions
• Publisher="University of Miami. Dept. of Economics"
• Date="1998-02-16"
• Type="image”
• Format="image/gif 4kB“
Example• Identifier="0385424728" [ISBN]
• Source="RC607.A26W574 1996">[where "RC607.A26W574 1996" is the call number of the print version of the resource, from which the present version was scanned]
• Language=enLanguage=fr
• Relation ="IsPartOf Two Lives" [collection of two novellas, one of which is "Reading Turgenev"]
• Coverage=1995-1996
• Rights="Access limited to members."
Characteristics of the Dublin Core
• All elements optional• All elements repeatable• Elements may be displayed in any order• Extensible • International in scope
Derivations
Dublin Core Principles
• Dumb-Down• One-to-One• Appropriate Values
Dumb-Down
• The fifteen core elements are usable with or without qualifiers
• Qualifiers make elements more specific:• Element Refinements narrow meanings, never extend• Encoding Schemes give context to element values
Appropriate Values
“Best practice for a particular element or qualifier may vary by context, but in general an implementor cannot always predict that the interpreter of the metadata will always be a machine. This may impose certain constraints on how metadata is constructed, but the requirement of usefulness for discovery should be kept in mind.”
-- from “Using Dublin Core”
Dumbing Down
The One-to-One Principle
• Describe one manifestation of a resource with one record• Ex.: a digital image of the Mona Lisa is not described
as if it were the same as the original painting
• Separate descriptions of resources from descriptions of the agents responsible for those resources• Ex.: email addresses and affiliations of creators are
attributes of the creator, not the resource
References/Further reading
• Dublin Core Metadata Initiative:http://dublincore.org/
• Dempsey, Lorcan. “Scientific, Industrial, and Cultural Heritage: a shared approach”http://www.ariadne.ac.uk/issue22/dempsey/
• Cathro, Warwick. “Smashing the silos: towards convergence….” http://www.nla.gov.au/nla/staffpaper/2001/cathro2.html
• Baker, Tom. “A Grammar of Dublin Core”http://www.dlib.org/dlib/october00/baker/10baker.html
• Heery, Rachel & Manjula Patel. “Application Profiles” http://www.ariadne.ac.uk/issue25/app-profiles/
• Dekkers, Makx & Stuart Weibel. “DCMI Progress Report & Workplan for 2002”http://www.dlib.org/dlib/february02/weibel/02weibel.html
Metadata OverviewRDF / Dublin Core in XML<?xml:namespace href=“http://www.w3c.org/RDF/”=as=”RDF”?>
<?xml:namespace href=“http://purl.org/RDF/DC”as=”DC”?>
<?XMl:namespace href=“http://loc.gov/LCNAF”as=”LCNAF”?>
<?XML:namespace href=“http://loc.gov/LCSH” as= “LCSH”?>
<RDF:RDF>
<RDF: Description RDF: HREF=“http://purl.org/metadata/dublin_core_elements”>
<DC.Title> A Thousand Wheels are Set in Motion</DC:Title>
< DC.Title.Alternative> The Building of Georgia Tech at the Turn of the 20th Century, 1888-1908</DC.Title.Alternative>
<DC:Creator.CorporateName>
<RDF:Description>
<LCNAF:CorporateName>Georgia Tech Library and Information Center</LCNAF:Corporate Name>
</RDF:Description>
Metadata Overview<DC:Subject>
<RDF:Description>
<LCSH:CorporateName>Georgia Institute of Technology-- Buildings</LCSH:CorporateName>
</RDF:Description>
<DC:Description> This Web site provides photographs, engravings and sketches of the first buildings on the Georgia Tech Campus, from 1888-1908. As of 9/20/1999, 88 images are provided but more will be added. Cataloged in EAD Single Item Metadata (SIM) format.</DC:Description>
<RDF:Seq>
<RDF:Description>
<RDF:LI><LCSH:PersonalName>Chritton, Heather</LCSH:PersonalName></RDF:LI>
<RDF:LI><LCSH:PersonalName>Crafts,Laurel</LCSH:PersonalName></RDF:LI>
</RDF:Description>
</RDF:Seq>