Top Banner
Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?
63

Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Prof. Ray R. Larson

University of California, BerkeleySchool of Information

Developing a Metadata Infrastructure for Information

Access:What, Where, When and Who?

Page 2: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Overview

Metadata as Infrastructure– What, Where, When and Who?

What are Entry Vocabulary Indexes?– Notion of an EVI

– How are EVIs Built

Time Period Directories– Mining Metadata for new metadata

4W Demo New Project: Bringing Lives to Light

Page 3: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Metadata as Infrastructure

The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

Page 4: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Metadata as Infrastructure

The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.

The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

Page 5: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

What?

Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.

Two kinds of mapping in every search:

• Documents are assigned to topic categories, e.g. Dewey

• Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.

Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

Page 6: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

‘What’ searches involve mapping to controlled vocabularies

Thesaurus/Ontology

Page 7: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Start with a collection of documents.

Building a Search Term Recommender

Page 8: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Classify and index with controlled

vocabulary

Or use a pre-indexed

collection.

Index

Page 9: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Problem:Controlled

Vocabularies can be

difficult for people to

use.

“pass mtr veh spark ign eng”

Index

Use: “Economic Policy”

In Library of Congress subj

For: “Wirtschaftspolitik”

Page 10: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 11: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

“What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s

vocabulary to the controlled vocabulary of a collection of documents…

Page 12: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Has an Entry Vocabulary

Module been built?

User selects a subject domain of

interest.

Download a set of training data.

Build associations between extracted terms & controlled

vocabularies.

Map user’s query to ranked list of

controlled vocabulary terms

Part of speech tagging

Use an existing EVI.

Extract terms (words and noun phrases) from

titles and abstracts.

User selects search terms from the ranked

list of terms returned by the EVI.

YES

Building an Entry Vocabulary Module (EVI)

Searching

For noun phrases

Internet DB indexed with a controlled

vocabulary.

Domains to select from: Engineering, Medicine, Biology, Social science, etc.

User has question but is unfamiliar with the domain

he wants to search.

NO

Building and Searching EVIs

Page 13: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Page 14: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a class in the training set

Page 15: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Association Measure

Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Page 16: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Alternatively

Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

Page 17: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources

Page 18: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 19: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

But why stop there?

Index

EVI

Page 20: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 21: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 22: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 23: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Support for the Learner with a Query

Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages

Any catalog: Archives, Libraries, Museums, TV, Publishers

Facet Vocabulary Displays

WHAT Thesaurus Cross-

e.g. LCSH references

WHERE Gazetteer Map

WHEN Period directory Timeline

WHO Biograph. dict. Personal e.g. Who’s Who relations

Page 24: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

It is also difficult to move between different media forms

Thesaurus/Ontology

EVI

Page 25: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Searching across data types

Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

Page 26: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

But texts associated with numeric data can be mapped as well…

Thesaurus/Ontology

captions

EVI

EVI

Page 27: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

But there are also geographic dependencies…

Thesaurus/Ontology

captionsMaps/Geo Data

EVI

EVI

Page 28: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург,

Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania /

Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria;

– 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans;

USSR Use a gazetteer!

Page 29: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.

Timebar

Page 30: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Zoom on map. Click on place for a list of records. Click on record to display text.

Page 31: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

So geographic search becomes part of the infrastructure

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Page 32: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time

periods Named time periods resemble place names in being:

– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in

England, USA, Spain, etc. Places have temporal aspects & periods have

geographical aspects: When the Stone Age was, varies by region

Page 33: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Vocabularies are the key!Want: Kung-fu movies?Use LCSH: Hand-to-hand fighting, oriental, in motion pictures.

Linking vocabularies WHAT, WHERE, WHEN

Library subject headingsTopic – Geographic subdivision – Chronological subdivision

Place name gazetteer:Place name – Type – Spatial markers (Lat & long) – When

Time Period DirectoryPeriod name – Type – Time markers (Calendar) – Where

Page 34: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

Time period directories link via the place (or time)

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Page 35: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHEN: Time Period Directory Timeline

Link to Catalog

Link to Wikipedia

Page 36: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHO: Biographical Dictionary Complex relationships

Life events metadata

WHAT: Actions prisoner

WHERE: Places Holstein

WHEN: Times

1261-1262

WHO: People Margaret Sambiria

Need external links

Page 37: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Any document, object, or performance

Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages

Any catalog: Archives, Libraries, Museums, TV, Publishers

Connect it with its context – and other resources.

Facet Vocabulary Displays

WHAT Thesaurus Cross- e.g. LCSH references

WHERE Gazetteer Map

WHEN Period directory Timeline

WHO Biograph. dict. Personal e.g. Who’s Who relations

Page 38: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Demo of search interface

Page 39: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Entry Vocabulary Index suggests correct LCSH with different spelling

Page 40: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Related places

Page 41: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Potentially related people

Page 42: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Potentially related periods

Page 43: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Mostly in India 16th-18th century

Page 44: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Find out more about this area.

Page 45: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Different Browsing Options!

Page 46: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Zooming in to South Asia

Restricting time frame

Select

Page 47: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

More information about the country of India…

Page 48: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

More information about the country of India…

WikipediaCIA Factbook

BBC Ethnologue

Berkeley Natural History Museums

Page 49: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Historical events – linked to Library catalog & Wikipedia : none avail. for this time period

Page 50: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

ECAI Cultural Atlases: presenting history in its geographical & chronological contexts

Page 51: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Mongol Empire Video

Page 52: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Demo Interface

http://ecai.berkeley.edu/imls2004/imls4w/

Page 53: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

New Project: Bringing Lives to Light:

Biography in Context

Ray R. Larson, Michael Buckland, Fredric Gey

University of California, Berkeley

Page 54: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Overview

Focussing on the Who in Who, What, Where and When

Types of Biographical Markup

Page 55: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

WHEN, WHERE and WHO Catalog records found from a time period search commonly include

names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

Page 56: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

Page 57: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Texts

Numericdatasets

A new form of biographical dictionary would link to all

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Biographical Dictionary

Page 58: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Projected Work

Develop XML markup for Biographical Events

Most likely to be adaptation and extension of existing biographical event markup– Example: EAC/EAD

Harvest biographical resources – Wikipedia, etc.

Integrate as next generation of current interface

Page 59: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

EAC/EAD<bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist>

Page 60: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Wikipedia data

Life events metadata

WHAT: Actions prisoner

WHERE: Places Holstein

WHEN: Times

1261-1262

WHO: People Margaret Sambiria

Need external links

Page 61: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,
Page 62: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

A Metadata Infrastructure

CATALOGS

AchivesHistorical Societies

LibrariesMuseums

Public TelevisionPublishersBooksellers

AudioImages

Numeric DataObjectsTexts

Virtual RealityWebpages

RESOURCES

INTERMEDIA INFRASTRUCTURE

Biographical DictionaryWHO

TimelinesTime Period DirectoryWHEN

MapsGazetteerWHERE

Syndetic StructureThesaurusWHAT

Special Display ToolsAuthority ControlFacet

Learners

Dossiers

Page 63: Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Acknowledgements Electronic Cultural Atlas Initiative project This work is being supported supported by the Institute of

Museum and Library Services through a National Leadership Grant for Libraries

Contact: [email protected]