Top Banner
Mixed content, mixed metadata: Information discovery in the NSDL
22

Mixed content, mixed metadata: Information discovery in the NSDL.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mixed content, mixed metadata: Information discovery in the NSDL.

Mixed content, mixed metadata:

Information discovery in the NSDL

Page 2: Mixed content, mixed metadata: Information discovery in the NSDL.

- 2 -

Experience from American Memory and NSDL

Caroline R. Arms and William Y. Arms

Mixed content, mixed metadata: information discovery in a messy world

In Metadata in Practice, Editors: Diane Hillmann and Elaine Westbrooks, ALA Editions (forthcoming)

Page 3: Mixed content, mixed metadata: Information discovery in the NSDL.

- 3 -

The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education).

The National Science Digital Library

http://nsdl.org/

Page 4: Mixed content, mixed metadata: Information discovery in the NSDL.

- 4 -

Mixed Content

Examples: NSDL-funded collections at Cornell

Atlas. Data sets of earthquakes, volcanoes, etc.

Reuleaux. Digitized kinematics models from the nineteenth century

Laboratory of Ornithology. Sound recording, images, videos of birds and other animals.

Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.

Page 5: Mixed content, mixed metadata: Information discovery in the NSDL.

- 5 -

Effective Information Discovery Before Digital Information

Searching

(a) Resources separated into categories of related materials. Each category organized, indexed and searched separately.

(b) Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc.

(c) Search engines used Boolean operators and fielding searching.

(d) Query languages and search interfaces assumed a trained user.

(e) Resources were physical items.

Page 6: Mixed content, mixed metadata: Information discovery in the NSDL.

- 6 -

Effective Information Discovery With Homogeneous Digital Information

Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog).

Full text indexing with ranked retrievalCan be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

Page 7: Mixed content, mixed metadata: Information discovery in the NSDL.

- 7 -

Mixed Metadata: the Chimera of Standardization

Technical reasons

(a) Characteristics of formats and genres

(b) Differing user needs

Social and cultural reasons

(a) Economic factors

(b) Installed base

Page 8: Mixed content, mixed metadata: Information discovery in the NSDL.

- 8 -

Cross-Domain Metadata

Dublin Core

"... indexes [such as Lycos] are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross-disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval."

[Weibel 1995]

Page 9: Mixed content, mixed metadata: Information discovery in the NSDL.

- 9 -

Information Discovery in a Messy World

Web search engines have adapted to a very large scale. Other techniques, such as cross-domain metadata and federated searching have failed to scale up.

• What new concepts and techniques have enabled this adaptation?

• What can we learn that is applicable to other information discovery tasks?

• How is NSDL making use of this understanding?

Page 10: Mixed content, mixed metadata: Information discovery in the NSDL.

- 10 -

Information Discovery in a Messy World

Building blocks

Brute force computation

The expertise of users -- human in the loop

Methods

(a) Better understanding of how and why users seek for information

(b) Relationships and context information

(c) Multi-modal information discovery

(d) User interfaces for exploring information

Page 11: Mixed content, mixed metadata: Information discovery in the NSDL.

- 11 -

Understanding How and Why Users Seek for InformationHomogeneous content

All documents are assumed equal

Criterion is relevance (binary measure)

Goal is to find all relevant documents (high recall)

Hits ranked in order of similarity to query

Mixed content

Some documents are more important than other

Goal is to find most useful documents on a topic and then browse

Hits ranked in order that combines importance and similarity to query

Page 12: Mixed content, mixed metadata: Information discovery in the NSDL.

- 12 -

Relationship and Contextual Information

Methods for capturing context

Analysis of citations and links (e.g., PageRank)

Mining usage logs (e.g., customers who buy the same product)

Reviews (e.g., reputation management)

Structural relationships (e.g., domain names)

Page 13: Mixed content, mixed metadata: Information discovery in the NSDL.

- 13 -

Multi-Modal Information Discovery

With mixed content and mixed metadata, the amount of information about the various resources varies greatly

but clues from many difference sources can be combined.

"The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task."

[Wactlar, 2000]

Page 14: Mixed content, mixed metadata: Information discovery in the NSDL.

- 14 -

User Interfaces for Exploring Information

Search index

Return hits

Browse content

Return objects

Page 15: Mixed content, mixed metadata: Information discovery in the NSDL.

- 15 -

NSDL: The Spectrum of Interoperability

Level Agreements Example

Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)

Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting

protocol and registry

Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information

Page 16: Mixed content, mixed metadata: Information discovery in the NSDL.

- 16 -

Users

Collections

NSDL Repository

The NSDL Repository

ServicesThe repository is a resource for service providers.

It holds information about every collection and item known to the NSDL, including contextual information.

Page 17: Mixed content, mixed metadata: Information discovery in the NSDL.

- 17 -

NSDL Search Service: First Phase

Portal

Portal

Portal

Search andDiscovery

Service

Collections

SDLIP harvest

crawl

NSDL Repository

Inquery -> Lucene

Page 18: Mixed content, mixed metadata: Information discovery in the NSDL.

- 18 -

NSDL Search Service: First Phase

Approach

(a) Collections map metadata to Dublin Core, provide via Open Archives protocol.

(b) Search service augments Dublin Core metadata with indexing of full-text where available.

(c) User interface returns snippets derived from the metadata, links to full content and to metadata.

Page 19: Mixed content, mixed metadata: Information discovery in the NSDL.

- 19 -

NSDL Search Service: First Phase

Weaknesses

(a) Ranking by similarity to query not sufficient.

(b) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata).

(c) Dublin Core records provide limited information.

(d) Browsing environment limited.

(e) Most users begin their search with a Web search engine (e.g., Google)

Page 20: Mixed content, mixed metadata: Information discovery in the NSDL.

- 20 -

NSDL Search Service: Second Phase Developments

Metadata

(a) Accept any metadata that is available in a range of formats

(b) System for reviews and annotations, with reputation management

Search system

(a) Multimodal retrieval and ranking

(b) Dynamic generation of snippets by search engine

Page 21: Mixed content, mixed metadata: Information discovery in the NSDL.

- 21 -

NSDL Search Service: Second Phase Developments (cont.)

Usability and human factors

(a) Wider range of browsing tools (e.g., collection visualization)

(b) Filters by education level and education quality, where known

Web compatibility

(a) Expose records for Web crawlers to index

(b) Browser bookmarklet to add NSDL information to Web pages

Page 22: Mixed content, mixed metadata: Information discovery in the NSDL.

Mixed content, mixed metadata:

Information discovery in the NSDL