Top Banner
National Geospatial Digital Archive Greg Janée University of California at Santa Barbara
22

National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Jan 03, 2016

Download

Documents

Alvin Cannon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

National Geospatial Digital Archive

Greg JanéeUniversity of California at Santa Barbara

Page 2: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 2

A misadventure in preservation

• 1976– Viking probes go to Mars– soil data is analyzed for evidence of life

• 1999– USC neurobiologist Joseph Miller asks for data– NASA has data on tape!

• But...– tapes coded “in a format so old that the

programmers who knew it had died”

Page 3: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 3

Paradox of preservation

• Is the data valuable?– yes: had to travel to another planet to get it

• Is the data being used?– no– perhaps never again

• How much am I willing to pay for its preservation?– as close to zero as possible

Page 4: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 4

Is it worth preserving?

• Keith’s equation*:– (current value) = (intrinsic value) - (cost to use)

• Greg’s equation:– item is worth preserving for time duration T if:

• (intrinsic value) * ProbT(usage) > T(preservation costs) + (cost to use)

*apologies to Keith Johnson, Stanford libraries

Page 5: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 5

Project genesis

• NDIIPP– Library of Congress, 2000– $100M– http://www.digitalpreservation.gov/

• NGDA– UCSB (MIL) & Stanford (Branner Library)– $2.6M, 3 years– geospatial data– http://www.ngda.org/

Page 6: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 6

Page 7: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 7

Project goal

• “How can we preserve geospatial data on a national scale and make it available to future generations?”

• No focus on a particular collection

• Geospatial data– discrete chunks– relatively highly-structured, well-defined– but 90% of our work is generic

Page 8: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 8

Idea #1

• Archival has to be cheap & easy– must be distributed– little incentive, no funding– not sexy

Page 9: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 9

NGDA approach

• Compromise: define cheap archive– fundamental approach: preservation by co-archival

of object semantics– ingest: one step up from crawling– web access– notable for what’s missing: discovery, usability

• Foundation for additional functionality– e.g., migration– prototype archives will offer ADL, OAI access

Page 10: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 10

Idea #2

• Archival systems must be designed with their own demise in mind– archival objects will long outlive any system that

manages them– system-level migrations will occur– at inopportune times

Page 11: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 11

system

databasestorage

handleresolver

database

Typical repository architecture

database

handleresolver

database

fragile

Page 12: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 12

NGDA architecture

storage subsystem

standard, public data model

archival system

databases,caches,

etc.

bulkloader

ingest

ADL OAIWeb

access

Page 13: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 13

Post-NGDA architecture

storage subsystem

standard, public data model

Web

Page 14: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 14

Storage system requirements

• Req’s:– associate UUIDs/RIDs with bitstreams– retrieve global/local bitstream by UUID/RID– determine (parent) UUID of any bitstream– list all UUIDs

• Satisfied by:– any filesystem– any kind of UUIDs

• tag:library.ucsb.edu,2005:identifier

Page 15: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 15

Archival objects

manifestUUID

componentRID

UUID

Page 16: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 16

Archival object representation

• Components are files• Manifest is an XML document

• Other approaches– OAIS: archival information packages (AIPs)– XMLtape

Page 17: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 17

Ingest

• Ingest template defines– common structure of objects to be ingested– necessary validations– associations to other objects

• assumes pre-loading of semantic definitions

– policies, rights, etc.

• Represents choke point– requires human evaluation

Page 18: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 18

Format registry

• We’re developing one– who isn’t?

• Serves as archive of format specifications

• How broadly to interpret “format”?– traditional file format– product– series, collection, arbitrary set

Page 19: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 19

Format dependencies

• Consider dependency graph induced by format specifications

• Def: a format is recoverable if the format of its specification is recoverable

• Axioms: plain text, HTML are recoverable

PDF

HTML

GIFGeoTIFF

CSSplaintext

TIFF“dessicated”

version

Page 20: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 20

Challenges

• Making ingest easy, easier, easier-er, ...

• GIS formats– very complex: topology, layer, coverage, project– proprietary

• MODIS– multiple petabytes– format (HDF) is not well-defined– moving to on-demand computation of products– lineage important– copious additional semantics

Page 21: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 21

Misadventure, redux

• What if there had been an NGDA-like solution?– format specification would have been archived

• Limitations– data not necessarily immediately usable– format specification itself not necessarily viewable

• But limitations can be addressed according to usage, available resources

Page 22: National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.

Greg Janée • DCC seminar • 2005-09-27 22

Questions for you

• Archival systems– definition? functionality?

• Storage systems– definition? functionality?

• Archival object representation– discrete files vs. AIPs?

• GIS formats– “dessicated” form?