Cooperative Project with Library of Congress on Preservation of Digital Geospatial Data Steve Morris Head of Digital Library Initiatives NCSU Libraries
Jan 12, 2016
Cooperative Project with Library of Congress on Preservation of Digital Geospatial Data
Steve MorrisHead of Digital Library InitiativesNCSU Libraries
Note: Percentages based on the actual number of respondents to each question 2
NC Geospatial Data Archiving Project(NCGDAP)
Partnership between NCSU Libraries and NC Center for Geographic Information & Analysis$520,000 funding – 3 yearsFocus on state and local geospatial content in North Carolina (state demonstration)Address NC OneMap objective: “Historic and temporal data will be maintained and available.”One of eight projects in the first NDIIPP funding round: “Building a Network of Partners”
Note: Percentages based on the actual number of respondents to each question 3
Note: Percentages based on the actual number of respondents to each question 4
NDIIPP OverviewNational Digital Information Infrastructure and Preservation Program
Congress appropriated $100 million for this effort, which instructs the Library to spend an initial $25 million to develop and execute a congressionally approved strategic plan
Eight initial projects, 2004-2007: web pages, cultural heritage, numeric data, video, business records, mixed content, geospatial (2)
Developing partnerships and identifying issuesExtensive interaction among NDIIPP projects
Note: Percentages based on the actual number of respondents to each question 5
Targeted Content
Resource TypesGIS “vector” (point/line/polygon) data
Digital orthophotography
Digital maps
Tabular data (e.g. assessment data)
Content ProducersMostly state, local, regional agencies
Some university, not-for-profit, commercial
Selected local federal projects
Note: Percentages based on the actual number of respondents to each question 6
Risks to Digital Geospatial Data
.shp
.mif
.gml
.e00
.dwg
.dgn
.bsb
.bil
.sid
Note: Percentages based on the actual number of respondents to each question 7
Risks to Digital Geospatial Data
Focus on current dataArchiving data does not guarantee “permanent access”
Future support of data formats in questionNeed to migrate formats or allow for emulation
Data failure“Bit rot”, media failure
Preservation metadata requirementsDescriptive, administrative, technical, DRM
Shift to “streaming data” for access
Note: Percentages based on the actual number of respondents to each question 8
Time series – vector dataParcel Boundary Changes 2001-2004, North Raleigh, NC
Note: Percentages based on the actual number of respondents to each question 9
Time series – Ortho imageryVicinity of Raleigh-Durham International Airport 1993-2002
Note: Percentages based on the actual number of respondents to each question 10
Today’s geospatial data as tomorrow’s cultural heritage
Note: Percentages based on the actual number of respondents to each question 11
Earlier NCSU Acquisition Efforts
NCSU University Extension project 2000-2001Target: County/city data in eastern NC
“Digital rescue” not “digital preservation”
Project learning outcomesConfirmed concerns about long term access
Need for efficient inventory/acquisition
Wide range in rights/licensing
Need to work within statewide infrastructure
Acquired experience; unanticipated collaboration
Note: Percentages based on the actual number of respondents to each question 12
One Earlier Project Outcome: Directory of County and City Services
Among top 15 most used resources on library web site
99.5% of directory users from outside ncsu.edu
Note: Percentages based on the actual number of respondents to each question 13
NDIIPP Project Phases
Content Identification and Selection
Content Acquisition
Partnership Building
Content Retention and Transfer
All 8 NDIIPP cooperative projectsadhere to this structure
Note: Percentages based on the actual number of respondents to each question 14
Content Identification and Selection
Work from NC OneMap Data Inventory
Combine with inventory information from various state agencies and from previous NCSU efforts
Develop methodology for selecting from among “early,” “middle,” and “late” stage products
Develop criteria for time series development
Investigate use of emerging Open Geospatial Consortium technologies in data identification
Note: Percentages based on the actual number of respondents to each question 15
Content AcquisitionWork from NC OneMap Data Sharing Agreements as a starting point (the “blanket”)Secure individual agreements (the “quilt”) Investigate use of OGC technologies in captureUse METS (Metadata Encoding and Transfer Standard) as a metadata wrapper
Bundle data files, metadata, ancillary documentationSupplement FGDC metadata with additional administrative, technical, and descriptive metadataEncode rights (Digital Rights Management – DRM)Links to services
Note: Percentages based on the actual number of respondents to each question 16
Partnership Building
Work within context of the NC OneMap initiativeExplore state, local, federal partnerships
Defined characteristic: “Historic and temporal data will be maintained and available”Advisory Committee drawn from the NC Geographic Information Coordinating Council subcommittees
Seek external partnersNational States Geographic Information Council FGDC Historical Data Committee
… more
Note: Percentages based on the actual number of respondents to each question 17
Content Retention and Transfer
Ingest into Dspace open source digital repository software
Look more generically at the issue of putting geospatial content into digital repositories
Investigate re-ingest into a second platformStart to define format migration paths
Special problem: geodatabases
Purse long term solutionRoles of data producing agencies, state agencies; NC OneMap; NCSU
Note: Percentages based on the actual number of respondents to each question 18
Big Geoarchiving Challenges
Format migration paths
Management of data versions over time
Preservation metadata
Preserving cartographic representation
Keeping content repository-agnostic
Preserving geodatabases
Harnessing geospatial web services
More …
Note: Percentages based on the actual number of respondents to each question 19
Vector Data Format Issues
Vector data much more complicated than image data
‘Preservation’ vs. ‘Permanent access’An ‘open’ pile of XML might make an archive, but if using it requires a team of programmers to do digital archaeology then it does not provide permanent access
Piles of XML need to be widely understood piles
GML: need widely accepted application schemas (like OSMM?)
The Geodatabase conundrumExport feature classes, and lose topology, annotation, relationships, etc.
… or use the Geodatabase as the primary archival platform (some are now thinking this way)
Note: Percentages based on the actual number of respondents to each question 20
Geography Markup Language Issues
GML still more useful as a transfer format than an archival format, support limited even for transferFGDC Historical Data Working Group investigations into GML for use in archivingPlans for environmental scan of existing GML profiles and application schemas or profiles
schema name (e.g. OSMM, top10NL, ESRI GML, LandGML)responsible agency; scheme has official government status?GML version; known unsupported GML componentsschema history; known interoperation with other schemas vendor support; translator support
Note: Percentages based on the actual number of respondents to each question 21
Managing Time-versioned Content
Many local agency data layers continuously updated
Older versions not generally available
Individual versioned datasets will wander off from the archive
How do users “get current metadata/DRM/object” from a versioned dataset found “in the wild”?
How do we certify concurrency and agreement between the metadata and the data?
Note: Percentages based on the actual number of respondents to each question 22
Preservation Metadata Issues
FGDC MetadataMany flavors, incoming metadata needs processingOther standards: PREMIS, MODS
Metadata wrapperMETS (Metadata Encoding and Transmission Standard) vs. other industry solutionsNeed a geospatial industry solution for the ‘METS-like problem’GeoDRM a likely trigger—wrapper to enforce licensing (MPEG 21 references in OGC Web Services 3)
Note: Percentages based on the actual number of respondents to each question 23
Preserving Cartographic Representation
The true counterpart of the old map is not the GIS dataset, but rather the cartographic representation that builds on that data:
Intellectual choices about symbolization, layer combinations
Data models, analysis, annotations
Cartographic representation typically encoded in proprietary files (.avl, .lyr, .apr, .mxd) that do not lend themselves well to migration
Symbologies have meaning to particular communities at particular points in time, preserving information about symbol sets and their meaning is a different problem
Note: Percentages based on the actual number of respondents to each question 24
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 25
Preserving Cartographic Representation
Image-based approaches (“dessicated data”)Generate images using Map Book or similar tools
Harvest existing atlas images
Capture atlases from WMS servers
Export ‘layouts’ or ‘maps’ to image
Vector-based approachesStore explicitly in the data format (e.g. Feature Class Representation in ArcGIS 9.2)
Archive and upward-migrate existing files .avl, .apr, .lyr, .mxd, etc.
SVG, VML or other XML approaches
Other?
Note: Percentages based on the actual number of respondents to each question 26
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 27
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 28
Preserving Geodatabases
Not just data layers and attributes—also topology, annotation, relationships, behaviors
ESRI Geodatabase archival issuesXML Export, Geodatabase History, File Geodatabase, Geodatabase Replication
Growing use of geodatabases by municipal, county agencies
Some looking to Geodatabase as archival platform (in addition to feature class export)
Note: Percentages based on the actual number of respondents to each question 29
Geodatabase Availability
According to the 2003 Local Government GIS Data Inventory, 10.0% of all county framework data and 32.7% of all municipal framework data were managed in that format.
Cities: Street Centerline Formats
Geodatabase
Shapefile
Coverage
Other
Counties: Street Centerline Formats
Geodatabase
Shapefile
Coverage
Other
Note: Percentages based on the actual number of respondents to each question 30
Evolving Geodatabase Handling Approaches
Project Stage Planned Approach
Original Proposal (Nov. 2003)
Export feature classes as shapefiles; archive Geodatabases less than 2 GB in size
Finalized Work Plan (Dec. 2004)
Also export content as Geodatabase XML
Possible Future Work Plan Changes
Explore maintenance of some archival content in Geodatabase form; explore Geodatabase replication as an archive development approach; archive Geodatabases of unlimited size
Note: Percentages based on the actual number of respondents to each question 31
Harnessing Geospatial Web Services
Automated content identification ‘capabilities files,’ registries, catalog services
WMS (Web Map Service) for batch extraction of image atlases
last ditch capture option
preserve cartographic representation
retain records of decision-making process
… feature services (WFS) later.
Rights issues in the web services space are ambiguous
Note: Percentages based on the actual number of respondents to each question 32
Partnerships
ESRI Discussing software requirements: meetings with development teams April 2005
Open Geospatial Consortium (OGC)Meet with Architecture Working Group Nov. 2005
National Archives and Records AdministrationInvestigations into GML for archiving; planned presentation to NARA technology team
FGDC Historical Data Working GroupGeneral geospatial data preservation issues
Note: Percentages based on the actual number of respondents to each question 33
Partnerships
EDINA (University of Edinburgh, UK) NCSU is Associate Partner on UK project for geospatial institutional repositories
UC Santa Barbara & Stanford UniversityOther NDIIPP geospatial project
EROS Data CenterPlanned site visit
Project visits to regional GIS groupsAlbemarle Regional GIS meeting Nov. 3
More planned …
Note: Percentages based on the actual number of respondents to each question 34
Progress to Date
Completion of project agreements
Hiring staff
Acquisition and deployment of storage system (12.4 TB capacity – two 16.8 TB systems)
Testing and deployment of repository software
Development of metadata workflow
Development of ingest workflow
Pilot project with NC Geologic Survey data
… Initial focus on developing the “plumbing”
Note: Percentages based on the actual number of respondents to each question 35
Questions for You?
What are your current practices for:Archiving data and managing time versionsManaging geodatabase versionsTransfer mechanisms for data
• to regional entities?• to off-site storage for disaster recovery?
Archiving project files and finished products
What rights issues exist with regard to putting county and city data into an archive?What would you like this project to do?
Note: Percentages based on the actual number of respondents to each question 36
Ways to Participate in NCGDAP
Identifying data for inclusion in the repository
Discussing data format strategies
Sharing ideas about archiving approaches and architectures
Sharing and identifying concerns about rights issues, liability, etc.
Host project visits to regional GIS groups
Use Local Government GIS listserv to discuss preservation issues?
Note: Percentages based on the actual number of respondents to each question 37
Questions?
Contact:
Steve MorrisHead, Digital Library InitiativesNCSU [email protected]
http://www.lib.ncsu.edu/ncgdap