Top Banner
EAD without XSLT a practical approach to archival finding aids Trevor Thornton Senior Applications Developer, NYPL Labs The New York Public Library
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tthornton code4lib

EAD without XSLTa practical approach to archival finding aids

Trevor Thornton

Senior Applications Developer, NYPL Labs

The New York Public Library

Page 2: Tthornton code4lib

Project goals

• Enable multiple presentations of

the same data

• Support dynamic web applications

• Cross-collection search with

component-level specificity in

results, and faceting on common

access points

Page 3: Tthornton code4lib

System overview

Ruby on Rails

+ MySQL

+ SOLR

Key functionality:

Data Import

Search index

API

Page 4: Tthornton code4lib

Core models

Page 5: Tthornton code4lib

Collection model

Each collection:

•must have onedescription

•may have one or more components

•may be associated withone or more access terms

Page 6: Tthornton code4lib

Component model

Each component:

•must belong to one collection

•must have one description

•may have one parent component

•may have one or morechild components

•may be associated withone or more access terms

Page 7: Tthornton code4lib

Component hierarchy attributes

• collection_id (id of root collection)

• parent_id (id of parent component)

• sib_seq (sibling sequence)

• level_num (numeric level within hierarchy)

• level_text (series, sub-series, file, etc.)

• has_children

• max_levels

• top_component_id

Computed after initial data import; provided

as a convenience for finding aid UIs and to

streamline formulation of API responses

Page 8: Tthornton code4lib

Description model

Elements of description organized (roughly) based on ISAD(G):

•Descriptive identityISAD(G) 3.1

•ContextISAD(G) 3.2.1 - 3.2.3

•Acquisition & processingISAD(G) 3.2.4, 3.3.2-3.3.3

•Content and structureISAD(G) 3.3.1, 3.3.4

•Access and useISAD(G) 3.4

•Related materialISAD(G) 3.5

•NotesISAG(G) 3.6

Page 9: Tthornton code4lib

Description model: basic EAD mapping

Page 10: Tthornton code4lib

Description model: JSON format{ "unitid": [ { "value": "3283", "type": "local_mss" } ], "unittitle": [ { "value": "David Ames Wells papers" } ], "unitdate": [ { "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" } ], "physdesc_extent":[ { "value": ".5 linear feet", "unit":"linear feet" }, { "value": "2 boxes", "unit":"containers" } ], "abstract": [ { "value": "David Ames Wells was an engineer, economist, textbook author, and advocate for lower tariff rates. This collection contains correspondence with Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript draft of Protection: The Poor Man's Friend; and a lecture Wells delivered on free trade in 1882"} ], "prefercite": [ { "value": "<p>David Ames Wells papers, Manuscripts and Archives Division, The New York Public Library</p>" } ]}

Page 11: Tthornton code4lib

EAD as a guide for data storage

• EAD elements that allow only CDATA are stored as

plain strings

• EAD elements that require content to be structured in

<p> or other block elements stored as HTML

• Rules established for converting EAD to HTML

when necessary

• HTML conversion designed to support re-conversion

back to EAD

Page 12: Tthornton code4lib

Special handling for dates

• Dates are hardo Inclusive dates and bulk dates

o Multiple date formats

o Ranges, lists and both

• Special data structure for dates:o date_statement (original text)

o inclusive_start / inclusive_end

o bulk_start / bulk_end

o keydate (for ordering query response – earliest inclusive date

or earliest bulk date when present)

o index_dates (for search faceting – every year included in range/list)

Page 13: Tthornton code4lib

Access Term model

Page 14: Tthornton code4lib

Refinement of Access Term/Access Term Association models

Page 15: Tthornton code4lib

Data import

• It’s messy business

• Bulk of work has focused on EAD;Nokogiri used extensively for parsing XML

• Basic process for EAD import:1. Create collection record

2. Extract collection-level data,create/save description

3. Extract access terms, and for eacha. Save if it doesn’t already existb. Save collection/term association

4. Extract top-level components, and for each:a. Create component recordb. Extract component-level data,

create/save descriptionc. Extract/save access terms & associationsd. Extract child components and repeat for each

Page 16: Tthornton code4lib

Integration with NYPL digital repository

• Fedora repository

+ custom metadata creation/digitization workflow system

+ API to query repository data

• All records in repository identified with UUID

• UUID of digital object associated with a given component

is stored locally in archives data system

• Best case scenario: common identifiers appear in

archival description and in Fedora

Page 17: Tthornton code4lib

Apache Solr

• Inter- and intra-collection search

• Collocation via faceting and filter queries

• Using RSolr to facilitate interaction with Solr

(for both search and index)

Page 18: Tthornton code4lib

API

• API development is proceeding in step with finding aid

development – available requests added as needed

• Basic requests:

o Collection-level data

o Components of a collection,

or sub-components of a componento Includes all component-level descriptive datao Max. depth can be specified

o Digital assets associated with

a component

Page 19: Tthornton code4lib

Finding aid prototype

Page 20: Tthornton code4lib

Finding aid prototype

Page 21: Tthornton code4lib

Front-end system overview

Page 22: Tthornton code4lib

Considerations for future development

• Separate API from data management?

o Data management app to handle all create/update/destroy

operations, while API (Sinatra?) is read-only

o Open API to public? Security/load considerations…

• ArchivesSpace

o NYPL is considering it as a possible replacement for

our existing ‘home-grown’ system

o How would this system integrate with ArchivesSpace API?

• Upcoming EAD revision

Page 23: Tthornton code4lib

some code to look at and/or borrow from:

github.com/nypl/archives_data_public

finding aid prototype:

archives.nypl.org

me:

[email protected]

NYPL Labs:

nypl.org/labs