CPDN Project in the RDCCCCARL Federated Pilot€¦ · Better integration of the transmission of metadata to Globus Publication with the Archivematica pipeline; Developments in computational

CPDN Project in the RDCCCCARL Federated Pilot

Executive Summary Project Background Project Objectives Project Design and Operations Project Outcomes and Metrics Recommendations

Recommendations for content improvement Recommendations for technical improvement

Appendix 1: Pilot Technical Descriptions Data Source Archivematica Ingestion Globus Publication Ingestion Script Information

Configuration Appendix 2: RDM Python Library

Overview Installation Example

Appendix 3: Files in pilot datasets by file type

Executive Summary As their digital datasets grow, researchers across all fields of inquiry are struggling to manage those datasets. For an individual researcher, there is a strong motivation to be able to find, access, and analyze their own data once it is produced. There may also be a desire to share datasets with colleagues, to preserve datasets for later reuse, or to combine their dataset with others from within or beyond their disciplinary area. For the research community, the reproducibility of a scientific result drives a need for open, managed, accessible datasets that allow results to be independently validated. For policy makers, the data output of governmentfunded research is seen as a valuable asset to be preserved and shared for general good. Across all of these groups, there is a recognized need to provide tools to enable researchers to manage and preserve their data efficiently. This recognition of the need to manage and preserve research data has led to a rapid growth in the development and deployment of research data repositories around the globe. While this is a positive development, and has led to tremendous advances in repository tools, it has also led to a kind of fragmentation. Individual toolsets have been developed for specific disciplines, for specific usecases, and for specific data scales with no guarantee of interoperability or even discoverability of datasets in other repositories. Further, advances in digital preservation of research data has lagged behind the technologies developed to disseminate the data. Finding

1

sound ways of producing archival copies of complex datasets is key to enabling longterm access. This pilot started with existing individual data management tools, each designed for a specific function, and assembled them into a research data management solution. That solution was designed to address all critical components of the process, from initial data movement, ingestion, metadata production, access control, replication, discovery, dissemination, and preservation. The solution is designed to be scalable and capable of handling diverse, complex datasets. As a pilot, a key consideration for the project was to identify gaps which need to be addressed in order to build a production service. This report provides details on the outcomes and lessons learned from this pilot project, which evaluated core components of a national research data management infrastructure service. A software stack comprising Archivematica, Globus Publication, and custom code was used to pass datasets from an established domain repository through an archival processing pipeline, and establish discovery and access layers to the data and metadata. Archivematica provided a standardscompliant, opensource solution to preservation. Globus Publication provided cloudbased search and discovery across repositories, backed by the tested scalability of the Globus File Transfer service. Globus allowed the tested solution to take advantage of federated storage (3 geographically distinct sites) and provides a natural, scalable solution which fits the Canadian geographic, political, and funding landscape. This pilot successfully provided important insights into the requirements for implementing a production service based on the functions of this model. First, it demonstrated that automated processes could generate archival digital objects for research datasets and that these objects could be deposited with an access platform (Globus Publication in this instance) and archived in preservation storage. Second, it demonstrated that, once ingested into a discovery and access platform, datasets were discoverable and retrievable under appropriate controlled access conditions. Third, it identified a need for upfront preparation of metadata by a metadata expert and for the intervention of a data curator to start and monitor the processing cycle. Fourth, it identified several improvements that will be necessary to assemble a production system based on this pilot’s basic design. These improvements include:

Better integration of the transmission of metadata to Globus Publication with the Archivematica pipeline;

Developments in computational processing that enhance scalability when pushing large digital objects through the Archivematica pipeline;

The need for Archivematica to better manage the processing of datasetlevel metadata for discovery applications outside of Archivematica;

A better method is needed to increase the automation of the normalization processing of diverse file formats.

2

All of these suggested improvements are incremental in nature and achievable through a nextstep development process. The primary recommendation is to proceed with the implementation of a production service that improves upon the model tested in this project. This pilot provided the experiences needed to develop a production service. Furthermore, lessons were learned about how a successful national preservation, discovery, and access platform for research data should perform. We recommend that CARL, Compute Canada, RDC, CANARIE, and other interested parties pursue this production service.

Project Background As an initial investigation into this challenge in preserving research data, Research Data Canada (RDC) established a Federated Data Management Pilot Project to build core 1 2

components of a national research data management infrastructure service. A joint proposal was made to the RDC Federated Pilot on behalf of Compute Canada and the Canadian Polar Data Network (CPDN) to test a software stack consisting of Archivematica and Globus Publication using data curated from Canadianfunded research in the International Polar Year (IPY) and housed by the CPDN, a data repository for Northern research data. Compute Canada implemented the software configuration and conducted the processing, while the CPDN provided a copy of the IPY data collection and prepared the appropriate metadata. The 3

project’s objective was to evaluate this specific configuration to understand better the requirements for a national preservation, discovery, and access platform. During 2007 and 2012, the Federal Government funded a variety of research projects as part of Canada’s contribution to the IPY. Starting in 2010, five centres worked collaboratively to ingest the data from these projects and to provide access and longterm preservation. In 2012, these five centres formalized their partnership under the CPDN Charter. The Canadian Cryospheric Information Centre at Waterloo University hosts the Canadian Polar Data Catalogue, which serves as a discovery portal for this collection. Scholars Portal and the University of Alberta Libraries established a preservation backbone for the data. The RDC Federated Pilot was established to investigate the components of a national research data management infrastructure service. Separate projects with Simon Fraser University (SFU) and the Ontario Council of University Libraries / Scholars Portal (OCUL/SP) looked at specific software implementations using librarybased staging repositories (vis., Islandora and Dataverse, respectively). The CPDN project was collectionsbased and involved working with an established domain data repository, transferring IPY datasets to an archival processing pipeline and then establishing discovery and access layers from the archival output. Globus Publication was used as the the discovery/access platform in this project. The implementation challenge with Globus Publication was to find a flexible batch process to ingest metadata and data files

1 http://www.rdcdrc.ca/ 2 http://www.rdcdrc.ca/activities/federatedpilot/ 3 The IPY collection of provenance remains with the CPDN repository.

3

from an aggregation of projects rather than from individual research projects. This required entering metadata in batch rather than inputting metadata manually and ingesting data in bulk instead of submitting data through individual projects. The transfer and integration of existing metadata into Globus Publication’s metadata framework was also examined. Aspects of this project built upon the experiences of the SFU project by extending the use of Globus Publication for discovery and access services.

Project Objectives This project had seven objectives.

1. Ingest IPY data files and metadata into the Compute Canada service stack for this project.

2. Trigger ingest directly through the Archivematica Transfer step into an Archivematica pipeline.

3. Generate Archival Information Packages (AIPs) for the IPY datasets. 4. Use Dissemination Information Packages (DIPs) with Globus Publication Service for

indexing and discovery. 5. Use existing IPY metadata for Globus Publication metadata intake. 6. Use Globus Publication’s search tool to find datasets within the IPY collection. 7. Properly manage access rights, making sure that deposit agreements match Globus

Publication access, e.g., managing dark items and embargo periods correctly. While access to the data may be restricted, metadata should always be open.

Project Design and Operations Three teams representing this project’s partner organizations (CPDN, Compute Canada, and Globus) held regular conference calls to coordinate the implementation of the project design. Figure 1 of the CPDN RDC Pilot Design (see overleaf) is a representation of the workflow developed to fulfill the above objectives. Globus FTP was used to transfer data from OCUL/SP to an Archivematica Transfer directory from which the processing of AIPs and DIPs was initiated. The version of Globus Publication used in this project captured metadata through formentered templates based on the DataCite Extended Dublin Core (DC) standard. All of the IPY datasetlevel metadata was prepared in North American Profile (NAP) ISO 19115. A separate workflow outside of the Archivematica pipeline was developed to transform IS0 19115 metadata records to DataCite Extended DC and then to transmit these records to Globus Publication employing the JSONLD format. This entailed mapping elements between NAP and DC and flattening the NAP hierarchical element structure. Globus Publication subsequently added datasets from the DIP objects containing IPY files. Once the metadata was indexed in Globus Publication, a Globus accountuser within a specific privileged group could search and locate IPY datasets. The Archivematica pipeline also produced a set of AIP digital objects that represent the preservation packages for longterm storage.

4

Figure 1: CPDN RDC Pilot Design

Project Outcomes and Metrics Using the two workflows depicted in Figure 1, the project succeeded in accomplishing most of its primary objectives. The following metrics were used to assess the project’s outcomes. Overall, these indicators were identified in the project charter to characterize the suitability of this particular technical solution for a nationalscale service and to determine areas for development and improvement in the technical pipeline. The metrics employed were:

1. The number of file formats that are transferred, identified, and processed by Archivematica.

2. The number of cases that require manual intervention. 3. The number of metadata files correctly processed by Globus Publication. 4. The discoverability of preservation objects (e.g. DIPs) from within Globus Publication

service (pass/fail). 5. The ability of Globus Publication to manage controlled access conditions (pass/fail).

5

The objective to transfer dataset access rights to Globus Publication through the CPDN metadata records was not possible to implement at this time because of further development required for Globus Publication to act on this metadata content. 1. The number of file formats that are transferred, identified, and processed by

Archivematica 118 datasets comprising 16948 files, with an aggregate size of ~2GB, were processed through the OCUL/SP→Archivematica→Globus Publication pipeline. An additional large dataset, comprising 9015 files, was also tested for ingestion through Archivematica, but this ~665GB dataset was not successfully ingested through Archivematica, due to storage service timeouts. A number of configuration changes were made to see if Archivematica could process the large dataset, but this was a lengthy process (hoursdays per attempt) and was ultimately not successful. Being able to handle large datasets like this will be important for future repositories. File extensions, while not formally indicative of the file content type, are likely a good predictor of file format, particularly coming from an existing collection such as the PDC, the datasets from which we expect were curated and quality checked before they were formally published. Even with file extensions, a few file types were not identified with confidence. This is one of the challenges associated with processing large amounts of researcher data; identifying the exact format of files by hand, without researcher metadata describing each individual file, is time consuming and prone to error. The development of reliable and automatable processes for determining and preserving file types is important. Table 1 shows the file extensions that were identified in the original datasets, along with the file format those extensions were assumed to be proxied by the extension:

6

Extension Assumed Filetype File Count DS_Store OS X information (likely not research data) 42

DT1 digital terrain elevation data/DTED 149

GPS GPS data 130

HD unknown 149

HDR raster image 1

csv comma separated values file 981

dbf dBase II data 37

doc Microsoft Word document 4

docx Microsoft Word document 5

hdf hierarchical data format 43

int unknown 2438

jpg JPEG image 11038

lnk unknown 2

pdf Portable Document Format (PDF) 371

prj project 3

rar RAR archive format 1

rtf Rich Text format 1

sbn GIS shape 6

sbx GIS shape 6

shp GIS shape 6

shx GIS shape 6

tfw GIS TIFF 162

tif GIS TIFF 162

txt Text document, various line end formats 414

xls Microsoft Excel spreadsheet 111

xlsx Microsoft Excel spreadsheet 13

(no extension) unknown 667

Total 16948

Table 1: Counts of file extension and assumed file type

7

The 16948 files contained in the datasets (large dataset excluded) were also processed with the Unix “file” utility, yielding the following counts of data types. Data Type (aggregated) File Count ASCII text 4502

Dbase 3 37

ESRI/GIS 24

HDF 43

ISO text 354

JPEG/image 11038

JVT/NAL 40

MS Office 116

NonISO text 75

PDF 371

RTF 1

TIFF 162

"data" 149

Other 36

Total 16948

Table 2: Aggregated file counts as categorized by Unix “file” tool A variety of files in the same category (e.g. ASCII text) were found with slightly different properties; these have been aggregated into representative groups. We expect that the “file” command is not differentiating well between lists of text and comma separated values files (i.e. files that are compatible with open format spreadsheet data). 2. The number of cases that require manual intervention Processing within Archivematica version 1.3 required manual intervention for every dataset. Known bugs prevented full automation of the Archivematica ingestion process. While ingestion could be triggered using the REST API and use of Archivematica’s “watched” directory, normalization and DIP storage steps needed to be triggered by hand from within the GUI. Near the end of the pilot, Archivematica version 1.4 was released. Testing with the upgraded version revealed that some, but not all, of the manual GUI steps were removed from the process.

8

The full pipeline from OCUL/SP → USask storage → Archivematica → Globus Publication had human intervention at each of the arrows for the sake of convenience and expedience during the pilot. However, fully automated scripts were written to:

1) pull data from any Globus endpoint (i.e. one on OCUL/SP) to hosting storage 2) bag the data into SIPs and ingest data into Archivematica for transformation into DIPs

and AIPs 3) ingest DIPs into Globus Publication 4) replicate the AIPs to multiple Compute Canada sites.

For full automation,

it would be a simple matter, after setting up a shared endpoint on the source repository, to integrate scripting from 1) and 2) to pull data automatically from a repository and then initiate ingestion into Archivematica;

The metadata processing and conversion was done outofband with respect to the data download. The scripting for metadata processing should be integrated with the data download. This should be technically feasible;

Archivematica needs to signal when its processing is complete and successful; this signal could be used to trigger an automatic ingestion for the DIPs into Globus Publication, and the AIPs into replicated storage;

A file format policy needs to be defined and implemented in Archivematica with a list of recommended research file formats for preservation and dissemination. The file format transformation choices should be preconfigured using this policy and putting a 'processingMCP.xml' file into the root directory of a SIP/transfer.

Globus’s script to ingest DIPs in place is currently invoked manually, once the datasets and metadata are placed in the correct directories on storage accessible by Globus Publication. While this script could be automated, a full Globus Publication API including “ingest” would be a superior solution.

Replicated storage of the AIPs is automated on a regular basis through scheduled events, and does not require explicit triggers. Replication frequency and automation can also be increased by purchasing/developing hardware and/or software solutions that automatically manage multiple copies of data at the PBscale.

3. The number of metadata files correctly processed by Globus Publication During the pilot, it became obvious that automated quality assurance for processed data sets is difficult to achieve on the Archivematica platform. To overcome this limitation, it was decided to perform a manual quality assurance testing on 10% of the datasets for correctness of the transformed DIPs and AIPs relative to original datasets. From the 118 datasets, 12 data sets were selected for quality evaluation purposes at random using a random number generator application. The selected datasets were analyzed at various stages in the Archivematica

9

pipeline. The following observations were recorded while comparing SIPs, DIPs and AIPs of these selected datasets.

In all cases, data files that were part of a particular SIP successfully transferred to the corresponding AIP and DIP.

File format normalization was selected as a preservation strategy, however, a majority of the data files were not normalized for preservation during the transformation process from SIP to AIP. None of the files were normalized for dissemination during the transformation process from SIP to DIP. See Table 1 for details about the file format normalization results.

The AIP for the study number 1841 was not generated. Normalization Results:

Data Set #

Data Files in SIP

Normalized for Preservation

Normalized for Dissemination

Normalization Type

Notes

721 2 1 0 PDF PDF/a

750 2 1 0 PDF PDF/a

755 1 1 0 PDF PDF/a

1500 368 1 0

1643 578 0 0

1734 3 1 0 PDF PDF/a

1841 3787 0 AIP not available 4

1842 2 0 0

1866 2 1 1 PDF PDF/a

10550 3 0 0

11580 1 0 0

11635 7 0 0

Table 3: Normalization Results for twelve Datasets selected for Quality Assurance testing During the quality assurance process, it became apparent that the metadata associated with each dataset was not being processed for dissemination and preservation. It was being directly transferred to Globus without any processing through the Archivematica pipeline. A separate test was conducted using one study to see how Archivematica behaves when metadata is

4 Archivematica did not complete verification on the AIP for 1841 and continued showing that it was executing the task, though the actual process was long dead.

10

pushed along with the data files through an Archivematica pipeline. Metadata was passed through the pipeline in four different ways as described below: 1 metadata with data files both in /objects folder 2 metadata in /metadata folder and data in /objects folder 3 metadata in /metadata/submissionDocumentation folder and data in /objects folder 4 metadata in /metadata/submissionDocumenation folder, metada.csv file in /metadata folder and data in /objects folder. Following observations were noted when the metadata files were passed through the pipeline

Archivematica successfully processed SIPs and produced a DIP and an AIP in each case.

Only the DIPs in the test scenario 1 contained metadata files. In the other three cases(2,3,4), metadata files are not part of their respective DIPs.

METS files from all cases didn't have any descriptive metadata in them. A structmap was present in each case which provided details about the directory structure. In each test scenario, the METS file was same in the DIP and the AIP of a particular SIP.

In the case of DIPs for scenarios 2,3,4, their structmaps were provided with information about a metadata folder and metadata files even though these files/folder didn't exist in these DIPs.

Descriptive metadata was not transferred to DIPs except for scenario 1. For 2,3,4, descriptive metadata was not captured in the METS file of each DIP and accompanying metadata files were not transferred to these DIPs either.

In scenario 4, a metadata.csv file was provided but information from that file didn't end up in the resultant METS file.

Resultant AIPs have metadata files, data files, and any migrated files along with other preservation related details, i.e. METS, checksums manifestmd5.txt etc.

4. The discoverability of preservation objects (e.g., DIPs) from within Globus

Publishing service: PASS 5. The ability of Globus Publication to manage controlled access conditions: PASS The discoverable IPY data ingested into GP had Globuslevel permissions set for discovery and download. A Globus group was created (“CC Globus Publication Users”) and the Globus accounts of the pilot project team members were added to this group. Permissions were set at the Globus level to allow only discovery of the metadata and subsequent download if the Globus user was in this Publication Users group. Globus set these permissions directly on the collection. The access conditions for both discovery and download of the data were each tested under three conditions:

i. discover/download with no Globus login/account

11

ii. discover/download with Globus login/account not in the privileged group iii. discover/download with Globus login/account in the privileged group

1. No Globus Account With no Globus user account (i.e. completely anonymous browsing from the world at large), the Compute Canada Community is visible, as is the Canadian International Polar Year Data Collection. When this collection is accessed, no metadata or entries are visible, and a search of the collection through GP using all of issue date, author, title, subject shows the message “There are no entries in the index for Collection "Canadian International Polar Year Data Collection"”. A search of the collection using text/metadata known to be indexed in the privileged area (example: “fox”) also returns the same no entry message. A direct attempt to visit the Globus Publication site, both through the Globus URL and through the URI, for an existing dataset in the IPY collection without logging into Globus led to a page requiring signin to Globus. No data or metadata was accessible. It was noted that cookie data from previous Globus logins persisted, so care might be required in the use of Globus on shared or insecure computers. These tests demonstrated that access control to hidden/embargoed data works correctly using Globus access control, and that anonymous users with no Globus access cannot discover datasets or metadata, or download datasets, which have been flagged for discovery by “grouponly.” 2. Globus account not in privileged group With a valid Globus user account (i.e. authenticated to Globus, but not part of the privileged group for CPDN data), the Compute Canada Community is visible, as is the Canadian International Polar Year Data Collection landing page. When this collection is accessed, no metadata or entries are visible, and a search of the collection through GP using all of issue date, author, title, subject shows the message “There are no entries in the index for Collection "Canadian International Polar Year Data Collection"”. A search of the collection using text/metadata known to be indexed in the privileged area (example: “fox”) also returns the same no entry message. A direct attempt to visit the Globus Publication site for an existing dataset in the IPY collection as an authenticated but unprivileged Globus user, both through the URL and through the URI, resulted in the message from GP “Authorization Required. You do not have the permission to perform the action you just attempted.” Even though the files could not be discovered through the GP interface with an unprivileged user, it is conceivable that possessing the direct download link could allow download of the data. A direct attempt to download the files in the collection as an unprivileged user was tested

12

by directly going to the download page brings up a Globus transfer window with the message “Authentication Failed. Your credentials do not provide sufficient access to this endpoint's file system.” These tests demonstrated that access control to hidden/embargoed data works correctly using Globus access control, and that authenticated Globus users with insufficient privilege cannot discover datasets or metadata, or download datasets, which have been flagged for discovery by “grouponly”, even if they have previous knowledge of URIs or URLs for download. 3. Globus account in privileged group While logged in to Globus with an account in the CC Globus Publication Users group, the Compute Canada Community is visible, as is the Canadian International Polar Year Data Collection. When the collection is accessed, a list of the first 20 (out of 118, the total number successfully ingested into GP) IPY datasets are displayed. Discovery by author, subject, metadata keyword, and title are all functional. At least one keyword (“environment”) happens to be present in the metadata of all 118 ingested datasets. Accessing any individual dataset discovered through a search brings the authenticated, privileged user to a landing page showing a selection of metadata (Title, keywords, dates, authors, publisher, URI, plus a toggle switch to display ALL metadata). This dataset page has a URI, implemented for the purposes of the pilot as a bit.ly link, as these data already have functional DOIs. Following the URI while logged in with a privileged account leads to the correct landing page for the dataset within Globus Publication. Following the link for downloading of datasets leads to an authenticated area for Globus File Transfer; download of dataset was successful. A representative, not exhaustive, sampling of datasets were tested, all with expected functionality.

Recommendations

Recommendations for content improvement The lessons learned from this pilot provide important insights into the requirements needed to implement a production service based on the functions of this test. First, the basic workflows demonstrated that automated processes could generate digital objects for research datasets and that these objects could be ingested in an access platform (Globus Publication in this instance) and be archived in preservation storage. Second, once ingested into a discovery/access platform, datasets were discoverable and retrievable under the appropriate controlled access condition. Third, several improvements are needed to assemble a smallscale production system based on this Pilot’s basic design. All of these improvements are incremental in design and achievable through a nextstep development.

13

The primary recommendation is to proceed with the development of a production service that improves upon the test model in this project. Four general areas have been identified for further work. 1. Workflow improvements

The current project required a separate workflow to transmit metadata to Globus Publication. For a production service, the metadata transfer needs to be integrated into the Archivematica pipeline. An experiment was conducted at the conclusion of the current project to identify how this metadata could be processed and queued for predictable use by Globus Publication (see the discussion under Archivematica Processing Evaluation Results above.) This included the possible use of the METS file as a container for basic Dublin Core metadata elements that could subsequently be extracted by Globus Publication. Another strategy was to have a metadata file containing this information referenced within the METS file. The tests conducted failed to generate a METS file with the information required for a predictable outcome. Further exploration with Artefactual is needed to determine both the best strategy for transferring this metadata and the production process for identifying the location of metadata. One clear outcome in this overall workflow is the need for the upfront preparation of metadata by a metadata librarian and for the intervention of a data curator to start and monitor the processing cycle. This is discussed further under Human Resourcing below.

2. Product improvements

Archivematica: a. Scalable processing strategies are needed for pushing large digital objects

through the pipeline. b. There is a clear need for Archivematica to support the processing of dataset level

metadata in DCE for discovery applications outside Archivematica. The strategies for providing this metadata were described under Workflow improvements above and may involve the use of the METS file and references to files in the AIP and DIP.

c. The DIPs produced by Archivematica did not include the metadata objects declared in the Transfer metadata folder. This was discussed above under Archivematica Processing Evaluation Results.

d. Extend the use of a Format Policy Registry to support better normalization processing of the diverse file formats encountered in research datasets.

Globus Publication:

a. Globus Publication allows users to set controls over the access to files in its system. There is a need to determine access rights through metadata.

14

b. Improvements with batch submission of datasets are needed. Curator toolkit:

a. As noted under Workflow improvements, a need exists for data curators to prepare datasetlevel metadata in at least the DCE standard. A toolkit that allows a data curator to crosswalk datasetlevel metadata from one standard to another or to enter descriptive information in the DCE standard would support the initial steps of this overall process.

b. Another data curator tool is required to structure files for transfer into the Archivematica pipeline. This tool should be integrated with Globus FTP to capitalize on highspeed file transfer.

c. A report manager that retrieves and displays all logs generated throughout the process would help the data curator detect and diagnose problems in the processing pipeline.

3. Human resourcing

Metadata librarian: a. A person with skills to develop mappings across dataset standards and to devise

metadata exchange strategies among systems is needed. b. This person should also provide advice on metadata exchange formats

compatible with various discovery systems. Data Curator:

a. A person is needed with the skills to organize data and metadata in a submission package to start the processing cycle.

b. This person should also monitor all processing logs to diagnose problems that arise in the pipeline. When a problem is identified and a cause determined, this person should be able to provide a solution.

c. This person should also coordinate arrangements for archival storage and its accompanying replication plans.

Software Developers:

a. A programmer with skills in parallel processing and optimization of computational tasks is needed to increase the efficiency of the processing demands throughout the chain of microservices utilized in the production of Archival Information Packages and Dissemination Information Packages.

b. A programmer with skills to chain microservices efficiently through standard input/output workflow and to optimize output activities to increase throughput performance.

c. A programmer with skills in APIs and interfaces to liaise with researchers in their development of customized workflows leveraging the developed repository.

15

Systems Administrator: a. The services of system administrator(s) are needed to configure both develop

and production virtual machines and to establish a secure environment for production.

Project Manager:

a. There is need for a project manager to oversee the development of a production platform. This person needs to understand the design of the system, to organize and manage work tasks, to supervise developers and staff, and to be a good communicator with all contributing partners.

4. Technical resourcing

Archival storage and replication: a. Systems support is needed to identify and secure the computational power for

generating AIPs and DIPs. b. Systems support is needed to identify and secure a storage allocation amply

enough to support the processing pipeline and archive storage. c. Systems support is needed to establish the appropriate replication processing of

the archival holdings.

Recommendations for technical improvement 1. Archivematica

Archivematica is a combination of open source software tools or microservices that allow users to create Archival Information Packages (AIPs) and Dissemination Information Packages (DIPs) suitable for longterm access. This underlying microservice architecture provides flexibility in extending the suite of tools that support emerging archival practices. Archivematica processes are in compliance with the ISOOAIS functional model and the platform follows industry standards and best practices to generate authentic, trustworthy, reliable, and software independent AIPs. Artefactual has also developed a Format Policy Register (FPR) to support communitypreferred archival formats. Normalization processing can be driven through an FPR, allowing communities to specify local or global format choices for preservation. Archivematica has received a lot of traction in the library and archival community in recent years. As a software solution initially for digital cultural archives, Archivematica has undergone significant product improvements, including a few related to processing research datasets. With 5

advancements in the stewardship of research data increasing the demand for data management

5 The University of Alberta Libraries, UBC Library, and SFU Library, working as a consortium, collectively invested in enhancements specifically to support the processing of research data.

16

and curation activities, Archivematica has the potential to become a leader in this field. A number of other groups, including Jisc in the UK, are also exploring the use of Archivematica for preserving research data. Archivematica was tested over a large variety of datasets and file types. Although only a few file types were normalized, the overall Archivematica processing step was a bottleneck in the data ingestion pipeline. The overall time required to process relatively small (MB) datasets was relatively long: 530 minutes per dataset, processed serially. A single dedicated server (of modest capability) was not able to rapidly process the IPY datasets; the total Archivematica processing time was approximately 12 hours. Many of the normalization steps (except for video processing) appear to be I/O bound rather than CPUbound. Ingestion requires moving data into a “watched directory” area: a number of copy/move primitives are required, which can be time and I/Ointensive. A single Archivematica server, or even several static servers, will likely be unable to scale to a regional or national scale. Archivematica’s storage service, which indexes and stores all AIPs produced by that instance of Archivematica, is valuable at a smallcollection scale. However, at the national level, it is unlikely to scale to store all AIPs that an Archivematica server has ever produced, and it is unlikely that there would be a single, canonical Archivematica server.

a. The development of a noninteractive, stateless version of Archivematica (“Archivematicalite”) is desirable. This proposed version would serve as a “black box” in the pipeline that takes in data+metadata and generates AIPs and DIPs automatically, subject to a rules engine, and does not attempt to store its outputs in a database. Multiple copies of this data processing tool could be started simultaneously, running on virtual or physical hardware, to process multiple data streams at the same time in parallel. Features should include: i. no (or optional) GUI ii. fully automatable iii. same preservation characteristics / outputs generated as Archivematica,

through rules engine iv. no (or optional storage) service v. minimizes file transfer and I/O vi. externalizes audit control for AIPs

b. Archivematica should be tested (before, or concurrent with, doing extensive Archivematica development) on cloud compute and virtual machines (VMs) running preconfigured Archivematica loadouts to dynamically handle larger, bursty ingests for datasets. The Compute Canada Cloud could be leveraged for this testing. Before automated testing is possible, we would require a fully automated Archivematica that requires no manual intervention.

17

c. Archivematica processing speeds and workflows should be optimized for throughput and barriers to automation (i.e. mandatory manual intervention at GUI) should be overcome. This will require development effort. i. Specialized hardware may be required for normalization that requires

extensive video processing or other processingintensive tasks. ii. metascheduling may be required to identify datasets that will require

hardware acceleration. d. Archivematica should generate a signal that processing, including production of

DIPs and AIPs, is complete for automation purposes.

2. Globus Publication

There are several attractive features of Globus Publication which bring it into consideration for building a nationalscale research data repository. The service is built on a “bringyourownstorage” model, which allows Canadian organizations and institutions to federate data discovery and access without depending on external or foreign parties to host the data. The discovery engine can span multiple collections and repositories across the country. Globus Publication can handle almost arbitrarily large datasets, for both total number of files and perfile sizes. The storage and transfer components are mature, highperforming, highly scalable, and wellunderstood as deployed on the Compute Canada hardware and software platform. Replication of data to other locations via Globus File Transfer is fast and easy on Compute Canada infrastructure. There are also a number of limitations in the current (June 2015) implementation of Globus Publication. The user interface is not as fully featured as some repository tools; there is no ability to view videos or pictures from within Globus Publication, with only file transfer being offered. Configuration or customization of the front end is presently done only through Globus. Metadata was only loosely transferrable from an existing collection to Globus Publication. The indexing service for Globus is hosted in the United States (the data stored in Canada only), which raises potential concerns around Canadian and provincial privacy legislation. And because Globus is SoftwareasaService, Canadian stakeholders have less influence over the direction Globus Publication takes and less ability to mandate development of features. Multiple language support (e.g. French and English) is not supported well under the existing Globus model. The following are recommendations for development and improvements in the Globus Publication service with an eye to the development of a national scale repository in Canada:

a. An API for the Globus Publication service should be developed in the immediate term for the GP product to be leveraged by many researchers. No single

18

organization (e.g. Globus, CC, CARL, …) can individually support the development needs of the entire longtail research community; rather, a robust, scalable backend is required, and an API to access it, to enable development of domainspecific enhancements and customized platforms/portals. i. Minimum features for an API interacting with GP need to be determined

(i.e. ingest, download); ii. A default, lightly customizable Globusprovided set of services and

environment must also be available for research groups who don’t need or want to develop custom platforms.

b. Globus Publication should support ingestion from existing collections, and have that ingestion process be fully automatable, likely through the extended API recommended in a).

c. Ingestion of data+metadata “in place” (i.e. where data currently resides, with little or no file transfer required for the data) into Globus federated discovery and dissemination system should be fully developed and integrated with the Globus Publication API.

d. The metadata model in Globus Publication should be developed to enable support for the richest and most detailpreserving metadata standards currently in use. This pilot showed that, with some lossiness, it is possible to migrate from the existing PDC data to Globus Publication. In particular, it is critical for the metadata model in Globus Publication to support unlimited nesting of metadata.

e. Anonymous, unauthenticated access for discovery and download of open datasets (no Globus or any other credentials) should be developed by Globus in Globus Publication. Currently, Globus accounts are required, and Globus file transfer protocol must be used for download of datasets. For small datasets in the long tail (e.g. a single KBMB scale picture), Globus transfer protocol can be more than is needed. Many institutions, projects, and journals have the hard requirement for complete open access to discovered datasets with no accounts required for download.

f. Alternate viewing and download options, including HTTP protocol, should be developed in Globus Publication web services to extend functionality for users. For example, it currently could be inconvenient to download multiple datasets. For a collection with 1000 datasets, a user would currently have to discover and download the datasets individually, rather than in bulk.

g. Globus should develop, support, or make available the ability for common file types to render in the web services rather than only download those files. For example: i. image files displayed not as list of names, but with thumbnails; ii. video files displayed with video thumbnail, clicked to play in window in

browser. h. Globus Publication should extend the functionality of the selfservice form

configurator. Globus staff intervention for creation of custom forms should be minimized or eliminated to scale the repository support to national level.

19

i. Integration of Globus Publication with Archivematica for preservation processes should be scoped and potentially developed.

j. Globus Publication should allow explicit permissions and ACL settings on data in the collection configurator.

k. The API for Globus Transfer service should be extended to allow the full functionality of file transfer options (e.g. mirroring directories) as are presently available through the Globus Transfer GUI.

l. A “franchise” model should be strongly considered for long term sustainability of Canadian use of Globus Transfer and Globus Publication. One model can be envisioned where Globus does feature and software development, which can be leveraged and deployed individually in different countries. The Canadian franchise should be: i. customizable for multiple language support; ii. entirely hosted in Canada for metadata privacy issues; iii. sustainable and developable by Compute Canada and/or other

stakeholders should Globus cease development/support on the Globus file transfer or Publishing product;

iv. supported by Globus support network as necessary, but also be supportable internally;

v. compatible with “vanilla” Globus and other franchised deployments of Globus.

m. Development of Globus Publication on mobile platforms should be considered, and placed into the Globus Publication roadmap if development will occur.

Report prepared by Jason Hlady, Dugan O’Neil, Umar Qasim, and Chuck Humphrey Contributors to the CPDN pilot by organization:

Canadian Polar Data Network Leanne Trimble, Umar Qasim, John Huck, Chuck Humphrey

Compute Canada Jason Hlady, Alex Garnett, Sean Cavanaugh, Jason Knabl, Dugan O’Neil

Globus Publication Kyle Chard, Rachana Ananthakrishnan, Jim Pruyne

August 2015

20

Appendix 1: Pilot Technical Descriptions Data Source Data were provided by Scholars Portal, and were transferred to CC resources (Silo.westgrid.ca) via Globus. The raw datasets have been replicated to Silo, and are available via the Globus endpoint smc748#ipy_data or /wg_global/proc/ipy_data via silo login.

Archivematica Ingestion The data are currently copied from the above location to /wg_global/proc/ingest. This directory is exported via NFS to the Archivematica server (bailer.westgrid.ca).

A Python script on bailer then prepares each dataset for ingestion by using the Python bagit library to create a bag, which is copied to the Archivematica watched directories for ingestion.

The script initiates ingestion via the Archivematica REST API. There are currently some known bugs in Archivematica that prevent the ingestion process from being fully automated. Currently the normalization step and the DIP storage steps must be completed manually. Archivematica's devs are aware of the issues.

Globus Publication Ingestion Once a DIP has been created, it is copied from the watchedDirectories to /wg_global/proc/result. There the main metadata is added to the DIP. The path for this, added to the base directory of the DIP is .globus_publication/globus_metadata.json which is used by Globus.

Once the DIP is verified to be complete, it is copied to the Globus Publish directory on Silo, /wg_global/globuspublish, and the owner is set to gplocal:gplocal, the Globus Publish local user on Silo. Once the data are all there, Globus will import the data into Globus Publish on their end. The data will then be in the IPY repository on Globus Publish.

Script Information There are currently two Python scripts associated with the pipeline. The first is called ingestion_script, and does the work of preparing a SIP, initiating transfer to Archivematica, then processing the DIPs for ingestion into Globus Publication and placing the AIPs in the storage location.

The second script, mirror_script, copies the contents of the AIP storage pool to a set of shared endpoints via Globus. It also transfers any files that have been added or updated. This is intended to be run from a cronjob to replicate the contents of the AIP storage pool in each replication endpoint for backup purposes.

Configuration The script uses a Python config file for much of its data. Here is a sample containing all options currently supported by the scripts. The "Archivematica" and "Globus" sections are used by the cc_rdm library, the "Mirroring" section is used by the mirroring script, and the "Ingestion" section is used by the ingestion/processing script.

[Archivematica] host= http://bailer.westgrid.ca user= smc748 key= [REDACTED] [Globus]

21

user= smc748 pass= [REDACTED] [Mirroring] source= smc748#usask_aip_storage dest= smc748#sfu_aip_storage, computecanada#scinet_aip_storage [Ingestion] input_dir = /process/ingest output_dir = /process/result metadata_dir = /process/metadata bag_dir = /var/archivematica/sharedDirectory/watchedDirectories/activeTransfers/baggitDirectory dip_dir = /var/archivematica/sharedDirectory/watchedDirectories/uploadedDIPs globus_meta_dir = .globus_publication globus_meta_file = globus_metadata.json aip_storage_dir = /var/archivematica/sharedDirectory/www/AIPsStore globus_aip_dir = /process/aips

22

Appendix 2: RDM Python Library The RDM Python Library is a set of Python classes that can be used to transfer files via Globus, bag them up, and ingest into Archivematica. The package is available on GitHub at https://github.com/smc748/cc_rdm

Overview

The library contains three main classes. These are TransferOperation, BagOperation, and IngestOperation. There is also a Configuration object that contains the information needed to access the various services. Documentation for the module is available through pydoc after installing.

Installation

Download the package from GitHub. Then run 'python setup.py bdist' in the main directory. This will build a package under ./dist that can be unpacked in the root directory to install the package.

Example

A barebones example script:

import cc_rdm, time, sys, bagit, shutil if __name__ == "__main__":

config = cc_rdm.Configuration() config.globus_user = "[account]" config.globus_pass = "[pass]" config.a_api_host = "http://bailer.westgrid.ca" config.a_user = "[account]" config.a_api_key = "[key]" bagdir = 'test' file_list = ['test.iso'] t = cc_rdm.TransferOperation(config = config, file_list = file_list,

dest_dir='/proc/test') t.start_transfer('smc748#silo', 'smc748#cc_rdm') while t.transfer_status() == 'ACTIVE':

print '.', sys.stdout.flush() time.sleep(5)

status = t.transfer_status() if status != 'SUCCEEDED':

sys.exit(1) bag_op = cc_rdm.BagOperation('/process/test', 'Test1': 'Test 1 data') bag = bag_op.make_bag() shutil.copytree('/process/test',

'/var/archivematica/sharedDirectory/watchedDirectories/activeTransfers/baggitDirectory/test2')

ingestOp = cc_rdm.IngestOperation(config) ingestOp.ingest('test2')

23

https://github.com/smc748/cc_rdm

Appendix 3: Files in pilot datasets by file type Data Type File Count ASCII C++ program text, with CRLF line terminators 1 ASCII C++ program text, with very long lines, with CRLF line terminators 98 ASCII English text, with CRLF line terminators 18 ASCII English text, with CRLF, CR, LF line terminators 1 ASCII English text, with very long lines, with CR line terminators 4 ASCII English text, with very long lines, with CRLF line terminators 43 ASCII text 640 ASCII text, with CR line terminators 41 ASCII text, with CRLF line terminators 3482 ASCII text, with CRLF, CR line terminators 149 ASCII text, with very long lines 11 ASCII text, with very long lines, with CRLF line terminators 11 ASCII text, with very long lines, with no line terminators 3 AppleDouble encoded Macintosh file 4 BioRad .PIC Image File 12594 x 14644, 13621 images in file 1 BioRad .PIC Image File 24929 x 25657, 13111 images in file 1 BioRad .PIC Image File 24931 x 13616, 25655 images in file 1 DBase 3 data file (1 records) 2 DBase 3 data file (118 records) 1 DBase 3 data file (128 records) 1 DBase 3 data file (130 records) 2 DBase 3 data file (136 records) 1 DBase 3 data file (148 records) 1 DBase 3 data file (149 records) 1 DBase 3 data file (159 records) 1 DBase 3 data file (161 records) 1 DBase 3 data file (163 records) 1 DBase 3 data file (173 records) 1 DBase 3 data file (174 records) 1 DBase 3 data file (177 records) 1 DBase 3 data file (178 records) 1 DBase 3 data file (181 records) 1 DBase 3 data file (182 records) 1 DBase 3 data file (185 records) 1 DBase 3 data file (186 records) 1 DBase 3 data file (192 records) 1 DBase 3 data file (193 records) 1 DBase 3 data file (199 records) 1 DBase 3 data file (2 records) 1 DBase 3 data file (200 records) 1 DBase 3 data file (202 records) 1 DBase 3 data file (204 records) 1 DBase 3 data file (207 records) 1 DBase 3 data file (210 records) 1 DBase 3 data file (235 records) 2 DBase 3 data file (236 records) 1 DBase 3 data file (247 records) 1 DBase 3 data file (4 records) 2 DBase 3 data file (5 records) 1 DBase 3 data file (94 records) 1

24

ESRI Shapefile version 1000 length 120 type Point 1 ESRI Shapefile version 1000 length 2034 type PolyLine 1 ESRI Shapefile version 1000 length 334 type PolyLine 1 ESRI Shapefile version 1000 length 54 type PolyLine 2 ESRI Shapefile version 1000 length 58 type PolyLine 1 ESRI Shapefile version 1000 length 66 type PolyLine 2 ESRI Shapefile version 1000 length 70 type Point 1 ESRI Shapefile version 1000 length 786 type PolyLine 2 ESRI Shapefile version 1000 length 926 type PolyLine 1 ESRI Shapefile version 16777216 length 58 2 ESRI Shapefile version 16777216 length 66 2 ESRI Shapefile version 33554432 length 58 1 ESRI Shapefile version 33554432 length 70 1 ESRI Shapefile version 67108864 length 66 2 ESRI Shapefile version 67108864 length 94 2 ESRI Shapefile version 83886080 length 62 1 ESRI Shapefile version 83886080 length 94 1 Hierarchical Data Format (version 4) data 43 ISO8859 English text, with CR line terminators 131 ISO8859 English text, with CRLF line terminators 23 ISO8859 English text, with CRLF, CR, LF line terminators 8 ISO8859 English text, with very long lines, with CR line terminators 7 ISO8859 English text, with very long lines, with CRLF line terminators 11 ISO8859 English text, with very long lines, with CRLF, CR, LF line terminators 13 ISO8859 text 2 ISO8859 text, with CR line terminators 5 ISO8859 text, with CRLF line terminators 154 JPEG image data, JFIF standard 1.01 8 JPEG image data, JFIF standard 1.01, comment 11030 JVT NAL sequence 40 MS Windows shortcut 2 Microsoft Office Document 116 NonISO extendedASCII English text, with CRLF line terminators 7 NonISO extendedASCII English text, with CRLF, CR, LF line terminators 4 NonISO extendedASCII English text, with CRLF, LF line terminators 21 NonISO extendedASCII English text, with very long lines, with CRLF line terminators 31 NonISO extendedASCII English text, with very long lines, with CRLF, CR, LF line terminators 3 NonISO extendedASCII English text, with very long lines, with CRLF, LF line terminators 5 NonISO extendedASCII English text, with very long lines, with CRLF, NEL line terminators 2 NonISO extendedASCII text, with CR line terminators 2 PDF document, version 1.3 91 PDF document, version 1.4 49 PDF document, version 1.5 229 PDF document, version 1.6 2 RAR archive data, v1d, os 1 Rich Text Format data, version 1, ANSI 1 TIFF image data, littleendian 162 UTF8 Unicode C++ program text, with CRLF, LF line terminators 1 UTF8 Unicode English text, with very long lines, with CRLF, CR, LF line terminators 2 UTF8 Unicode English text, with very long lines, with CRLF, LF line terminators 2 Zip archive data, at least v2.0 to extract 18 data 149 empty 3

25

Total files 16948

26

CPDN Project in the RDCCCCARL Federated Pilot€¦ · Better integration of the transmission of metadata to Globus Publication with the Archivematica pipeline; Developments in computational

Documents