Top Banner
‘Archiving and managing a million or more data files on BiG Grid’ Peter Doorn, Data Archiving and Networked Services (DANS) With Jan Just Keijser (NIKHEF) BiG Grid & Beyond, Amsterdam, 26/9/2012
32

Archiving and managing a million or more data files on BiG Grid

Nov 17, 2014

Download

Technology

pkdoorn

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Archiving and managing a million or more data files on BiG Grid

‘Archiving and managing a million or more data files on BiG Grid’

Peter Doorn, Data Archiving and Networked Services (DANS)With Jan Just Keijser (NIKHEF)BiG Grid & Beyond, Amsterdam, 26/9/2012

Page 2: Archiving and managing a million or more data files on BiG Grid

Contents

Promises and ideas at the kick-off of BiG Grid in 2007: what became of them?

In NL SSH in UK, DE, ESFRI

Two sub-projects of BiG Grid with DANS Analyzing and visualizing big humanities data (briefly)

Archiving and managing a million or so humanities files

Beyond BiG Grid: next requirements and challenges for the future of SSH research and infrastructure

An example of analysis of Big Social Science Data (GPS traces) from Italy

Challenges for data infrastructure

Page 3: Archiving and managing a million or more data files on BiG Grid

From the original Big Grid proposal:

“BIG GRID is crucial to the success and continuity of many Dutch research communities, covering important areas such as life sciences, astronomy, particle physics, meteorology, and climate research, water management, to name just a few.

However, the very nature of the new infrastructure, a multidimensional collaboration enabler and accelerator, allows for direct participation of also social sciences, humanities, and even addresses communities in administrative domains, like digital academic repositories.” Pro

mise

s… P

rom

ises…

Page 4: Archiving and managing a million or more data files on BiG Grid

ESFRI projects in SSH about grid

CESSDA: grid technologies for facilitating the merging of distributed data sources

DARIAH: grid services for an open semantic architecture

facilitating arts and humanities research need for ‘easy’ interfaces for humanities scholars,

services need to be usable without the complexities of the grid infrastructure

CLARIN: grid technology for access to guidance and advice through distributed knowledge centres access to repositories of data with standardized descriptions, processing

tools ready to operate on standardized data

Prom

ises…

Pro

mise

s…

Page 5: Archiving and managing a million or more data files on BiG Grid

Tools for processing, analysing, annotating, editing and publishing text data

• Grid-enabled workbench to process, analyse, annotate, edit and publish XML-encoded textual data for academic research

• Connect to the D-Grid Integration Platform (DGI) via TextGrid-specific middleware components

• Demonstrate the efficiency of the grid-enabled tools in the areas publishing, processing, retrieval, and linking

• Semantic TextGrid: semantic methods for processing text assets, and for interweaving texts and dictionaries

Prom

ises…

Pro

mise

s…

Page 6: Archiving and managing a million or more data files on BiG Grid

Germany: Textgrid

But a

lso re

sults

Page 7: Archiving and managing a million or more data files on BiG Grid

TextGrid VRE: Repository + Lab

But a

lso re

sults

Page 8: Archiving and managing a million or more data files on BiG Grid

UK: e-Social Science

“The National Centre for e-Social Science (NCeSS) investigates how innovative and powerful computer-based infrastructure and tools, developed under the UK e-Science programme, can benefit the social science research community”

Examples of grid-projects: Mixed Media Grid (MiMeG): generate tools and techniques for social scientists to analyse audio-visual qualitative data and related materials collaboratively

SABRE software has been specifically designed for the statistical analysis of multi-process random effect response data, using parallel processing Pro

mise

s… P

rom

ises…

Page 9: Archiving and managing a million or more data files on BiG Grid

UK e-Social Science discontinued…

Is no

mor

e…

Page 10: Archiving and managing a million or more data files on BiG Grid

Dutch example from humanities

Subject: organization of knowledge Comparison of designed classification system

(UDC) with a socially grown knowledge system (Wikipedia)

Multidisciplinary research group, including DANS researcher Andrea Scharnhorst

Big data set (dump of Wikipedia: 2,8 TB) Mine the data to extract the page and category link changes over time

Create complex visualizations Computational support by BiG Grid team: Tom

Visser, Coen Schrijvers and Ammar Benabadelkader

Page 11: Archiving and managing a million or more data files on BiG Grid
Page 12: Archiving and managing a million or more data files on BiG Grid

Archiving experiments since 2007

Grid middleware not very suitable for our archiving purposes

Use case: How can you be sure that what you store on the grid is valid?

Giving proof of data integrity is a requirement of ISO standard 16363 for trusted digital archives

Advantages of grid storage: Fast access to grid worker node Hierarchical storage manager: eg. efficient automated backup procedures

Shared facility is efficient and economically attractive

Page 13: Archiving and managing a million or more data files on BiG Grid

Large numbers of datasets and files

> 23,000 data sets in DANS archives Every data set consists of 1+ data files, sometimes 1000+ Most data sets are small (98% < 1 Gb) For example, the entire population census of 1960 (>11

million records) fits on one CD-ROM (< 700 Mb) Total number of files >1 million Total storage volume ca. 70 Tb Long processing times with large numbers of datasets and

files Management operations on the whole archive: slow and

problematic on normal servers Mass conversions (e.g. thumbnails of images) Data integrity control (checksums) Compressing the data

Copying of the whole archive to the grid is not trivial

Page 14: Archiving and managing a million or more data files on BiG Grid

Datasets in DANS EASY (Sept. 2012)

1,8% of datasets > 2 GB2,8% of datasets > 1 GB

23,560 datasets 1,693,413 files

Page 15: Archiving and managing a million or more data files on BiG Grid

The experiment

Experiment with five digital archives (not in EASY), containing a total 290,341 files, grouped over a total of 1695 'tar' files of 5 GB each (c. 8.5 TB)

Carried out by Jan Just Keijser (Nikhef) Three-phase workflow

Page 16: Archiving and managing a million or more data files on BiG Grid

DANS Workflow phase 1:• Create checksums• Create tarballs (.tar files)• Upload tarballs to the grid

1) md5sum

2) tar

3) Upload gridstorage

Page 17: Archiving and managing a million or more data files on BiG Grid

DANS Workflow phase 2:• Download .tar file• Compress it to a .tar.gz file• Upload compressed tarball

3) Upload

gridstorage

Worker Node

1) Download

2) Compress

Page 18: Archiving and managing a million or more data files on BiG Grid

4) Compare

gridstorage

Worker Node

1) Download

2) Unpack

DANS Workflow phase 3:• Download .tar.gz file• Unpack it• Calculate checksums• Send checksums back and compare

3) md5sum

Page 19: Archiving and managing a million or more data files on BiG Grid

Results

The tool works One checksum mismatch detected: disk failure on grid worker node!

Page 20: Archiving and managing a million or more data files on BiG Grid

SSH: big data challenges

Data generated by people tend to be small Data generated by social processes (Twitter, Facebook), transactions (financial), administrations and by devices (GSM, GPS) tend to be big

More analytical projects of big data in SSH (but few in NL)

Millions of digitized books (“Culturomics”)

Sentiment analysis of twitter feeds to predict markets and economic trends

Traffic flows using GPS

Page 21: Archiving and managing a million or more data files on BiG Grid

An example from Italy

GPS traces17K private carsone week of ordinary mobility200K trips (trajectories)Milan, Italy

From presentation by Dino PedreschiPisa

Data donated by OCTO Telematics

Page 22: Archiving and managing a million or more data files on BiG Grid

Where is traffic concentrated between midnight and 2 a.m.? (red = most intense)

Page 23: Archiving and managing a million or more data files on BiG Grid

Where is traffic concentrated between 6 p.m. and 8 p.m.?

Page 24: Archiving and managing a million or more data files on BiG Grid

Select only trips that start in the city centre (orange) and move to North-West

Page 25: Archiving and managing a million or more data files on BiG Grid

Where is people between 6pm and 8pm of Wednesday, April 4th?

Page 26: Archiving and managing a million or more data files on BiG Grid

Where is people between 8pm and 10pm of Wednesday, April 4th? (high density spot appeared)

Page 27: Archiving and managing a million or more data files on BiG Grid

Where is people between 10pm and midnight of Wednesday, April 4th? (The dense spot disappeared. What happened?)

Page 28: Archiving and managing a million or more data files on BiG Grid

Focus on the high-density spot: Centered on the parking lots of the stadium, a football match took place there...

Page 29: Archiving and managing a million or more data files on BiG Grid

SSH Research beyond Big Grid

Acceptance of grid technology by SSH community is low and slow: “my laptop has enough processing power”

Grid is still perceived as “complicated” Researchers are not aware of:

data management issues the research potential of “Big SSH Data”

Demonstrator projects are still needed: Social scientists need to focus more on the analytical potential of “Big Social Data”

“Culturomics” in humanities DANS can help to make that accessible, although we

are not only driven by data, but also by… demand!

Page 30: Archiving and managing a million or more data files on BiG Grid

Archiving beyond BiG Grid

Storage capacity: joining forces with other parties: 3TU Data Centre, National Coalition for Digital Preservation (NCDD with Royal Library, National Archives, Institute for Sound and Vision, museum sector), Roadmap projects

Archiving is more than storage: archival management requires repeated operations on masses of files, many small, but also big (e.g. audio/visual)

Set of procedures to support archival management Continuity of grid infrastructure is prerequisite Is cloud the answer?

Public cloud is not without risk Costs are not yet attractive enough Private community cloud is attractive

Page 31: Archiving and managing a million or more data files on BiG Grid
Page 32: Archiving and managing a million or more data files on BiG Grid

Thank you for your attention

[email protected]@nikhef.nl

www.dans.knaw.nl