Top Banner
Designing a long term very large digital library Stephen Green Head of Digital Library Infrastructure [email protected]
13

Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

Aug 01, 2018

Download

Documents

dangkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

Designing a long term

very large digital library

Stephen Green

Head of Digital Library Infrastructure

[email protected]

Page 2: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

2

The design of a very large digital librarypresents many challenges

Long term very

large digital library

Low cost of

ownership

Large object-

store

Tape is not

scalable

Geographically

diverse

Self monitoring

Assurance of

authenticity

Use robust

digital signing

Long term

retention

Availability of

ingest / access persistent

identifier

Usage and

random access

Geographically

resilient

Changing

storage market

Rolling

procurement &

replacement

Heterogeneous

store design

Significant

failure rate

Local self

healing

Remote self

healing

Separate

metadata store

Re-signing over

time

Change vendors

over time

Trust separated

from vendor

Service priorities

Provision of

storage

File format

migration

1

2

34

5

6

7

Page 3: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

3

•An interruption to ingest

is less visible

•Hence a longer

interruption can be

tolerated

Service priorities - we need to determine the service priorities so we can focus on the critical topics

Loss of all objects in the store

Loss of some objects

Long term interruption to access

Long term interruption to ingest

Short term interruption to access

Short term interruption to ingest

Disaster

Incident

Incre

asin

g

severity •An interruption to access

is externally visible

•A long term interruption

to access is not tenable

•This must be mitigated in

the design of the system

•Retention is more

important than high

availability

•The store must be

resilient

1

Page 4: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

4

Geographically resilient - to tolerate the loss of a single computer room / facility, a multi site repository is needed to deliver service without interruption

One cannot obtain commercial DR for

multi-100 Tb systems

DR must thus be in the system design

A single site, with a common-mode

disaster, cannot sustain availability, and

so is not acceptable

Hence need a multi-site solution

Full service can be delivered remotely,

albeit slower, than locally

Remote delivery implies that each site

does not need “maximum resilience”

within that site – such systems are

expensive

Long term

very large

digital library

Large object-

store

Geographically

diverse

Long term

retention

Availability of

ingest / access

Geographically

resilient

Service priorities

1

2

2

DR – disaster recovery

Page 5: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

5

Anticipated usage patterns lead to the conclusion that tape is not sufficiently scalable

Usage patterns: expect peaks with a “very long tail”

Usage metrics are likely to be in terms of:

Ingest in objects/sec & capacity/sec growing over time

Access in relation to the size of the store, hence measured

in objects/Tb/sec and volume delivered/Tb/sec

Access will need to be scalable & it will be random access

Tape storage is efficient when “restoring large file systems”

but is not efficient for restoring individual items.

Typically one tape robot can retrieve an item in ~40 secs and

the maximum number of robots is typically ~10

The scalability of a tape library is thus limited by the no. of

robots and is not scalable for random access to a large library

Future costs vs online are unclear & need routine self checks

Exclusively using online storage is a “clean” keep it simple

approach and there appears little to be lost by adopting it.

Large object-

store

Tape is not

scalable

Usage and

random access

3

HSM –

hierarchical

storage

management –

mixed on/offline

3

Page 6: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

6

Large scale storage is not intrinsically reliable and so monitoring and self healing are required

HP reviewed the long term reliability of content held by

the Internet Archive

~1 in 1000 files changed during a three year period

Extrapolating to a collection of ~150 million digital items

then ~4000 items would suffer corruption per month

While the reliability of storage is likely to improve, some

“bit rot” is inevitable, and so we need to plan for it

We do not worry that Internet bearers are unreliable

since we add detection / recovery (e.g. TCP over IP)

We need similar approach for large scale storage, self

monitoring / detection and recovery has to be automatic

Ideally this should be both at a local level (e.g. RAID

5/6) and also across storage sites

See http://arxiv.org/abs/cs.DL/0508130

Large object-

store

Self monitoring

Significant

failure rate

Local self

healing

Remote self

healing

4

4

Page 7: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

7

The changing landscape in the storage system market implies accommodating rolling procurement and heterogeneous storage systems

The market for storage systems is changing rapidly

Cost of storage is reducing by 30-40% per year

New innovative players emerge and some fade

These imply that supplier “lock-in” is not sensible

Not realistic to assume there will be one supplier for the lifetime of the library, hence need the flexibility to change supplier over time

Reducing costs lead to rolling procurement just ahead of demand

Cost effective to replace on a rolling basis on expiry of warranty

Rolling replacement/procurement programmes imply the need to be able to support a heterogeneous storage systems

The design of the logical architecture thus needs to support storage sourced from multiple storage vendors

Low cost of

ownership

Changing

storage market

Rolling

procurement &

replacement

Heterogeneous

store design

Change vendors

over time

Provision of

storage

5

5

Page 8: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

8

To have a meaningful archive continuous assurance of authenticity is required from the time of ingest (part 1)

Assurance of authenticity:

With a physical item this can be based partly on

examination of the item as well as its content

With a digital item, there is no physical item to examine

A long term library will also migrate though storage products

and vendors over time

Stored digests and/or protection within a single system do

not address handover and so are not sufficient – for

example

Assurance of

authenticity

Use robust

digital signing

Long term

retention

6

6

1st storage system

•May have measures to

“protect” (no

unauthorised changes)

•May store digests

2nd storage system

•May have measures to

“protect”

•May store digests

3rd storage system

•May have measures to

“protect”

•May store digests

timenowingest

Not pro

tecte

d

Not pro

tecte

d

Trust separated

from vendor

Page 9: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

9

Ideally

time

stamped

RFC

3161

To have a meaningful archive continuous assurance of authenticity is required from the time of ingest (part 2)

Only a chain of increasing strength digital signatures will do

an object’s signature is re-evidenced periodically over time

the signature chain is “transferred” when systems are refreshed

“perpetuity” is provided by the signature chain not by a system

The assurance (trust) is thus separate from any capabilities by any

one vendor or storage system

The assurance relies on keeping the private signing key private

It is not sufficient to rely on “software” signing

The only trusted way to keep a private key private is to use an HSM*

Leads to the conclusion that if you do not use an HSM

then you “do not have a meaningful long term archive”

But no current storage products use them

FIPS

186-2

Digital

Signature

Standard

FIPS

140-2

Security

Reqs for

Crypto-

Modules

* HSM - high security module not hierarchical storage management

6

Page 10: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

10

Time/date stamping is a well established library process6

Page 11: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

11

The use of robust digital signing requires a separate metadata store

A storage system can thus provide long term assurance of

authenticity over succession of vendors

Able to“prove” a bit stream is identical to that ingested

Based on holding objects in an “invariant” store

But cannot support both “no changes” & also “changes”

However, metadata can & may need to change over time

to support versioning and successor objects

A change to live metadata thus cannot be in the same

store

Use robust

digital signing

Separate

metadata store

7

7

• Upper layer: versions

and collections

• Lower Layer: METS for

each stored object

• Of conventional size (few TB)

• Hence conventional backup

regimes can be applied

Invariant resilient storage system providing

assurance of authenticity

Meta data

management

O1

V

O2

External

persistent identifier

Replacement by successor version

Also export new and

updated metadata for

additional resilience

Page 12: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

12

Summary - the design of a very large digital library presents many challenges

Long term very

large digital library

Low cost of

ownership

Large object-

store

Tape is not

scalable

Geographically

diverse

Self monitoring

Assurance of

authenticity

Use robust

digital signing

Long term

retention

Availability of

ingest / access persistent

identifier

Usage and

random access

Geographically

resilient

Changing

storage market

Rolling

procurement &

replacement

Heterogeneous

store design

Significant

failure rate

Local self

healing

Remote self

healing

Separate

metadata store

Re-signing over

time

Change vendors

over time

Trust separated

from vendor

Service priorities

Provision of

storage

File format

migration

Page 13: Designing a long term very large digital library · Designing a long term very large digital library ... (part 1) Assurance of ... Stored digests and/or protection within a single

13

Annex - The needs for large scale archival are not well met by the storage market

High end market profile

Immense scale with corresponding price

Maximised performance with price premium

Single frame maximised resilience / availability with price premium

Proprietary software management

Low end market profile

Low cost without scalability

Large scale archival needs

Immense scale

Low total cost of ownership

Long term resilience – in practice requires distribution

Self healing

Strategy for future migration

Assurance of authenticity

Does not need

Maximised performance

Maximised resilience within single frame

100% availability

We received briefings from ~35 storage vendors

There were are still are two main clusters of storage systems:

Scalable but not affordable, or affordable but not scalable

Needs for large scale archival are not well met