Top Banner
National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC
19

National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

Dec 17, 2015

Download

Documents

Erika Walker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

National Digital Repository®

Preserving the imperfect: reflections from NDAD and

elsewhere

Kevin Ashley

Head of Digital Archives Group

ULCC

Page 2: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 2

National Digital Repository®

Overview

• Issues that arise when databases are records• Informing (expensive, important) decisions• Tensions between ideal formats and non-ideal

data• Representation mechanisms for access control

and absent data• Concentrating on R&D issues

Page 3: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 3

National Digital Repository®

Page 4: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 4

National Digital Repository®

What is NDAD?

• A service for UK government records which exist as ‘structured information’

• Contains data + contextual information• Established in 1997 - service in March 1998• First service by a national archive to provide

online public access to preserved material• Selection undertaken by National Archives and

government departments• Everything else at ULCC: under contract to TNA

Page 5: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 5

National Digital Repository®

Page 6: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 6

National Digital Repository®

Preservation

• Data transformed to canonical form - originals kept

• Paper documentation digitised• Technical metadata produced or transformed• Consistency checks applied:

For transformation processAgainst original systemAgainst published information Internal cross-checks

Page 7: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 7

National Digital Repository®

Consequences

• Preservation far removed from creation• Unlike actively curated systems: preservation

and use can take place simultaneously• Multiple use scenarios - more than views

Page 8: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 8

National Digital Repository®

Where are the problems?

Management

Page 9: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 9

National Digital Repository®

Perfect Preservation Formats?

• DDI: XML-basedgood for survey/social science dataNot so good for complex relational stuffLikes clean data

• XML representationsMore flexibleNot so good when data is unclean

• As SQLMuch metadata or needs another schemeUseless for unclean data

Page 10: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 10

National Digital Repository®

How bad is bad?

• Data out of range is a quality problem, not a preservation problem (e.g. ‘Age’ of 230)

• But…Age = -20?Age = B0 ?Age = Thursday?

• All present problems if ‘Age’ is a positive integer in our preservation schema

• Date = ‘31 Feb 2007’ is syntactically but not semantically valid

Page 11: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 11

National Digital Repository®

More bad stuff

• Absent key fields or mandatory fields• Encoded data that uses bad codes

if days of week are 1 - 7, what is day 9? Day X ?

• ‘Encoded’ data which is stored translated• 1 - 1 mappings that aren’t

Page 12: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 12

National Digital Repository®

What’s the problem?

• Must preserve errors - their nature is informative• Would like to understand original system

behaviour with these errors• Don’t want to use tools that force all fields to be

text• Want a datatype like ‘almost always integer’ or

‘often a date’ - and intelligent behaviour when it isn’t.

Page 13: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 13

National Digital Repository®

How does it get that way?

• Data validation often in application, not database Isn’t always well-implemented

• People hack around the application• Past migrations were poor

Page 14: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 14

National Digital Repository®

Missing and absent values

• Common occurrence in survey and experimental data

• Different types of ‘missing’:No informationKnown to be unreadableRefused to answerSubject didn’t know

• All mechanisms for representation ad-hoc• Knowledge in application, not database• Query engines don’t understand concept

Page 15: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 15

National Digital Repository®

Access: restricted viewingID Name Fname Office

10246 Ashley Kevin 179

10579 Mouse Mickey 188

ID REG Date From To10246 X111ABC 1 Oct 98 Clapham Deptford10579 H179JKL 1 Apr 99 Land’s

EndJohnO’Groats

56999 A217HGB 23 Dec 97 Poole Sandy

REG Make Year ColourX111ABC Yugo 1999 Grey

H179JKL Trabant 1957 Brown

People

Trips

Vehicles

Not available until 2050

Page 16: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 16

National Digital Repository®

Access - goal

• Duplicate original system• Advanced analysis tools• Simple viewing via a generic tool• Multimedia datatypes • Extensible via object-like design

• Traditional database systems not up to task without significant additional effort

• Hence much software home-grown

Page 17: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 17

National Digital Repository®

New issues from temporal GIS

• Temporal GIS allows one system to represent changing features and knowledge

• Queries like:Which features are newer than feature X?What did area Y look like 10 years ago?What present-day names correspond to ‘Hetfelle’?

• In a preserved temporal GIS:What would the answer to question 2 have been if I

asked it 5 years ago?

Page 18: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 18

National Digital Repository®

Inconsistencies and errors

• Schools census - 4 datasets per year for different school types

• But 1976 only has 3 - no nursery schools• Further examination shows files have been

merged• Confirmation came from completed census

forms held by schools - not by government department

Page 19: National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.

2007-03-23 Presdb07 - Edinburgh 19

National Digital Repository®

Cornell’s DP model