National Digital Repository ® Preserving the imperfect: reflections from NDAD and elsewhere Kevin Ashley Head of Digital Archives Group ULCC.
Post on 17-Dec-2015
218 Views
Preview:
Transcript
National Digital Repository®
Preserving the imperfect: reflections from NDAD and
elsewhere
Kevin Ashley
Head of Digital Archives Group
ULCC
2007-03-23 Presdb07 - Edinburgh 2
National Digital Repository®
Overview
• Issues that arise when databases are records• Informing (expensive, important) decisions• Tensions between ideal formats and non-ideal
data• Representation mechanisms for access control
and absent data• Concentrating on R&D issues
2007-03-23 Presdb07 - Edinburgh 4
National Digital Repository®
What is NDAD?
• A service for UK government records which exist as ‘structured information’
• Contains data + contextual information• Established in 1997 - service in March 1998• First service by a national archive to provide
online public access to preserved material• Selection undertaken by National Archives and
government departments• Everything else at ULCC: under contract to TNA
2007-03-23 Presdb07 - Edinburgh 6
National Digital Repository®
Preservation
• Data transformed to canonical form - originals kept
• Paper documentation digitised• Technical metadata produced or transformed• Consistency checks applied:
For transformation processAgainst original systemAgainst published information Internal cross-checks
2007-03-23 Presdb07 - Edinburgh 7
National Digital Repository®
Consequences
• Preservation far removed from creation• Unlike actively curated systems: preservation
and use can take place simultaneously• Multiple use scenarios - more than views
2007-03-23 Presdb07 - Edinburgh 9
National Digital Repository®
Perfect Preservation Formats?
• DDI: XML-basedgood for survey/social science dataNot so good for complex relational stuffLikes clean data
• XML representationsMore flexibleNot so good when data is unclean
• As SQLMuch metadata or needs another schemeUseless for unclean data
2007-03-23 Presdb07 - Edinburgh 10
National Digital Repository®
How bad is bad?
• Data out of range is a quality problem, not a preservation problem (e.g. ‘Age’ of 230)
• But…Age = -20?Age = B0 ?Age = Thursday?
• All present problems if ‘Age’ is a positive integer in our preservation schema
• Date = ‘31 Feb 2007’ is syntactically but not semantically valid
2007-03-23 Presdb07 - Edinburgh 11
National Digital Repository®
More bad stuff
• Absent key fields or mandatory fields• Encoded data that uses bad codes
if days of week are 1 - 7, what is day 9? Day X ?
• ‘Encoded’ data which is stored translated• 1 - 1 mappings that aren’t
2007-03-23 Presdb07 - Edinburgh 12
National Digital Repository®
What’s the problem?
• Must preserve errors - their nature is informative• Would like to understand original system
behaviour with these errors• Don’t want to use tools that force all fields to be
text• Want a datatype like ‘almost always integer’ or
‘often a date’ - and intelligent behaviour when it isn’t.
2007-03-23 Presdb07 - Edinburgh 13
National Digital Repository®
How does it get that way?
• Data validation often in application, not database Isn’t always well-implemented
• People hack around the application• Past migrations were poor
2007-03-23 Presdb07 - Edinburgh 14
National Digital Repository®
Missing and absent values
• Common occurrence in survey and experimental data
• Different types of ‘missing’:No informationKnown to be unreadableRefused to answerSubject didn’t know
• All mechanisms for representation ad-hoc• Knowledge in application, not database• Query engines don’t understand concept
2007-03-23 Presdb07 - Edinburgh 15
National Digital Repository®
Access: restricted viewingID Name Fname Office
10246 Ashley Kevin 179
10579 Mouse Mickey 188
ID REG Date From To10246 X111ABC 1 Oct 98 Clapham Deptford10579 H179JKL 1 Apr 99 Land’s
EndJohnO’Groats
56999 A217HGB 23 Dec 97 Poole Sandy
REG Make Year ColourX111ABC Yugo 1999 Grey
H179JKL Trabant 1957 Brown
People
Trips
Vehicles
Not available until 2050
2007-03-23 Presdb07 - Edinburgh 16
National Digital Repository®
Access - goal
• Duplicate original system• Advanced analysis tools• Simple viewing via a generic tool• Multimedia datatypes • Extensible via object-like design
• Traditional database systems not up to task without significant additional effort
• Hence much software home-grown
2007-03-23 Presdb07 - Edinburgh 17
National Digital Repository®
New issues from temporal GIS
• Temporal GIS allows one system to represent changing features and knowledge
• Queries like:Which features are newer than feature X?What did area Y look like 10 years ago?What present-day names correspond to ‘Hetfelle’?
• In a preserved temporal GIS:What would the answer to question 2 have been if I
asked it 5 years ago?
2007-03-23 Presdb07 - Edinburgh 18
National Digital Repository®
Inconsistencies and errors
• Schools census - 4 datasets per year for different school types
• But 1976 only has 3 - no nursery schools• Further examination shows files have been
merged• Confirmation came from completed census
forms held by schools - not by government department
top related