Auditing PLN’s: Preliminary Results and Next Steps Prepared for PLN 2012 UNC, Chapel Hill October 2012 Micah Altman, Director of Research, MIT Libraries Non Resident Senior Fellow, The Brookings Institution Jonathan Crabtree, Assistant Director of Computing and Archival Research HW Odum Institute for Research in Social Science, UNC
42
Embed
Auditing PLN’s: Preliminary Results and Next Steps Prepared for PLN 2012 UNC, Chapel Hill October 2012 Micah Altman, Director of Research, MIT Libraries.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Auditing PLN’s:Preliminary Results and Next Steps
Prepared for PLN 2012
UNC, Chapel HillOctober 2012
Micah Altman, Director of Research, MIT Libraries
Non Resident Senior Fellow, The Brookings Institution
Jonathan Crabtree, Assistant Director of Computing and Archival ResearchHW Odum Institute for Research in Social Science, UNC
Collaborators*
• Nancy McGovern• Tom Lipkis & the LOCKSS Team
Research SupportThanks to the Library of Congress, the National Science
Foundation, IMLS, the Sloan Foundation, the Harvard University Library, the Institute for Quantitative Social Science, and the Massachusetts Institute of Technology.
Auditing PLN's
* And co-conspirators
micah
Update
Related WorkReprints available from: micahaltman.com
• M. Altman, J. Crabtree, “Using the SafeArchive System: TRAC-Based Auditing of LOCKSS”, Proceedings of Archiving 2011, Society for Imaging Science and Technology.
• Altman, M., Beecher, B., & Crabtree, J. (2009). A Prototype Platform for Policy-Based Archival Replication. Against the Grain, 21(2), 44-47.
– Round 0: Setting up the Data-PASS PLN– Round 1: Self-Audit– Round 2: Compliance (almost)– Round 3: Auditing Other Networks
• What’s next?
Auditing PLN's
Why audit?
Auditing PLN's
Short Answer: Why the heck not?
Auditing PLN's
“Don’t believe in anything you hear, and only half of what you see”
- Lou Reed
“Trust, but verify.”
- Ronald Reagan
Insider & ExternalAttacks
Slightly Long Answer: Things Go Wrong
Physical & Hardware
Software
Media Curatorial Error
OrganizationalFailure
Full Answer:It’s our responsibility
Auditing PLN's
OAIS Model Responsibilities• Accept appropriate information from Information Producers.• Obtain sufficient control of the information to ensure long term
preservation. • Determine which groups should become the Designated
Community able to understand the information. • Ensure that the preserved information is independently
understandable to the DC• Ensure that the information can be preserved against all
reasonable contingencies, • Ensure that the information can be disseminated as
authenticated copies of the original or as traceable back to the original
• Makes the preserved data available to the DC
Auditing PLN's
OAIS Basic Implied Trust Model
• Organization is axiomatically trusted to identify designated communities
• Organization is engineered with the goal of:– Collecting appropriate authentic document– Reliably deliver authentic documents, in understandable form, at
a future time• Success depends upon:
– Reliability of storage systems:e.g., LOCKSS network, Amazon Glacier
– Reliability of organizations: MetaArchive, DataPASS, Digital Preservation Network
– Document contents and properties: Formats, Metadata, Semantics, Provenance, Authenticity
Auditing PLN's
Reflections on OAIS Trust Model
• Specific bundle of trusted properties• Not complete instrumentally nor ultimately
Auditing PLN's
Trust Engineering Approaches• Incentive based approaches:
– Rewards, penalties, incentive-compatible mechanisms• Modeling and analysis:
– Statistical quality control & reliability estimation, threat-modeling and vulnerability assessment• Portfolio Theory:
• Integrating information required– Heuristics for lagged information– Heuristics for incomplete
information– Heuristics for aggregated
information• Comparing map to policy required
Mere matter of implementation
• Adding replica:Uh-oh, most policies failed Adding replicas wasn’t going to resolve most issues
Theory
Gather Information from
Each Replica
Integrate Information ->
Map Network State
Compare Current State Map to Policy
Success
State ==
Policy ?
YES
Add Replica
NO
Theory vs. Practice Round 2: Compliance (almost)
Auditing PLN's
“How do you spell ‘backup’?
R – E - C – O – V – E – R - Y
-
Practice (and adjustment) makes perfekt?
• Timings (e.g. crawls, polls) – Understand– Tune– Parameterize heuristics, reporting– Track trends over time
• Collections– Change partitioning to AU’s at source– Extend mapping to AU’s in plugin– Extend reporting/policy framework to group AU’s
• Diagnostics– When things go wrong – information to inform adjustment
Auditing PLN's
Theory vs. PracticeRound 3: Auditing Other PLNs
Auditing PLN's
“In theory, theory and practice are the same –in practice, they differ.”
-
Theory
Auditing PLN's
Gather Information from
Each Replica
Integrate Information ->
Map Network State
Compare Current Network to Policy
Success
State ==
Policy ? YES
Add Replica
NO
Adjust AU Sizes, Polling
Intervals adjusted?
NOYES
Practice (Year 3) • 100% of what?• Diagnostic inference
Theory
Gather Information from
Each Replica
Integrate Information -> Map Network
State
Compare Current Network to Policy
Success
State ==
Policy
? YES
Add Replica
NO
Adjust AU Sizes,
Polling Intervals adjusted
?
NOYES
100% of what?
• No: Of LOCKSS boxes?• No: Of AU’s?• Almost: Of policy overall • Yes: Of policy for specific collection• Maybe: Of files?• Maybe: Of bits in a file?
What you see
Auditing PLN's
Box X,Y,Z all agree on AU A
What you can conclude:
Box X,Y,Z have the same content Content is good
Assumption:Failures on file harvest are independent; number of
harvested files large
What you see
Auditing PLN's
Box X,Y,Z don’t agree
What you can conclude?
Hypothesis 1: Disagreement is real, but doesn’t really matter.
Non-Substantive AU differences (arising from dynamic elements in AU’s that have no bearing on the substantive content )
1.1 Individual URLS/files that are dynamic and non substantive (e.g., logo images, plugins, Twitter feeds, etc.) cause content changes (this is common in the GLN).
1.2 dynamic content embedded in substantive content (e.g. a customized per-client header page embedded in the pdf for a journal article )
Hypothesis 2: Disagreement is real, but doesn’t really matter in the longer run (even if disagreement persists over long run!)
2.1 Temporary AU Differences. Versions of objects temporarily out or sync. (E.g. if harvest frequency << source update frequency, but harvest times across boxes vary
significantly)2.2 Objects temporarily missing
(E.g. recently added objects are picked up by some replicas, not by others)
Hypothesis 3: Disagreement is real, mattersSubstantive AU differences
3.1 Content corruption (e.g. from corruption in storage, or during transmission/harvesting)3.2 Objects persistently missing from some replicas ( e.g. because of permissions issue @ provider; technical failures during harvest; plugin problems)3.2 Versions of objects persistently missing/out of sync from some replicas(e.g. harvest frequency > source update frequency leading to different AU’s harvesting different versions of the
content. )Note that later “agreement” signifies that a particular version was verified, not that all versions have been
replicated and verified
Hypothesis 4: AU’s really do agree, but we think they don’t
4.1 Appearance of disagreement caused by Incomplete diagnostic information Poll data are missing as a result of system reboot, daemon updates, or other cause.
4.2 Poll data are lagging – from different periods Polls fail, but contains information about agreement that is ignored
Auditing PLN's
Design Challenge
• Create more sophisticated algorithms and
• Instrument PLN data collection
Such that
Observed behavior allows us to distinguish between hypotheses 1-4.
Auditing PLN's
Approaches to Design Challenge
[Tom Lipkis’s Talk]
Auditing PLN's
What’s Next?
Auditing PLN's
“It’s tough to make predictions, especially about the future”
-Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert
Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey
Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, and others
Short Term
• Complete round 3 data collection• Refinements of current auditing algorithms
– More tunable parameters (yeah?!)– Better documentation– Simple health metrics
• Reports, and dissemination
Auditing PLN's
Longer Term
• Health metrics, diagnostics, decision support• Additional audit standards• Support additional replication networks• Audit other policy sets
Auditing PLN's
Bibliography (Selected)
• B. Schneier, 2012. Liars and Outliers, John Wiley & Sons
• H.M. Gladney, J.L. Bennett, 2003. “What do we mean by authentic”, D-Lib 9(7/8)
• K. Thompson, 1984. “Reflections on Trusting Trust”, Communication of the ACM, Vol. 27, No. 8, August 1984, pp. 761-763.
• David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis, Vicky Reich, Seth Morabito. “Requirements for Digital Preservation Systems: A Bottom-Up Approach”, D-Lib Magazine, vol. 11, no. 11, November 2005.
• OAIS, Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1, Blue Book, January 2002