Top Banner
ECHO DEPository Technical Architecture Phase 1 Final Report Report of project activities from Fall 2004 through 2007 University of Illinois at Urbana Champaign in partnership with OCLC Contributors: Matt Cordial, David Dubin, Janet Eke, Joseph Futrelle, Thomas Habing, Leah Houser, Patricia Hswe, William Ingram, Joanne Kaczmarek, Robert Manaster, Joel Plutchak, Beth Sandore, John Unsworth December 2008 Revised July 2009
80

ECHO DEPository Technical Architecture Phase 1 Final Report

Oct 20, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportReportofprojectactivitiesfromFall2004through2007

[AuthorName]

UniversityofIllinoisatUrbanaChampaigninpartnershipwithOCLCContributors:MattCordial,DavidDubin,JanetEke,JosephFutrelle,ThomasHabing,LeahHouser,PatriciaHswe,WilliamIngram,JoanneKaczmarek,RobertManaster,JoelPlutchak,BethSandore,JohnUnsworthDecember2008RevisedJuly2009ReDraft2.3–notfinalized

Page 2: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

2

TableofContents

1. Preface.................................................................................................................................41.1. AbouttheECHODEPositoryProject(Phase1)............................................................41.2. AboutThisDocument ..........................................................................................................41.3. ReviewofProjectObjectivesandDeliverables...........................................................41.3.1. ModelsandToolstoSupportWebArchiving .....................................................................41.3.2. RepositoryEvaluationandInteroperability .......................................................................51.3.3. Long‐termSemanticPreservationResearch.......................................................................6

2. ArchivingtheWeb:theWebArchivesWorkbench .............................................82.1. Overview ..................................................................................................................................82.2. TheWebArchivingProblem..............................................................................................82.2.1. TheUbiquitousWeb ......................................................................................................................82.2.2. VolumeandSelectionofWebContent...................................................................................92.2.3. TheImportanceofContext .........................................................................................................9

2.3. TheArizonaModel:AnArchivalApproachtoWebArchiving ............................ 102.3.1. Background.....................................................................................................................................102.3.2. AnArchivalApproach ................................................................................................................112.3.3. ArizonaModelSummary ..........................................................................................................11

2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel ................... 122.4.1. DevelopmentConsiderations..................................................................................................122.4.2. OverviewoftheWebArchivesWorkbench......................................................................142.4.3. ATouroftheWebArchivesWorkbench ...........................................................................152.4.4. WebArchivesWorkbenchToolsSummary......................................................................202.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench......................................................................................................................................................21

2.5. Findings­UserFeedback................................................................................................. 242.5.1. LimitedResourcesandLimitedTime .................................................................................252.5.2. ComplexityoftheTools.............................................................................................................252.5.3. WebContentDelivery ................................................................................................................25

2.6. ConclusionsandNextSteps............................................................................................. 263. RepositoryEvaluationandInteroperability ....................................................... 273.1. RepositoryEvaluation ...................................................................................................... 273.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation.......................................................................................................273.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories ....................................................................................................................................................283.1.3. TestingApproachandMethodology....................................................................................323.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary ...................................................................................................................................................343.1.5. ConclusionandNextSteps.......................................................................................................35

3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards.................................................. 363.2.1. HandSOverview ...........................................................................................................................363.2.2. TheNeedforInteroperabilityandPreservationSupport ..........................................363.2.3. HubandSpokeKeyPrinciples................................................................................................37

Page 3: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

3

3.2.4. METSProfile...................................................................................................................................393.2.5. HandSWorkflowCycle ..............................................................................................................443.2.6. HandSTechnicalImplementation.........................................................................................503.2.7. LessonsLearned...........................................................................................................................553.2.8. NextSteps:theHubandSpoke ..............................................................................................563.2.9. Conclusion.......................................................................................................................................57

4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation 584.1. Introduction:TheNeedforaSemanticsofPreservationApproach ................. 584.1.1. ThePreservationSemanticsProblem.................................................................................584.1.2. OurGoal............................................................................................................................................59

4.2. TheProblems:UnderstandingSemanticPreservation ......................................... 594.2.1. ProblemsPosedbyDescriptivePracticeandStructures............................................594.2.2. UnderstandingtheSemanticPreservationProblem:Summary..............................64

4.3. TowardMoreCapableArchivesandRepositories .................................................. 644.3.1. Recap:Theneedforautomatedinferencecapability...................................................644.3.2. BECHAMELandBuildingaMetadataOntology ..............................................................654.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary .............................................................................................................................654.3.4. ResolvingSemanticAmbiguity:anInferenceExample ...............................................664.3.5. AutomatedInferenceasaPreservationService.............................................................68

4.4. SystemArchitecture .......................................................................................................... 694.4.1. Architecture:Overview .............................................................................................................69

4.5. LessonsLearnedandNextSteps ................................................................................... 714.6. Conclusion............................................................................................................................. 71

5. ANotefromthePIs ...................................................................................................... 736. References....................................................................................................................... 746.1. ArchivingtheWeb:theWebArchivesWorkbench................................................. 746.1.1. Resources ........................................................................................................................................74

6.2. RepositoryEvaluationandInteroperability ............................................................. 756.2.1. RepositoryEvaluation................................................................................................................756.2.2. HandSToolsSuite ........................................................................................................................75

6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation.. 777. Appendices...................................................................................................................... 807.1. WebArchivesUserGuide................................................................................................. 807.2. WebArchivesWorkbenchImplementationGuide ................................................. 807.3. AnnotatedTrustedDigitalRepositoryChecklist ..................................................... 807.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)... 807.5. RepositoryTestingFindings:Narrative...................................................................... 807.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist ..................................................................................................................... 807.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions ... 807.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus............ 80

Page 4: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

4

1. Preface

1.1. AbouttheECHODEPositoryProject(Phase1)TheECHODEPository(Phase1)isanNDIIPP‐partnerresearchanddevelopmentprojectattheUniversityofIllinoisatUrbana‐Champaign(UIUC)inpartnershipwithOCLC,theNationalCenterforSupercomputingApplications);theMichiganStateUniversityLibrary;andanallianceofstatelibrariesfromArizona,Connecticut,Illinois,NorthCarolinaandWisconsin.OuraimistosupportthedigitalpreservationeffortsoftheLibraryofCongressbyaddressingissuesofhowwecollect,manage,preserve,andmakeusefultheenormousamountofdigitalinformationourcultureisnowproducing.Phase1projectactivities(Fall2004through2007)includeddevelopingwebarchivingtools,evaluatingexistingrepositorysoftware,developinganarchitecturetoenhanceexistingrepositories’interoperabilityandpreservationfeatures,andmodelingnext‐generationrepositoriesforsupportinglong‐termpreservation.

1.2. AboutThisDocumentThisnarrativereportprovidesadetailedoverviewofeachoftheareasofworkdescribedbelow.Theattachedappendicesprovidespecificadditionalprojectdeliverables.Acollectedarchiveofallprojectdeliverables,includingposters,presentationsandpublications,isforthcoming.SeveralsectionsincludematerialcontributedbythesameauthorswhowrotearticlesonECHODEPprojectsforanissueofLibraryTrends,guest‐editedbyPatriciaCruseandBethSandore,toappearintheWinter2009issue(specifically,LibraryTrends,Volume57,Number3).

1.3. ReviewofProjectObjectivesandDeliverables

1.3.1. ModelsandToolstoSupportWebArchivingGoals:

• Articulateamethodologyforselectingdigitalmaterialsattheaggregatelevelbasedonarchivalprinciples,anduseprovenance,functionalanalysis,andcontextanalysistofacilitatemeta‐taggingforretrieval.

• Buildasuiteofopensourcesoftwaretoolsthatsupportidentification,capture,anddescriptionofwebsites.

Deliverables:

Page 5: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

5

• TheArizonaModel,anarchivalapproachtowebarchiving(developedbytheArizonaStateLibraryandArchives)

• TheWebArchivesWorkbenchsuiteoftools(developedbyOCLC)Overview

Traditionally,webarchivingmethodshavefocusedoneithermanualorautomatedcaptureapproaches,bothproblematic.Manualitem‐levelselectionfailsduetotheenormousnumberofresourcesontheweb,whilefullyautomatedweb‐captureapproachesriskburyingsubstantivematerialsunderamountainofirrelevantinformation.Toaddressthisfundamentalproblem,OCLCbuiltasuiteofopen‐sourcewebarchivingtoolsthatbridgethegapbetweenmanualselectionandautomatedcapture.BasedontheArizonaModel,whichprovidesforintegrationofbothhumanandmachineprocesses,theWebArchivesWorkbench(WAW)comprisesfourtoolstohelparchivistsidentify,describe,selectandharvestweb‐basedcontentforstorageinanyrepository.

DetailsSeeSection2ofthisreport,andAppendixitems6.1and6.2.

1.3.2. RepositoryEvaluationandInteroperabilityGoals:

• Install,configure,test,andevaluateexistingopen‐sourcedigitalrepositorysystems,withparticularregardtosupportforinteroperabilityandemergingpreservationstandards.

Deliverables:

• Arepositoryevaluationframeworkbasedonthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment,withmappingtocurrentversion

• Repositorytestingfindings• TheHubandSpoke(HandS)toolssuitesupportingrepository

interoperabilityandpreservationmetadata• PREMIS‐basedMETSprofiles

Overview

Tohelpunderstandhowwellexistingrepositorysystemssupporttoday’sdigitalpreservationefforts,weevaluatedfourexistingopen‐sourcerepositorysystems(DSpace,ePrints,FedoraandGreenstone).Evaluationactivitiesincludedtheingestionandmanipulationofhalfaterabyteof

Page 6: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

6

heterogeneouscontentineachrepositorysystem,andthedevelopmentofapreservation‐focusedrepositoryevaluationframeworkbasedonemergingstandardsfortrusteddigitalrepositories.

EarlyevaluationfindingsledtothedevelopmentoftheHubandSpoke(HandS)Architecture,aproof‐of‐conceptsuiteoftoolstoenhancetheinteroperabilityandpreservationfeaturesofsystemstested.TheHandSsuitesupportstoday’slibraries’effortstomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdata.Itincludesthedevelopmentofacommonstandards‐basedmethod(aPREMIS‐basedMETSprofile)forpackagingcontentthatallowsdigitalobjectstobemovedinandoutofmorerepositoriesmoreeasilywhilesupportingthecollectingoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Thismodelhaspotentialwideapplicability,andisalreadyinuseinseveralreal‐worldarchivingprojects.

DetailsSeeSection3ofthisreport,andAppendixitems6.3,6.46.5and6.6.

1.3.3. Long‐termSemanticPreservationResearch

Goals:• Researchtechniquestomigratethesemanticcontentofdocumentsand

documentstructuresacrossgenerationsofencodingschemes.

Deliverables:• Articulationofsemanticpreservationproblemsposedbycurrentmetadata

practice• Developmentofaformalmetadatadescriptionontology• Demonstrationofautomatedinferenceusingreasoningsoftware

Overview

Currentfirst‐generationrepositorysystemspreservethestructureofinformation,notitsmeaningorsemantics.Whenwemovecontentfromonesystemtoanother,thisstructuremaybesubtlyorunsubtlytransformed.Tomeaningfullypreserveourdigitalcontentovertime,wethereforehavetoinfermeaningorsemanticsfromstructuresthatchangeovertime.Becauseofthevolumeofinformationtobepreserved,weneedtobeabletodothiswithautomatedtools.

TheUIUCGraduateSchoolofLibraryandInformationScience(GSLIS)collaboratedwiththeNationalCenterforSupercomputingApplications

Page 7: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

7

(NCSA)tocontributetothedevelopmentofnext‐generationarchiveswithsemanticanalysiscapabilitiestoreducelong‐termpreservationrisks.UsingrepositorytechnologydevelopedatNSCAandautomatedreasoningtoolsdevelopedatGSLIS,wemodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivesheadofflong‐termpreservationrisks.Thisworkincludesarticulatingsemanticpreservationproblemsposedbycurrentpractice,andanalyzingreal‐worlddatamigrationexamplestodevelopaformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thisunderstandingispresentedinaformalmetadatadescriptionontology.

DetailsSeeSection4ofthisreport,andAppendixitem6.7.

Page 8: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

8

2. ArchivingtheWeb:theWebArchivesWorkbench

2.1. OverviewAcoredeliverableoftheECHODEPositoryProject'sfirstphasewasOCLC'sdevelopmentoftheWebArchivesWorkbench(WAW),anopen‐sourcesuiteofWebarchivingtoolsforidentifying,describing,andharvestingWeb‐basedcontentforingestintoanexternaldigitalrepository.ReleasedinOctober2007,thesuiteisdesignedtobridgethegapbetweenmanualselectionandautomatedcapturebasedonthe"ArizonaModel,"whichappliesatraditionalaggregate‐basedarchivalapproachtoWebarchiving.(By“aggregate‐basedarchiving,”wemeanarchivingitemsbygrouporinseries,ratherthanindividually.)CorefunctionalityofthesuiteincludestheabilitytoidentifyWebcontentofpotentialinterestthroughcrawlsof"seed"URLsandthedomainstheylinkto;toolsforcreatingandmanagingmetadataforassociationwithharvestedobjects;websitestructuralanalysisandvisualizationtoaidhumancontentselectiondecisions;andpackagingusingaPREMIS‐basedMETSprofiledevelopedbytheECHODEPositorytosupporteasieringestionintomultiplerepositories.ThenextsectionsprovideanoverviewoftheWebarchivingproblem;backgroundontheArizonaModel;anoverviewofhowthetoolsworkandtheirtechnicalimplementation;andabriefsummaryofuserfeedbackfromtestingandimplementingthetools.AppendixitemsincludetheWebArchivesWorkbenchUserGuide(6.1),whichprovidesdetailedscreen‐by‐screendocumentationofthetoolsuite’sfunctionality.TheWAWImplementationGuideisprovidedinAppendix6.2.

2.2. TheWebArchivingProblem

2.2.1. TheUbiquitousWebForabroadrangeoforganizations,Websitesarenowthedeliverymechanismofchoicefornearlyanytypeofinformationcontent.Muchofthiscontentiscreatedanddisseminatedinelectronicformatsonly,withprinted(hard)copiesconsideredjustacourtesyorconvenience.Theelectronicformatenvironment,whileexpedientforcurrentaccesspurposes,presentschallengesforanyonechargedwithpreservingtheinformationovertime.ThesechallengesincludethesheervolumeofWeb‐publishedinformation,traditionalissuesofselectionanddescription,aswellthetechnicalchallengesassociatedwithlong‐termpreservationofdigitalobjects.

Page 9: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

9

2.2.2. VolumeandSelectionofWebContentAnimmediatechallengeofWebarchivingisassuringthatallcontentoflong‐termrelevancedeliveredthroughtheWebisidentifiedandcollected(i.e.,harvested).DifficultiesarisefirstfromthetaskofselectingpertinentcontentforpreservationfromtheenormousvolumeofinformationstreamingfromWebserversatanygivenpointintime.Selectiondecisionswillbeinfluencedbythechargeoftheindividualresponsibleforcapturingspecificcontenttypes(suchasalibrarianorarchivist)basedonappraisal,orcollectiondevelopment;onpoliciescreatedinconcertwiththemissionoftheinstitutionororganization;andontheaudience,orusercommunity,beingserved.ThesheervolumeofcontentpublishedontheWebmakesafullymanualperusalofonlineresourcesinfeasible.VolumeisstillafactorevenwhenWebcrawlers—asexplainedbelow—areengaged.ThedynamicnatureoftheWebalsocreatesproblemsforselectionandharvestingofcontent.URLscanchangeovernight;resourcescanbetakenoff‐linewithlittleornonotice;andnew,relatedcontentcanbeaddedinnewordifferentdirectoriesthanthosevisitedpreviouslybyaWebcrawlerharvestinganorganization'swebsite.AlthoughWebcrawlingautomatesarchivingofawebsite,itisquitepossibleforWebcrawlerssimplytomisscontentbecauseofa“robotsexclusionprotocol”(activatedbythesitecreatortomakepartsofasite“uncrawlable”)orbecauseoftheimpenetrablecharacteroftheDeepWeb(wherecontent,suchasaresultspagetoaWebform,isinaccessibletoaWebcrawlerorWebspider).1Inaddition,thevastmeasureoftheWebrendersscalableWebcrawlinganalmostintractabletechnicalchallenge.KnowingwheretofindallcontenteligibleforharvestingaccordingtocollectiondevelopmentandappraisalpoliciesbecomesnearlyimpossiblewithoutintentionalcoordinationorwithoutWebcrawlingtoolsandresourcesthataredesignedfor,andtakeaccountof,thefluidnatureofwebsitecontentandthemassivescaleoftheWeb.

2.2.3. TheImportanceofContextContextisaboutunderstandingrelationshipsbetweendifferentanddiscretepiecesofinformation.Itisaboutunderstandingwhytheinformationwascreated,bywhichindividualororganization,andatwhatpointintime.Contextualinformationcanhelpdefinetheboundariesandthescopeofharvestedcontent.Aswithanalogobjects,muchoftheusefulnessofdigitalobjectswhichmakeupourculturalrecorddependsonourhavingdescriptiveandcontextualinformationaboutthem.Oncecontentisidentifiedandharvested,itisnecessarytoprovideaccesstothedigitalobject.Suchcontentaccessmeansthatattentionshouldbepaidtocapturingaccuratemetadataalongwiththecontentitself.Thiscontextualmetadata1Webarchiving.(2009,March2).InWikipedia,TheFreeEncyclopedia.Retrieved21:56,March2,2009,fromhttp://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=274363238

Page 10: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

10

willhelpdescribetheorigin(or"provenance")oftheresource,aswellaswhyandwhenitwascreated.(Forexample,isthediscoveredresourceoneinaseriesofannualreportsfromaparticularstateagency?Isitasinglepublicationsummarizingresearchfindings?Ordoesitencompassresultsfromaspecificsurveytakenaspartofalargerefforttorevampcommunityservices?)Inthecaseofadigitalobject,metadatanotonlysupportshumaninterpretationofcontent,itisneededtoprovidecrucialtechnicalinformationformaintaininglong‐termviabilityoftheobjectitself.

2.3. TheArizonaModel:AnArchivalApproachtoWebArchivingTheWebArchivesWorkbenchtoolsuiteispremisedontheprinciplesofthe“ArizonaModel,”anaggregate‐basedapproachtoWebarchivingdesignedtobridgethegapbetweenhumanselectionandautomatedcapture.“Aggregate‐based”meansthatratherthanarchiveitemssingly,orindividually,theyareorganized(grouped)inseries,orinaggregates.TheArizonaModelwasdevelopedin2003byRichardPearce‐MosesoftheArizonaStateLibraryandArchives.

2.3.1. BackgroundMoststatelibrariesandarchiveshavemandatestocollectstateagencypublicationsandmakethemavailabletothepublic.Tothisend,therearewell‐establisheddepositorysystemsthathaveworkedwithpaperpublicationsformanyyears.InaWebenvironmentthenuancesofdeterminingwhatapublicationis,orwhoisresponsibleforselectionandcollectionofparticularinformationresources,becomeslessclear.Nonetheless,tomeetthesemandateslibrariansandarchivistsmuststillidentify,select,acquire,describe,andprovideaccesstostateagencyinformation"published"onwebsites.Inearlyattemptstodevelopacollectionofstateagencyelectronicpublications,twoapproachescameabout.AccordingtoPearce‐Moses,Cobb,andSurface(2005),thefirstapproachhasitspremisein“traditionallibraryprocessesofselectingdocumentsonebyone,identifyingappropriatedocumentsforacquisition;electronicallydownloadingthedocumenttoaserverorprintingittopaper;thencataloging,processing,anddistributingitlikeanyotherpaperpublication.”(175)Whilethisapproachensuresthatvaluabledocumentswillbegathered,itsdependenceonmanualselectionlimitsarchivingtoonlyaveryfewitems.ScalingthisprocessinaccordancewiththevastnessofWeb‐baseddocumentswouldnecessitateanexpansioninpersonnelthatfewstatelibrarieshavethefundingtoaddress.(Pearce‐Moses,Cobb&Surface,2005)Alternatively,intheotherapproach,softwaretoolsthatautomateregularlyoccurringWebcrawlsareengaged.AsPearce‐Moses,Cobb,andSurfaceassert,thismodel“tradeshumanselectionofsignificantdocumentsforthehopethatfull‐textindexingandsearchengineswillbeabletofinddocumentsoflastingvalueamongtheclutterofother,ephemeralWebcontentcapturedintheprocess.”(176)Yet,whilethismodelrelieveslibrariansand

Page 11: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

11

archivistsoftheupfrontonusofselectionandorganization,atthesametimeitmayundulyburdenfuturesearchers,iffull‐textindexingandsearchcapabilitiesdonotevolveasanticipated.TheArizonaModel,explainedindetailbelow,constitutesathirdapproachtoWebarchiving,incorporatingbothhumanassessmentandautomatedtools.

2.3.2. AnArchivalApproachTheArizonaModelappliesanarchivalperspectivetocuratingcollectionsofWebpublications.Itexploitscertaintellingparallelsbetweenwebsitesandarchives:namely,theconceptofprovenance(i.e.,documentsclassedtogetherstemfromthesamesource)andtheorganizationalstructureinherentinboththesekindsofcollections—directoriesandsubdirectoriesforwebsites,andseriesandsubseriesforarchives.(Pearce‐Moses,Cobb&Surface,2005)Intheory,ifwebsitesorganizeWebpublicationsusingcommonfiledirectorystructures,informationaboutindividualdocumentswithinsub‐directoriescouldbeinheritedfromparentdirectories.IntheArizonaModel,whichdrawsonbasicarchivalpractice,websitesarehandledashierarchicalaggregatesratherthanasindividualitems,andtheoriginalorderofthedocuments(theorderinwhichthecreatingagencyoversawthem)ismaintained.Provenanceandoriginalorderareconsideredimportantcontextualpiecesofinformation.Retainingdocumentsintheorderinwhichtheywereoriginallymanagedandkeepingthemclusteredtogetherbasedontheoriginatingagencyenhanceone’sknowledgeofthecreationandoriginaluseofthedocuments.Provenanceandoriginalorderalsoallowfor"inheritance"ofhigher‐levelmetadatameanttodescribethehomeagencyfromwhichthedocumentscameandthewaythedocumentswereoriginallyarranged.Finally,anarchivalapproachtocuratingacollectionofWebdocuments—focusingfirstonaggregates(collectionsandseries),ratherthanonindividualdocuments—trimsthenumberofitemsthatneedtobeappraisedbyahumandowntoamoremanageablenumber.

2.3.3. ArizonaModelSummaryTheArizonamethodologyisbasedonanarchivalapproachtotheWebthatincorporatesbothhumanselectionandautomatedcapture.Inthisapproach,Webmaterialsaremanagedinawaysimilartotheorganizationofmaterialsinpaper‐basedarchives:asahierarchyofaggregatesratherthanasindividualitems.ThisapproachreducestoamorepracticalsizethesheervolumeproblemofpreservingWebmaterials,whilemaintainingascalabledegreeofhumaninvolvement.ItistheguidingmodelforOCLC’sWebArchivesWorkbench.

Page 12: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

12

2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel

TheArizonaModelisparticularlyinstructiveinitsevocationofwhere,inthepracticeofarchivalmanagement,automationcanbeconsideredmostuseful.Thatis,whiletechnologymaybeappliedforinformationprocessingactivitiessuchasdatasearchingandtracking,andlistconstructionandclassification,tasksfordistinguishingwhethercontentisin‐scopeorisvaluablearebestreservedforhumans.TheoverallgoalbehinddevelopingasuiteoftoolsbasedontheArizonaModelistoachieveaproductivecomplementbetweenautomatedprocessingandhumandecision‐making,allthewhileadheringtoestablishedarchivalprinciples.ThesoftwarethatOCLCcreated,theWebArchivesWorkbench,comprisesfivetoolstoidentify,select,describe,andharvestWeb‐basedmaterials,aswellastokeeptrackof,orlog,theseactivitiesandtogeneratereportsaboutthem.Indoingso,theyserveasaconduitbetweenhumaninvolvement(viamanualselection)andcomputerizedcaptureofWebcontent:theyconvertthearchivist'spoliciesforcollectingcontentcreatedontheWebtosoftware‐centeredrulesandconfigurations.Theyalsoassistinformationprofessionalsbyprovidingthemeanstoaddmetadatatoharvestedobjectsasaggregates.Inaddition,thetoolsimplementthePREMIS‐basedMETSprofilesdevelopedbyECHODEP(attheUniversityofIllinois)forpackagingcontent;bydesigntheseprofilesfacilitateingestionintomultipleexternalrepositoriesandsupportlong‐termpreservation.2PackagingisthelaststepintheWAWworkflow,afterwhichtheobjectsarereadyforingestintoanexternaldigitalrepository.

2.4.1. DevelopmentConsiderationsOCLCledthedevelopmentofthetoolsuite.Priortotooldesignanddevelopment,OCLCcarefullyconsideredtheusercommunity,whichitidentifiedasablendoflibrariansandarchivists.Significanttoitsconsiderationwastheissueofterminology:howshouldtoolsandfeaturesintheWebArchivesWorkbenchbenamed,orcalled,ifamixedcommunityoflibrariansandarchivistswastoserveasitsuserbase?The word “series,” for example, might invoke semantics and usage for an archivist that is different, even unfamiliar, for a librarian. Thus,inexploringtheusercommunity,OCLChadarchivistslookatnewtypesofmetadataandaskedlibrarianstothinkaboutprinciplesofarchiving,suchasarchivalseriesandthecurationof

2TwoMETSprofilesdevelopedbyECHODEPareatworkhere:theECHODepGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(accessibleathttp://www.loc.gov/standards/mets/profiles/00000015.html)andtheECHODepMETSProfileforWebSiteCaptures(accessibleathttp://www.loc.gov/standards/mets/profiles/00000016.html).Theformeristhe'toplevel'format‐genericprofile,whichfocusesonimplementingPREMIS.Thelatter,awebcaptureprofile,isanexampleofa'sub‐profile,’whichisusedwiththefirstonetoprovideastructureformoreformat‐specificinformation.

Page 13: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

13

documentsinaggregateratherthanasindividualitems.Eventually,OCLCelectednottodevisenewterminologyfortheconceptsatissue;notonlydidtheteamconcludethatterminologywas,inessence,atrainingmatter,italsosawthattheworkoflibrariansandarchivistsoftenoverlap—i.e.,eachisfrequentlyengagedinthemilieuoftheother.Indoinghigh‐levelanalysisfortheuserinterface,OCLCarrivedatseveralworkingassumptionsthathadsomebearingonthedesignofthetoolsuite.OneassumptionwasthatbecausethetoolsintheWebArchivesWorkbenchmightchangeovertime,theyneededtobe“aware”ofeachotherandenablethesharingofdata,but—asimportant—theusershouldhavetheabilitytooptnottouseatoolintheWorkbench.Throughinterviewswithlibrariansandarchivists,OCLCalsolearnedthatharvestingresponsibilitiesoftenweresharedamongindividuals;asaconsequence,datageneratedbyatoolhadtoberenderedshareablebymultipleusers—andsimultaneouslyso.Thisfeaturewouldallowausertoviewtheworkofanother.Inaddition,ratherthantryingtointegratetheWorkbenchintoaninstitution’smanyauthenticationschemes,OCLCincorporatedasimplescheme,allowingtheWorkbenchtorunwithjustbasicadministration.Intermsofharvesting,OCLCdesignedmorethanoneharvestingworkflow,sothatausercouldselecttheappropriatelevelofanalysisandsophisticationforatask.Forinstance,theQuickHarvestfeatureisasingle‐screenlaunchpointthatrunsaharvestimmediately.TheAnalysistool,whichispartofanextendedharvestingworkflow,requiresmoreset‐up,butitresultsinabigger“pay‐off”intermsofthewebsitechangeobservationsithandlesautomaticallyfortheuser(thisisexplainedbelowmoreformallyinthedescriptionoftheAnalysistool).Finally,wherethedepositofharvestedinformationisconcerned,OCLCknewthatingesttoavarietyofrepositories,includingitsownDigitalArchiveaswellasDSpacerepositories,wouldneedtobeaccommodated.Aclean,simpleinterfacewascreatedbetweenthepointwheretheWorkbenchendsandarepositorysoftwareapplicationwouldbegin;thatis,theWorkbenchgeneratesharvestedpackagesofcontentinafilesystemthattherepositorythenpicksupandprocesses.(Thisisthepointintheworkflowatwhichtheabove‐mentionedPREMIS‐basedMETSprofilesdevelopedbyECHODEPisimplemented.)

Page 14: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

14

2.4.2. OverviewoftheWebArchivesWorkbenchTheWebArchivesWorkbenchisasuiteofwebarchivingtoolsforidentifying,selecting,describingandharvestingweb‐basedcontentbasedonlibraryandarchivalpractice.Itbridgesthegapbetweenmanualselectionandautomatedcapturebytransformingcollectionpoliciesintosoftware‐basedrulesandconfigurations.Itaccommodatesavarietyofwebharvestingapproaches,includingmassharvesting,selectiveharvesting,andindividualdocumentharvesting.ContentispackagedusingtheECHODEPMETSprofile,whichisdesignedtosupportthecollectionofPREMISpreservationmetadata,andtofacilitateingestionintoavarietyofexternalrepositories.ThefivetoolsintheWorkbencharetheDiscovery,Properties,Analysis,Harvest,andSystemtools.Below(Fig.1)isanoverviewoftheWorkbenchWorkflow,followedbyamoredetailedtourofthefunctionalityofeachtool.

Figure1:OverviewofWAWWorkflow

Page 15: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

15

2.4.3. ATouroftheWebArchivesWorkbenchThescreenshotinbelowdisplaysthemainWAWtoolsscreenaftertheuserhasloggedon.Thefivetoolsareexemplifiedbythetopmostrowoftabs.(ThoughtheAlertstabsitsinthisrow,itislessatoolthanafeatureoftheWorkbench.ItenablesuserstoaccessacollectionofreportsandalertsfortheDiscovery,Properties,Analysis,andHarvestTools.)IntheinterfacefortheWAWtools,atabiscoloredintosignifywhichtoolisopen,oractive,atthatparticularmoment.InFigure2,forexample,theDiscoverytabisshaded,becausetheDiscoverytooliscurrentlyactive.Similarly,theEntryPointstabisshaded,becauseitisactiveasacomponentoftheDiscoverytool.

Figure2:ScreenshotofWAWinterfacehomescreenAkeyadvantagetotheWorkbenchtoolsisthatharvestingofWebcontentmaybescheduledsothatitoccursonaregularbasis.However,theWorkbenchtoolsalsoofferusersthealternativeofrunningaone‐timeharvest.ThisisknownastheQuickHarvest,accessibleviatheHarvesttab.QuickHarvestisaddressedbrieflyinthediscussionbelowoftheHarvesttool.

2.4.3.1. TheDiscoveryTool:FindingWebContentofInterestThefirststepinconstructinganarchiveofWeb‐basedresourcesistodeterminewhichpartsoftheWebholddesirable,andthuscollection‐worthy,content.ThisstepliesatthecruxoftheDiscoveryTool.TheDiscoveryToolaidsinidentifyingpotentiallyrelevantwebsitesbycrawlingrelevant"seed"entrypointstogeneratealistofdomainstowhichthe"seed"siteslink.(Note:AnentrypointisaspecificwebsiteURLwheretheDiscoveryToolwillbegintosearchfordomainsorcollectWebcontent.AdomainisaserverontheInternetthatmaycontainWebcontentandisidentifiedbyahigh‐leveladdress.Forexample,

Page 16: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

16

http://www.illinois.gov/news/isawebsite,anditsdomainis"Illinois.gov".DomainsdoNOTinclude“http://”.)3Inanapproachthateffectivelyborrowsfromcitationanalysis,theDiscoveryToolisdesignedontheideathaton‐topicsiteslikelypointtoothersitesaddressingasimilartopic.Thedomainsinthegeneratedlistarethenmanuallyevaluatedasin‐scopeorout‐of‐scope,basedonsubjectinterestandcollectingpolicies.(SeeFigure3,whichshowsalistofdomainsreturnedafterentrypointshavebeencrawled,aswellasradiobuttonsthatnotethescopeforeachdomain.)Thisprocessresultsinalistofdomainsdefiningasub‐setoftheWebthatisrelevantfortheuser'sarchivingpurposes.Domainsmarkedasin‐scopecanbeassociatedwithanEntity(i.e.,creator,oragency,ororganizationresponsiblefortheWebcontent).Later,inthePropertiesandAnalysisTools,metadataassociatedwithentities(creatorssuchasagenciesororganizations)canbeinheritedbycontentharvestedfromaparticularwebsite.

Figure3:ScreenshotoftheinterfacefortheDomainsfeatureoftheDiscoveryToolInsum,theDiscoveryToolisusedto:

• Generatealistofpotentiallyrelevantdomainsbycrawlingseedsites.• Assigndomainsasin‐scopeorout‐of‐scope.• AdddomainsmanuallytotheDomainslist.• Associatedomainswithentities(creatingagenciesororganizations).

3Anoteaboutcapitalizationinthissectionthatprovidesatourofthesoftware:here,entrypointsanddomainsarenotcapitalized,becausewearespeakingoftheminthegeneraluseoftheDiscoveryTool.However,theyarealsofeaturesoftheDiscoveryTool.Whenwediscussthemassuch,theywillbecapitalized.

Page 17: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

17

2.4.3.2. ThePropertiesTool:EnteringMetadatatoDescribeContentCreators(Entities)

AnotherpremiseoftheArizonaModelisthat,asmuchaspossible,metadatashouldbeenteredonlyonceandbeinheritedbyassociatedharvestedobjects.AftertheEntryPointsandDomainfeaturesoftheDiscoveryToolarerun,andentities(i.e.,contentcreators)havebeenassociatedwithdomains,metadataabouttheresultingentitiesmaybeenteredviathePropertiesTool.Besidesenablingthemanagementofinformationaboutentities,thePropertiesToolalsoallowstheusertodescribetherelationships(e.g.,parent/child)ofentitieswithoneanother,aswellasenterotherinformationsuchascontactinformation.Importantly,inaddition,thePropertiesToolalsocanbeeasilyengagedtocreateanalysesandseriesfromentities'websites.Thepurposeofenablinganalysisofawebsiteistoexamineitsstructure—i.e.,thedirectoriesthatmakeupthewebsite.(FormoreontheAnalysisTool,seebelow.)Insum,thePropertiesToolisusedto:• Createandmanagealistofcontentcreators(entities).• Assignmetadataandotherpropertiestoentities.• Specifywebsitesthatentitiesareresponsiblefor,andcreateanalysesandseries

(explainedbelow)basedonthosewebsites.

2.4.3.3. TheAnalysisTool:VisualizingtheStructureofaWebsiteThroughtheAnalysisToolitispossibletodiscernwhetherthereisvaluablecontentinthedirectoriesthatcompriseawebsiteand,ifso,toidentifythosechunksofcontent."Series"referstoflexibleaggregatesofcontentthatareanalogoustoarchivalseries—whichmaybeawholewebsiteoraportionofit(e.g.,onlyPDFsofannualreports),orevenoneindividualpageordocumentfromwebsites.Looselydefined,aseriesisanycollectionofWebmaterialthatauserchoosestocollectinone"bucket."Inaddition,seriesareusedinordertodrivetheWorkbenchharvestoperations.WhileseriesmaybeestablishedwithinthePropertiesTool,theycanalsobeestablishedandmanagedusingtheAnalysisTool,thenharvestedandpackagedintheHarvestTool.TheAnalysisToolhastwofunctionalareas:

• theAnalysisscreen,whichprovidesvisualizationtoolstoaidincontentselectiondecision‐makingandinseriesstructuredecisions.Here,too,abaselineanalysiscanbecreatedagainstwhichtomeasurefuturewebsiteanalyses;

• theSeriesscreen,whereseriesarecreated,edited,andmanaged;Seriesobjectsarekept;andSeriesharvestsareregulated.

TheAnalysisToolisusedto:

Page 18: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

18

• Analyzethestructureofawebsite.• Enterassociatedentities.• Setabaselineanalysisforcomparisonwithfutureanalyses.• Adjustsettings,suchasspidersettingsandchangenotificationthreshold

settings.• Definea"series"forharvesting(e.g.,harvestasanindividualobject),with

optiontoassociateitwithanentity.• Holdseriesobjectspriortoharvest.• Scheduleharvestsofseries.

Inaddition,operationsforholdingseriesobjectsandharvestingthemmaybeaccessedviathePropertiesTool.

2.4.3.4. TheHarvestTool:Reviewing,Packaging,andIngestingHarvestedContent

AlltheharvestsintheWorkbench,includingseriesharvests(viatheAnalysisTool)andquickharvests,arelistedintheHarvestTool.TheHarvestToolisusedtomonitorthestatusofharvestsandtoprovideanopportunitytoreviewandmodifytheharvestbeforepackagingitupandingestingitintoarepository.Theremaybesingle‐objectharvestsormultiple‐objectharvests,dependingonwhethertheoptiontoharvestcontentasindividualobjectswasselectedintheSeriesdetailsscreenofanAnalysis‐basedSeries(i.e.,intheAnalysisTool).TheQuickHarvestfeatureschedulesone‐timeharvestsofcontentbasedonaURLinputteddirectlyintotheHarvestTool.Afterharvestsarecompletetheymaybereviewed,atwhichtimeadditionalmetadatamaybeassigned.Theusercanrender,ordisplay,theharvestedcontentwithintheWAWtool,offtheHarvestResultspage.Theusercanactually"stepinto"theharvestedcontentatboththeharveststartingpointandatanyotherpointinthewebsite(viathewebsitefilestructuredisplay),andthesoftwarewillrenderthewebsiteappropriately.ThepurposeofthedisplayfeatureintheWebArchivesWorkbenchistoallowtheusertoverifythecorrectnessofwhatwasharvested—“correctness”meaningthatalltheinformationexpectedtohavebeencollectediscollected.Oncetheharvestedcontentisconfirmedascorrect,itthencanbeingestedintotheuser'slocalrepository.Insum,theHarvestToolisusedto:

• MonitorthestatusofharvestsscheduledintheAnalysisTool.• Deletecompletedharvests.• Reviewcompletedharvestcontent,whethersingle‐objectormulti‐object,

priortoingest.• Reviewcompletedharvests;ifdesired,editmetadataand/orinclude/exclude

content.

Page 19: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

19

• Ingestharvestedcontentintoarepository.• Launchaone‐timequickharvestusingtheQuickHarvestTool.

2.4.3.5. TheAlertsTab:WorkbenchNotificationsAsmentionedabove,the“Alerts”tabisnotatoolbut,rather,afeaturefornotifyingtheuserofavarietyofsystemsinformation.Thisinformationincludesnotificationabouterrors,incompleteprocesses,completedprocesses,andnewinformation(suchasthediscoveryofanewdomain,oranewfolderencounteredduringanalysis).Inshort,theAlertsTabisusedtoreviewreportsandalertsaboutWorkbenchfunctions.

2.4.3.6. TheSystemTools:MonitoringandManagingWorkbenchActivitiesTheSystemToolstabcontainsanumberofbehind‐the‐scenesfunctionsthataffectandreportonactivitiesofthefivemaintoolsoftheWorkbench.TheSystemToolsaredividedintofourfunctionalareas:

• theAuditLogpage,whichdisplaysrecentWorkbenchactivitiesandevents;• theSpiderSettingspage,wheretheusercanconfiguredefaultDomain,

Analysis,andHarvestspidersettings,aswellascreateadditionalDomain,Analysis,andHarvestspiderswithcustomsettings.Specifically,typesofspidersettingsinclude—butarenotlimitedto—depth(howdeeplyawebsiteshouldbecrawled,orspidered)andparametersoftime(when,howfrequently,andforhowlong);

• theImport/Exportpage,throughwhichtheusercanimportorexportavarietyofmetadatacommonlyusedintheWorkbench.Theseincludeentities,domains,andsubjectheadings.

• theReportspage,whichgeneratesprintablereportsonactivitiesofthemainfiveWorkbenchtools.Itoffersaviewofin‐developmententityandseriesreports.

Page 20: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

20

2.4.4. WebArchivesWorkbenchToolsSummaryTheWebArchivesWorkbenchimplementsanarchivalapproachtotheselectionandpreservationofdigital(Web‐based)content.TheWorkbenchautomatesmuchofthemethodologyembracedbytheArizonaModel,particularlybeyondtheinitialselectiondecisionsmadebythearchivist(e.g.,decidingatthestartofthearchivingprocesswhichwebsite,orwhichpartofawebsite,tocaptureandpreserve).Afterselectionparametersareset,theWorkbenchfacilitatesthecaptureandmanagementofthedigitalmaterialsinhierarchicalaggregates‐‐notunlikethearchivingofprint‐basedmaterials.

OVERVIEW OF WEB ARCHIVES WORKBENCH TOOLS

Discovery Tool

• discover domains

• group and prioritize domains

Comprising the Entry Points and Domains tabs, the Discovery Tool helps to identify potentially relevant web sites by crawling relevant “seed” Entry Points to generate a list of domains that they link to. At the end of this process the users have a list of domains that defines the sub-set of the web relevant for their archiving purposes. From here, the Properties and Analysis Tools are used to manage creator information about domains, and associate this information with harvests of content.

Properties Tool

• organize collection space

• create metadata

Comprising the Entities tab, the Properties Tool is used to maintain information about content creators or ‘Entities’ (e.g., government agencies), and associate them with the domains and web sites they are responsible for. The Properties Tool also allows users to describe the relationships (e.g., parent/child) of Entities with one another, as well enter high-level metadata about them that may be inherited by content harvested from their web sites. Importantly, the Properties tool can also be used to create and associate Series with Entities’ web sites. Series and harvests are then further managed using the Analysis and Harvest/Package Tool.

Analysis Tool

• visualize site structure

• associate metadata

• schedule harvests

Comprising the Analysis tab and the Series tab, the Analysis Tool provides website structure visualization tools to aid content selection decisions, and allows users to define archival Series, associate metadata with these series, and schedule recurring harvests of Web content. Harvesting activities are then monitored and managed in the Harvest Tool..

Harvest Tool

• review content

• package for ingest in external repository

Comprising the Harvester and Quick Harvest tabs, the Harvest Tool lists all harvests within the Workbench, including Series harvests scheduled using the Analysis Tool as well as Quick Harvests. It is used to monitor their status, initiate the final harvesting and ingest steps for the completed harvests tracked in the Harvest Tool, including reviewing harvest contents and metadata before ingest. This is the final step in the Web Archives Workbench workflow. It also offers a separate Quick Harvest feature.

Systems Tools

• reports and settings

The System Tools manage and monitor Workbench activities, reporting on operations undertaken in the four other tools. It has four functional sections: an Audit Log page (shows recent Workbench activities); a Spider Settings page (parameters for spidering may be set here); an Import/Export page (for moving metadata); and a Reports page (for producing printable reports about activities performed by the other tools).

Page 21: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

21

2.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench

AnISO9001company,OCLChasanexternallyauditedQualitySystembasedontherequirementsofISO9001asanaidforensuringthatproductsmeetuserexpectationsandspecifiedrequirements.OCLC'sprojectdevelopmentlifecycleisaprocessthatspecifieshowOCLCservicesaremarketedanddeveloped.Thisprocessincludeslifecycledocumentssuchasprojectplans,requirements,design,testplans,operationssupportplansandpost‐projectreviews.TheWebArchivesWorkbenchprogramfollowedthislifecycle.TheWAWprogramwasdividedintothreemainprojectsandmanysmallerreleasesinordertoreduceriskandtocreateafeedbackloopallowingrefinementoftherequirementsbasedonpreviousreleases.Therewerethreemajorsoftwarereleases,plusapproximately20additionalreleasesoverthecourseofthethree‐yearprogram.Thethreemaindevelopmentprojectswerebasedonthemainareasoffunctionalityofthetoolsuite:(1)DomainandEntity,(2)AnalysisandPackager,and(3)SiteAnalysisandChangeManagement.ThoughtheDomainandEntityfeaturesinWAWweresomewhatfunctionallysimple,theDomainandEntityprojectcarriedasignificantamountofriskbecauseitbuiltthetechnicalfoundationonwhichtherestoftheprojectwouldrest.TheSiteAnalysisandChangeManagementtoolswereriskyduetotheusabilityissuesinvolvedinclearlyrepresentingtotheusertheprocessofharvestingandevaluatingchangestowebsites.ThroughouttheprojectoneofourmainconcernswashowtorepresenttheArizonaModelinaclearandusablewayinsoftware.(Thisconcernisaddressedinthesection“TheWebArchivesWorkbenchWorkflow.”)Basedonearlydiscussions,thesystembegantobeseenasa“workbench,”intowhichcomponentsandsystemswouldbeincorporatedanddroppedovertime—perhapsbecauseuserswouldprefertoapplysomeoftheirlocaltoolsorperhapsbecausetheywouldhavemultipletoolsforagiventask.Additionally,eachcomponentwouldgrowitsdataqualityovertime,thereforeforcingtherestofthesystemtoadapteasilytoevolvingspecificationsanddataversions.Therefore,thearchitectureisdesignedforlocation,interface,anddata‐exchangetransparencies,whichmeansthatchangesinthosethreemainareasareexpectedtodriveallothersystemcharacteristics.ThehighleveltechnicalarchitectureofthesystemwasspecifiedusingtheReferenceModel‐OpenDistributedProcessing(RM‐ODP).4Thisframeworkusesvarious

4Forthespecification,see“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction,”athttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.

Page 22: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

22

viewsofasystem,includingadomainmodelview,aninformationview,anapplicationview,andatechnologyanddeploymentview.5Usingthisframework,OCLCcreatedthefollowingearlydomainmodelofthesystem.(SeeFig.4onnextpage.)Someoftheboxesinthisdomainmodelwerelaterremovedfromtherequirements,asourunderstandingofthesystemtobebuiltchangedovertime.Thearchitectureconsistedofseverallayers:client,integration,service,andpersistence.TheclientlayerconsistedofauserinterfaceimplementedusingtheStrutsframeworkasamodel‐view‐controllertostructuretothecode.ThesecondlayerisaWebserviceslayerthatprovidesthehooksforaclienttotalktotheapplication(althoughthecodewasnotusedinthisway).Thislayeralsoprovidesintegrationbetweentoolsandtranslationbetweentheinternalandexternalrepresentationsofthedata.EachdevelopingWAWtool(Entity,Analysis,Domain,etc.)implementedaconsistentHelperAPItoallowtheuserinterfacelayertoAdd/Update/Delete/Searchsingleormultipleobjects.TheOracledatabaseprovidedapersistencelayer.Oncethehighleveldesignwasproduced,adetaileddesignwasproducedforeachtool.OCLCcreatedusecasesforallmainactivitiesineachofthetools.

5InRM‐ODPthearchitectureofasystemisdescribedby5views(essentially5differentpointsofview)reflectingtheseparationofresponsibilitiesbetweenbusinesssponsors,developers,andsupportstaff.Thoseviewsare:

1. Enterprise‐community,enterpriseobjects(domainmodel),objectives(requirements/usecases),roles

2. Information‐schemas,objectattributes,databoundaries,constraints,semantics3. Computational‐components,interfaces,interactions,contracts4. Engineering‐transparencies(location,access,failure,persistence),nodes,channels5. Technology‐technologies&products(theonlydependenceonspecificproductsand

implementationpackages)

Page 23: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

23

Figure4:DiagramshowingOCLC'searlydomainmodelofthesystemthateventuallydevelopedintotheWAWsuiteoftools.Eachdeveloperworkedinhisown“sandbox,”whereaWAWinterfacewassetupforhisexclusiveuse.Theworkofmultipledeveloperswasintegratedintoa

Page 24: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

24

developmenttestenvironmentcalled“Baseline.”Thisway,productmanagersandtestanalystscouldreviewworkinprogressinBaseline.WhenBaselinewasreadyitwasmigratedintoaQualityAssuranceenvironment,whereformalizedtestingwasdoneagainstatestplan.Formajorinstalls,BaselinewasalsoinstalledatUIUCforadditionaltesting.Thefinalstepofthedevelopmentprocesswastodeploythesoftwareintoaproductionenvironment.TheWebArchivesWorkbenchwasreleasedasanopen‐sourcepackageonSourceForgeinOctober2007.ReleasedocumentationincludesdetailedinstallationinstructionsandadetailedUserGuideforunderstandingandusingthetools.• WAWReleasehomepage:

https://sourceforge.net/projects/webarchivwkbnch/• AdministrationGuide:

https://sourceforge.net/project/showfiles.php?group_id=205495(alsoprovidedasanAppendixiteminthisreport)

• UserGuide:https://sourceforge.net/project/showfiles.php?group_id(alsoprovidedasanAppendixiteminthisreport)

• WAWsoftwarepackage:http://webarchivwkbnch.cvs.sourceforge.net/webarchivwkbnch/webarchivwkbnch/

TheAdministrationGuidehasruntimeenvironmentrequirementsforWAW.Italsohasalistofall3rd‐partysoftwareusedbyWAWintheIncorporatedCodesectionofthedocument.Thethird‐partysoftwareisincludedintheWAWdistribution(refertolinkforWAWsoftwarepackage).AnOCLCsubscriptionisnotrequiredtouseWAWortousethisthird‐partysoftware.PleaseseetheHOWTO‐build‐install‐locally.txtfileintheWAWdeploymentforadditionalinformation.TheWAWtools,asdevelopedbythisproject,willcontinuetobemadepubliclyavailableindefinitelythroughSourceForge.Inaddition,in2008OCLCreleasedanewarrayofservicesincorporatingcomponentsoftheWAWtoolsintoaworkflowwithCONTENTdm,WorldCAT,andtheOCLCDigitalArchive.

2.5. Findings‐UserFeedbackTestingoftheWAWtoolswasundertakeninvaryingdegreesbytheoriginalprojectcontentpartners,aswellasbyseveralvolunteerorganizations.Feedbackabouttheirexperiencesworkingwiththetoolswasgatheredduringlarge‐groupprojectmeetingsatOCLC,aswellasthroughphoneconversationsande‐mailexchanges.TheoverallresponseindicatesthattheWebarchivingapproachoftheWAWtoolswas“elegant”andworthconsideration,butinpracticecontentpartnersgenerallydidnotimplementthefullfunctionalityofthetools.Thus,thepotentialbenefitsof

Page 25: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

25

applyinganarchivalapproachtotheWebwerenotrealizedcompletely.Reasonsforthispartialimplementationhavetodowithinadequateresourcesandtimetowardtrainingfortheuseofthetools,whichalsopointsuptheircomplexity(explainedinfurtherdetailbelow).TheWebArchivingWorkbenchispowerfulandextensiveintermsofwebharvestingandcontent,orseries,analysis,but—accordingtothefeedbackfromourcontentpartners—atacostofheuristicsandusability.Notsurprising,theQuickHarvestfunctionality(which,becauseseriesanalysisisnotanoptioninit,involvedfewerstepsandthuslessmanagementthantheregularHarvesttool)wasengagedmostoften;forsome,theQuickHarvestfeaturebecameamuch‐valuedcomponentoftheirdailyworkflows.Changesincontentdeliveryapproaches—suchasfromstaticWebpagestodatabase‐drivenpages—constitutedanotherreasonfornotapplyingthefullfunctionalityofthetools.

2.5.1. LimitedResourcesandLimitedTimeDuringtheirparticipationintheECHODEPositoryproject,statelibraryandarchivespartnersremainedundercontinualoperationalpressurestorespondtotheneedforcapturingcontentfromagencywebsites.SomepartnerstestedtheWAWtoolswhilecontinuingtouseotherWebcontentcaptureapproachesinordertomeettheirimmediateobligations,leavingfewerresourcestofocusontheWAWtools.Becausethetoolswerestillunderdevelopment,testingofthevariousphasedreleasesmayalsohavebeendifficulttoincorporateintodailyworkflows.Supportfromtheproject(intheformofinterns)hadbeenplannedbutwasgearedtotheearlyreleasesoftheWorkbench,beforethefullfunctionalityofthetoolswasimplemented.Inhindsight,puttingprojectresourcestowarddirectworkwithcontentpartners,asoriginallyintended,mighthaveresultedinmoreuseofthefullfunctionalityofthetools,especiallyiftimedmorespecificallytocoincidewithlater,morefullyfunctional,softwarereleases.

2.5.2. ComplexityoftheToolsAccordingtouserfeedback,theQuickHarvestandDiscoverytoolswereeasiesttouse,becausetheycouldbesetupquicklyandincorporatedintoexistingworkflowswithoutincreasingtheneedfornewresources.ThefullfunctionalityofthetoolsinvolvesunderstandingaprocesswithagreaterlevelofcomplexitythanthatpresentedbytheQuickHarvestoption;partnersreportedthatitwaseasiertousetheQuickHarvestandDiscoverytools,ratherthanexpendtimeandresourcesforlearning,ortesting,thetoolssuiteasawhole.Further,somecontentpartnersreportthatthecomplicatedinterfaceofthetoolswasabarriertousingthemtotheirfullestpotential.

2.5.3. WebContentDeliveryTheassumptionproposedbythearchivalmodel—thatawebsiteanditsdirectoriesaresimilartoanarchivalrecordcollectionandsetofrecordseries—doesnotapply

Page 26: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

26

todayaseasilyasitdidwhenthemodelwasfirstproposedin2003.Anincreasingamountofcontentisnowdeliveredthroughdatabase‐drivenwebsitesratherthanthroughstaticWebpages.Therelationshipsbetweencontentitemsthatmayhavebeenobviouswhenstoredinafiledirectoryarenotalwaysapparentwhenstoredinadatabase.Therefore,crawlingdomainstofindpotentialcontenttoharvestandapplyinginheritedmetadataaccordingtoadirectorystructurearenowlessusefulapproachesthantheywerejustafewyearsago.Nonetheless,despitethisshiftinhowinformationisdeliveredviawebsites,theconceptofcontentinheritingmetadatafrompreviouslyharvestedcontent,andthenassociatingthatcontentwithanexistingaggregatecollection,continuestobeusefulformakingautomatedharvestprocessesmoreeffective.

2.6. ConclusionsandNextStepsStatelibrariansandarchivistscontinuetosearchforthebestmethodsforcapturingWebcontentbasedontheirspecificmandatesandtheresourcestheyhaveavailabletothem.RecentdevelopmentsinWebarchivingservicesandtoolsprovidenewopportunitiesforpartneringwithothersandforexploringnewworkflows.TheWebArchivesWorkbenchtoolsareoneoptionamongmany.TheyautomatethemethodologyprescribedbytheArizonaModel,whichispremisedonkeyarchivalpractices,suchasobservationofprovenanceandadherencetooriginalorder.Thefourmaintools(Discovery,Properties,Analysis,andHarvest)enabletheidentification,selection,description,andpackagingofdigitalcontent.Inaddition,theWAWsuiteincludesfunctionalitiesforerrornotification,aswellasSystemtoolsforoverseeingandreviewingWorkbenchactivities(intheformofauditlogs,spidersettings,metadataimport/exportoptions,andreportsontheactivitieslaunchedbyothertoolsinthesuite).ThelessonslearnedfromdevelopingtheWorkbench,andtheunderlyingarchivalmodelusedtodirectitsdevelopment,underscorethemergingrolesandresponsibilitiesofarchivistsandlibrariansinthedigitalenvironmentandtheneedtore‐evaluateandre‐envisionworkflows.Moreover,thecontinuingmissionandsignificanceofthisworkhavebeenaffirmedinthesecondphaseofNDIIPP.Forexample,theUniversityofIllinois,OCLC,andtheUniversityofMarylandhavepartneredtodevelopastand‐alone,open‐sourcemetadataextractiontoolintendedtoprovideaccesstoarchivedcontent–akindofnextstepfortheWebArchivesWorkbench.Inaddition,intheStateInitiativescomponentofNDIIPP,aselectionofstatelibrariesacrossthenationarecollaboratingtodeveloptoolsandservicemodelsforthemanagementandpreservationofstategovernmentdigitalmaterials.Theseprojectsaddressdigitalpreservationinavarietyofcontexts,includingdisasterreadinessandtherecoveryofdata.ThroughtheStateInitiativeswork,NDIIPPisaddressingthefundamentalissueofkeepingat‐riskstategovernmentresourcesviableaspartofournationalheritageandrecord.

Page 27: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

27

3. RepositoryEvaluationandInteroperabilityAnothercomponentoftheECHODEPositoryprojectistheevaluationofvariousopensourcerepositorysoftwareapplications,withafocusonhowtheseapplicationssupportactivitiesofaninstitutionororganizationinterestedinprovidingservicesassociatedwithatrustworthydigitalrepository.Thissectiondescribesthedevelopmentofanevaluationframeworkbasedonthefirstdraftofthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment(AuditChecklist),ourrepositorytestingandfindings,andhowtheseactivitiesledtothedevelopmentofatoolsuite(theHubandSpoke)forsupportingrepositoryinteroperabilityandthecollectionofmetadataimportantforpreservation.

3.1. RepositoryEvaluation

3.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation

Ourgoalistoprovideanevaluationframeworkthatreflectscurrentthinkingondigitalpreservationstandardstohelpcuratorsofdigitalcollectionslibrariansandarchivistsassessdigitalrepositorysystems,withafocusontheirabilitytosupportlong‐termpreservation.The2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommentprovidedavitalstartingpoint.TheAuditChecklistwasdevelopedbyajointtaskforcefromRLGandtheNationalArchivesandRecordsAdministration(NARA).Itprovidesameansbywhichaninstitutioncanperformaself‐evaluationtodeterminehowwellitispositionedatanorganizationalleveltoprovideanexpectedleveloftrustworthinessasadigitalrepository.Weconsideredittobeastate‐of‐the‐artarticulationofwhatitmeans,atanorganizationallevel,tobeasuccessfulcuratorofdigitalresources.Wethereforedecidedtousethisdocumentasastartingpoint,andprovidesupportforusingthoseportionsthatarerelevanttosoftwareasa‘lens’forconsideringrepositorysoftwaresystems.OurprojectteamreviewedeachAuditChecklistitemwiththequestioninmind,“Howmightarepositorysoftwaresystemsupportanorganizationinmeetingthiscriterion?”SomeChecklistitemsareapplicabletosoftwareapplications;othersarenot.Weisolatedtherelevantitems,and,throughmuchdiscussion,testingandreview,developedasystemofannotationstodescribehoweachparticularrelevantChecklistitemmightbeappliedtoassessmentofrepositorysoftwaresystems.ThisadaptedChecklistisourAnnotatedAuditChecklist.

Page 28: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

28

OurAnnotatedAuditChecklistevaluationframeworkisprovidedinthisreportinAppendix6.3.Astheversionweused,the2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommenthasnowbeensucceededbytheTrustworthyRepositoriesAuditandCertification:CriteriaandChecklist(TRAC),wehaveincludedmappingtoequivalentsectionsofthenewversion.AdditionaldetailsandexamplesofannotationsareprovidedinAppendix6.4.Thefollowingoverviewisextractedfromthissource.

3.1.1.1. Findings:theAnnotatedAuditChecklistAsaRepositoryEvaluationFramework(fromKaczmareketal,2006)

WefoundtheprocessitselfofadaptingtheAuditChecklistasaframeworkforourrepositorysoftwareapplicationevaluationtobeausefullearningexperience.SituatingourevaluationwithintheoriginalAuditChecklistprovidedaframeworktodiscussrepositorysoftwareapplicationswithoutlosingsightofthelargerorganizationalcontext.Asweusedittodocumentourrepositoryinstallationandexperimentationexperiences,wefoundtheannotatedAuditChecklistprovidedagoodframeworkforlookingatrepositorysoftwareapplicationswithinthecontextofdigitalpreservation.However,informationaboutotheraspectsofsoftwarenotdirectlyrelatedtopreservation(e.g.,easeofinstallation,easeofmaintenance,programminglanguageused)donotfitwellintothisframework.Importantly,wehavealsofoundthattheprocessitselfofannotatingtheoriginalAuditChecklistprovidedaforumfortheprojectteammemberstobegindiscussionsthathaveopenedupopportunitiestoexploreourindividualassumptionsaboutvariouschecklistitemsandourinterpretationsofterminology.Throughthesediscussionsweestablisheddirectionstotakeourevaluationactivitiesfurther.

3.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories

Thefouropen‐sourcerepositorysoftwareapplicationsthatweretestedwereDSpace,Eprints,Fedora,andGreenstone.Thecollectionitemsusedastestdataaredescribedindetailbelow.OuttestingapproachandmethodologyareexplainedinSection3.1.3,alsobelow.

3.1.2.1. TestData:aHeterogeneousCanonicalSetInordertotesttherepositories,anumberofheterogeneouscollectionsofdigitalitemswereidentified.Eachofthesecollectionshadvaryingstructures,formats,andexistingmetadata.Anoverviewofeachcollectionfollowsbelow.

Page 29: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

29

3.1.2.1.1. AerialPhotosThisisarelativelysmallcollectionofscannedaerialphotographsfromacoupleofIllinoiscounties.Itconsistsof1,021distinctscannedphotographsandtheiraccompanyingmetadata,whichaccountsfor2,042distinctfilesforatotalof235megabytes.ThemetadataarebasedonFederalGeographicDataCommittee(FGDC)GeospatialMetadataStandard.ThescannedimageswereJPEGsofscreenresolutionquality;thearchivalqualityimageswerenotavailableforthisproject.

3.1.2.1.2. DLI(DigitalLibraryInitiative)JournalArticlesTheDLI(DigitalLibraryInitiative)collectionconsistsof85,650distinctjournalarticlesonthesubjectsofscience,technology,andengineeringfromfivedifferentpublishers.ThiscollectionwascreatedaspartoftheGraingerLibrariesearlierNSFandCNRIfundedDigitalLibrarytestbed.Thiswasbyfarthelargestcollectionwith2,247,455filesforatotalof76,148megabytes.Eachjournalarticletypicallyconsistedofseveralinstantiations,typicallyincludingXMLandSGMLconformingtooneofseveraldifferentDTDsplusaPDFversion,butinsomecasesalsoPostscriptorTeXversions,plusalloftheassociatedfilessuchasimagesandmetadata,whichalsooccurredinseveraldifferentformats.

3.1.2.1.3. WILLPublicRadioBroadcasts

WILListhelocalPublicbroadcastingstation,andthiscollectionconsistsoftheaudiorecordingsforaselectionofitsFocus580,dailytalkradioprograms.Atotalof310programsareincluded.EachprogramhasaWAVaudiofileplustwoXMLmetadatafilesforatotalof930filesand82,456megabytes.ThemetadatawasoriginallyinaMicrosoftAccessdatabase.

3.1.2.1.4. VincentVoiceAudioCollectionThiscollectionisaselectionofaudiorecordingsfromtheVincentVoiceLibraryatMichiganStateUniversity.Itconsistsof209recordings,manyofwhicharecomposedofseveralaudioWAVfiles.ThereisanEncodedArchivalDescription(EAD)fileassociatedwitheachrecordingforatotalof3,515filesand110,186megabytes.

3.1.2.1.5. DOQ(DigitalOrthophotoQuadrangles)DataThisisacollectionof1,073highresolution,DigitalOrthophotoQuadranglesoftheChicagoarea.EachDOQconsistsofsixfiles:theimageTIFFfile,the‘worldfile’usedforgeo‐referencingtheTIFF,anFGDC1XMLmetadatafile,plusatext‐onlyversionofthemetadata,andtheDTDforthemetadatafile,andanXSLTstylesheetforthemetadata.

3.1.2.1.6. “TheCanonicalSet”ThefirstrepositorytestedwasDSpace.Foreachcollection,specializedscriptsandXSLTswerewrittentoarrangetheitemsandmetadatainsuchawaythattheycould

Page 30: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

30

beingestedintoDSpaceusingtheItemImportutility.Thisprocessincludedmovingthefilesassociatedwitheach“item”intoasingledirectory,creatingdescriptivemetadatainDSpace’sidiosyncraticQualifiedDublinCoreformat,andcreatingacontentmanifest.TheDSpacebulkingestpackageformatwasacceptedasthebaselineconfigurationfromwhichallotherprocessingwouldoccur.Thecollectionofallthedigitalpackagesinthisformatbecameour“canonicalset.”Seefiguresbelowforvariousbreakdownsofallthefilesinallthisset:

CanonicalTestSet::NumberofPackagesbyCollection

Page 31: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

31

CanonicalTestSet:TotalMegabytesbyCollection

CanonicalTestSet:NumberofFilesbyCollection

Page 32: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

32

CanonicalTestSet:NumberofFilesbyFilenameExtension

3.1.3. TestingApproachandMethodologyInanutshell,ourprocesswastoinstallaparticularrepository,dowhateverwasnecessarytoingestourcanonicaldatasetintotherepository,dowhateverwasnecessarytoexportourcanonicaldatasetbackoutoftherepository,andrecordinanarrativefashionourfindings,especiallyinthecontextofdigitalpreservationand,also,ofourAnnotatedTrustedDigitalRepositoryChecklist.ItneedstobementionedthattheAnnotatedTrustedDigitalRepositoryChecklistandourconceptofwhatwasactuallyrequiredforlong‐termdigitalpreservationwasbeingcontinuallyrevisedinparallelwiththerepositoryevaluationprocess.Moredetailsofthetestingapproachandmethodologyareprovidedbelow.First,differentprojectstaffmembers,consistingprimarilyofgraduateresearchassistants,wereassignedtoeachofthedifferentrepositoriestobeevaluated.Cooperationbetweenevaluatorswasencouraged,especiallywhendifferentskillsetsmightbeneededinperforminganevaluation.Someoftheevaluatorshadlibrarybackgroundswhileothershadtechnicalcomputerbackgrounds.

Page 33: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

33

Theapproachwasfairlyfreeformandtheevaluatorshadsomeleewayinthedetails,butingeneraltheyfollowedthisroughoutline:

• Thefirststepwastobecomefamiliarwiththerepositorytobeevaluated.Thiscouldinvolvereviewofanypreviousevaluationsorotherwrittenmaterialabouttherepository,includingthedocumentationprovidedbytherepositoryitself.Thisstepculminatedintheinstallationoftherepositoryonourtestserver.

• Thenextstepwastodevelopaprocessforingestingourcanonicaldatasetintotherepository.Thisrequiredtheevaluatortogainagoodunderstandingofthedetailsoftherepository’ssupportedmetadataformatsandsupportedfilestructures,aswellastherepository’sprogramminginterfacesorbatchprocessingtoolsthatcouldbeusedtofacilitatetheingest.Theevaluatorwouldalsoneedtobecomefamiliarwithourcanonicaldatasetatthispoint,iftheywerenotalreadyfamiliarwithit.Theingestprocessgenerallyconsistedofthesesteps:

o Developmappingsbetweenthevariousmetadataformatsrepresentedinthecanonicaldatasetandthemetadataschemasrequiredbytherepository.ThesemappingscouldbeimplementedusingXSLTtransformationsorinsomecasesbywritingcustomizedcomputerprograms.

o Packagethemetadataandfilesinaformatthatisdigestiblebytherepository.Thiscouldbeassimpleascreatingatextfilemanifest,ornamingfilesaccordingtosomestandardandputtingthemallinacertaindirectorystructure,orascomplexascreatingMETSorFoxMLXMLpackages.Similartothemetadata,thesepackagesareusuallyimplementedusingsomecombinationofXSLTandcustomizedcomputerprograms.

o Finally,theactualingestneededtooccur.Onceagain,thiscouldbeassimpleasrunningoractivatingtherepository’snativeingesttool,orascomplexaswritingacustomingestprogramthatusestherepository’slow‐levelprogramminginterfaces.Iftherepositorysupportedanativebatchingestmechanismeveryattemptwasmadetouseitasisbeforeresortingtothecreationofanycustomizedingestprograms.

• Oftendevelopmentoftheingestprocesswasiterative,consistingofdevelopingandimplementingaprocess,testingit,andrefiningituntiltheentirecanonicalsetcouldbereliablyingested.

• Afterthecanonicaldatasethadbeeningested,theprocesswasreversed,andthedatawasexportedordisseminatedbackoutoftherepository.Similartoingest,thedisseminationcouldbeassimpleasinvokingnativerepositoryfunctions,orascomplexaswritingacustomprogram—althoughwe

Page 34: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

34

preferredtousenativebatchexportcapabilitiesiftherepositoryhadsupportforany.Thisprocesswasalsoofteniterative.

• Evaluatorswereencouragedtorecordtheirprocessesandfindingsabouttherepositoriesthroughoutthisprocess.

• Finally,reuseoftransformationsandcomputerprogramsbetweendifferentrepositorieswasencouraged.

Inparallelwiththerepositoryevaluationprocessesdescribedabove,teammembersalsoparticipatedinthereviewoftheTrustedDigitalRepositoryChecklist,sothatbothofthesetasksinformedtheotherinaniterativefashion.ThesimultaneousTrustedDigitalRepositoryChecklistreviewandtherepositoryevaluationsculminatedintheevaluatorsbeingaskedtoapplyourAnnotatedTrustedDigitalRepositoryChecklisttotheirrepositoryevaluationfindings,whichareprovidedintheappendicesofthisreport.Unfortunately,oneofthepitfallsofemployinggraduateassistantsonalong‐termprojectlikethisisthattheyeventuallygraduateormoveontootherassistantshipsastheireducationalgoalsprogressorchange.Whilewestronglyencouragedourgraduateassistantstodocumenttheirwork,wefoundinsomecasesthattheirnoteswerenotalwaysdetailedenoughforustoaccuratelyreflecttheirfindingsasweappliedtheirevaluationstoourAnnotatedTrustedDigitalRepositoryChecklist.Thissometimesrequiredustorevisit,orrecreate,atestforagivenrepositoryinordertoaddressoneormoreofthechecklistitems.Anotheroutcomeoftherepositoryevaluationsisthatasweprogressedwithingestandexporttestingofthedifferentrepositories,oneofourgoalsbecametobeabletoreliablymoveacollectionofdigitalobjectsbetweenanytwooftherepositoriesthatwerebeingevaluatedandbackagain(roundtripping).ThiswasthegenesisforourcurrentlydevelopingHubandSpokerepositoryinteroperabilityarchitecture.

3.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary

Thefollowingopensourcerepositorieswereevaluated:• DSpace:Version1.2.2withlaterupgradeto1.3.1• Eprints:Version2.3.13• Fedora:Version2.0,withlaterupgradesto2.1,2.1.1and2.2• Greenstone:Version2.6,withupgradeto7.7

Overviewreportswereproducedforeachrepository,containingthefollowingsections.TheseareprovidedinAppendix6.5,RepositoryTestingFindings:Narrative.

Page 35: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

35

• RepositoryOverview• TestingandTechnicalEnvironment• Methodology• Findings• Conclusion

WealsodocumentedourexperiencesusingourAnnotatedChecklistevaluationframework.ThisdocumentationisprovidedinAppendix6.6.

3.1.5. ConclusionandNextStepsTwokeyoverallfindingsweretypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.Duringthedevelopmentofourtestbedwefoundourselvesdevelopinganumberofdifferentthoughsimilarcustomizedscriptsandprogramsforexportingdigitalpackagesfromonerepositorysystemandimportingthosedigitalpackagesintoanotherrepositorysystem.Therepositorysystemsthemselveshadverylittleincommonthatwouldfacilitatethistask.Theytypicallysupporteddifferentdescriptivemetadataformats,hadnosupportforprovenancemetadata,offeredlittleornosupportfortechnicalmetadata,andemployeddifferentmeansofidentifyingthefilesconstitutingapackage.Thedevelopmentofanin‐housetooltofacilitatedatainteroperabilitybetweenmultiplerepositorieswithouttheneedtodevelopcustomizedmechanismsforeachrepositorycombinationthereforesoonemergedasakeytasktosupportourrepositoryevaluationactivities.Atthesametime,wewerealsocomingtoamorestructuredunderstandingofemergingdigitalpreservationstandards,specificallyearlydraftsofAnAuditChecklistfortheCertificationofTrustedDigitalRepositories(RLG,2005;Kaczmarek,Hswe,Eke,&Habing,2006;Kaczmarek,Habing,&Eke,2006)andthePREMISDataDictionaryforPreservationMetadata(PREMISWorkingGroup,2005).Webegantoseethataformally‐developedinteroperabilityarchitecturedesignedwithafocusonprovidingadditionalsupportforretentionofprovenanceandtechnicalmetadatacouldbeavaluableandpracticalprojectdeliverable,andonewithimmediateapplicationinourownlibrariesandinotherinstitutionsthatcommonlyimplementmultiplerepositorysystemstomanageandpreservedigitalcollections.ThesefindingsledourproposingtheHubandSpoketoolsuiteasanadditionalprojectdeliverable.Thisworkisdescribedindetailinthenextsection(Section3.2).

Page 36: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

36

3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards

ThissectiondescribesthedevelopmentoftheHubandSpoke(HandS)toolsuite,builttohelpcuratorsofdigitalobjectsmanagecontentinmultiplerepositorysystemswhilepreservingvaluablepreservationmetadata.ImplementingMETSandPREMIS,HandSprovidesastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.(Notethatrelatedprojectwork,investigatingthemorefundamentalsemanticissuesunderlyingthepreservationofthemeaningofdigitalobjectsovertime,isprofiledinSection4.)

3.2.1. HandSOverviewTheHandSisasuiteoftoolsbuilttosupportmovingcontentbetweenrepositorieswhilegeneratingandmaintainingPREMIS‐basedtechnicalandpreservationmetadata.Itemergedoutofprojectactivitiestoevaluateopen‐sourcerepositories(seeSection3.1)inwhichwefoundtypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.ThenextsectiondescribestheimpetusandrationalebehindtheHandSdevelopmentinmoredetail.

3.2.2. TheNeedforInteroperabilityandPreservationSupport

3.2.2.1. InstitutionsCommonlyRelyonMultipleRepositoriesTherearecurrentlymanydifferentdigitalrepositoriesinwidespreaduse,includingDSpace,Greenstone,Fedora,EPrints,andCONTENTdm,alongwithdigitalarchiveserviceslikethosefromOCLCandCDL.Therearealsomanydifferentsourcesofinputintothesesystems,suchasfromwebcrawlerslikeHeritrixorpackagedcontentfromOCLC'sWebArchivesWorkbench,aswellasnumerousdigitizationandscanningservices.Itisalsonotuncommonforseveralofthesesystemstobeinusewithinasingleinstitution.Ifcuratorswishtosharedatainternally,orwithotherinstitutionsorconsortia,thenmultiplerepositorysystemsverylikelywillcomeintoplay.Repositoryinteroperabilityissuesalsoemergeasinstitutionsupdateorreplacetheirrepositorysystems,andmustmigratecontentfromanexistingrepositorysystemtoitsreplacement.

3.2.2.2. Out‐of‐the‐boxRepositoryInteroperabilityisLowOurrepositoryevaluationexperimentsandourexperienceswithrepositoriesinproductionatourowninstitutionsshowthatthenativeabilityforrepositoriestointeroperateistypicallyverybasic.Almostnoneofthesystemswetestedwereable

Page 37: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

37

tooperatewithoneanotherbeyondarudimentarylevel,usuallyrestrictedtotheOAIProtocolforMetadataHarvesting(OAI‐PMH)forDublinCore.IfanyOAISconceptsareimplemented(andfeware),suchastheuseofsubmissionordisseminationinformationpackages(SIPsandDIPs),theseimplementationsvarygreatly.InanidealOAIS‐compliantworld,aDIPfromonerepositoryshouldbeaSIPtoanother.However,inreality,adisseminationpackageproducedbyDSpacecannotbeusedforsubmissionintoEprints.Becauseoftheseinconsistencies,achievinganyrealinteroperabilitybetweenrepositorysystemsusuallyentailssomelevelofcustomsoftwaredevelopment.Further,anytimeanewrepositoryisaddedtothemix,newsoftwarewillneedtobedevelopedinordertoaccommodatetheaddedrepository.

3.2.2.3. SupportforEmergingPreservationStandardsisLowFewofthecurrentrepositorieshaveanyexplicitsupportforpreservation,suchasforcollectingpreservationmetadataasarticulatedbyPREMIS,oractivitiestosupportpreservationsuchasformatmigrationsorchecksumvalidationsasoutlinedintheTrustedDigitalRepositoryChecklist.Foraninstitutionthatdeploysseveralrepositorysystems,ataskassimpleasperformingconsistentbackupstooff‐linestoragecanbecomecomplicatedbythefactthatthesystemsstoretheirunderlyingdatadifferently.Theremaybedatastoredinrelationdatabases,XMLdatabases,RDFtriplestores,andvariousfilesystems–allofwhichmustbebackedup,andmayrequiredifferentbackuptechniques.Insummary,thegenerallackofrepositorysupportforinteroperabilityandforemergingrepositorystandardsatatimewhenlibrariesandotherinstitutionscommonlyrelyonmultiplerepositorysystemstomanage,shareandpreservecontent,isthefundamentalimpetusbehindthedevelopmentoftheHandStoolsuite.Thekeyprinciplesofinteroperabilityandpreservation,andtheapproachesimplementedintheHandStosupportthem,areexaminedmorecloselyinthenextsection,followedbyafunctionalandtechnicaloverviewoftheHandStools.

3.2.3. HubandSpokeKeyPrinciplesTheHubandSpokeapproachisbasedonthetwokeyprinciplesofinteroperabilityandpreservation,withtheunderstandingthatinteroperabilityisnotonlyanenduntoitself,butitisalsocriticalforpreservation.

3.2.3.1. InteroperabilityToreducethecomplexityofinteroperability,theHubandSpokeusesacommonpackagingformatwhichisusedforinterchangeofdigitalresourcesbetweendifferentrepositories.Digitalpackagescomingfromarepositoryaretransformedintothiscommonformatbeforeanyfurtherprocessing,anddigitalpackagesaretransformedfromthiscommonformatintothenativerepositoryformatwhenbeing

Page 38: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

38

placedintoarepository.TheideaistoreduceanN2problemintoa2NproblemasshowninFigure5below.

Figure5:InteroperabilityStandards:aSimpleIdea

3.2.3.2. PreservationThesecondkeyprincipleisthatthecommonpackagingformataswellastheprocessesthatactonthatpackagingformatshouldnotonlysupportinteroperability,butshouldalsopromotepreservation.Thisprincipletreatsthecommonpackagingformatasanarchivalinformationpackage(AIP)intheOAISmodel.Theassumptionbeingthatonereasonpackagesarebeingmovedbetweenrepositoriesisforpreservation.ThereareseveralfeaturesoftheHubandSpokearchitecturethatpromotepreservation.Onekeypreservationfeatureistherelianceoncurrentbestpracticesregardingpreservationmetadata,primarilyinformedbyPREMIS.TheHubandSpokeisespeciallyconcernedwithtechnicalmetadataaboutthefilesandbitstreamswhichcompriseadigitalpackage,andalsowithprovenancemetadataabouttheeventsthatoccurduringthepackage'slifetime,includingeventspertainingnotonlytothefilesandbitstreams,butalsotothemetadataitself.Thetechnicalmetadataisusedtovalidatethefilesandbitstreamsthroughoutadigitalobject'slifetimeandareupdatedasrequired,forexamplewhenaformattransformationoccurs.Theprovenancemetadataisalsoupdatedthroughoutanobject'slifetime.ThetoolsthatimplementtheHandSarchitectureperformthese

Page 39: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

39

actionsautomaticallyasrequiredduringprocessing,butthedataarealwayspresentinthepackagessothatothersystemscanalsoperformtheseactionsasneeded.Anotherkeypreservationfeatureisthetreatmentofthepackagesthemselves.Forpurposesofrepositoryinteroperabilityandalsotosupportpreservation,theHubandSpokeframeworktreatstheinstantiationsofthepackagesasfirstclassdigitalobjects.ThismeansthatwhenaHandSpackageistransformedforingestionintoaspecificdigitalrepository,notonlyarethemetadata,files,andbitstreamsthatcomprisethepackagedecomposedasappropriatefortherepositoryanduploaded,butthepackageitself(inourcaseasuiteofMETSfiles)isalsotreatedasadigitalobjecttobeuploadedtotherepository.Laterwhenthedigitalpackageneedstobedisseminatedfromtherepository,notonlyarethemetadata,files,andbitstreamsavailablefordownload,butalsotheoriginalHandSpackage.ThisallowstheHandSsystemtocomparethepackageasitwasoriginallyingestedtohowitnowappearsasdisseminatedfromtherepository.Thisprocess,wefeel,iscriticalforpreservationinanenvironmentofheterogeneousandchangingrepositories.AnotheraspectofthistreatmentoftheHubandSpokepackagesasfirstclassdigitalobjectsisthatwecancreatesnapshotsofindividualpackagesatpointsintimeandalsorecordpreservationmetadatadataaboutthepackagesnapshots.TheHandStoolsuitecurrentlyimplementsthisconceptasamasterpackagewhichreferencestime‐stampedsnapshotsofthemainpackage.Themasterpackagealsorecordspreservationmetadataaboutthesnapshots.Thisapproachisexplainedinmoredetailinthenextsection,whichdescribestheconcreteimplementationoftheHandSpackagesusingMETS.

3.2.4. METSProfileTorealizetheaboveprinciples,wewantedtoutilizetheprevailingdigitallibrarystandardsasmuchaspossible.Tothatend,weadoptedMETSasthepackagingstandard,PREMISasthepreservationmetadatastandard,andMODSasthedescriptivemetadatastandard.Wealsooptionallyutilizeseveralformat‐specifictechnicalmetadatastandardssuchasMIXandtextMDforimageandtextobjectsrespectively,amongothers.OurMETSprofile,theECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(Habing,2005),iscurrentlyregisteredwiththeLibraryofCongress.Asalreadydescribed,theprimaryfocusoftheHandSMETSprofileistoenablerepositoryinteroperabilityandtosupportpreservationofrepositorycontent.Becauseofthestrongfocusonpreservationratherthanaccess,theHandSprofileisrelativelynoncommittalregardingfileformatsorstructures;instead,specialattentionisgiventoadministrativeandtechnicalmetadata,particularlytointegratingthePREMISdatamodelandschemaintoMETS.Weanticipatethatourfileformat‐agnosticHandSprofilemaybeoverlaidontopof,orinheritedby,otherprofilesthatbetterdefineaparticularfileformatorstructure,providingthemwith

Page 40: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

40

addedsupportforpreservationorinteroperability.Anexampleofthisarrangement,whereaformat‐specificMETSprofileisimplementedasasubclassofthePREMIS‐focusedHandSprofile,istheECHODEPMETSProfileforWebSiteCaptures(Habing,2006),alsoregisteredwiththeLibraryofCongress.ThustheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilityisgenerallynotconcernedwithrenderingormakingaccessibleanyparticularrepresentationofanobject,butitisconcernedwithpreservingtheobjectanditsrepresentations,includingthehistoryofhowthosehavechangedoverperiodsoftime.Inthiscontext,preservationreferstoshort‐terminteroperability,preservingtherepresentationsandmetadataasadigitalpackageismovedbetweentwodifferentrepositories.Italsoreferstothelong‐termpreservationofthepackageanditshistoryasitexistsinvariousrepositoriesforlongperiodsoftimeandundergoesvarious"preservationactions"suchasfixitychecks,normalizations,orformatmigrations.Notethatthoughtheprofileisgenerallyagnosticaboutalmostallaspectsofadigitalobject'srepresentation,suchasstructureorfileformats,wehavemadesomepragmaticconcessions,suchasmandatingatleastMODSfortheprimarydescriptivemetadata(dmdSec)whileatthesametimeallowingmultiplealternativedescriptivemetadatasections.Thealternativedescriptivemetadatasectionsareusedasameanstorecordvariousversionsofthesemetadataastheyhaveexistedindifferentrepositoriesoratdifferentpointsintime.ApotentialusagescenariocanbeillustratedinthefollowingmigrationexampleusingourMETSprofile:

1. WestartwithadigitalobjectwhoseoriginalsourcedescriptivemetadataisintheMARCXMLformat.BecauseourprofilerequiresMODSastheprimarydescriptivemetadata,theMARCXMLwillbetransformedintoMODS,andtheMODSwillbestoredintheMETSdocumentalongwithaprovenancestatementwithsomedetailsaboutthetransformation,especiallyidentifyingthesourcemetadataformat.However,becausedescriptivemetadataareconsideredtobeasignificantpartoftherepresentationofanentityandbecausetransformationsbetweenmetadataformatsareoftenimperfect,theoriginalMARCXMLformatisalsostoredintheMETSdocumentasanalternatemetadataformat.

2. NowsupposethatthedigitalobjectistobeingestedintoDSpace.DSpace,however,doesnothavenativesupportfortheMODSorMARCXMLmetadataformats;therefore,aspartoftheingestprocess,theMODSmustbetransformedintotheidiosyncraticDublinCore(DC)metadataformatthatissupportedbyDSpace.ThismetadataformatisalsoaddedasanotheralternatedescriptivemetadataformattotheMETSdocument,alongwithaprovenancestatementdescribinghowthisnewDCformatwasderivedfromtheprimaryMODSformat.

Page 41: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

41

3. NextimaginethattheobjectexistsinDSpaceforsomeperiodoftimeduringwhichthedescriptivemetadataundergoessomerevision,suchastheadditionofnewsubjecttermsortheadditionofanabstract.NowtheobjectistobedisseminatedfromDSpaceforingestintosomenewrepository.ThiscouldtriggertheadditionofanotheralternatedescriptivemetadatasectiontotheMETSdocument.ThisalternateformatwouldconformtotheidiosyncraticDSpaceDublinCoreformat,buttheprovenancestatementwouldspecifythatthisDCformatrepresentsanewerversionofthedescriptivemetadatathanwasoriginallyingestedintoDSpace.

Theabovescenariowouldproduceachainofdescriptivemetadataformats,suchasMARCXML(original)→MODS(primary)→DC(version1)→DC(version2),withprovenancePREMISeventstatementsadequatetodeterminethesequenceofeventsthatledtothischain.AspartofthisprofilewealsoenvisionfutureprocessesthatmightreconcilelatermetadatarevisionsandmergethoserevisionsbackintoanewprimaryMODSdescriptivemetadatasection.Thepreservationofsemanticsduringthesetypesofmigrationsisoneoftheconcernsofsemanticpreservationdescribedinthefinalsectionofthisreport(Section4).Becausewefeelthatadministrativemetadataareimportantforpreservation,thisprofileisfairlyprescriptivewhenitcomestotheadministrativemetadata,whichcanbeassociatedwithalmostallofthesectionsthatmakeuparepresentation:structures,filesandbitstreams,anddescriptivemetadata.ParticularattentionispaidtothetechnicalandprovenancemetadataassociatedwiththeseMETSsections.

3.2.4.1. MasterMETSProfileAnotherkeyideabehindourMETSprofileistheideaofaMasterMETSdocument.EachpackageintheHandSarchitectureconsistsofasingleMasterMETSdocument,oneormoreMETSSnapshotdocuments,plusallthefilesandbitstreamsthatarereferencedfromtheMETSSnapshots,asshownbelowinFigure6.

Page 42: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

42

Figure6:MasterMETSShowingMultipleSnapshotsandAssociatedFilesEachSnapshotrepresentsaversionofthedigitalpackageatapointintime,usuallywhenthepackageiseitherretrievedfromorplacedintoagivenrepository.Nearlyanyaspectofadigitalobject'srepresentationmaychangewithtime,includingdescriptivemetadata,structure,and,asillustratedabove,eventhefilesreferencedfromapackagemaychangeovertime,perhapsasformatmigrationsoccur.ThesechangesarerecordedasprovenancestatementsintheMETSSnapshotinwhichthe

Page 43: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

43

changeismanifest.Forexample,METSSnapshot2intheabovediagramwouldhaveaPREMISeventdescribingthatFile1wasdeletedfromthepackageandFile4wasaddedtothepackage.Inmostcases,theHandSsystemcanautomaticallydetectwhenthesechangesoccurandwillautomaticallyaddtheappropriateprovenancestatementsorembellishthetechnicalmetadataasrequired.However,itmaynotbeabletodeterminewhythechangesoccurredwithoutsomesortofintelligentintervention.HandSisabletodetectthechangesbecauseithasaccesstothepreviousSnapshotsandcancomparetheSnapshotofthepackageasitwentintoarepositorytothepackagethatisretrievedfromtherepository.ThisisoneoftheprimaryreasonsthattheMETSdocumentsthemselvesarealsoplacedintoarepositoryalongwiththeotherfilesthatareactuallypartofthepackage.TheMETSprofileimplementationsdescribedaboveareanintegralpieceoftheHandSarchitecture,usedasframeworkforgeneratingandmaintainingPREMIS‐basedmetadataovertimetosupportlongtermpreservation.ThenextsectionlooksinmoredetailatothermechanismsoftheHandStoolsuite,andillustratesitsoverallworkflowcycle.

Page 44: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

44

3.2.5. HandSWorkflowCycle

Figure7:HandSWorkflowCycleAsdescribedintheprecedingsections,theHandSToolSuiteprovidesaframeworkforsustainingandenrichingpreservationmetadatafordigitalobjectsastheyaremovedinto,outof,andbetweendigitalrepositorysystems.Digitalobjectsorpreservationpackagestypicallyrefertoasetoffilesthatrepresentsasingleintellectualentity,includingmetadataabouttheentityoraboutthefilesthemselves.IntheHubandSpokeworkflowcycle(seeFigure3),digitalobjectsareretrieved,convertedtoacommonprofile,validated,enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintoadigitalrepository.

3.2.5.1. WorkflowOverview:GET,PROCESS,PUTPreservationpackagesmayentertheHandSworkflowinvariousways:somemaycomefromthird‐partyapplicationsliketheOCLCWebArchivesWorkbench,othersmaybedisseminatedfromadigitalrepositorylikeDSpaceorEPrints,andsomewilloriginatesimplyasdirectoriesoffilesonacomputerfilesystem.Inanycase,thesetoffilesthatmakeupthepreservationpackagemustfirstbegatheredandorganizedforprocessing.Objectsenteringtheworkflowfromadigitalrepositorysystemmustfirstbefetchedfromtherepositorybyinteractingwithitsnativedissemination

Page 45: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

45

routine,whichwillvaryfromrepositorytorepository.Thisinteractionwiththerepositorysystemisfacilitatedbyour“LightweightRepositoryCreate,Retrieve,Update,andDelete”Service—affectionatelynamedLRCRUD.LRCRUDismadeupoftwomodules:theLRCRUDClient,whichrunsonthesamemachineastheotherHandStools,andtheLRCRUDService,whichrunsalongsideadigitalrepositorysystem.Toretrieveapackagefromtherepository,theLRCRUDClientmakesarequesttotheLRCRUDService.TheLRCRUDService,inturn,communicatesdirectlywiththerepositorysystemandretrievesthepackageviatherepository’snativedisseminationroutine.TheLRCRUDServicezipsupthepackageandsendsitoverthenetworktotheClient.OncethepackagehasbeenreceivedbytheLRCRUDClientandverified,itscontentsareunzippedontothelocalfilesystem.Fromthere,theTo‐HubPackagertoolconvertsthedigitalobjectintowhatwecallaHubPackage.AHubPackageismadeupofthecontentfilesthatconstitutethedigitalobject;METSdocumentscontainingdescriptive,administrative,andstructuralmetadataabouttheobjectatvariouspointsintime;andasingleMasterMETSdocumentthatcompriseschronologicalandstructuralinformationabouttheotherMETSdocuments.TheMasterMETSfilewillcontainapointertoatleastone,butpotentiallyseveralotherECHODEPMETSdocuments,eachofwhichservesasasnapshotoftheHubPackageatsomepointinitslifecycle.TheECHODEPMETSdocumentistheheartofaHubPackage;itholdstogetherallthefilesandvariousmetadatathatmakeupthepackage.WhenaHubPackageiscreated,anewECHODEPMETSdocumentisgeneratedforthepackage.IfthepackagealreadycontainsanolderECHODEPMETSdocument(generatedpriortoingestionintotherepository),thenewMETSdocumentiscomparedtotheolderonetodiscoveranychangesordamagestothepackagethatmighthaveoccurredwhileinthecustodyoftherepository.TheECHODEPMETSdocumentisthenenrichedwithtechnicalmetadataandvalidatedagainsttheECHODEPMETSProfileregisteredwiththeLibraryofCongress(Habing,2005)].TheTechMDAugmentortoolenrichestheMETSdocumentwithformat‐specifictechnicalmetadatafoundbyanalyzingeachofthepackage'scontentfiles,andconvertingtheresultintoPREMISObjectmetadata.Oncethepackagehasbeenanalyzedandenriched,theProfileValidatorcloselyinspectstheconstituentfilesthatmakeupthepackage,bothdataandmetadata,andverifiestherearenoerrorsorinconsistencies.Atthispoint,theHubPackageisreadytobesentontoanotherrepository.Butfirst,ithastobeconvertedintoaformcompatiblewithingestionintothetargetrepository,whichagainwillvaryfromrepositorytorepository.ThisfinalconversioniscarriedoutbyaFrom‐HubPackagertool,builtspecificallyforthetargetrepository.Fromthere,thepackageishandedofffromtheLRCRUDClienttotheLRCRUDServiceforthetargetrepositoryandingested.

Page 46: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

46

3.2.5.2. WorkflowExampleTheworkflowcyclemightbemoreeasilyunderstoodbyfollowinganexamplepreservationpackageasitmakesitswaythroughtheprocess.Forthisexample,wewillusethreesmallfilesthatmakeupasinglewebpage:anHTMLfile,aCascadingStyleSheet(CSS),andaJPEGimage.Thesethreefilescomposeasinglepreservationpackage,oritem,whichhasbeensubmittedtoaDSpacedigitalrepository.UsingtheHandStoolsuite,wewilltransfertheitemfromDSpacetoanEPrintsrepository,whilegeneratingpreservationandtechnicalmetadataalongtheway.1.RetrieveRepositoryXdisseminationpackageviaLRCRUDInourexample,supposewehaveanLRCRUDServicerunningalongsideaDSpacerepositoryonaremoteserver.TheLRCRUDClientapplicationsendsarequesttotheLRCRUDServicetoretrieve(GET)apackagefromtherepository.TheLRCRUDServicerelaystheretrievalrequesttoDSpaceusingtherepository’snativedisseminationmethod.Theoutputofarepository'sdisseminationwilltypicallybemadeupofanynumberofmetadatastreamsandothersupportingartifactsinadditiontotheitem’scontentfiles.InthecaseofDSpace,thepackagewillincludeaDSpaceMETSfilethatencompassesMODSdescriptivemetadataaboutthepackageandPREMIStechnicalmetadatapertainingtoeachoftheconstituentbitstreams.Inourexample,thepackagereturnedbyDSpacenowcontainsfourfiles:theHTML,CSS,andJPEGfileswebeganwithandaDSpaceMETSfile.TheLRCRUDServicereceivestheDSpacedisseminationandpackagesitscontentsintoaziparchive,whichwillbetransmittedoverHTTPtotheLRCRUDClient.TheLRCRUDServicealsocalculatesfilesizeandachecksumvalueforthezipfilebeforesendingit,andtransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackagezipfile.AstheLRCRUDClientreceivesthepackagezipfile,ittoocalculatesfilesizeandchecksumvalues,whicharevalidatedagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.Assumingthevaluesagree,thepackageisunzippedandsavedtodisk.2.CreateHubPackagefromrepositorydisseminationfilesTocreateaHubPackagefromtherepositorydisseminationpackage,theTo‐HubpackagerneedstoproduceanewECHODEPMETSdocumentforthepackage.Thepackagerbeginsbysearchingtheretrievedfilesforanymetadataincludedbytherepository.Inourexample,thepackagerlocatestheDSpaceMETSdocumentandretrievesitsMODSdescriptivemetadatastream.ThisDSpaceMODSmetadatawillbetransformedintoAquiferMODSandinsertedintothenewECHODEPMETSdocument’sdescriptivemetadatasection.Otherrepositoriesexportmetadataindifferentformats(e.g.,DublinCore),butinallcasesthepackagemetadataareultimatelytransformedtoAquiferMODSbytheTo‐HubPackager.

Page 47: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

47

TheTo‐HubPackagerthencreatesanentryinthefilesectionoftheECHODEPMETSdocumentforeachofthepackage’sconstituentfiles.Inourexample,thenewECHODEPMETSdocumentwillcontainafileelementforeachofourthreecontentfiles.TheTo‐HubPackagerwillalsocreateinthenewECHODEPMETSdocumentthreePREMIStechnicalmetadataobjectstocorrespondtothreefileelements.Eachwillcontainbasictechnicalmetadataaboutoneofthefiles,includingchecksumvalues,filesize,andMIME‐type.Finally,anyleftoverdescriptiveandtechnicalmetadataelementsfromtheDSpaceMETSdocumentareinsertedintotheECHODEPMETSdocumentasalternatemetadatasothatitisneverlost.IfthepackagecontainsolderECHODEPMETSdocuments(becauseithadbeenpackagedbyHandSbeforeenteringtherepository),themostrecentECHODEPMETSdocumentiscomparedtothejust‐generatedECHODEPMETSdocumenttoexposeanychangesthepackagemayhaveundergonesinceitwaslastanalyzed.ThesedataarerecordedinthenewECHODEPMETSdocument’sprovenancemetadataasPREMISevents.IfthepackagecontainsaMasterMETSdocument,apointertothenewMETSdocumentiscreatedanddesignatedasthemost‐currentECHODEPMETSdocumentforthepackage.IfnoMasterMETSdocumentcanbefound,theTo‐HubPackagercreatesonefromscratch.OncethenewMETSdocumenthasbeencreatedandtheMasterMETSdocumentisupdated,HubPackagecreationiscomplete.OurexampleHubPackagenowconsistsofaMasterMETSdocument,whichpointstoasingleECHODEPMETSdocument.ThisECHODEPMETSdocumentcontainsdescriptivemetadataaboutthepackage;technicalmetadataabouteachofthethreecontentfiles,alongwithpointerstothosefilesandthetechnicalmetadataleftbyDSpace;andprovenancemetadatadocumentingthepackage’sexportfromtherepositoryanditsHubPackagetransformation.3.Generatetechnicalmetadata;augmentHubPackageMETSDocumentUsingtoolsfromtheJSTOR/HarvardObjectValidationEnvironment(JHOVE),theHandSTechMDAugmentermoduleanalyzeseachoftheHubPackage'scontentfilesandgeneratesformat‐specific,technicalmetadataforeach.TheJHOVE‐generatedmetadataistransformedusingformat‐specificXSLTstylesheets,andinsertedintothetechnicalmetadatasectionoftheECHODEPMETSdocumentastechnicalmetadata.AnyinconsistenciesbetweenthetechnicalmetadatacurrentlyheldintheMETSdocumentandthosegeneratedbyJHOVEarerecordedintheprovenancesectionoftheECHODEPMETSdocumentasPREMISvalidationevents.ThetechnicalmetadatastoredintheECHODEPMETSdocumentisformattedincompliancewiththefollowingmetadatapreservationstandards:AudioMDforaudiofiles;TextMDfortext,XML,andHTML;andMIXforimages.InourexampletheHTML,CSS,andJPEGfileswilleachbeanalyzedbyJHOVE.TheJHOVEoutputforboththeHTMLandCSSfileswillbeformattedasTextMD,andtheoutputfortheJPEGimagewillbeformattedasMIX.Eachwillbeinsertedintothe

Page 48: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

48

ECHODEPMETSdocumentinatechnicalmetadataelementcorrespondingtotheappropriatefileelement.TheJHOVEanalysisitselfisalsodocumentedandrecordedintheECHODEPMETSdocumentasavalidationevent.OneofthelimitationsofJHOVEisitssmallnumberofsupportedmediatypes.InitscurrentreleaseJHOVEoffersnosupportforclosedformatssuchasMicrosoftOfficefiles.AnotherdrawbackofusingJHOVEisthatitonlyreportstheMIME‐typecorrectlyforHTMLorXMLfilesiftheyarewellformed;otherwiseitreportsthemasplaintext,causingdiscrepancieswithintheECHODEPMETSdocumentandvalidationwarnings.Nevertheless,wefoundJHOVEtobeausefultoolforanalyzingfilesandgeneratingtechnicalmetadata.FormoreonJHOVEvisithttp://hul.harvard.edu/jhove/.4.ValidateHubPackageMETSDocumentagainstMETSProfileTheProfileValidatorexaminesthecurrentECHODEPMETSdocumentfortheHubPackageagainsttherequirementsofourMETSprofilescurrentlyregisteredwiththeLibraryofCongress(Habing2005,2006).KeyvalidationpointsincludecheckingtomakesurethattheprimarydescriptivemetadataelementcontainsaMODSobjectthatconformstotheAquiferMODSprofile;thateveryfilereferencedbythefilesectionhasassociatedtechnicalmetadataPREMISobjects;andthatallprovenancemetadataassociatedwithafilecontainvalidPREMISeventelements.TheProfileValidatoralsochecksthatthepackagecontentfilesreferencedbytheECHODEPMETSdocumentareaccountedfor,andthattheirchecksum,file‐size,andmime‐typevaluesarecorrect.OurexampleECHODEPMETSdocumentpassesvalidationforthefollowingreasons:itcontainsvalidAquiferMODSinitsprimarydescriptivemetadataelement;eachofitsfileelementsreferencetechnicalmetadataelementscontainingvalidandcompletePREMISobjectmetadata;anditconformsstructurallytoourMETSprofilerequirements.Oncethevalidationhascompleted,thevalidationeventitselfisdocumentedandrecordedintheECHODEPMETSasaPREMISvalidationevent.5.CreateRepositoryPackagefromHubPackageBeforearepositorycanacceptapackageforsubmission,itmustfirstreceiveadescriptionofthepackage’scontents.TheFrom‐HubPackagermoduleusesdescriptivemetadataextractedfromtheHubPackageECHODEPMETSdocumenttogeneratetherepository‐specificmetadataneedforpackagesubmission.ThisprocessusuallyinvolvestransformingtheAquiferMODSmetadatafoundintheECHODEPMETSdocumentintoametadataformatrequiredforrepositorysubmission,andwillvaryfromrepositorytorepository.Inourexample,wearesendingthepackagetoanEPrintsrepository,whichmeansthepackagerwillgenerateanEPrints‐specificmetadatafilefromtheAquiferMODSstream.ThetransformationeventisrecordedintheECHODEPMETSdocumentasa

Page 49: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

49

PREMISmetadata‐transformationevent,andthenewly‐generatedmetadataisaddedtotheMETSdocumentasalternatedescriptivemetadata.ARepositoryPackagezipfileisthencreated,consistingoftheMasterMETSdocument,allthesubordinateMETSsnapshotdocumentsandthecontentdatafiles,aswellasanyrepository‐specificmetadatafiles.6.SendRepositoryPackagetoRepositoryYviaLRCRUDAtthislaststepinourexample,wehaveanEPrints‐specificLRCURDServicerunningonaremoteserverwithanEPrintsrepository.TheLRCRUDClientsendsarequesttotheLRCRUDServicetocreate(POST)anewpackage.TheLRCRUDServicerelaysthecreaterequesttotherepositoryand,usingtherepository’snativemethods,createsanemptyrecord.TheLRCRUDServicereceivesanewlocationidentifier,orhandle,correspondingtothenewlycreatedlocationintherepository,whichitsendsbacktotheLRCRUDClient.Thislocationidentifierisinsertedintothepackage’sECHODEPMETSdocumentastheprimaryIDfortheMETSdocument.TheLRCRUDClientthensendsarequesttotheLRCRUDServicetoupdate(PUT)thenewpackageatthatlocation.TheLRCRUDClientcalculatesfilesizeandchecksumvaluesforthepackagezipfilebeforesendingittotheService,andittransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackage.AstheLRCRUDServicereceivesthepackagezipfilefromtheClient,itcalculatesitsownfilesizeandchecksumvaluesandvalidatesthemagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.OncetheLRCRUDServicehasvalidatedthefiletransfer,itunzipsthepackageandingestseachofitscontents—includingthepackageMETSfiles—intotherepositoryusingtherepository’snativesubmissionroutine.Therepository‐specificdescriptivemetadatathatwasgeneratedinStep5aboveissubmittedtotherepositoryaswell.Oncethepackagehasbeenfullyingested,theLRCRUDServicereturnsanupdateresponsemessagetotheLRCRUDClientconfirmingthesuccessfulsubmission,oranerrorifthesubmissionfailed.Somerepositoriesallowforcertainbitstreamstobegivenprivilegedstatus.InsuchcasestheMasterMETSandECHODEPMETSfilesmayreceivespecialstatus;butinallcasestheMETSfilesarepreservedalongwiththeotherpackagecontentfilesandaretreatedasfirstclassobjectswithregardtotherepository.Thatway,whenthepackageisretrievedfromtherepository,allthemetadatapertainingtothestateofthepackagebeforeitwassubmittedtotherepositoryisnotlost.

3.2.5.3. WorkflowRecapThroughtheworkflowprocessdescribedabove,HandSprovidestoolstofacilitatemovingdigitalobjectsbetweenmultiplerepositorieswhilegeneratingandmaintainingimportantPREMIS‐basedtechnicalandprovenancepreservationmetadata.Digitalobjectsareretrieved,convertedtoacommonprofile,validated,

Page 50: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

50

enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintothetargetrepository.

3.2.6. HandSTechnicalImplementationThekeytechnicalcomponentsoftheHandSimplementationaretheHubandSpokeMETSprofileJavaclasses,providinganextensibleJavaAPItoourMETSprofilewithApacheXMLBeans;theTo‐andFrom‐HubPackagermodules,facilitatinginteroperabilitythroughpluggableinterfaces;andtheLightweightRepositoryCRUDService(LRCRUD),supportingthedisseminationandsubmissionofobjectsbydefiningaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.

3.2.6.1. HubandSpokeMETSProfileAPIThecoreoftheHandSToolSuiteisourMETSProfileAPI,aJavacoderepresentationofaMETSXMLdocumentcompiledfromourMETSprofile.ThebulkofourMETSclasseswerecreatedwithApacheXMLBeans(http://xmlbeans.apache.org/),atoolforgeneratingJavaclassesfromXMLschemafiles(XSDfiles).WithXMLBeans,weareabletocompileXMLschemadocumentstoproduceaJavacodestructure,allowingustoworkwithXMLdatathroughourownJavaclassesandmethods.TocreateourMETSprofileAPI,wecombinemethodsfromXMLBeans‐generatedclassesfromtheMETS,MODS,andPREMISschemas,alongwithformat‐specificpreservationmetadataschemaslikeMIX,TextMD,andAudioMD.Wealsolayercustom‐builtconveniencemethodsontopoftheXMLBeans‐generatedmethodstofacilitateadditionalmanipulationoftheMETSdocumentinafashionuniquetoourMETSprofile.AnewHandSProfileJavaobjectcanbecreatedfromscratchgivenasetofcontentfilesandaccompanyingmetadata,orbyinstantiatinganexistingXMLMETSdocumentthatconformstoourprofile.Onceinstantiated,theunderlyingMETSdocumentobjectcanbeoperateduponprogrammaticallythroughAPIcalls.InworkingwiththeAPI,weareassuredthatanymanipulationoftheMETSdocumentwillalwaysbeconsistentwithourMETSprofile.Forinstance,toaddanewfiletothepreservationpackage,acallismadetothetotheaddFile()method,whichinturntriggerscallstoothermethodsthatensuretheMETSobjectremainsconsistentwithourprofile—suchasaddinganewPREMISObjecttechMDsectionassociatedwiththenewfile,andgeneratingchecksum,MIME‐type,andfilesizevalues.AtanytimetheMETSobjectcanbevalidatedagainstourprofile,orre‐serializedasXMLandsavedtothefilesystem.

3.2.6.2. To‐andFrom‐HubPackagersTofacilitaterepositoryinteroperability,theHandSToolSuiteincludesasetofpackagerclassesfortransformingacollectionofpreservationitemsintoaHubPackage,andfortransformingaHubPackageintoaformrequiredforsubmission

Page 51: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

51

intoagivenrepository.ForitemsenteringtheHubandSpokefromadigitalrepository,therepository‐specificTo‐Hubpackagertakesthenativerepositorydissemination,unpacksit,andinstantiatesanewHandSMETSProfileobjectfromitscontents.Goingtheotherway,arepository‐specificFrom‐HubpackagerpreparesaHubPackageforsubmissionintotheparticularrepository.Currently,wehaveTo‐HubpackagersforprocessingitemscomingfromDSpace,EPrints,OCLC’sWebArchivesWorkbench,orfromadirectoryoffiles.OurcurrentlistofFrom‐HubpackagertargetsincludesDSpace,EPrints,andtheLibraryofCongressarchivestandardBagit.To‐andFrom‐HubpackagermodulesfortheFedorarepositoryarecurrentlyindevelopment.Wehaveemployedapluggablearchitectureforcreatingpackagermodules.BaseTo‐andFrom‐HubclassesareimplementedinJavaasabstractclasseswiththeintentionthattheywillbeoverriddenandextendedbyotherprogrammersneedingtotailortheHandSToolstotheirspecificrepositoryorarchivingstandard.Thismodulararchitectureallowsotherdeveloperstocreatepackagerplug‐insfortheirownrepositorysystemswithouthavingtorecompileorre‐factortheexistingHandScodebase.

3.2.6.3. LightweightRepositoryCRUDService(LRCRUD)TheLightweightRepositoryCRUDservicespecificationdefinesdisseminationandsubmissionweb‐serviceinterfacestodigitalrepositorysystemsforusewiththeECHODEPHubandSpokeToolSuite.TheLRCRUDspecificationdefinesaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.ItenablesuserstoobtainobjectsinaformatexpectedbytheHandSprocessingscriptsandsuppliesdigitalobjectstorepositoriesinaformatexpectedbytheirnativeingestionmechanisms.Thespecificationisimplementation‐agnostic:itsimplydefinestheparametersandresponsesrequiredtoenableaserviceimplementationtocommunicatewiththeLRCRUDclientapplication.ThisallowsLRCRUDimplementerstochoosethemostappropriateenvironmentandprogramminglanguagesforinteractingwiththeirchosenrepository.TheHandSToolSuitecurrentlyhasLRCRUDimplementationsforDSpace,EPrints,andFedora.TheLRCRUDServicefollowsRepresentationalStateTransfer(REST)conventions.ItexposesCRUDactionsonrepositorycontentovertheHTTPprotocol.Asmentionedabove,CRUDisanacronymforCreateRetrieveUpdateandDelete–thebasicoperationsthatapplicationsshouldimplementwhenactinguponpersistentstoragelikerelationaldatabasemanagementsystems,filesystems,andthelike.TheLRCRUDclientcommunicateswiththeLRCRUDserviceviaHTTPmethods,statuscodes,andheaders.ThelistbelowshowshowtheCRUDactionsaremappedtotheHTTPmethods:

• Create==POST• Retrieve==GET• Update==PUT

Page 52: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

52

• Delete==DELETE

InmostcasestheLRCRUDservicewillresideonthesamehostastherepositoryitservessothatithasaccesstotherepository'sAPI.LRCRUDisessentiallya"dumb"packager;itissimplyawaytosupplyfilestotheremoterepositoryinanyformat/configurationthatitcannativelyingest.InthisitissimilartoprotocolsliketheSimpleWebServiceOfferingRepositoryDeposit,orSWORD(Allinson,François,&Lewis,2008),whicharebeingadoptedbyrepositories‐‐andwhichmaymakethesubmissionfunctionofLRCRUDultimatelyunnecessary.Itmaybebeneficialtopresentsomedescriptivestep‐by‐stepexamplesinordertoclarifythefunctionsoftheLRCURDcomponentswithintheHubandSpokeToolSuite.Theseexamplesdescribeindetailtheinteractionsbetweentheclientandtheservice.

3.2.6.4. LRCRUDFunctions‐‐Examples

3.2.6.4.1. Dissemination

Figure8:LRCRUDDissemination

Page 53: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

53

Dissemination(seeFigure8)istheactofretrievinganitemfromarepository,wherebytheitemisdefinedasanintellectualentitycomprisinganynumberofcontentstreams,metadatastreams,andothersupportingartifacts.ItemsdisseminatedfromarepositoryusingtheLRCRUDservicearemostlikelyboundforprocessingandtransformationbytheHandSToolSuiteto‐hubpackager.ThepackagercreatesMETSfilesconformanttotheHandSprofile,extractsandaugmentstechnicalmetadata,andrecordsprovenanceinformation.Describedbelowarethefourmajorstepsinnegotiatingdissemination:

1. TheLRCRUDclientsubmitsanHTTPGETrequesttotheLRCRUDservice.TheGETrequestprovidestheIDoftheitemdesiredviatheLRCRUDserviceURLsyntax.

2. Theservicecallstherepository'snativedisseminationroutinefortheIDindicated.

3. Theservicereceivestheoutputfromthedisseminationandaddstheentirecontentintoazip‐file.

4. Theservicereturnsthezip‐filecontainingthe"package"totheclient.

3.2.6.4.2. Submission

Figure9:LRCRUDSubmission

Page 54: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

54

Submission(seeFigure9)istheactofeither1)addinganitemtoarepositoryforthefirsttime;or2)updatinganitemalreadyintherepository.Describedbelowistheprocessofaddinganitemtoarepositoryforthefirsttime.Thisisatwo‐stageprocess;thefirststagereservesanidentifierinthesystem,whilethesecondactuallyplacescontentintherepository.Stage1­Createstubrecordtoreserveanidentifier:ItiscriticaltonotethatthepackageitselfisnotuploadedaspartofthePOSTrequest;rather,thePOSTrequestcreatesonlyastuborplaceholderrecord.ThereasonthattheactualpackageisnotuploadedaspartofthePOSTisthattheidentifierassignedtothepackagebytherepositoryneedstobeembeddedintheMETSfilewhichispartofthepackage.ThetypicalsequenceofoperationstoingestanewpackageistousePOSTtocreateanewplaceholderrecordandgettheidentifierforthatrecord.Thatidentifieristhenusedtoupdateprovenanceandothermetadatathatispartofthepackage,andthentheplaceholderrecordisupdatedoroverwrittenwiththeactualpackageusingthePUTaction.Themajorstepsinthisprocessare:

1. TheLRCRUDclientissuesaPOSTrequesttotheLRCRUDservicespecifyingtheIDof"where"tocreatetherecord(e.g.inaspecificcollection)ifneeded.

2. Theservicecallstherepository'snativeitemorIDcreationroutine.3. TherepositorysuppliestheservicewiththeIDforthenewly‐createdrecord.4. TheservicerespondstotheclientwithanHTTP201"Created"messageand

returnstheIDintheLocation:header.Stage2–Uploadandingesttheitem:Inthisstage,theitemisuploadedandplacedintherepository.Thisistheexactprocessforupdatinganexistingitem:

1. TheLRCRUDclientissuesaPUTrequesttotheLRCRUDservicetoreplacethepackageidentifiedbythesuppliedURI.Theentitybodyoftherequestwillcontainazip‐filecontainingthe"package"tobeingested.

2. Theserviceunpacksthefilesandcallstherepository'snativeingestionroutine.

3. TheservicerespondstotheclientwithanHTTP204"NoContent"messageindicatingthattherequestwassuccessful.

Page 55: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

55

3.2.7. LessonsLearnedBelow,innoparticularorder,areseveralkeylessonslearnedduringthecourseofdevelopingtheHandSarchitecture.

3.2.7.1. Mergingmetadataispotentiallyrisky.Afterexpendingmucheffortexploringhowwemightmergedifferentversionsofmetadatafilessoastomaintainasinglemastermetadatafile,wereachedtheconclusionthatthiswasverydifficultproblem,andpotentiallydangerousintermsofdataloss.Thisrealizationledustoourcurrentarchitecture,whichskirtstheissueofmergingmetadataintoasinglefilebymaintainingmultipleSnapshotMETSfilesallreferencedfromacommonMasterMETSfile.

3.2.7.2. METSsupportsmultiplemetadataformatswell.CombiningPREMIS,MODS,andotherXML‐basedtechnicalmetadataformatsintoasingleMETSdocumentworkedwellforthisparticularproject.ThegeneralstructureofMETSseemedtolenditselftoconstructingpreservationpackages.Ourconceptualmodel,whichwasdirectlyinfluencedbytheMETSandPREMISstructures,consistedatahighleveloftheintellectualentityhavingoneormorerepresentations.Theserepresentationsandalltheircomponentpartsweretheprimaryfocusofthepreservationefforts.TheMETSfileitselfistreatedastheabstractparentrepresentationoftheintellectualentity.However,therearealsooneormoreconcreterepresentationsconsistingofeachstructMapwithintheMETSfile.TheserepresentationsconsistoftherelationshipsembodiedinthestructMap(andpossiblytherelatedstructLinksections);thefilesandbitstreamsreferencedfromthestructMap;andtheassociateddescriptivemetadata(dmdSec),whichcouldbereferencedviathestructMaporviaindividualfilesorbitstreams.AllremainingpartsoftheMETSdocument,primarilytheheader(metsHdr)andadministrativemetadata(amdSec)sections,arenotconsideredpartoftheintellectualentity’srepresentationsbutare,instead,metadataabouttheserepresentations‐‐mostlyconcernedwithpreservationandthushavingastrongfocusontechnicalandprovenancemetadata.Therewerepragmaticchallengesingettingthesedisparatemetadatastandardstoworktogether,however,andthenextparagraphconveysonesuchexample.

3.2.7.3. ImplementingPREMISinMETSrequireshigh‐levelstructuraldecisions.EmbeddingPREMISmetadatawithinaMETSpackagewasnotanintuitiveundertaking.Therewereseveralreasonsforthis.Amongthesewerevariousoverlapsinthemetadatafieldssupportedbyeachstandard.Whenfacedwiththeseoverlapsourgeneralapproachwastoprovidethemetadatainbothplaces.Althoughthisapproachintroducedduplicationandtheopportunityforinconsistenciesintothemetadata,wefeeltheaddedflexibilityinprocessing

Page 56: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

56

compensatedfortheseshortcomings.Moreover,theHandStoolsuitevalidationstepsensuredthatthesetypesofinconsistencieswerenotpresent.DecisionswerealsorequiredastowheretoembedthePREMISentitieswithintheMETSfile.Whiletheseentitiesareclearlyalladministrativemetadata,theydonotalwaysfitneatlywithinoneofthefoursubgroups,techMD,digiprovMD,sourceMD,orrightsMD,providedbyMETS.RefertotheECHODEPMETSprofiles(Habing,2005)fordetails.ProjectstaffalsoparticipatedinaworkinggroupchairedbyRebeccaGuentherattheLibraryofCongresstoaddressthisissue.Theworkinggroupproducedareport,GuidelinesforusingPREMISwithMETSforexchange(Guenther,2008).

3.2.8. NextSteps:theHubandSpokeDevelopmentoftheHubandSpoketoolsuiteisongoing.Thelatestversionsofthesourcecodecanbedownloadedfromtheproject’sSourceForgewebsite:http://sourceforge.net/projects/echodep/.RecentdevelopmentsincludetheadditionofaFrom‐SpokefortheBagItspecification(Boyko,Kunze,Littman,&Madden,2008)andmodificationstosupportversion1.5ofDSpace.WorkiscontinuingapaceonbothFrom‐andTo‐SpokesfortheFedorarepositorywithparticularattentionbeingpaidtohowourMETSprofilemightbeaccuratelymappedtoaFedoracontentmodel,reducingtheneedforpotentiallylossymappingsashavethusfarbeenrequiredforotherrepositorysystems.Theprojectisalsolookingatotherpotentialrepositories,suchasLOCKSSorCONTENTdm,forSpokedevelopment.InadditiontodevelopingnewSpokes,wearealsomonitoringdevelopmentswiththenextversionofJHOVE,aswellaswiththeGlobalDigitalFormatRegistry(GDFR),toexplorehowthesetoolsmightbeusedtoenhancetheformat‐specifictechnicalmetadatawearecurrentlygeneratingfordifferentfiletypes.

3.2.8.1. SupportingPreservationNowandintheFutureTheHubandSpoke(HandS)frameworkenhancestheinteroperabilityandpreservationfeaturesofexistingopen‐sourcerepositorysystems.Itprovidesasuiteoftoolstofacilitatemovingdigitalobjectsbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Itisintendedtosupportcurators’effortstodaytomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdatainaccordancewithemergingdigitalpreservationstandards.Inthelongterm,however,weseetheneedforthenextgenerationofdigitalrepositoriestodomoreinordertosupportourabilitytopreservethemeaningofthedigitalobjectsmaintainedinrepositories.Currentrepositorysystemspreservethestructuresofdigitalobjects,fromwhichmeaningorsemanticsmustbeinferred.Learningfromreal‐worlddatamigrationexamplesfromtheHandSefforts,GSLISandNCSAresearchersareworkingtomodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivespreservethemeaning(notjustthestructures)of

Page 57: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

57

digitalobjectsandheadofflonger‐termpreservationrisks.Specifically,wearedevelopingautomatedreasoningtechniquestargetedatidentifying,andeventuallycorrecting,problematicmetadatadescriptions.Thisworkisprofiledseparatelyinthenextsection(Section4).

3.2.9. ConclusionWithdigitalpreservationstillinitsinfancy,manychangestoemergingstandards,strategies,andmethodologiescanbeexpectedinthecomingyears.TheHubandSpokeframeworkprovidesamodelthatattemptstoincorporatecurrenttechnologiesandbestpracticesfromthefieldtosupportdigitalpreservationincurrentrepositoryenvironments.ItimplementsMETSandPREMIStoprovideastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.HandSisintendedtohelpcuratorsofdigitalobjectstodaybyprovidingimprovedsupportforpreservationandinteroperabilitytoexistingrepositorysystems.Ultimately,inordertomeaningfullypreserveourdigitalcontentovertime,wewillneedthenextgenerationofpreservationtoolstosupportautomaticinferenceofmeaning,orsemantics,fromchanged—andthuspotentiallyambiguous—informationstructures.

Page 58: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

58

4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation

AkeygoaloftheECHODEPositoryprojectistoinvestigatebothpracticalsolutionsforsupportingdigitalpreservationactivitiestoday,andthemorefundamentalresearchquestionsunderlyingthedevelopmentofthenext‐generationofdigitalpreservationsystems.Earlierinthisreport,wereviewedtwoareasofactivitythataimtosupporton‐the‐groundpreservationeffortsinexistingtechnicalandorganizationalenvironments:theWebArchivesWorkbench,asuiteoftoolstohelpcuratorscollectandmanageweb‐baseddigitalresources;andtheHandStoolssuite,whichaimstoenhanceexistingrepositories’supportforinteroperabilityandemergingpreservationstandards.Inthelongerterm,however,werecognizethatsuccessfuldigitalpreservationactivitieswillrequireamorepreciseandcompleteaccountofthemeaningofrelationshipswithinandamongdigitalobjects.Thissectiondescribesprojecteffortstoidentifythecoreunderlyingsemanticissuesaffectinglong‐termdigitalpreservation,andtomodelhowsemanticinferencemayhelpnext‐generationarchivesheadofflong‐ternpreservationrisks.

4.1. Introduction:TheNeedforaSemanticsofPreservationApproach

4.1.1. ThePreservationSemanticsProblemLikeanyinformationmanagementactivity,digitalpreservationeffortsareguidedbyhumanunderstanding.Decisionsaboutdocumentingafileformat,emulatinganenvironment,ormigratingfromonesystemtoanotheraremadewithanunderstandingofhowlevelsofdigitalexpressioncascadeandinterrelate:voltage,bit,octet,pointer,integer,grapheme,pixel,polygon,color,pitch,textstring,tree,image,tuple,file,andsoon.Thecomplexityoftheserelationshipsposesfewseriousproblemsforhumanbeings‐‐infact,theproblemsliepreciselyintheeasewithwhichourmindsinterpretthoserelationships.Long‐termpreservationisdistributednotonlyovertimebutalsoacrosstheresponsibilitiesofmanydifferentpeople.Itisdirectedatcollectionsmuchtoolargetoallowthoughtfulattentiontoindividualresources.Wemustthereforebuildintoourtoolsamorecarefulandpreciseencodingoftheknowledgethatguidesoureffortlessmentaldeductions.Thepreservationhazardsthatresultfromcurrentdescriptivepracticeandourexperimentswithautomatedtoolstoamelioratethoserisksaredescribedinthesectionsthatfollow.

Page 59: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

59

4.1.2. OurGoalOurgoalistounderstandbetterthesemanticproblemsarisingindigitalpreservation,andhowwemightapplythatunderstandingtothedevelopmentofresourcesandtools.Specifically,weareexperimentingwithautomatedinferencesaboutentities,theirproperties,therelationshipswithinandbetweenthem,andhowthesefactsareexpressedinmetadatadescriptions.Enrichingthatmetadatawithnewdeducedassertionsisonestepinheadingoffdigitalpreservationrisks.Weareworkingtowardadeductivesystemforreasoningaboutanomalousorincompletemetadata.Theaimisnottoautomaticallydeduceallmissinginformationortocorrectmalformedrecords,buttocallhumanattentiontodescriptionsthatareproblematicorsuspicious.Ourworkbeginswithananalysisofthekindsofsemanticproblemsposedbycurrentdescriptivepracticeandmetadataschemas,informedbyanalysesofreal‐worlddatamigrationexamples.Wehaveappliedtheunderstandinggainedinthisanalysistothedevelopmentofadraftmetadataontology(discussedinSectionC),whichmovesustowardamoreformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thismetadataontologyiskeytoaproof‐of‐conceptexperimentalsystemcomposedoftheRDFrepositoryTupeloandtheBECHAMELreasoningsoftware.

4.2. TheProblems:UnderstandingSemanticPreservation

4.2.1. ProblemsPosedbyDescriptivePracticeandStructuresInmanypreservationeffortsmetadatadescriptionmayseemstraightforward,butcrucialinformation‐–includingfactsthatseemobviousatfirstglance‐‐isleftunstated,andmustbeinferredbyhumanreaders.(Anexampleisprovidednext.)Asdiscussedabove,thissituationmaynotberiskywhenpeopleareavailabletoreasonaboutindividualrecords,butahuman‐basedmanualapproachdoesnotscaleoverlargecollectionsizesorovertime.Thesheervolumeofdigitalinformationmeansweincreasinglyrelyonautomatedmachineprocessingofrecords.Butsoftwaretoolsexecutetransactionsusingonlyknowledgethathasbeenexplicitlyrepresentedforthem.Ouraimthereforeistomakethoseunstatedfactsavailableinaformthatsoftwarecanuse.Thisworkbeginswithaninvestigationofthekindsofsemanticproblemsposedbycurrentinformationstructuresandimplementations.Theseproblemsbreakdownintothreebasiccategories:

• Semanticproblemsrelatingtodescriptivepractice• Semanticproblemsrelatingtoencodingstandards

Page 60: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

60

• Semanticproblemsrelatingtometadataschemadesign

4.2.1.1. Semanticproblemsrelatingtodescriptivepractice.Someoftheproblemswefacearearesultofhowresourcesaredescribedusingmetadata,whileotherproblemsarisebywayofhowthosedescriptionsareexpressed,andwhathappenstothemovertimeastheyaremigratedfromonesystemtoanother.OnesemanticproblemofparticularinteresttousiswhatRenearetal(2002)describeas"ontologicalvariationinreference."Essentially,metadatacanfailtomakecriticaldistinctionsinwhat,precisely,itisdescribing.Theproblemisillustratedinthemetadataexamplebelow,whichshowspropertiesassertedatanumberofdifferentlevelsofabstraction.

Figure10:ExampleofMultipleLevelsofAbstractioninMetadataDescriptionWeseeinthisexamplepropertiesoftheimageitself(likeitstitleandsubjectmatterinlines8and23)describedalongsidepropertiesofthefilewhichencodestheimage(itsMIMEclassificationinlines2and12),propertiesofthemetadatadescription(itscreationdateinline28),andpropertiesoftherepositorysoftwareobjectthat

Page 61: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

61

expressesthemetadatadescription(e.g.,thatitdisseminatesresources,andhasparticulardatastreamsassociatedwithit;lines1,3,7,11,18,and22).Themainpreservationriskproceedingfromthismixingoflevelsistheinabilitytodistinguish,withoutsemanticinformationabsentinthedescription,thelevelatwhichaparticularpropertyapplies.Forexample,whatisitexactlythathasaMIMEclassificationimage/jpeg?IsittheFedorarecordorisitoneorbothofthedatastreams?Ahumanreadercaneasilyresolvethatkindofambiguitywithoutconsciouseffort,butpreservationtransactions(suchasmigration),whicharetypicallyexecutedthroughsoftware,cannot.Thepreservationaimhereispresumablytopreserveaccesstotheimage.Thataimmayormaynotdependonpreservingthejpegfileexpressingtheimage,andpreservingtheFedoraobjectthatexpressesthemetadataisalmostcertainlynotarequirement.Thisexamplethereforeillustratestheproblemofmixedlevelsofdescription.Weneedtoclarifyandenrichmetadatadescriptionsbylinkingtheirassertionsexplicitlytotheappropriateentities,orelsedrawtheattentionofhumananalyststorecordsthatcannotbedisambiguatedautomatically.

4.2.1.2. Semanticproblemsrelatingtoencodingstandards.Inadditiontoproblemsofdescriptivepractice,wefacesemanticproblemsstemmingfromlimitationsoftheencodingtechnologiesinwhichmetadatadescriptionsareexpressed.Theseproblemsgenerallyfallintooneofthefollowingtwocategories:

4.2.1.2.1. Syntacticoverloadinginconventionalmarkup.

Familiarencodingtechnologiesformetadatadescriptions,suchasthosebasedonXML,workwellmostofthetime,buttheyhavecertainfundamentalproblems,suchasthoseasstemmingfromtheuseofmultiplecompetingsemanticrelationshipsandofunstructureddata.Specifically:

• Competingsemanticrelationships:Preservationmetadataformatstypicallyoverloadasimplesyntaxwithmultiplecompetingsemanticinterpretations.TypicalexamplesincludeXMLapplicationswhereasmallnumberofsyntacticrelationships(e.g.,theparent/childrelationshipbetweenelements)representanynumberofsemanticrelationships(whole/part,propertyname/value,etc.)thatarecontextdependent.Oftenapreciseinterpretationofthesesemanticscanbefoundonlyintheexecutionofapplicationsoftwarethatconsumesthefile‐‐and,presumably,inthemindoftheprogrammerwhowrotetheapplication.

• Unstructureddata:Theinformationinresourcedescriptionsmayonlybeincompletelyavailableformachineprocessingandverification.Crucial

Page 62: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

62

contextualdatamayexistonlyasnaturallanguageannotationsorasunstructuredinformationinthecontentofmetadatafields.

ThemetadataexamplepresentedinFigure10doesnotexhibitproblemsofsyntacticoverloading,becauseitconformstoastandardserializationoftheRDFabstractmodelinwhichpropertiesandrelationshipsareexplicitlyidentified.Butthesecondproblemisevidentinhowmuchinformationinthisdescriptionisexpressedinnaturallanguagetextandannotations.(Note,forexample,thedc:dateelement(line28)inwhichthedocumentedevent(scanningandprocessing)isprependedtothedatestring.)

4.2.1.2.2. Problemswithobjectmodels.

Otherpotentialsemanticproblemsstemmingfromlimitationsofmetadataencodingtechnologiesconcerntheobjectmodelsofrepositorysystemsthemselves.Modelingdecisionsinrepositorydesigncancreatedescriptiveartifactsthatleavetheirmarkevenafterrecordmigration.Forexample,arepositorymaymingleinformationaboutrepositoryobjectswiththeinformationthattherepositoryobjectsaremeanttopreserve,creatingproblemswhenthoserecordsarefurtherprocessedandcontextualinformationisnolongeravailabletohelpinterprettherecordsandmakefurtherpreservationdecisions.AgoodillustrationofthisissuecanbeseenbyrevisitingthemetadatadescriptioninFigure10,whichwasserializedfromtriplesthatwereextractedfromtheRDFdatabasebackingaFedorarepositoryinstallation.InFigure10,noticethatinRDFtermsthisentiremetadatadescriptionis"about"anobjectidentifiedasinfo:fedora/changeme:97(line1).Thisrepositorysoftwareobjectistheonlyresourceidentifiedbyanrdf:typearc,andisthereforetheonlyentitywithanobjectclassidentification.Barringanyexplicittypeidentificationinaresourcedescription,FedoraobjectsseemtobetheonlykindofthingthattheFedorarepositoryknowsabout.Expressedinthatform,wecannotpreserveanyinformationexceptFedorarecords,andthoserecordsassertnoexplicitpreservationtargets.AsystemlikeFedoracanpreserveobjectswithinthecontextofitsowntransactions,buttheimplicitknowledgedirectingsuchoperationsdependsontheinterpretationofprogrammers,withalltheproblemsdiscussedsofar.Ontheotherhand,itisnotadesignflawofFedorathatitsmetadatarecordiscenteredinternallyontheFedoradigitalobject.Preservationontologyisproperlyamatterofdescriptivepractice,notsoftwareengineering.Infairness,ourmetadataexamplecomesfromamigrationscenarioinwhichRDFtriplesareextractedfromFedora'sRDFstoredirectly,ratherthanthroughaconventionalexportprocess.ButthisexampleservestoremindusthatobjectmodelinginasystemsuchasFedoraplaysthesameroletothesameendsaswithotherkindsofsoftware:efficientsourcecodemanagementbyandforthesystemdevelopers.Objectmodelingdecisionsarenotintendedandcannotbeexpectedtoaddresstheweaknessesof

Page 63: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

63

resourceanalysisanddescription.Forlong‐termpreservation,therefore,itisimportanttoreduceambiguousorimplicitsemanticsinrepositoryobjectmodels.Thatcanmeaneithermodifyingthosemodelsor,aswehaveattempted,providingtoolsandtechniquesformigratingfromrepositoryobjectmodelstomodelsthatincludebetterrepresentationsofpreservationtargets.

4.2.1.3. Semanticproblemswithpublishedmetadataschemas.Finally,inadditiontosemanticproblemsrelatingtodescriptivepracticetoencodingstandards,weseesemanticproblemsstemmingfromlimitationsofpublishedmetadataschemasthemselves.Publishedschemasformalizeelementsetsonwhichtheproperty/valueascriptionsarebased.Eachofthesemetadataschemesnotonlyexpressesitsuniqueviewoftheuniversebutisitselfgroundedinbasicontologicalassumptions.Avarietyofambiguitiescanstillarise,asillustratedbelow,drawingagainonourrunningexamplefromFigure10.Weneedtobeginbyunderstandingthelogicalpartsofthemetadatarecordandtheirrelationshipstooneanother.Ametadatarecorddescribessomeentity‐‐aninstanceofaclassliketheclassofbooks,images,oraudiorecordings.Metadatadescriptionslistpropertiesofthatentity,eachofwhichhasavalue.Forinstancethe“author”propertyofthebookmighttakeasitsvaluethenameoftheauthor.Membershipinaclassrequiresthattheinstancerespectdefinedclassconstraints(movies,forexample,haverunningtimes,butbooksdonot).Considerthemetadatastatementdc:type>image</dc:type>(line9)fromouroriginalexampleinFigure10.Weeasilyrecognizethattheword"image"pointsustoaninstanceofaclass,justasthenameofanauthorpointsustoaparticularperson.Ahumanreaderwouldneverconcludethatabookwasauthoredbyanameorbyastringexpressinganame.Similarly,theword"image"licensesourinferencethatthepropertyvaluefordc:typeisaclassofentitiesintheworldratherthan,forexample,aquantity(suchas14centimeters)oraquality(likemonochrome).Inthiscasewearecuedtotheexistenceofnotjustanyentity,buttotheverytargetofourpreservationefforts‐‐somethingmuchmoreimportanttousinthelongrunthanthedigitalfileorthebitsequencethatonlyexpressesthisimagecontingently.Computersoftwarecannotmakethosekindsofmeaningfuldistinctionswithouthelp.Onekindofhelpwouldbeaconstraintontherangeofallowablepropertyvalues,buttheDublinCoreelementschemaenforcesnosuchconstraint:dc:typecantakeanyvaluethatindicatesthe"natureorgenreoftheresource."(DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1,2008)Asecondkindofhelpwouldbeavaluestringthat,throughitsmachine‐readablestructureornotation,indentifiesaclass.InanRDFexpressionthiswouldbeaURIlinkedbyanrdf:typepropertytosomeclassdeclaration.TheDCMITypeVocabulary

Page 64: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

64

hasthisstructure(http://dublincore.org/2008/01/14/dctype.rdf),andinterestingly,thescopenotefordc:typeintheDCElementsRDFschemarecommendstheuseofthatvocabulary(http://dublincore.org/2008/01/14/dcelements.rdf).HadtheauthorofourmetadatadescriptionusedtheURIdcmitype:Image,insteadoftheword"image,"wewouldbeonestepclosertoidentifyingtheabstractimageasanentity.Theword"image,"althoughitcontainsthesamesequenceofletters,isnotlinkedinastandardizedwaytothedeclarationofaclass.AssigningaDCMItyperesourcetothedc:typeelementsimplifiestheinferencethatanimageexists,andthatoneormoreofthemetadatastatementsinthatdescriptionareascribingpropertiesofanimage–semantically,asignificantstep.Butasourrunningexamplestands,theschema'sflexibilityinvitesambiguity,andadditionalinformationisnecessarytoconnecttheliteralvalue"image,"withaformalizedclasssuchasdcmitype:Image.

4.2.2. UnderstandingtheSemanticPreservationProblem:SummaryWehaveseeninthissectionthatdescriptivepractice,encodingstandards,andpublishedspecificationsmayallcomplicatedigitalobjectpreservation.Impreciseresourcedescriptionscanmakeitimpossibletodeterminethelevelatwhichaparticularpropertyapplies.Theflexibilityofferedbyencodingstandardsbringsrisksaswellasbenefits.We’vealsoseenhowobjectmodelingdecisionsandsemanticallyunderspecifiedmetadataschemascanleadtoincorrectorambiguoususage.Inthenextsection,wemovefromunderstandingthecoresemanticproblemsassociatedwithdescriptivepracticeandstructurestolookingattheresourcesandtoolsbeingdevelopedbytheECHODEPositoryprojecttoidentifysemanticambiguityinreal‐worldmetadatadescriptions,andhighlightpotentialpreservationrisks.

4.3. TowardMoreCapableArchivesandRepositories

4.3.1. Recap:TheneedforautomatedinferencecapabilityDigitalresourcepreservationeffortsaredistributednotonlyovertimebuttypicallyacrosstheresponsibilitiesofpeoplewhomayneverconsultwithoneanother.Transactionslikemigrationbetweensystemsareexecutedoverlargecollectionswherecloseattentiontoindividualrecordsistooexpensive,butwherecorrecttreatmentofaresourceoftendependsonknowledgethatisincompletelyorimpreciselyrepresentedinpreservationmetadata.Suchambiguitiespresentfewproblemsforhumanbeings:ourflexiblemindsmakecorrectinferenceswithoutconsciouseffort.Butthedatatosupportthoseinferencesarenotexpressedinaformthatcanguidetheexecutionofourprogramsandutilities.Wethereforeneedtoolsandmethodsthatsupportthediscoveryandcorrectionofpreservationrisks.

Page 65: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

65

Thenextsectiondescribesourexperimentsindevelopingthesemethodsandtools.Specifically,welookatourdevelopmentofanontologyofmetadatadescriptions,theResourceDescriptionVocabulary,andhowthismaybeappliedtoidentifysemanticambiguityinmetadatadescriptionsusingthereasoningtoolBECHAMEL.

4.3.2. BECHAMELandBuildingaMetadataOntologyBECHAMELisatoolforexpressingandtestingsemanticmodelsofdigitalresources.(Dubinatal,2003)IthasbeendevelopedbyresearchersattheUniversityofIllinois,theWorldWideWebConsortium,andtheUniversityofBergen.ABECHAMELapplicationcan,forexample,translatethebibliographicmetadataforajournalarticlefromonestandardformatintoanotherbyconstructingamodeloftheauthor'saffiliationwithaninstitution.(Renear&Dubin,2003)InourrecentandcurrentexperimentstheinputtoBECHAMELaremetadatadescriptionsretrievedfromanRDFrepository(Tupelo),togetherwithschemasdefinedintheOWLWebOntologyLanguage(OWL,2004).Newfactsdeducedfromthoseinputsareaddedbacktotherepositoryasannotationstothedescription.Atechnicaloverviewofthisapproachispresentedinthefollowingsections.

4.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary

Ouraimistoenrichmetadatawithnewassertionsinferredfromexistingresourcedescriptions.Towardthataimwehaveidentifiedclasses,properties,andrelationshipsforovercomingencodingproblems,andwehaveexpressedtheseinaschema.Thisvocabularydoesnotrepresentclassesorpropertiesforspecifictypesofresources.Instead,itoffersanontologyofmetadatadescriptionsthemselves.Simplystated,thevocabularyincludestermsthatcanbeusedtodescriberecords,metadatadescriptions,andrelationshipsbetweenthemandpreservationtargets.(TheResourceandDescriptionVocabularyisprovidedinAppendix6.7.)Morespecifically,thevocabularyisdividedintothefollowingsections:

• W3Cstandardclassesandpropertieso Theseincludeclassesandpropertiessuchasrdfs:Resource,

rdf:Statement,andowl:ObjectProperty.• Alternatereificationclasses

o ConventionaluseoftheRDFreificationvocabularyisbasedonanunderstandingthattriplesstandinatype/instancerelationshipwith"tokens"appearinginRDFdocuments(RDFSemantics,2004).Butthisinterpretation,intendedtosupportprovenancedocumentation,presentspuzzlesforunderstandinghowaserializedexpressioncanstandindirectrelationshipswithresourcesreferredtobyanabstracttriple.(ForthoseanalystswhomaybeconcernedwithabusingtheofficialaccountofRDFreification,thevocabularyincludesseparate

Page 66: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

66

classesforgeneralizedstatements,RDFstatements,andabstracttriples.)

• Indicationrelationshipso Thissectionincludesagroupofhierarchicallyorganized

relationships,basedonrecommendationsinPiotrKaminiski's2002thesis.Therelationshipsincludeindication,representation,denotation,identification,description,depiction,ascription,expression,andencoding.

• ClassesbasedontheDCMIAbstractModelo IntheDublinCoreAbstractModel,theterm"metadataelement"is

usedsynonymouslywiththeterm"property."Butourclasses,thoughbasedonthatmodel,representmetadataelementsasspecializednamesofproperties,ratherthanaspropertiesthemselves.Classesinthissectionincludemetadataelement,metadataelementset,metadatastatement,andmetadatadescription.

• Markupstructureso AthirdalternativetoreifyingRDFstatementsundertheofficialW3

interpretation,orthroughuseofalternateclasses,istoreifythenotationexpressingtheRDF.ThissectionofthevocabularyincludesclassesforXMLelements,XMLdocuments,XMLschemas,XMLattributes,andURIs.

Insummary,theResourceDescriptionVocabularyisanontologyofmetadatadescriptionsthemselves.ItsaimistoprovideasemanticallysoundframeworkforovercomingtheencodingproblemsdescribedinSection4.2ofthisreport.Thenextsectionwalksusthroughademonstrationofhowthisontology,asusedbyBECHAMEL,canhelptohighlightpotentialpreservationrisks.

4.3.4. ResolvingSemanticAmbiguity:anInferenceExampleInSection4.2ofthispaperwediscussedproblemsofdescriptivepractice,encodingstandards,andschemadesign.Nowwepresentanillustrationofhowourinferencingsoftwarerespondstothoseproblems.Intheexamplebelow,anambiguousmetadatastatementfromtherecordshowninFigure10isidentifiedandassociatedwiththeimpliedpreservationtargetitdescribes.Figure11belowshowsoneRDFstatementextractedfromFigure10,ourrunningexample:

Figure11:AfragmentoftherecordshowninFigure10.

Page 67: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

67

ThisRDFstatementviewshowsthestringvalue"image"assignedtotheDublinCoretypepropertyfortheFedoraObjectidentifiedasinfo:fedora/changeme:97–aswediscussedearlierwhenviewingtheoriginalmarkuprecord(Figure10).Themainissueisoneofclearlyidentifyingthetargetofourpreservationefforts:animageinthiscase.Summarizingtheconcernsdiscussedearlier:

o TheFedoraobjectisanamorphousresource,whichseemstosharepropertiesoftheimageitself,theimagecontent,andthebitstreamencodingtheimage.TheFedoraobjectcannot,therefore,beourpreservationtarget.

o Accordingtotheformalschemadefinition,theDublinCoreTypepropertyindicatesthe"thenatureorgenreofaresource,"butneednotidentifytheexistenceofanyparticularconcreteobjectorabstractentity.Asalreadyseen,thisvaguenessintheformalschemaopensthedoortotheuseofvalues(suchastheliteralstring“image”)thatarecleartohumanreadersbutwhichposeproblemsformachineprocessing.

o Althoughtheword"image"invitesahumanreadertoinferthatourpreservationtargetisanimage,thatinformationisnotexplicitenoughtosupportautomatedprocessing.Theinferencedependsnotonlyonwordmeaningbutalsoonthetacitbackgroundknowledgethatthepropertyvaluemustinthiscasebeaclass(ratherthan,forexample,aquality,quantity,orname).

Torecapthen,thisimage(Figure11)illustratestherelationshipsbetweentheFedoraobject,theDCelement“type”,andthevalue“image”ambiguouslyexpressedintheoriginalrecord(Figure10).Inthenextstep,webegintoclarifytheserelationships.Figure12belowshowsthefirstinferencestage:

Figure12:BECHAMELhasidentifiedthefragmentasametadatastatement.

Page 68: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

68

ThisRDFstatementshowsthattheoriginaldc:typearchasbeenidentifiedasametadatastatement,andanewTAGURIhasbeengeneratedtodenotethatstatement.AlthoughthisstageoftheprocessingbeganwithconventionalRDFreification,ourassignmentofrdf:subject,rdf:predicate,andrdf:objectpropertiestoournewMetadata_StatementinstanceisadeparturefromorthodoxRDFsemantics.Thisfirststageofinferenceprocessinghasidentifiedthemetadatastatement.Inthenextstagewetakethisastepfurthertoidentifythepreservationtarget.

Figure13:BECHAMELhasidentifiedthemetadatastatementasadescriptionofanimage.Figure13showstheidentificationofthepreservationtarget.ThesysteminfersthatthismetadatastatementmustbedescribinganabstractimagethathasbothaclassidentityandanobjectidentitydistinctfromtheJPEGfile,thebitstreamencodingthatfile,thegeographydepictedintheimage,andtheFedoraobjectthatservesasthelocusforpropertyattributionsatallthoselevels.Inaddition,themetadatastatementisidentifiedasbelongingtoametadatadescription.Identifyingthepreservationtargetshouldsimplifythevalidationoflaterpreservationtransactions,makingiteasiertoverifythatessentialpropertiespersistacrossmigrationsandthroughtranslationsfromoneformattoanother.

4.3.5. AutomatedInferenceasaPreservationServiceTheontologyandinferencesthatitsupportsallowus,evenincaseswheremetadatarecordsareterseandincomplete,torecoverimportantdistinctions,suchasthe

Page 69: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

69

distinctionbetweenapersonandthemetadatarecorddescribingthatperson.Thisknowledgeisexpressedinaportablesyntax(RDF/OWL)withexplicitly‐definedsemantics,soitcanbemaintainedwithouthavingtomodifytheoriginalrecordortransformitintoanothersyntax(eitherofwhichcouldintroducefurtherpreservationrisks).Indeed,BECHAMEL’sabilitytoreadandwritefromRDFdatabases(usingTupelo)meansthatitcanreadmetadatarecords,applyrulesandinfernewassertions,andwritethoseassertionsbacktotheRDFdatabasewithoutalteringtheoriginalrecordsinanyway.The“openworld”ofRDF/OWLmeansthatautomatedinferencecanbecomeapartofthepreservationprocesswithoutrequiringthatweredesignandreimplementinstitutionalrepositoriestoaccommodateit.Instead,inferenceisakindofservicethatcanbeusedalongsidethosetoolstoheadoffpreservationrisksandfillgapsinrepresentation.Thenextsectionlooksmorecloselyatthearchitectureandproof‐of‐conceptimplementationofanarchivethataugmentsaninstitutionalrepositorywithinferencecapabilitiesandservices.

4.4. SystemArchitectureWerespondtothepractice,standardization,andtechnicalproblemspreviouslyoutlinedintwoways:

o First,wedesignoursystemsforaworldwheremetadatawillvarygreatlyintheircompleteness,expressivity,andconsistency.Preservationriskswillarise,andwebuildtoolswiththeaimofamelioratingthoseproblems.

o Second,weproposeanarchitectureforrepositoriesthatwehopewillsupportmoreeffectiveresourcedescriptionandencoding:onethatincludescapabilitiesandservicesthatwillbeneededinthenextgenerationofdigitalcontentmanagementsystems.

4.4.1. Architecture:OverviewTheproposedarchitectureaugmentstypicalinstitutionalrepositoryarchitectureswithtwonewcapabilities:

o Theabilitytomanagenotjustbitstreamsandassociatedmetadata,butalsoassociatedsemantics,expressedinstandardRDFandOWLsyntax.

o Automatedservicesfordetectingand/orcorrectingsemanticambiguityinmetadatadescriptions.

4.4.1.1. Architecture:theTupelomodelTupeloisamiddlewarecomponentprovidingsemanticcontentmanagementfordistributed,heterogeneousapplications.Bymiddleware,wemeanthatTupeloprovidesabstractions(knownascontexts)thatencapsulatedifferentstorageandretrievaltechnologiesfordataandmetadata,includingfilesystems,webservices,

Page 70: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

70

relationaldatabases,andRDFstores.Bywayofthesecontexts,applicationscanexchangeRDFstatementsandaccessrawoctetstreamsassociatedwiththem.Tupelocanthereforeservethesameroleasacontentmanagementsystem(CMS)orinstitutionalrepository.ButTupelodiffersfromthesesystemsinmakingonlyminimalassumptionsaboutthestructureoftheinformationitmanages,allowingapplicationstoencodethatstructureasexplicitRDFstatements.RDF'sopen‐worldassumptionanduseofUniformResourceIdentifiersmeansthatTupelocanassembledescriptionsfrommultiple,independentsources,evenifthosesourcesarenototherwisecoordinated.Tupelohasoriginallybeendesignedtosupportscienceapplicationswheredataisproduced,processed,andtransformedbymultiplepeopleandsoftwarecomponents.Suchapplicationsrequirepreservationofworkflowtracesandthetrackingofrelationshipsbetweenrawinputandoutputresultsacrossdistributedsystems.Thesesamechallengesariseindigitalpreservation,wherecriticaltransformationsmayoccuroutsideofthecontrolofarepositorysystem,orwithinmetadatawhosesemanticsareknownatonestageoftheprocessandunknownatanother.Suchtransformationsaredistributed,heterogeneousprocesses,andtyingadigitalartifacttotheprocessinwhichitparticipatedrequiresportable,globally‐scopedidentifiersthatcanbemanagedindependentlyoftheprocessitself.RDFusageenforcestheglobalscopeofidentifiersbyusingURIstoidentifynodes.

4.4.1.2. ConnectingBECHAMELtoTupeloOurBECHAMELclientapplicationretrievesanXML‐serializedsubgraphoftherepositorycontentsfromTupeloviaTupelo’sHTTP‐basedclient/serverprotocol,whichisbasedonextendingNokia’sproposedURIQAprotocol(http://sw.nokia.com/uriqa/URIQA.html)ThesubgraphissubmittedtoBECHAMEL,togetherwithsupportingOWLOntologiesandstandardizedRDFSvocabularies(e.g.,DublinCore).NewRDFstatementsandannotationsemergingfromBECHAMEL'sexecution(seetheinferenceexampleinFigures2‐4)arethendeliveredbacktotheTupeloserver.

4.4.1.3. ObservationsonImplementationLikethecharacteristicsoftheTupeloarchitecture,wepredictthatinferentialcapabilities(suchasthoseillustratedearlierinthissection)willbebasicservicesprovidedbyandforfuturedigitalrepositories.Butthefunctionalcomponentsofthoserepositorieswillbelooselycoupledanddistributed.Interpretiveservicesare,furthermore,neededrightawayforsystemsbasedoncurrentContentManagementSystemtechnologies,andtoaidinreformingdescriptivepracticeasitstandstoday.Forallthesereasons,wehavesoughtinourimplementationtomaketheinterpretationcomponentastructurallydistinctlayer,communicatingwiththeTupelomiddlewareviageneral‐purposeclient/serverprotocolssuchasHTTP.Whileweassumetheresourcedescriptionsandinferredknowledgewillconformto

Page 71: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

71

theRDFabstractmodel,wehavechosentodelivertheminconventionalserializedforms,suchasRDF/XML.Aswithanysimilarproject,avarietyofengineeringchallengesrequirefurtherexperimentationandimprovement.Forexample,attheBECHAMELapplicationlayer,allRDFstatementsareexpressedasifpartofasingleglobalgraph,whetherretrievedfromTupelo,parsedfromanRDFSvocabulary,inferredbyBECHAMELitself,ordrawnfromanyothersource.Butobviously,onlyafiniteamountofinputknowledgecanbeefficientlysharedoverthenetworkbetweenclientandserver.Ourinterpretiverulesarethemselvesseparatefromthestrategiesforselecting,retrieving,andstoringRDFstatements,butpragmaticallytheycannotbetotallyindependentofeachother.

4.5. LessonsLearnedandNextStepsOurresearchcontributioncanbeseenfromoneperspectiveasthetechnicalgroundworkforafuturegenerationofimprovedautomateddigitalpreservationsystemsandmethods.Butonecanalsounderstandourfindingsasopportunitiestoapplyhumanintelligencemoreeffectivelywithexistingtoolsandstandards.Itmightneveroccurtoadigitallibrarianthathispreservationmethodsarebeingexecutedwithoutclearlyidentifiabletargets,orthatasimplechange(suchasdcmitype:Imageinsteadof"image")coulddramaticallyreducetheworkrequiredtocorrectthatproblem.Theexerciseofencodingsemanticknowledgewithenoughclarityandprecisionforacomputerrevealscomplexitiesthatourremarkablehumanmindswouldotherwiseallowustoignore.Withtheaidofthatinsight,muchprogresscouldbemadeinreformingthepracticesthatpromptourdevelopmentandresearch.

4.6. ConclusionInstitutionalrepositoriesandothercurrenteffortsforpreservingdigitalartifactsfacechallengesresultingfromunderspecifiedmetadataschemas,ambiguoususage,andmetadatamodelsthatrelatemoretorepositoryimplementationthantoissuesofmeaning.Theseentailveryrealriskstotheintegrityandusefulnessofpreserveddigitalartifactsastheyarestored,managed,andretrieved.Descriptivepracticesthatseemcorrectmayintroduceinconsistenciesthatareundetectablewithoutmanualinspectionofeachrecord‐‐anunreasonablerequirementforcollectionsofevenmoderatesize.Improvedmetadatastandardsandrepositorymetadatamodelsarepartofthesolutiontotheseproblems,butwealsoseearoleforautomationindetectingandmitigatingpreservationrisks.Ourexperimentalarchivingtechnologies,BECHAMEL

Page 72: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

72

andTupelo,demonstratethatwecanlocateandcorrectambiguousmetadataexpressionsinthecontextoftransactionssuchasimportandexport.Asbestpracticesevolvefordigitalpreservation,weseereasoningcapabilitieslikethosedemonstratedbyBECHAMELbecominganintegralcomponentofdigitalpreservationsystems,allowingcuratorstotransformlargecollectionswithgreaterconfidencethatrecordswillfaithfullyrepresenttheinformationtheyareintendedtopreserve.ComplementinginteroperabilitymodelslikeECHODEPository'sHubandSpoketoolsuite,webelievethetechniquesdescribedherepointtoanewgenerationofpreservationtools,andrevealwaystouseexistingtoolswithmoresuccess.

Page 73: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

73

5. ANotefromthePIsWhentheUniversityofIllinoisLibraryandtheGraduateSchoolofLibraryandInformationSciencesubmittedtheproposalfortheECHODEPprojecttotheLibraryofCongressin2003,thedigitalpreservationlandscapewasradicallydifferent.Thenumberofwebarchivingtoolswasstillsmall;farfewerinstitutionsthannowhadinstancesofrepositorysoftwareapplicationsintheirlibraries;theproblemspaceofinteroperabilitybetweenrepositoryplatformswasjustgainingground;andtechniquesformigratingthesemanticcontentofdocumentsovertimeandthroughvariousencodingschemeswerestillonthehorizon.TheaccomplishmentsofECHODEPPhase1projects,intheformoftechnicalframeworksandsoftwareapplications,aswellasofpublishedresearchandenduringpartnerships,havecontributedtotheredesignofthislandscapeforthericherandmoresustainable.Forexample,oureffortsatenablingrepositoryinteroperabilityhaveresultedintheregistrationoftheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilitywiththeLibraryofCongress.Becauseofourworkinthisarea,institutionssuchasHarvardUniversity,theArizonaStateLibrary,andtheGeorgiaInstituteofTechnologyhavecontactedustolearnmoreaboutthetechnicalarchitectureissuesinvolvedinourframework.Thesecontactsbespeakknowledgesharingandcommunitybuildingtowardapublicgood–interactionsthatareintegraltothedevelopmentofanetworkedapproachtodigitalcontentstewardship.Anotherbeneficialoutcomehasbeenthepartnershipsthemselves,initiallyestablishedduringPhase1,suchaswithOCLC;IllinoisandOCLCarecollaboratingagaininECHODEPPhase2,thistimeonanamed‐entityextractionandrecognitiontooldevelopmentprojectthatseekstoautomatecreationandextractionofmetadataforpreservationpurposesandcontexts.Indeed,theworkofstartingandsustainingcross‐organizationalcollaborationforaproject’speriodofperformanceshouldnotbeoverlooked.Asourteamshavelearnedintheprocess,effectivecollaborationentails–butisnotlimitedto–layingafoundationforacommunicationinfrastructurethatdrawsonanarrayoftools,suchaswikisandvirtualmeetingapplications;nurturingahealthybalancebetweenencouragementofnewdirectionsinresearchanddevelopmentandmeetingthedeliverablestowhichtheprojectiscommitted;andunderstandingfromthestartthattheoutcomeofoureffortswillonlybeasmeaningfulandsuccessfulasthecollaborationsthemselvesarerichandproductive.TheUniversityofIllinoisisgratefultotheLibraryofCongressforfundingitsdigitalpreservationresearchactivitiesunderNDIIPP.TheworkachievedduringPhase1hasaffordedusagreaterunderstandingofthechallengessurroundingpreservationstrategies,whichwehopetheNDIIPPcommunityatlargewillcontinuetolearnfromanddrawuponinfuturestewardshipendeavors.

Page 74: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

74

6. References

6.1. ArchivingtheWeb:theWebArchivesWorkbench

6.1.1. ResourcesCobb,J.,Pearce‐Moses,R.&Surface,T.(2005).ECHODEPositoryProject.In

Archiving2005:finalprogramandproceedings,April26,2005,Washington,D.C.,(175‐178).Springfield,VA:TheSocietyforImagingScienceandTechnology,2005.RetrieveJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/IST2005paper_final.pdf/.

TheISOReferenceModelforOpenDistributedProcessing–AnIntroductionECHODepGenericMETSProfileforPreservationandDigitalRepository

Interoperability.(2005).RetrievedAugust27,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.html.

ECHODepMETSProfileforWebSiteCaptures(2006).RetrievedAugust27,2008,

fromhttp://www.loc.gov/standards/mets/profiles/00000016.html.TheECHODEPository:AnNDIIPP‐PartnerProjectoftheUniversityofIllinoisat

Urbana‐ChampaignwithOCLCandtheLibraryofCongress.(n.d.).RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/.

“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction.”

(1996).RetrievedAugust27,2008,fromhttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.

OCLCDigitalManagementServices.(2008).RetrievedJuly5,2008,fromhttp://www.oclc.org/us/en/services/collection/default.htm.TheNationalDigitalInformationandInfrastructurePreservationProgram.(n.d.).

RetrievedJuly5,2008,fromhttp://www.digitalpreservation.gov/.Pearce‐Moses,R.&Kaczmarek,J.(2005).AnArizonaModelforPreservationand

AccessofWebDocuments.DttP:DocumentstothePeople.33(1),17‐24.RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/azmodel.pdf/.

Page 75: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

75

Rani,S.,Goodkind,J.,Cobb,J.,Habing,T.,Eke,J.,Urban,R.&Pearce‐Moses,R.(2006).Technicalarchitectureoverview:toolsforacquisition,packaging,andingestofwebobjectsintomultiplerepositories(poster).Openinginformationhorizons:6thACM/IEEE‐CSJointConferenceonDigitalLibraries:June11‐15,2006,ChapelHill,NC,USA:JCDL2006/sponsoredbyACMSIGonInformationRetrieval,ACMSIGonHypertext,HypermediaandtheWeb,IEEETechnicalCommitteeforDigitalLibraries,(360‐360).NewYork:ACM,2006.

WebArchivesWorkbench.(2008).RetrievedJuly5,2008fromhttp://sourceforge.net/projects/webarchivwkbnch/.Webarchiving.(2008,August21).InWikipedia,thefreeencyclopedia.Retrieved

August27,2008,fromhttp://en.wikipedia.org/wiki/Web_archiving.

6.2. RepositoryEvaluationandInteroperability

6.2.1. RepositoryEvaluationDLIFull‐TextJournalCollection.RetrievedApril7,2009,from

http://forseti.grainger.uiuc.edu/pubs/tocdli.asp.HistoricalAerialPhotoImageDatabase.RetrievedApril7,2009,from

http://images.library.uiuc.edu/projects/aerial_photos/.IllinoisDigitalOrthophotoQuarterQuadrangleData.RetrievedApril7,2009,from

http://www.isgs.illinois.edu/nsdihome/webdocs/doq05/.RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.

MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.

VincentVoiceAudioLibraryatMichiganStateUniversityLibraries.RetrievedApril7,2009,fromhttp://vvl.lib.msu.edu/showfindingaid.cfm?findaidid=CoolidgeC.

6.2.2. HandSToolsSuiteAllinson,F.,François,S.,&Lewis,S.(January,2008).SWORD:SimpleWeb‐service

OfferingRepositoryDeposit.Ariadne,(54).RetrievedSeptember11,2008,fromhttp://www.ariadne.ac.uk/issue54/allinson-et-al/.

Boyko,A.,Kunze,J.,Littman,J.,&Madden,L.(2008).TheBagItFilePackageFormat(V0.95).RetrievedSeptember15,2008,fromhttp://www.cdlib.org/inside/diglib/bagit/bagitspec.html.

Page 76: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

76

ConsultativeCommitteeforSpaceDataStandards.(2002).ReferenceModelforanOpenArchivalInformationSystem(OAIS).CCSDS650.0‐B‐1.BlueBook.RetrievedSeptember12,2008,fromhttp://public.ccsds.org/publications/archive/650x0b1.pdf.

DLFAquiferMetadataWorkingGroup.(2006).DigitalLibraryFederation/AquiferImplementationGuidelinesforShareableMODSRecords.RetrievedSeptember10,2008,fromhttp://wiki.dlib.indiana.edu/confluence/download/attachments/24288/DLFMODS_ImplementationGuidelines_Version1-2.pdf?version=1.

GlobalDigitalFormatRegistry(GDFR)InformationSite.(n.d.).RetrievedSeptember15,2008,fromhttp://www.gdfr.info/.

Guenther,R.(2008).GuidelinesforusingPREMISwithMETSforexchange.RetrievedSeptember11,2008,fromhttp://www.loc.gov/standards/premis/guidelines-premismets.pdf.

Guenther,R.(2008).BattleoftheBuzzwords:Flexibilityvs.InteroperabilityWhenImplementingPREMISinMETS.D­LibMagazine,14(7/8).RetrievedSeptember12,2008,fromhttp://www.dlib.org/dlib/july08/guenther/07guenther.html.

Habing,T.G.(2005).ECHODepgenericMETSprofileforpreservationanddigitalrepositoryinteroperability.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.xml.

Habing,T.G.(2006).ECHODepMETSprofileforwebsitecaptures.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000016.xml.

Habing,T.G.(2007).LightweightrepositoryCRUDService(LRCRUDS).RetrievedSeptember10,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/LRCRUDS.htm.

JHOVE:JSTOR/HarvardObjectValidationEnvironment.(2007).RetrievedSeptember11,2008,fromhttp://hul.harvard.edu/jhove/.

Kaczmarek,J.,Habing,T.G.,&Eke,J.(2006).Repositorysoftwareevaluationusingtheauditchecklistforcertificationoftrusteddigitalrepositories.InProceedingsofthe6thACM/IEEE­CSjointconferenceondigitallibraries2006,ChapelHill,NC,USAJune11­15,2006.NewYork:AssociationforComputingMachinery.RetrievedSeptember10,2008,fromhttp://doi.acm.org/10.1145/1141753.1141774.

Kaczmarek,J.,Hswe,P.,Eke,J.,&Habing,T.G.(2006).Usingthe‘Auditchecklistforthecertificationofatrusteddigitalrepository’asaframeworkforevaluating

Page 77: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

77

repositorysoftwareapplications.D­LibMagazine,12(12).RetrievedSeptember10,2008,fromhttp://www.dlib.org/dlib/december06/kaczmarek/12kaczmarek.html.

METS:Metadataencoding&transmissionstandard,officialwebsite(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/.

MIX:NISOmetadataforimagesinXMLschema,technicalmetadatafordigitalstillimagesstandard,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mix/.

MODS:Metadataobjectdescriptionschema,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mods/.

OAI­PMH:OpenArchivesInitiative–ProtocolforMetadataHarvesting.(2008).RetrievedSeptember12,2008,fromhttp://www.openarchives.org/pmh/.

PREMIS:Preservationmetadatamaintenanceactivity,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/premis/.

PREMISWorkingGroup.(2005).Datadictionaryforpreservationmetadata.Dublin,OH:OCLCandRLG.RetreivedSeptember10,2005,fromhttp://www.oclc.org/research/projects/pmwg/premis-final.pdf.

RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.

JISC.(2008).SWORD.RetrievedSeptember11,2008,fromhttp://www.ukoln.ac.uk/repositories/digirep/index/SWORD.

textMD:TechnicalMetadataforText,OfficialWebSite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/textMD/.

TheECHODEPositoryproject.(n.d.).RetrievedSeptember9,2008,fromhttp://ndiipp.uiuc.edu/.

TheApacheSoftwareFoundation.(2008).WelcometoXMLBeans.RetrievedSeptember11,2008,fromhttp://xmlbeans.apache.org/.

UIUCEchodepHubandSpokeFrameworkToolSuite.(n.d.).RetrievedSeptember12,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/

6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation

Page 78: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

78

DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1.(2008).RetrievedOctober2,2008,fromhttp://dublincore.org/2008/01/14/dcelements.rdf

DCMITypeSchema.(2008).RetrievedOctober3,2008,from

http://dublincore.org/2008/01/14/dctype.rdfDubin,D.,Sperberg‐McQueen,C.M.,Renear,A.,andHuitfeldt,C.(2003).Alogic

programmingenvironmentfordocumentsemanticsandinference.LiteraryandLinguisticComputing,18(2):225–233.

Habing,T.,Ingram,W.,Cordial,M.,Manaster,R.andEke.J.(2008).Developmentsin

digitalpreservationattheUniversityofIllinois:theHubandSpokearchitectureforsupportingrepositoryinteroperabilityandemergingpreservationstandards.LibraryTrends,57(4),[pagenos.].

Kaczmarek,J.,Hswe,P.,Hauser,L.,andEke.J.(2008).TheWebArchivesWorkbench:

takinganarchivalapproachtothepreservationofWebcontent.LibraryTrends,57(4),[pagenos].

Kaminski.P.(2002).Integratinginformationonthesemanticwebusingpartially

orderedmultihypersets.Unpublishedmaster’sthesis.UniversityofWaterloo.RetrievedSeptember12,2008,fromhttp://www.ideanest.com/braque/Thesis-web.pdf.

OWLWebOntologyLanguage.(2004).RetrievedOctober2,2008,fromhttp://www.w3.org/TR/owl‐features/

RDFSemantics.(2004).RetrievedOctober2,2008,from

http://www.w3.org/TR/rdf-mt/Renear,A.,Dubin,D.,Sperberg‐McQueen,C.M.,andHuitfeldt,C.(2002).Towardsa

semanticsforXMLmarkup.InE.Munson,R.Furuta,andJ.I.Maletic(eds.)Proceedingsofthe2002ACMSymposiumonDocumentEngineering(119‐126).NewYork:ACM.

Renear,A.andDubin,D.(2003).Towardsidentityconditionsfordigitaldocuments.

InS.Sutton,editor,Proceedingsofthe2003DublinCoreConference.UniversityofWashington,Seattle,WA.

Tupelo.(2008).RetrievedOctober2,2008,from

http://dlt.ncsa.uiuc.edu/wiki/index.php/Main_Page.

Page 79: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

79

URIQA.TheURIQueryAgentModel:ASemanticWebEnabler.(2003‐2008).RetrievedOctober2,2008,fromhttp://sw.nokia.com/uriqa/URIQA.html.

Page 80: ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

80

7. Appendices

7.1. WebArchivesUserGuide

7.2. WebArchivesWorkbenchImplementationGuide

7.3. AnnotatedTrustedDigitalRepositoryChecklist

7.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)

7.5. RepositoryTestingFindings:Narrative

7.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist

7.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions

7.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus