F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016

EVA/Minerva 2016

Integration and Retrieval of Heterogeneous Archival Metadata

CONNECTING COLLECTIONS

Kepa J. Rodriguez – Archives Yad Vashem09/11/2016

Outline

● Data integration in the first phase of the project

● Our actual integration approach

● Retrieval of data using controlled vocabularies

● Development of the EHRI controlled vocabularies

Data integration in the first phase of the project● Holding institutions delivered data in very different formats:

● XML, text files, CSV, JSON, etc...

● Ingestion into the portal was made case by case● We interpreted data model and map it with our model

● Sometimes without help of the institution

● Lots of data introduced by hand

● Process no sustainable, it cannot be repeated● No automatic updates are possible

● If an institution updates content, data has to be updated by hand

●Other problems: infrastructure, persistent identifiers, etc.

Proposal for the second phase of the project

●Data conversion

●Data publication and synchronization

●Data ingestion

Data conversion● Converstion tool: different data formats into EAD:

● XML, JSON, CSV...

● Generic transformation● Useful for a relevant number of institutions

● Reusable functions, as mappings for specific fields of their export format into EAD

● Utilities to configure specific transformations

● Validation of the output:

● Machine validation: XML validation protocols● Schematron, RNG

● Human validation: HTML preview including mark-up for validation errors

EAD File sample (1)<archdesc level="subgrp"> <did> <unitid>M.49.E</unitid> <unittitle encodinganalog="3.1.2">Testimonies of Holocaust Survivors collected by the

Central Jewish Historical Commission in Poland, 1944-1947</unittitle> <physdesc encodinganalog="3.1.5">6845 files</physdesc> <langmaterial> <language langcode="deu" encodinganalog="3.4.3">German</language> <language langcode="pol" encodinganalog="3.4.3">Polish</language> <language langcode="yid" encodinganalog="3.4.3">Yiddish</language> </langmaterial> <repository> <corpname> ושם יד <Yad Vashem Archives</corpname / ארכיון </repository> </did> <scopecontent encodinganalog="3.3.1"> <p>The collection consists of approximately 7,200 testimonies collected by the Centralna Żydowska Komisja Historyczna (Central Jewish Historical Committee) in Poland during its during its active years, 1944-1947.

….. as well as testimonies from survivors who fought in partisan units and survivors who

were in hiding.</p> </scopecontent>

…....

EAD File sample (2)

…... <originalsloc encodinganalog="3.5.1"> <p>ZYDOWSKI INSTYTUT HISTORYCZNY - ZIH, WARSZAWA, POLAND</p> </originalsloc>

…... <controlaccess> <geogname>Poland</geogname> <geogname>Warsaw</geogname> </controlaccess> <controlaccess> <subject>Persecution of Jews</subject> <subject>Testimonies, Biographies</subject> <subject>Holocaust survivors</subject> </controlaccess> <controlaccess> <corpname>Centralna Żydowska Komisja Historyczna</corpname> </controlaccess></archdesc>

Data publication and synchronization● We plan to use two data publication protocols:

● OAI-PMH: one of the first protocols for publication of data● Publication of data in different formats: Dublin Core (default), EAD,

etc.

● PMH-servers are not easy to implement and to mantain for small archives

● But we want to implement a client for institutions that already use it

● RessourceSync: a new protocol● Based on SiteMaps

● Data can be published on the web page of the institution

● Higher security

● Use sitemaps to expose changes and updates

● Only modified and new data will be tranferred to the portal

● Both are standard protocols of the Open Archives Initiative

Data ingestion● After data is ingested into the portal, it will receive a permanent URL:

● Formal protocol is in progress

● Necessary to publish our data in the Linked Open Data cloud

● Updates: data will be overwritten● But the portal keeps the user generated data

● But... is it enough for the user just to have all information in a single infrastructure?

Data retrieval● The user needs to be able to retrieve information related to selected topics, places, people, organizations, creators...

● Regardless which institution holds it

● Regardless in which language the metadata is written

EHRI controlled vocabularies● EHRI Thesaurus

● Concepts: hierarchy of concepts formalized in SKOS● A first set translated into 10 languages

● Made by historians and content specialists

● Authority lists:● Named entities or instances of the concepts

● Proposed by historians and especialists: not really useful for indexing and retrieval of data

● During import a lot were added by hand to address necessities of the real data

● Domain specific authorities: Ghettos, Camps, Administrative Districts

● Vocabularies created for applications in the portal:● Two research guides● Linked to the EHRI Thesaurus

Problems of the first approach of the project● A vocabulary built with knowledge about the Shoah can be helpful to represent the history, but not necessarily the documentation:

● The complilation of an encyclopedia and the implementation of an engine for cataloguing and retrieval are two very different things and require different strategies and kinds of expertise.

● The vocabularies should be able to retrieve the real existing data:

● Vocabularies should be able to describe the data, not only the content... i.e: types of documents, physical format of the data...

● A strategy to increase te datasets when new data addresses new necessities has to be implemented.

The reality of the data● Different institutions use different systems to assign keywords (or no system)

● Keywords can have different relevance in different systems● In a National Archive “holocaust” can be a relevant keyword, but it

is not relevant for the EHRI portal.

● A same keyword can have different meanings in different knowledge basis

● i.e: “labor” in one set of imported data corresponds to “forced labor”, in another set to “trade unions”

● Relevant information is often given as free text:● Necessary to use Natural Language Processing to extract this

information, but we can do in the project only in a experimental level.

EHRI's data driven approach (1)● Extraction of access points of the EAD files during import <controlaccess> <geogname>Poland</geogname> <geogname>Warsaw</geogname> </controlaccess> <controlaccess> <subject>Persecution of Jews</subject> <subject>Testimonies, Biographies</subject> <subject>Holocaust survivors</subject> </controlaccess> <controlaccess> <corpname>Centralna Żydowska Komisja Historyczna</corpname> </controlaccess>

EHRI's data driven approach (2)● Person, corporate bodies:

● Check whether we have corresponding authority files

● If we have: link the description unit with the correspoinding authority file

● If we don't have: create a new authority file

● Priority of EHRI: creators of archival collections

● Places:● Link the places with the geographical database GeoNames

● Problematic for historical places, some of them will be added as extra vocabulary.

EHRI's data driven approach (3)● Concepts/terms: the most complicated case● Archives used very different strategies for concepts:

● Some institutions make composition of terms using different rules (or no-rule)

● Subject: “Jews--Persecution--France” (data of USHMM) ● EHRI has an atomic approach

● Subject: “Persecution of Jews”● Place: “France”

● Steps to process concepts/terms:● Terms are normalized and de-duplicated● If there are equivalent terms in the thesaurus we establish a link● If there are not equivalent terms the concept goes to further

analysis● If necessary a board of experts will consider to accomodate a new

concept in our concept hierarchy.

Ghethos and Concentration Camps● We evaluate to start a WikiData project for ghettos and concentration camps

● Strategy:● Extract information from the actual thesaurus and alternative

sources

● Encyclopedic knowledge

● Data from project partners

● Integration of all this data in the WikiData platform

● Enrichment with help of the community

● Multilingual labels and no controversial information

● Finally the data in WikiData and in the portal should be synchronized

NIOD Institute for War, Holocaust and Genocide Studies (NL)

CEGESOMA Centre for Historical Research and

Documentation on War and Contemporary Society (BE)

Jewish Museum in Prague (CZ)

Center for Holocaust Studies at the Institute for

Contemporary History in Munich (DE)

YAD VASHEM The Holocaust Martyrs’ andHeroes’ Remembrance Authority (IL)

United States Holocaust Memorial Museum (USA)

Bundesarchiv (DE)

The Wiener Library Institute for the Study of the Holocaust & Genocide (UK)

Holocaust Documentation Centre (SK)

Polish Center for Holocaust Research (PL)

The Jewish Museum of Greece (GR)

Jewish Historical Institute (PL)

King’s College London (UK) Ontotext AD (BG) Elie Wiesel National Institute for the Study of Holocaustin Romania (RO) DANS Data Archiving and Networked Services (NL) Shoah Memorial, Museum, Center for Contemporary Jewish Documentation (FR) ITS International Tracing Service (DE) Hungarian Jewish Archives (HU) INRIA Institute for Research in Computer Science and Automation (FR) Vilna Gaon State Jewish Museum (LT) VWI Vienna Wiesenthal Institute for Holocaust Studies (AT)

Foundation Jewish Contemporary Documentation Center (IT)

CONNECTING KNOWLEDGE

CONNECTING COLLECTIONS

Integration and Retrieval of Heterogeneous Archival

Metadata

09/11/2016

F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016

Education