CLARIN-D tutorial Nijmegen, 7 September 2011 Speech & Language Data Repository (SLDR) Bernard Bel [email protected]
CLARIN-D tutorial Nijmegen, 7 September 2011
Speech & Language Data Repository (SLDR)
Bernard Bel [email protected]
Table of contents
2
Speech & Language Data Repository (SLDR)
• A demo to start with! • Background • Item ‘packaging’ • Our OAIS model • Interoperability • Processing queries on SLDR • Access rights management • Following up resource usage • Work in progress
A demo
3
Speech & Language Data Repository (SLDR)
http://sldr.org/sldr000714/toc
We need to modify the private/public status of some files on this item:
Background
4
Speech & Language Data Repository (SLDR)
• Laboratoire parole et langage (LPL), a speech research laboratory of the French Centre National de la Recherche Scientifique (CNRS), is in charge of an archive submission site named Speech & Language Data Repository (SLDR, pronounce ‘splandar’): http://sldr.org
• The aim of SLDR is to preserve data eligible for speech/language research and facilitate non-commercial access to it.
• Resource pooling is constructed on an interoperable system (OAIS, Open Archival Information System) currently involving two major computing centres (CINES and CC-IN2P3) in a joint project initiated by TGE-Adonis.
Centre Informatique National de l’Enseignement Supérieur,
Montpellier, France
Centre de Calcul de l’Institut National de Physique
Nucléaire et de Physique des Particules, Lyon, France 5
Très Grand Équipement Adonis, Paris, France
Speech & Language Data Repository (SLDR)
from experimental linguistics…
A broad range of disciplines…
… to field linguistics
… including didactics, processing written forms, understanding of speech information, the assessment and retraining of voice, speech and language dysfunctioning etc.
History
9
• June 2005: a project proposal for the Council of Federation Institut de Linguistique Francaise • September 2005: a call for projects from Direction de l'Information Scientifique (CNRS) on « Centres
de ressources numériques » • Bringing together two projects of Centre de ressources pour la description de l’oral (CRDO)
submitted by LACITO and LPL: http://sldr.org/docs/admin/lettreDeMissionCRDO.pdf
• September 2006: creating the steering committee of CRDO (Lyon): http://sldr.org/wiki/ComiteDePilotage
• 2006-2007 (contract with Ministry of research, LPL funding): developing the architecture of SLDR (3 enginers)
• The pilot project of TGE-Adonis/SLDR/CRDO-Paris/CINES/IN2P3: Framing the project with Direction des archives de France (juillet 2008) Setting up the pilot project (nov. 2008) A report on the pilot project in March 2009 (Claude Huc): http://sldr.org/docs/admin/ArchivageMutualise-document-synthese-v1.3.pdf Evaluation for the steering committe of TGE-Adonis by Y. Marcoux, EBSI, Montréal (June 2009): http://sldr.org/docs/admin/Marcoux-resume-operation.pdf An agreement is signed between CNRS, CINES and Archives de France to set up the legal framework of long-term preservation (June 2010). Long-term preservation is initiated by CRDO-Paris (22 June 2010) and CRDO-Aix (16 July 2010). January 2011: CRDO-Aix in production 66 Gb/40952 files, in test 186 Gb/7108 files.
• August 2011: The CRDO label is abandoned. CRDO-Aix is renamed Speech & Language Data Repository (SLDR, http://sldr.org)
Detail: http://sldr.org/wiki/CrdoHistorique
Item ‘packaging’
10
Speech & Language Data Repository (SLDR)
Sharing/preserving what and why?
11
• No specific structure is imposed on SLDR items. We deal with generic items.
• The following item types are defined in metadata: Primary data (corpora): all signals associated with oral production
(audio/video/articulatory measurements) and documents created or collected during an experiment or a field enquiry;
Secondary data (resources): material derived from primary data (transcription, translation, annotation, analysis…) and their associated resources: lexica, grammars, frequency tables, knowledge bases etc.
Tools: software and hardware descriptions of equipment used for data analysis and annotation, e.g.:
• Morpho-syntactic taggers • Syntactic analyzers • Translation assistance • Prosodic analysis/modelling
• Collections of items of all types (including collections)
Items of SLDR include
speech/singing corpora, their
annotations, lexica and other
knowledge bases as well as tools
associated with data processing. Corpora
may comprise audio/video
recordings and measurements of
physiological activity.
embargo -> 2047
public
non-commercial licence
non-commercial licence
in progress
Persistent URL + OAI + ARK Archival Resource Key
Display source
Several access options
14
Our OAIS model
Speech & Language Data Repository (SLDR)
The OAIS model in TGE-Adonis pilot project
• Open Archival Information System is the ISO 14721 standard.
• For the pilot project we took advantage of prior experience in the long-term preservation of data from astrophysics.
L’archivage numérique à long terme : les débuts de la maturité ? F. Banat-Berger, L. Duplouy, C. Huc. Direction des Archives de France, 2009.
• The actual implementation took into account specific features of oral resources, notably:
The diversity of file formats: sound/video and all signals associated with speech/singing
Corpora and their annotations are subject to editorial modifications after being stored and shared.
Multilingual approach of descriptive metadata, international scripts, transliteration/annotation standards (IPA etc.)
15
Preserving on the (very) long term: why and how?
• Digital medium-term or long-term archiving is not a mere reliable backup. • Aim 1: preserving data • Aim 2: making it accessible and eligible for reuse after an unspecified period
of time. This is the challenge of long-term preservation (archivage pérenne).
• Long-term digital preservation is not the ultimate step of data storage before it becomes untraceable or lost!
• Three major issues: 1) preserving a document, 2) making it accessible, 3) preserving its intelligibility.
Why should we preserve documents?
16
(Source : CINES)
How shall we proceed? • These issues deal with the very long term, thereby meaning more than 30 years. • For this reason data should be handed over to an institutional archive rather
than a consortium of computer centres.
Implications of long-term preservation
17
• Preserve the intelligibity of a document: The archive site accepts a restricted number of
persistent formats whose specifications are freely accessible.
The archive site is committed to migrating formats once they have become obsolete. This is the job of the archive curator, not the producer’s!
• Preserve the signification of its content: 1. Descriptive metadata; 2. Archival metadata; 3. A formal description of conventions used in the
archive. PPDI = Project Preservation Description Information. http://sldr.org/ppdi
The pilot project involved 3 actors:
• Submission sites: CRDO-Aix (SLDR) and CRDO-Paris • The archive site: Centre informatique national de
l’Enseignement supérieur (CINES, Montpellier) • The dissemination site: Centre de calcul de
l’Institut national de physique nucléaire et de physique des particules (CC-IN2P3, Lyon)
18
The OAIS model in TGE-Adonis pilot project
CRDO Organizes data
collection, formatting and metadata
CINES Transfer management/SIP verification
Creating AIP (Archival Information Packages) and storing them Transfering DIPs to CC-IN2P3
Submission Information Packages
(SIP)
Receipts Warnings Archival certificates
CC-IN2P3 Assessing transfers
Structuring items for distribution Retrieving Dublin Core metadata/general
cataloging
Producers
Dissemination Information Packages (DIP) CRDO
The ‘domain’ application: Graphic interface Handlings OLAC
metadata Query tools…
Scientific users
Items distributed
A functional diagram of the OAIS setup
Generic infrastructure Public users
TGE ADONIS: management and funding
You may skip this slide as simple explanations follow!
Source: projet pilote pour la mutualisation de l’archivage pérenne des données orales
The life on an item on SLDR
1. Items submitted to SLDR are protected by regular back-up procedures: current data;
2. After a proper packaging, each item is transfered to the test platform of the archive site (CINES);
3. After assessing the submission information package (SIP), CINES forwards a dissemination information package (DIP) to CC-IN2P3;
4. Several versions of the same item may be submitted to take into account editorial changes during the phase of medium-term preservation;
5. Once the item has become stable, it is transfered to the production platform of the archive site and assigned a persistent Archival Resource Key (ARK) for its long-term preservation. A DIP is again transfered from CINES to CC-IN2P3 for its distribution;
6. Submitting new versions is still possible but it should be motivated since all versions are preserved in the long-term archive.
7. Nonetheless it remains possible to modify metadata, descriptive files and access rights without submitting a new version.
20
The OAIS solution worked out by SLDR
21
SLDR (submission site)
Archive site Dissemination site
Lab producer
Individual producer
Transfer
Submission
A multi-tier architecture
CINES (Montpellier) CC-IN2P3 (Lyon)
version 2
version 1
(No storage in medium-term preservation)
version 2
version 1
Submission Submission
Submission
The OAIS solution worked out by SLDR
22
SLDR (submission site)
Archive site Dissemination site
Lab producer
User Queries interpreted by SLDR may be forwarded to the lab producer or the dissemination site
CINES (Montpellier) CC-IN2P3 (Lyon)
version 1
version 2 version 2
version 1
The OAIS solution worked out by SLDR
23
SLDR (submission site)
Archive site
Lab producers
Users
Open access
Users may receive data/metadata from SLDR, lab producers and/or the dissemination site. Note that the archive site is not involved in this process.
CINES (Montpellier)
version 1
version 2
Dissemination site
version 2
version 1
CC-IN2P3 (Lyon)
The OAIS solution worked out by SLDR
24
SLDR (submission site)
Archive site Disseminationsite
Lab producers
User
Portal
Metadata harvesting
Portals harvest metadata found in repositories available at different levels of the system
CINES (Montpellier) CC-IN2P3 (Lyon)
version 1
version 2
Creating the submission information package
Submission Information Package (SIP)
26
Item stored at SLDR
Submitting an item to the archive site: the submission information package (SIP)
27
(See: http://sldr.org/wiki/Packaging-en)
Dealing with the SIP in CINES and forwarding a DIP to CC-IN2P3
28
CC-IN2P3 (dissemination site)
Fedora Commons
iRods Arcsys
Dissemination Information Package (DIP)
CINES (archive site)
Long-term storage (AIP)
Accessing data on SLDR
User
2. SLDR to CC-IN2P3 query
3. Downloading under SLDR licence
1. Selection
29
CC-IN2P3 (dissemination)
SLDR
Controlled downloading from CC-IN2P3 is achieved via a ‘channelling’ of data through SLDR.
Open access
Deleting data from the submission site
User
Open access
30
CC-IN2P3 (dissemination)
SLDR
Data may be deleted from SLDR as it is entirely distributed from CC-IN2P3.
Controlled access
Restoring data on the submission site
31
CC-IN2P3 (dissemination)
SLDR
When necessary for preparing a new version of an item, its entire data set may be retrieved from the distribution site.
Retrieving an item from CC-IN2P3 to SLDR
32
33
CC-IN2P3
SLDR
Retrieving an item from CC-IN2P3 to SLDR
Datastreams stored at CC-IN2P3 are downloaded to SLDR and the whole structure of the item is restored along with original file names and dates of modification.
34
Archival metadata: the ‘sip.xml’ file
This file contains archival metadata: a brief description of the item, relations to other items, versioning, storage instructions and specifications of all files submitted for long-term preservation (in the DEPOT folder)
35
10,000 years!
Archival metadata: the ‘sip.xml’ file
36
File description in the SIP
37
Interoperability
Speech & Language Data Repository (SLDR)
(OAI-PMH) Dublin Core metadata (Follow this link)
Multilingual keywords
Multilingual description
Both oai_dc and olac metadata formats are produced.
39
Descriptive metadata http://www.language-archives.org/item/oai:sldr.org:sldr000745
OLAC: Open Language Archives Community
Descriptive metadata should contain all details relevant to reusing a linguistic resource. This is a
simple example registered in the OLAC archive. A more elaborated example is found at:
oai:sldr.org:sldr000764
40
Processing queries on SLDR
Speech & Language Data Repository (SLDR)
SLDR persistent links and open access
41
SLDR currently meets requirements for creating persistent links to open-access or controled-access items and the files that they contain. Below are examples.
• Link http://sldr.org/sldr000014 displays item sldr000014. • Link http://sldr.org/sldr000014/get/olac picks up OLAC descriptive metadata for item
sldr000014. • Link http://sldr.org/sldr000027/source launches a ‘source’ query on the server of the
lab producing sldr000027. • Link http://sldr.org/sldr000723/download launches a downloading of the latest version
of sldr000723 (under Creative Commons licence) irrespective of its distribution site. • Link http://sldr.org/sldr000014/download launches a downloading of the latest version
of sldr000014 (under CRDO licence) irrespective of its distribution site. • Link http://sldr.org/sldr000036_v3/download launches a downloading of version 3 of
item sldr000036 irrespective of its distribution site. • Link http://sldr.org/sldr000014/toc displays a detailed table of contents from which no
file can be downloaded because the entire item is in restricted access (under CRDO licence).
• Link http://sldr.org/sldr000525/toc displays a detailed table of contents from which a few open-access files can be downloaded.
• Link http://sldr.org/sldr000525/map displays a map of files from which a few open-access files can be downloaded.
42
This URL: http://fedora.tge-adonis.fr:8090/fedora/get/CRDO-Aix:126690/DEPOT_525.pdf depends on the dissemination service and current version of the item because of its references to identifier ‘126690’ and index ‘525’.
http://sldr.org/sldr000033/toc
SLDR (non-)persistent links for open access
43
This URL http://sldr.org/sldr000525/get/stream/CG5_22k-tc.txt is independent on the dissemination service, the current version of the item, and whether it is stored in a medium-term or long-term archive.
(To this effect, file [72] was packed in a ‘stream’ directory)
http://sldr.org/sldr000525/toc
SLDR persistent links for open access
44
http://sldr.org/sldr000525/get/stream/CG5_22k-tc.txt
This URL will remain accessible after uploading a new version to the archive.
Solution: modifying access rights to any document should be accomplished by a simple update of metadata, i.e. no versioning of the concerned item.
What if its access right needs to be modified?
SLDR persistent links for open access
45
Access rights management
SLDR data is submitted to an institutional archive (CINES) for its long-term preservation. Access rights should therefore be compliant with the French Code du patrimoine (Heritage code) with respect to public archives.
Speech & Language Data Repository (SLDR)
Owner of item
Privileged user (team)
Privileged individual
Member of authorised group
… Anybody
Open-access files in table of contents
yes yes
yes
yes
yes
yes
Any file in current version
yes
yes
yes
yes
… no
Any file in previous versions
yes
yes
no
no
… no
Any file at the source (next version)
yes
yes
no
no
… no
File in a SECRET folder yes
yes
no
no
… no
Updating descriptive files and metadata
yes yes no no … no
Editing/viewing confidential metadata
yes no no no no no
Access options More options to come!
47
Code du patrimoine (the Heritage Code) in France Excerpts from Code du patrimoine, law of 15 July 2008, (the Heritage Code) L211-1: Archives cover all documents, regardless of their date, place of storage, shape and physical support, produced or received by any person or entity and any department or public or private agency during the exercise of their business. L211-2: The preservation of archives is organized in the public interest, both for the sake of dealing with and assessing the rights of individuals or legal entities, public or private, and for documenting research with historical material. L211-4: Public archives are: (a) Documents produced by the activity of State, local governments, public institutions and other legal persons under public or private law who are in charge of a public service, as part of their public service remit. (...) L213-1: Public archives are in open access if not subject to restrictions as per Article L. 213-2.
L213-2: Notwithstanding the provisions of Article L. 213-1 (...) public archives are automatically granted open access after a delay of ... (read details). Among derogations (code AR048): 50 years. Documents disclosure of which undermines the protection of privacy or for appreciation or value judgments about a person named or easily identifiable, or which reveal the behavior of a person under circumstances which might bring him/her injury.
L213-5: Any Administration holding public or private archives is required to give reasons for objecting to a request for access to archival documents.
48
Access rights management on SLDR
1
2 3
4
• Access to items disseminated by SLDR must be compliant with the French Code du patrimoine which classifies every scientific production as a public archive.
• By default, a public archive shall be in open access (Article L213-1, Act of 15 July 2008). However, derogation clauses for restricting access are applicable in cases enlisted by Art. L213-2.
• Denied access shall be explicitly motivated (Art. L213-5) (1). • In the latter case, right holders may sign a free consent form to share the document
before the end of the restriction period (2). • Permissions may be granted for a limited period of time (3) under special conditions (4). • Different access rights may be specified for single files in a given item. • Permission to download a file is granted by the distribution site on the basis of
conditions declared by producers, as shown above.
49
Article L213-2, Code du patrimoine, Act of 15 juillet 2008 (source) Modified by Ordonnance Nr. 2009-483 of 29 April 2009 - art. 13 Notwithstanding the provisions of Article L. 213-1: I.-Public archives are automatically granted open access after a delay of: 1. Twenty-five years from the date of the document or the most recent document included in the file: a) For documents whose disclosure violates the confidentiality of the deliberations of the Government and authorities of the executive branch to conduct foreign relations, currency and credit, public, business and industrial secrecy, inquiries by the relevant departments on tax and customs offenses (AR039) or secrecy in statistics except when relevant data are collected through questionnaires relating to facts and private behavior mentioned in 4 and 5 (AR041); b) For documents mentioned in 1 of I of Article 6 of Law No. 78-753 of 17 July 1978, with the exception of documents produced under a contract for services performed on behalf of one or more specific persons when those documents, because of their content, fall in the scope of points 3 or 4 of the present act (AR040, AR042, AR053); 2. Twenty-five years from the date of death of the person concerned, for documents whose disclosure violates medical confidentiality (AR043). If the date of death is unknown, the time is one hundred and twenty years from the date of birth of the person in question (AR061); 3. Fifty years after the date of the document or the most recent document included in the file, for documents whose disclosure violates the secrecy of national defense, fundamental interests of the State in the conduct of external affairs, public safety, security of persons (AR049) or for the protection of privacy (AR048), except documents mentioned in 4 and 5. The same deadline applies to documents that will assess or value judgments about an individual, named or readily identifiable, or which reveal the behavior of a person under circumstances that might cause him/her prejudice (AR048). The same deadline applies to documents relating to the construction, equipping and operation of structures, buildings or parts of buildings used for detention of persons or usually receiving detainees (AR050). This period is counted from the end of the allocation to these uses of structures, buildings or parts of buildings in question;
50
4. Seventy-five years from the date of the document or the most recent document in the folder, or within twenty-five years from the date of death of the person if the latter period is shorter: a) For documents whose disclosure violates the secrecy of statistics are involved when data collected through questionnaires are relating to facts and behavior of private life (AR041, AR052); b) For documents relating to investigations conducted by the staff of the Judicial Police (AR046, AR056); c) For documents relating to cases before the courts, subject to special provisions relating to judgments, and enforcement of judgments (AR047, AR051, AR057); d) For the minutes and registers of public or ministerial officers (AR045, AR055); e) For records of birth and marriage certificates of civil status, after their completion (AR044, AR054); 5. Hundred years from the date of the document or the most recent document in the folder (AR058), or within twenty-five years from the date of death of the person if the latter period is more, with respect to documents referred to in point 4 dealing with persons under 18 (AR041, AR046, AR047, AR045). The same limits apply to documents covered or having been covered by the national defense secrets whose disclosure is likely to endanger the safety of persons named or readily identifiable (AR059). It is the same for documents relating to investigations conducted by police services, judicial matters brought before the Courts (AR058), subject to special provisions relating to judgments, and enforcement of judgments the communication of which affects the intimacy of people’s sexual life (AR060). II.-Access is prohibited to public archives whose disclosure would lead to the dissemination of information to design, manufacture, use or location of nuclear, biological, chemical or other weapons that have a direct or indirect destruction of a similar level (AR062).
(With p.o.) Victorine Dumas, 23 April 2011. Photograph: Médéric Gasquet-Cyrus
Informed consent and complimentary user’s
licence
User
Informant
53
Specifying private/public status in the package
Some folders bear special names indicating a peculiar way of sharing their contents.
Setting up access rights attributes is explained in full detail on this page: http://sldr.org/wiki/accessRightsSettings_en
54
Following up resource usage
Speech & Language Data Repository (SLDR)
SLDR user licence
Signing CRDO user licence
Downloading
55
Notifications are posted in the language chosen by the user. They remind him/her of the nature of downloaded items, their
reference publications and terms of the SLDR licence.
Notification of downloading (under SLDR licence)
Following up resource usage
57
This page displays the downloadings of an item and academic profiles of its users.
Access to this list is restricted to persons having produced items on SLDR or the ones who downloaded this particular item.
1) Users’ community
58
2) Publications
Following up resource usage
59
Project 1: Corpus of Interactional Data (2003) Project 2: OTIM (Outils de Traitement de l'Information Multimodale) (ANR 2009) Project 3: La production des sons en parole conversationnelle (2007)
3) Teams and research projects
Follow this link
Three projets
Following up resource usage
60
• facilitating collaborative work beyond institutional barriers (international projects etc.)
• stressing out the utility of oral data for the research community, the diversity of their uses, and consequently the benefit of sharing them on a non-commercial basis.
Following up persons, productions, teams and projects associated with SLDR resources is of great relevance to:
Following up resource usage
61
Work in progress
Speech & Language Data Repository (SLDR)
62
• SLDR team at LPL (part time): 3 administrators technical coordination: 2 research enginers scientific coordination: 2 research scientists
• In the framework of the ORTOLANG project (Open Resources and Tools for Language), a consortium of French linguistic research labs is planning to set up a network of CLARIN centres relying on SLDR and CNRTL respectively for the preservation and sharing of oral and text resources. This project will demand notably: • implementing persistent identifiers based on EPIC (European Persistent Identifier
Consortium); • enriching descriptive metadata formats to facilitate interoperability; • implementing Shibboleth or OpenSSO cross-site authentication.
• Complementary distribution of linguistic resources in collaboration with ELRA and the Linguistic Data Consortium (LDC), e.g. sldr000034 et sldr000770.
• We are eager to work in association with research projects likely to prompt the implementation of new features!
Journal of development: http://sldr.org/wiki/Developpement/Journal
Work in progress at SLDR