Application of International GeoSample Number (IGSN) to Sample Collections Sri Vinay Geoinformatics for Geochemistry (GfG) Program Lamont Campus of Columbia University 2007 September 25
Jan 17, 2016
Application ofInternational GeoSample Number
(IGSN) to Sample Collections
Sri VinayGeoinformatics for Geochemistry (GfG)
ProgramLamont Campus of Columbia University
2007 September 25
Presentation Outline
Unique identifiers and their application to sample and data management
System for Earth SAmple Registration (SESAR) and International GeoSample Number (IGSN)
Current Status and Activities of SESAR IGSN Implementation Strategies Discussion
Unique IdentifiersAn identifier is an unambiguous label which specifies an entity.
Unique identifiers are widely used to designate physical objects, assisting in trading (e.g., the Universal Product Code bar code system), and the extension of similar principles to digital and abstract entities is a prerequisite for digital commerce of rights and intellectual content.
Although the design of unique identification schemes is a technical problem, it is also a business issue with implications for what is identified and how identified items are made available.
tel:+1-816-555-1212URN:NBN:fi-fe976238
DOI:10.1000/ISSN1047-935X
“In a dynamic and distributed information environment, the effective management of both metadata records and the resources they describe requires a systematic way of generating and assigning unique identifiers.”(N. Friesen 2002: Recommendations for Globally Unique, Location-Independent, Persistent Identifiers)
Life Sciences - Bioinformatics
LSID = Life Science Identifier LSID = Life Science Identifier
“The World-Wide Web provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics. However, several limits and inadequacies have become apparent, one of which is the inability to programmatically identify locally named objects that may be widely distributed over the network. This shortcoming limits our ability to integrate multiple knowledgebases, each of which gives partial information of a shared domain, as is commonly seen in Bioinformatics”(Clark, T., Martin S., Liefeld T., 2004: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics. Vol.5 (1), 59-70.)
URN:LSID:ncbi.nlm.nih.gov:GenBank.accession:NT_001063:2
Geosciences - Geoinformatics
Kai Lin (SDSC): “ “Ontology Based Resource Registration and Integration in GEON”, Lecture July 2005
Examples from thePetDB Database
Name Location Publication CruiseD3-1 SEIR ANDERSON, 1980 VM3301 (Vema)D3-1 North Fiji Basin EISSEN 1994 Starmer 1 (Nadir)D3-1 Shimada Smt GRAHAM 1988 S1-79 (Sea Sounder)D3-1 Gorda Ridge CLAGUE 1984 KK2-83NP (Kana Keoki)3-1 Lamont Smts BATIZA 1982 RISE III (New Horizon)
Name Location Publication CruiseD3-1 SEIR ANDERSON, 1980 VM3301 (Vema)D3-1 North Fiji Basin EISSEN 1994 Starmer 1 (Nadir)D3-1 Shimada Smt GRAHAM 1988 S1-79 (Sea Sounder)D3-1 Gorda Ridge CLAGUE 1984 KK2-83NP (Kana Keoki)3-1 Lamont Smts BATIZA 1982 RISE III (New Horizon)
Sample names are duplicated.
Sample names are modified or changed.
D3 Engel 1964D-3 Scheidegger 1981, Schilling 1971PD3 Tatsumoto 1965, 1966PD-3 Hedge 1970, Muehlenbach 1972PV D-3 Engel 1965AMPH3D Pineau 1976AMPH-D3 MacDougall 1986AMPH D-3 Sun 1980, Schilling 1975AMPH 3-PD-3 Hart 1971S-10 Subbarao 1972
Dredge sample 3, Amphitrite Cruise 1963/4D3 Engel 1964D-3 Scheidegger 1981, Schilling 1971PD3 Tatsumoto 1965, 1966PD-3 Hedge 1970, Muehlenbach 1972PV D-3 Engel 1965AMPH3D Pineau 1976AMPH-D3 MacDougall 1986AMPH D-3 Sun 1980, Schilling 1975AMPH 3-PD-3 Hart 1971S-10 Subbarao 1972
Dredge sample 3, Amphitrite Cruise 1963/4
46396B 22 3,28-38 Dungan 1978396B 22 3,28-38 Muehlenbach 1979249 Dungan 1978DSDP046-0396B-022-003/28-38 PetDB
DSDP Leg 46, Hole 396B, Section 22, Sample 3, 28-33cm
Sample Naming in the Geosciences
Geosciences - Geoinformatics Integration of data in a distributed
system requires unique identification of samples.
Currently, naming of samples is ambiguous. Different samples have identical names. Samples are renamed. Metadata that allow unique identification are often missing
for terrestrial samples. Institutions have their own naming protocols, no
assurance that names are unique on a global scale. Access to information about the samples
Need to ensure proper evaluation and facilitate interpretation of sample-based data.
Links to physical specimens to make observations & measurements and the science
derived from them reproducible. to allow discovery & re-use of samples for improved use of
existing collections.
Urgency to Act
Growing number of data systems with sample-based data
Growing demand for ‘fine-grained’ access to data at the level of individual samples
New technologies for linking and integrating data (interoperability)
Increasing need to share samples
Generating Unique IDs: Options “Registration-based schemes”
Require a central clearinghouse Register personal or institutional names Register prefix or namespace (e.g. URN) Register metadata that allow the central
clearinghouse to generate identifiers
Schemes without registration use a computational process (naming protocol) to produce
an ID based on metadata No central authority
No-Registration Scheme
Risk of incorrect application of naming protocol
Risk of name duplication Identifier might grow to impracticable
length to insure uniqueness Metadata missing for legacy samples Easy implementation
SESAR - A Centralized Approach Response to urgent need for unique ID Easier to prevent duplicate registrations Easier to ensure links between parent
and child samples Provide a central access point for
Peer2Peer registration Facilitate international collaboration Build a Global Sample Catalog
SESAR – A Centralized Approach Proposed to NSF in July 2004 SGER (EAR) award received for September
2004 - August 2005 First presented to community at Marine
Curators’ Meeting at LDEO, September 2004 Supplement received in Sept 2005 until May
2006 Workshop at SDSC January 2005 Proposal to NSF August 2005 Three year grant awarded in April 2006 (NSF-
OCE).
Unique user code String of random characters
IGSN:SIO001324
International GeoSample Number:
A Global Unique Identifier for Earth Samples
Managed at central clearinghouse (SESAR) Strict Syntax (9 characters: letters [A-Z] &
numbers [0-9]) Fits sample labels Fits data tables in publications Allows 2,176,782,336 sample identifiers per registrant
Generated by SESAR or by users Does not replace personal or institutional
names
Benefits of the IGSN & SESAR
Ability to unambiguously identify samples allows to link & integrate data for a single sample
advances interoperability among digital data management systems & the development of Geoinformatics.
helps build more comprehensive data sets for samples. fosters new cross-disciplinary approaches in science.
aids preservation and curation, orphaned samples can be identified.
ensures proper linking of data from samples and sub-samples. facilitates sharing of samples.
SESAR: Status Basic version of system functional since
Fall 2004 Nearly 3.6 Million GeoObjects registered
All DSDP/ODP GeoObjects (holes, cores, core sections, core samples)
Dredge and core collections from Scripps, WHOI, Lamont, Antarctic Research Facility (ARF)
>40,000 mineral specimens from Harvard Museum Rocks & minerals from the US Polar Rock Repository
IGSN implemented in Geoscience data systems (e.g. EarthChem, MetPetDB, PaleoStrat, CoreWall)
Revised & extended version to be released in phases by end of 2007
SESAR: Sample Registration
Obtain account via website Set up login/password Get a unique user code
Submit sample information Via Batch Registration Forms (.xls workbooks) Via web site (currently off-line for upgrade) Via web services (under development)
Registration via Spreadsheet FormsAvailable Batch RegistrationForms1. Coring GeoObjects2. Dredges/trawl/grabs3. Individual samples4. Sections, Suites, & Sequences
Available Batch RegistrationForms1. Coring GeoObjects2. Dredges/trawl/grabs3. Individual samples4. Sections, Suites, & Sequences
Registration via Web Site:Currently off-line for upgrades
Registration via Web Services:
Under Development Registration of objects via collaborating
data systems Automatically register samples when sample metadata are
entered into collaborating data systems (e.g. IODP, MGDS) Eliminates redundant metadata submission
Systems communicate via web services Starting with REST based services. Could support SOAP in
future. Authentication
Investigating different technologies including GEON/GAMA Metadata exchange and validation
XML schema
SESAR Service “MyGeoSamples”
Current Services: Long-term preservation of
information about samples Lists of personal sample collections Store images, field notes, etc.
Current Services: Long-term preservation of
information about samples Lists of personal sample collections Store images, field notes, etc.
Assist investigators to manage their samples.Assist investigators to manage their samples.
SESAR Service “MyGeoSamples”
Services “Under Construction” Search & sort personal sample collections Create maps of sample locations Establish links to data (publications, data systems) Download tabular sample information to
spreadsheets
Antarctic Research Facility, FSUAntarctic Research Facility, FSUCa. 7,000 coresCa. 7,000 cores
Antarctic Research Facility, FSUAntarctic Research Facility, FSUCa. 7,000 coresCa. 7,000 cores
SESAR Service “MyGeoSamples”
Potential Services: Modules to manage administrative metadata
(customizable) Modules for creating & operating web interfaces to
collections
Advantages No IT infrastructure required (except a computer and an
internet connection) No maintenance and risk & contingency management Access from anywhere by authorized individuals. Platform independent
Extended Services for Sample Curation?
Extended Services for Sample Curation?
The SESAR Global Sample Catalog
SESAR integrates the World’s sample collections
Allows users to find/discover existing samples
Provides access to “sample profiles” View sample information in SESAR as provided Link to the specimen’s ‘home’ (archive) Link to data (publications, databases)
The Challenges Diversity of collections
Repositories Museums Individual Investigators Structured science & field
programs Metadata requirements Sample types & relations Vocabularies
Global Scope Data Generated by
International Collaborations
IODP ICDP InterMARGINS, InterRidge
Data are shared globally Scientific literature Web bases repositories
Samples are shared globally
Multiple systems and catalogs Data Management Systems
for Science Programs Ridge2000 - MGDS MARGINS - MGDS IODP
Domain Specific Catalogs NGDC – IMLGS
National Catalogs Canadian National Sample
Management System SESAR Issues
Redundancies Unacceptable demands on
investigators Inconsistencies Fragmentation Competition rather than
collaboration Adoption
Sample curation Data publication
IGSN Implementation Strategies Work with investigators, curators and
repositories to define & integrate registration process and IGSN into existing sample and data management workflows
Joint Workshop of SESAR & NGDC, February 26 & 27, 2007, Boulder, CO
Registration of repository and museum collections ongoing
Advance adoption of IGSN Work with editors to make IGSN a requirement for data
publication (e.g. Editors’ Round Table, Societies) Work with funding agencies, large science programs (e.g.
IODP, MARGINS, ANDRILL), CI projects (e.g. GEON, CHRONOS), and repositories on sample and data archiving policies
Work with CI Partners on system design & interoperability
Interoperability Workshop, January 2005 at SDSC Working with GEON on authentication scheme Working with IODP and KU/EarthChem on web services
Editor’s Breakout*Editor’s Breakout*- Reporting Data:
- Published paper is point of record. All data should be reported. No “representative data”, no “data can be obtained from author”, no data available at personal websites
- Submission to databases should be strongly encourage
- Unique sample identifier (IGSN)- This may solve the problem of poor sample metadata- This system is being implemented.- Essential component of successful database -
contains sample metadata, allows samples to be followed through its analytical history.
- Tracks samples and subsamples.
- We should start using it now.
- Reporting Data:- Published paper is point of record. All data should be
reported. No “representative data”, no “data can be obtained from author”, no data available at personal websites
- Submission to databases should be strongly encourage
- Unique sample identifier (IGSN)- This may solve the problem of poor sample metadata- This system is being implemented.- Essential component of successful database -
contains sample metadata, allows samples to be followed through its analytical history.
- Tracks samples and subsamples.
- We should start using it now.
*at the GERM Meeting, May 2006, recommendations of Editors’ Breakout presented by Steve Goldstein
Support by Funding Agencies
“We have also funded an effort (SESAR) to uniquely identify all samples so that various analyses on the same samples can be cross referenced and listed. I would also like you to indicate in your dissemination plan that your suite of samples will be registered with SESAR.”
Letter of NSF Program Manager (OCE/MG&G) to a PI, processing paperwork for a grant (January 2007)
Kerstin Lehnert: The Digital Specimen
identifying, organizing, documenting, and cataloging existing data collections, preferably in a digital format;
constructing logical linkages and search engines that facilitate access to organizations and their geoscience sample and data collections;
dedicating adequate space — physical and digital — for storing and efficient accessing of existing and future samples and data sets;”
“Government, educational, and private sector organizations, individually as well as collectively, are
encouraged to aggressively address the following Geoscience data-preservation challenges”
Joint Workshop of SESAR & NGDC IMLGS Boulder, CO, February 26 & 27, 2007
Define procedures & best-practices for Creating & assigning IGSNs Submitting metadata for GeoObjects to SESAR
Work towards an integrated system of sample catalogs Recommend ways to define & implement standards for metadata
and vocabularies Identify possibilities for streamlining procedures for submission of
sample metadata to catalogs
Workshop Recommendations
Streamlined Registration Process Registration process should be simple Options to integrate easily into existing sample and data
management workflows Ability to adopt required metadata from existing forms in use to
avoid redundant metadata submission to multiple systems Support automated registration from other systems via web
services to avoid manual/redundant metadata submission
Best Practices Objects should receive an IGSN at the time of labeling Objects should have an IGSN before being distributed among
multiple investigators and users Parent objects should be registered before child objects Metadata should include geospatial info (coordinates prefd.)
Workshop Recommendations
Batch Registration Forms It is preferred that forms for the MGDS, IMLGS, and SESAR have the
same column headers, which the metadata listed under this header clearly defined. The order of the headers can vary.
An XML schema for sample metadata should be developed to which the metadata in any spreadsheet can be exported.
SESAR Batch Registration Forms should be customizable, e.g. buttons beneath the header should allow to hide unnecessary columns. Columns for metadata that are identified as ‘recommended’ should always be visible.
SESAR should develop a manual for filling out the forms. The manual should include instructions regarding definition of parent – child relations. It needs to be decided if a site should get an IGSN. It is possible to link multiple stations taken at one site by including the site name as metadata.
Vocabularies and Classification Schemes Adopt from existing standards as much as possible and work with
repositories and other systems to use common schemes It is preferable for different systems (MGDS, IMLGS, SESAR) to allow
multiple vocabularies List allowed vocabularies on the Marine Metadata Initiative (MMI) web
site.
Registration Procedures to Support Integration with Existing Workflows:Under Implementation
Trusted Agents A registrant can apply to become a Trusted Agent. Trusted Agents are
authorized to generate unique IGSNs within their registered name space (user code). They can use tools, e.g. Excel, on the ship or in the field, to generate IGSNs within their given name space, have the samples labeled with IGSN, and submit the IGSN along with metadata via web services within a short time frame. Trusted Agents must sign a MOU outlining policy and procedures related to handling IGSN with trusted agents.
Example IODP: Name Space “DR0”, “DR1”,…
Data System
Ship/Field1. Generate Label with
IGSN
Trusted Agent Operation
2. Ingest IGSN & Metadata
SESAR
3. Submit Metadata & IGSN to SESAR (Web Services)
Registration Procedures to Support Integration with Existing Workflows:Under Implementation
Pre-Assigned IGSNs Upon request, SESAR provides forms (spreadsheets) with pre-assigned
IGSNs to chief scientists/investigators/repositories to take on ship/field. Forms filled with metadata should be submitted to SESAR post-collection. E.g.: SCRIPPS.
Other systems or repositories pre-populate their existing forms with IGSNs, obtained from SESAR, and provide to chief scientists. E.g.: MGDS provide forms with IGSNs to PIs in advance of R2K and MARGINS cruises. Post-cruise, MGDS will submit the sample metadata to SESAR.
Data SystemShip/Field
3. Enter metadata with IGSN
SESAR
5. Submit Metadata & IGSN to SESAR (Web Services)
Ship/Field2. Enter metadata with
IGSN
1. Get forms with IGSN
3. Submit forms with metadata and IGSN
1. Get IGSN2. Forms with
IGSN
4. Forms with metadata and IGSN
Collaboration with Repositories & Systems:Ongoing IODP
Registered DSDP/ODP holes, cores, core sections, core samples
“Trusted Agent” arrangement in progress MGDS
Registered existing dredges, cores, and core samples Incorporating IGSN into existing MGDS forms
LDEO (Lamont) Registered existing dredge and core collections
WHOI Registering existing dredge and core collections Future arrangements like “Trusted Agent” to be discussed
SIO (SCRIPPS) Used SESAR forms with pre-assigned IGSNs on cruise for
dredge collections Metadata need to be updated
Collaboration with Repositories & Systems:Ongoing Antarctic Research Facility (ARF)
Registering existing dredge and core collections US Polar Rock Repository
Registered existing rocks and minerals Need pre-assigned IGSNs and web service registration
Harvard Museum Registered existing mineral specimens Project for adding simple sample curation module in
progress OSU
Start with IGSN for historic samples Then become trusted agent and issue IGSNs to new samples
including those given to PIs
NGDC May register some orphaned historical samples Work with curators/repository and SESAR to streamline and
standardize metadata fields and entry forms
Collaboration with Repositories & Systems:Ongoing Canadian National Marine Geoscience Collections
Likely to register existing collections May become “Trusted Agent” in future
Limnological Research Center (LRC/LacCore) Likely to register via batch registration forms May use pre-assigned IGSNs or become “Trusted Agent” in
future
USGS Discussions are on-going with USGS to make them aware of
SESAR effort Plan to contact state geological surveys
Other Repositories Efforts are under way to reach out and propose suitable process OSU model may be most applicable (First register legacy
samples and then become trusted agent or use pre-assigned IGSNs)
Could offer sample curation module for small operations