Top Banner
BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato (FAO), Gianpaolo Coro (CNR), Anton Ellenbroek (FAO), Pasquale Pagano (CNR)
37

BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

Dec 18, 2015

Download

Documents

Dominick Morris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

BiOnymA flexible workflow approach to

taxon name matchingEdward Vanden Berghe (VUB), Nicolas Bailly (WorldFish),

Caselyn Aldemita (FIN), Fabio Fiorellato (FAO), Gianpaolo Coro (CNR), Anton Ellenbroek (FAO),

Pasquale Pagano (CNR)

Page 2: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Improving the current matchers

• Propose several Taxonomic Authority Files as references to be matched with

• Make flexible and customizable the control of the matching workflow (e.g., selection of the sequence of the matching methods)

• Give full control for advanced users [but still a set of default/standard workflow(s) for basic users]

Page 3: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

BiOnym approach

• There is no one size that fits all!!• Some applications are ‘fault intolerant’– E.g. compilation of authority lists– Have to minimise ‘false positives’, at the expense

of less automation• Others are less sensitive to mistakes– E.g. synonymy expansion in a biogeographic query,

find distribution records of a single species under different names or spelling variations

• Will require different choices

Page 4: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

A flexible workflow for taxon name matching: BiOnym

Parsingand Pre-

processing

Matchers:• GSAy (new)• Lexical distances

• Levenshtein • Soundex• Trigrams

Workflows• BiOnym (new): User control• Emulation of Taxamatch• YASMEEN (new)

Taxon Matcher 1

Taxon Matcher 2

Taxon Matcher n

Post-processing

ReferenceSource(ASFIS)

ReferenceSource

(FishBase)

ReferenceSource

(WoRMS)

Raw Input String. e.g. Gadis morua Lineus 1759

Matching name qed Gadus morhua (Linnaeus, 1758)

ReferenceSource(any in DwC-A)

Page 5: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Developed in iMarine infrastructure• iMarine (D4Science): e-infrastructure• VREs: Virtual Research Environments exploiting

data and tools in the infrastructure …

Infrastructure

VREsgCube

Page 6: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

iMarine

OBISWoRMS

WoRDS

GBIF

CoL

ITIS

IRMNGNCBI

MyOcean

WOA

EuroStat

Data.FAO

iMarine Registries

Validation

Enriching

Processing

Sharing

… and outside: iMarine Data Bonanza

Private Cloud

Commercial Cloud

Page 7: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

iMarine: Storage and Computing as Service

• Scalability and high availability

• Across sites

• ISO 19115/10139 Metadata

• Catalogue

• Open source RDBMS

• Up to 1 TB data

• Secure• Fault-tolerant• Replication

Virtual Workspace

Relational Databases

Large and Active data

storage

Spatial Database

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

45 TB Currently Used330 CPU Cores Currently Allocated

Page 8: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Statistical Manager: Resources and Sharing

Page 9: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

BiOnym: Outline

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

• Components– Taxonomic Authority Files– Matchers– Pre- and post-processing: parsers, synonym ex-

pansion , taxon resolution, performance statistics• Development frameworks– For Matchers– For Workflows (= sequence of Matchers)

• Experiments– Results

• Conclusions

Page 10: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Available Taxonomic Authority Files• CoL: Catalogue of Life

• NCBI: National Center for Biotechnology Information

• IRMNG: Interim Register of Marine and Non-marine Genera

• ITIS: Integrated Taxonomic Information System

• WoRMS: World Register of Marine Species

• ASFIS: List of Species for Fishery Statistics Purposes; for commercial aquatic species

• FishBase (+info from CofF: Catalog of Fishes): for finfishes

Page 11: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Pre-Processing: name format standard

• Split names in atomic components (genus, species, authority, author, year) if necessary (Dima Mozzherin’s parser)

• Align variations in complementary words: var./v., aff., conf./cf., comma in authority, etc.

• Customize character/string substitutions

Page 12: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matchers: principle• Input:

– Standard formatted file of names Input– Customized parameters (e.g., thresholds for distances)

• Character substitutions– E.g. dropping gender suffix– E.g. fuzzy matching of Tony Rees

• A unique algorithm (e.g., one lexical distance):– Using the customized parameters

• Output: A set of names with matching rate– One subset being considered as matched– One subset considered as non-matching

• The output of a matcher can be used as the input of another one

Page 13: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matchers: the built-in matchers• Lexical dist.: the minimum number of single-

character edits (insertion, deletion, substitution) required to change one word into the other

• Soundex-Like dist.: an algorithm relying on an encoding of phonemes pronunciation in English. Our variant does not compress phonetic information

• Trigrams / N-grams dist.: a similarity measure between sequences of letter triplets (a trigram representation) extracted from the input strings

• One domain-knowledge based matcher (GSAy) … to be applied first in the context of Systematics

Page 14: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matchers: GSAy process (1)

GSAy

GSAY

GSrAy

GSrAY

GSA

Complete matchStep ScoreGSAy 100

GSAY 97

GSrAy 94

GSrAY 91

GSA 88

GSrA 85

Parentheses issue

Gender agreement issues

Gender agreement and parentheses

Year issues

GSrAYear and gender agreement issues

Page 15: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matchers: GSAy process (2)

GSY

GSrY

GS

GSr

SAy

Author issues, misspelling or wrongStep RateGSAy 82

GSAY 79

GSrAy 76

GSrAY 73

GSA 70

GSrA 67

Author and year issues, Homonyms

Genus issues, other combinations

SrAY

Page 16: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matchers: GSAy process (3)

SrAy

SrAY

GAy

GAY

Genus issues, other combinationsStep RateSrAy 64

SrAY 61

GAy 35

GAY 3261> >35

Species misspellings … but also …

… species described in same genusby same author in same paper

Matched names

othermatchers

Non-Matching names

Page 17: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Matcher: GSAy examplesGenus Species Authority Step RateGadus morhua Linnaeus, 1758 GSAy 100Gadus morhua (Linnaeus, 1758) GSAY 97Gadus morhuus Linnaeus, 1758 GSrAy 94Gadus morhuus (Linnaeus, 1758) GSrAY 91Gadus morhua Linnaeus, 1759 GSA 88Gadus morhuus (Linnaeus, 1759) GSrA 85Gadus morhua Lineus, 1758 GSY 82Gadus morhuus Lineus, 1758 GSrY 79Gadus morhua Lineus, 1759 GS 76Gadus morhuus Lineus, 1759 GSr 73Gadis morhua Linnaeus, 1758 SAy 70Gadis morhuus (Linnaeus, 1758) SrAy 67Gadis morhua Linnaeus, 1758 SAY 64Gadis morhuus (Linnaeus, 1758) SrAY 61Gadus morthua Linnaeus, 1758 GAy 35Gadus morthua (Linnaeus, 1758) GAY 32

Page 18: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Workflow development framework• Builds flexible Workflows for Names Matching

• A Java framework based on the gCube system (http://www.gcube-system.org/)

• Allows to exploit Cloud Computing Facilities

• Presents Java interfaces to build Strings Pre-Processing, Parsing and Post-processing

• Allows to define character substitutions

• Allows to add new Matchers as plug-ins

Page 19: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Workflow development framework• A series of operators acting as switches:– First apply ‘transformation’ (e.g. character

substitution)– Then calculate distance between all possible pairs

of names • Each switch decides, whether a pair of names

should be considered as ‘matches’, and splits the input list in:– ‘matched’ names– ‘non-matching’ names.

• Parameters in each switch are customizable

Page 20: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

Matcher framework: YASMEEN (FAO)• Yet Another Species Matching Execution ENvironment

• Based on COMET – COncept Matching Engine and Tools

• YASMEEN: a set of data models, formats and tools to perform species matching identification

• Multiple matchlets, each dealing with a specific attribute of the species data model (genus, species, author etc.)

• New matchlets can be designed (just a few lines of code) and plugged in

• Reference data in DwC-A format

• Full support to distributed computation (split IN & REF data / join results)

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Page 21: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

Workflow builder: YASMEEN (FAO)

• Used as a matcher in BiOnym workflow• But can work as a standalone specific workflow

• When used as standalone:• Lexical matchlets' scores can be computed with a

combination of different strategies (Levenshtein distance, soundex similarity, N-grams similarity)

• Overall matching score for an input / reference data pair is a weighted combination of the triggered matchlets' scores

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Page 22: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Workflow: BiOnym• Workflow consists of three parts– Preprocessing (including possibly parsing)– Chain of matchers; output from one is input in next– Postprocessing• E.g. present ‘ambiguous’ matches to end user• E.g. calculate performance statistics

• Chain of matchers– Most restrictive first– Those based on domain knowledge first– Test names matched in one step are not passed on

to next matcher

Page 23: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

FAO use caseScientific namesAustroglossus spp

Austropotamobius pallipesAuxis rocheiAuxis thazardAuxis thazard, A. rochei

Bagrus sppBothidaeCorallium sp. nov.Ex MolluscaEx Pinctada spp

Common NamesAbalones neiAesop shrimpAfrican forktail snapperAfrican lungfishesAfrican moonfishAfrican sicklefishAka coralAkiami paste shrimp

Alaska plaiceAlaska pollock(=Walleye poll.)

Page 24: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Page 25: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Workflow: Emulation of Tony Rees’ Taxamatch

• Normalization of species name into its root disregarding the gender issues in taxon name.

• Modified Damerau-Levenshtein Distance Algorithm (MDLD) - the number of times of replace, delete or insert character to make the two strings the same

• Phonetic algorithm (e.g. Soundex)• Authority Matching - which detects the

similarity in substring

Page 26: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Post-processing• Modalities governing how the results of the

matching process are used/presented to the end-user

• Will depend on the needs of the end user• Examples:– Synonymy expansion of queries in a

biogeographical system–Reconciliation of check lists from different

sources, for same area and taxon–Presenting end-user with ambiguous matches

Page 27: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Experiments: a R implementation• Experimental system implemented in R and

PostgreSQL–R thin wrapper around PostgreSQL statements– SQL used for the heavy lifting• Make use of Trigram indexes, for example

• Tool for communication and prototyping• Developing tools to analyse performance–Generate confusion matrix…– For identical test sets, different workflows• Quantitatively compare sets of options and/or matchers

Page 28: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Effectiveness

True

hits

False hits

Non hits

Example graph comparing performance of different settings (generated with R)

Page 29: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

Results of experiments (YASMEEN)Genus / species misspellings:IN: Lacheepenseer Perthseecous → REF: Acipenser persicus (Borodin, 1897)

Scientific name matchlet, using Levenshtein similarity → 61.5%

No separation between genus / species, relevant misspelling:IN: acipnesreppeerseekoos → REF: Acipenser persicus (Borodin, 1897)

Scientific name matchlet, using Levenshtein similarity → 47.6%

Inverted genus / species:IN: Platorhynchus Scaphirhincus → REF: Scaphirhincus platorhynchus (Rafinesque, 1820)

Scientific name matchlet, using n-grams similarity → Score: 100.0%

Relevant misspellings (resolved with support from authorities data):IN: Casphinhi Platynchurs (Rafinesk, 1820) → REF: Scaphirhynchus platorynchus (Rafinesque, 1820)

Genus matchlet (wgt: 75), Species matchlet (wgt : 100), Author name matchlet (wgt : 50), Year matchlet (wgt : 25), using Levenshtein

similarity (wgt : 100) and Soundex similarity (50) → 58,2%

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Page 30: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Experiments: a simple interface

Page 31: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Efficiency• Application first version of BiOnym workflow (1000

species names)

• First run: only one Worker node (~ 1 CPU)

• Second run: Cloud Computing facilities assigned by iMarine e-Infrastructure (computation distributed over 19 Worker nodes)

• Result: Time reduction 76.7%

• This means that the workflow can be used also in interactive systems (no need for batch processing)

Page 32: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Results of experiments: Matchers

• Search Term: – Rhincodon typu Linneaus, 1758

• Output/s: Using GSAy – Rhincodon typus Smith, 1828 -> Score is 73%

• Using taxamatch: – Rhineodon typus Smith, 1828 – Rhiniodon typus (Smith, 1828) – Rhinodon typicus Müller & Henle, 1839 – Rhinodon typicus Smith, 1845

Page 33: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Conclusions: Results• Workflows– Building of pre-set (default)– On the fly setting– Integrating taxonomic/nomenclature knowledge

(GSAy)• Making the best from previous matchers

(Taxamatch and subsequent various implementations) and other technologies (uBio/GNA/GBIF parser)

• Effectiveness and Efficiency increased in iMarine e-infrastructure

Page 34: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Conclusions• Plans: interface (Nov.), tests (Dec.), open (Jan.)

• Other Taxonomic Authority File– FADA (BioFresh) / PESI / …

• Name reconciliation

• Beyond scientific names– Common names / Vessels / …

• New matchers integration = as matching methods are developed

Page 35: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Future added value? Storage of knowledge

• Make available the matches between raw and published names (and current valid names)

• Self-learning system• Build a community of practice (CoP), not alone

…• GNA, BioVel

Collaborative development

Page 36: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

Special Thanks

• Tony Rees (CSIRO)• Dmitry (Dima) Mozzherin (GNA project)

Page 37: BiOnym A flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato.

TDWG Annual Conference 2013, Firenze, Italy 31st October 2013

• Edward Vanden Berghe, Vrije Universiteit Brussel (VUB), Brussels, Belgium

• Nicolas Bailly, WorldFish, and FishBase Information and Research Group (FIN), Los Baños, Philippines

• Caselyn Aldemita, FishBase Information and Research Group (FIN), Los Baños, Philippines

• Fabio Fiorellato & Anton Ellenbroek, Fisheries Statistics and Information (FIPS), FAO, Rome, Italy

• Gianpaolo Coro & Pasquale Pagano, Istituto di Scienza e Tecnologie dell'Informazione A. Faedo (ISTI), CNR, Pisa, Italy

Authors