The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk.

Post on 31-Mar-2015

213 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

The complexity of biodiversity knowledge

Andrew C. JonesCardiff University

Andrew.C.Jones@cs.cardiff.ac.uk

Malcolm ScobleThe Natural History Museum

M.Scoble@nhm.ac.uk

2Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Purpose of talk

• Malcolm & Andrew are both investigators in BiodiversityWorld (BDW)

• There are many problems BDW doesn’t solve yet …

• … and the funding runs out tomorrow!• We’ll present

– BiodiversityWorld as a framework to support biodiversity research

– Other projects in which biodiversity informatics problems have been addressed individually

• Major challenge: draw these disparate efforts together

Part 1(Andrew Jones)

4Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Why Biodiversity Informatics is hard

• Need to integrate data & tools of different kinds for interesting “in silico” analyses

• Various computer science issues, e.g.– Human-Computer Interaction

• Design of environments to support scientific research

– Interoperability– Complexity & heterogeneity of data

• Differences of scientific opinion

• Data quality problems

5Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

The BiodiversityWorld project

• 3 year e-Science project funded by BBSRC• Partners: The University of Reading, Cardiff

University, The Natural History Museum, Southampton University

• Aim:– Build a Biodiversity Grid

(Problem Solving Environment to support Biodiversity research)

– Support discovery & use of arbitrary tools & data sources for interesting in silico experiments

– Provide environment to get beyond the ‘cutting and pasting into Word documents’ approach to data integration and analysis

6Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Example problems for BiodiversityWorld

• How should conservation efforts be concentrated?– (example of Biodiversity Richness & Conservation

Evaluation)• Where might a species be expected to occur,

under present or predicted climatic conditions?– (example of Bioclimatic & Ecological Niche

Modelling)• How can geographical information assist in

selection among possible phylogenetic trees?– (example of Phylogenetic Analysis &

Palaeoclimate Modelling)

7Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

BiodiversityWorld architecture

BiodiversityWorld-GRID Interface (BGI)

The GRID

Workflow enactment

engine Wrapped resources

Native Biodiversity-

World Resources

Metadata repository

Presentation

BGI API

User interface

8Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

9Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

10Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Some problems not fully solved in BDW

• Flexible data access– BGI designed to make BDW maintainable, but currently

assumes each resource has a predefined set of operations– BioDA project investigated use of OGSA-DAI in BDW

• HCI issues– A much more exploratory approach to workflow construction

might be appropriate?

• Semantic interoperability & data quality– Metadata repository: basic information only– Only basic solution to species naming problems (SPICE)– Other problems of descriptive terms, differences of expert

opinion, etc., remain to be addressed

11Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Complexity of biodiversity data: a multi-dimensional problem

• Same specimen might be described with differences of:– Terminology– Opinion about identification– Opinion about whether a particular feature is present– Accuracy

• Experts may differ as to:– Circumscription associated with a given scientific name

• (So may not be describing the same concept)– Terminology used to describe a given taxon– Accepted name for a species in a taxonomic checklist

• There may be errors!• ...

12Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

SPICE for Species 2000

• BBSRC/EPSRC- and EU-funded• SPecies 2000 Interoperability Co-ordination

Environment• Aims:

– build scalable, federated scientific name catalogue organised by taxon (species, etc.)

– provide ‘synonymy server’, enriching information retrieval

• Issue: how to build an architecture to integrate specialist, heterogeneous databases, providing a consistent federated view of broader scope?

• Common Data Model sufficed …– data requirements of federation identical for each database– small set of ‘canned queries’ adequate for the catalogue

13Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

SPICE internal architecture

GSD GSD

Wrapper(e.g. JDBC)

Wrapper(e.g.CGI/XML

+ ODBC)

User(Web Browser)

User(Web browser)……

……

(in some cases, generic) CORBA ‘wrapper’ element of GSD Wrapper

User Server module(HTTP)

‘Query’ co-ordinator

CAS knowledge repository(taxonomic hierarchy, annual checklist, genus

and other caches, ...)

Common Access System (CAS)

CORBA

Internalwrapper

Externalwrapper

XMLCGI

14Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

LITCHI

• BBSRC/EPSRC- and EU- funded• Logic-based Integration of Taxonomic Conflicts in

Heterogeneous Information systems• Aim: detect conflicts between species checklists and either

– Assist in producing a consistent checklist, or– Generate correspondences between checklists (‘cross-map’)

• Addressing problems of species classification & naming variations when accessing species-related data

• More general, semantic interoperability issue:– detecting conflicts between different expert views of same subject

matter;– supporting data access based on these views

15Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

LITCHI example

Checklist 1

– Caragana arborescens Lam. (accepted name)

Caragana sibirica Medikus (synonym)

Checklist 2

– Caragana sibirica Medikus (accepted name)Caragana arborescens Lam. (synonym)

(“Lam.” = “Lamark”)

“A full name which is not a pro-parte name may not appear as both an accepted name and a synonym in the same checklist”

)(_)(_

),,,,(),,,,(_

,,,,,,

21

2211

2121

cparteprocpartepro

tlcansynonymtlcannameaccepted

ttcclan

16Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Name relationships (LITCHI 2)

17Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

myViews

• Not funded yet – limited proof-of-concept prototype only

• Addresses problem that an expert may wish to generate taxon descriptions which are:– Coherent;– Mapped explicitly to other taxon descriptions, and– Based directly on existing documentation

(monographs, etc), rather than completely re-coded in some restrictive formalism with a new vocabulary

18Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Example: describing the same things?• Description A:

– Sarothamnus scoparius (L.) Wimm. ex Koch.– Broom– ... a bush which is 50-200 cm high ...

• Description B:– Cytisus scoparius– Yellow broom– ... a small shrub up to 6ft or more ... native in its yellow form ...

• Description C:– Cytisus scoparius (L.) Link.– Broom– ... a deciduous shrub growing to 2.4m by 1m at a fast rate ... scented flowers ...

• Description D:– Common Broom– Cytisus scoparius– ... covered in profuse golden-yellow flowers ... shrub about 1-3m tall ...

• Description E:– Broom– Cytisus scoparius– ... Like a spineless edition of gorse ... with larger scentless flowers ...

• Similar problems apply to individual specimen descriptions

19Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Things we might want to do

• In a system where– data is held in as ‘raw’ a form as possible, to avoid information loss, but– we can impose various views and hypotheses

we might wish to …

• Create our own ‘view’ of the data– For a given piece of knowledge, we could

• accept it unaltered• accept but re-express in our terms (e.g. different scientific name; different units; ...)• state it is equivalent to another piece of knowledge

(e.g. minor differences in measurements)• flag it as ‘wrong’• ...

– In relation to another’s view, we might• include or ignore it• declare some ‘mapping’ applicable to a group of items

(e.g. every species of ‘Sarothamnus’ is mapped to ‘Cytisus’)• ...

• Reason with differing levels of precision simultaneously (e.g. binary/continuous characters derived from same features)

20Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

An experimental prototype

• Proof of concept ...– arbitrary, small data set from various sources: Cytisus & Genista

species– No real ‘front end’ or ‘back end’ yet!

• Implemented in Prolog (a logic programming language)• Formalisms to record complex assertions & their sources• Ontological knowledge not currently separated out explicitly;

rules perform inference• User makes his/her own assertions about (for example)

– synonymy;– which assertions of others to accept;– ...

• ... both very specific and more general rules• Main purpose: illustrate handling multiple opinions/hypotheses

21

Sample knowledge base extractsassertion(1, association(2, 3,

absent(scent(flowers)))).assertion(1, property(2, yellow(flowers))).assertion(1, label(2, common('Broom'))).assertion(1, label(2,

species('Cytisus', 'scoparius'))).

assertion(4, property(5, shrublet(whole))).assertion(4, property(5, deciduous(whole))).assertion(4, property(5, size(6, in, whole))).assertion(4, property(5, deep_yellow(flowers))).assertion(4, property(5, small(leaves))).assertion(4, label(5,

species('Cytisus', 'ardoinii'))).

assertion(4, property(7, size(6, ft, whole))).assertion(4, label(7,

species('Cytisus', 'scoparius'))).

assertion(12, label(13, common('Broom'))).assertion(12, label(13,

common('Scotch Broom'))).assertion(12, property(13,

compound('sparteine'))).

assertion(12, property(13, compound('tyramine'))).

assertion(12, label(13,species('Sarothamnus', 'scoparius'))).

assertion(14, label(15,species('Sarothamnus', 'scoparius'))).

assertion(14, property(15,size_range(50, 200, cm, whole))).

assertion(14, property(15, bright_yellow(flowers))).

assertion(16, label(17,species('Cytisus', 'scoparius'))).

assertion(16, property(17,max_height(2.4, m, whole))).

assertion(16, property(17,max_width(1, m, whole))).

assertion(16, property(17, present(scent(flowers)))).

assertion(8, property(9, golden_yellow(flowers))).

assertion(8, property(9,size_range(1, 3, m, whole))).

assertion(8, label(9,species('Cytisus', 'scoparius'))).

Source 12 assertsthat item 13’s

label is commonname ‘Scotch Broom’

22Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Deducing from the knowledge base?- display_accepted_props('Cytisus', 'ardoinii'). shrublet(whole)deciduous(whole)size(6, in, whole)deep_yellow(flowers)small(leaves)

Yes?- display_accepted_props('Cytisus', 'scoparius').yellow(flowers)size(6, ft, whole)golden_yellow(flowers)size_range(1, 3, m, whole)max_height(2.4, m, whole)max_width(1, m, whole)present(scent(flowers))absent(spines)absent(scent(flowers))

Yes

?- display_contradictions_for('Cytisus', 'scoparius').[present(scent(flowers)), absent(scent(flowers))]

Yes

23Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Adding synonymy (1)

• User regards any statement about a Sarathamnus species as being a statement about a Cytisus species with same epithet:

• assertion(20,synonym(species('Cytisus', Epithet), _, species('Sarothamnus', Epithet), _)).

• (Could be more restrictive, e.g. apply to only particular information sources)

24Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Adding synonymy (2)?- display_accepted_props('Cytisus', 'scoparius').yellow(flowers)size(6, ft, whole)golden_yellow(flowers)size_range(1, 3, m, whole)max_height(2.4, m, whole)max_width(1, m, whole)present(scent(flowers))compound(sparteine)compound(tyramine)size_range(50, 200, cm, whole)bright_yellow(flowers)absent(spines)absent(scent(flowers))

Yes?- display_contradictions_for('Cytisus', 'scoparius').[size_range(1, 3, m, whole), size_range(50, 200, cm, whole)][present(scent(flowers)), absent(scent(flowers))]

Yes

25Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Some important issues for future work

• Complexity, e.g.– Trade-off: effective resource discovery v. computational

expense of traversing rich ontology– Scalability of taxonomic conflict detection

• May find large data sets need clever techniques such as Rete network

– Scalability of inference in myViews; caching inferred information

• Managing & ranking large result sets– How to rank resources discovered– How to rank conflicts

to present users with matches they are likely to want• Joining all these fragmentary projects up together

Part 2(Malcolm Scoble)

27Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Specimen (unit) dataCollection-level

Observations

Locality

Date of specimen collection

Time of specimen collection

Name of collector

Species/taxon concept

Type specimen

Homonyms Author of taxon

Date of description

Genus name(for binomial)

Images

The complexity of taxonomic/biodiversity data

Species name DNA barcodes

Synonyms

Species concepts

28Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Where we are now

• Fragmented results

• Fragmented effort

• Largely a paper medium (restricted access)

Where we want to be

• Less fragmented; single site or distributed access

• Easier to update• Coordinated effort• Electronic (or dual)

medium• Free access to data• Taxonomy easier to

use

Taxonomy: from a ‘fragmented’ to a ‘distributed’ resource

29Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Projects to integrate biodiversity data

• BioCISE (collection-level)

• ENHSIN (specimen (unit)-level)

• BioCASE (unit- & collection-level)

• Species 2000 (species nomenclature)

• SYNTHESYS (taxonomic infrastructure)

• ENBI (network of biodiversity information)

• EDIT (distributed approach to taxonomy)

• PBIs (inventorying the planet’s biodiversity)

• CATE: Creating a Taxonomic e-Science

30Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

BioCASE National Node Network

BioCASE National Node

CORM

• 31 National Nodes

• Core Meta Database is updated every night

Collection-level

31Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

NNNNCollection

BioCASE Core

WWWInterface P

Core Data Items

(BioCASE Profile)

keywords,keyword Relations

Enh

ance

dM

eta-

Dat

a

Thesaurus

L

SH

Unitaccess

Metadata

IndexP

B

Cor

e D

ata

Item

s(B

ioC

AS

E P

rofi

le)

L

Collection-levelMeta-Data

X

SpecialInterest

Networks

NationalNodes

NN

UnitInformation

DB

UnitInformation

DB

UnitInformation

DB

UnitInformation

DB

Unit-D

ata

(ABCD)

L,B

All levels

A Biological Collections Service for Europe

32Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Creating a taxonomic e-science (CATE)

• Literature scattered over 250 years of paper publications.

• Data inaccessible other than to specialist users

• Aim to transfer in toto the taxonomy of two groups of organisms to the web (Hawkmoths and Aroids).

• Broad aim: to encourage migration of taxonomy to the web.

• Provide data for those studying biodiversity.

• Encourage quality control, peer-review and the development of “consensus” taxonomies in the web environment.

• Develop means of citation for web-based revisions

Arisaema candidissimumPhoto : RBG Kew

The Hawkmoth Sphinx caligineus sinicus from Beijing, China.Photo: Tony Pittaway

top related