Top Banner
ENA – 1 st Dec 2014 – EBI, UK Evangelos Pafilis Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece [email protected], http://epafilis.info Text Mining and Environmental Metadata Suggestion
48

Text Mining and Environmental Metadata Suggestion

Jul 31, 2015

Download

Science

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Evangelos Pafilis

Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC)

Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece

[email protected], http://epafilis.info

Text Mining and Environmental

Metadata Suggestion

Page 2: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Species – Environments

Page 3: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Comparative Αnalysis •  Location •  Environment •  Time Period

Image from http://theresilientearth.com/

Coral Reefs

?

Page 4: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Not Trivial

Page 5: Text Mining and Environmental Metadata Suggestion

Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/

Page 6: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Metadata

Meta- = Μετά (“after”)

=> data “after” data

=> data describing data

Essential Context Information

Page 7: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

a clear definition, that can be interpreted

in many, sometimes conflicting, ways

Page 8: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

a clear definition, that can be interpreted

in many, sometimes conflicting, ways

Essential Context Information

Page 9: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Community Standards

•  Standards (such as MiXS, MIMARKS)

see http://gensc.org/gc_wiki/index.php/GSC_Publications

for a comprehensive list of publications

•  capture genomic/metagenomic and other type of sequence contextual information

•  Including detailed guidelines on how to annotate a sample

(e.g. Yilmaz P et al. (2011) The ISME journal 5: 1565–1567)

http://gensc.org/

Page 10: Text Mining and Environmental Metadata Suggestion

P. Yilmaz et al., Nat Biotech 29, 415–420 (2011)

Page 11: Text Mining and Environmental Metadata Suggestion

source: http://wiki.gensc.org/index.php?title=MIMARKS

Page 12: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://www.tomorrowstarted.com/2013/01/how-a-key-works/.html

Page 13: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

•  Project descriptions

•  Scientific-content web pages

•  Full text scientific articles

•  Literature abstracts

•  In-house documents

Page 14: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 15: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Looking up terms:

Intensive, learning curve

Page 16: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Literature Mining

Page 17: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

processing text

to extract facts of interest

Page 18: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS

Page 19: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

terrestrial, aquatic, marine, lagoon, coral reef, sediment, freshwater, soil

ENVIRONMENTS: ENVO term identification in text

Page 20: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

ENVIRONMENTS: ENVO term identification in text

Page 21: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS: ENVO term identification in text

ID: ENVO:00000150 Name: coral reef

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 22: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS: ENVO term identification in text

ID: ENVO:00000150 Name: coral reef

Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.

Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)

Page 23: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS http://environments.hcmr.gr http://environments-eol.blogspot.gr/

●  Dictionary based ●  Open source ●  Environment Ontology ●  fast performance

●  4000 PubMed abstracts / second *

●  Based on SPECIES name recognition tagger (Pafilis et al, PLOS ONE)

●  E600 gold standard: ENVO-based corpus of EOL Species pages

●  Recognition Accuracy – Mention Level: - F1: 82.0% 87.1% of the TPs: exact id among predicted ones

●  Submitted preprint: http://biorxiv.org/content/early/2014/11/13/011403

Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24 GB RAM processing a set of 536,052 abstracts

Page 24: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

biome

environmental feature

environmental material

environmental condition

habitat … … … … …

Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany

http://environmentontology.org ~1600 terms, June 2013

ENVO: source of environment descriptor names and synonyms

Page 25: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Improving Accuracy

●  Increasing matches in text ●  orthographic variation supported

e.g. freshwater, fresh water, and fresh-water ●  Case-insensitive matching ●  Synonym generation to reflect the way environment descriptive

terms are mentioned in text (both generic and ENVO specific)

●  Preventing overmatching (i.e. avoiding increased FP) ●  „stopword-list” (e.g. spring, well, range)

Action Example Add a variant in which non-informative words have been removed

epipelagic zone → epipelagic estuarine biome → estuarine

Plural form addition sediment → sediments Adjective form addition lagoon → lagoonal

Page 26: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVO parts Not included: species tissues foods

Limitations – Known Issues

negation not supported conflicts with anatomy terms

(e.g. mouth, blowhole)

Scope

Page 27: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Sample Output

Update to EOLTAGS 346289845

eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477

File Name

Start coord

End coord

Match text ENVO ID

Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221

Page 28: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ENVIRONMENTS – Sample Output

Update to EOLTAGS 346289845

eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477

File Name

Start coord

End coord

Match text ENVO ID

Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221

Traversing all IS_A, PART_OF

Relationships in ENVO

Page 29: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Download

ENVIRONMENTS

•  Home Page: http://environments.hcmr.gr/ •  Tagger Software:

http://download.jensenlab.org/environments_tagger.tar.gz

Page 30: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

other forms of access

Page 31: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://eol.org/info/discover_what

Page 32: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

ENVIRONMENTS

ID: ENVO:00000150 Name: coral reef

Interactive Curation

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Page 33: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 34: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 35: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 36: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Interactive Curation

Page 37: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

Not only ENVO terms

Page 38: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

http://www.ncbi.nlm.nih.gov/pubmed/18301735

Page 39: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

What else is being identified?

ready you to discover!

Page 40: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK ACTION ES1103

Page 41: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

!  Importance of standardized metadata and annotations !  ENVO: Standardized hierarchically organized descriptions of

environment types !  Literature, project and other scientific content web pages may

describe the environment context of a metagenomics sample !  ENVIRONMENTS:

!  Dictionary-based environment descriptive term identification !  Ontological Community standards, e.g. ENVO: name source !  Command line application

!  Browser extensions, a user-friendly interface !  Highly Interactive !  Can be used while browsing the web !  Extract ENVO from a selected part of a web page !  Extended for:

!  Organism, diseases, and tissue mention identification

Summary

Page 42: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Digging-out Information

http://hartpurylrc.files.wordpress.com Photo by Dr Chatzinikolaou E

Page 43: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Critical Assessment of Information Extraction in Biology

BioCreative: Metagenomics Track

•  Preparing a Metagenomics Track as part of the BioCreative 2015 challenge •  Aim: improve the environmental-context annotation of sequences in major

metagenomics repositories.

•  Track coordinator: Dr. L. Hirschman, MITRE •  BioCreative (www.biocreative.org)

Page 44: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

ACTION ES1103

ENVIRONMENTS-EOL http://environments-eol.blogspot.com/ Encyclopedia of Life (EOL) http://www.eol.org •  process EOL taxon pages •  extract environmental context (ENVO terms) •  EOL Taxon Page: Quick Facts, Data tab •  integrated in Traitbank •  large scale biological questions Rubenstein Fellowship 2013 In collab: Jennifer Hammock, Patrick Leary, Katja Schulz, Cyndy Parr

SEQenv http://environments.hcmr.gr/seqenv.html •  annotate microbial sequences with ENVO terms •  sequence analysis, literature mining, visualization •  GenBank isolation source, PubMed Abstracts •  sample comparison, temporal/spatial pattern analysis •  extension: proteins, protein families, 3D visualization Reused: Analysis of America bird habitats, http://blog.eol.org/

(NoPlaceLikeHome, in collab: Rob Stevenson, Carl Nordman)

Hexanchus griseus EOL page, http://eol.org/pages/212027

Biodiversity – Genomics

Page 45: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

http://jensenlab.org/

Santos A et al. (under review), preprint: http://biorxiv.org/content/early/2014/11/10/010975

Frankild S et al. (under review), preprint: http://biorxiv.org/content/early/2014/08/25/008425

Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390

Page 46: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Acknowledgements

HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, HITS: Dr. S. Berger and more

Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)

Thank You!

Amvrakikos Lagoons, May 2011

ACTION ES1103

Page 47: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

Acknowledgements

Thank You!

Amvrakikos Lagoons, May 2011

ACTION ES1103

id: ENVO:00000038 name: lagoon

HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, and more

Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)

Page 48: Text Mining and Environmental Metadata Suggestion

ENA – 1st Dec 2014 – EBI, UK

•  Start Firefox •  Install the “megx-seqenv-bar.xpi”

•  Drug and Drop •  “Install Now” and “Restart”

•  Visit a couple of PubMed abstracts or article web

pages of your preference •  Annotate the complete abstract, •  Annotate selected sentences only

Tutorial