iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service Matthew Collins (iDigBio) Jorrit Poelen (independant) Alexander Thompson (iDigBio) Jennifer Hammock (EOL)
20
Embed
Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service
Matthew Collins (iDigBio)Jorrit Poelen (independant)Alexander Thompson (iDigBio)Jennifer Hammock (EOL)
What We’re Interested In
Computation with biodiversity data• Research at scale• Lowering barriers to access• Reproducability
Matthew CollinsTechnical Operations
Manager - iDigBio
Jorrit PoelenIndependant
Alexander Thompson Software Products
Lead - iDigBio
Jennifer HammockMarine Theme
Coordinator - EOL
Quick Review of Ways That We Work With Datasets
Focus here is on using large aggregated datasets to answer research questions
Good: low barrier to entry, expert-built, documentation, peersLess good: limited scope, limited ability to change
Working With Data - APIs & Libraries
Good: direct access to data, some simple analysisLess good: programming barrier, performance limits
Working With Data - Download & Code
Good: ultimate flexibility, combine & mergeLess good: data management barrier, you’re the sysadmin
Working With Data - GUODA
Global Unified Open Data Access(If SPNHC can be Spinach, GUODA Gouda)
An informal collaboration between technologistsfrom organizations like EOL , ePANDDA, and iDigBio as well asindependent biodiversity informaticists. We share data usecases, best practices, infrastructure, code, and ideas aroundthe science that can be done by analyzing large open-accessbiodiversity datasets.
Working With Data - GUODA Continued
Goals• Have technologists discuss the technical challenges and
solution approaches in the biodiversity informatics domain• Provide on-ramp for those who might not think of
themselves as “technologists”• Fast parallel computation infrastructure and practices
(currently using Apache Spark)• Local copies of entire datasets already formatted, ready for
computation at scale on provided infrastructure• Hosting for services that rely on above
What Questions Does GUODA Make Approachable?
Can we create structured data from the unstructured text in iDigBio records?
GUODA provides a platform to quickly start working on this problem.
1. No data download2. Jupyter Notebooks3. Parallel processing of entire dataset
Data Characterization
Looking at the Darwin Core terms fieldNotes, occurrenceRemarks, and eventRemarks to see how many characters are in which fields
The Code to Produce That Figureidbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")notes = sqlContext.sql("""
SELECT`http://portal.idigbio.org/terms/uuid` as uuid,TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document
FROM idbtable WHERE `http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR
`http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR `http://rs.tdwg.org/dwc/terms/eventRemarks` != ''""")
Remember “collaboration” and “infrastructure” to lower barriers
• Twice monthly Google Hangouts• Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL,
TraitBank so far• Apache Spark cluster for computation• Backs Effechecka http://effechecka.org/• Backs Fresh Data https://github.com/gimmefreshdata/• ePANDDA (we’re sharing ideas)• iDigBio data quality workflows
Why is GUODA Important?
Perform research at a faster pace by “outsourcing” some of the harder parts
Collect entire large datasets together in one place for cross-dataset exploration without data management barrier
Provides a foundation, both community and infrastructure, upon which to build purpose-built applications and APIs bigger and faster than before
How You Can Fit With GUODA
• Make your data available
• Data standards to make it relatable to other datasets
• Making data available doesn’t end with handoff to the
aggregator - where is your data used?
• Support workforce development
• Support next-wave things like ePANDDA
• Collaborate with GUODA when starting your own research
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.