The Barcode of Life Data Portal (http://bol.uvm.edu) Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC www.barcodeoflife.org; [email protected]and [email protected]
24
Embed
Dr David Schindel and Mike Trizna - BOL Data Portal
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Contents of PresentationCrowd-sourced open source software
How does Data Portal complement BOLD and GenBank?
Data Portal capabilities
Case Study: Smithsonian frozen bird tissue project
An Experiment in Museum Tissue Mining and Fast Data Release
Tissue sampling winter/spring
Sequencing completed in September
Sequence quality control in October
Taxonomic checking in early November– Obvious errors removed– Minor discrepancies remain
Data released for Adelaide Conference– Crowd-sourced annotation by community– Will data be mis-used?
Unique Data Portal Capabilities
Creating customized datasets from public and/or your private data
Online library of standard datasets
Support sharing within project teams using Connect IDs, easy link to Working Groups
Running different identification analyses based on different methodologies:– Standard sequence input using FASTA format– Use standard or customized datasets
Barcode Aggregator
727,170 public records
Summary Statistics per Family
Creating Customized Datasets
Existing Data Analysis Packages
LIST of packages– BLOG– BRONX– Kernel– CAOS– USEARCH– BLAST
Output of identification routines as probabilities of assignment
Data Analysis Methods Session
New packages presented Friday afternoon:– Damon Little: Automatic Plants Barcode
pipeline (from raw traces to trimmed/edited sequences)
– Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree-based analysis)
– Alain Franc: Matching Next Generation results to Sanger-based reference records
Sample output
CONNECT for Data Portal Collaboration
The USNM Bird ProjectUSNM Division of Birds frozen tissue collection:– 21,104 specimens, 2512 species
Which new ones ones to sample/barcode?
Public records for birds– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388
Moving Data Among BOLD, GenBank, Data Portal
USNM Excel Spreadsheet
(KE-Emu Source)
Local database that holds all fields from
the original spreadsheet
Data Portal Aggregator database
BOLDSplit into projects that consist of 2-4 plates
Creating a ‘Pick List’
Spreadsheet of tissue samples compared with:– ITIS taxonomy– Clemens species list in BOLD– Counts of GenBank and/or public BOLD
records– Geographic informattion
Screenshot of USNM list side-by-side with BOLD records
Identifying Samples to be Subsampled
Side-by-Side Lists
USNM Bird Dataset
3150 tissues sampled
168 failed sequences
94 problematic sequences
166 clustered badly
2761 ‘BARCODE-ready’ samples
1,147 ‘first-BARCODE’ species
91% increase over 1,259 barcoded species
(3,892 listed in BOLD includes BINs, others)
Two problematic clades, USNM data
Flycatchers: Family Tyrannidae– Sublegatus arenarum, S. modestus, S.
obscurior, S. sp.– Conopias parvus, C. albovittatus– Myiarchus ferox, M. swainsoni, M. sp.
Hummingbirds: Family Trochilidae– Phaethornis longuemareus
Inconsistencies within USNM dataset
Incompatibilities with public, other data
Resolving Mis-identified Specimens
What testing dataset to use?
ID trees and analytical routines could use:– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388
Which ones have reliable taxonomic IDs?
Preparing a Data Release PaperSummary statistics from Data Portal