Dr David Schindel and Mike Trizna - BOL Data Portal

Post on 25-May-2015

893 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Using BOL in conjunction with BOLD, its capabilities and an example of the case study; Smithsonian frozen bird tissue project

Transcript

The Barcode of LifeData Portal

(http://bol.uvm.edu)

Dr. David E Schindel, Executive Secretary

Michael Trizna, Database Specialist

Consortium for the Barcode of Life (CBOL)

Smithsonian Institution

Washington, DC

www.barcodeoflife.org;

SchindelD@si.edu and TriznaM@si.edu

Contents of PresentationCrowd-sourced open source software

How does Data Portal complement BOLD and GenBank?

Data Portal capabilities

Case Study: Smithsonian frozen bird tissue project

An Experiment in Museum Tissue Mining and Fast Data Release

Tissue sampling winter/spring

Sequencing completed in September

Sequence quality control in October

Taxonomic checking in early November– Obvious errors removed– Minor discrepancies remain

Data released for Adelaide Conference– Crowd-sourced annotation by community– Will data be mis-used?

Unique Data Portal Capabilities

Creating customized datasets from public and/or your private data

Online library of standard datasets

Support sharing within project teams using Connect IDs, easy link to Working Groups

Running different identification analyses based on different methodologies:– Standard sequence input using FASTA format– Use standard or customized datasets

Barcode Aggregator

727,170 public records

Summary Statistics per Family

Creating Customized Datasets

Existing Data Analysis Packages

LIST of packages– BLOG– BRONX– Kernel– CAOS– USEARCH– BLAST

Output of identification routines as probabilities of assignment

Data Analysis Methods Session

New packages presented Friday afternoon:– Damon Little: Automatic Plants Barcode

pipeline (from raw traces to trimmed/edited sequences)

– Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree-based analysis)

– Alain Franc: Matching Next Generation results to Sanger-based reference records

Sample output

CONNECT for Data Portal Collaboration

The USNM Bird ProjectUSNM Division of Birds frozen tissue collection:– 21,104 specimens, 2512 species

Which new ones ones to sample/barcode?

Public records for birds– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388

Moving Data Among BOLD, GenBank, Data Portal

USNM Excel Spreadsheet

(KE-Emu Source)

Local database that holds all fields from

the original spreadsheet

Data Portal Aggregator database

BOLDSplit into projects that consist of 2-4 plates

Creating a ‘Pick List’

Spreadsheet of tissue samples compared with:– ITIS taxonomy– Clemens species list in BOLD– Counts of GenBank and/or public BOLD

records– Geographic informattion

Screenshot of USNM list side-by-side with BOLD records

Identifying Samples to be Subsampled

Side-by-Side Lists

USNM Bird Dataset

3150 tissues sampled

168 failed sequences

94 problematic sequences

166 clustered badly

2761 ‘BARCODE-ready’ samples

1,147 ‘first-BARCODE’ species

91% increase over 1,259 barcoded species

(3,892 listed in BOLD includes BINs, others)

Two problematic clades, USNM data

Flycatchers: Family Tyrannidae– Sublegatus arenarum, S. modestus, S.

obscurior, S. sp.– Conopias parvus, C. albovittatus– Myiarchus ferox, M. swainsoni, M. sp.

Hummingbirds: Family Trochilidae– Phaethornis longuemareus

Inconsistencies within USNM dataset

Incompatibilities with public, other data

Resolving Mis-identified Specimens

What testing dataset to use?

ID trees and analytical routines could use:– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388

Which ones have reliable taxonomic IDs?

Preparing a Data Release PaperSummary statistics from Data Portal

Figures from BOLD

top related