Top Banner
The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics and Computational Biology University of Idaho
25

The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

The data flood: We need a bigger boat

James A. Foster

The Initiative for Bioinformatics and Evolutionary Studies

(IBEST)

Biological Sciences, Bioinformatics and Computational

Biology

University of Idaho

Page 2: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Outline

✦Where is this flood of data coming from?✦What kind of tool is appropriate for this

amount of data?✦What kind of a tool is “bioinformatics”?✦How about an example?

Page 3: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

DNA sequencing data flood

Year bp/day

1977 7.35

1986 50-ish

1995 19,000

1998 400,000

2008 1,600,000,000

2009 3,200,000,000

2012 ??

ABI 3700

454

454/FLX

???

Technology

ABI 370

ABI 377

Gels

Page 4: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

The data flood: DNA example

Year bp/day Notes

1977 7.35 Manual: φx174

1986 50-ish Gel: ABI 370

1995 19,000 Gel: ABI 377

1998 400,000 Cap: ABI 3700

2008 1600000000

454

2009 3200000000

454/FLX

2012 ?? ??

Water

1L

Barrel (176 gallons)

Big pool (2x6x12m)

football field, 20m deep

Lakes Michigan/Huron

all Great Lakes (nearly)

ocean?

Page 5: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Bioinformatics tools

Year Data volume

1977 1L

1986 barrel

1995 big pool

1998 football field

2008 Lake Michigan

2009 Great Lakes

2012 ocean?

Technology

hose

pfd

Kayak

Orca?

bigger boat?

Glomar?

spoon

Page 6: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Bioinformatics: bigger boat?

Your thesis

Data

The Computer(bioinformatics)

Hypo

You

Your hypothesis

Page 7: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Reflection on the metaphor

✦ At some point, you can use fundamentally different techniques: spoons versus boats

✦ At some point, you can test fundamentally new hypotheses: not “we need a smaller shark”

✦ Sometimes the old technology is still good: the kayak was appropriate in this picture

✦ The new technology may be for a different purpose: fishing versus deep sea exploration

Page 8: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Technology quiz!

Page 9: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

What does this do?

Page 10: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

What does this do?

Not that!

THIS!

A Bigger BoatWhatever you tell it to do!

Page 11: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

What is Bioinformatics?

Bioinformatics is what you tell the computer to do with your data

Page 12: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Of Boats and Bioinformatics

Bioinformatics is what you do with the boat you are in during the data flood

You might be able to do more with a bigger boat

Page 13: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Sampling emergent diversity

✦Get ALL DNA along a age-variant transect• 10 samples per site• time since exposure:

5y, 19y, 40y, 63y, 100y, and 150y

• “chronoclines” sample ecosystems by age

✦Who’s there?✦How does ecosystem

change over time?

Page 14: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Bioinformatics problems

✦Estimate α diversity: number of “species” in each sample and age group

✦Estimate β diversity: amount of variation in “species” between age groups

✦Determine which species (no quotes) are present in each sample (not part of this talk)

Biological questions: How do soil bacterial respond to retreating glaciers? How do microbial soil communities change?

Page 15: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Lots of data (post QC)

Age Samples Sequences DNA Mbp

5y 9 35,092 8.77

19y 10 41,494 10.37

40y 8 33,665 8.42

63y 9 41,767 10.44

100y 8 41,178 10.29

150y 8 40,210 10.05

Total 52 233,406 58.35

Note: A SMALL run, max is 37GB/8hr run max, 1.6 Bbp/day

Page 16: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Bioinformatics objectives

determine species

cluster by species

cluster by age

Explain data in terms of biological processes and age (tell a story)

Too much data: 233K sequences!

Page 17: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Trick: Turn it upside down

Cluster each of 52 samples (approx. 6k each), choose a proxy sequence

Cluster proxies by age (approx. 40k each)

Cluster combined sequences to get species (quantify richness)

Build +/- matrix

++ + ++

++ - ++

+ - -

+ - +

- +++ +

Page 18: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Bioinformatics challenges

✦Move data between computers (IGS, laptop, IBEST Core)

✦File the data in a retrievable way✦Associate metadata with data✦Cluster sequences within/between samples✦Associate clusters with species✦Compute diversity statistics✦Prepare publications and talks✦(much more)

Page 19: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Conclusions

✦Biology• There are thousands of species of bacteria in

arctic soil• Number of bacterial species increases as time of

post-glacial exposure increase

✦Algorithmics (want a job?)• “Quantity has a quality all it’s own” (V.I.Lenin)• Need new algorithms to use new hardware• Database/dataset management is crucial

Page 20: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Thanks!

Ursel Schüette

Zaid Abdo

Jacob Pierson

Larry Forney

Rob Lyon

The Forney-Top lab

John Bunge, Cornell

The Relational Database project, MSU

to INBRE for the excuse

to IBEST for the science

to NIH, NSF, and UI for the money (P20RR16448, P20RR016454, EPS080935)

Page 21: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Discussion?

Page 22: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Extra stuff

Intentionally blank

Page 23: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Roche 454: a genome a day

Page 24: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Metagenomics

✦Harvest approximately first 300bp of every 16s rRNA molecule, all samples• Ribosome: required to

translate DNA (conserved)

• Common marker for microbial species

✦Cluster by evolutionary relationships (“species”)

✦Analyze by chronocline

Page 25: The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics.

JAF INBRE Data Flood 8/4/09

Future work: same tune, new lyrics

✦Data from human microbiomeHow do microbial communities vary between healthy and sick people?

✦Data from polluted soil (Yangtzee river, PRC)How do microbial communities vary as pollution increases?

✦Data from longitudinal transectsHow does microbial diversity change with latitude?