Top Banner
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu www.theseed.org
41

High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

High performance computational analysis of

DNA sequences from different environments

Rob Edwards

Computer ScienceBiology

edwards.sdsu.edu www.theseed.org

Page 2: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 3: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Firstbacterial genome

100bacterial genomes

1,000bacterial genomes

Num

ber

of

know

n s

equence

s

Year

How much has been sequenced?

Environmentalsequencing

Page 4: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Everybody inSan Diego

Everybody inUSA

AllculturedBacteria

100people

How much will be sequenced?

One genome fromevery species

Most majormicrobial environments

Page 5: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Metagenomics(Just sequence it)

200 liters water 5-500 g fresh fecal matter50 g soil

Sequence

Epifluorescent Microscopy

Concentrate and purify bacteria, viruses, etc

Extract nucleic acids

Publish papers

Page 6: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

The SEED Family

Page 7: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

The metagenomics RAST server

Page 8: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Automated Processing

Page 9: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Summary View

Page 10: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Metagenomics ToolsAnnotation & Subsystems

Page 11: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Metagenomics ToolsPhylogenetic Reconstruction

Page 12: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Metagenomics ToolsComparative Tools

Page 13: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Outline

There is a lot of sequence Tools for analysis More computers

Page 14: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

How much data so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

~300 GS20~300 FLX~300 Sanger

Page 15: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Computes

Page 16: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Linear compute complexity

Page 17: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Just waiting …

Page 18: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Hours

of

Com

pute

Tim

e

Input size (MB)

Overall compute time~19 hours of compute per input megabyte

Page 19: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

How much so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

Compute time (on a single CPU):

328,814 hours = 13,700 days = 38 years

~300 GS20~300 FLX~300 Sanger

Page 20: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 21: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Shannon’s Uncertainty

• Shannon’s Uncertainty – Peter’s surprisal

p(xi) is the probability of the occurrence of each base or string

Page 22: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Surprisal in Sequences

Page 23: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Uncertainty Correlates With Similarity

Page 24: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

But it’s not just randomness…

Page 25: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Which has more surprisal:coding regions or non-coding regions?

Uncertainty in complete genomes

Coding regions Non-coding regions

Page 26: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

More extreme differences with 6-mers

Coding regions Non-coding regions

Page 27: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Can we predict proteins

• Short sequences of 100 bp

• Translate into 30-35 amino acids

• Can we predict which are real and could be doing something?

• Test with bacterial proteins

Page 28: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Kullback-Leibler Divergence

Difference between two probability distributions

Difference between amino acid composition and average amino acid composition

Calculate KLD for 372 bacterial genomes

Page 29: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

KLD varies by bacteria

Colored by taxonomy of the bacteria

Page 30: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

KLD varies by bacteria

Page 31: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Most divergent genomes

• Borrelia garinii – Spirochaetes

• Mycoplasma mycoides – Mollicutes

• Ureaplasma parvum – Mollicutes

• Buchnera aphidicola – Gammaproteobacteria

• Wigglesworthia glossinidia – Gammaproteobacteria

Page 32: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Divergence and metabolism

Bifidobacterium

Bacillus

Nostoc

Salmonella

Chlamydophila

Mean of all bacteria

Page 33: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Divergence and amino acids

UreaplasmaWigglesworthia

BorreliaBuchnera

Mycoplasma

Bacteria meanArchaea mean

Eukaryotic mean

Page 34: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Predicting KLD

y = 2x2-2x+0.5

KLD

per

gen

ome

Percent G+C

Page 35: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Summary

• Shannon’s uncertainty could predict useful sequences

• KLD varies too much to be useful and is driven by %G+C content

Page 36: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

New solutions for old problems?

Page 37: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Xen and the art of imagery

Page 38: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

The cell phone problem

Page 39: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Searching the seed by SMS

1 2 34 5 67 8 9* 0 #

seed search

histidine coli

GMAIL.COM@

AUTOSEEDSEARCHES

edwar

ds.

sdsu

.ed

u

SEEDdatabases

22 proteins in E. coli

) ) ) )))))

Anywhere Idaho GMCS429 Argonne

Page 40: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Challenges

• Too much data

• Not easy to prioritize

• New models for HPC needed

• New interfaces to look at data

Page 41: High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu.

Acknowledgements

• Sajia Akhter• Rob Schmieder

• Nick Celms• Sheridan Wright

• Ramy Aziz

• FIG • The mg-RAST team

• Rick Stevens

• Peter Salamon• Barb Bailey• Forest Rohwer• Anca Segall