Top Banner
Hadoop World, NYC Hadoop for Bioinformatics Deepak Singh Amazon Web Services
109

Hw09 Hadoop For Bioinfomatics

Jul 14, 2015

Download

Technology

Cloudera, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hw09   Hadoop For Bioinfomatics

Hadoop World, NYC

Hadoop for BioinformaticsDeepak Singh

Amazon Web Services

Page 2: Hw09   Hadoop For Bioinfomatics
Page 4: Hw09   Hadoop For Bioinfomatics

By ~Prescott under a CC-BY-NC license

Page 5: Hw09   Hadoop For Bioinfomatics

data sets

Page 6: Hw09   Hadoop For Bioinfomatics

many data sets

Page 7: Hw09   Hadoop For Bioinfomatics

PFAM

GENBANK ENSEMBL

PDB

Many Others

Page 8: Hw09   Hadoop For Bioinfomatics

manageable

Page 9: Hw09   Hadoop For Bioinfomatics

Image: Matt Wood

Page 10: Hw09   Hadoop For Bioinfomatics

Human genome

Image: Matt Wood

Page 11: Hw09   Hadoop For Bioinfomatics
Page 12: Hw09   Hadoop For Bioinfomatics
Page 13: Hw09   Hadoop For Bioinfomatics

Image: Matt Wood

Page 14: Hw09   Hadoop For Bioinfomatics

~100 TB/WeekImage: Matt Wood

Page 15: Hw09   Hadoop For Bioinfomatics

~100 TB/Week>2 PB/Year

Image: Matt Wood

Page 16: Hw09   Hadoop For Bioinfomatics
Page 17: Hw09   Hadoop For Bioinfomatics

years

Page 18: Hw09   Hadoop For Bioinfomatics

days

Page 19: Hw09   Hadoop For Bioinfomatics

hours

Page 20: Hw09   Hadoop For Bioinfomatics

gigabytes

Page 21: Hw09   Hadoop For Bioinfomatics

terabytes

Page 22: Hw09   Hadoop For Bioinfomatics

petabytes

Page 23: Hw09   Hadoop For Bioinfomatics

really fast

Page 24: Hw09   Hadoop For Bioinfomatics
Page 25: Hw09   Hadoop For Bioinfomatics

typical informatics workflow

Page 26: Hw09   Hadoop For Bioinfomatics
Page 27: Hw09   Hadoop For Bioinfomatics
Page 28: Hw09   Hadoop For Bioinfomatics
Page 29: Hw09   Hadoop For Bioinfomatics
Page 31: Hw09   Hadoop For Bioinfomatics

Via Argonne National Labs under a CC-BY-SA license

Page 32: Hw09   Hadoop For Bioinfomatics

Via Argonne National Labs under a CC-BY-SA license

killer app

Page 34: Hw09   Hadoop For Bioinfomatics
Page 35: Hw09   Hadoop For Bioinfomatics
Page 36: Hw09   Hadoop For Bioinfomatics

Image: Chris Dagdigian

Page 37: Hw09   Hadoop For Bioinfomatics
Page 38: Hw09   Hadoop For Bioinfomatics

rethink algorithms

Page 39: Hw09   Hadoop For Bioinfomatics

rethink computing

Page 40: Hw09   Hadoop For Bioinfomatics

rethink data management

Page 41: Hw09   Hadoop For Bioinfomatics

rethink data sharing

Page 42: Hw09   Hadoop For Bioinfomatics

operational mindset

Page 43: Hw09   Hadoop For Bioinfomatics

scalability

Page 44: Hw09   Hadoop For Bioinfomatics

we are data geeks not data center geeks

Page 45: Hw09   Hadoop For Bioinfomatics

two key trends

Page 46: Hw09   Hadoop For Bioinfomatics
Page 47: Hw09   Hadoop For Bioinfomatics
Page 48: Hw09   Hadoop For Bioinfomatics
Page 49: Hw09   Hadoop For Bioinfomatics

develop applications

Page 50: Hw09   Hadoop For Bioinfomatics

distribute applications

Page 51: Hw09   Hadoop For Bioinfomatics

use applications

Page 52: Hw09   Hadoop For Bioinfomatics

some work

Page 53: Hw09   Hadoop For Bioinfomatics

some workfilters

^

Page 54: Hw09   Hadoop For Bioinfomatics

High Throughput Sequence AnalysisMike Schatz, University of Maryland

Page 55: Hw09   Hadoop For Bioinfomatics

• Read Mapping

• Mapping & SNP Discovery

• De novo Genome Assembly

Page 56: Hw09   Hadoop For Bioinfomatics

Short Read Mapping

Page 57: Hw09   Hadoop For Bioinfomatics

Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008)

African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)

Page 58: Hw09   Hadoop For Bioinfomatics

Alignment > 10000 CPU hrs

Page 59: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Page 60: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 61: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 62: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Need parallelization framework

Page 63: Hw09   Hadoop For Bioinfomatics

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

Page 64: Hw09   Hadoop For Bioinfomatics

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

Page 65: Hw09   Hadoop For Bioinfomatics
Page 66: Hw09   Hadoop For Bioinfomatics

CloudBurst efficiently reports every k-difference alignment of every read

Page 67: Hw09   Hadoop For Bioinfomatics

many applications only need the best alignment

Page 68: Hw09   Hadoop For Bioinfomatics

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Page 69: Hw09   Hadoop For Bioinfomatics

SOAPSnp: Consensus alignment and SNP calling

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Page 70: Hw09   Hadoop For Bioinfomatics

Crossbow: Rapid whole genome SNP analysis

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Ben Langmead

Page 71: Hw09   Hadoop For Bioinfomatics
Page 72: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Page 73: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Page 74: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Page 75: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Page 76: Hw09   Hadoop For Bioinfomatics

Crossbow   condenses   over   1,000   hours   of  resequencing   computa:on   into   a   few   hours  without   requiring   the   user   to   own   or   operate   a  computer  cluster

Page 77: Hw09   Hadoop For Bioinfomatics

Comparing Genomes

Page 78: Hw09   Hadoop For Bioinfomatics

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Page 79: Hw09   Hadoop For Bioinfomatics

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

Page 80: Hw09   Hadoop For Bioinfomatics

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Page 81: Hw09   Hadoop For Bioinfomatics

Prof. Dennis WallHarvard Medical School

Page 82: Hw09   Hadoop For Bioinfomatics

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

Page 83: Hw09   Hadoop For Bioinfomatics

massive computational demand

Page 84: Hw09   Hadoop For Bioinfomatics

1000 genomes = 5,994,000 processes = 23,976,000 hours

Page 85: Hw09   Hadoop For Bioinfomatics

2737 years

Page 86: Hw09   Hadoop For Bioinfomatics

periodic task

Page 87: Hw09   Hadoop For Bioinfomatics

must scale up

Page 88: Hw09   Hadoop For Bioinfomatics

not scalability gurus

Page 89: Hw09   Hadoop For Bioinfomatics

hadoop streaming

Page 90: Hw09   Hadoop For Bioinfomatics
Page 91: Hw09   Hadoop For Bioinfomatics

compared 50+ genomes

Page 92: Hw09   Hadoop For Bioinfomatics

what’s next?

Page 93: Hw09   Hadoop For Bioinfomatics

de novo assembly

Page 94: Hw09   Hadoop For Bioinfomatics

machine learning and statistics

Page 95: Hw09   Hadoop For Bioinfomatics

protein structure prediction

Page 96: Hw09   Hadoop For Bioinfomatics

docking

Page 97: Hw09   Hadoop For Bioinfomatics

trajectory analysis

Page 98: Hw09   Hadoop For Bioinfomatics

key driving factors?

Page 99: Hw09   Hadoop For Bioinfomatics

the ecosystem

Page 100: Hw09   Hadoop For Bioinfomatics

Pig

Page 101: Hw09   Hadoop For Bioinfomatics

Cascading

Page 102: Hw09   Hadoop For Bioinfomatics

Hive

Page 103: Hw09   Hadoop For Bioinfomatics

RHIPE

Page 104: Hw09   Hadoop For Bioinfomatics

domain specific libraries and tools

Page 106: Hw09   Hadoop For Bioinfomatics
Page 107: Hw09   Hadoop For Bioinfomatics

http://aws.amazon.com/education/

Page 108: Hw09   Hadoop For Bioinfomatics
Page 109: Hw09   Hadoop For Bioinfomatics

[email protected]; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig

Thank you!