Top Banner
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3. 0/us/ for details
32

Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science

Jimmy Lin, Michael Schatz, and Ben LangmeadUniversity of Maryland

Wednesday, June 10, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Cloud Computing @ Maryland Teaching

Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative

Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant

Research Web-scale text processing Statistical machine translation Bioinformatics

Page 3: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Maria

no dio

una

bofetada

bruja

verde

Mary

did not

a

slap

witch

green

green witchbruja verde

Learning Translation Models

Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique.

Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz.

Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro.

These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious.

We built systems for “learning” translation models in Hadoop…… sort of like the word count example, but with more math

Page 4: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Maria no dio una bofetada a la bruja verde

Mary not

did not

no

did not give

give a slap to the witch green

slap

a slap

to the

to

the

green witch

the witch

by

Example from Koehn (2006)

slap

Translation as a “Tiling” Problem

Page 5: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

From Text to DNA Sequences Text processing: [0-9A-Za-z]+

DNA sequence processing: [ATCG]+

Easier, right?

(Nope, not really)

Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead (M.S. student, Computer Science; Fall 2008)

Page 6: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Analogy(And two disclaimers)

Page 7: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Strangely-Formatted Manuscript Dickens: A Tale of Two Cities

Text written on a long spool

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 8: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

… With Duplicates Dickens: A Tale of Two Cities

“Backup” on four more copies

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 9: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Shredded Book Reconstruction Dickens accidently shreds the manuscript

How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k

fragments The short fragments from every copy are mixed together Some fragments are identical

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Page 10: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Overlaps

Generally prefer longer overlaps to shorter overlaps

In the presence of error, we might allow the overlapping fragments to differ by a small amount

It was the best of

of times, it was theof times, it was the

best of times, it was

times, it was the worsttimes, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

of times, it was theof times, it was the

times, it was the age

it was the age ofit was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age ofit was the age of

was the age of foolishness,

the worst of times, it

It was the best of

was the best of times,4 word overlap

It was the best of

of times, it was the1 word overlap

It was the best of

of wisdom, it was the1 word overlap

Page 11: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Greedy Assembly

The repeated sequence makes the correct reconstruction ambiguous

It was the best of

of times, it was theof times, it was the

best of times, it was

times, it was the worsttimes, it was the worst

was the best of times,

the best of times, it

of times, it was theof times, it was the

times, it was the agetimes, it was the age

It was the best of

of times, it was theof times, it was the

best of times, it was

times, it was the worsttimes, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

of times, it was theof times, it was the

times, it was the age

it was the age ofit was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age ofit was the age of

was the age of foolishness,

the worst of times, it

Page 12: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

The Real Problem(The easier version)

Page 13: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGCAATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

CGGTCTAGATGCTTACTATGC

AATGCTTACTATGCGGGCCCCTT

CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject genome

Sequencer

Reads

Page 14: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

DNA Sequencing

ATCTGATAAGTCCCAGGACTTCAGT

GCAAGGCAAACCCGAGCCCAGTTT

TCCAGTTCTAGAGTTTCACATGATC

GGAGTTAGTAAAAGTCCACATTGAG

Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG

Bacteria: ~5 million bp Humans: ~3 billion bp

Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)

Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al,

2009)

Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads

~144 GB of compressed sequence data

Page 15: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

How do we put humpty dumpty back together?

Page 16: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Human Genome

11 years, cost $3 billion… your tax dollars at work!

A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project

Page 17: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

CGGTCTAGATGCTTAGCTATGCGGGCCCCTT

Reference sequence

Alignment

GCTTA T CTAT

TTA T CTATGC

A T CTATGCGGA T CTATGCGG

GCTTA T CTAT

TCTAGATGCT

CTATGCGGGCCTAGATGCTT

A T CTATGCGGCTATGCGGGC

A T CTATGCGG

Subject reads

Page 18: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

CGGTCTAGATGCTTATCTATGCGGGCCCCTT

GCTTATCTATTTATCTATGC

ATCTATGCGGATCTATGCGG

GCTTATCTAT GGCCCCTTGCCCCTT

CCTT

CGGCGGTCCGGTCTCGGTCTAG

TCTAGATGCTCTATGCGGGCCTAGATGCTT

CTT

ATGCGGGCCC

Reference sequence

Subject reads

Page 19: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT…Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT…

Insertion Deletion Mutation

Page 20: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

1. Map: Catalog K-mers• Emit every k-mer in the genome and non-overlapping k-mers in the reads• Non-overlapping k-mers sufficient to guarantee an alignment will be found

CloudBurst

Human chromosome 1

Read 1

Read 2

Map

2. Shuffle: Coalesce Seeds• Hadoop internal shuffle groups together k-mers shared by the reads and the reference• Conceptually build a hash table of k-mers and their occurrences

shuffle

3. Reduce: End-to-end alignment• Locally extend alignment beyond seeds by computing “match distance”• If read aligns end-to-end, record the alignment

Reduce

Read 1, Chromosome 1, 12345-12365

Read 2, Chromosome 1, 12350-12370

Page 21: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

0 2000000 4000000 6000000 80000000

2000

4000

6000

8000

10000

12000

14000

16000Running Time vs Number of Reads on Chr 1

01234

Millions of Reads

Ru

nti

me

(s)

0 100000020000003000000400000050000006000000700000080000000

500

1000

1500

2000

2500

3000

Running Time vs Number of Reads on Chr 22

0

1

2

3

4

Millions of Reads

Ru

nti

me

(s)

Results from a small, 24-core cluster, with different number of mismatches

Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

Page 22: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

24 48 72 960

200

400

600

800

1000

1200

1400

1600

1800

Running Time on EC2 High-CPU Medium Instance Cluster

Number of Cores

Ru

nn

ing

tim

e (s

)

CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2

Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

Page 23: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

What’s Next?(Michael Schatz’s Ph.D. dissertation)

Page 24: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Wait, no reference?

Page 25: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

de Bruijn Graph Construction Dk = (V,E)

V = All length-k subfragments (k > l) E = Directed edges between consecutive subfragments

Nodes overlap by k-1 words

Locally constructed graph reveals the global sequence structure

Overlaps implicitly computed

It was the best was the best ofIt was the best of

Original Fragment Directed Edge

de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001

Page 26: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

de Bruijn Graph Assembly

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was thetimes, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age ofthe age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

Page 27: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Compressed de Bruijn Graph

Unambiguous non-branching paths replaced by single nodes

An Eulerian traversal of the graph spells a compatible reconstruction of the original text There may be many traversals of the graph

Different sequences can have the same string graph It was the best of times, it was the worst of times, it was the worst of

times, it was the age of wisdom, it was the age of foolishness, …

of times, it was the of times, it was the

It was the best of times, it

it was the age ofit was the age ofthe age of wisdom, it was the

it was the worst of times, it

the age of foolishness

Page 28: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Hadoopification…(Stay tuned!)

Page 29: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Cloud worthy?

Page 30: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

How much data?

Page 31: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Bottom Line: Bioinformatics Great use case of Hadoop

Interesting computer science problems

Help unravel life’s mysteries?

Page 32: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.

Questions?Comments?

Thanks to the organizations who support our work: