Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3. 0/us/ for details
32
Embed
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science
Jimmy Lin, Michael Schatz, and Ben LangmeadUniversity of Maryland
Wednesday, June 10, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Cloud Computing @ Maryland Teaching
Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative
Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant
Research Web-scale text processing Statistical machine translation Bioinformatics
Maria
no dio
una
bofetada
bruja
verde
Mary
did not
a
slap
witch
green
green witchbruja verde
Learning Translation Models
Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique.
Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz.
Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro.
These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious.
We built systems for “learning” translation models in Hadoop…… sort of like the word count example, but with more math
Maria no dio una bofetada a la bruja verde
Mary not
did not
no
did not give
give a slap to the witch green
slap
a slap
to the
to
the
green witch
the witch
by
Example from Koehn (2006)
slap
Translation as a “Tiling” Problem
From Text to DNA Sequences Text processing: [0-9A-Za-z]+
DNA sequence processing: [ATCG]+
Easier, right?
(Nope, not really)
Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead (M.S. student, Computer Science; Fall 2008)
Analogy(And two disclaimers)
Strangely-Formatted Manuscript Dickens: A Tale of Two Cities
Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates Dickens: A Tale of Two Cities
“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction Dickens accidently shreds the manuscript
How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k
fragments The short fragments from every copy are mixed together Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Overlaps
Generally prefer longer overlaps to shorter overlaps
In the presence of error, we might allow the overlapping fragments to differ by a small amount
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
It was the best of
was the best of times,4 word overlap
It was the best of
of times, it was the1 word overlap
It was the best of
of wisdom, it was the1 word overlap
Greedy Assembly
The repeated sequence makes the correct reconstruction ambiguous
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
of times, it was theof times, it was the
times, it was the agetimes, it was the age
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
The Real Problem(The easier version)
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGCAATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject genome
Sequencer
Reads
DNA Sequencing
ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
GGAGTTAGTAAAAGTCCACATTGAG
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
Bacteria: ~5 million bp Humans: ~3 billion bp
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al,
2009)
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
~144 GB of compressed sequence data
How do we put humpty dumpty back together?
Human Genome
11 years, cost $3 billion… your tax dollars at work!
A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project
1. Map: Catalog K-mers• Emit every k-mer in the genome and non-overlapping k-mers in the reads• Non-overlapping k-mers sufficient to guarantee an alignment will be found
CloudBurst
Human chromosome 1
Read 1
Read 2
Map
2. Shuffle: Coalesce Seeds• Hadoop internal shuffle groups together k-mers shared by the reads and the reference• Conceptually build a hash table of k-mers and their occurrences
shuffle
…
…
3. Reduce: End-to-end alignment• Locally extend alignment beyond seeds by computing “match distance”• If read aligns end-to-end, record the alignment