Top Banner
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth
34

The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Dec 24, 2015

Download

Documents

Aldous Casey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

The Longest Common Substring

Problema.k.a Long Repeat

by Donnie Demuth

Page 2: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Sections

1. MapReduce and Hadoop2. Map and Reduce3. Mappers and Reducers4. Using Tools (Amazon)5. Conclusions

Page 3: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

1. MapReduce and Hadoop

• What is it?• And how do I get it?

Page 4: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Google MapReduce

• Circa 2003• Based on Map and Reduce (go figure)– and Functional Programming!

• Proprietary

Page 5: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Apache Hadoop

• Circa 2006, released 2009• Named after an Elephant Toy• Seconds, maybe a minute, to install

Page 6: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Installing Hadoop on OSX

• Single Cluster setup is a piece of cake• Download the archive (tar.gz)• Modify conf/hadoop-env.sh:

– # export JAVA_HOME=/usr/lib/j2sdk1.6-sun – export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/

• Modify bin/hadoop:– JAVA=$JAVA_HOME/bin/java– JAVA=$JAVA_HOME/Commands/java

• Just run bin/hadoop with arguments

Page 7: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

STOP!

• Actually, installing Hadoop wasn’t necessary• We can write parallel code without it

Page 8: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

2. Map and Reduce

• What is it?– Quick Primer to Functional Programming• Higher-Order Functions• Alonzo Church (Lamba Calculus)

• Haskell Curry (Spicy Food)

• How do I use it?

(x (y x*x + y*y))(5)(2)↦ ↦

Page 9: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Code w/ Side-Effects

>>> thing = {'name':'Donald'}>>> def change_name(object): object['name'] = 'Donnie'... >>> change_name(thing)>>> thing{'name': 'Donnie'}

Page 10: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Pure Code, Side-effect Free

>>> thing = {'name':'Donald'}>>> def change_name(object): ... new_obj = {'name': 'Donnie'}... # copy any other values... return new_obj... >>> thing = change_name(thing)>>> thing{'name': 'Donnie'}

Page 11: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Benefits of Pure Code / FP

• easy to understand– Local vars = easy– Global vars + side-effects = hard

• it’s easy to parallelize– We only care about what we know RIGHT NOW

Page 12: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Map

1

2

3

1

4

6

f(x)

Page 13: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Map in Python

• Use the map(<function>, <list>) built-in

>>> map(lambda x: x*x, range(1,100))[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]

Page 14: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Reduce

1

2

3

0f(x, y)

f(x, y)

f(x, y) = 6

Page 15: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Reduce in Python

• Use the map(<function>, <list>, <unit>) built-in

>>> reduce(lambda x, y: x+y, [1,2,3], 0)6

>>> reduce(lambda x, y: x+y, (map(lambda x: x*x, range(1,100)), 0)

328350

Page 16: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

3. Mappers and Reducers

• How do I write them?– Word Count (Hello World for Distrib. Comp.)– Longest Repeat

• Show me how to pipe them

Page 17: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Mappers

• Pseudo-Code– Take some input– Process it– And emit a Key – Value pair

Page 18: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Word Count Mapper

• For some input:– Donald Demuth Donald Draper

• The output should be:– Donald 1– Demuth 1– Donald 1– Draper 1

Page 19: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Word Count Mapper Code

• wordcount/mapper.py

#!/usr/bin/env pythonimport sys, re

word_re = re.compile('[a-zA-Z]+')for line in sys.stdin: line = line.strip().lower() for word in word_re.findall(line): print '%s\t%s' % (word, 1)

Page 20: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Reducers

• Dependant on the Mapper’s emissions• Pseudo-Code for word count– Read an emission from the mapper– Find the key and the value– Store the key in a dictionary with it’s value• But if the key already exists, add the value with the

pre-existing value!

– Emit the dictionary

Page 21: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Word Count Reducer Code

• wordcount/reducer.py#!/usr/bin/env pythonimport sys

counts = {}for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) counts[word] = counts.get(word, 0) + count

for word, count in counts.items(): print '%s\t%s'% (word, count)

Page 22: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Unix Pipes

• Does this really work??

$ cat books/*.txt | wordcount/mapper.py | wordcount/reducer.py | sort | heada 10526ab 3aback 1abaft 2abaht 1abandon 2abandoned 10abandonment 1abasement 1abash 1

Page 23: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Longest Repeat (LCS)

• Many problems can be solved with a series of Maps and Reduces

• However, Hadoop Streaming is a single Map and Reduce step

• After much trial and error my solution involves a pre-processing step

Page 24: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Pre-processing

• fasta_to_line.py

• gen_suffixes.py

ecoli.fasta.line

ecoli.fasta.line.0

ecoli.fasta.line.100000

ecoli.fasta.line.200000

4.6 megs

4.5 megs

4.4 megs

4.3 megs

ecoli.fasta ecoli.fasta.line

Page 25: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

LCS Mapper

• Pseudo-code– Read a line from a suffix file– Determine the index (first chars)– Cycle through the first 100,000 positions• Cycle through possible lengths (10 3000)

– Emit the Length (Key) and the Position (Val)

• Emit (-1) and (-1) to STAY ALIVE

Page 26: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

LCS Reducer

• Pseudo-Code– Simple– Find the largest KEY emitted by any mapper– Display it

Page 27: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

LCS w/ Murmur.txt$ cat murmur.txt.line.0 | lcs/mapper.py | lcs/reducer.py length(63) pos(128)

$ python>>> text = open('murmur.txt.line').read()>>> text[128:128+63]'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'

>>> seq = text[128:128+63]>>> text.index(seq)128>>> text[129:].index(seq) + 1291777>>> text[128:128+63] == text[1777:1777+63]True>>> text[1777:1777+63]'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'

Page 28: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

4. Using Tools, Amazon

• Harness the power of many machines at once– Easy to use 20

• Need to sign up for:– Amazon Elastic MapReduce Service (EMS)– Amazon Elastic Compute Cloud (EC2)– Amazon Simple Storage Service (S3)– Amazon SimpleDB

Page 29: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Deploying Data/Code

• First you’ll need to upload it to S3– Create a new bucket (or global folder) named ecoli-lcs

– Create a new path named input, ecoli-lcs/input– Upload all of the generated suffixes to the input

folder– Upload mapper.py and reducer.py to ecoli-lcs

Page 30: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Creating a Job (Flow)

Page 31: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Creating a Job Flow (…)

Page 32: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

RESULTS!

• Need to download the output$ cd output$ cat * | sort (...)length(2815) pos(4166641)

$ python>>> text = open('ecoli.fasta.line').read()>>> seq = text[4166641:4166641+2815]>>> text.index(seq)4166641>>> text[4166642:].index(seq) + 41666424208043>>> text[4166641:4166641+2815] == text[4208043:4208043+2815]

Page 33: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

5. Conclusions

• Costs– It’s about 3 cents an hour for a “medium” VM– One run took 840 instance hours (20+ actual)

• Approx. $25– Used about 2000 instance hours in total

• Hadoop Streaming is EASY– Though requires many (easy) tools– But costly if you have “bugs”

Page 34: The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

A Better Solution?

• Jeff Parker’s program used the following approach:– Cycle through the sequence and find all repeats of

a given size– Emit the location– Increase the size and use the previously known

locations to find larger matches

• Looks good for MapReduce (Core)