Spark: A Quick Ignition Matthew Kemp
Jul 27, 2015
Provides distributed processing
Main unit of abstraction is the RDD
Can be used with frameworks like Mesos or Yarn
Supports Java, Python and Scala
https://spark.apache.org/
What is Spark?
Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables
Transformations Lazily create a new RDD from an existing one
Actions Usually return a value, force computation of RDD
Resilient Distributed Dataset
Sample Text
Spark Example
Spark Shell
Shell Example
Gists
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Word Count
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Alternate Word Count
$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')
Running the Example
a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1
The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...
#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c
A (Bad) Shell Version
23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an
The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...
Our Use Case
distinct()3rd party
3rd partydistinct()
join()
join()
union() distinct() foreach()1st party