Top Banner
Spark: A Quick Ignition Matthew Kemp
16

CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Jul 27, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Spark: A Quick IgnitionMatthew Kemp

Page 2: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Provides distributed processing

Main unit of abstraction is the RDD

Can be used with frameworks like Mesos or Yarn

Supports Java, Python and Scala

https://spark.apache.org/

What is Spark?

Page 3: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables

Transformations Lazily create a new RDD from an existing one

Actions Usually return a value, force computation of RDD

Resilient Distributed Dataset

Page 4: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Some examples: filter map flatMap distinct union intersection join reduceByKey

Transformations

Page 5: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Some examples: reduce collect take count foreach saveAsTextFile

Actions

Page 7: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Example: Word Count

flatMap()inputreduceBy

Key() map() outputmap()

Page 8: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Word Count

Page 9: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Alternate Word Count

Page 10: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')

Running the Example

Page 11: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1

The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...

Page 12: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c

A (Bad) Shell Version

Page 13: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an

The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...

Page 14: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Our Use Case

distinct()3rd party

3rd partydistinct()

join()

join()

union() distinct() foreach()1st party

Page 15: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Questions?

Page 16: CloudCamp Chicago lightning talk      "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal

Contact [email protected]

@mattkemp

/in/matthewkemp