Spark and Cassandra Artem Aliev DataStax Solving classical data analytic task by using modern distributed databases
Spark and Cassandra
Artem AlievDataStax
Solving classical data analytic task by using modern distributed databases
©2014 DataStax Confidential. Do not distribute without consent.
Artem AlievSoftware Developer
Solving classical data analytic task by using modern distributed databases
Spark and Cassandra
Agenda: Lambda Architecture
http://lambda-architecture.net
Apache Cassandra
• Masterless architecture with read/write anywhere design.• Continuous availability with no single point of failure.• Multi-data center and cloud availability zone support.• Linear scale performance with online capacity expansion. • CQL – SQL-like language.
Node
Node
100,000 txns/sec
Node
Node
Node
Node
NodeNode200,000 txns/sec
Node Node
Node
NodeNode
Node
400,000 txns/sec
Apache Cassandra™ is a massively scalable NoSQL OLTP database.
• Cassandra was designed with the understanding that system/hardware failures can and do occur• Peer-to-peer, distributed system • All nodes the same• Multi data centre support out of the box• Configurable replication factor• Configurable data consistency per request• Active-Active replication architecture
Cassandra Architecture Overview
Node 11st copy
Node 4
Node 5 Node 22nd copy
Node 3
Node 11st
Node 4
Node 5 Node 22nd copy
Node 33rd copy
DC: USA DC: EU
Cassandra Query LanguageCREATE TABLE sporty_league (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM sporty_league WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;
INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
Apache Spark
• Distributed computing framework• Created by UC AMP Lab since 2009• Apache Project since 2010• Solves problems Hadoop is bad at• Iterative Algorithms• Interactive Machine Learning• More general purpose than MapReduce• Streaming!• In memory!
8
Fast
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
Hadoop
Spark
110 sec / iteration
first iteration 80 secfurther iterations 1 sec
* Logistic Regression Performance
Simple• 1. package org.myorg;• 2.• 3. import java.io.IOException;• 4. import java.util.*;• 5.• 6. import org.apache.hadoop.fs.Path ;• 7. import org.apache.hadoop.conf.*;• 8. import org.apache.hadoop.io.*;• 9. import org.apache.hadoop.map red.*;• 10. import org.apache.hadoop.util.*;• 11.• 12. public class WordCount {• 13.• 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {• 15. private final static IntWritable one = new IntWritable(1);• 16. private Text word = new Text();• 17.• 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {• 19. String line = value.toString();• 20. StringTokenizer tokenizer = new StringTokenizer(line);• 21. while (tokenizer.hasMoreToken s()) {• 22. word.set(tokenizer.nextToken());• 23. output.collect(word, one);• 24. }• 25. }• 26. }• 27.• 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {• 29. public void reduce(Text key, Iterator<IntWritab le> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {• 30. int sum = 0;• 31. while (values.hasNext()) {• 32. sum += values.next().get();• 33. }• 34. output.collect(key, new IntWritable(sum));• 35. }• 36. }• 37.• 38. public static void main(String[] args) throws Exception {• 39. JobConf conf = new JobConf(WordCount.class);• 40. conf.setJobName("wordcount");• 41.• 42. conf.setOutputKeyClass(Text.class);• 43. conf.setOutputValueClass(IntWritable.class);• 44.• 45. conf.setMapperClass(Map .class);• 46. conf.setCombinerClass(Redu ce.class);• 47. conf.setReducerClass(Reduce.class);• 48.• 49. conf.setInputFormat(TextInputFo rmat.class);• 50. conf.setOutputFormat(TextOutpu tFormat.class);• 51.• 52. FileInputFormat.setInpu tPaths(conf, new Path(args[0])) ;• 53. FileOutputFormat.setOutpu tPath (conf, new Path(args[1]));• 54.• 55. JobClient.runJob(conf);• 57. }• 58. }
1. file = spark.textFile("hdfs://...")2. counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)3. counts.saveAsTextFile("hdfs://...")
A Quick Comparison to Hadoop
HDFS
map()
reduce()
map()
reduce()
DataSource 1 Data
Source 2
map()
join()
cache()
transform() transform()
APImap reduce
APImapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoin
reducecountfoldreduceByKeygroupByKeycogroupcrosszip
sampletakefirstpartitionBymapWithpipesave ...
API*Resilient Distributed Datasets (RDD)*Collections of objects spread across a cluster, stored in RAM or on Disk
*Built through parallel transformations
*Automatically rebuilt on failure
*Operations*Transformations (e.g. map, filter, groupBy)
*Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
Why Spark on Cassandra?*Data model independent queries
*Cross-table operations (JOIN, UNION, etc.)
*Complex analytics (e.g. machine learning)
*Data transformation, aggregation, etc.
*Stream processing
How to Spark on Cassandra?*DataStax Cassandra Spark driver*Open source: https://github.com/datastax/cassandra-driver-spark
*Compatible with*Spark 0.9+*Cassandra 2.0+*DataStax Enterprise 4.5+
Cassandra Spark Driver*Cassandra tables exposed as Spark RDDs
*Read from and write to Cassandra
*Mapping of C* tables and rows to Scala objects
*All Cassandra types supported and converted to Scala types
*Server side data selection
*Spark Streaming support
*Scala and Java support
Connecting to Cassandra
// Import Cassandra-‐specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._
// Spark connection optionsval conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-‐demo").set("cassandra.connection.host", "192.168.123.10") // initial contact.set("cassandra.username", "cassandra").set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Accessing DataCREATE TABLE test.words (word text PRIMARY KEY, count int);
INSERT INTO test.words (word, count) VALUES ('bar', 30);INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDDval rdd = sc.cassandraTable("test", "words")// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]
rdd.toArray.foreach(println)// CassandraRow[word: bar, count: 30]// CassandraRow[word: foo, count: 20]
rdd.columnNames // Stream(word, count) rdd.size // 2
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]firstRow.getInt("count") // Int = 30
*Accessing table above as RDD:
Saving Dataval newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;
word | count-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐
bar | 30foo | 20cat | 40fox | 50
(4 rows)
*RDD above saved to Cassandra:
Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option
Mapping Rows to Objects
CREATE TABLE test.cars (id text PRIMARY KEY,model text,fuel_type text,year int
);
case class Vehicle(id: String,model: String,fuelType: String,year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),// Vehicle(MT8787, Hyundai x35, Diesel, 2011)
à
*Mapping rows to Scala Case Classes
*CQL underscore case column mapped to Scala camel case property
*Custom mapping functions (see docs)
Server Side Data Selection*Reduce the amount of data transferred
*Selecting columns
*Selecting rows (by clustering columns and/or secondary indexes)
sc.cassandraTable("test", "users").select("username").toArray.foreach(println)// CassandraRow{username: john} // CassandraRow{username: tom}
sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println)// CassandraRow{model: Ford Mondeo}// CassandraRow{model: Hyundai x35}
Spark SQL
Spark SQL Streaming ML
Spark (General execution engine)
Graph
Cassandra/HDFS
Compatible
Spark SQL*SQL query engine on top of Spark
*Hive compatible (JDBC, UDFs, types, metadata, etc.)
*Support for in-memory processing
*Pushdown of predicates to Cassandra when possible
Spark SQL Example
import com.datastax.spark.connector._
// Connect to the Spark clusterval conf = new SparkConf(true)...val sc = new SparkContext(conf)
// Create Cassandra SQL contextval cc = new CassandraSQLContext(sc)
// Execute SQL queryval df = cc.sql("SELECT * FROM keyspace.table WHERE ...”)
Spark Streaming
Spark SQL Streaming ML
Spark (General execution engine)
Graph
Cassandra/HDFS
Spark Streaming*Micro batching
*Each batch represented as RDD
*Fault tolerant
*Exactly-once processing
*Unified stream and batch processing framework
DStreamData Stream
RDD
Streaming Exampleimport com.datastax.spark.connector.streaming._
// Spark connection optionsval conf = new SparkConf(true)...
// streaming with 1 second batch windowval ssc = new StreamingContext(conf, Seconds(1))
// stream inputval lines = ssc.socketTextStream(serverIP, serverPort)
// count wordsval wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// stream outputwordCounts.saveToCassandra("test", "words")
// start processingssc.start() ssc.awaitTermination()
Spark MLlib
Streaming
Spark (General execution engine)
Graph
Cassandra/HDFS
Spark SQL ML
Spark MLlib*Classification(SVM, LogisticRegression, NaiveBayes, RandomForest…)
*Clustering (Kmeans, LDA, PIC)
*Linear Regression
*Collaborative Filtering
*Dimensionality reduction (SVD, PCA)
*Frequent pattern mining
*Word2Vec
Spark MLlib Example
import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.classification.NaiveBayes
// Spark connection optionsval conf = new SparkConf(true)...
// read dataval data = sc.cassandraTable[Iris]("test", "iris")
// convert data to LabeledPointval parsedData = data.map { i => LabeledPoint(class2id(i.species), Array(i.petal_l,i.petal_w,i.sepal_l,i.sepal_w)) }
// train the model val model = NaiveBayes.train(parsedData)
//predict
model.predict(Array(5, 1.5, 6.4, 3.2))
http://www.datastax.com/dev/blog/interactive-advanced-analytic-with-dse-and-spark-mllib
APIs•Scala•Native•Python•Popular•Java •Ugly•R•No closures
Questions?