Data Analytics with Apache Spark and Cassandra

#bigdatabe @maasg

Data Analytics with Apache

BigData.be Meetup 8/Sep/2015

Gerard Maas @maasgData Processing Team Lead

and

#bigdatabe @maasg

#bigdatabe @maasg

Tweet few keywords about your interests and experience.Use hashtag “#bigdatabe”

@maasg#bigdatabe

Agenda

MotivationSparkling RefreshmentQuick Cassandra OverviewConnecting the Dots . . .ExamplesResources

@maasg#bigdatabe

Scalability

@maasg#bigdatabe

Availability

@maasg#bigdatabe

Resilience

@maasg#bigdatabe

@maasg#bigdatabe

Memory CPU’sNetwork

@maasg#bigdatabe

What is Apache Spark?

Spark is a fast and general engine for large-scale distributed data processing.

Fast Functional

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Growing Ecosystem

@maasg#bigdatabe

The Big Idea...Express computations in terms of transformations and actions on a distributed data set.

Spark Core Concept: RDD => Resilient Distributed Dataset

Think of an RDD as an immutable, distributed collection of objects

• Resilient => Can be reconstructed in case of failure• Distributed => Transformations are parallelizable operations• Dataset => Data loaded and partitioned across cluster nodes (executors)

RDDs are memory-intensive. Caching behavior is controllable.

@maasg#bigdatabe

RDDsRDD

PartitionsPartitionsPartitions

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...")

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))

111111

111111

111111

@maasg#bigdatabe


111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

@maasg#bigdatabe


111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

75

7

3

@maasg#bigdatabe


111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

75

7753

7

3

@maasg#bigdatabe

RDD LineageEach RDDs keeps track of its parent.This is the basis for DAG scheduling and fault recoveryval file = spark.textFile("hdfs://...")val wordsRDD = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)val scoreRdd = words.map{case (k,v) => (v,k)}

HadoopRDD

MappedRDD

FlatMappedRDD

MappedRDD

MapPartitionsRDD

ShuffleRDD

wordsRDD MapPartitionsRDD

MappedRDDscoreRDDrdd.toDebugString is your friend

@maasg#bigdatabe

What is Apache Cassandra?Cassandra is a distributed, high performance, scalable and fault tolerant column-oriented “noSQL” database.

Bigtable

Data Model- wide rows, sparse arrays- high write throughput

DynamoDB

Infrastructure- P2P gossip- “kv” store- Tunable consistency

@maasg#bigdatabe

Cassandra ArchitectureNodes use gossip to communicate ring state

Data is distributed over

the cluster

Each node is responsible for a segment of tokens

Data is replicated to n (configurable) nodes

@maasg#bigdatabe

CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));

INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);

maasg 1441709070 working on my presentation

@maasg#bigdatabe


INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);


peter_v 1441721070 meetup tonight!!!

@maasg#bigdatabe


INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441719110, “almost ready”);


1441719110 almost ready


@maasg#bigdatabe




...

Partition Key

Clustering Key

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800


Murmur3Hash(“maasg”) = 451

@maasg#bigdatabe


1000

000

200

400

600

800maasg 1441709070 working on my

presentation

@maasg#bigdatabe


1000

000

200

400

600


presentation


Murmur3Hash(“peter_v”) = 42

@maasg#bigdatabe


1000

000

200

400

600

800maasg 144170907

0working on my presentation


@maasg#bigdatabe


1000

000

200

400

600


presentation



@maasg#bigdatabe

+

Spark Cassandra Connectorhttps://github.com/datastax/spark-cassandra-connector

@maasg#bigdatabe

“This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.”

@maasg#bigdatabe

Developer earnings vs tech skills

@maasg#bigdatabe

RDD

PartitionsPartitionsPartitions1000

000

200

400

600

800

@maasg#bigdatabe

RDD

PartitionsPartitionsPartitions1000

000

200

400

600

800

cassandraTable, joinWithCassandraTable

repartitionByCassandraReplica

@maasg#bigdatabe

Examples

@maasg#bigdatabe

Spark Notebook Software: https://github.com/andypetrella/spark-notebook

Meetup Notebooks: https://github.com/maasg/spark-notebooks

https://github.com/andypetrella/spark-notebook

https://github.com/maasg/spark-notebooks

@maasg#bigdatabe

Resources

Project website: http://spark.apache.org/Spark presentations: http://spark-summit.org/2015Starting Questions: http://stackoverflow.com/questions/tagged/apache-sparkMore Advanced Questions: [email protected] Code: https://github.com/apache/sparkGetting involved: http://spark.apache.org/community.html

http://spark.apache.org/

http://spark-summit.org/2014

http://stackoverflow.com/questions/tagged/apache-spark

http://apache-spark-user-list.1001560.n3.nabble.com/

https://github.com/apache/spark

http://spark.apache.org/community.html

@maasg#bigdatabe

Resources

Project website: http://cassandra.apache.org/ Community Site: www.planetcassandra.org Questions: http://stackoverflow.com/questions/tagged/cassandra Training: https://academy.datastax.com/ Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-connector Excellent deep-dive in data locality implementation:http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1

http://cassandra.apache.org/

http://www.planetcassandra.org

http://stackoverflow.com/questions/tagged/cassandra

https://academy.datastax.com/

https://github.com/datastax/spark-cassandra-connector



http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1



@maasg#bigdatabe

Resources

Spark-Notebook: https://github.com/andypetrella/spark-notebook

Meetup code: https://github.com/maasg/spark-notebooks

Slides (soon): http://www.virdata.com/category/tech/

https://github.com/andypetrella/spark-notebook

https://github.com/maasg/spark-notebooks

http://www.virdata.com/category/tech/

@maasg#bigdatabe

Acknowledgments

@maasg#bigdatabe

Want to work with this exciting tech? We are hiring!