Top Banner
#bigdatabe @maasg Data Analytics with Apache BigData.be Meetup 8/Sep/2015 Gerard Maas @maasg Data Processing Team Lead and
41

Data Analytics with Apache Spark and Cassandra

Jan 08, 2017

Download

Software

Gerard Maas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Analytics with Apache Spark and Cassandra

#bigdatabe @maasg

Data Analytics with Apache

BigData.be Meetup 8/Sep/2015

Gerard Maas @maasgData Processing Team Lead

and

Page 2: Data Analytics with Apache Spark and Cassandra

#bigdatabe @maasg

Page 3: Data Analytics with Apache Spark and Cassandra

#bigdatabe @maasg

Tweet few keywords about your interests and experience.Use hashtag “#bigdatabe”

Page 4: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Agenda

MotivationSparkling RefreshmentQuick Cassandra OverviewConnecting the Dots . . .ExamplesResources

Page 5: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Scalability

Page 6: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Availability

Page 7: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Resilience

Page 8: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Page 9: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Memory CPU’sNetwork

Page 10: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

What is Apache Spark?

Spark is a fast and general engine for large-scale distributed data processing.

Fast Functional

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Growing Ecosystem

Page 11: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

The Big Idea...Express computations in terms of transformations and actions on a distributed data set.

Spark Core Concept: RDD => Resilient Distributed Dataset

Think of an RDD as an immutable, distributed collection of objects

• Resilient => Can be reconstructed in case of failure• Distributed => Transformations are parallelizable operations• Dataset => Data loaded and partitioned across cluster nodes (executors)

RDDs are memory-intensive. Caching behavior is controllable.

Page 12: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDsRDD

PartitionsPartitionsPartitions

Page 13: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...")

Page 14: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))

111111

111111

111111

Page 15: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))

111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

Page 16: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))

111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

75

7

3

Page 17: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))

111111

111111

111111

.reduceByKey(_ + _)

2411

2221

3121

75

7753

7

3

Page 18: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDD LineageEach RDDs keeps track of its parent.This is the basis for DAG scheduling and fault recoveryval file = spark.textFile("hdfs://...")val wordsRDD = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)val scoreRdd = words.map{case (k,v) => (v,k)}

HadoopRDD

MappedRDD

FlatMappedRDD

MappedRDD

MapPartitionsRDD

ShuffleRDD

wordsRDD MapPartitionsRDD

MappedRDDscoreRDDrdd.toDebugString is your friend

Page 19: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

What is Apache Cassandra?Cassandra is a distributed, high performance, scalable and fault tolerant column-oriented “noSQL” database.

Bigtable

Data Model- wide rows, sparse arrays- high write throughput

DynamoDB

Infrastructure- P2P gossip- “kv” store- Tunable consistency

Page 20: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra ArchitectureNodes use gossip to communicate ring state

Data is distributed over

the cluster

Each node is responsible for a segment of tokens

Data is replicated to n (configurable) nodes

Page 21: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));

INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);

maasg 1441709070 working on my presentation

Page 22: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));

INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);

maasg 1441709070 working on my presentation

peter_v 1441721070 meetup tonight!!!

Page 23: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));

INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441719110, “almost ready”);

maasg 1441709070 working on my presentation

1441719110 almost ready

peter_v 1441721070 meetup tonight!!!

Page 24: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

maasg 1441709070 working on my presentation

1441719110 almost ready

peter_v 1441721070 meetup tonight!!!

...

Partition Key

Clustering Key

Page 25: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800

maasg 1441709070 working on my presentation

Murmur3Hash(“maasg”) = 451

Page 26: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800maasg 1441709070 working on my

presentation

Page 27: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800maasg 1441709070 working on my

presentation

peter_v 1441721070 meetup tonight!!!

Murmur3Hash(“peter_v”) = 42

Page 28: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800maasg 144170907

0working on my presentation

peter_v 1441721070 meetup tonight!!!

Page 29: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Cassandra Architecture

1000

000

200

400

600

800maasg 1441709070 working on my

presentation

peter_v 1441721070 meetup tonight!!!

1441719110 almost ready

Page 30: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

+

Spark Cassandra Connectorhttps://github.com/datastax/spark-cassandra-connector

Page 31: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

“This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.”

Page 32: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Developer earnings vs tech skills

Page 33: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDD

PartitionsPartitionsPartitions1000

000

200

400

600

800

Page 34: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

RDD

PartitionsPartitionsPartitions1000

000

200

400

600

800

cassandraTable, joinWithCassandraTable

repartitionByCassandraReplica

Page 35: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Examples

Page 36: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Spark Notebook Software: https://github.com/andypetrella/spark-notebook

Meetup Notebooks: https://github.com/maasg/spark-notebooks

Page 37: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Resources

Project website: http://spark.apache.org/Spark presentations: http://spark-summit.org/2015Starting Questions: http://stackoverflow.com/questions/tagged/apache-sparkMore Advanced Questions: [email protected] Code: https://github.com/apache/sparkGetting involved: http://spark.apache.org/community.html

Page 38: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Resources

Project website: http://cassandra.apache.org/ Community Site: www.planetcassandra.org Questions: http://stackoverflow.com/questions/tagged/cassandra Training: https://academy.datastax.com/ Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-connector Excellent deep-dive in data locality implementation:http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1

Page 39: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Resources

Spark-Notebook: https://github.com/andypetrella/spark-notebook

Meetup code: https://github.com/maasg/spark-notebooks

Slides (soon): http://www.virdata.com/category/tech/

Page 40: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Acknowledgments

Page 41: Data Analytics with Apache Spark and Cassandra

@maasg#bigdatabe

Want to work with this exciting tech? We are hiring!