Training Day | December 3rd Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and Cassandra Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4 th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra. http://bit.ly/cassandrasummit2014
43
Embed
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Training Day | December 3rd
Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and
Cassandra
Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra.
http://bit.ly/cassandrasummit2014
Cassandra + Spark = Awesome
Johnny Miller, Solutions Architect @CyanMiller www.linkedin.com/in/johnnymiller
• Unlimited, free use of the software in DataStax Enterprise.
• No limit on number of nodes or other hidden restrictions.
• If you’re a startup, it’s free!
www.datastax.com/startups
Training Day | December 3rd
Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and
Cassandra
Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra.
http://bit.ly/cassandrasummit2014
What is Apache Cassandra?
Apache Cassandra™ is a massively scalable NoSQL OLTP database. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance.
Cassandra is: • A Highly distributed database • Low latency – very near real-time • 100% availability – No SPOF • Highly scalable – Linear Scalability • Wide Column Store • Disk Optimised
What is Apache Cassandra?
• Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. • Multi-data center and cloud availability zone support. • Linear scale performance with online capacity expansion. • CQL – SQL-like language.
Node
Node
100,000 txns/sec
Node
Node
Node
Node
Node Node 200,000 txns/sec
Node Node
Node
Node Node
Node
400,000 txns/sec
“In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.” Solving Big Data Challenges for Enterprise Application Performance Management, Tilman Rable, et al., August 2013, p. 10. Benchmark paper presented at the Very Large Database Conference, 2013. http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2013.pdf
Netflix Cloud Benchmark… End Point Independent NoSQL Benchmark Highest in throughput…
Lowest in latency…
Cassandra: A Leader in Performance
• Cassandra was designed with the understanding that system/hardware failures can and do occur
• Peer-to-peer, distributed system • All nodes the same • Data partitioned among all nodes in the cluster • Custom data replication to ensure fault tolerance
Cassandra Architecture Overview
Node 1
Node 4
Node 5 Node 2
Node 3
• Multi data centre support out of the box • Configurable replication factor • Configurable data consistency per request • Active-Active replication architecture
Cassandra Architecture Overview
Node 1 1st copy
Node 4
Node 5 Node 2 2nd copy
Node 3
Node 1 1st
Node 4
Node 5 Node 2 2nd copy
Node 3 3rd copy
DC: USA DC: EU
Cassandra Query Language
CREATE TABLE sporty_league (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM sporty_league WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;
INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
Adoption
http://db-engines.com/en/ranking November 2014
Performance & Scale
DataStax works for small to huge deployments.
• DataStax Enterprise footprint @ Netflix • 80+ Clusters • 2500+ nodes • 4 Data Centres (Amazon Regions) • > 1 Trillion transactions per day See: http://www.datastax.com/resources/casestudies/netflix
Cassandra Use Cases
• Playlists/Collections • Personalisation/Recommendation • Messaging • Fraud Detection • Internet of Things/Sensor Data • Time Series
Spark Streaming Example import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
Spark SQL
• SQL-92 and HiveQL compatible query engine • Currently only SELECT and INSERT queries • Support for in-memory computation • Pushdown of predicates to Cassandra when possible
Spark SQL and HQL Example
import com.datastax.spark.connector._ // Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf) // Create Cassandra SQL context val cc = new CassandraSQLContext(sc) // Execute SQL query val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2") // Execute HQL query val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
Spark
• The next big thing! • Simple to use • Works great with Cassandra • Fast distributed processing – faster than MapReduce • Streaming • Machine Learning