Apache spark

Apache Spark

History and market overview

Martin Zapletal Cake Solutions

Apache Spark and Big Data

1) History and market overview

2) Installation

3) MLlib and machine learning on Spark

4) Porting R code to Scala and Spark

5) Concepts - Core, SQL, GraphX, Streaming

6) Spark’s distributed programming model

7) Deployment

Table of contents

● Motivation - why distributed data processing

● Market overview

● Brief history

● Hadoop MapReduce

● Apache Spark

● Other competitors

● Q & A

Motivation

● production of data in 2002 around 5 exabytes/800

megabytes per person. Even more TV, radio, phone.

● doubled from 1999

● importance of data for business and society

● must stored, processed, analysed to get the value

● 3Vs of data

o Volume

o Velocity

o Variety

Distributed computing

● from supercomputers to cloud

o economical reasons

o gradual upgrades

o fault tolerance

o scalability

o versatility

o development speed

o ecosystem and tooling

o geographical distribution

o various models and technologies

● largest Yahoo Hadoop cluster has 4,500 nodes. 40,000 nodes in total. 455

petabytes

● Facebook Hadoop 2000 nodes, each 12TB storage, 32GB RAM, 8-16

● Yahoo Kafka 20 gigabytes/second, LinkedIn 460,000 writes/sec,

2,300,000 reads/sec

● MongoDB 100 nodes, 20-30TB

● need for new tools, approaches, philosophy, languages, theory

● 7 fallacies of distributed computing

o the network is reliable, the latency is 0, the network is secure

● complexity

o packet loss, ordering, acknowledgement, time, synchronization,

reliable delivery

o many possible states and possibilities

o ubiquitous failures and impact of the distribution

● deployment

● theory

Big Data technologies

● distributed computing frameworks

o batch

o stream

● machine learning and data mining

● support tools

● message queues

● databases

● distributed computing primitives

● cluster operating systems, schedulers

● deployment tools

Big Data technologies

Distributing computation

● efficient use of resources

● ensuring the computation completes

● ensuring correct result

● different levels of abstraction

o processes

o threads

o actors

o actor clusters and virtualized actors

o frameworks on top of actors

o distributed computing frameworks

● different computing models

o share nothing

o shared memory

o actors

o mapReduce

t1 t2 t3 t4 t5 t6

Data Network Computation

t1 t2 t3

Brief history

● Google File System 2003

● MapReduce 2004

● BigTable 2006

● Dremel 2008

● Colossus 2011

● Spanner 2012

● Amazon Dynamo 2002

Brief history

● Apache Hadoop

o HDFS file system

o HBase database

o MapReduce

o Apache Mahout

o Apache Hive

o Apache Pig

o Apache Drill

o Yarn resource management etc.

Hadoop MapReduce

Hadoop MapReducepublic class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key, new IntWritable(sum));

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Apache Spark

● developed at UC Berkeley, now OS

● written in Scala, uses Akka

● compatible with existing Hadoop infrastructure

● api for Java, Scala, Python

● simple, expressive, functional and high level programming model

● speed

● in memory caching, query optimizations

● suitable for iterative and ad-hoc queries (ideal for ML)

● used in production in Yahoo, Amazon, ..

● Databricks raised ~$47M in last year

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Apache Spark

● RDD

● deployment, installation and programming model and what is actually happening in

the background in the next talks

Competition

● non exhaustive list

● Akka cluster/remoting

o lower level abstraction

o more work for the developer

o more freedom

Competition

● Intel GearPump

o build on top of Akka

o scalable, fault-tolerant and expressive solution

o distributed streaming data solution competing with for example Storm

Competition

● Apache Flinko written in Java, started in 2008 at the Technical University of Berlin, the Humboldt

University of Berlin, and the Hasso Plattner Institute

o ASF Top-Level Project since early 2015

o fast

o cost based query optimizers that generalizes relational database query optimizers to

distributed environment

o streaming

o api similar to Spark

Competition

● Apache Tezo developed by Hortonworks, became ASF Top-Level since July 2014

o generalizes MapReduce to a more powerful framework based on expressing computations

as dataflow graph

o much richer api

o lower level than Spark or Flink allowing some extra optimizations

Competition

● Apache Samza

o developed at LinkedIn, joined ASF in September 2013

o distributed stream processing framework

o uses Kafka (also developed at LinkedIn) and other data sources

● Apache Storm

o distributed unbounded stream processing framework

o programming api to define graph topologies

using Spouts (sources) and Bolts (processing nodes)

o used at Yahoo, Twitter, Yelp, Spotify, ...

Conclusion

● why distributed computing frameworks

● why Spark?

o concepts based on theory

o young and progressive, written in Scala

o already mature and production proven

o distributed computing, Big Data, data analysis increasingly important

o potential to replace market leading MapReduce in Hadoop ecosystem

● why not?

o many competitors

o Spark may not always be the best fit

Questions

Apache spark - History and market overview

streaming data

data centers

importance of data

huge amounts of data

defined data types

worlds total data

distributed applications

distributed systems

Software

Apache Spark 101

Apache spark Intro

Big data with Apache spark - WUNCA · 2017-07-21 · -...

Introduction to Cassandra • Why Spark - Apache Cassandra |...

TeachYourself Apache Spark...HOUR 1 Introducing Apache...

Apache spark - Installation

Apache Ignite and Apache Spark - GridGain Systems · Ignite...

Accelerator for Apache Spark Functional Specification ·...

Developing Apache Spark Applications - Cloudera · Apache.....

[@NaukriEngineering] Apache Spark

Budapest Spark Meetup - Apache Spark @enbrite.ly

Integrating Apache Hive with Kafka, Spark, and...

Developing Apache Spark Applications · Apache Spark...

Apache spark meetup

Apache Spark - Yandex