Top Banner
Stratosphere: System Overview Robert Metzger [email protected] Twitter: @rmetzger_ Big Data Beers Meetup, Nov. 19 th , 2013
30

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

May 11, 2015

Download

Business

Robert Metzger

Stratosphere is the next generation big data processing engine.

These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop.

For more information, visit stratosphere.eu

Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere: System Overview

Robert [email protected]

Twitter: @rmetzger_

Big Data Beers Meetup, Nov. 19th, 2013

Page 2: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere

… is a distributed data processing engine

… automatically handles parallelization

… brings database technology to the world of big data

Page 3: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

● Extends MapReduce with more operators

● Support for advanced data flow graphs

● Compiler/Optimizer, Java/Scala Interface, YARN

Overview

map crossjoin cogroup

Known from Hadoop New in Stratosphere

reduce

Known from Hadoop New in Stratosphere

M R

M R

J

M

R R

Page 4: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Local Files HDFS S3 ...Storage

YARN Direct EC2Cluster

Manager

Stratosphere Runtime

Stratosphere Optimizer

Java API

Scala API Meteor ...

Stratosphere System Stack

Hadoop MR

Hive

Page 5: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere in a Cluster

● Operators are executed over the whole cluster

● Side by side with Hadoop● Scales by adding more

nodes● Support for YARN is in

development● We have a LocalExecutor

Cluster Node

JobSubmission

JobManagerResource Mgmt

CompilerWeb Interface

Master Node

TaskManager

DataNode

TaskManager

DataNode

TaskManager

4 Worker Nodes

DataNode

TaskManager

DataNodeStratosphere

Hadoop

Legend:

Page 6: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

4. Scala Interface

1. Data Flows

3. Iterations

2. Optimizer

Page 7: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Data Flows: Execution Models

One of many possible data flows in Stratosphere

M R

J

M

R

M RApache Hadoop MR is limited to one data flow

Page 8: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Complex Data Flows in Hadoop

M R

M R

M

M R

R

J

M

RFiltering

Joining

Grouping

Grouping

Page 9: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Data Flows: Lessons Learned

1. Most tasks do not fit the MapReduce model2. Very expensive

○ Always go to disk and HDFS

3. Tedious to implement○ Custom data types and file formats between jobs

That’s why higher level abstractions for MR exist.

Page 10: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

R

J

M

R

● Data flow graphs are supported natively● Stratosphere only writes to disk if necessary,

otherwise in-memory

Advanced Data Flows in Stratosphere

Page 11: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Skeleton of a Stratosphere Program

● Input: text file, JDBC source, CSV, etc. ● Transformations

○ map, reduce, join, iterate etc.

● Output: to file etc.● Data Types

○ PactRecord: Tuples with n fields.

○ custom data types for vectors, images, audio (we only expect serialization and compare)

2

Page 12: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Data Flows: Code Example

FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);

FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);

MapContract ordersFiltered = MapContract.builder(FilterOrders.class)

.input(orders).build();

ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)

.input(customers)

.keyField(PactInteger.class, 0).build();

MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0,

0)

.input1(ordersFiltered)

.input2(groupedCustomers).build();

ReduceContract orderBy = ReduceContract.builder(MaxSum.class)

.input(joined)

.keyField(PactInteger.class, 0).build();

FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);

R

J

M

R

Filter Mapper

Define group key

Page 13: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Map Stub and PactRecord by Example

public class FilterOrders extends MapStub {

@Override

public void map(PactRecord order, Collector<PactRecord> out)

throws Exception {

PactString date = order.getField(Orders.DATE_IDX, PactString.class);

if (date.getValue().equals("11.20.2013")) {

out.collect(order);

}

}

}

MapContract ordersFiltered = MapContract.builder(FilterOrders.class)

.input(orders).build();

Page 14: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

4. Scala Interface

1. Data Flows

3. Iterations

2. Optimizer

Page 15: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Joins in Hadoop

● Which strategy to choose?● How to configure it

Lessons Learned:

● Joins do not naturally fit MapReduce● Very time consuming to implement● Hand optimization necessary

Source: Sebastian Schelter, TU Berlin

Map (Broadcast) Join Reduce (Repartition) Join

Page 16: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Joins with Stratosphere

● Natively implemented into the system● Optimizer decides join strategy:

○ Sort-merge-join○ Hybrid Hash Join○ Data Shipping Strategy

● Hybrid Hash Join starts in-memory and gracefully degrades to disk

Page 17: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Optimizer Magic

Recap example job:

● We require a grouped input for the reducer (sorting or hashing)

● Optimizer chooses Sort-Merge-Join → no sorting for reduce

R

J

M

RFiltering

Joining

Grouping

Grouping

Page 18: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere Optimizer

● Cost-based optimizer○ Enumerate different execution plans○ Choose the cheapest one

● Optimizer collects statistics○ Size of input and output

● Operators (Map, Reduce, Join) tell how they modify fields

● In-memory chaining of operators● Memory Distribution

⇒ Focus on your application logic rather than parallel execution.

Page 19: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

4. Scala Interface

1. Data Flows

3. Iterations

2. Optimizer

Page 20: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Algorithms that need iterations

● K-Means● Gradient descent● Page-Rank● Logistic Regression● Path algorithms on graphs● Graph communities / dense sub-components● Inference (belief propagation)

Page 21: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

● Many algorithms loop over the data○ Machine learning: iteratively refine the model○ Graph processing: propagate information hop by hop

Why Iterations?

Initial Input

1 2

3 4

5

6 7

1st Iteration

1 1

2 2

5

5 5

2nd Iteration

1 1

1 1

5

5 5

Example: Connected Components

Page 22: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

● Loop is outside the system○ Hard to program○ Very poor performance

Iterations in Hadoop

M R M R M R

1st Iteration

Usually each iteration is more than a single map and reduce!

n-th Iteration

Driver

Spawn 1st Iteratio

n

2nd Iteration ...

Spaw

n 2n

d It

erat

ion Spawn n-th Iteration

Page 23: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

● Loop is inside the system○ Easy to program○ Huge performance gains

Iterations in Stratosphere

M

C

M

R R M

Iterate

Page 24: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

4. Scala Interface

1. Data Flows

3. Iterations

2. Optimizer

Page 25: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

● Functional object oriented programming language● ScaLa = Scalable Language● Very productive (few LOC)● Feels like a scripting language● No more UDFs● Easy to integrate● Runs in JVM, is compatible to regular Java classes● Basis for developing embedded domain specific

languages (DSL)

Page 26: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Do more, write less!class Person(val firstName: String, val lastName: String)

public class Person {

private final String firstName;

private final String lastName;

public Person(String firstName, String lastName) {

this.firstName = firstName;

this.lastName = lastName;

}

public String getFirstName() {

return firstName;

}

public String getLastName() {

return lastName;

}

}

Page 27: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Let the code speak

val input = TextFile(textInput)

val words = input.flatMap { line => line.split(" ") }

val counts = words

.groupBy { word => word }

.count()

val output = counts.write(wordsOutput, CsvOutputFormat())

val plan = new ScalaPlan(Seq(output))

Page 28: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Example in Scala

FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);

FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);

MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build();

ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)

.input(customers)

.keyField(PactInteger.class, 0)

.build();

MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0)

.input1(ordersFiltered).input2(groupedCustomers).build();

ReduceContract orderBy = ReduceContract.builder(MaxSum.class)

.input(joined)

.keyField(PactInteger.class, 0)

.build();

FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts");

val customers = DataSource(customersPath, CsvInputFormat[Customer])

val orders = DataSource(ordersPath, CsvInputFormat[Order])

val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")}

val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip,

grp.maxBy{_.total})}

val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id}

.isEqualTo {cust => cust._1} .map { (orders, cust) => cust}

val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}}

val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled))

val plan = new ScalaPlan(Seq(output), "BDB Example")

R

J

M

R

Page 29: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere: Database inspired Big Data Analytics

Summary: Feature Matrix

Map Reduce Stratosphere

Operators

● Map● Reduce

● Map● Reduce (multiple sort keys)● Cross● Join● CoGroup● Union● Iterate, Iterate Delta

Composition Only MapReduce Arbitrary Data flows

Data Exchange Batch through diskPipelined, in-memory

(automatic spilling to disk)

Page 30: Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere is the next-generation open source Big Data Analytics Platform.

Quickstart: http://stratosphere.eu/quickstart

Website: http://stratosphere.eu

GitHub: https://github.com/stratosphere

Mailing List:https://groups.google.com/d/forum/stratosphere-dev

Twitter: @stratosphere_eu

Get In Touch