Top Banner
DATA-CENTRIC METAPROGRAMMING Vlad Ureche
167

Data centric Metaprogramming by Vlad Ulreche

Feb 21, 2017

Download

Data & Analytics

Spark Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data centric Metaprogramming by Vlad Ulreche

DATA-CENTRICMETAPROGRAMMING

Vlad Ureche

Page 2: Data centric Metaprogramming by Vlad Ulreche

Vlad UrechePhD in the Scala Team @ EPFL. Soon to graduate ;)

● Working on program transformations focusing on data representation● Author of miniboxing, which improves generics performance by up to 20x● Contributed to the Scala compiler and to the scaladoc tool.

@

@VladUreche

@VladUreche

[email protected]

Page 3: Data centric Metaprogramming by Vlad Ulreche

Research ahead*

!* This may not make it into a product. But you can play with it nevertheless.

Page 4: Data centric Metaprogramming by Vlad Ulreche

STOP

Please ask if thingsare not clear!

Page 5: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Page 6: Data centric Metaprogramming by Vlad Ulreche

Motivation

Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data and used with permission.

Page 7: Data centric Metaprogramming by Vlad Ulreche

Motivation

Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data and used with permission.

Performance gap betweenRDDs and DataFrames

Page 8: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD DataFrame

Page 9: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

Page 10: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

Page 11: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

Page 12: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

?

● strongly typed● faster

Page 13: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

Dataset

● strongly typed● faster

Page 14: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

Dataset

● strongly typed● faster mid-way

Page 15: Data centric Metaprogramming by Vlad Ulreche

Motivation

RDD

● strongly typed● slower

DataFrame

● dynamically typed● faster

Dataset

● strongly typed● faster mid-way

Why just mid-way?What can we do to speed them up?

Page 16: Data centric Metaprogramming by Vlad Ulreche

Object Composition

Page 17: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Vector[T] { … }

Page 18: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Vector[T] { … }The Vector collection

in the Scala library

Page 19: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

class Vector[T] { … }The Vector collection

in the Scala library

Page 20: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

class Vector[T] { … }The Vector collection

in the Scala library

Corresponds to a table row

Page 21: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

class Vector[T] { … }

Page 22: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

class Vector[T] { … }

Page 23: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

Page 24: Data centric Metaprogramming by Vlad Ulreche

Object Composition

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

Traversal requiresdereferencing a pointer

for each employee.

Page 25: Data centric Metaprogramming by Vlad Ulreche

A Better Representation

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 26: Data centric Metaprogramming by Vlad Ulreche

A Better Representation

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 27: Data centric Metaprogramming by Vlad Ulreche

A Better Representation

● more efficient heap usage● faster iteration

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 28: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

Page 29: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

– Which makes Vector[Employee] suboptimal

Page 30: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

– Which makes Vector[Employee] suboptimal

● Not limited to Vector, other classes also affected

Page 31: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

– Which makes Vector[Employee] suboptimal

● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures

Page 32: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

– Which makes Vector[Employee] suboptimal

● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures

– We'd like a "structured" representation throughout

Page 33: Data centric Metaprogramming by Vlad Ulreche

The Problem● Vector[T] is unaware of Employee

– Which makes Vector[Employee] suboptimal

● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures

– We'd like a "structured" representation throughout

Challenge: No means of communicating this

to the compiler

Page 34: Data centric Metaprogramming by Vlad Ulreche

Choice: Safe or Fast

Page 35: Data centric Metaprogramming by Vlad Ulreche

Choice: Safe or Fast

This is where mywork comes in...

Page 36: Data centric Metaprogramming by Vlad Ulreche

Data-Centric Metaprogramming● compiler plug-in that allows● Tuning data representation● Website: scala-ildl.org

Page 37: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Page 38: Data centric Metaprogramming by Vlad Ulreche

Transformation

Definition Application

Page 39: Data centric Metaprogramming by Vlad Ulreche

Transformation

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

Page 40: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

Page 41: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

● repetitive and complex● affects code

readability● is verbose● is error-prone

Page 42: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

● repetitive and complex● affects code

readability● is verbose● is error-prone

compiler (automated)

Page 43: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

● repetitive and complex● affects code

readability● is verbose● is error-prone

compiler (automated)

Page 44: Data centric Metaprogramming by Vlad Ulreche

Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {

type Target = Vector[Employee] type Result = EmployeeVector

def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...

def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}

Page 45: Data centric Metaprogramming by Vlad Ulreche

Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {

type Target = Vector[Employee] type Result = EmployeeVector

def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...

def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}

What to transform?What to transform to?

Page 46: Data centric Metaprogramming by Vlad Ulreche

Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {

type Target = Vector[Employee] type Result = EmployeeVector

def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...

def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}

How totransform?

Page 47: Data centric Metaprogramming by Vlad Ulreche

Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {

type Target = Vector[Employee] type Result = EmployeeVector

def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...

def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...} How to run methods on the updated representation?

Page 48: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

● repetitive and complex● affects code

readability● is verbose● is error-prone

compiler (automated)

Page 49: Data centric Metaprogramming by Vlad Ulreche

Transformation

programmer

Definition Application● can't be automated● based on experience● based on speculation● one-time effort

● repetitive and complex● affects code

readability● is verbose● is error-prone

compiler (automated)

Page 50: Data centric Metaprogramming by Vlad Ulreche

http://infoscience.epfl.ch/record/207050?ln=en

Page 51: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Page 52: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 53: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

class Vector[T] { … }

Page 54: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

Page 55: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

Page 56: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

class NewEmployee(...) extends Employee(...)

ID NAME SALARY DEPT

Page 57: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

class NewEmployee(...) extends Employee(...)

ID NAME SALARY DEPT

Page 58: Data centric Metaprogramming by Vlad Ulreche

Scenario

class Employee(...)

ID NAME SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

class Vector[T] { … }

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

class NewEmployee(...) extends Employee(...)

ID NAME SALARY DEPT Oooops...

Page 59: Data centric Metaprogramming by Vlad Ulreche

Open World Assumption● Globally anything can happen

Page 60: Data centric Metaprogramming by Vlad Ulreche

Open World Assumption● Globally anything can happen● Locally you have full control:

– Make class Employee final or

– Limit the transformation to code that uses Employee

Page 61: Data centric Metaprogramming by Vlad Ulreche

Open World Assumption● Globally anything can happen● Locally you have full control:

– Make class Employee final or

– Limit the transformation to code that uses Employee

How?

Page 62: Data centric Metaprogramming by Vlad Ulreche

Open World Assumption● Globally anything can happen● Locally you have full control:

– Make class Employee final or

– Limit the transformation to code that uses Employee

How?

Using Scopes!

Page 63: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeOpt) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Page 64: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeOpt) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Page 65: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeOpt) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Now the method operateson the EmployeeVector

representation.

Page 66: Data centric Metaprogramming by Vlad Ulreche

Scopes● Can wrap statements, methods, even entire classes

– Inlined immediately after the parser

– Definitions are visible outside the "scope"

Page 67: Data centric Metaprogramming by Vlad Ulreche

Scopes● Can wrap statements, methods, even entire classes

– Inlined immediately after the parser

– Definitions are visible outside the "scope"

● Mark locally closed parts of the code– Incoming/outgoing values go through conversions

– You can reject unexpected values

Page 68: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 69: Data centric Metaprogramming by Vlad Ulreche

Best Representation?

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 70: Data centric Metaprogramming by Vlad Ulreche

Best Representation?

It depends.

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 71: Data centric Metaprogramming by Vlad Ulreche

Best ...?

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

It depends.

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 72: Data centric Metaprogramming by Vlad Ulreche

Best ...?

Tungsten repr.<compressed binary blob>

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

It depends.

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 73: Data centric Metaprogramming by Vlad Ulreche

Best ...?

EmployeeJSON{ id: 123, name: “John Doe” salary: 100}

Tungsten repr.<compressed binary blob>

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

It depends.

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 74: Data centric Metaprogramming by Vlad Ulreche

Scopes allow mixing data representationstransform(VectorOfEmployeeOpt) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Page 75: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeOpt) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Operating on theEmployeeVectorrepresentation.

Page 76: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeCompact) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Operating on thecompact binary representation.

Page 77: Data centric Metaprogramming by Vlad Ulreche

Scopestransform(VectorOfEmployeeJSON) {

def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

}

Operating on theJSON-based

representation.

Page 78: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 79: Data centric Metaprogramming by Vlad Ulreche

Composition● Code can be

– Left untransformed (using the original representation)

– Transformed using different representations

Page 80: Data centric Metaprogramming by Vlad Ulreche

Composition● Code can be

– Left untransformed (using the original representation)

– Transformed using different representations

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 81: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 82: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 83: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Easy one. Do nothing

Page 84: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 85: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 86: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 87: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Automatically introduce conversionsbetween values in the two representationse.g. EmployeeVector Vector[Employee] or back→

Page 88: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 89: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 90: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 91: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Hard one. Do not introduce any conversions. Even across separate compilation

Page 92: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 93: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Hard one. Automatically introduce double conversions (and warn the programmer)e.g. EmployeeVector Vector[Employee] CompactEmpVector→ →

Page 94: Data centric Metaprogramming by Vlad Ulreche

Composition

calling● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 95: Data centric Metaprogramming by Vlad Ulreche

Composition

callingoverriding

● Original code● Transformed code

● Original code● Transformed code

● Same transformation● Different transformation

Page 96: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}

Page 97: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}

Method print in the classimplements

method print in the trait

Page 98: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}

Page 99: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }}

Page 100: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }} The signature of method

print changes according tothe transformation it no→

longer implements the trait

Page 101: Data centric Metaprogramming by Vlad Ulreche

Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}

transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }} The signature of method

print changes according tothe transformation it no→

longer implements the trait

Taken care by thecompiler for you!

Page 102: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 103: Data centric Metaprogramming by Vlad Ulreche

Column-oriented Storage

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

Page 104: Data centric Metaprogramming by Vlad Ulreche

Column-oriented Storage

NAME ...NAME

EmployeeVector

ID ID ...

...SALARY SALARY

Vector[Employee]

ID NAME SALARY

ID NAME SALARY

iteration is 5x faster

Page 105: Data centric Metaprogramming by Vlad Ulreche

Retrofitting value class status

(3,5)

3 5Header

reference

Page 106: Data centric Metaprogramming by Vlad Ulreche

Retrofitting value class statusTuples in Scala are specialized butare still objects (not value classes)

= not as optimized as they could be

(3,5)

3 5Header

reference

Page 107: Data centric Metaprogramming by Vlad Ulreche

Retrofitting value class status

0l + 3 << 32 + 5

(3,5)

Tuples in Scala are specialized butare still objects (not value classes)

= not as optimized as they could be

(3,5)

3 5Header

reference

Page 108: Data centric Metaprogramming by Vlad Ulreche

Retrofitting value class status

0l + 3 << 32 + 5

(3,5)

Tuples in Scala are specialized butare still objects (not value classes)

= not as optimized as they could be

(3,5)

3 5Header

reference

14x faster, lower heap requirements

Page 109: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

Page 110: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4)

Page 111: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8)

Page 112: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

Page 113: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

Page 114: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}

Page 115: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}

accumulatefunction

Page 116: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}

accumulatefunction

accumulatefunction

Page 117: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}

accumulatefunction

accumulatefunction

compute:18

Page 118: Data centric Metaprogramming by Vlad Ulreche

DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum

List(2,3,4) List(4,6,8) 18

transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}

accumulatefunction

accumulatefunction

compute:18

6x faster

Page 119: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 120: Data centric Metaprogramming by Vlad Ulreche

Research ahead*

!* This may not make it into a product. But you can play with it nevertheless.

Page 121: Data centric Metaprogramming by Vlad Ulreche

Spark● Optimizations

– DataFrames do deforestation

– DataFrames do predicate push-down

– DataFrames do code generation● Code is specialized for the data representation● Functions are specialized for the data representation

Page 122: Data centric Metaprogramming by Vlad Ulreche

Spark● Optimizations

– RDDs do deforestation

– RDDs do predicate push-down

– RDDs do code generation● Code is specialized for the data representation● Functions are specialized for the data representation

Page 123: Data centric Metaprogramming by Vlad Ulreche

Spark● Optimizations

– RDDs do deforestation

– RDDs do predicate push-down

– RDDs do code generation● Code is specialized for the data representation● Functions are specialized for the data representation

This is whatmakes them slower

Page 124: Data centric Metaprogramming by Vlad Ulreche

Spark● Optimizations

– Datasets do deforestation

– Datasets do predicate push-down

– Datasets do code generation● Code is specialized for the data representation● Functions are specialized for the data representation

Page 125: Data centric Metaprogramming by Vlad Ulreche

User Functions

X Y

user function

f

Page 126: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y

user function

fdecode

Page 127: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y encodeddata

user function

fdecode encode

Page 128: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y encodeddata

user function

fdecode encode

Allocate object Allocate object

Page 129: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y encodeddata

user function

fdecode encode

Allocate object Allocate object

Page 130: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y encodeddata

user function

fdecode encode

Page 131: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

X Y encodeddata

user function

fdecode encode

Modified user function(automatically derived

by the compiler)

Page 132: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

encodeddata

Modified user function(automatically derived

by the compiler)

Page 133: Data centric Metaprogramming by Vlad Ulreche

User Functions

serializeddata

encodeddata

encodeddata

Modified user function(automatically derived

by the compiler) Nowhere near assimple as it looks

Page 134: Data centric Metaprogramming by Vlad Ulreche

Challenge: Transformation not possible

● Example: Calling outside (untransformed) method

Page 135: Data centric Metaprogramming by Vlad Ulreche

Challenge: Transformation not possible

● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings

Page 136: Data centric Metaprogramming by Vlad Ulreche

Challenge: Transformation not possible

● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings

– Explain why it's not possible: due to the method call

Page 137: Data centric Metaprogramming by Vlad Ulreche

Challenge: Transformation not possible

● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings

– Explain why it's not possible: due to the method call

– Suggest how to fix it: enclose the method in a scope

Page 138: Data centric Metaprogramming by Vlad Ulreche

Challenge: Transformation not possible

● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings

– Explain why it's not possible: due to the method call

– Suggest how to fix it: enclose the method in a scope

● Reuse the machinery in miniboxing

scala-miniboxing.org

Page 139: Data centric Metaprogramming by Vlad Ulreche

Challenge: Internal API changes

Page 140: Data centric Metaprogramming by Vlad Ulreche

Challenge: Internal API changes

● Spark internals rely on Iterator[T]– Requires materializing values

– Needs to be replaced throughout the code base

– By rather complex buffers

Page 141: Data centric Metaprogramming by Vlad Ulreche

Challenge: Internal API changes

● Spark internals rely on Iterator[T]– Requires materializing values

– Needs to be replaced throughout the code base

– By rather complex buffers

● Solution: Extensive refactoring/rewrite

Page 142: Data centric Metaprogramming by Vlad Ulreche

Challenge: Automation

Page 143: Data centric Metaprogramming by Vlad Ulreche

Challenge: Automation

● Existing code should run out of the box

Page 144: Data centric Metaprogramming by Vlad Ulreche

Challenge: Automation

● Existing code should run out of the box● Solution:

– Adapt data-centric metaprogramming to Spark

– Trade generality for simplicity

– Do the right thing for most of the cases

Page 145: Data centric Metaprogramming by Vlad Ulreche

Challenge: Automation

● Existing code should run out of the box● Solution:

– Adapt data-centric metaprogramming to Spark

– Trade generality for simplicity

– Do the right thing for most of the cases

Where are we now?

Page 146: Data centric Metaprogramming by Vlad Ulreche

Prototype

Page 147: Data centric Metaprogramming by Vlad Ulreche

Prototype Hack

Page 148: Data centric Metaprogramming by Vlad Ulreche

Prototype Hack● Modified version of Spark core

– RDD data representation is configurable

Page 149: Data centric Metaprogramming by Vlad Ulreche

Prototype Hack● Modified version of Spark core

– RDD data representation is configurable

● It's very limited:– Custom data repr. only in map, filter and flatMap

– Otherwise we revert to costly objects

– Large parts of the automation still need to be done

Page 150: Data centric Metaprogramming by Vlad Ulreche

Prototype Hacksc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect()

Page 151: Data centric Metaprogramming by Vlad Ulreche

Prototype Hacksc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect()

Page 152: Data centric Metaprogramming by Vlad Ulreche

Prototype Hacksc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect() Not yet 2x faster,

but 1.45x faster

Page 153: Data centric Metaprogramming by Vlad Ulreche

Motivation

Transformation

Applications

Challenges

Conclusion

Spark

Open World

Best Representation?

Composition

Page 154: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation

Page 155: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming

Page 156: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming

– Opaque data → Structured data

Page 157: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming

– Opaque data → Structured data

– Is it possible? Yes.

Page 158: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming

– Opaque data → Structured data

– Is it possible? Yes.

– Is it easy? Not really.

Page 159: Data centric Metaprogramming by Vlad Ulreche

Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming

– Opaque data → Structured data

– Is it possible? Yes.

– Is it easy? Not really.

– Is it worth it? You tell me!

Page 160: Data centric Metaprogramming by Vlad Ulreche

Thank you!

Check out scala-ildl.org.

Page 161: Data centric Metaprogramming by Vlad Ulreche

Deforestation and Language Semantics

● Notice that we changed language semantics:– Before: collections were eager

– After: collections are lazy

– This can lead to effects reordering

Page 162: Data centric Metaprogramming by Vlad Ulreche

Deforestation and Language Semantics

● Such transformations are only acceptable with programmer consent– JIT compilers/staged DSLs can't change semantics

– metaprogramming (macros) can, but it should be documented/opt-in

Page 163: Data centric Metaprogramming by Vlad Ulreche

Code Generation● Also known as

– Deep Embedding

– Multi-Stage Programming

● Awesome speedups, but restricted to small DSLs● SparkSQL uses code gen to improve performance

– By 2-4x over Spark

Page 164: Data centric Metaprogramming by Vlad Ulreche

Low-level Optimizers● Java JIT Compiler

– Access to the low-level code

– Can assume a (local) closed world

– Can speculate based on profiles

Page 165: Data centric Metaprogramming by Vlad Ulreche

Low-level Optimizers● Java JIT Compiler

– Access to the low-level code

– Can assume a (local) closed world

– Can speculate based on profiles

● Best optimizations break semantics– You can't do this in the JIT compiler!

– Only the programmer can decide to break semantics

Page 166: Data centric Metaprogramming by Vlad Ulreche

Scala Macros● Many optimizations can be done with macros

– :) Lots of power

– :( Lots of responsibility● Scala compiler invariants● Object-oriented model● Modularity

Page 167: Data centric Metaprogramming by Vlad Ulreche

Scala Macros● Many optimizations can be done with macros

– :) Lots of power

– :( Lots of responsibility● Scala compiler invariants● Object-oriented model● Modularity

● Can we restrict macros so they're safer?– Data-centric metaprogramming