Large-Scale Statistics with MonetDB and R

Large-Scale Statistics with

MonetDB and R

Hannes Mühleisen

DAMDID 2015, 2015-10-13

About Me

2

• Postdoc at CWI Database architectures since 2012

• Amsterdam is nice. We have open positions.

• Special interest in data management for statistical analysis

• Various research & software projects in this space

Outline• Column Store / MonetDB Introduction

• Connecting R and MonetDB

• Advanced Topics

• “R as a Query Language”

• “Capturing The Laws of Data Nature”

3

Column Stores / MonetDB Introduction

4

Postgres, Oracle, DB2, etc.:

NX

Constitution

Galaxy

Defiant

Intrepid

1

1

1

1

1

3

8

3

6

1

class speed flux

NX Constitution GalaxyDefiant Intrepid

1 11 1 1

3 83 6 1

Conceptional

Physical (on Disk)

5

Column Store:

NXConstitutionGalaxyDefiantIntrepid

11111

38361

class speed flux

NX Constitution Galaxy Defiant Intrepid

1 1 1 1 1

3 8 3 6 1

Compression!

6

What is MonetDB?

• Strict columnar architecture OLAP RDBMS (SQL)

• Started by Martin Kersten and Peter Boncz ~1994

• Free & Open Open source, active development ongoing

• www.monetdb.org

7

Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380

http://www.monetdb.org

MonetDB today• Expanded C code

• MAL “DB assembly” & optimisers

• SQL to MAL compiler

• Memory-Mapped files

• Automatic indexing

8

9

EXPLAIN SELECT * FROM mtcars;

| X_2 := sql.mvc(); | | X_3:bat[:oid,:oid] := sql.tid(X_2,"sys","mtcars"); | | X_6:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",0); | | (X_9,r1_9) := sql.bind(X_2,"sys","mtcars","mpg",2); | | X_12:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",1); | | X_14 := sql.delta(X_6,X_9,r1_9,X_12); | | X_15 := algebra.leftfetchjoin(X_3,X_14); | | X_16:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",0); | | (X_18,r1_18) := sql.bind(X_2,"sys","mtcars","cyl",2); | | X_20:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",1); | | X_21 := sql.delta(X_16,X_18,r1_18,X_20); | | X_22 := algebra.leftfetchjoin(X_3,X_21); |

Some MAL

“Invisible JOIN”

• Optimisers run on MAL code

• Efficient Column-at-a-time implementations

●●

●

●●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

0.01

1.00

100.00

Query

Aver

age

time

(s)

●

●

monetdbpostgres

TPC−H SF−100 Hot runs

Performance...

log!10

But statistics with SQL?

11

12

Efficiency

Flex

ibilit

y

Statistical Toolkits

Data Management

Systems2

Integrate not Reinvent

?

Collect data

Load dataFilter,

transform & aggregate data

Analyze & Plot

Publish paper

13

Collect data

Load dataFilter,


Analyze & Plot

Publish paper

Growing

Not really Analysis features

14

Collect data

Load dataFilter,


Analyze & Plot

Publish paper

Statistical Toolkit

Data Management

System2

15

Filter, transform &

aggregate dataAnalyze & Plot

Statistical Toolkit

Data Management

System2

16

17

• JDB

+ Native operators, lazy evaluation

• JDB

+ Cheap data transfer

Bridge the Gap

18

19

MonetDB.R connector on CRAN since 2013

Embedded R in MonetDB Part of MonetDB since 2014

Previous Work

MonetDBLite for R Preview release available

20

Embedded Python/NumPy Next MonetDB release

Also…

MonetDB.R connector

21

Hannes Mühleisen and Thomas Lumley: Best of Both Worlds – Relational Databases and Statistics 25th International Conference on Scientific and Statistical Database Management (SSDBM2013)

DBI

• DBI is for R what JDBC is for Java

• Low-level interface to talk to SQL databases

• Drivers available for most relational databases

• Typically socket connection between R and DB

22

df <- dbGetQuery(con, "SELECT * FROM table")

DBI

• Works, but (generally)

• Serialising/Unserialising large datasets is slow

• Data ingest is slow

• SQL knowledge required

23

dplyr• Data reorganisation package in “Hadleyverse”

• Works with data.frame, data.table, SQL DBs

• Maps relational operations (selection, projection, join, grouping etc.) to native R operators

• Lazy evaluation, call chaining

• MonetDB.R includes a dplyr compatibility layer

24

dplyr

25

ni <- select(n, first_name, last_name, race_desc, sex, birth_age)

ow <- filter(ni, as.integer(birth_age) > 66, sex=="MALE", race_desc == “WHITE")

print(ow)

SELECT "first_name" AS "first_name", "last_name" AS "last_name", "race_desc" AS "race_desc", "sex" AS "sex", "birth_age" AS "birth_age" FROM "ncvoter" WHERE CAST("birth_age" AS INTEGER) > 66.0 AND "sex" = 'MALE' AND "race_desc" = 'WHITE' LIMIT 10

Generated:

In R:

• Better, but

• Most (All) R packages cannot work with dplyr tables, so at some point data needs to be transferred.

• What if this dataset is large?

26

dplyr

Embedded R in MonetDB

27

⨝

σ

π

σ

Statistical analysis as operators in relational queries

+

Relationally Integrated

28

CREATE FUNCTION rapi01(i INTEGER) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R { data.frame(i=seq(1,i),d=42.0) };

SELECT i,d FROM rapi01(42) AS r WHERE i>40;

Table-producing

29

CREATE FUNCTION rapi02 (i INTEGER, j INTEGER, z INTEGER) RETURNS INTEGER LANGUAGE R { i*sum(j)*z };

SELECT rapi02(i,j,2) AS r02 FROM rval;

Transformationsπ

30

CREATE FUNCTION rapi03(i INTEGER, z INTEGER) RETURNS BOOLEAN LANGUAGE R { i>z };

SELECT * FROM rval WHERE rapi03(i,2);

Filteringσ

31

CREATE AGGREGATE kmeans(data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster };

SELECT cluster FROM (SELECT MIN(x) AS minx, MAX(x) AS maxx, kmeans(x,5) AS cluster FROM xdata GROUP BY cluster) as cdata ORDER BY cluster;

Aggregation

32

● ● ●

●

●

● ●

●

● ● ●●

●

● ● ● ●●

●

● ● ● ●

●

●

● ●●

●

PL/R−naive

PL/R−tuned

MonetDB

R−full

R−col

RInt

0

10

20

30

40

1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 MRows (log)

Tim

e (s

)

Performance…

33

34

> rf.fit <- randomForest(income~., data=training, mtry=2, ntree=10)

Code Shipping

MonetDB.R 1.0.0, soon

> predictions <- mdbapply(con, “t1", function(d) { p <- predict(rf.fit, type=“prob", newdata=d)[,2] p[p > .9] })

MonetDBLite

35

36

MonetDBLite• Socket serialization/deserialization for client/server protocol is

slow for large result sets.

• Too slow for many machine learning problems!

• Running a database server is cumbersome and overkill for a single R client

• Solution: Run entire database inside the R process

• Only copy ingest data / query results around in memory, fast

• Same interface as MonetDB.R, DBI/dplyr

https://goo.gl/jelaOy

https://goo.gl/jelaOy

lineitem table with 10M rows, SELECT * FROM lineitem

37

0

5

9

14

18

Old (MAPI Socket) MonetDBLite0.4 s

17.2 s

Quick Benchmark

s

Zero-Copy

38

Jonathan Lajus and Hannes Mühleisen: Efficient Data Management and Statistics with Zero-Copy Integration 26th International Conference on Scientific and Statistical Database Management (SSDBM2014)

BAT

Descriptor

Column

Descriptor

0 1

2

...

42 43

44

...

Column

Descriptor

Arrays

head

tail

Reference

42 43 44 ...Reference

SEXP Header

Array

R SEXP

MonetDB BAT

39

BATDescriptor

ColumnDescriptortail

42 43 44 ...Reference

SEXP HeaderR

ReferenceMonetDB

Dress-up

+ Garbage Collection Fun40

Advanced Topics

41

R as a Query Language

42

Hannes Mühleisen, Alex Bertram and Maarten-Jan Kallen: Relational Optimizations for Statistical Analysis, Journal of Statistical Software (under review)

What is Renjin?

• R on the JVM

• Compatibility is paramount, not just academic exercise (e.g. automatic Fortran/C translations)

• R anywhere on any data format (e.g. Cloud environments)

• Increased performance through lazy evaluation, parallel execution, …

• Easy to plug any Java code into R analysis, easy to plug Renjin into java projects

43

Abstraction in Renjin> a <- 1:10^9 > a[1000000] <- NA #harr harr

> system.time(print(anyNA(a)))[[3]] [1] TRUE [1] 0.001 > system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 2.23

> system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 0.05

GNU R

Renjin44

“R as a query language”

• Observation 1: Lots of data wrangling happening in R scripts

45

subset()

merge()

aggregate()

dplyr::, data.table::

[

$


• Observation 2: Things get slow quickly as vectors get longer

• Lots of optimisation opportunities, but how?

• State of the art: Tactical optimisation/Band aids

46


• Proposal: Treat R scripts as declaration of intent (not as a procedural contract written in blood)

• Then we can optimise strategically!

47

Rule-based query optimisation

48

• Selection Pushdown

• Data-parallel scheduling

• Function specialisation/vectorisation

• Common expression elimination/caching

• Redundant computation elimination

49

Optimisations

50

Static analysis?

51

+

a 42

[

min max

/

a <- 1:1000b <- a + 42c <- b[1:10]d <- min(c) / max(c) print(d)

Deferred Evaluation

52

[ (subset)

n=10

factorial

n=1000

a

n=1000

[ (subset)

n=10

a

n=1000

factorial

n=10

Pushdownb <- factorial(a) c <- b[1:10] print(c)

53

●

●

●

● ● ●●

●

●

● ● ●

GNU R

Renjin 0

2

4

6

106 107 108

Dataset Size (elements, log scale)

Exec

utio

n Ti

me

(s)

Pushdown

54

/

- (cached)

a[i] (cached)

Recycling

/

- -

mina[i] max

a

for (i in 1:100) print((a[i] - min(a))/(max(a)-min(a)))

55

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

Renjin

Renjin + R.

GNU R

0

20

40

60

106 107 108

Dataset Size (elements, log scale)

Exec

utio

n Ti

me

(s)

Recycling

56

svymeanagep <- svymean(~agep, svydsgn, se=TRUE)

for(i in 1:ncol(wts)) { repmeans[i,]<-t(colSums(wts[,i]*x*pw)/ sum(pw*wts[,i])) } […] v<-crossprod(sweep(thetas,2, meantheta,"-")*sqrt(rscales))*scale

*

crossprod 0.2

*

*

wts[,1]x

*

p

colSums

*

sum

/

*

wts[,2]

*

colSums

*

sum

/

*

wts[,3]

*

colSums

*

sum

/

*

wts[,4]

*

colSums

*

sum

/

*

wts[,5]

*

colSums

*

sum

/

repmeans

rep

47512

*

colSums sum

/

rep

5

t

- [5]

svymean

58

*

crossprod 0.2

*

*

wts[,1]x

colSums

/

(cached)

*

wts[,2]

colSums

/

(cached)

*

wts[,3]

colSums

/

(cached)

*

wts[,4]

colSums

/

(cached)

*

wts[,5]

colSums

/

(cached)

repmeans

*

(cached)

colSums

/

(cached)

rep

5

t

- [5]

svymean

59

●

●●

●

●

●

●●● ●●

●

●

●

●

●

●●

●

● ●

●

●

●

GNU R

Renjin −opt

Renjin

Renjin 1t

0

25

50

75

100

47512 1060060 9093077Dataset Size (elements, log scale)

Exec

utio

n Ti

me

(s)

svymean

Capturing the Laws of Data Nature

60

Hannes Mühleisen, Martin Kersten and Stefan Manegold: Capturing the Laws of (Data) Nature, 7th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2015

Statistical Models?• Everyone has models, they encode our

understanding of the world

• Everyone has data to train/fit and validate a model

• So far, data management community has ignored these models

• But they hold precious domain knowledge!

61

Configuration Measurement

62

Model!

Grouped by-source operation

Convergence Hints

63

Measurement Configuration

Fitted parameters

64

65

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.10 0.12 0.14 0.16 0.18 0.20

2.0

2.5

3.0

3.5

Frequency (GHz)

Inte

nsity

(Jy)

source=17562, alpha=-0.692, p=0.81266

Model to function conversion (automatic)

Move to DB (automatic)

67

Approximate Answer with zero IO*68

Integrate & Intercept• Integrate model fitting infrastructure into data

management system.

• Also: Huge performance benefits for analysts!

• Intercept model fitting and validation operations by the user and store the model for later use.

• Storage format: Model code + Parameters

69

I ⇡ p · ⌫↵ ? S ⌫ I S ⌫ I

R2 = 0.92 !

I ⇡ p · ⌫↵ ?

R2 = 0.92 !

S p ↵

I ⇡ p · ⌫↵

S = 42, ⌫ = 0.14, I =?

I = 3.0± 0.05 !

(1) (2)

(3)

(4)

(5)

70

But…• What do we do if model parameters are not

specified in the query?

• Sample data?

• Given multiple parameters, it is far from certain that all combinations of values are allowed in the model.

• Construct filter?

71

Data & Model Changes• What should we do if the user gives us a better

model?

• Recompressing could be very expensive

• Threshold for improvement?

• Changes in the data affect the model quality, too

• Switch models?

• Constant Monitoring?

72

Multiple, partial or grouped• There could be many models for a table with

overlapping parameters

• Which one to pick?

• Models do not have to cover the entire table/column

• “Patching”?

• Models could be fitted on aggregation results

• Keep group counts?

73

Thank You Questions?

http://hannes.muehleisen.org

@hfmuehleisen

Large-Scale Statistics with MonetDB and R

Documents