Large-Scale Statistics with MonetDB and R Hannes Mühleisen DAMDID 2015, 2015-10-13
Large-Scale Statistics with
MonetDB and R
Hannes Mühleisen
DAMDID 2015, 2015-10-13
About Me
2
• Postdoc at CWI Database architectures since 2012
• Amsterdam is nice. We have open positions.
• Special interest in data management for statistical analysis
• Various research & software projects in this space
Outline• Column Store / MonetDB Introduction
• Connecting R and MonetDB
• Advanced Topics
• “R as a Query Language”
• “Capturing The Laws of Data Nature”
3
Column Stores / MonetDB Introduction
4
Postgres, Oracle, DB2, etc.:
NX
Constitution
Galaxy
Defiant
Intrepid
1
1
1
1
1
3
8
3
6
1
class speed flux
NX Constitution GalaxyDefiant Intrepid
1 11 1 1
3 83 6 1
Conceptional
Physical (on Disk)
5
Column Store:
NXConstitutionGalaxyDefiantIntrepid
11111
38361
class speed flux
NX Constitution Galaxy Defiant Intrepid
1 1 1 1 1
3 8 3 6 1
Compression!
6
What is MonetDB?
• Strict columnar architecture OLAP RDBMS (SQL)
• Started by Martin Kersten and Peter Boncz ~1994
• Free & Open Open source, active development ongoing
• www.monetdb.org
7
Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380
MonetDB today• Expanded C code
• MAL “DB assembly” & optimisers
• SQL to MAL compiler
• Memory-Mapped files
• Automatic indexing
8
9
EXPLAIN SELECT * FROM mtcars;
| X_2 := sql.mvc(); | | X_3:bat[:oid,:oid] := sql.tid(X_2,"sys","mtcars"); | | X_6:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",0); | | (X_9,r1_9) := sql.bind(X_2,"sys","mtcars","mpg",2); | | X_12:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",1); | | X_14 := sql.delta(X_6,X_9,r1_9,X_12); | | X_15 := algebra.leftfetchjoin(X_3,X_14); | | X_16:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",0); | | (X_18,r1_18) := sql.bind(X_2,"sys","mtcars","cyl",2); | | X_20:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",1); | | X_21 := sql.delta(X_16,X_18,r1_18,X_20); | | X_22 := algebra.leftfetchjoin(X_3,X_21); |
Some MAL
“Invisible JOIN”
• Optimisers run on MAL code
• Efficient Column-at-a-time implementations
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
0.01
1.00
100.00
Query
Aver
age
time
(s)
●
●
monetdbpostgres
TPC−H SF−100 Hot runs
Performance...
log!10
But statistics with SQL?
11
12
Efficiency
Flex
ibilit
y
Statistical Toolkits
Data Management
Systems2
Integrate not Reinvent
?
Collect data
Load dataFilter,
transform & aggregate data
Analyze & Plot
Publish paper
13
Collect data
Load dataFilter,
transform & aggregate data
Analyze & Plot
Publish paper
Growing
Not really Analysis features
14
Collect data
Load dataFilter,
transform & aggregate data
Analyze & Plot
Publish paper
Statistical Toolkit
Data Management
System2
15
Filter, transform &
aggregate dataAnalyze & Plot
Statistical Toolkit
Data Management
System2
16
17
• JDB
+ Native operators, lazy evaluation
• JDB
+ Cheap data transfer
Bridge the Gap
18
19
MonetDB.R connector on CRAN since 2013
Embedded R in MonetDB Part of MonetDB since 2014
Previous Work
MonetDBLite for R Preview release available
20
Embedded Python/NumPy Next MonetDB release
Also…
MonetDB.R connector
21
Hannes Mühleisen and Thomas Lumley: Best of Both Worlds – Relational Databases and Statistics 25th International Conference on Scientific and Statistical Database Management (SSDBM2013)
DBI
• DBI is for R what JDBC is for Java
• Low-level interface to talk to SQL databases
• Drivers available for most relational databases
• Typically socket connection between R and DB
22
df <- dbGetQuery(con, "SELECT * FROM table")
DBI
• Works, but (generally)
• Serialising/Unserialising large datasets is slow
• Data ingest is slow
• SQL knowledge required
23
dplyr• Data reorganisation package in “Hadleyverse”
• Works with data.frame, data.table, SQL DBs
• Maps relational operations (selection, projection, join, grouping etc.) to native R operators
• Lazy evaluation, call chaining
• MonetDB.R includes a dplyr compatibility layer
24
dplyr
25
ni <- select(n, first_name, last_name, race_desc, sex, birth_age)
ow <- filter(ni, as.integer(birth_age) > 66, sex=="MALE", race_desc == “WHITE")
print(ow)
SELECT "first_name" AS "first_name", "last_name" AS "last_name", "race_desc" AS "race_desc", "sex" AS "sex", "birth_age" AS "birth_age" FROM "ncvoter" WHERE CAST("birth_age" AS INTEGER) > 66.0 AND "sex" = 'MALE' AND "race_desc" = 'WHITE' LIMIT 10
Generated:
In R:
• Better, but
• Most (All) R packages cannot work with dplyr tables, so at some point data needs to be transferred.
• What if this dataset is large?
26
dplyr
Embedded R in MonetDB
27
⨝
σ
π
σ
Statistical analysis as operators in relational queries
+
Relationally Integrated
28
CREATE FUNCTION rapi01(i INTEGER) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R { data.frame(i=seq(1,i),d=42.0) };
SELECT i,d FROM rapi01(42) AS r WHERE i>40;
Table-producing
29
CREATE FUNCTION rapi02 (i INTEGER, j INTEGER, z INTEGER) RETURNS INTEGER LANGUAGE R { i*sum(j)*z };
SELECT rapi02(i,j,2) AS r02 FROM rval;
Transformationsπ
30
CREATE FUNCTION rapi03(i INTEGER, z INTEGER) RETURNS BOOLEAN LANGUAGE R { i>z };
SELECT * FROM rval WHERE rapi03(i,2);
Filteringσ
31
CREATE AGGREGATE kmeans(data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster };
SELECT cluster FROM (SELECT MIN(x) AS minx, MAX(x) AS maxx, kmeans(x,5) AS cluster FROM xdata GROUP BY cluster) as cdata ORDER BY cluster;
Aggregation
32
● ● ●
●
●
● ●
●
● ● ●●
●
● ● ● ●●
●
● ● ● ●
●
●
● ●●
●
PL/R−naive
PL/R−tuned
MonetDB
R−full
R−col
RInt
0
10
20
30
40
1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 M1 K 10 K 100 K 1 M 10 M 100 MRows (log)
Tim
e (s
)
Performance…
33
34
> rf.fit <- randomForest(income~., data=training, mtry=2, ntree=10)
Code Shipping
MonetDB.R 1.0.0, soon
> predictions <- mdbapply(con, “t1", function(d) { p <- predict(rf.fit, type=“prob", newdata=d)[,2] p[p > .9] })
MonetDBLite
35
36
MonetDBLite• Socket serialization/deserialization for client/server protocol is
slow for large result sets.
• Too slow for many machine learning problems!
• Running a database server is cumbersome and overkill for a single R client
• Solution: Run entire database inside the R process
• Only copy ingest data / query results around in memory, fast
• Same interface as MonetDB.R, DBI/dplyr
https://goo.gl/jelaOy
lineitem table with 10M rows, SELECT * FROM lineitem
37
0
5
9
14
18
Old (MAPI Socket) MonetDBLite0.4 s
17.2 s
Quick Benchmark
s
Zero-Copy
38
Jonathan Lajus and Hannes Mühleisen: Efficient Data Management and Statistics with Zero-Copy Integration 26th International Conference on Scientific and Statistical Database Management (SSDBM2014)
BAT
Descriptor
Column
Descriptor
0 1
2
...
42 43
44
...
Column
Descriptor
Arrays
head
tail
Reference
42 43 44 ...Reference
SEXP Header
Array
R SEXP
MonetDB BAT
39
BATDescriptor
ColumnDescriptortail
42 43 44 ...Reference
SEXP HeaderR
ReferenceMonetDB
Dress-up
+ Garbage Collection Fun40
Advanced Topics
41
R as a Query Language
42
Hannes Mühleisen, Alex Bertram and Maarten-Jan Kallen: Relational Optimizations for Statistical Analysis, Journal of Statistical Software (under review)
What is Renjin?
• R on the JVM
• Compatibility is paramount, not just academic exercise (e.g. automatic Fortran/C translations)
• R anywhere on any data format (e.g. Cloud environments)
• Increased performance through lazy evaluation, parallel execution, …
• Easy to plug any Java code into R analysis, easy to plug Renjin into java projects
43
Abstraction in Renjin> a <- 1:10^9 > a[1000000] <- NA #harr harr
> system.time(print(anyNA(a)))[[3]] [1] TRUE [1] 0.001 > system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 2.23
> system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 0.05
GNU R
Renjin44
“R as a query language”
• Observation 1: Lots of data wrangling happening in R scripts
45
subset()
merge()
aggregate()
dplyr::, data.table::
[
$
“R as a query language”
• Observation 2: Things get slow quickly as vectors get longer
• Lots of optimisation opportunities, but how?
• State of the art: Tactical optimisation/Band aids
46
“R as a query language”
• Proposal: Treat R scripts as declaration of intent (not as a procedural contract written in blood)
• Then we can optimise strategically!
47
Rule-based query optimisation
48
• Selection Pushdown
• Data-parallel scheduling
• Function specialisation/vectorisation
• Common expression elimination/caching
• Redundant computation elimination
49
Optimisations
50
Static analysis?
51
+
a 42
[
min max
/
a <- 1:1000b <- a + 42c <- b[1:10]d <- min(c) / max(c) print(d)
Deferred Evaluation
52
[ (subset)
n=10
factorial
n=1000
a
n=1000
[ (subset)
n=10
a
n=1000
factorial
n=10
Pushdownb <- factorial(a) c <- b[1:10] print(c)
53
●
●
●
● ● ●●
●
●
● ● ●
GNU R
Renjin 0
2
4
6
106 107 108
Dataset Size (elements, log scale)
Exec
utio
n Ti
me
(s)
Pushdown
54
/
- (cached)
a[i] (cached)
Recycling
/
- -
mina[i] max
a
for (i in 1:100) print((a[i] - min(a))/(max(a)-min(a)))
55
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
Renjin
Renjin + R.
GNU R
0
20
40
60
106 107 108
Dataset Size (elements, log scale)
Exec
utio
n Ti
me
(s)
Recycling
56
svymeanagep <- svymean(~agep, svydsgn, se=TRUE)
for(i in 1:ncol(wts)) { repmeans[i,]<-t(colSums(wts[,i]*x*pw)/ sum(pw*wts[,i])) } […] v<-crossprod(sweep(thetas,2, meantheta,"-")*sqrt(rscales))*scale
*
crossprod 0.2
*
*
wts[,1]x
*
p
colSums
*
sum
/
*
wts[,2]
*
colSums
*
sum
/
*
wts[,3]
*
colSums
*
sum
/
*
wts[,4]
*
colSums
*
sum
/
*
wts[,5]
*
colSums
*
sum
/
repmeans
rep
47512
*
colSums sum
/
rep
5
t
- [5]
svymean
58
*
crossprod 0.2
*
*
wts[,1]x
colSums
/
(cached)
*
wts[,2]
colSums
/
(cached)
*
wts[,3]
colSums
/
(cached)
*
wts[,4]
colSums
/
(cached)
*
wts[,5]
colSums
/
(cached)
repmeans
*
(cached)
colSums
/
(cached)
rep
5
t
- [5]
svymean
59
●
●●
●
●
●
●●● ●●
●
●
●
●
●
●●
●
● ●
●
●
●
GNU R
Renjin −opt
Renjin
Renjin 1t
0
25
50
75
100
47512 1060060 9093077Dataset Size (elements, log scale)
Exec
utio
n Ti
me
(s)
svymean
Capturing the Laws of Data Nature
60
Hannes Mühleisen, Martin Kersten and Stefan Manegold: Capturing the Laws of (Data) Nature, 7th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2015
Statistical Models?• Everyone has models, they encode our
understanding of the world
• Everyone has data to train/fit and validate a model
• So far, data management community has ignored these models
• But they hold precious domain knowledge!
61
Configuration Measurement
62
Model!
Grouped by-source operation
Convergence Hints
63
Measurement Configuration
Fitted parameters
64
65
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.10 0.12 0.14 0.16 0.18 0.20
2.0
2.5
3.0
3.5
Frequency (GHz)
Inte
nsity
(Jy)
source=17562, alpha=-0.692, p=0.81266
Model to function conversion (automatic)
Move to DB (automatic)
67
Approximate Answer with zero IO*68
Integrate & Intercept• Integrate model fitting infrastructure into data
management system.
• Also: Huge performance benefits for analysts!
• Intercept model fitting and validation operations by the user and store the model for later use.
• Storage format: Model code + Parameters
69
I ⇡ p · ⌫↵ ? S ⌫ I S ⌫ I
R2 = 0.92 !
I ⇡ p · ⌫↵ ?
R2 = 0.92 !
S p ↵
I ⇡ p · ⌫↵
S = 42, ⌫ = 0.14, I =?
I = 3.0± 0.05 !
(1) (2)
(3)
(4)
(5)
70
But…• What do we do if model parameters are not
specified in the query?
• Sample data?
• Given multiple parameters, it is far from certain that all combinations of values are allowed in the model.
• Construct filter?
71
Data & Model Changes• What should we do if the user gives us a better
model?
• Recompressing could be very expensive
• Threshold for improvement?
• Changes in the data affect the model quality, too
• Switch models?
• Constant Monitoring?
72
Multiple, partial or grouped• There could be many models for a table with
overlapping parameters
• Which one to pick?
• Models do not have to cover the entire table/column
• “Patching”?
• Models could be fitted on aggregation results
• Keep group counts?
73
Thank You Questions?
http://hannes.muehleisen.org
@hfmuehleisen