IntroductionComputing in databases
Conclusion
Computing near the data:let someone else do the heavy lifting for you
Konrad Banachewicz
AmstRdam, June 20th 2011
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
”We’re drowning in data and starving for information”
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Problems:
memory
bandwidth
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
But is it really necessary?
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Two possible approaches:
1 Ripley i Chen: extra interface, pure R
2 R + SQL
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Ripley i Chen
R(user) // CORBA // R(servant)
��DB
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Alternative
R(user) // DBoo
Two scenarios:
1 pure R processing
2 computations partially in DB
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
200000 400000 600000 800000 1000000
05
1015
2025
30
Case study 1, method 1
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
05
1015
2025
30
Case study 1, method 2
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
15 20 25 30 35
010
2030
4050
60
Case study 1, method 1
Dataset size (columns)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
15 20 25 30 35
010
2030
4050
60
Case study 1, method 1
Dataset size (columns)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
2000000 4000000 6000000 8000000 10000000
020
4060
8010
0
Case study 3, method 1
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
020
4060
8010
0
Case study 3, method 2
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
IntroductionComputing in databases
Conclusion
1 with minimal effort, significant speedups are possible
2 ODBC as minimal requirement
3 extensions: parallel computing...
Konrad Banachewicz Computing near the data