7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
1/25
Bryan Lewis, Paradigm4
http://goo.gl/wPRSO
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
2/25
Free and open-source array database
Sparse/dense, multi-dimensional arrays
Distributed storage, parallel processing
Excels at parallel sparse/dense linear algebra
ACID, data replication, versioned data
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
3/25
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
4/25
SciDB Arrays
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
5/25
3.141593 "When human" 2
Each cell in a SciDB array consists ofa fixed number of typed values.
Here is an example cell:
x y z
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
6/25
0.577215 "funny things" 3
Cells are ordered along coordinate axes.A 1-D array looks like an R data frame.
1.41421 "big data interact" 2
2.718282 "judgement and" 1
3.1412654 "When human" 2
0.207879 "happen." 4
x y z
1
2
34
5
i
Dimension i
Variables
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
7/25
0.5772 3
SciDB arrays can be multi-dimensional
3.1415 2
2.7182 1
1.6180 0
0.2078 4
x y z
1
2
3
4
5
i
0.5772 -5
3.1415 -3
2.7182 -2
1.6180 -1
0.2078 -6
x y z
0.5772 39
3.1415
2.7182 19
1.6180 25
0.2078 46
x y z
1 2 3j
Dimension i
Dimension j
x
"funnythings"
"big datainteract"
"judgement and"
"Whenhuman"
"happen."
"funnythings"
"big datainteract"
"judgement and"
"Whenhuman"
"happen."
"funnythings"
"big datainteract"
"judgement and"
"Whenhuman"
"happen."
213
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
8/25
0.577215 3
Arrays can be sparse and values may
be explicitly marked missing in severalways.
Missing(1) Missing(7)
NA 0
0.207879 4
x y z
1
2
34
5
i
"funny things"
"When human"
"happen."
"big data interact"
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
9/25
0.577215
3.141593
2.718282
1.618034
0.207879
x y1
2
3
4
5
i0
1
2
3
4
z
Arrays can be joined along commondimensions (like R's merge):
true
true
false
true
w
1
2
3
4
z
x =
0.577215
3.141593
2.718282
0.207879
x y2
3
4
5
i1
2
3
4
z
true
true
false
true
w
"funny things"
"big data interact"
"judgement and"
"When human"
"happen."
"funny things"
"big data interact"
"judgement and"
"happen."
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
10/25
SciDB array partitioning and overlap
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
11/25
Array chunks are distributed
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
12/25
Regular chunk distribution across arrays= fast n-dimensional join/merge
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
13/25
Values can be aggregated, along dimensions optionally overwindows
Functions can be applied over values in arrays
Arrays can be sparse
Linear algebra operations and matrix decompositions areavailable for matrices and vectors.
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
14/25
The scidb package for R
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
15/25
List/Dataframe-like
RObjectTables, g.data, filehash, ff, DBI and many database
interfaces RPgSQL, RMySQL, ROracle, ...), Vertica/R,Netezza/R, rredis, scidb, RBerkeley, RCassandra, LaF,lazy.frames
Hadooprmr, HadoopStreaming, RHIPE
Array-like
ff, bigmemory, pbdR, scidb, Netezza
Other
rdsm, forthcoming from Simon, flexmem
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
16/25
The package defines two main ways tointeract with SciDB:
1. Iterable data frame interface using SciDB querylanguage directly
2. N-dimensional sparse/dense array class for Rbacked by SciDB arrays
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
17/25
library("scidb")scidbconnect(host="localhost")
# An example reference to a SciDB matrix:A
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
18/25
Subarrays return new SciDB array objects
A[c(0,49000,171), 5:8]
Reference to a 3x4 SciDB array
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
19/25
Use [] to materialize data to R
A[c(0,49000,171), 5:8][]
[,1] [,2] [,3] [,4]
[1,] 0.9820799 -0.4563357 -1.2947495 -0.8085465[2,] -1.5090126 0.1547963 -0.2435732 -0.1836875
[3,] 1.3296710 -1.5006536 -0.5980172 0.3752186
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
20/25
Arithmetic
X
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
21/25
Mixed SciDB and R object arithmetic
Z
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
22/25
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
23/25
SVD and principal components
S
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
24/25
It is sometimes possible to use SciDB arrays in Rpackages with little modification.
library("biclust")library("s4vd")data(lung)A
7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides
25/25
Virtual machines and EC2images ready to roll (includingRstudio) available from:
www.scidb.org
R package on CRAN anddevelopment version at:
github.com/Paradigm4