Top Banner

of 25

The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

Apr 03, 2018

Download

Documents

Gallo Solaris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    1/25

    Bryan Lewis, Paradigm4

    [email protected]

    http://goo.gl/wPRSO

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    2/25

    Free and open-source array database

    Sparse/dense, multi-dimensional arrays

    Distributed storage, parallel processing

    Excels at parallel sparse/dense linear algebra

    ACID, data replication, versioned data

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    3/25

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    4/25

    SciDB Arrays

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    5/25

    3.141593 "When human" 2

    Each cell in a SciDB array consists ofa fixed number of typed values.

    Here is an example cell:

    x y z

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    6/25

    0.577215 "funny things" 3

    Cells are ordered along coordinate axes.A 1-D array looks like an R data frame.

    1.41421 "big data interact" 2

    2.718282 "judgement and" 1

    3.1412654 "When human" 2

    0.207879 "happen." 4

    x y z

    1

    2

    34

    5

    i

    Dimension i

    Variables

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    7/25

    0.5772 3

    SciDB arrays can be multi-dimensional

    3.1415 2

    2.7182 1

    1.6180 0

    0.2078 4

    x y z

    1

    2

    3

    4

    5

    i

    0.5772 -5

    3.1415 -3

    2.7182 -2

    1.6180 -1

    0.2078 -6

    x y z

    0.5772 39

    3.1415

    2.7182 19

    1.6180 25

    0.2078 46

    x y z

    1 2 3j

    Dimension i

    Dimension j

    x

    "funnythings"

    "big datainteract"

    "judgement and"

    "Whenhuman"

    "happen."

    "funnythings"

    "big datainteract"

    "judgement and"

    "Whenhuman"

    "happen."

    "funnythings"

    "big datainteract"

    "judgement and"

    "Whenhuman"

    "happen."

    213

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    8/25

    0.577215 3

    Arrays can be sparse and values may

    be explicitly marked missing in severalways.

    Missing(1) Missing(7)

    NA 0

    0.207879 4

    x y z

    1

    2

    34

    5

    i

    "funny things"

    "When human"

    "happen."

    "big data interact"

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    9/25

    0.577215

    3.141593

    2.718282

    1.618034

    0.207879

    x y1

    2

    3

    4

    5

    i0

    1

    2

    3

    4

    z

    Arrays can be joined along commondimensions (like R's merge):

    true

    true

    false

    true

    w

    1

    2

    3

    4

    z

    x =

    0.577215

    3.141593

    2.718282

    0.207879

    x y2

    3

    4

    5

    i1

    2

    3

    4

    z

    true

    true

    false

    true

    w

    "funny things"

    "big data interact"

    "judgement and"

    "When human"

    "happen."

    "funny things"

    "big data interact"

    "judgement and"

    "happen."

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    10/25

    SciDB array partitioning and overlap

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    11/25

    Array chunks are distributed

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    12/25

    Regular chunk distribution across arrays= fast n-dimensional join/merge

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    13/25

    Values can be aggregated, along dimensions optionally overwindows

    Functions can be applied over values in arrays

    Arrays can be sparse

    Linear algebra operations and matrix decompositions areavailable for matrices and vectors.

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    14/25

    The scidb package for R

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    15/25

    List/Dataframe-like

    RObjectTables, g.data, filehash, ff, DBI and many database

    interfaces RPgSQL, RMySQL, ROracle, ...), Vertica/R,Netezza/R, rredis, scidb, RBerkeley, RCassandra, LaF,lazy.frames

    Hadooprmr, HadoopStreaming, RHIPE

    Array-like

    ff, bigmemory, pbdR, scidb, Netezza

    Other

    rdsm, forthcoming from Simon, flexmem

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    16/25

    The package defines two main ways tointeract with SciDB:

    1. Iterable data frame interface using SciDB querylanguage directly

    2. N-dimensional sparse/dense array class for Rbacked by SciDB arrays

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    17/25

    library("scidb")scidbconnect(host="localhost")

    # An example reference to a SciDB matrix:A

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    18/25

    Subarrays return new SciDB array objects

    A[c(0,49000,171), 5:8]

    Reference to a 3x4 SciDB array

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    19/25

    Use [] to materialize data to R

    A[c(0,49000,171), 5:8][]

    [,1] [,2] [,3] [,4]

    [1,] 0.9820799 -0.4563357 -1.2947495 -0.8085465[2,] -1.5090126 0.1547963 -0.2435732 -0.1836875

    [3,] 1.3296710 -1.5006536 -0.5980172 0.3752186

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    20/25

    Arithmetic

    X

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    21/25

    Mixed SciDB and R object arithmetic

    Z

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    22/25

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    23/25

    SVD and principal components

    S

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    24/25

    It is sometimes possible to use SciDB arrays in Rpackages with little modification.

    library("biclust")library("s4vd")data(lung)A

  • 7/28/2019 The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

    25/25

    Virtual machines and EC2images ready to roll (includingRstudio) available from:

    www.scidb.org

    R package on CRAN anddevelopment version at:

    github.com/Paradigm4