Big Analytics Without Big Hassles

Big Analytics without Big Hassles

Bryan Lewis Chief Data Scientist

Alex PoliakovSolutions Architect

© Paradigm4 Inc. 2

Paradigm4’s SciDB

SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python

© Paradigm4 Inc. 3

Paradigm4’s SciDB

SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges.

© Paradigm4 Inc. 4

Webinar Replay

These slides are from a Paradigm4 webinar held on 11/12/13

You can find this webinar, and additional webinars, at:

http://www.paradigm4.com/video/

www.paradigm4.com

© P

arad

igm

4 5

Agenda

1. Brief Introduction to SciDB

2. Demos

3. Q & A

© P

arad

igm

4 6

Open-source high-performance database

Data organized in multi-dimensional sparse arrays

Horizontally scalable

Excels at parallel linear algebra

ACID, data replication, versioned data

Developed by Paradigm4

© P

arad

igm

4 7

About Paradigm4

Paradigm4 develops & supports SciDB

CTO is MIT database researcher Mike StonebrakerForce behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …)

Commercial applicationsComputational Genomics E-commerceImaging Industrial AnalyticsQuantitative Finance Internet of Things

© P

arad

igm

4 8

Community edition• Open Source• Unrestricted• Fully scalable

Enterprise edition• More math • Fault tolerance• System management tools

Developed by Paradigm4

© P

arad

igm

4 9

SciDB Powers NIH NCBI’s1000 Genomes Project

Running 24 x 7 since Fall 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

© P

arad

igm

4 1

0

SciDB Builds ARCA NBBO Book

• 186 million quotes for one day

• 80 seconds on a 32-instance cluster

• Runs in about half the time on a cluster twice as large

© P

arad

igm

4 1

1

SciDB Powers Recommendation Engines

• Fast truncated SVD

• Sparse 50M x 50M matrix 4 billion nonzero values

• Minutes per singular value on a four node Linux cluster

© P

arad

igm

4 1

2

SciDB System Architecture

“Shared Nothing” cluster of commodity hardware nodes

Interconnected with standard ethernet and TCP/IP

© P

arad

igm

4 1

3

SciDB Arrays

Each cell in a SciDB array consists of a fixed number of typed attributes (variables).

Here is an example cell with four attributes

Price Volume Symbol usec

450.61 150 “AAPL” 36013008713

© P

arad

igm

4 1

4

SciDB Arrays

A 1-D array looks like a spreadsheet

This picture shows five cells, each with four attributes

Dim

ensi

on

i

Attributes

Price Volume Symbol usec

1 450.61 150 “AAPL” 36013008713

2 450.73 200 “AAPL” 36013008915

3 450.84 10 “AAPL” 36013208113

4 36.57 75 “MSFT” 36019008713

5 36.20 100 “MSFT” 36003200113

© P

arad

igm

4 1

5

SciDB Arrays

The same data “redimensioned” into a 2D array

Dim

ensi

on

use

c

“AAPL” “MSFT”

Price Volume Price Volume

36003200113 36.20 100

36013008713 450.61 150

36013008915 450.73 200

36013208113 450.84 10

36019008713 36.57 75

Dimension Symbol.

© P

arad

igm

4 1

6

Access multi-dimensional subsets in constant time

© P

arad

igm

4 1

7

High Performance Windowing

Simple running median outlier filter

Fast, one-pass, running stats over arbitrary time or data windows

Even when time intervals cross over internal storage shards

© P

arad

igm

4 1

8

SciDB Arrays

Arrays can be joinedalong dimensions or subsets of dimensions

Values can be aggregatedalong dimensions and over windows

Functions can be applied over values in arrays

Arrays can be sparse

Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors

© P

arad

igm

4 1

9

• Work in familiar IDE• Data persisted in SciDB• Offload large computations to cluster

© P

arad

igm

4 2

0

Demos

Quantitative Finance example• Regularized correlation• Relevance network graph

Remote Sensing application• NASA MODIS satellite images• Regrid with spatial interpolation• Visualize (multiple resolutions)

Survival Analysis on Healthcare Data• Estimate Cox proportional hazards model with

the big data bootstrap

© P

arad

igm

4 2

1

Live demos

© P

arad

igm

4 2

2

SciDB-R/Py(global)

R/Py-exec(local)

Program SciDB naturally from R or Python

Invoke R or Python from within SciDB queries

Two modes for using R & Python

© P

arad

igm

4 2

3

Rationale

Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel

Extend SciDB's powerful native analysis capabilities

© P

arad

igm

4 2

4

Really simple example

avg(

r_exec(

build(<z:double>[i=1:1000,1,0],0),

'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') )

{i} x_avg

{0} 3.14119

Instance-parallel Monte Carlo estimate of π

© P

arad

igm

4 2

5

Big data bootstrap example

Consider a matrix named "events" with 8 columns:

ID (numeric) Race (categorical)

SES (numeric) Age (numeric)

Days_to_event (numeric) Group (categorical)

Event (binary) Gender (categorical)

Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model.

© P

arad

igm

4 2

6


Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method).

store(redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]),P)

© P

arad

igm

4 2

7

store(redimension(apply(r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean));'),m, n%32),<ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)


© P

arad

igm

4 2

8

Big data bootstrap result

Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example.

library("scidb")cf =scidb("coefs")[c(0,13:15)][]se =scidb("coefs")[c(16,29:31)][]

plot(exp(cf))

lapply(1:4,function(j){lines(c(j,j),c(exp(cf[j]-1.96*se[j]),exp(cf[j]+1.96*se[j])))})

Take Away

In-database, scalable, complex mathLess coding, more analysisTransparent scale-up & speed-upInteractive exploratory analyticsSeamless R and Python integration

www.paradigm4.com

© Paradigm4 Inc. 30

Questions?

Tell us about your application• [email protected]

Try our Quick Start • scidb.org/forum• Download a VM or EC2 AMI

www.paradigm4.com

Thanks for your interest!

Big Analytics Without Big Hassles

Technology

paradigm4 paradigm4

scidb queries paradigm4

parallel paradigm4

demos paradigm4

paradigm4 webinar

large paradigm4

tcpip paradigm4

73200aapl360130089 paradigm4