Top Banner
Big Analytics without Big Hassles Bryan Lewis Chief Data Scientist Alex Poliakov Solutions Architect
30

Big Analytics Without Big Hassles

Dec 16, 2014

Download

Technology

Paradigm4

Complex analytics should work as nimbly on extremely large data sets as on small ones. You don’t want to think about whether your data fits in-memory, about parallelism, or formatting data for math packages. You’d like to use your favorite analytical language and have it transparently scale up to Big Data volumes.

Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.

Details:

Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4
Day/Time: Tuesday November 12th, 2013 at 1pm EST


Learn how SciDB enables you to:

-Explore rich data sets interactively
-Do complex math in-database—without being constrained -by memory limitations
-Perform multi-dimensional windowing, filtering, and aggregation
-Offload large computations to a commodity hardware cluster—on-premise or in a cloud
-Use R and Python to analyze SciDB arrays as if they were R or Python objects.
-Share data among users, with multi-user data integrity guarantees and version control
Webinar Agenda:

-Introduction to SciDB
-Demo
-Live Q&A
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Analytics Without Big Hassles

Big Analytics without Big Hassles

Bryan Lewis Chief Data Scientist

Alex PoliakovSolutions Architect

Page 2: Big Analytics Without Big Hassles

© Paradigm4 Inc. 2

Paradigm4’s SciDB

SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python

Page 3: Big Analytics Without Big Hassles

© Paradigm4 Inc. 3

Paradigm4’s SciDB

SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges.

Page 4: Big Analytics Without Big Hassles

© Paradigm4 Inc. 4

Webinar Replay

These slides are from a Paradigm4 webinar held on 11/12/13

You can find this webinar, and additional webinars, at:

http://www.paradigm4.com/video/

www.paradigm4.com

Page 5: Big Analytics Without Big Hassles

© P

arad

igm

4 5

Agenda

1. Brief Introduction to SciDB

2. Demos

3. Q & A

Page 6: Big Analytics Without Big Hassles

© P

arad

igm

4 6

Open-source high-performance database

Data organized in multi-dimensional sparse arrays

Horizontally scalable

Excels at parallel linear algebra

ACID, data replication, versioned data

Developed by Paradigm4

Page 7: Big Analytics Without Big Hassles

© P

arad

igm

4 7

About Paradigm4

Paradigm4 develops & supports SciDB

CTO is MIT database researcher Mike StonebrakerForce behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …)

Commercial applicationsComputational Genomics E-commerceImaging Industrial AnalyticsQuantitative Finance Internet of Things

Page 8: Big Analytics Without Big Hassles

© P

arad

igm

4 8

Community edition• Open Source• Unrestricted• Fully scalable

Enterprise edition• More math • Fault tolerance• System management tools

Developed by Paradigm4

Page 9: Big Analytics Without Big Hassles

© P

arad

igm

4 9

SciDB Powers NIH NCBI’s1000 Genomes Project

Running 24 x 7 since Fall 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

Page 10: Big Analytics Without Big Hassles

© P

arad

igm

4 1

0

SciDB Builds ARCA NBBO Book

• 186 million quotes for one day

• 80 seconds on a 32-instance cluster

• Runs in about half the time on a cluster twice as large

Page 11: Big Analytics Without Big Hassles

© P

arad

igm

4 1

1

SciDB Powers Recommendation Engines

• Fast truncated SVD

• Sparse 50M x 50M matrix 4 billion nonzero values

• Minutes per singular value on a four node Linux cluster

Page 12: Big Analytics Without Big Hassles

© P

arad

igm

4 1

2

SciDB System Architecture

“Shared Nothing” cluster of commodity hardware nodes

Interconnected with standard ethernet and TCP/IP

Page 13: Big Analytics Without Big Hassles

© P

arad

igm

4 1

3

SciDB Arrays

Each cell in a SciDB array consists of a fixed number of typed attributes (variables).

Here is an example cell with four attributes

Price Volume Symbol usec

450.61 150 “AAPL” 36013008713

Page 14: Big Analytics Without Big Hassles

© P

arad

igm

4 1

4

SciDB Arrays

A 1-D array looks like a spreadsheet

This picture shows five cells, each with four attributes

Dim

ensi

on

i

Attributes

Price Volume Symbol usec

1 450.61 150 “AAPL” 36013008713

2 450.73 200 “AAPL” 36013008915

3 450.84 10 “AAPL” 36013208113

4 36.57 75 “MSFT” 36019008713

5 36.20 100 “MSFT” 36003200113

Page 15: Big Analytics Without Big Hassles

© P

arad

igm

4 1

5

SciDB Arrays

The same data “redimensioned” into a 2D array

Dim

ensi

on

use

c

“AAPL” “MSFT”

Price Volume Price Volume

36003200113 36.20 100

36013008713 450.61 150

36013008915 450.73 200

36013208113 450.84 10

36019008713 36.57 75

Dimension Symbol.

Page 16: Big Analytics Without Big Hassles

© P

arad

igm

4 1

6

Access multi-dimensional subsets in constant time

Page 17: Big Analytics Without Big Hassles

© P

arad

igm

4 1

7

High Performance Windowing

 

Simple running median outlier filter

Fast, one-pass, running stats over arbitrary time or data windows

Even when time intervals cross over internal storage shards

Page 18: Big Analytics Without Big Hassles

© P

arad

igm

4 1

8

SciDB Arrays

Arrays can be joinedalong dimensions or subsets of dimensions

Values can be aggregatedalong dimensions and over windows

Functions can be applied over values in arrays

Arrays can be sparse

Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors

Page 19: Big Analytics Without Big Hassles

© P

arad

igm

4 1

9

• Work in familiar IDE• Data persisted in SciDB• Offload large computations to cluster

Page 20: Big Analytics Without Big Hassles

© P

arad

igm

4 2

0

Demos

Quantitative Finance example• Regularized correlation• Relevance network graph

Remote Sensing application• NASA MODIS satellite images• Regrid with spatial interpolation• Visualize (multiple resolutions)

Survival Analysis on Healthcare Data• Estimate Cox proportional hazards model with

the big data bootstrap

Page 21: Big Analytics Without Big Hassles

© P

arad

igm

4 2

1

Live demos

Page 22: Big Analytics Without Big Hassles

© P

arad

igm

4 2

2

SciDB-R/Py(global)

R/Py-exec(local)

Program SciDB naturally from R or Python

Invoke R or Python from within SciDB queries

Two modes for using R & Python

Page 23: Big Analytics Without Big Hassles

© P

arad

igm

4 2

3

Rationale

Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel

Extend SciDB's powerful native analysis capabilities

Page 24: Big Analytics Without Big Hassles

© P

arad

igm

4 2

4

Really simple example

avg(

r_exec(

build(<z:double>[i=1:1000,1,0],0),

'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') )

{i} x_avg

{0} 3.14119

Instance-parallel Monte Carlo estimate of π

Page 25: Big Analytics Without Big Hassles

© P

arad

igm

4 2

5

Big data bootstrap example

Consider a matrix named "events" with 8 columns:

ID (numeric) Race (categorical)

SES (numeric) Age (numeric)

Days_to_event (numeric) Group (categorical)

Event (binary) Gender (categorical)

Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model.

Page 26: Big Analytics Without Big Hassles

© P

arad

igm

4 2

6

Big data bootstrap example

Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method).

store(redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]),P)

Page 27: Big Analytics Without Big Hassles

© P

arad

igm

4 2

7

store(redimension(apply(r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean));'),m, n%32),<ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)

Big data bootstrap example

Page 28: Big Analytics Without Big Hassles

© P

arad

igm

4 2

8

Big data bootstrap result

Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example.

library("scidb")cf =scidb("coefs")[c(0,13:15)][]se =scidb("coefs")[c(16,29:31)][]

plot(exp(cf))

lapply(1:4,function(j){lines(c(j,j),c(exp(cf[j]-1.96*se[j]),exp(cf[j]+1.96*se[j])))})

Page 29: Big Analytics Without Big Hassles

Take Away

In-database, scalable, complex mathLess coding, more analysisTransparent scale-up & speed-upInteractive exploratory analyticsSeamless R and Python integration

www.paradigm4.com

Page 30: Big Analytics Without Big Hassles

© Paradigm4 Inc. 30

Questions?

Tell us about your application• [email protected]

Try our Quick Start • scidb.org/forum• Download a VM or EC2 AMI

www.paradigm4.com

Thanks for your interest!