Sv big datascience_cliffclick_5_2_2013

Big Data forBig Questions

Cliff Click, CTO [email protected]://0xdata.comhttp://cliffc.org/blog

mailto:[email protected]

http://0xdata.com/

http://cliffc.org/blog

● Motivation: What & Why Big Math?● Better Mousetrap● Demo● Fork: Deep Dive into

Math Hacking ...or... K/V Store

Source: https://github.com/0xdata/h2o

https://github.com/0xdata/h2o

0xdata.com 3

42!

0xdata.com 4

42!What was the question again?

0xdata.com 5


Oh yeah, it was:● How do I place ads based on a clickstream?

0xdata.com 6


Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?

0xdata.com 7


Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?

0xdata.com 8


Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?

0xdata.com 9


Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?● Find people (un)like me?● ... or ... or ... or... ????

0xdata.com 10

How do I figure it all out?

● Well... what are my tools?● Domain Knowledge,

● (me! The Expert)

● Math & Science! Data Science, and● Data – lots and lots and lots of it

● Old logs, new logs, databases, historical records, click-streams, CSV files, dumps

● Often TB's, sometimes PB's of it

0xdata.com 11

Data: The Main Player

● Data: I got lots of it● But it's a messy mixed-up lot

● Stored in HDFS, S3, DB2 or scattered about● Incompatible formats, older & newer bits● Missing stuff, or "known broken" fields

● And it's Big● Too big for my laptop, or even one server

0xdata.com 12

Data: Cleaning it Up

● Just the parts I want:● SQL, Hive, HBase, grep● Data is Big, so this is slow

● Wrong format: ● Awk, shell scripts, files, disk-to-disk

● Inspection (do I got it right yet?)● Grep/awk, histograms, plots/prints● Visualization tools

0xdata.com 13

From Facts to Knowledge

● Data cleaned up: lots of neat rows of facts● Lots of rows: millions and billions ...

● But facts is not knowledge● Too much to "get it" by looking

● Time for a mathematical Model!● Here again, Big limits my tools

● Either can't deal, or deal very very slowly

0xdata.com 14

Modeling: math(data)

● Modeling gives a simpler view● A way to understand● And predict in real time

● Modeling is Math!● Generalized Linear Modeling

– Oldest, most well known & used● Random Forest● K-Means Clustering

0xdata.com 15

Big Data vs Modeling

● Model: a concise description of my data● A more accurate model predicts better

● Generally More Data builds a better Model● But only if the tool can handle it● (some datasets are not helped but it rarely hurts)

● Tools can't handle Big: so down sample, and use better (more complex) algorithm

0xdata.com 16

Big Data vs Better Algorithm

● Don't want to choose Big vs Better● Down sampling loses information

● Want a way to manipulate Big Data like it's small: interactive & fast. Subtle when I need it and brute force when I don't

● Build the Better Algorithm and use Big Data● Seeing 10x more data yield prediction

increases e.g. from 75% to 85%

0xdata.com 17

Building The Better Big Data Mousetrap

● Want fast: means dram instead of disk● Fall back to disk, if data >>> dram

● Want fast: use all cpus● Problems are mostly data-parallel anyways

● Want ease-of-programming: ● “parallelism without effort”● Well understood programming model

0xdata.com 18

● Want ease-of-use:● python, json, REST/HTML interfaces● Full R semantics (via fastr project)

● Data ingest:● where: HDFS, S3, NFS, URL, URI, browser● what: csv, hive, rdata


0xdata.com 19


● Want ease-of-admin:● e.g. java -jar h2o.jar● auto-cluster (no config at all) or hadoop Job

● Want ease-of-upgrade: adding more servers gives

● More CPU (faster exec)● More DRAM (larger data in dram)● More network/disk bandwidth (faster ingest)

0xdata.com 20

H2O: An Engine for Big Math

● Built in layers – pick your abstraction level● Analysts, starters: REST, browser

– "clicky clicky" load data, build model, score● Scientists: R, JSON, python to drive engine

– Complex math● Math hackers: building new algos

– Full (distributed) Java Memory Model– "codes like Java, runs distributed"

● Core Engineering: call us, we're hiring

0xdata.com 21

Core Engineering: K/V Store

● Classic distributed Key/Value store● get/put/atomic-transaction● Full JMM semantics, exact consistency● Full caching as-needed

– Cached keys "get" in 150 nano's– Misses limited by network speed

● Hardware-like cache coherency protocol

● Distributed fork/join (thanks Doug Lea)

0xdata.com 22

Core Engineering: D/F/J

● Distributed fork/join (jsr 166y)● Recursive-descent for data-parallel● Distribution handled by the core

– Log-tree scatter/gather across cluster

● Supports map/reduce-style directly● But also "do this on all nodes" style● Or random graph hacking

0xdata.com 23

Math Hacking

● “Tastes like (distributed) java”

(actual inner loop, auto-parallel, auto-distributed)

● Big “vector math” is easy● The obvious for-loop "just works"

for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; }

0xdata.com 24

Math Hacking

● Dense-vector algorithms are easy● Generalized Linear Modeling: 2 weeks● K-means: 2 days● Histogram: 2 hours

● Random Forest: not dense vectors● Still makes good use of D/F/J● All-CPUs, all-nodes still light up

– Very fast tree building

0xdata.com 25

Science: dancing with the data

● Like the belle of the ball, the main algos (GLM, k-means, RF) only arrive when the data is properly dressed

● Munging data: dropping junk columns, replacing missing bits, adding features

● H2O provides a tool-kit● Big vector calculator: "d := a+b*c"● dram speeds: "msec per Gbyte"

0xdata.com 26

Science: APIs

● Need to script, automate repetitive tasks● R via fastr and bigmemory package

● Full R semantics, 5x R speed single-thread● But your vectors can be very very big...● https://github.com/allr/fastr

● REST / URL / JSON● Drive from e.g. python, scripts, curl, wget

– e.g. h2o testing harness is all python

0xdata.com 27

Demos & Quick Starts

● Full browser interface● Tutorials● Handful of clicks to run e.g. RF or GLM

on gigabytes of data

● Auto-cluster in seconds● On EC2 (or your laptops right now)

● Good enough for serious work● (and have customers using this interface!)

0xdata.com 28

Demo Time!

0xdata.com 29

H2O: An Engine for Big Math

● Focus on Big Math● Easy to extend via M/R or K/V programming

● Auto-cluster● Data-parallel exec across all CPUs● dram caching across all servers● Parallel ingest across all servers● Open source: https://github.com/0xdata/h2o

0xdata.com

https://github.com/0xdata/h2o

0xdata.com 30

Math Hacking: The M/R API

● Make a 'golden object'● Will be endlessly replicated across cluster● Set 'input' fields:

– Auto-serialized, distributed– Shallow-copy on nodes: eg arrays share state

● golden.map(key_1mb)● map() called on clone for each 1mb● Set 'output' fields now

0xdata.com 31

Math Hacking: The M/R API

● gold.reduce(gold)● Combine pairs of 'golden' objects● Both locally and remotely (distributed)● Log-tree roll-up

● 'output' fields will be shipped over the wire● null-out 'input' fields● transient marker available

0xdata.com 32

Math Hacking: Example

CalcSumsTask cst = new CalcSumsTask(); cst._arykey = ary._key; // BigData Table key cst._colA = colA; // integer indices to columns cst._colB = colB; cst.invoke(ary._key); // Do It!

// Results returned directly in 'cst' object...cst._sumX... // use results

public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's

0xdata.com 33



// map called for every 1Mb of data, or so public void map( Key key1Mb ) { … boiler plate... // lots of unimportant details // Standard for-loop over the data for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; } }

0xdata.com 34



// reduce called between pairs of golden objects// always reduce right-side into 'this' object

public void reduce( DRemoteTask rt ) { CalcSumsTask cst = (CalcSumsTask)rt; _sumX += cst._sumX ; _sumY += cst._sumY ; _sumX2+= cst._sumX2; } }

0xdata.com 35

A Fast K/V Store

● Distributed in-memory K/V Store● Peer-to-peer, no master● Full JMM semantics, get/put/atomic/remove● Hardware-style cache-coherency protocol● Fast: 150nanos for cache-hitting 'get'● Fast: 50micros for cache-missing 'put'● No persistence (see above for 'fast')● No locks: use 'atomic' instead

0xdata.com 36

K/V Design Goals

● JMM semantics on all get/put● Cache-hitting 'gets' as fast as possible

● Local hashtable lookup + few tests

● 'puts' as lazy as possible (still JMM)● Typically do not block for remote put

● Arbitrary transactions on single Keys

0xdata.com 37

K/V Coherency Protocol

● Many are possible● Picked a {fast-enough,easy} one● Faster is possible

● Every Key has 1 master node● And everybody knows it from Key hash

● Master orders racing writes● Winner of NBHM insert

0xdata.com 38


● Master tracks replicas● Single CAS update

● Invalidate replicas on update● Single CAS required, plus the invalidates● Cache miss on replica will reload

● Interlocking get/put races solved withfinite state machine

0xdata.com 39


0xdata.com 40

Backup Slides

0xdata.com 41

The Expert

● Domain Expert:● What data is useful, which is trash● What needs help to become useful● Missing elements? Toss outliers?● Build new features from old?

● All through this process Big Data is, well, Big, hence Slow to cp / awk / grep

● And Big limits my tools

Sv big datascience_cliffclick_5_2_2013

Technology

data lots

data dram

sensor data

knowledge data

data science

grep data

betterbig data mousetrap

big data vs modeling