Top Banner
Smart database for next-generation applications LOGICBLOX - SIMPLIFYING YOUR DATA STACK MLConf NY, 2014.04.11
32

MLconf NYC Shan Shan Huang

Nov 01, 2014

Download

Technology

SessionsEvents

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MLconf NYC Shan Shan Huang

Smart database for next-generation applications

LOGICBLOX - SIMPLIFYING YOUR DATA STACKMLConf NY, 2014.04.11

Page 2: MLconf NYC Shan Shan Huang

AREN’T THERE ENOUGH DATABASES?

©2014. LogicBlox. All Rights Reserved.

Page 3: MLconf NYC Shan Shan Huang

IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES

©2014. LogicBlox. All Rights Reserved.

Page 4: MLconf NYC Shan Shan Huang

IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES

©2014. LogicBlox. All Rights Reserved.

Is a similar revolution coming in databases?

Page 5: MLconf NYC Shan Shan Huang

OUR MISSION

▪ Be the iPhone of databases▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014

▪ One database to replace many specialized databases▪ Transactional (e.g. Oracle, VoltDB, NuoDB)

▪ Analytical (e.g. Teradata, Redshift, Hadoop)

▪ Graphs

▪ Documents

▪ ...

Footnote: for certain class of applications

©2014. LogicBlox. All Rights Reserved.

Page 6: MLconf NYC Shan Shan Huang

OUR MISSION

▪ Be the iPhone of databases. ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014

▪ One database to replace many specialized databases▪ Transactional (e.g. Oracle, VoltDB, NuoDB)

▪ Analytical (e.g. Teradata, Redshift, Hadoop)

▪ Graphs▪ Documents

▪ ...

Footnote: for certain class of applications

©2014. LogicBlox. All Rights Reserved.

Page 7: MLconf NYC Shan Shan Huang

SHOW ME

©2013. LogicBlox. All Rights Reserved.

Page 8: MLconf NYC Shan Shan Huang

FIRST THING FIRST

▪ Declarative query language▪ Based on Datalog

▪ ACID transactions ▪ In fact… full serializability

▪ Built from scratch -- not by stitching together different databases under the hood.

©2014. LogicBlox. All Rights Reserved.

Page 9: MLconf NYC Shan Shan Huang

CLIQUES IN LOGIQL

3 Clique - Triangle Queries 4 Clique

©2014. LogicBlox. All Rights Reserved.

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

4cliques(a, b, c, d) <-

edge(a, b),

edge(a, c),

edge(a, d),

edge(b, c),

edge(b, d),

edge(c, d).

Page 10: MLconf NYC Shan Shan Huang

3 CLIQUE in LOGIQL vs. SQL

©2013. LogicBlox. All Rights Reserved.

SELECT DISTINCT

v1.x AS x, v2.x AS y, v3.x AS w

FROM edge AS v1, edge AS v2, edge AS v3

WHERE

v1.y = v2.x

AND v2.y = v3.x

AND EXISTS(

SELECT 1 FROM edge AS vv1

WHERE vv1.x = v1.x AND vv1.y = v3.x);

SQL

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 11: MLconf NYC Shan Shan Huang

3 CLIQUE in LOGIQL vs SPARQL

©2013. LogicBlox. All Rights Reserved.

sparql PREFIX g: <http://logicblox.com/graph>

SELECT DISTINCT ?av ?bv ?cv FROM <$database>

WHERE {

?a g:edge ?b .

?a g:edge ?c .

?b g:edge ?c .

?a g:value ?av .

?b g:value ?bv .

?c g:value ?cv .

FILTER (xsd:int(?av) < xsd:int(?bv) and

xsd:int(?bv) < xsd:int(?cv))

};

SPARQL

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 12: MLconf NYC Shan Shan Huang

class triangle_count : public graphlab::ivertex_program<graph_type, set_union_gather> { public: bool do_not_scatter; // Gather on all edges edge_dir_type gather_edges(icontext_type& context, const vertex_type& vertex) const { return graphlab::ALL_EDGES; } gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { set_union_gather gather; graphlab::vertex_id_type otherid = edge.target().id() == vertex.id() ?edge.source().id() : edge.

target().id(); size_t other_nbrs = (edge.target().id() == vertex.id()) ? (edge.source().num_in_edges() + edge.source().num_out_edges()): (edge.target().num_in_edges() + edge.target().num_out_edges()); size_t my_nbrs = vertex.num_in_edges() + vertex.num_out_edges(); if (PER_VERTEX_COUNT || (other_nbrs > my_nbrs) || (other_nbrs == my_nbrs && otherid > vertex.id())) { gather.v = otherid; } return gather; } void apply(icontext_type& context, vertex_type& vertex, const gather_type& neighborhood { do_not_scatter = false; if (neighborhood.vid_vec.size() == 0) { vertex.data().vid_set.clear(); if (neighborhood.v != (graphlab::vertex_id_type(-1))) vertex.data().vid_set.vid_vec.push_back(neighborhood.v); } else vertex.data().vid_set.assign(neighborhood.vid_vec); do_not_scatter = vertex.data().vid_set.size() == 0; } edge_dir_type scatter_edges(icontext_type& context, const vertex_type& vertex) const { if (do_not_scatter) return graphlab::NO_EDGES; else return graphlab::OUT_EDGES; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { const vertex_data_type& srclist = edge.source().data(); const vertex_data_type& targetlist = edge.target().data(); if (targetlist.vid_set.size() < srclist.vid_set.size()) edge.data() += count_set_intersect(targetlist.vid_set, srclist.vid_set); else edge.data() += count_set_intersect(srclist.vid_set, targetlist.vid_set); }};

3-CLIQUE IN LOGILQ vs. GRAPHLAB

©2013. LogicBlox. All Rights Reserved.

GraphLab - C++

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 13: MLconf NYC Shan Shan Huang

4 CLIQUE - SYNTHETIC DATA

©2014. LogicBlox. All Rights Reserved.

Page 14: MLconf NYC Shan Shan Huang

4 CLIQUE - REAL DATA

©2014. LogicBlox. All Rights Reserved.

Page 15: MLconf NYC Shan Shan Huang

SEMANTIC WEB - LUBM

©2014. LogicBlox. All Rights Reserved.

Page 16: MLconf NYC Shan Shan Huang

DATAWAREHOUSE - TPC-H

©2013. LogicBlox. All Rights Reserved.

Page 17: MLconf NYC Shan Shan Huang

A NON-TRIVIAL EXAMPLE: PAGERANK IN LOGIQL

©2013. LogicBlox. All Rights Reserved.

d[] = 0.85f. // dampening factor

tolerance[] = 0.01f. // when to the pr change is small enough to stop

pr[p] = 1.0f / node_count[] <- node(p), !pr[p] = _. // initial pr

pr[p] = (1.0f - d[]) + (d[] * sum[p]) <-

abs[r - pr[p]] > tolerance[].

pr[p] = pr[p] <-

r = (1.0f - d[]) + (d[] * sum[p]),

!(abs[r - pr[p]] > tolerance[]).

pr[p] = pr[p] <- !sum[p] = _.

sum[n] = t <-

agg<< t = total(r) >>

edge(p, n),

r = pr[p] / out_count[p].

Page 18: MLconf NYC Shan Shan Huang

HOW DOES IT WORK

©2013. LogicBlox. All Rights Reserved.

Page 19: MLconf NYC Shan Shan Huang

ALGORITHMS FIRST

Computer Science @CompSciFact Sep 28

“Computer science is now about systems. It hasn’t been about algorithms since the 1960’s.” -- Alan Kay #hlf13

Page 20: MLconf NYC Shan Shan Huang

PHILOSOPHY: BRAINS BEFORE BRAWN

▪ Algorithmic scalability▪ New worst-case optimal join algorithm

▪ Incremental maintenance proportional to trace edit distance

▪ Adaptive domain decomposition for parallelization

▪ Data structures▪ Compression close to info-theoretic limit in some cases

▪ I/O minimization, cache consciousness

▪ Persistent data structures: full serializability, branch & merge, auditability, scalable distribution

▪ Unified declarative programming model▪ Optimizations through aggressive analysis

▪ Brute force▪ In-memory when data fits

▪ Distribution across thousands of cores, and GPUs

©2013. LogicBlox. All Rights Reserved.

Page 21: MLconf NYC Shan Shan Huang

A SMART JOIN ALGORITHM - LFTJ

▪ “Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm” T. Veldhuizen, ICDT 2014▪ Best Newcomer Award

©2013. LogicBlox. All Rights Reserved.

Page 22: MLconf NYC Shan Shan Huang

LFTJ INTUITION: CONSIDER MORE THAN PAIRS

©2013. LogicBlox. All Rights Reserved.

▪ Widely adopted technique: pair-wise joins

▪ Suppose A, B, and C each have 1 million records distributed over 3 months▪ Pair-wise join: best case scenario, 0.5 million records as intermediate results

▪ LFTJ: no records materialized

Jan Feb Mar

A(x)

B(x)

C(x)

Page 23: MLconf NYC Shan Shan Huang

SMARTER INCREMENTAL VIEW MAINTENANCE

▪ Incremental Maintenance for Leapfrog Triejoin, T. Veldhuizen, 2013▪ http://arxiv.org/abs/1303.5313

▪ Replaced our implementation of Count and DRed algorithms [Gupta+ 93]

▪ Guarantees that work is done proportional to the trace edit distance between the before and after▪ Critical for allowing caching analytical

views for performance, but still incorporating real-time updates

©2013. LogicBlox. All Rights Reserved.

Page 24: MLconf NYC Shan Shan Huang

INCREMENTALIZING 3 CLIQUE VIEW

©2013. LogicBlox. All Rights Reserved.

LogicBlox - Algebraic

+3cliques(a, b, c) <-

+edge(a, b), edge(a, c), edge(b, c).

+3cliques(a, b, c) <-

edge(a, b), +edge(a, c), edge(b, c).

+3cliques(a, b, c) <-

edge(a, b), edge(a, c), +edge(b, c).

DReD - Synthactic

3cliques(a, b, c) <-

edge(a, b), edge(a, c), edge(b, c).

edge(a, b) edge(a, c) edge(b, c)

Page 25: MLconf NYC Shan Shan Huang

INCREMENTAL MAINTENANCE OF 4-CLIQUE

©2013. LogicBlox. All Rights Reserved.

Page 26: MLconf NYC Shan Shan Huang

A PARTICULAR USE CASE OF LB FOR GRAPHS

©2013. LogicBlox. All Rights Reserved.

Page 27: MLconf NYC Shan Shan Huang

SCREAMING FAST PROGRAM ANALYSIS

▪ Order of magnitude faster than prior-art

▪ Program analysis is graph analysis▪ “Strictly Declarative Specification of

Sophisticated Points-to Analyses” (OOPSLA ‘09)

▪ “Exception Analysis and Points-to Analysis - Better Together” (ISSTA ‘09)

▪ “Pick Your Context Well - Understanding Object-Sensitivity” (POPL ’11)

▪ “Efficient and Effective Handling of Exceptions in Java Points-to Analysis” (CC’13)

▪ “Hybrid Context Sensitivity for Points-to Analysis” (PLDI ’13)

▪ “Set-based Pre-processing for Points-to Analysis” (OOPSLA ‘13)

©2013. LogicBlox. All Rights Reserved.

Page 28: MLconf NYC Shan Shan Huang

PROGRAM ANALYSIS IS ALL ABOUT GRAPH ANALYSIS

©2013. LogicBlox. All Rights Reserved.

Page 29: MLconf NYC Shan Shan Huang

COMPARE TO PRIOR-ART : >10x

©2013. LogicBlox. All Rights Reserved.

Page 30: MLconf NYC Shan Shan Huang

...AND THAT WAS ON PRIOR ART LOGICBLOX

©2013. LogicBlox. All Rights Reserved.

Page 31: MLconf NYC Shan Shan Huang

RECAP

▪ LogicBlox: the iPhone of databases▪ But perhaps the $10k camera of graph queries?

▪ Holy Grails▪ Declarative query language: LogiQL

▪ ACID transactions

▪ Guiding Principle: Brains before Brawns▪ Innovate on algorithms: LTFJ, incremental view maintenance, etc.

▪ Innovate on data structures

▪ Declarative language allows aggressive optimizations

▪ Brute force when necessary

©2014. LogicBlox. All Rights Reserved.

Page 32: MLconf NYC Shan Shan Huang

THANK YOU

©2014. LogicBlox. All Rights Reserved.