Top Banner
Trecul – Data Flow Processing using Hadoop and LLVM David Blair
25

Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

Dec 23, 2015

Download

Documents

Nora Beasley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

Trecul – Data Flow Processing using Hadoop and LLVMDavid Blair

Page 2: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Agenda

Problem Statement

Trecul User Level Overview

Trecul Architecture and Design

Page 3: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Advertising Decision Solutions

The Akamai Advertising Network

Full Funnel Approach

Awareness

Prospecting

Remarketing

Data Coop

500+ sites browse & buy data

300m monthly active cookies

Data Collection

600 Million Site Events per Day

50 Million Impressions per Day

Page 4: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Making Data into Decisions

Ad ServingandData Collection

Data Coop

ModelingScoring

AttributionBilling

Page 5: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Problem Statement

Had a working system but much pain

Commerical Parallel RDBMS, MySQL, Perl

Functional Requirements

Natural partitioning key = User/Cookie

Most processing aligns with that key

Handling of structured data only (e.g. no text analysis)

Non Functional Requirements

Fault Tolerance

Performance/Cost

Must be deployable in Akamai network

Reach Goals

Ease of use

Ad-hoc queries

Page 6: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Hadoop to Rescue (Almost)

HDFS

Good enough performance

Hooks to customize data placement

Handles most single node failures

Map Reduce

Cluster and resource management

Partition parallel computing model

Shuffles for cases when we need it

Handles most single node failures

Mystery guest

Ad-hoc Java – anti-pattern

Hive or Pig – too slow for our needs

or …

Page 7: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program

Page 8: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Operators

Page 9: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Arguments

Page 10: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

g = generate[output="'Hello World!' AS greeting", numRecords=1];p = print[limit=10]; g -> p;d = devNull[];p -> d;

Anatomy of a Trecul Program : Arrows

Page 11: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

$ ads-df --file - << EOF> g = generate[output="'Hello World!' AS greeting", > numRecords=1];> p = print[limit=10]; > g -> p;> d = devNull[];> p -> d;> EOFHello World!

Streaming pipes & filters without threads or processes

Anatomy of a Trecul Program

Page 12: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Basic Trecul Map Reduce Program

m = map[format=“cre_date DATETIME, event_id INTEGER, greeting VARCHAR”];e = emit[key=“greeting”];g -> e;

r = reduce[];gb = group_by[sortKey=“greeting”, output=“greeting, SUM(1) AS greeting_cnt”];r -> gb;w = write[file=“hdfs://default:0/users/dblair/demo_mr”];gb -> w;

$ ads-df –-map map.tql –-reduce reduce.tql –-input /my/test/data

Page 13: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Example with branching and merging

r = read[file=“hdfs://default:0/foo”, format=“akid CHAR(22), cre_date DATETIME, coop_id INTEGER”];c = copy[output=“akid”, output=“input.*”];r -> c;g = group_by[sortKey=“akid”, output=“akid, SUM(1) AS activity”];c -> g;j = merge_join[leftKey=“akid”, rightKey=“akid”, where=“activity > 5”, output=“l.*”];c -> j;g -> j;

read copy join

group

Page 14: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Scope of Trecul

File IO• Read, write• Simple parser/printer• Local filesystem and HDFS• Bucketed mode

Merging• Hash Join & Merge Join

•Inner, outer, semi, anti semi• Union All • Sorted Union• Switch

Transformation Generate, Copy

• Aggregation– Hash Group By– Sort Group By– Hybrid Group By– Sort Running Total

• Filter• Sort

– Exernal sort – Supports presorted keys

• Map Reduce Integration– Emit– Map– Reduce

• MySQL Integration– Select– Insert

Page 15: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Limits of Trecul

Relational Data

Primitive Types : INTEGER, BIGINT, DOUBLE PRECISION, DECIMAL, DATE, DATETIME, CHAR(N), VARCHAR

No container types : list, set, bag, map

No Unicode, no code page support

No Metadata management

ADS has special operators that encapsulate specific datasets

Formats may be stored in files

No optimizer

We write very complex queries

No barrier to construction of optimal plans

Predictable performance in production

Page 16: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Trecul Architecture

C++, Linux

Expression Language

Parser

Semantic Analysis

Codegen

Operator Library

Dataflow Runtime

OS Services

Graph Semantic Analysis

Operator Scheduler

Harness Integration

Single Machine

Hadoop

MPI (experimental)

Page 17: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Trecul and LLVM

LLVM

Open source compiler and toolchain project

Used extensively by Apple and Google

Supports static and JIT compilation

http://www.llvm.org

Trecul Expresssions

Transforms, predicates, aggregates

Expressions & data structures compiled using LLVM

Operators are parameterized with expressions

Most operator code + scheduler etc. statically compiled

Page 18: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”,

format=“a INTEGER, b BIGINT, c DATE”];

F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];

R -> F;

G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];

F -> G;

struct _R { int32_t a; int64_t b; date c; };

Note: The use of pseudo-C is for illustration only; we transform Trecul directly to LLVM IR and then to machine code.

Page 19: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”,

format=“a INTEGER, b BIGINT, c DATE”];

F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];

R -> F;

G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];

F -> G;

struct _R { int32_t a; int64_t b; date c; };

bool _F(_R * rec) { return rec->c >= date(1,1,2012); }

Page 20: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Trecul Expression Compilation

R = read[file=“/home/dblair/example”,

format=“a INTEGER, b BIGINT, c DATE”];

F = filter[where=“c >= CAST(‘2012-01-01’ AS DATE)”];

R -> F;

G = group_by[hashKey=“c”, output=“c, SUM(a*b) AS s”];

F -> G;

struct _R { int32_t a; int64_t b; date c; };

bool _F(_R * rec) { return rec->c >= date(1,1,2012); }

struct _G {date c; int64_t s; };

void _G_init(_G * out, _R * in) { out->c = in->c; out->s = 0LL; }

void _G_upd(_G * out, _R * in) { out->s += in->a*in->b; }

Page 21: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Integration with Hadoop

Query HadoopPipesCompiler

ExecutorTaskJVM

HDFS

TaskJVM

Executor

Page 22: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Performance Testing

All tests performed on 82-node Hadoop cluster

16 GB memory

1 x 4 code SMT Xeon

2 x 2TB 7200 RPM SATA disks

Two datasets in use

Site Events : cookie sorted; 2048 buckets; 640 GB; 100B rows

Impressions : cookie sorted; 2048 buckets; 700 GB; 17B rows

Buckets gzip compressed

Running Hadoop 0.21 and Hive 0.9.0

Had to implement shim layer to get Hive to run on 0.21

Page 23: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Performance on Simple Queries

Count distinct cookies Join purchases & impressions0

2

4

6

8

10

12

14

16

18

20

TreculHive

Page 24: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Performance on Complex Queries

Join + Aggregation Modeling Feature Generation

Query #30

20

40

60

80

100

120

140

160

180

200

TreculHive

Page 25: Trecul – Data Flow Processing using Hadoop and LLVM David Blair.

©2012 AKAMAI | FASTER FORWARDTM

Thanks!

https://github.com/akamai-tech/trecul