Making Pig Fly Optimizing Data Processing on Hadoop

© Hortonworks Inc. 2011

Daniel Dai (@daijy)Thejas Nair (@thejasn)

Page 1

Making Pig FlyOptimizing Data Processing on Hadoop


What is Apache Pig?

Page 2Architecting the Future of Big Data

Pig Latin, a high level data processing language.

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/


Pig-latin example


• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;


Why pig ?


•Faster development– Fewer lines of code– Don’t re-invent the wheel

•Flexible– Metadata is optional– Extensible– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/


Pig optimizations


• Ideally user should not have to bother

• Reality– Pig is still young and immature– Pig does not have the whole picture

–Cluster configuration–Data histogram

– Pig philosophy: Pig is docile


Pig optimizations


• What pig does for you– Do safe transformations of query to optimize– Optimized operations (join, sort)

• What you do– Organize input in optimal way– Optimize pig-latin query– Tell pig what join/group algorithm to use


Rule based optimizer


• Column pruner• Push up filter• Push down flatten• Push up limit• Partition pruning• Global optimizer


Column Pruner


• Pig will do column pruning automatically

• Cases Pig will not do column pruning automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);B = foreach A generate a0+a1;C = order B by $0;Store C into ‘output’;

Pig will prune a2 automatically

A = load ‘input’;B = order A by $0;C = foreach B generate $0+$1;Store C into ‘output’;

A = load ‘input’;A1 = foreach A generate $0, $1;B = order A1 by $0;C = foreach B generate $0+$1;Store C into ‘output’;

DIY


Column Pruner


• Another case Pig does not do column pruning

– Pig does not keep track of unused column after grouping

A = load ‘input’ as (a0, a1, a2);B = group A by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

DIY

A = load ‘input’ as (a0, a1, a2);A1 = foreach A generate $0, $1;B = group A1 by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;


Push up filter


• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filter b0>10

Push up filter


Other push up/down


• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);B = foreach A generate flattten(a0), a1;C = order B by a1;Store C into ‘output’;

Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)


Partition pruning


• Prune unnecessary partitions entirely– HCatLoader

2010

2011

2012

HCatLoader Filter (year>=2011)

2010

2011

2012

HCatLoader (year>=2011)


Intermediate file compression


Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file between map and reduce

– Snappy

•Temp file between mapreduce jobs

– No compression by default


Enable temp file compression


•Pig temp file are not compressed by default

– Issues with snappy (HADOOP-7990)– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query speed up

pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo


Multiquery


• Combine two or more map/reduce job into one

– Happens automatically– Cases we want to control multiquery: combine too many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store


Control multiquery


• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, COUNT(A);Store C0 into ‘output0’;B1 = group A by $1;C1 = foreach B1 generate group, COUNT(A);Store C1 into ‘output1’;execB2 = group A by $2;C2 = foreach B2 generate group, COUNT(A);Store C2 into ‘output2’;


Implement the right UDF


• Algebraic UDF– Initial– Intermediate– Final

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, SUM(A);Store C0 into ‘output0’;

MapInitial

CombinerIntermediate

ReduceFinal


Implement the right UDF


• Accumulator UDF– Reduce side UDF– Normally takes a bag

• Benefit– Big bag are passed in batches

– Avoid using too much memory

– Batch size

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, my_accum(A);Store C0 into ‘output0’;

my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed }}pig.accumulative.batchsize=20000


Memory optimization


• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if possible

reduce(Text key, Iterator<Writable> values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2


Optimization starts before pig


• Input format • Serialization format• Compression


Input format -Test Query


> searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….)


Input formats


PigStor

age

LzoP

igStor

age

PigStor

age W

Typ

e

AvroStor

age (

has t

ypes

)0

20406080

100120140

RunTime (sec)

RunTime (sec)


Columnar format


•RCFile•Columnar format for a group of rows•More efficient if you query subset of columns


Tests with RCFile


• Tests with load + project + filter out all records.

• Using hcatalog, w compression,types•Test 1

•Project 1 out of 5 columns•Test 2

•Project all 5 columns


RCFile test results


Project 1 (sec) Project all (sec)0

20

40

60

80

100

120

140

Plain TextRCFile


Cost based optimizations


• Optimizations decisions based on your query/data

• Often iterative process

Run query Measure

Tune


• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation


Map(logic) M.

Output

HBA HBAOutput

Reduce task


Cost based optimization – Hash Agg.


• Auto off feature • switches off HBA if output reduction is

not good enough• Configuring Hash Agg

• Configure auto off feature - pig.exec.mapPartAgg.minReduction

• Configure memory used - pig.cachedbag.memusage


Cost based optimization - Join


• Use appropriate join algorithm•Skew on join key - Skew join•Fits in memory – FR join


Cost based optimization – MR tuning


•Tune MR parameters to reduce IO•Control spills using map sort params •Reduce shuffle/sort-merge params


Parallelism of reduce tasks


4 6 8 24 48 2560:14:240:15:500:17:170:18:430:20:100:21:360:23:020:24:290:25:55

Runtime

Runtime

• Number of reduce slots = 6• Factors affecting runtime

• Cores simultaneously used/skew• Cost of having additional reduce tasks


Cost based optimization – keep data sorted


•Frequent joins operations on same keys

• Keep data sorted on keys• Use merge join

• Optimized group on sorted keys• Works with few load functions – needs

additional i/f implementation


Optimizations for sorted data


sort+sort+join+join join + join0

10

20

30

40

50

60

70

80

90

Join 2Join 1Sort2Sort1


Future Directions


• Optimize using stats• Using historical stats w hcatalog• Sampling


Questions


?

© Hortonworks Inc. 2011 Page 36

Making Pig Fly Optimizing Data Processing on Hadoop

Documents

lzo hortonworks

hadoop hortonworks

filter hortonworks

default hortonworks

future of big dataquery

future of big datapush

future of big datacombine

future of big datapig