Top Banner
© Hortonworks Inc. 2011 Daniel Dai (@daijy) Thejas Nair (@thejasn) Page 1 Making Pig Fly Optimizing Data Processing on Hadoop
36

Making Pig Fly Optimizing Data Processing on Hadoop

Feb 24, 2016

Download

Documents

Making Pig Fly Optimizing Data Processing on Hadoop. Daniel Dai (@ daijy ) Thejas Nair (@ thejasn ). What is Apache Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig- latin -cup pic from http:// www.flickr.com /photos/ frippy /2507970530/. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Daniel Dai (@daijy)Thejas Nair (@thejasn)

Page 1

Making Pig FlyOptimizing Data Processing on Hadoop

Page 2: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

What is Apache Pig?

Page 2Architecting the Future of Big Data

Pig Latin, a high level data processing language.

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Page 3: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig-latin example

Page 3Architecting the Future of Big Data

• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;

Page 4: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Why pig ?

Page 4Architecting the Future of Big Data

•Faster development– Fewer lines of code– Don’t re-invent the wheel

•Flexible– Metadata is optional– Extensible– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 5: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig optimizations

Page 5Architecting the Future of Big Data

• Ideally user should not have to bother

• Reality– Pig is still young and immature– Pig does not have the whole picture

–Cluster configuration–Data histogram

– Pig philosophy: Pig is docile

Page 6: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig optimizations

Page 6Architecting the Future of Big Data

• What pig does for you– Do safe transformations of query to optimize– Optimized operations (join, sort)

• What you do– Organize input in optimal way– Optimize pig-latin query– Tell pig what join/group algorithm to use

Page 7: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Rule based optimizer

Page 7Architecting the Future of Big Data

• Column pruner• Push up filter• Push down flatten• Push up limit• Partition pruning• Global optimizer

Page 8: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Column Pruner

Page 8Architecting the Future of Big Data

• Pig will do column pruning automatically

• Cases Pig will not do column pruning automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);B = foreach A generate a0+a1;C = order B by $0;Store C into ‘output’;

Pig will prune a2 automatically

A = load ‘input’;B = order A by $0;C = foreach B generate $0+$1;Store C into ‘output’;

A = load ‘input’;A1 = foreach A generate $0, $1;B = order A1 by $0;C = foreach B generate $0+$1;Store C into ‘output’;

DIY

Page 9: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Column Pruner

Page 9Architecting the Future of Big Data

• Another case Pig does not do column pruning

– Pig does not keep track of unused column after grouping

A = load ‘input’ as (a0, a1, a2);B = group A by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

DIY

A = load ‘input’ as (a0, a1, a2);A1 = foreach A generate $0, $1;B = group A1 by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

Page 10: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Push up filter

Page 10Architecting the Future of Big Data

• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filter b0>10

Push up filter

Page 11: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Other push up/down

Page 11Architecting the Future of Big Data

• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);B = foreach A generate flattten(a0), a1;C = order B by a1;Store C into ‘output’;

Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)

Page 12: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Partition pruning

Page 12Architecting the Future of Big Data

• Prune unnecessary partitions entirely– HCatLoader

2010

2011

2012

HCatLoader Filter (year>=2011)

2010

2011

2012

HCatLoader (year>=2011)

Page 13: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Intermediate file compression

Page 13Architecting the Future of Big Data

Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file between map and reduce

– Snappy

•Temp file between mapreduce jobs

– No compression by default

Page 14: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Enable temp file compression

Page 14Architecting the Future of Big Data

•Pig temp file are not compressed by default

– Issues with snappy (HADOOP-7990)– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query speed up

pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo

Page 15: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Multiquery

Page 15Architecting the Future of Big Data

• Combine two or more map/reduce job into one

– Happens automatically– Cases we want to control multiquery: combine too many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store

Page 16: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Control multiquery

Page 16Architecting the Future of Big Data

• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, COUNT(A);Store C0 into ‘output0’;B1 = group A by $1;C1 = foreach B1 generate group, COUNT(A);Store C1 into ‘output1’;execB2 = group A by $2;C2 = foreach B2 generate group, COUNT(A);Store C2 into ‘output2’;

Page 17: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Implement the right UDF

Page 17Architecting the Future of Big Data

• Algebraic UDF– Initial– Intermediate– Final

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, SUM(A);Store C0 into ‘output0’;

MapInitial

CombinerIntermediate

ReduceFinal

Page 18: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Implement the right UDF

Page 18Architecting the Future of Big Data

• Accumulator UDF– Reduce side UDF– Normally takes a bag

• Benefit– Big bag are passed in batches

– Avoid using too much memory

– Batch size

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, my_accum(A);Store C0 into ‘output0’;

my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed }}pig.accumulative.batchsize=20000

Page 19: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Memory optimization

Page 19Architecting the Future of Big Data

• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if possible

reduce(Text key, Iterator<Writable> values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2

Page 20: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Optimization starts before pig

Page 20Architecting the Future of Big Data

• Input format • Serialization format• Compression

Page 21: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Input format -Test Query

Page 21Architecting the Future of Big Data

> searches = load  ’aol_search_logs.txt' using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query matches '.*thejas.*';    > dump search_thejas; (1568578 , thejasminesupperclub, ….)

Page 22: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Input formats

Page 22Architecting the Future of Big Data

PigStor

age

LzoP

igStor

age

PigStor

age W

Typ

e

AvroStor

age (

has t

ypes

)0

20406080

100120140

RunTime (sec)

RunTime (sec)

Page 23: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Columnar format

Page 23Architecting the Future of Big Data

•RCFile•Columnar format for a group of rows•More efficient if you query subset of columns

Page 24: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Tests with RCFile

Page 24Architecting the Future of Big Data

• Tests with load + project + filter out all records.

• Using hcatalog, w compression,types•Test 1

•Project 1 out of 5 columns•Test 2

•Project all 5 columns

Page 25: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

RCFile test results

Page 25Architecting the Future of Big Data

Project 1 (sec) Project all (sec)0

20

40

60

80

100

120

140

Plain TextRCFile

Page 26: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimizations

Page 26Architecting the Future of Big Data

• Optimizations decisions based on your query/data

• Often iterative process

Run query Measure

Tune

Page 27: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation

Page 27Architecting the Future of Big Data

Map(logic) M.

Output

HBA HBAOutput

Reduce task

Page 28: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – Hash Agg.

Page 28Architecting the Future of Big Data

• Auto off feature • switches off HBA if output reduction is

not good enough• Configuring Hash Agg

• Configure auto off feature - pig.exec.mapPartAgg.minReduction

• Configure memory used - pig.cachedbag.memusage

Page 29: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization - Join

Page 29Architecting the Future of Big Data

• Use appropriate join algorithm•Skew on join key - Skew join•Fits in memory – FR join

Page 30: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – MR tuning

Page 30Architecting the Future of Big Data

•Tune MR parameters to reduce IO•Control spills using map sort params •Reduce shuffle/sort-merge params

Page 31: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Parallelism of reduce tasks

Page 31Architecting the Future of Big Data

4 6 8 24 48 2560:14:240:15:500:17:170:18:430:20:100:21:360:23:020:24:290:25:55

Runtime

Runtime

• Number of reduce slots = 6• Factors affecting runtime

• Cores simultaneously used/skew• Cost of having additional reduce tasks

Page 32: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – keep data sorted

Page 32Architecting the Future of Big Data

•Frequent joins operations on same keys

• Keep data sorted on keys• Use merge join

• Optimized group on sorted keys• Works with few load functions – needs

additional i/f implementation

Page 33: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Optimizations for sorted data

Page 33Architecting the Future of Big Data

sort+sort+join+join join + join0

10

20

30

40

50

60

70

80

90

Join 2Join 1Sort2Sort1

Page 34: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Future Directions

Page 34Architecting the Future of Big Data

• Optimize using stats• Using historical stats w hcatalog• Sampling

Page 35: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Questions

Page 35Architecting the Future of Big Data

?

Page 36: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011 Page 36