Top Banner
Apache Pig UDFs Extending Pig to solve complex tasks UDF = User Defined Functions
22

Apache PIG - User Defined Functions

May 26, 2015

Download

Education

Christoph Bauer

Extending Pig to solve complex tasks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache PIG - User Defined Functions

Apache Pig UDFsExtending Pig to solve complex tasks

UDF = User Defined Functions

Page 2: Apache PIG - User Defined Functions

Your speaker today:

Christoph Bauer

java developer 10+ years

one of the founders

Helping our clients to use and understand their (big) data

working in "BigData" since 2010

Page 3: Apache PIG - User Defined Functions

Why use PIG

● ad-hoc way for creating and executing map/reduce jobs

● simple, high-level language● more natural for analysts than map/reduce

Page 4: Apache PIG - User Defined Functions

Done.

http://leesfishandphotos.blogspot.de

Page 5: Apache PIG - User Defined Functions

Oh, wait...

Page 6: Apache PIG - User Defined Functions

UDFs to the rescue

Writing user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...

Page 7: Apache PIG - User Defined Functions

Do whatever you want

● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing

...● much more...

Page 8: Apache PIG - User Defined Functions

User Defined Functions

● EvalFunc<T>public <T> exec(Tuple input)

● FilterFuncpublic Boolean exec(Tuple input)

● Aggregate Functionspublic interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal();}

● Load/Store Functionspublic Tuple getNext()public void putNext(Tuple input);

Page 9: Apache PIG - User Defined Functions

What? Why?companyName companyAdress

companyAdresscompanyAddress

Net WorthNet Worth

Net WorthNet Worth

Net WorthNet Worth

Net WorthNet Worth

Net Worth

2010 | companyName | current Address | historical Net Worth

2011 | companyName | current Address | historical Net Worth

2012 | companyName | current Address | historical Net Worth

Page 10: Apache PIG - User Defined Functions

Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

...apply UDF

r1, t1, q1:"v1", q2:"v4"

r1, t3, q1:"v1", q2:"v4"

r1, t5, q1:"v2", q2:"v4"

SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)

Page 11: Apache PIG - User Defined Functions

LATESTpublic class LATEST extends EvalFunc<Tuple> {

public Tuple exec(Tuple input) throws IOException { }}

Page 12: Apache PIG - User Defined Functions

LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } } return result;}

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

Page 13: Apache PIG - User Defined Functions

SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>();

DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur));

dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }

Page 14: Apache PIG - User Defined Functions

SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts);

for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } } return result;}

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

Page 15: Apache PIG - User Defined Functions

PigLatin

REGISTER 'my-udf.jar'DEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y');A = LOAD 'inputTable' AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO 'output.csv';

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

Page 16: Apache PIG - User Defined Functions

Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y');

...public SNAPSHOTS(String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}

Page 17: Apache PIG - User Defined Functions

I didn't talk about

● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects

● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface

● do implementpublic Schema outputSchema(Schema input)

● report progress when doing time consuming stuff

● Performance?

Page 18: Apache PIG - User Defined Functions

SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG));

for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}

Page 19: Apache PIG - User Defined Functions

Reality check

● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase

Page 20: Apache PIG - User Defined Functions

Wrapping it up

We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase.● Rewrote HBaseStorage● UDFs for Snapshots, Latest

Right now we are trying to push our changes back into PIG.

Page 21: Apache PIG - User Defined Functions

Questions?

Page 22: Apache PIG - User Defined Functions

Thank you!

Christoph Bauer

[email protected]://www.xing.com/profile/Christoph_Bauer62