Hadoop-PigProcessing of large data
Yevgen SmertenkoEngineering Team Lead. BI Developer.
How it worksBI engineerclear result
data
PigPig
Hadoop - Software Framework
Provide Massive Parallel Processing (MPP) of data
MapReduce program• Input read• Map• Partition / Combine• Copy / Compare / Merge• Reduce• Output write
MapReduce Data Flow
MapReduce Data Flow
MapReduce functionality
The Hadoop Ecosystem
PIG
• Data types• Relational operators• UDF – user defined functions
Pig Latin - language of the data streams description
Pig. Data Types
Simple Types• int• long• float• double• chararray• bytearray• boolean• datetime
Complex Types• tuple (.., ..)• map [key#value]• bag {(), .., ()}
Pig. Relational operators
• SPLIT• UNION• FILTER• DISTINCT• SAMPLE• FOREACH• STREAM
• JOIN• GROUP / COGROUP• CROSS• ORDER
• LOAD• STORE
PIG. UDF
Eval Functions (EvalFunc) • Filter Functions • Aggregate Functions• Algebraic Interface• Accumulator Interface
Load/Store Functions (StoreFunc)
piggybank
How it worksBI engineerclear result
data
Pig
THANKS FOR YOUR ATTENTION!