Top Banner
Big Data with Pig and Python Shawn Hermans Omaha Dynamic Languages User Group April 8th, 2013 Tuesday, April 9, 13
47

Pig and Python to Process Big Data

Jan 26, 2015

Download

Technology

Shawn Hermans

April 8th, 2013 Presentation to Omaha Dynamic Languages User Group
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pig and Python to Process Big Data

Big Data with Pig and Python

Shawn HermansOmaha Dynamic Languages User Group

April 8th, 2013

Tuesday, April 9, 13

Page 2: Pig and Python to Process Big Data

About Me

• Mathematician/Physicist turned Consultant

• Graduate Student in CS at UNO

• Current Software Engineer at Sojern

Tuesday, April 9, 13

Page 3: Pig and Python to Process Big Data

Working with Big Data

Tuesday, April 9, 13

Page 4: Pig and Python to Process Big Data

What is Big Data?Data Source Size

Wikipedia Database Dump 9GB

Open Street Map 19GB

Common Crawl 81TB

1000 Genomes 200TB

Large Hadron Collider 15PB annually

Gigabytes - Normal size for relational

databases

Terabytes - Relational databases may

start to experience scaling issues

Petabytes - Relational databases

struggle to scale without a lot of fine tuning

Tuesday, April 9, 13

Page 5: Pig and Python to Process Big Data

Working With DataExpectation Reality

• Different File Formats

• Missing Values

• Inconsistent Schema

• Loosely Structured

• Lots of it

Tuesday, April 9, 13

Page 6: Pig and Python to Process Big Data

MapReduce

Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview

• Map - Emit key/value pairs from data

• Reduce - Collect data with common keys

• Tries to minimize moving data between nodes

Tuesday, April 9, 13

Page 7: Pig and Python to Process Big Data

MapReduce Issues

• Very low-level abstraction

• Cumbersome Java API

• Unfamiliar to data analysts

• Rudimentary support for data pipelines

Tuesday, April 9, 13

Page 8: Pig and Python to Process Big Data

Pig• Eats anything

• SQL-like, procedural data flow language

• Extensible with Java, Jython, Groovy, Ruby or JavaScript

• Provides opportunities to optimize workflows

Tuesday, April 9, 13

Page 9: Pig and Python to Process Big Data

Alternatives• Java MapReduce API

• Hadoop Streaming

• Hive

• Spark

• Cascading

• Cascalog

Tuesday, April 9, 13

Page 10: Pig and Python to Process Big Data

Python

• Data analysis - pandas, numpy, networkx

• Machine learning - scikits.learn, milk

• Scientific - scipy, pyephem, astropysics

• Visualization - matplotlib, d3py, ggplot

Tuesday, April 9, 13

Page 11: Pig and Python to Process Big Data

Pig Features

Tuesday, April 9, 13

Page 12: Pig and Python to Process Big Data

Input/Output• HBase

• JDBC Database

• JSON

• CSV/TSV

• Avro

• ProtoBuff

• Sequence File

• Hive Columnar

• XML

• Apache Log

• Thrift

• Regex

Tuesday, April 9, 13

Page 13: Pig and Python to Process Big Data

Relational OperatorsLIMIT GROUP FILTER CROSS

COGROUP JOIN STORE DISTINCT

FOREACH LOAD ORDER UNION

Tuesday, April 9, 13

Page 14: Pig and Python to Process Big Data

Built In FunctionsCOS SIN AVG SUM

COUNT RANDOM LOWER UPPER

CONCAT MAX MIN TOKENIZE

Tuesday, April 9, 13

Page 15: Pig and Python to Process Big Data

User Defined Functions• Easy way to add arbitrary code to Pig

• Eval - Filter, aggregate, or evaluate

• Storage - Load/Store data

• Full support for Java and Jython

• Experimental support for Groovy, Ruby and JavaScript

Tuesday, April 9, 13

Page 16: Pig and Python to Process Big Data

Census Example

Tuesday, April 9, 13

Page 17: Pig and Python to Process Big Data

Getting Data

Tuesday, April 9, 13

Page 18: Pig and Python to Process Big Data

Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"

• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV

• TSV > CSV

Tuesday, April 9, 13

Page 19: Pig and Python to Process Big Data

Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))

[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .

Tuesday, April 9, 13

Page 20: Pig and Python to Process Big Data

Pig Quick Start

pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058

http://pig.apache.org/releases.html

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

• Download Pig Distribution

• Untar package

• Start Pig in local mode

Tuesday, April 9, 13

Page 21: Pig and Python to Process Big Data

Loading Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();

Tuesday, April 9, 13

Page 22: Pig and Python to Process Big Data

Extracting Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};

Tuesday, April 9, 13

Page 23: Pig and Python to Process Big Data

Adding Schema

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}

Tuesday, April 9, 13

Page 24: Pig and Python to Process Big Data

Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;

("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)

Tuesday, April 9, 13

Page 25: Pig and Python to Process Big Data

Storing Data

grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');

ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160

Tuesday, April 9, 13

Page 26: Pig and Python to Process Big Data

Space Catalog Example

Tuesday, April 9, 13

Page 27: Pig and Python to Process Big Data

Space Catalog

• 14,000+ objects in public catalog

• Use Two Line Element sets to propagate out positions and velocities

• Can generate over 100 million positions & velocities per day

Tuesday, April 9, 13

Page 28: Pig and Python to Process Big Data

Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537

• Use Python script to convert to Pig friendly TSV

• Create Python UDF to parse TLE into parameters

• Use Python UDF with Java libraries to propagate out positions

Tuesday, April 9, 13

Page 29: Pig and Python to Process Big Data

Python UDFs

• Easy way to extend Pig with new functions

• Uses Jython which is at Python 2.5

• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)

• Can use Java classes

Tuesday, April 9, 13

Page 30: Pig and Python to Process Big Data

TLE parsing

def  parse_tle_number(tle_number_string):        split_string  =  tle_number_string.split('-­‐')        if  len(split_string)  ==  3:                new_number  =  '-­‐'  +  str(split_string[1])  +  'e-­‐'  +  str(int(split_string[2])+1)        elif  len(split_string)  ==  2:                new_number  =  str(split_string[0])  +  'e-­‐'  +  str(int(split_string[1])+1)        elif  len(split_string)  ==  1:                new_number  =  '0.'  +  str(split_string[0])        else:                raise  TypeError('Input  is  not  in  the  TLE  float  format')          return  float(new_number)

54-61 BSTAR Drag (Decimal Assumed)

-11606-4

Full parser at https://gist.github.com/shawnhermans/4569360

Tuesday, April 9, 13

Page 31: Pig and Python to Process Big Data

Simple UDF

import tleparser

@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params

Tuesday, April 9, 13

Page 32: Pig and Python to Process Big Data

Extract Parameters

grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);

grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])

Tuesday, April 9, 13

Page 33: Pig and Python to Process Big Data

Storing Results

grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');

Tuesday, April 9, 13

Page 34: Pig and Python to Process Big Data

UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4

@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time

while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))

current_time += increment

return ecef_positions

Tuesday, April 9, 13

Page 35: Pig and Python to Process Big Data

Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}

Tuesday, April 9, 13

Page 36: Pig and Python to Process Big Data

Result

(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)

Tuesday, April 9, 13

Page 37: Pig and Python to Process Big Data

Pig on Amazon EMR

Tuesday, April 9, 13

Page 38: Pig and Python to Process Big Data

Tuesday, April 9, 13

Page 39: Pig and Python to Process Big Data

Tuesday, April 9, 13

Page 40: Pig and Python to Process Big Data

Tuesday, April 9, 13

Page 41: Pig and Python to Process Big Data

Tuesday, April 9, 13

Page 42: Pig and Python to Process Big Data

Tuesday, April 9, 13

Page 43: Pig and Python to Process Big Data

Pig with EMR

Tuesday, April 9, 13

Page 44: Pig and Python to Process Big Data

Pig with EMR

• SSH in to box to run interactive Pig session

• Load data to/from S3

• Run standalone Pig scripts on demand

Tuesday, April 9, 13

Page 45: Pig and Python to Process Big Data

Conclusion

Tuesday, April 9, 13

Page 46: Pig and Python to Process Big Data

Other Useful Tools• Python-dateutil : Super-duper date parser

• Oozie : Hadoop workflow engine

• Piggybank and Elephant Bird : 3rd party pig libraries

• Chardet: Character detection library for Python

Tuesday, April 9, 13

Page 47: Pig and Python to Process Big Data

Parting Thoughts• Great ETL tool/language

• Flexible enough to write general purpose MapReduce jobs

• Limited, but emerging 3rd party libraries

• Jython for UDFs is extremely limiting (Spark?)

Twitter: @shawnhermansEmail: [email protected]

Tuesday, April 9, 13