Pig and Python to Process Big Data

Big Data with Pig and Python

Shawn HermansOmaha Dynamic Languages User Group

April 8th, 2013

Tuesday, April 9, 13

About Me

• Mathematician/Physicist turned Consultant

• Graduate Student in CS at UNO

• Current Software Engineer at Sojern


Working with Big Data


What is Big Data?Data Source Size

Wikipedia Database Dump 9GB

Open Street Map 19GB

Common Crawl 81TB

1000 Genomes 200TB

Large Hadron Collider 15PB annually

Gigabytes - Normal size for relational

databases

Terabytes - Relational databases may

start to experience scaling issues

Petabytes - Relational databases

struggle to scale without a lot of fine tuning


Working With DataExpectation Reality

• Different File Formats

• Missing Values

• Inconsistent Schema

• Loosely Structured

• Lots of it


MapReduce

Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview

• Map - Emit key/value pairs from data

• Reduce - Collect data with common keys

• Tries to minimize moving data between nodes


https://developers.google.com/appengine/docs/python/dataprocessing/overview

https://developers.google.com/appengine/docs/python/dataprocessing/overview

MapReduce Issues

• Very low-level abstraction

• Cumbersome Java API

• Unfamiliar to data analysts

• Rudimentary support for data pipelines


Pig• Eats anything

• SQL-like, procedural data flow language

• Extensible with Java, Jython, Groovy, Ruby or JavaScript

• Provides opportunities to optimize workflows


Alternatives• Java MapReduce API

• Hadoop Streaming

• Hive

• Spark

• Cascading

• Cascalog


Python

• Data analysis - pandas, numpy, networkx

• Machine learning - scikits.learn, milk

• Scientific - scipy, pyephem, astropysics

• Visualization - matplotlib, d3py, ggplot


Pig Features


Input/Output• HBase

• JDBC Database

• JSON

• CSV/TSV

• Avro

• ProtoBuff

• Sequence File

• Hive Columnar

• XML

• Apache Log

• Thrift

• Regex


Relational OperatorsLIMIT GROUP FILTER CROSS

COGROUP JOIN STORE DISTINCT

FOREACH LOAD ORDER UNION


Built In FunctionsCOS SIN AVG SUM

COUNT RANDOM LOWER UPPER

CONCAT MAX MIN TOKENIZE


User Defined Functions• Easy way to add arbitrary code to Pig

• Eval - Filter, aggregate, or evaluate

• Storage - Load/Store data

• Full support for Java and Jython

• Experimental support for Groovy, Ruby and JavaScript


Census Example


Getting Data


Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"

• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV

• TSV > CSV


Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))

[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .


Pig Quick Start

pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058

http://pig.apache.org/releases.html

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

• Download Pig Distribution

• Untar package

• Start Pig in local mode






Loading Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();


Extracting Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};


Adding Schema

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}


Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;

("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)


Storing Data

grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');

ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160


Space Catalog Example


Space Catalog

• 14,000+ objects in public catalog

• Use Two Line Element sets to propagate out positions and velocities

• Can generate over 100 million positions & velocities per day


Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537

• Use Python script to convert to Pig friendly TSV

• Create Python UDF to parse TLE into parameters

• Use Python UDF with Java libraries to propagate out positions


Python UDFs

• Easy way to extend Pig with new functions

• Uses Jython which is at Python 2.5

• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)

• Can use Java classes


TLE parsing

def parse_tle_number(tle_number_string): split_string = tle_number_string.split('-‐') if len(split_string) == 3: new_number = '-‐' + str(split_string[1]) + 'e-‐' + str(int(split_string[2])+1) elif len(split_string) == 2: new_number = str(split_string[0]) + 'e-‐' + str(int(split_string[1])+1) elif len(split_string) == 1: new_number = '0.' + str(split_string[0]) else: raise TypeError('Input is not in the TLE float format') return float(new_number)

54-61 BSTAR Drag (Decimal Assumed)

-11606-4

Full parser at https://gist.github.com/shawnhermans/4569360


https://gist.github.com/shawnhermans/4569360

https://gist.github.com/shawnhermans/4569360

Simple UDF

import tleparser

@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params


Extract Parameters

grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);

grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])


Storing Results

grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');


UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4

@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time

while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))

current_time += increment

return ecef_positions


Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}


Result

(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)


Pig on Amazon EMR







Pig with EMR


Pig with EMR

• SSH in to box to run interactive Pig session

• Load data to/from S3

• Run standalone Pig scripts on demand


Conclusion


Other Useful Tools• Python-dateutil : Super-duper date parser

• Oozie : Hadoop workflow engine

• Piggybank and Elephant Bird : 3rd party pig libraries

• Chardet: Character detection library for Python


Parting Thoughts• Great ETL tool/language

• Flexible enough to write general purpose MapReduce jobs

• Limited, but emerging 3rd party libraries

• Jython for UDFs is extremely limiting (Spark?)

Twitter: @shawnhermansEmail: [email protected]


mailto:[email protected]

mailto:[email protected]

Pig and Python to Process Big Data

Technology