Pig and Python to Process Big Data

Big Data with Pig and Python

Shawn HermansOmaha Dynamic Languages User Group

April 8th, 2013

Tuesday, April 9, 13

About Me

• Mathematician/Physicist turned Consultant

• Graduate Student in CS at UNO

• Current Software Engineer at Sojern

Working with Big Data

What is Big Data?Data Source Size

Wikipedia Database Dump 9GB

Open Street Map 19GB

Common Crawl 81TB

1000 Genomes 200TB

Large Hadron Collider 15PB annually

Gigabytes - Normal size for relational

databases

Terabytes - Relational databases may

start to experience scaling issues

Petabytes - Relational databases

struggle to scale without a lot of fine tuning

Working With DataExpectation Reality

• Different File Formats

• Missing Values

• Inconsistent Schema

• Loosely Structured

• Lots of it

MapReduce

Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview

• Map - Emit key/value pairs from data

• Reduce - Collect data with common keys

• Tries to minimize moving data between nodes

MapReduce Issues

• Very low-level abstraction

• Cumbersome Java API

• Unfamiliar to data analysts

• Rudimentary support for data pipelines

Pig• Eats anything

• SQL-like, procedural data flow language

• Extensible with Java, Jython, Groovy, Ruby or JavaScript

• Provides opportunities to optimize workflows

Alternatives• Java MapReduce API

• Hadoop Streaming

• Hive

• Spark

• Cascading

• Cascalog

Python

• Data analysis - pandas, numpy, networkx

• Machine learning - scikits.learn, milk

• Scientific - scipy, pyephem, astropysics

• Visualization - matplotlib, d3py, ggplot

Pig Features

Input/Output• HBase

• JDBC Database

• JSON

• CSV/TSV

• Avro

• ProtoBuff

• Sequence File

• Hive Columnar

• XML

• Apache Log

• Thrift

• Regex

Relational OperatorsLIMIT GROUP FILTER CROSS

COGROUP JOIN STORE DISTINCT

FOREACH LOAD ORDER UNION

Built In FunctionsCOS SIN AVG SUM

COUNT RANDOM LOWER UPPER

CONCAT MAX MIN TOKENIZE

User Defined Functions• Easy way to add arbitrary code to Pig

• Eval - Filter, aggregate, or evaluate

• Storage - Load/Store data

• Full support for Java and Jython

• Experimental support for Groovy, Ruby and JavaScript

Census Example

Getting Data

Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"

• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV

• TSV > CSV

Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))

[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .

Pig Quick Start

pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058

http://pig.apache.org/releases.html

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

• Download Pig Distribution

• Untar package

• Start Pig in local mode

Loading Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();

Extracting Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};

Adding Schema

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}

Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;

("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)

Storing Data

grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');

ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160

Space Catalog Example

Space Catalog

• 14,000+ objects in public catalog

• Use Two Line Element sets to propagate out positions and velocities

• Can generate over 100 million positions & velocities per day

Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537

• Use Python script to convert to Pig friendly TSV

• Create Python UDF to parse TLE into parameters

• Use Python UDF with Java libraries to propagate out positions

Python UDFs

• Easy way to extend Pig with new functions

• Uses Jython which is at Python 2.5

• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)

• Can use Java classes

TLE parsing

def parse_tle_number(tle_number_string): split_string = tle_number_string.split('-‐') if len(split_string) == 3: new_number = '-‐' + str(split_string[1]) + 'e-‐' + str(int(split_string[2])+1) elif len(split_string) == 2: new_number = str(split_string[0]) + 'e-‐' + str(int(split_string[1])+1) elif len(split_string) == 1: new_number = '0.' + str(split_string[0]) else: raise TypeError('Input is not in the TLE float format') return float(new_number)

54-61 BSTAR Drag (Decimal Assumed)

-11606-4

Full parser at https://gist.github.com/shawnhermans/4569360

Simple UDF

import tleparser

@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params

Extract Parameters

grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);

grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])

Storing Results

grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');

UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4

@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time

while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))

current_time += increment

return ecef_positions

Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}

Result

(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)

Pig on Amazon EMR

Pig with EMR

• SSH in to box to run interactive Pig session

• Load data to/from S3

• Run standalone Pig scripts on demand

Conclusion

Other Useful Tools• Python-dateutil : Super-duper date parser

• Oozie : Hadoop workflow engine

• Piggybank and Elephant Bird : 3rd party pig libraries

• Chardet: Character detection library for Python

Parting Thoughts• Great ETL tool/language

• Flexible enough to write general purpose MapReduce jobs

• Limited, but emerging 3rd party libraries

• Jython for UDFs is extremely limiting (Spark?)

Twitter: @shawnhermansEmail: shawnhermans@gmail.com

Pig and Python to Process Big Data

Technology

THE Big Pig · 2020-03-11 · Big Pig You’ll love your...

New Lecture 09: Parallel Databases, Big Data, Map/Reduce,...

Python Course Content - Python.direct · 8. First Steps...

PROCESSING LARGE / BIG DATA SET THROUGH MapR AND...

Python in big data world

Python tutorial @ BIG (EPFL)

Big Cypress - nps.gov€¦ · Big Cypress National Park...

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Simple big data, in Python

Installation and Operation Instructions Stone Age Big Pig...

In Java/Python In Pig+Java Today, we'll start with Pig Two.....

When big data meet python @ COSCUP 2012

Pp As Big As A Pig

Python for Big Data Analytics

Big Data Analysis using Python

Python Big Picture numPy - University of...