Pig and Python to Process Big Data

Post on 26-Jan-2015

118 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

April 8th, 2013 Presentation to Omaha Dynamic Languages User Group

Transcript

Big Data with Pig and Python

Shawn HermansOmaha Dynamic Languages User Group

April 8th, 2013

Tuesday, April 9, 13

About Me

• Mathematician/Physicist turned Consultant

• Graduate Student in CS at UNO

• Current Software Engineer at Sojern

Tuesday, April 9, 13

Working with Big Data

Tuesday, April 9, 13

What is Big Data?Data Source Size

Wikipedia Database Dump 9GB

Open Street Map 19GB

Common Crawl 81TB

1000 Genomes 200TB

Large Hadron Collider 15PB annually

Gigabytes - Normal size for relational

databases

Terabytes - Relational databases may

start to experience scaling issues

Petabytes - Relational databases

struggle to scale without a lot of fine tuning

Tuesday, April 9, 13

Working With DataExpectation Reality

• Different File Formats

• Missing Values

• Inconsistent Schema

• Loosely Structured

• Lots of it

Tuesday, April 9, 13

MapReduce

Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview

• Map - Emit key/value pairs from data

• Reduce - Collect data with common keys

• Tries to minimize moving data between nodes

Tuesday, April 9, 13

MapReduce Issues

• Very low-level abstraction

• Cumbersome Java API

• Unfamiliar to data analysts

• Rudimentary support for data pipelines

Tuesday, April 9, 13

Pig• Eats anything

• SQL-like, procedural data flow language

• Extensible with Java, Jython, Groovy, Ruby or JavaScript

• Provides opportunities to optimize workflows

Tuesday, April 9, 13

Alternatives• Java MapReduce API

• Hadoop Streaming

• Hive

• Spark

• Cascading

• Cascalog

Tuesday, April 9, 13

Python

• Data analysis - pandas, numpy, networkx

• Machine learning - scikits.learn, milk

• Scientific - scipy, pyephem, astropysics

• Visualization - matplotlib, d3py, ggplot

Tuesday, April 9, 13

Pig Features

Tuesday, April 9, 13

Input/Output• HBase

• JDBC Database

• JSON

• CSV/TSV

• Avro

• ProtoBuff

• Sequence File

• Hive Columnar

• XML

• Apache Log

• Thrift

• Regex

Tuesday, April 9, 13

Relational OperatorsLIMIT GROUP FILTER CROSS

COGROUP JOIN STORE DISTINCT

FOREACH LOAD ORDER UNION

Tuesday, April 9, 13

Built In FunctionsCOS SIN AVG SUM

COUNT RANDOM LOWER UPPER

CONCAT MAX MIN TOKENIZE

Tuesday, April 9, 13

User Defined Functions• Easy way to add arbitrary code to Pig

• Eval - Filter, aggregate, or evaluate

• Storage - Load/Store data

• Full support for Java and Jython

• Experimental support for Groovy, Ruby and JavaScript

Tuesday, April 9, 13

Census Example

Tuesday, April 9, 13

Getting Data

Tuesday, April 9, 13

Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"

• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV

• TSV > CSV

Tuesday, April 9, 13

Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))

[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .

Tuesday, April 9, 13

Pig Quick Start

pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058

http://pig.apache.org/releases.html

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

• Download Pig Distribution

• Untar package

• Start Pig in local mode

Tuesday, April 9, 13

Loading Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();

Tuesday, April 9, 13

Extracting Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};

Tuesday, April 9, 13

Adding Schema

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}

Tuesday, April 9, 13

Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;

("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)

Tuesday, April 9, 13

Storing Data

grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');

ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160

Tuesday, April 9, 13

Space Catalog Example

Tuesday, April 9, 13

Space Catalog

• 14,000+ objects in public catalog

• Use Two Line Element sets to propagate out positions and velocities

• Can generate over 100 million positions & velocities per day

Tuesday, April 9, 13

Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537

• Use Python script to convert to Pig friendly TSV

• Create Python UDF to parse TLE into parameters

• Use Python UDF with Java libraries to propagate out positions

Tuesday, April 9, 13

Python UDFs

• Easy way to extend Pig with new functions

• Uses Jython which is at Python 2.5

• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)

• Can use Java classes

Tuesday, April 9, 13

TLE parsing

def  parse_tle_number(tle_number_string):        split_string  =  tle_number_string.split('-­‐')        if  len(split_string)  ==  3:                new_number  =  '-­‐'  +  str(split_string[1])  +  'e-­‐'  +  str(int(split_string[2])+1)        elif  len(split_string)  ==  2:                new_number  =  str(split_string[0])  +  'e-­‐'  +  str(int(split_string[1])+1)        elif  len(split_string)  ==  1:                new_number  =  '0.'  +  str(split_string[0])        else:                raise  TypeError('Input  is  not  in  the  TLE  float  format')          return  float(new_number)

54-61 BSTAR Drag (Decimal Assumed)

-11606-4

Full parser at https://gist.github.com/shawnhermans/4569360

Tuesday, April 9, 13

Simple UDF

import tleparser

@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params

Tuesday, April 9, 13

Extract Parameters

grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);

grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])

Tuesday, April 9, 13

Storing Results

grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');

Tuesday, April 9, 13

UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4

@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time

while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))

current_time += increment

return ecef_positions

Tuesday, April 9, 13

Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}

Tuesday, April 9, 13

Result

(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)

Tuesday, April 9, 13

Pig on Amazon EMR

Tuesday, April 9, 13

Tuesday, April 9, 13

Tuesday, April 9, 13

Tuesday, April 9, 13

Tuesday, April 9, 13

Tuesday, April 9, 13

Pig with EMR

Tuesday, April 9, 13

Pig with EMR

• SSH in to box to run interactive Pig session

• Load data to/from S3

• Run standalone Pig scripts on demand

Tuesday, April 9, 13

Conclusion

Tuesday, April 9, 13

Other Useful Tools• Python-dateutil : Super-duper date parser

• Oozie : Hadoop workflow engine

• Piggybank and Elephant Bird : 3rd party pig libraries

• Chardet: Character detection library for Python

Tuesday, April 9, 13

Parting Thoughts• Great ETL tool/language

• Flexible enough to write general purpose MapReduce jobs

• Limited, but emerging 3rd party libraries

• Jython for UDFs is extremely limiting (Spark?)

Twitter: @shawnhermansEmail: shawnhermans@gmail.com

Tuesday, April 9, 13

top related