Frustration-Reduced PySpark: Data engineering with DataFrames

Post on 16-Apr-2017

3784 Views

Category:

Engineering

5 Downloads

Preview:

Click to see full reader

Transcript

Frustration-Reduced PySpark

Data engineering with DataFramesIlya Ganelin

Why are we here?Spark for quick and easy batch ETL (no

streaming)Actually using data frames

CreationModificationAccessTransformation

Lab!Performance tuning and operationalization

What does it take to solve a data science problem?

Data Prep IngestCleanupError-handling &

missing valuesData munging

TransformationFormattingSplitting

ModelingFeature extractionAlgorithm selectionData creation

TrainTestValidate

Model buildingModel scoring

Why Spark?Batch/micro-batch processing of large datasetsEasy to use, easy to iterate, wealth of common

industry-standard ML algorithmsSuper fast if properly configuredBridges the gap between the old (SQL, single

machine analytics) and the new (declarative/functional distributed programming)

Why not Spark?Breaks easily with poor usage or improperly

specified configsScaling up to larger datasets 500GB -> TB scale

requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms

While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation

ScalaYes, I recommend Scala

Python API is underdeveloped, especially for ML Lib

Java (until Java 8) is a second class citizen as far as convenience vs. Scala

Spark is written in Scala – understanding Scala helps you navigate the source

Can leverage the spark-shell to rapidly prototype new code and constructs

http://www.scala-lang.org/docu/files/ScalaByExample.pdf

Why DataFrames?Iterate on datasets MUCH fasterColumn access is easierData inspection is easiergroupBy, join, are faster due to under-the-hood

optimizationsSome chunks of ML Lib now optimized to use

data frames

Why not DataFrames?RDD API is still much better developedGetting data into DataFrames can be clunkyTransforming data inside DataFrames can be

clunkyMany of the algorithms in ML Lib still depend on

RDDs

CreationRead in a file with an embedded header

http://stackoverflow.com/questions/24718697/pyspark-drop-rows

Create a DFOption A – Inferred types from Rows RDD

Option B – Specify schema as strings

DataFrame Creation

Option C – Define the schema explicitly

Check your work with df.show()

DataFrame Creation

Column ManipulationSelection

GroupByConfusing! You get a GroupedData object, not an

RDD or DataFrameUse agg or built-ins to get back to a DataFrame.Can convert to RDD with dataFrame.rdd

Custom Column FunctionsAdd a column with a custom function:

http://stackoverflow.com/questions/33287886/replace-empty-strings-with-none-null-values-in-dataframe

Row ManipulationFilter

Range:

Equality:

Column functionshttps://spark.apache.org/docs/1.6.0/api/python/

pyspark.sql.html#pyspark.sql.Column

JoinsOption A (inner join)

Option B (explicit)

Join types: inner, outer, left_outer, right_outer, leftsemi

DataFrame joins benefit from Tungsten optimizationsNote: PySpark will not drop columns for outer joins

Null HandlingBuilt in support for handling nulls/NA in data

frames.Drop, fill, replace https://spark.apache.org/docs/1.6.0/api/python/

pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

What does it take to solve a data science problem?

Data Prep IngestCleanupError-handling &

missing valuesData munging

TransformationFormattingSplitting

ModelingFeature extractionAlgorithm selectionData creation

TrainTestValidate

Model buildingModel scoring

Lab RulesAsk Google and StackOverflow before you ask

me You do not have to use my code.Use DataFrames until you can’t.Keep track of what breaks!There are no stupid questions.

LabIngest DataRemove invalid entrees or fill missing entriesSplit into test, train, validateReformat a single column, e.g. map IDs or change

formatAdd a custom metric or feature based on other

columnsRun a classification algorithm on this data to figure

out who will survive!

What problems did you encounter?

What are you still confused about?

Spark Architecture

PartitionsHow data is split on diskAffects memory usage, shuffle sizeCount ~ speed, Count ~ 1/memory

CachingPersist RDDs in distributed memoryMajor speedup for repeated operations

SerializationEfficient movement of data Java vs. Kryo

Partitions, Caching, and Serialization

Shuffle!All-all operations

reduceByKey, groupByKeyData movement

SerializationAkka

Memory overheadDumps to disk when OOMGarbage collection

EXPENSIVE!

Map Reduce

What else?Save your work => Write completed datasets to

fileWork on small data first, then go to big dataCreate test data to capture edge casesLMGTFY

By popular demand:screen pyspark--driver-memory 100g \--num-executors 60 \--executor-cores 5 \--master yarn-client \--conf "spark.executor.memory=20g” \--conf "spark.io.compression.codec=lz4" \--conf "spark.shuffle.consolidateFiles=true" \--conf "spark.dynamicAllocation.enabled=false" \--conf "spark.shuffle.manager=tungsten-sort" \--conf "spark.akka.frameSize=1028" \--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops"

Any Spark on YARNE.g. Deploy Spark 1.6 on CDH 5.4Download your Spark binary to the cluster and untar In $SPARK_HOME/conf/spark-env.sh:

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/conf This tells Spark where Hadoop is deployed, this also gives it the

link it needs to run on YARNexport SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop

classpath) This defines the location of the Hadoop binaries used at run

time

top related