Top Banner
Gianmario Spacagna 14 th September, 2019 - Alluxio Meetup @ San Francisco, CA
27

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Apr 16, 2017

Download

Technology

Alluxio, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Gianmario Spacagna 14th September, 2019 - Alluxio Meetup @ San Francisco, CA

Page 2: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Takeaways

¨  What a logical data warehouse is ¨  How to handle governance issues ¨  An Agile workflow made of iterative exploratory

analysis and production-quality development ¨  A fully in-memory stack for fast computation on top

of Spark and Alluxio ¨  How to successfully do data science if your data

resides in a RDBMS and you don’t have a data lake

Page 3: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

About me

¨  Engineering background in Distributed Systems ¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm)

¨  Data-relevant experience ¤  Predictive Marketing (AgilOne, StreamSend)

¤  Cyber Security (Cisco)

¤  Financial Services (Barclays)

¤  Automotive (Pirelli) ç

Page 4: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Areas of interest

¨  Functional Programming, Scala and Apache Spark

¨  Contributor of the Professional Data Science Manifesto

¨  Founder of Data Science Milan Meetup community (datasciencemilan.org)

¨  Co-authoring Python Deep Learning book, coming soon…

Building production-ready and scalable machine learning systems

(continue with list of principles...)

Page 5: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Data Science Agile cycle Get

access to data

Explore

Transform Train

Evaluate

Analyze results

Even dozens of iterations per

day!!!

Page 6: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Successful development of new data products

requires proper infrastructure and tools

Page 7: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Start by building a toy model with a small snapshot of data that can fit in your laptop

memory and eventually ask your organization for cluster resources

Page 8: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

¨  You can’t solve problems with data science if data is not largely available

¨  Data processing should be fast and reactive to allow quick iterations

¨  The core team cannot depend on IT folks

Start by building a toy model with a small snapshot of data that can fit in your laptop

memory and eventually ask your organization for cluster resources

Page 9: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Data Lake in a legacy enterprise environment

Page 10: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Technical issues ¨  Engineering effort

¤  dedicated infrastructure team (expensive) ¨  Synchronization with new data from source

¤  Report what portion of data has been exported and what not

¨  Consistency / Data Versioning / Duplication ¤  ETL logic and requirements change very often ¤ Memory is cheap but when you have hundreds of sparse

copies of same data is confusing ¨  I/O cost

¤  Reading/writing is expensive for iterative and explorative jobs (machine learning)

Page 11: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Logical Data Warehouse

¨  View and access cleaned versions of data ¨  Always show latest version by default ¨  Apply transformations on-the-fly

(discovery-oriented analytics) ¨  Abstract data representation from rigid structures

of the DB’s persistence store ¨  Simply add new data sources using virtualization ¨  Flexible, fast time-to-market, lower costs

Page 12: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

What about governance issues?

¨  Large corporations can’t move data before an approved governance plan

¨  Data can only be stored in a safe environment administered by only a few authorized people who don’t necessary understand data scientists needs

¨  Data leakage paranoia, cloud-phobia! ¨  As result, data cannot be easily/quickly pulled from the

central data warehouse and stored into an external infrastructure

Page 13: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Long time and large investment for setting up a new project

That’s not Agile!

Page 14: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Wait a moment, analysts don’t seem to have this problem…

Page 15: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

From disk to volatile memory

Distribute and make data temporary available in-memory in an ad-hoc development cluster

Page 16: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

¨  In-memory engine for distributed data processing ¨  JDBC drivers to connect to relational databases ¨  Structured data represented using DataFrame API ¨  Fully-functional data manipulation via RDD API ¨  Machine learning libraries (ML/MLllib) ¨  Interaction and visualization through

Spark Notebook or Zeppelin

Page 17: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

In-memory workflow

Page 18: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Just Spark cache is not enough ¨  Data is dropped from memory

at each context restart due to ¤  Update dependency jar

(common for mixed IDE development / notebook analysis)

¤  Re-submit the job execution

¤  Kerberos ticket expires L

¨  Fetching 600M rows can take ~ 1 hour in a 5 nodes cluster

Dozens iterations per day => spending most of the time waiting for data to reload at each iteration!

Page 19: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Distribute and make data temporary persistently available in-memory in the development cluster and

shared among multiple concurrent applications

From volatile memory to persistent memory storage

Page 20: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

¨  Formerly known as Tachyon ¨  In-memory distributed storage system ¨  Long-term caching of raw data and intermediate

results ¨  Spark can read/write in Alluxio seamlessly instead

of using HDFS ¨  1-tier configuration safely leaves no traces to disk ¨  Data is loaded once and available for the whole

development period to multiple applications

Page 21: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Alluxio as the Key Enabling Technology

Page 22: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Page 23: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

1-tier configuration

¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk ¨  alluxio.worker.memory.size=24GB ¨  alluxio.worker.tieredstore

¤  levels=1 ¤  level0.alias=MEM ¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER} ¤  level0.dirs.quota=24G

¨  We leave empty the under FS configuration ¨  Deploy without mount (no root access required)

¤  ./bin/alluxio-start.sh all NoMount

Page 24: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Spark read/write APIs

¨  DataFrame ¤  dataframe.write.save(”alluxio://master_ip:port/mydata/

mydataframe.parquet") ¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio://

master_ip:port/mydata/mydataframe.parquet")

¨  RDD ¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object")

¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio://master_ip:port/mydata/myrdd.object")

Page 25: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Making the impossible possible

¨  Agile workflow combining Spark, Scala, DataFrame, JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop production-quality machine learning pipelines

¨  Data available since day 1 and at every iteration ¤ Alluxio decreased loading time from hours to seconds

¨  Avoid complicated and time-consuming Data Plumbing operations

Page 26: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Further developments 1.  Memory size limitation

¤  Add external in-memory tiers? 2.  Set-up overhead

¤  JDBC drivers, partitioning strategy and data frame from/to case class conversion (Spark 2 aims to solve this)

3.  Shared memory resources between Spark and Alluxio ¤  Set Alluxio as OFF_HEAP memory as well and divide memory in

storage and cache 4.  In-Memory replication for read availability

¤  If an Alluxio node fails, data is lost due the absence of an underlying file system

5.  Would be nice if Alluxio could handle this and mount a relational table/view in the form of data files (csv, parquet…)

Page 27: Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Follow-up links ¨  Original article on DZone:

¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon

¨  Professional Data Science Manifesto: ¤  datasciencemanifesto.org

¨  Vademecum of Practical Data Science: ¤  datasciencevademecum.wordpress.com

¨  Sparkz ¤  github.com/gm-spacagna/sparkz