Distributed Systems from Scratch - Part 2 Handling third party libraries https://github.com/phatak-dev/distributedsystems
Distributed Systems from Scratch - Part 2Handling third party libraries
https://github.com/phatak-dev/distributedsystems
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda● Idea● Motivation● Architecture of existing big data system● Function abstraction● Third party libraries● Implementing third party libraries● MySQL task● Code example
Idea
“What it takes to build a distributed processing system
like Spark?”
Motivation● First version of Spark only had 1600 lines of Scala code● Had all basic pieces of RDD and ability to run
distributed system using Mesos● Recreating the same code with step by step
understanding ● Ample of time in hand
Distributed systems from 30000ft
Distributed Storage(HDFS/S3)
Distributed Cluster management(YARN/Mesos)
Distributed Processing Systems(Spark/MapReduce)
Data Applications
Our distributed system
Mesos
Scala function based abstraction
Scala functions to express logic
Function abstraction● The whole spark API can be summarized a scala
function which can represented as follow () => T● This scala function can be parallelized and sent over
network to run on multiple systems using mesos● The function is represented as a task inside the
framework● FunctionTask.scala
Spark API as distributed function● Initial API of the spark revolved around scala function
abstraction for processing as with RDD for data abstraction
● Every API like map, flatMap represented as a function task which takes one parameter and return one value
● The distribution of the functions are initially done by the mesos which later ported to other cluster management
● This shows how the spark started with functional programming
Till now● Discussion about Mesos and its abstraction● Hello world code on Mesos● Defining Function interface● Implementing
○ Scheduler to run scala code○ Custom executor for scala○ Serialize and Deserialize scala function
● https://www.youtube.com/watch?v=Oy9ToN4O63c
What a local function can do?● Access to the local data. Even in spark, normally the
function access the hdfs local data● Ability to access the classes provided by the framework● Any logic which can be serializedWhat it cannot do?● Access classes outside from the framework● Access the results of other functions (shuffle)● Access to lookup data (broadcast)
Need of third party libraries● Ability to add third party libraries in a distributed system
framework is important● Third party libraries allow us to
○ Connect to third party sources○ Use library to implement custom logic like matrix
manipulation inside function abstraction○ Ability to extend base framework using set of
libraries ex: spark-sql○ Ability to optimize for specific hardware
Approaches to third party libraries● There are two different approaches to distribute third
party jars● UberJar - Build all the dependencies with your
application code to single jar● Second approach is to distribute the libraries separately
and adding them to the classpath of executors● UberJar suffers from issues of jar size and versioning● So we are going follow second approach which is
similar to one followed in Spark
Design for distributing jars
Executor 1
Executor 2
Jar serving http server
Scheduler code
Scheduler/Driver
Download jars over http
Download jars over http
Distributing jars● Third party jars are distributed over http protocol over
the cluster● Whenever the scheduler/drives comes up it starts a http
server to serve the jars passed on to it by user● Whenever executors are created, scheduler passes on
the uri of the http server to connect● Executors connect to the jar server and download the
jars to respective machine. Then they add them to their classpath.
Code for implementing ● We need multiple changes to our existing code base to
support third party jars● The following are the different steps
○ Implementation of embedded http server○ Change to scheduler to start http server○ Change to executor to download jars and add it to
classpath○ A function which uses third party library
Http Server● We implement an embedded http server using jetty● Jetty is a popular http server and J2EE servlet container
from eclipse organization● One of the strength of jetty is it can be embedded inside
another program to provide http interfaces to certain functionality
● Initial versions of Spark used jetty for jar distribution. Newer version uses netty.
● https://eclipse.org/jetty/● HttpServer.scala
Scheduler change● Once we have http server, now we need to start when
we start our scheduler● We will use registered callback for creating our jar
server.● As part of starting the jar server, we will copy all the jars
provided by the user to a location which will beame base director for the server.
● Once we have the server running, we pass on the server uri to all the executors
● TaskScheduler.scala
Executor side● In executor, we download the jars using calls to the jar
server running on master● Once we downloaded the jars, we add it the classpath
using URLClassLoader● We use above classloader to run our functions so that it
has access all the jars● We plug this code in the registered callback of the
executor so it run only once● TaskExecutor.scala
MySQL function● This example is a function which access the mysql class
to run jdbc against a mysql instance● We ship mysql jar using our jar distributed framework so
it will be not part of our application jar● There is no change in our function api as it’s a normal
function as other examples● MySQLTask.scala
References● http://blog.madhukaraphatak.com/mesos-single-node-
setup-ubuntu/● http://blog.madhukaraphatak.com/mesos-helloworld-
scala/● http://blog.madhukaraphatak.com/custom-mesos-
executor-scala/● http://blog.madhukaraphatak.com/distributing-third-
party-libraries-in-mesos/