6.824 Project Report: Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou Introduction Most big data systems use cores separately for different tasks, causing inefficient utilization of CPU, Memory and IO resources on each multicore node. In this project, we explored the benefits of using using multiple cores together to execute a single task. We show that the modified version of KMeans can reduce average heap memory utilization in a 6core node by 2.5x compared to the unmodified KMeans in Mllib (a machine learning library built on top of Spark). A modified version of PageRank is 15% faster than the PageRank implementation in GraphX (a graph library built on top of Spark) running on a 4core node. We expect the speed up to increase more in a cluster setting for PageRank working on larger graphs. We believe the same optimizations can be applied to other platforms such as Hadoop MapReduce. Background Current big data systems, such as Hadoop MapReduce and Spark, treat every core as an independent machine. The runtime uses multicore systems by decomposing a job into smaller tasks. For example, in Hadoop MapReduce, a MapReduce job is decomposed into a series of Map and Reduce tasks, where each task operates on a “input split”. In Spark, the RDDs are splitted into a number of partitions and a separate spark task is created to process a “partition”. Each task perform operations on its own input sequentially without communicating with other tasks. The big data systems schedules multiple tasks on each multicore node to exploit the CPU resources. For example, Hadoop and Spark would assign 8 or more tasks to a node with 8cores to fully utilize the node. 0
16
Embed
Optimizing Big Data Frameworks for Multi-core Systems · Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou Introduction Most big data systems use cores
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
6.824 Project Report:
Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou
Introduction Most big data systems use cores separately for different tasks, causing inefficient
utilization of CPU, Memory and IO resources on each multicore node. In this project,
we explored the benefits of using using multiple cores together to execute a single task.
We show that the modified version of KMeans can reduce average heap memory
utilization in a 6core node by 2.5x compared to the unmodified KMeans in Mllib (a
machine learning library built on top of Spark). A modified version of PageRank is 15%
faster than the PageRank implementation in GraphX (a graph library built on top of
Spark) running on a 4core node. We expect the speed up to increase more in a cluster
setting for PageRank working on larger graphs. We believe the same optimizations can
be applied to other platforms such as Hadoop MapReduce.
Background Current big data systems, such as Hadoop MapReduce and Spark, treat every core as
an independent machine. The runtime uses multicore systems by decomposing a job
into smaller tasks. For example, in Hadoop MapReduce, a MapReduce job is
decomposed into a series of Map and Reduce tasks, where each task operates on a
“input split”. In Spark, the RDDs are splitted into a number of partitions and a separate
spark task is created to process a “partition”. Each task perform operations on its own
input sequentially without communicating with other tasks. The big data systems
schedules multiple tasks on each multicore node to exploit the CPU resources. For
example, Hadoop and Spark would assign 8 or more tasks to a node with 8cores to
fully utilize the node.
0
This model of parallelism hurts the memory efficiency of many popular data analytics
applications, including KMeans and K Nearest Neighbors, that retains a large
inmemory accumulator data structure, which stores partial results. These accumulators
are required to alleviate the network delay by allowing the system to send much fewer
messages. For example, KMeans stores newly computed cluster centroids data in
memory during the execution of the tasks.
In Hadoop MapReduce, the memory inefficiency is made worse by the fact individual
tasks are running inside separate JVMs, preventing sharing of large inmemory
readonly data structures. The problem is alleviated in Spark by running multiple tasks
in the same executor JVM and providing a builtin BroadCast variables that reduces
memory usage for readonly data.
Additionally, running a large number of tasks on each multicore node would result in a
large number of partial results to be generated, degrading the performance for
communication heavy applications such as PageRank. For example, if we are running 8
tasks on an 8core machines for PageRank, we will end up with a reduce stage that
needs to merge 8 partial results. On the other hand, if we only run a single task, then
the reduce stage would be much faster. However, if we run only a single task, the
current model of parallelism would be able to only utilize 1 out of 8 cores on the
multicore node.
Motivating Applications
KMeans
KMeans is a clustering algorithm used in many applications. The algorithm partitions a
set of n sample objects into k clusters. For example, a popular application of KMeans
involves finding topics in news articles. Such application picks n news articles as
sample objects and cluster them into k topics.
1
The algorithm first chooses k objects randomly as the centroids. It then assigns every
sample object to a cluster that it is closest to. After assigning all objects to their potential
clusters, KMeans recalculates the location of the centroid in each cluster. This process
runs repeatedly until the centroid locations stabilize, or until a fixed iteration limit.
Since each sample object is independent of each other, they can be processed in
parallel. Therefore, a common technique in implementing KMeans is to split the sample
objects into subgroups (called “slices” in our code) and process the slices in parallel.
Given the description of the algorithm, it is natural to implement each iteration of
KMeans in a MapReduce fashion. As shown in Figure 1, the map phase takes a slice as
input, computes the similarity between each sample object (represented as vectors) and
each centroid, and assigns the object to the closest cluster. The process of finding the
closest cluster is the most computationally intensive part of the algorithm, and thus the
map phase dominates the running time of each iteration. Finally, it yields a partial sum
of sample vectors along with the number of vectors in each cluster. The reduce task
phase adds up the partial sums from each cluster , divides it by the number of vectors in
each cluster and generates new locations for the centroids.
KMeans is a memory intensive application, because each map task needs to store a
large data structure containing the information about the cluster centroids. The number
of this data structure is proportional to both the number of clusters and the number of
concurrent map tasks. To reduce the memory footprint, we targeted at reducing the
number of map tasks while making each task utilize multiple cores so that the overall
performance does not degrade (Figure 2).
2
Figure 1: KMeans running on Spark with singlethreaded tasks
Figure 2: KMeans running on Spark with multithreaded tasks
3
PageRank
PageRank is an algorithm that was first used by Google to compute the relative
importance of web pages. It takes as input a graph that consists of Vertices and Edges
(G = (V, E)) and outputs rank information for each vertex in the graph. PageRank is an
iterative improvement algorithm that improves the rank measures of each vertex during
each iteration.
There are two variants of the algorithm. The first runs for a certain number of iterations
and stop (PageRank.run(graph, numIter) in GraphX). The second algorithm runs until all
the rank updates are less than a tolerance (PageRank.runUntilConvergence(graph, tol)
in GraphX). We focused on the later algorithm as it is more useful because it it is hard to
determine the number of iterations the algorithm should run before hand and the
amount of computations reduces significantly in the later iterations when most edges
are no longer active.
The algorithm first partitions the graph into a number of partitions of edges based on
random vertex partition. In each iteration, it updates the active vertices by joining the
vertices with the rank updates that are above the tolerance. It then starts a MapReduce
job that performs two steps. The first step is a per task scan of edges to calculate the
rank updates and do a pre aggregation within the task. The second reduce step
aggregates all the partial rank updates to compute the final rank updates. A high level
pseudocode is shown below.
while (activeMessages > 0) {
val newVerts = g.vertices.innerJoin(messages) g = g.outerJoinVertices(newVerts) messages = g.mapReduceTriplets(sendMsg, mergeMsg))
}
4
Design and Implementation
KMeans
We started by modifying the KMeans implementation provided by the machine learning
library (mllib) in Spark. It follows the MapReduce pattern as described above. In
particular, the implementation makes use of the mapPartitions() API, which is a
coarsegrain RDD operation that applies a user defined function to each partition in
parallel. Each partition represents a slice of sample objects that needs processing. For
each partition, we specify a map task that computes the similarities and assigns the
centroids.
Note that the original loop that iterates over the points is singlethreaded. Since
calculating similarity for one point is independent of calculating that of another and the
order does not matter, we modified this particular loop to make use of multiple cores.
We changed the original point array to a Scala parallel collection, and created a
ForkJoinPool of some size to house the concurrent worker threads. ForkJoinPool is an
advanced Java concurrent library that implements workstealing, where each task
attempts to reduce idling by executing subtasks spawned by other tasks.
To ensure correctness, we used locks to synchronize accesses and updates to the
partial sum and the count of the points in each cluster. We think that using locks instead
of an accumulator is the better approach. Since KMeans is compute intensive, the
contention is low and the locked code path contributes to a very small percentage in the
execution time. We compared the computed final cost against that of the single
threaded version and obtained the same result across multiple runs.
5
The following code snippet outlines the structure of the modified code. Note that the
outer foreach() call is executed in parallel within a ForkJoinPool. The findClosest() call is
computeintensive and takes up most of the execution time.
pointsArray.foreach { point => // executed on multiple cores in parallel
(0 until runs).foreach { i =>
val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point) // intensive
writeLock.lock()
// update sum and count
writeLock.unlock()
}
}
PageRank (1) Design
Currently GraphX utilizes multicore systems by partition graph into multiple partitions
and process each partition in parallel as shown in the code below in GraphImp.scala.