Terasort UsingSAGA-MapReduce
Given by: Sharath Maddineni
CCT: Center for Computation & Technology
Why Terasort?
• Sorting the large datasets in scientific computations.
• Google processes around 20 Petabytes of data per day using MapReduce.
• So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster.
Center CCT: Center for Computation & Technology
Introduction
• Sort Benchmark (http://sortbenchmark.org/)
• Google won the 2010 competition, Yahoo Hadoop In 2009
• But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS)
• SAGA-MapReduce is infrastructure independent.
Center CCT: Center for Computation & Technology
SAGA MapReduce Execution Overview
1. Start the Master with a executable linked to SAGA-MapReduce and creates advert directory
2. The master looks the InputFormat specified in the JobDescription to chunk the input data.
3. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API
4. Worker puts its status information into an advert directory and will communicate with master using this advert service.
5. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function
6. When all chunks mapping is done master moves to reduce Phase.
7. In the reduce, the master assigns sets of partitions to be reduced to idle workers.
Center CCT: Center for Computation & Technology
Slide Title
Center CCT: Center for Computation & Technology
Terasort
• Sort-benchmark’s provides a “Gensort” program to generate Data Records
• Data Format• Each Record has 100 bytes ASCII values contains
where 10 bytes random key and rest is the value .• 10^10, 100 byte-records for terabyte of data
• All the records are sorted according to this 10 byte key.
Center CCT: Center for Computation & Technology
Terasort SAGA Map-Reduce
• Similar to SAGA-MapReduce Except the partition list is generated before launching the master
• The partition list generated will make sure that the keys in map phase goes into partition of its range.
• This will spread the keys evenly across all the partitions.
Center CCT: Center for Computation & Technology
Center CCT: Center for Computation & Technology
Distributed Workers for Terasort
• Cyder and Cyd01 machines as workers
• Prerequisites:– SSH password less login from Master machine to Worker
machines.– Fuser Mount the Input and Output Data Locations on each
machine.
Center CCT: Center for Computation & Technology
Results
Center CCT: Center for Computation & Technology
• X-Axis -> Data set size in MB• Y-Axis ->Time to solution in seconds
• Increasing the input Data size• Constant Number of workers (3) (Both Master and Worker on Cyd01 )
Operating System : Redhat 5.5Architecture : x86_64 Memory : 8 GBCPU Type : Dual-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,
Results cont…
• Constant Input File Size(400MB, 6 Chunks, 5 partitions) • Increasing number of workers
Center CCT: Center for Computation & Technology
• X-Axis -> Number of workers• Y-Axis ->Time to solution in seconds
Operating System : Ubuntu 10.04Architecture : x86_64 AMDMemory : 63 GBCPU Type : 6-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,
Results cont…
• Distributed workers (2 workers, 1 chunk(10mb), 5 partitions)• Cyd01 and Cyder are used
Center CCT: Center for Computation & Technology
Case 1 : Master, Worker and Data on same machineCase 2 : Remote Master , Data and workers on same machineCase 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers
• X-Axis -> Cases• Y-Axis ->Time to solution in seconds
SAGA Map-Reduce Usability
• Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce.
• Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing
• Tested on – RHEL 4,5 and Ubuntu 10.04– SAGA 1.4.1 and 1.5– Boost Version 1.40
Center CCT: Center for Computation & Technology
Future Work
• Currently MapReduce only supports Launching worker through forking Localhost and SSH
• SAGA- BigJob can be used to launch the workers instead– Helps in running MapReduce distributed over LONI Machines– But mounting directories is a problem over LONI.
Center CCT: Center for Computation & Technology
Thank You
Center CCT: Center for Computation & Technology