Top Banner
Jumping Bean Map Reduce With Bash (the power of the Unix philosophy)
18

Map Reduce with Bash - An Example of the Unix Philosophy in Action

Dec 01, 2014

Download

Software

Jumping Bean

How www.JumpingBean.co.za used simple bash and Linux commands to build a map/reduce solution for a University's physics department that could be maintained by lecturers and students using existing skills.

A stunning example of how the architectural approach of Unix enables users to design simple solutions to their complex processes.

The presentation discusses some of the command line utilities used such as GNU parallels and xargs as well how cgroups, namespaces and Linux capabilities can be combined to create a bash based map/reduce framework.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map Reduce With Bash(the power of the Unix philosophy)

Page 2: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

About Me

● Solutions integrator at Jumping Bean– Developer & Trainer

– Technologies● Java● PHP● HTML5/Javascript● Linux

– What I am planning to do:● The Internet of things

● Solutions integrator at Jumping Bean– Developer & Trainer

– Technologies● Java● PHP● HTML5/Javascript● Linux

– What I am planning to do:● The Internet of things

Page 3: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map/Reduce with Bash

● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix

philosophy,

– what awesome solutions can be created by using simple bash script and userland tools,

– cool utilities and tools

● The purpose is not:– to suggest that Map/Reduce is best done with bash

– best given constraint – see business problem

● Purpose of this presentation is: – to demonstrate the power and flexibility of the Unix

philosophy,

– what awesome solutions can be created by using simple bash script and userland tools,

– cool utilities and tools

● The purpose is not:– to suggest that Map/Reduce is best done with bash

– best given constraint – see business problem

Page 4: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

“is a set of cultural norms and philosophical approaches to developing small yet capable

software” - Wikipedia

Unix Philosophy

Page 5: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Unix Philosophy

“Early Unix developers were important in bringing the concepts of modularity and reusability into

software engineering practice, spawning a 'software tools' movement” - Wikipedia

Page 6: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),

● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is

– written in Fortran.

– single threaded,

● No money for fancy-pants solution

● Nuclear Engineering department needs to run monte-carlo methods on data to calculate something to do with core temperature of nuclear reactors :),

● Post-grad students need to run analysis as part of their course work,● Analysis can take days or weeks to run,● University has invested in 900 node cluster,● Cluster used for research when not used by students● Tool used for analysis is

– written in Fortran.

– single threaded,

● No money for fancy-pants solution

Page 7: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● As-Is system– Professor uses laptop and desktop,

– Manually starts application with simple script,

– Start script x number of times where x=number of cores,

– Waits for days,

– Manually checks progress,

– Not scalable to 900 nodes!

● As-Is system– Professor uses laptop and desktop,

– Manually starts application with simple script,

– Start script x number of times where x=number of cores,

– Waits for days,

– Manually checks progress,

– Not scalable to 900 nodes!

Page 8: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Business Problem

● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?

● Open Stack?● Open Nebula?● KVM?

– Tools available to IT department – I.e how they do deploys, monitoring, user management etc

● Requirements– independence from IT department or experts for help,

– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,

– Due to security concerns prevent IT staff from gaining access to research,

● Keep it simple – Proof of concept

● Unknowns– How 900 node cluster set up i.e using any cluster software or virtualisation?

● Open Stack?● Open Nebula?● KVM?

– Tools available to IT department – I.e how they do deploys, monitoring, user management etc

● Requirements– independence from IT department or experts for help,

– Student & lecturer IT skills is limited to Fortran & some bash scripting skills,

– Due to security concerns prevent IT staff from gaining access to research,

● Keep it simple – Proof of concept

Page 9: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

What is Map/Reduce?

● Programming model for – Processing and generating large datasets,

– Using a parallel distribution algorithm,

– On a cluster or set of distributed nodes

● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used

to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”

● Programming model for – Processing and generating large datasets,

– Using a parallel distribution algorithm,

– On a cluster or set of distributed nodes

● Popularised by Google and the advent of cloud computing● Apache Hadoop – full blown map/reduce framework. Used

to analyse your social media data, “understand the customer” and by numerous agencies with 3 letter acronyms. – “Really we only trying to help you know yourself better”

Page 10: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Map/Reduce Steps

● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,

● Reduce – Gather the results of the compute nodes and aggregate results into final answer

● Map – Master node takes large dataset and distributes it to compute nodes to perform analysis on. The compute nodes return a result,

● Reduce – Gather the results of the compute nodes and aggregate results into final answer

Page 11: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

What we need

● Controller node functions – to distribute data to nodes,

– execute calculation functions

– collect results

● Management node functions– distribute application and scripts to compute nodes,

● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors

● Security Requirements– Prevent system administrators from gaining access to core application , script or

data

● Controller node functions – to distribute data to nodes,

– execute calculation functions

– collect results

● Management node functions– distribute application and scripts to compute nodes,

● Compute node functions– Scripts to run the single threaded application in parallel on multi core processors

● Security Requirements– Prevent system administrators from gaining access to core application , script or

data

Page 12: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Controller Functions

● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,

– Use ssh to distribute files, execute processes,

● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,

● Issues:– Copying public key to 900 machines?– Give each student their own account?

● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts

using pssh– Fancy pants – chef, ansible

● How to distribute files to a node (map), execute calculations & gather results (reduce)?– Use split to split input files,

– Use ssh to distribute files, execute processes,

● How to do this to multiple (900) nodes?– Use parallel ssh (pssh), paralle scp,

● Issues:– Copying public key to 900 machines?– Give each student their own account?

● Solution– Set up ldap authentication (password based) or– Include controller nodes root public key in compute node image, distribute 2ndary keys via scripts

using pssh– Fancy pants – chef, ansible

Page 13: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Management Node Functions

● Use parallel ssh to distribute scripts from management node to compute nodes,

● Using Ansible or Chef could be a next evolutionary step to automate system maintenance

● Use parallel ssh to distribute scripts from management node to compute nodes,

● Using Ansible or Chef could be a next evolutionary step to automate system maintenance

Page 14: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Compute Node Functions

● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?

● xargs – pass through list of input files,

– -n set each iteration to run on one input file

– -P set number of processes to start in parallel

– Script waits for completion of processing & check output

● GNU parallels– Can run commands in parallel using 1 or more hosts

– More options for target input placement {}, string replacement

– Can pass output as input to another process

● Basically bash scirpt - How to parallelise single threaded application to use multiple cores on modern CPUs?

● xargs – pass through list of input files,

– -n set each iteration to run on one input file

– -P set number of processes to start in parallel

– Script waits for completion of processing & check output

● GNU parallels– Can run commands in parallel using 1 or more hosts

– More options for target input placement {}, string replacement

– Can pass output as input to another process

Page 15: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Compute/Controller Node

● At end of compute node process either compute node pings controller node,

● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,

● Check for errors and reschedule failed computes,

● At end of compute node process either compute node pings controller node,

● Controller node waits for pssh to return to carry out next step. I.e – reduce process or start next script with output from 1st being input to 2nd step,

● Check for errors and reschedule failed computes,

Page 16: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Security

● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results

● Each user should be limited in resource usage– Simple

● ulimit● psacct

– Advanced● Cgroups● Namespaces

● Students can execute but not read bash script file, special permissions– Use sudo or

– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh

● Each student should have separate account– Linux mutli-user system. User home directory for storing files and results

● Each user should be limited in resource usage– Simple

● ulimit● psacct

– Advanced● Cgroups● Namespaces

● Students can execute but not read bash script file, special permissions– Use sudo or

– Linux capabilities ● setcap – eg setcap "cap_kill=+ep" script.sh

Page 17: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Security

● Limit the root user– Linux capabilities

● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as

needed,● /etc/security/capabilities.conf

● Limit the root user– Linux capabilities

● setcap, capsh,pscap● Disable root account – grant CAP_SYS_ADMIN as

needed,● /etc/security/capabilities.conf

Page 18: Map Reduce with Bash - An Example of the Unix Philosophy in Action

Jumping Bean

Resources

● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities

● Parallel SSH,● Xargs,● GNU parallel,● cgroups,● namespaces,● Linux capabilities

● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect

● Twitter - @mxc4● Gplus – Mark Clarke ● Jumping Bean● Cyber Connect

● Jozi Linux User Group

● Jozi Java User Group● Maker Labs

● Jozi Linux User Group

● Jozi Java User Group● Maker Labs