Top Banner
Distributed Computing Case Study
31

Introduciton of Distributed Computing

Feb 13, 2017

Download

Documents

phamkhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduciton of Distributed Computing

Distributed Computing – Case Study

Page 2: Introduciton of Distributed Computing

Outline

• What is distributed computing

• Case study

– Hadoop – HDFS and map reduce

– Gluster File System

Page 3: Introduciton of Distributed Computing

What is Distributed Computing/System?

• Distributed computing

– A field of computing science that

studies distributed system.

– The use of distributed systems to

solve computational problems.

• Distributed system

– Wikipedia• There are several autonomous

computational entities, each of which has its

own local memory.

• The entities communicate with each other by

message passing.

– Operating System Concept• The processors communicate with one

another through various communication

lines, such as high-speed buses or

telephone lines.

• Each processor has its own local memory.

Page 4: Introduciton of Distributed Computing

What is Distributed Computing/System?

• Distributed program

– A computing program that runs in a distributed

system

• Distributed programming

– The process of writing distributed program

Page 5: Introduciton of Distributed Computing

What is Distributed Computing/System?

• Common properties

– Fault tolerance• When one or some nodes fails, the whole system can still work fine

except performance.

• Need to check the status of each node

– Each node plays partial role• Each computer has only a limited, incomplete view of the system. Each

computer may know only one part of the input.

– Resource sharing• Each user can share the computing power and storage resource in the

system with other users

– Load Sharing• Dispatching several tasks to each nodes can help share loading to the

whole system.

– Easy to expand• We expect to use few time when adding nodes. Hope to spend no time

if possible.

Page 6: Introduciton of Distributed Computing

CASE STUDY - HADOOP

Page 7: Introduciton of Distributed Computing

Quick overview

Paramount Q1 2008 - 7

• Features

• HDFS

• Map-Reduce Framework

Page 8: Introduciton of Distributed Computing

Features

• Large files

– Gigabytes, Terabytes

• Write once, read many

• Commodity Hardware

Page 9: Introduciton of Distributed Computing

HDFS

• Namenode:

– manages the file system namespace and

regulates access to files by clients.

– determines the mapping of blocks to DataNodes.

– fsImage and editLog

• Data Node :

– manage storage attached to the nodes that they

run on

– save CRC codes

– send heartbeat to namenode.

– Each data is split as a chunk and each chuck is

stored on some data nodes.

Page 10: Introduciton of Distributed Computing

HDFS

• Secondary Namenode

– responsible for merging fsImage and EditLog

– Not a namenode

Page 11: Introduciton of Distributed Computing

HDFS architecture

Page 12: Introduciton of Distributed Computing

Secondary namenode

• Edit log

– Transaction log• Update transaction log before updating content in memory

• Always update this file when each request has been sent to namenode

• fsImage

– Persistent checkpoint

• Secondary namenode

– Responsible for merging editLog and fsImage.

Page 13: Introduciton of Distributed Computing

Secondary namenode

From Hadoop - The Definitive Guide

Page 14: Introduciton of Distributed Computing

Map-Reduce Framework

• JobTracker

– Responsible for dispatch job to each tasktracker

– Job management like removing and scheduling.

• TaskTracker

– Responsible for executing job. Usually tasktracker

launch another JVM to execute the job.

Page 15: Introduciton of Distributed Computing

Map-Reduce Framework

From Hadoop - The Definitive Guide

Page 16: Introduciton of Distributed Computing

Summary - Hadoop

• Hadoop provides a distributed file system (HDFS) that

stores data on the compute nodes, providing very high

aggregate bandwidth across the cluster.

• Hadoop implements a computational paradigm named

Map/Reduce, where the application is divided into

many small fragments of work, each of which may be

executed or reexecuted on any node in the cluster.

Page 17: Introduciton of Distributed Computing

CASE STUDY – GLUSTER

FILESYSTEM

Page 18: Introduciton of Distributed Computing

Quick overview

Paramount Q1 2008 - 18

• Introduction

• Gluster File system design

• Example : 4 nodes GlusterFS

GlusterFSCluster File System

Page 19: Introduciton of Distributed Computing

Introduction

• GlusterFS is an open source clustered file system

and runs on industry standard hardware from any

vendor and delivers multiple times the scalability

and performance of conventional storage at a

fraction of the cost.

=

N x Performance & Capacity

+ +

Page 20: Introduciton of Distributed Computing

GlusterFS Overview

From GlusterFS Datasheet

Page 21: Introduciton of Distributed Computing

GlusterFS Design

GigE

GlusterFS Clustered Filesystem on x86-64 platform

Storage ClientsCluster of Clients (Supercomputer, Data Center)

GLFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GLFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GlusterFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GLFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GLFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GlusterFS Client

Clustered Vol Manager

Clustered I/O Scheduler

GLFSDVolumeGLFSDVolume

Storage Brick 1

GlusterFSVolume

Storage Brick 2

GlusterFSVolume

Storage Brick 3

GlusterFSVolume

GLFSDVolumeGLFSDVolume

Storage Brick 4

GlusterFSVolume

Storage Gateway

NFS/Samba

GLFS Client

Storage Gateway

NFS/Samba

GLFS Client

Storage Gateway

NFS/Samba

GLFS Client

InfiniBand RDMA (or) TCP/IP

NFS / SAMBA over TCP/IP

Compatibility withMS Windows

and other Unices

From http://www.gluster.org/

Page 22: Introduciton of Distributed Computing

Key Design Considerations

• Capacity Scaling

– Scalable beyond Peta Bytes

• I/O Throughput Scaling

– Pluggable Clustered I/O Schedulers

– Advantage of RDMA transport

• Reliability

– Non Stop Storage

• Ease of Manageability

– Self Heal

– NFS like Disk Layout

• Elegance in Design

– Stackable Modules

– Not tied to I/O Profiles or Hardware or OS

Page 23: Introduciton of Distributed Computing

Translators• Performance translators

1. Read Ahead

2. Write Behind

3. Threaded I/O

4. IO-Cache

• Clustering translators

1. Automatic File Replication (AFR)

2. Stripe

3. Unify

• Scheduling translators

1. Adaptive Least Usage (ALU)

2. Non-uniform filesystem architecture (NUFA)

3. Random

4. Rand-Robin

Page 24: Introduciton of Distributed Computing

FUSE

• What’s FUSE ?

• Stands for “File system in USErspace”

• Makes it easy to write new filesystems

1.without knowing how the kernel works

2.without breaking unrelated things

3.more quickly/easily than traditional file systems

built as a kernel module

Page 25: Introduciton of Distributed Computing

FUSE structure

From http://fuse.sourceforge.net/

Page 26: Introduciton of Distributed Computing

How FUSE Works

• Application makes a file-related syscall

• Kernel figures out that the file is in a mounted

FUSE filesystem

• The FUSE kernel module forwards the

request to your userspace FUSE app

• Your app tells FUSE how to reply

Page 27: Introduciton of Distributed Computing

Example : 4 nodes GlusterFS

Storage Virtualization : GlusterFS (AFR + Unify) ~1.8TB

Virtual Machine (XEN + KVM) Web App. MySQL

User

Server

POSIX

Ext4

vlab01

GlusterFS Server

Global Name Space ( /mnt/glusterfs )

Server

POSIX

Ext3

vlab02

GlusterFS Server

Server

POSIX

XFS

vlab03

GlusterFS Server

Server

POSIX

Ext3

vlab04

GlusterFS Server

TCPIP – GigE

User User

Page 28: Introduciton of Distributed Computing

The view of GlusterFS client

• $ df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 901G 115G 740G 14% /

tmpfs 4.0G 0 4.0G 0% /dev/shm

/etc/glusterfs/glusterfs.vol 1.8T 243G 1.6T 13% /mnt/glusterfs

Page 29: Introduciton of Distributed Computing

benchmark.pdf

test.ogg

initcore.c

mylogo.xcf

driver.c

ether.c

test.m4a

Unify Volume

work.ods

corporate.odp

driver.c

The view of GlusterFS server

accounts-2007.db

backup.db.zip

accounts-2006.db

accounts-2007.db

backup.db.zip

accounts-2006.db

accounts-2007.db

backup.db.zip

accounts-2006.db

Mirror Volume

north-pole-map

dvd-1.iso

xen-image

north-pole-map

dvd1.iso

xen-image

north-pole-map

dvd1.iso

xen-image

Stripe Volume

BRICK1 BRICK2 BRICK3

Page 30: Introduciton of Distributed Computing

Summary - GlusterFS

• GlusterFS clusters together storage building blocks,

aggregating disk and memory resources and

managing your data in a single global namespace.

• GlusterFS is based on a stackable architecture that

can be optimized for specific application profiles with

simple plug-in modules, optimizing performance for a

wide range of workloads.

Page 31: Introduciton of Distributed Computing

Reference

• http://en.wikipedia.org/wiki/Message_passing

• http://en.wikipedia.org/wiki/Distributed_computing

• http://en.wikipedia.org/wiki/Filesystem_in_Userspace

• http://en.wikipedia.org/wiki/Distributed_file_system

• http://hadoop.apache.org/

• Tom White - Hadoop - The Definitive Guide

• Silberschatz Galvin - Operating System Concepts

• http://www.gluster.org/

• http://www.zresearch.com/

• http://fuse.sourceforge.net/