Top Banner
Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United St See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)
25

Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

Cloud Computing Lecture #1

Parallel and Distributed Computing

Jimmy LinThe iSchoolUniversity of Maryland

Monday, January 28, 2008

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Page 2: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Today’s Topics Course overview

Introduction to parallel and distributed processing

Page 3: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

What’s the course about? Integration of research and teaching

Team leaders get help on a tough problem Team members gain valuable experience

Criteria for success: at the end of the course Each team will have a publishable result Each team will have a paper suitable for submission to

an appropriate conference/journal

Along the way: Build a community of hadoopers at Maryland Generate lots of publicity Have lots of fun!

Page 4: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Hadoop Zen Don’t get frustrated (take a deep breath)…

This is bleeding edge technology

Be patient… This is the first time I’ve taught this course

Be flexible… Lots of uncertainty along the way

Be constructive… Tell me how I can make everyone’s experience better

Page 5: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Things to go over… Course schedule

Course objectives

Assignments and deliverables

Evaluation

Page 6: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

My Role To hack alongside everyone

To substantively contribute ideas where appropriate

To serve as a facilitator and a resource

To make sure everything runs smoothly

Page 7: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Outline Web-Scale Problems

Parallel vs. Distributed Computing

Flynn's Taxonomy

Programming Patterns

Page 8: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Web-Scale Problems Characteristics:

Lots of data Lots of crunching (not necessarily complex itself)

Examples: Obviously, problems involving the Web Empirical and data-driven research (e.g., in HLT) “Post-genomics era” in life sciences High-quality animation The serious hobbyist

Page 9: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

It all boils down to… Divide-and-conquer

Throwing more hardware at the problem

Simple to understand… a lifetime to master…

Page 10: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Parallel vs. Distributed Parallel computing generally means:

Vector processing of data Multiple CPUs in a single computer

Distributed computing generally means: Multiple CPUs across many computers

Page 11: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Flynn’s Taxonomy

Instructions

Single (SI) Multiple (MI)D

ata

Mu

ltip

le (

MD

)SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Multi-threaded Programming

Sin

gle

(S

D)

Page 12: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

SISD

D D D D D D D

Processor

Instructions

Page 13: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

SIMD

D0

Processor

Instructions

D0D0 D0 D0 D0

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D1

D2

D3

D4

Dn

D0

Page 14: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

MIMD

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

Instructions

Page 15: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Parallel vs. Distributed

SharedMemory

Parallel: Multiple CPUs within a shared memory machine

Distributed: Multiple machines with own memory connected over a network

Ne

two

rk c

on

ne

ctio

nfo

r d

ata

tra

nsf

er

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

Instructions

Page 16: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

Page 17: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Different Workers Different threads in the same core

Different cores in the same CPU

Different CPUs in a multi-processor system

Different machines in a distributed system

Page 18: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Parallelization Problems How do we assign work units to workers?

What if we have more work units than workers?

What if workers need to share partial results?

How do we aggregate partial results?

How do we know all the workers have finished?

What if workers die?

What is the common theme of all of these problems?

Page 19: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

General Theme? Parallelization problems arise from:

Communication between workers Access to shared resources (e.g., data)

Thus, we need a synchronization system!

This is tricky: Finding bugs is hard Solving bugs is even harder

Page 20: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Multi-Threaded Programming Difficult because

Don’t know the order in which threads run Don’t know when threads interrupt each other

Thus, we need: Semaphores (lock, unlock) Conditional variables (wait, notify, broadcast) Barriers

Still, lots of problems: Deadlock, livelock Race conditions ...

Moral of the story: be careful!

Page 21: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Patterns for Parallelism Several programming methodologies exist to

build parallelism into programs

Here are some…

Page 22: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Master/Workers The master initially owns all data

The master creates workers and assigns tasks

The master waits for workers to report back

workers

master

Page 23: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

Producer/Consumer Flow Producers create work items

Consumers process them

Can be daisy-chained

CP

P

P

C

C

CP

P

P

C

C

Page 24: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

All available consumers should be available to process data from any producer

Work queues divorce 1:1 relationship from producers to consumers

Work Queues

CP

P

P

C

C

shared queue

Page 25: Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.

iSchool

And finally… The above solutions represent general patterns

In reality: Lots of one-off solutions, custom code Burden on the programmer to manage everything

Can we push the complexity onto the system? MapReduce…for next time