Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007.

Public Computing - Challenges and Solutions

Yi Pan Professor and Chair of CS

Professor of CISGeorgia State University

Atlanta, Georgia, USA

AINA 2007May 21, 2007

Outlines

What is Grid Computing? Virtual Organizations Types of Grids Grid Components Applications Grid Issues Conclusions

Outlines -continued Public Computing and the BOINC Architecture Motivation for New Scheduling Strategies Scheduling Algorithms Testing Environment and Experiments MD4 Password Hash Search Avalanche Photodiode Gain and Impulse

Response Gene Sequence Alignment Peer to Peer Model and Experiments Conclusion and Future Research

What is Grid Computing?

Analogy is to power grid Heterogeneous and

geographically dispersed

HydroelectricPlant

Gas Plant

Coal Plant



geographically dispersed Standards allow for

transportation of power HydroelectricPlant

Gas Plant

Coal Plant

Customer




transportation of power Standards define interface with

grid

HydroelectricPlant

Gas Plant

Coal Plant

Customer




transportation of power Standards define interface with

grid Non-trivial overhead of

managing movement and storage of power

Economies of scale compensate for this overhead allowing for cheap, accessible power

HydroelectricPlant

Gas Plant

Coal Plant

Customer

A Computational “Power Grid”

Goal is to make computation a utility Computational power, data services,

peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way Supercomputer

Cluster

Workstations



peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way

Standards allow for transportation of these services

SupercomputerCluster

Workstations

Internet

Customer



peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way

Standards allow for transportation of these services

Standards define interface with grid Architecture provides for management

of resources and controlling access Large amounts of computing power

should be accessible from anywhere in the grid

SupercomputerCluster

Workstations

Internet

Customer

Virtual Organizations

Independent organizations come together to pool grid resources

Component organizations could be different research institutions, departments within a company, individuals donating computing time, or anything with resources

Formation of the VO should define participation levels, resources provided, expectations of resource use, accountability, economic issues such as charge for resources

Goal is to allow users to exploit resources throughout the VO transparently and efficiently

Types of Grids

Computational Grid Data Grid Scavenging Grid Peer-to-Peer Public Computing

Computational Grids

Traditionally used to connect high performance computers between organizations

Increases utilization of geographically dispersed computational resources

Provides more parallel computational power to individual applications than is feasible for a single organization

Most traditional grid project concentrate on these types of grids

Globus and OSGA

Data Grids

Distributed data sources Queries of distributed data Sharing of storage and data management

resources D0-Partical Physics Data Grid allows access to

both compute and data resources of huge amounts of physics data

Google

Scavenging Grids

Harness idle cycles on systems especially user workstations

Parallel application must be quite granular to take advantage of large amounts of weak computing power

Grid system must support terminating and restarting work when systems cease idling

Condor system from University of Wisconsin

Peer-to-Peer

Converging technology with traditional grids Contrasts with grids having little infrastructure

and high fault tolerance Highly scalable for participation but difficult to

locate and monitor resources Current P2P like Gnutella, Freenet, FastTrack

concentrate on data services

Public Computing

Also converging with grid computing Often communicates through a central

server in contrast with peer-to-peer technologies

Again scalable with participation Adds even greater impact of multiple

administrative domains as participants are often untrusted and unaccountable

Public Computing Examples

SETI@Home (http://setiathome.ssl.berkeley.edu/) – Search for Extraterrestrial Intelligence in radio telescope data (UC Berkeley) 搜索地外文明的分布式网络计算

Has more than 5 million participants “The most powerful computer, IBM's ASCI

White, is rated at 12 TeraFLOPS and costs $110 million. SETI@home currently gets about 15 TeraFLOPs and has cost $500K so far.”

More Public Computing Examples

Folding@Home project (http://folding.stanford.edu) for molecular simulation aimed at new drug discovery

Distributed.net (http://distributed.net) for cracking RC5 64-bit encryption algorithm – used more than 300,000 nodes over 1757 days

Grid Components

Authentication and Authorization Resource Information Service Monitoring Scheduler Fault Tolerance Communication Infrastructure

Authentication and Authorization Important for allowing

users to cross the administrative boundaries in a virtual organization

System security for jobs outside the administrative domain currently rudimentary

Work being done on sandboxing, better job control, development environments

HPC

A&A Server

A&A ServerUser

A&A Server

Cluster

Resource Information Service

Used in resource discovery

Leverages existing technologies such as LDAP, UDDI

Information service must be able to report very current availability and load data

Balanced with overhead of updating data

HPC

GIS

GISUser

Cluster

GIS

Monitoring

Raw performance characteristics are not the only measurement of resource performance

Current and expected loads can have a tremendous impact

Balance between accurate performance data and additional overhead of monitoring systems and tracking that data

Scheduler

Owners of systems interested in maximizing throughput

Users interested in maximizing runtime performance

Both offer challenges with crossing administrative boundaries

Unique issues such as co-allocation and co-location

Interesting work being done in scheduling like market based scheduling

Fault Tolerance

More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research

Multiple administrative domains in VO challenge the reliability of resources

Faults can refer not only to resource failure but violation of service level agreements (SLA)

Impact on fault tolerance if there is no accountability for failure

HPC

ScavangingCluster

La, la, la, we're computingmolecular modeling forreally cheap.

I'm expensive and idle atthe moment.

Fault Tolerance





HPC

ScavangingCluster

Oh, no! Half myworkstations aren't idlinganymore I'll never get thisjob done.

Here I am, still idle.

Fault Tolerance





HPC

ScavangingCluster

My user gave me adeadline and a budget forthis.

I hate you PC's

Fault Tolerance





HPC

ScavangingCluster

It would have been cheaperfor me to do it, oh well.

Once againsupercomputers win.

Communication Infrastructure

Currently most grids have robust communication infrastructure

As more grids are deployed and used, more concentration must be done on network QoS and reservation

Most large applications are currently data rich P2P and Public Computing have experience in

communication poor environments

Applications

Embarrassingly parallel, data poor applications in the case of pooling large amounts of weak computing power

Huge data-intensive, data rich applications that can take advantage of multiple, parallel supercomputers

Application specific grids like Cactus and Nimrod

Grid Issues

Site autonomy Heterogeneous resources Co-allocation Metrics for resource allocation Language for utilizing grids Reliability

Site autonomy

Each component of the grid could be administered by an individual organization participating in the VO

Each administrative domain has its own policies and procedures surrounding their resources

Most scheduling and resource management work must be distributed to support this

Heterogeneous resources

Grid resources will have not only heterogeneous platforms but heterogeneous workloads

Applications truly exploiting grid resources will need to scale from idle cycles on workstations, huge vector based HPCs, to clusters

Not only computation power, also storage, peripherals, reservability, availability, network connectivity

Co-allocation

Unique challenges of reserving multiple resources across administrative domains

Capabilities of resource management may be different for each component of a composite resource

Failure of allocating components must be handled in a transaction-like manner

Acceptable substitute components may assist in co-allocating a composite resource

Metrics for resource allocation

Different scheduling approaches are measure performance differently

Historical performance Throughput Storage Network connectivity Cost Application specific performance Service level

Language for utilizing grids

Much of the work in grids is protocol or language work

Expressive languages needed for negotiating service level, reporting performance or resource capabilities, security, and reserving resources

Protocol work in authentication and authorization, data transfer, and job management

Summary about Grids

Grids offer tremendous computation and data storage resources not available in single systems or single clusters

Application and algorithm design and deployment still either rudimentary or application specific

Universal infrastructure still in development Unique challenges still unsolved especially in

regard to fault tolerance and multiple administrative domains

Public Computing

Aggregates idle workstations connected to the Internet for performing large scale computations

Initially seen in volunteer projects such as Distributed.net and SETI@home

Volunteer computers periodically download work from a project server and complete the work during idle periods

Currently used in projects that have large workloads on the scale of months or years with trivially parallelizable tasks

BOINC Architecture

Berkeley Open Infrastructure for Network Computing

Developed as a generic public computing framework

Next generation architecture for the SETI@home project

Open source and encourages use in other public computing projects

BOINC lets you donate computing power to the following projects Climateprediction.net: study climate change Einstein@home: search for gravitational signals emitted

by pulsars LHC@home: improve the design of the CERN LHC

particle accelerator Predictor@home: investigate protein-related diseases SETI@home: Look for radio evidence of extraterrestrial

life Cell Computing biomedical research (Japanese; requires

nonstandard client software)

BOINC Architecture

Server Complex

Database

Scheduling

Web Interface

Data Server

SQL

SQL

RPC Calls

User Browser Interaction

File Upload/Download

Participant Nodes

Motivation for New Scheduling Strategies Many projects requiring large scale

computational resources not of the current public computing scale

Grid and cluster scale projects are very popular in many scientific computing areas

Current public computing scheduling does not scale down to these smaller projects

Motivation for New Scheduling Strategies Grid scale scheduling for public computing

would make public computers a viable alternative or complimentary resource to grid systems

Public computing has the potential to offer a tremendous amount of computing resources from idle systems of organizations or volunteers

Scavenging grid projects such as Condor indicate interest in harnessing these resources in the grid research community

Scheduling Algorithms

Current BOINC scheduling algorithm New scheduling algorithms

First Come, First Serve with target workload of 1 workunit (FCFS-1)

First Come, First Serve with target workload of 5 workunits (FCFS-5)

Ant Colony Scheduling Algorithm

BOINC Scheduling

Originally designed for “unlimited” work Clients can request as much work as desired up

to a specified limit Smaller, limited computational jobs faced with

the challenge of more accurate scheduling Too many workunits assigned to a node leads to

either redundant computation by other nodes or exhaustion of available workunits

Too few workunits assigned leads to increased communication overhead

New Scheduling Strategies

New strategies target computational problems on the scale of many hours or days

Four primary goals:Reduce application execution time Increase resource utilizationNo reliance on client supplied informationRemain application neutral

First Come First Serve Algorithms

Naïve scheduling algorithms based solely on the frequency of client requests for work

Server-centric approach which does not depend on client supplied information for scheduling

At each request for work, the server compares the number of workunits already assigned to a node and sends work to the node based on a target worklevel

Two algorithms tested targeting either a workload of one workunit (FCFS-1) or five workunits (FCFS-5)

Ant Colony Algorithms

Meta-heuristic modeling the behavior of ants searching for food

Ants make decisions based on pheromone levels

Decisions affect pheromone levels to influence future decisions

?


Initial decisions are made at random

Ants leave trail of pheromones along their path

Next ants use pheromone levels to decide

Still random since initial trails were random

?


Shorter paths will complete quicker leading to feedback from the pheromone trail

Ant at destination now bases return decision on pheromone level

Decisions begin to become ordered

? ?


Repeated reinforcement of shortest path leads to greater pheromone buildup

Pheromone trails degrade over time

? ?


At this point the route discovery has converged

Probabilistic model of route choice allows for random searching of potentially better routes

Allows escape from local minima or adaptation to changes in environment

? ?

Ant Colony Scheduling

In the context of scheduling, the scheduler attempts to find optimal distribution of workunits to processing nodes

To carry out the analogy, workunits are the “ants”, computational power is the “food”, and the mapping is the “path”

Scheduler begins by randomly choosing mappings of workunits to nodes

As workunits are completed and returned, more powerful nodes are reinforced more often than weaker nodes

More workunits are sent to more powerful nodes

Ant Colony Scheduling in BOINC

To take advantage of more workunits on each node, distributions are chosen on batches of workunits

A percentage of a target batch is sent based on pheromone level

Due to batching of workunits, server to client communication is consolidated and reduced

Using pheromone heuristic ensures nodes get a share of workunits proportional to their computing power

Ant Colony Scheduling in BOINC

Pheromone levels based on actual performance of completed workunits not on reported benchmarks of nodes

Attempts to improve on CPU benchmarks: Incorporates communication overhead Fluctuations in performance Dynamic removal and addition of nodes Level can be calculated completely by server and not

on untrusted nodes

Testing Environment and Experiments Testing of new scheduling strategies

implemented on a working BOINC system Scheduling metrics and data Strategies used to schedule three experiments:

MD4 Password Hash Search Avalanche Photodiode Gain and Impulse Response

Calculations Gene Sequence Alignment

Testing Environment

5 Pentium 4 2.66 GHz

Local Network

23 Pentium 4 2.8 GHz Gateway

Internet

Campus Network

Quad Xeon 1.9 GHz

Cable Modem

Local Network

BOINC ServerAthlon XP 2.08 GHz

Gateway

Scheduling Metrics and Data

All three experiments are measured with the same metrics

Application runtime of each scheduling algorithm and the sequential runtime

Speedup versus sequential runtime for each scheduling algorithm

Workunit Distribution of each algorithm

MD4 Password Hash Search

MD4 is a cryptographic hash used in password security in systems such as Microsoft Windows and the open source SAMBA

Passwords are stored by computing the MD4 hash of the password and storing the hashed result

Ensures clear-text passwords are not stored on a system When password verification is needed, the supplied password is

hashed and compared to the stored hash Cryptographic security of MD4 ensures the password cannot be

derived from the hash Recovering a password is possible through brute force exhaustion

of all possible passwords and searching for a matching hash value

MD4 Search Problem Formulation

MD4 search experiment searches though all possible 6 character passwords

A standard keyboard allows 94 possible characters in a password

For 6 character passwords, there are 946 possible passwords


BOINC implementation divides the entire password space into 2,209 workunits of 944×4 possible passwords

All passwords in the workunit are hashed and compared to a target hash

Results are sent back to the central server for processing

All workunits are processed regardless of finding a match


Problem is ideally suited to the public computing architectureComputationally intensive Independent tasksLow communication requirements

MD4 Search Results

Parallel runtimes are measured versus an extrapolated sequential runtime based on the time needed for computing passwords of one workunit

Parallel implementation takes on the additional load of scheduling and communication costs

MD4 Search RuntimeAnt Colony v. Sequential Runtime

Sequential

Ant Colony0

20000

40000

60000

80000

100000

120000

140000

160000

180000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

00

Workunits

Sec

on

ds

Sequential

Ant Colony

MD4 Search RuntimeFCFS-5 v. Sequential Runtime

Sequential

FCFS-50

20000

40000

60000

80000

100000

120000

140000

160000

180000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

00

Workunits

Sec

on

ds

Sequential

FCFS-5

MD4 Search RuntimeFCFS-1 v. Sequential Runtime

Sequential

FCFS-10

20000

40000

60000

80000

100000

120000

140000

160000

180000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

00

Workunits

Sec

on

ds

Sequential

FCFS-1

MD4 Search Parallel RuntimeParallel Runtime Comparison

Ant Colony

FCFS-1

FCFS-5

0

2000

4000

6000

8000

10000

12000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

00

Workunits

Tim

e Ant Colony

FCFS-1

FCFS-5

MD4 Search Runtime

All three show runtimes significantly lower than sequential

Ant Colony and FCFS-5 show similar runtimes lower than FCFS-1

FCFS-5 shows erratic runtime due to processing and reporting five workunits at a time

MD4 Search Speedup ComparisonSpeedup Comparison

0

5

10

15

20

25

30

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

00

Workunits

Sp

eed

up

FCFS-1

FCFS-5

Ant Colony

MD4 Search Speedup

FCFS-1 quickly approaches and maintains a lower peak speedup level due to communication overhead and delay from scheduling requests

FCFS-1 also suffers from reduced parallelism due to inability to exploit local parallelism on the quad processor system

FCFS-5 erratically approaches higher speedup level Ant colony approaches peak speedup level with a similar

pattern to FCFS-1 with a level similar to FCFS-5

MD4 Search Workunit DistributionWorkunit Distribution

0

20

40

60

80

100

120

140

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.8

0 GHz

Athlon

XP 2

.08

GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Quad

XEON 1.9

0 GHz

Host

Wo

rku

nit

s

Ant Colony

FCFS-1

FCFS-5

MD4 Search Workunit Distribution

Quad processor system underutilized with FCFS-1 algorithm

Remaining systems evenly distributed for all three scheduling algorithms

Lower speed workstations receive proportionally smaller workloads

MD4 Search Conclusion

MD4 search is ideally suited to the public computing architecture

Calculation benefits from larger workloads assigned to nodes to reduce communication overhead

Ant Colony and FCFS-5 perform similarly with FCFS-1 performing poorly

Avalanche Photodiode Gain and Impulse Response Avalanche Photodiodes (APDs) are used as

photodetectors in long-haul fiber-optic systems The gain and impulse response of APDs is a

stochastic process with a random shape and duration

This experiment calculates the joint probability distribution function (PDF) of APD gain and impulse response

APD Problem Formulation

The joint PDF of APD gain and impulse response is based on the position of an input carrier

This input carrier causes ionization on the APD leading to additional carriers within a multiplication region

This avalanche effect leads to a gain in carrier over time Due to this avalanche effect, the joint PDF can be

calculated iteratively based on the probability of a carrier ionizing and in turn causing additional impacts and ionizations creating new carriers

APD Problem Formulation

BOINC implementation parallelizes calculation of the PDF for any carrier in 360° of the unit circle

360 workunits are created corresponding to each of these positions using identical parameters

The result of each workunit is a matrix of results with all values for all positions of a carrier and the impulse response for all times

APD Runtime

Sequential runtime is based on extrapolating total runtime from the average CPU time of a single workunit

All three parallel schedules show runtimes significantly lower than sequential

Ant Colony and FCFS-5 show similar runtimes lower than FCFS-1

FCFS-5 shows erratic runtime due to processing and reporting five workunits at a time

APD RuntimeAnt Colony v. Sequential Runtime

0

10000

20000

30000

40000

50000

60000

0 30 60 90 120 150 180 210 240 270 300 330 360

Workunits

Sec

on

ds

Sequential

Ant Colony

APD RuntimeFCFS-5 v. Sequential Runtime

0

10000

20000

30000

40000

50000

60000

0 30 60 90 120 150 180 210 240 270 300 330

Sequential

FCFS-5

APD RuntimeFCFS-1 v. Sequential Runtime

0

10000

20000

30000

40000

50000

60000

0 30 60 90 120 150 180 210 240 270 300 330

Workunits

Sec

on

ds

Sequential

FCFS-1

APD Parallel RuntimeParallel Runtime Comparison

0

500

1000

1500

2000

2500

3000

3500

0 30 60 90 120 150 180 210 240 270 300 330 360

Workunits

Tim

e Ant Colony

FCFS-5

FCFS-1

APD Runtime

Ant Colony has lowest runtime followed by FCFS-1 and finally by FCFS-5

Note the spike in runtime for FCFS-5 at the end of the calculation

Long runtime of individual workunits accounts for this spike at the end of the calculation for FCFS-5 when pool of workunits is exhausted

APD Speedup ComparisonSpeedup Comparison

0

5

10

15

20

25

0 30 60 90 120 150 180 210 240 270 300 330

Workunits

Sp

eed

up Ant Colony

FCFS-5

FCFS-1

APD Speedup

Large fluctuations at the beginning of the calculation likely due to constrained bandwidth for output data

Bandwidth constraint leaves all nodes performing similarly except for the single local node

Testing Environment

5 Pentium 4 2.66 GHz

Local Network

23 Pentium 4 2.8 GHz Gateway

Internet

Campus Network

Quad Xeon 1.9 GHz

Cable Modem

Local Network

BOINC ServerAthlon XP 2.08 GHz

Gateway

APD Workunit DistributionWorkunit Distribution

0

5

10

15

20

25

30

35

Athlon

XP 2

.08

GHz

Quad

XEON 1.9

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.8

0 GHz

Hosts

Wo

rku

nit

s

FCFS-5

FCFS-1

Ant Colony

APD Workunit Distribution

Local workstation is highest performer of the nodes Quad processor is weakest performer

Shares outbound bandwidth with most other nodes Constrained bandwidth of single network interface dominates

any benefit from local parallelism Workunits on other nodes randomly distributed due to

contention for communication medium Ant colony allocates the fewest workunits to the quad

processor and most workunits to the local node

APD Conclusion

APD experiment focuses on the impact of communication overhead due to output data on scheduling strategy

All three offer significant speedup over sequential with FCFS-5 performing the worst

Ant colony outperforms both naïve algorithms by an increased allocation of work to the best performing node

Ant colony benefits from reserving more workunits in the work pool for higher performing nodes at the end of the calculation

Gene Sequence Alignment

Problem from bioinformatics finding the best alignment for two sequences of genes based on matching of bases and penalties for insertion of gaps in either sequence

Alignments of two sequences are scored to determine the best alignment

Different alignments can offer different scores


Given two sequences A bonus is given for a match in the sequences A penalty is applied for a mismatch

A C G T T A GA

A G T T A G GA

Sequence 1

Sequence 2

-1+11 -1 +1 -1 -1 +1 -1=


Sequences can be realigned by inserting gaps Gaps are penalized Resulting scores will differ depending on where

gaps are inserted

A C G T T A GA

A G T T A G GA

Sequence 1

Sequence 2

-2+11 +1 +1 +1 +1 +1 3=-2

Sequence Alignment Problem Formulation Finding the best possible

alignment is based on a dynamic programming algorithm

A scoring matrix is calculated to simultaneously calculate all possible alignments

Calculating the scoring matrix steps through each position and determines the score for all combinations of gaps

Once calculated, the best score can be found and backtracked to determine the alignment

A C G T T A GA

A

G

T

T

A

G

G

A

Sequence 1

Seq

uenc

e 2

1 1 0 0 0 0 1 0

1 2 0 0 0 0 0 0

0 0 1 1 0 0 0 1

0 0 0 0 2 1 0 0

0 0 0 0 1 3 1 0

1 1 0 0 0 1 4 2

0 0 0 1 0 0 2 5

0 0 0 1 0 0 0 3

Sequence Alignment Problem Formulation Each entry in the

scoring matrix depends on adjacent neighbors from the position before

These dependencies create a pattern depicted in the diagram

Sequence Alignment Problem Formulation The dependencies of the

scoring matrix make parallelization difficult

Nodes cannot compute scores until previous dependencies are satisfied

Maximum parallelism can be achieved by calculating the scores in a diagonal major fashion

Row Major

Diagon

al M

ajor

Col

umn

Maj

or

Sequence Alignment Problem Formulation BOINC implementation only measures

calculation and storage of the solution matrix Does not include finding the maximum score and

backtracing through the alignment Solution matrix is left on the client and not

transferred to the central server Problem calculates the solution matrix for

aligning two generated sequences each of length 100,000

Sequence Alignment RuntimeAnt Colony v. Sequential Runtime

0

1000

2000

3000

4000

5000

6000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

0023

0024

0025

00

Workunits

Tim

e (S

eco

nd

s)

Sequential

Ant Colony

Sequence Alignment Runtime

Runtime curves shows a slight wave beginning with a decrease in per unit runtime and later increasing again

Due to the wavefront completion of tasks in the diagonal major computation leading to increasing parallelism up to the longest diagonal

After this midpoint, parallelism decreases

Diagonal Major Execution

Row Major

Diagon

al M

ajor

Col

umn

Maj

or

Sequence Alignment RuntimeFCFS-5 v. Sequential Runtime

0

1000

2000

3000

4000

5000

6000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

0023

0024

0025

00

Workunits

Tim

e (S

eco

nd

s)

Sequential

FCFS-5

Sequence Alignment RuntimeFCFS-1 v. Sequential Runtime

0

1000

2000

3000

4000

5000

6000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

0023

0024

0025

00

Workunits

Tim

e (S

eco

nd

s)

Sequential

FCFS-1

Sequence Alignment Parallel Runtime

Parallel Runtime Comparison

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

0023

0024

0025

00

Workunits

Tim

e (S

eco

nd

s)

Ant Colony

FCFS-5

FCFS-1

Sequence Alignment Speedup Comparison

Speedup Comparison

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

0015

0016

0017

0018

0019

0020

0021

0022

0023

0024

0025

00

Workunits

Sp

eed

up Ant Colony

FCFS-5

FCFS-1

Sequence Alignment Speedup

FCFS-1 shows a steady curve reflecting gradual increase in parallelism due to available tasks and a steady decrease in parallelism as the wavefront passes the largest diagonal of the calculation

FCFS-5 shows a more gradual incline at the beginning of the calculation and steeper decline toward the end

Ant colony shows a steeper incline and gradual decline

Sequence Alignment Speedup

FCFS-5 enjoys less parallelism at the beginning of the calculation due to allocating many workunits to nodes requesting work with such a small pool to draw from

As more workunits become available, FCFS-5’s aggressive scheduling works to its advantage until workunits begin to become exhausted again

FCFS-1 is conservative in scheduling throughout Ant Colony begins conservatively but occasionally sends multiple

workunits to a node This leads to a quicker buildup of generated workunits early in the

calculation Later in the calculation, Ant Colony schedules more aggressively

and eventually exhausts the workunit pool similarly to FCFS-5

Sequence Alignment Workunit Distribution

Workunit Distribution

0

20

40

60

80

100

120

140

160

Athlon

XP 2

.08

GHz

Quad

XEON 1.9

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.8

0 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.6

6 GHz

Pentiu

m 4

2.8

0 GHz

Host

Wo

rku

nit

s

Ant Colony

FCFS-5

FCFS-1

Sequence Alignment Workunit Distribution Mostly random distribution due to task

dependency dominating the calculation All three scheduling techniques show little

preference for any node based on communication or processing resources

Sequence Alignment Conclusion

Ant colony provides an interesting mix of attributes of both FCFS-1 and FCFS-5 when scheduling this computation

It shares the conservative scheduling of FCFS-1 and later aggressive scheduling similar to FCFS-5

All three parallel computations offer only a slight benefit to the problem due to the task dependency structure

It should be noted, the theoretical computation time of the sequential algorithm would require a sequential machine with 37.3 GB of memory if no memory reduction techniques are used in storing the solution matrix

Performance Summary

Ant colony scheduling offers top performance in all three experiments

FCFS-1 and FCFS-5 offer varying performance levels depending on the attributes of the target application

Ant colony adapts to match or better the best of the competing algorithms

All three offer acceptable schedules for the parallel applications without relying on client supplied information

Problems with BOINC

Client Server Model Traffic Congestion Problem Server too busy to handle requests Solution – peer to peer model

Peer to Peer Platform

Written in Java Uses the JXTA toolkit to provide communication services

and peer to peer overlay network Platform provides basic object and messaging primitives

to facilitate peer to peer application development Messages between objects can be transparently passed

to objects on other peers or to local objects Each object in an application runs in its own thread

Distributed Scheduling

Applications provide their own scheduling and decomposition of work

Typical pattern of application design: A factory object generates work units from

decomposed job data Worker objects perform computation Result objects consolidate and report results

The work factory handles distributing work to cooperating peers and consolidating results

Distributed Sequence Alignment

Sequence Alignment job begins as a comparison of two complete sequences

Work factory breaks the complete comparison into work units up to a minimum size

A work unit can begin processing as soon as its dependencies are available Initially, only the upper left corner work unit of the result matrix

has all dependencies satisfied When a work unit is completed, its adjacent work units become

eligible for processing


The distributed application attempts to complete work units in squares of four adjacent work units

As more work units are completed, larger and larger squares of work units become eligible for processing

Complete

Complete

Eligible

Not Ready

Eligible

Processing


Peers other than the peer initially starting the job will begin with no work to complete

The initial peer will broadcast a signal signifying availability of eligible work units

Peers will attempt to contact a peer advertising work requesting work


A peer with available work will distribute the largest amount of work eligible and mark the work unit as remotely processed

When a peer completes all work in a work unit it will report to the peer who initially assigned the work Only adjacent edges of results necessary for computing new

work units are reported to reduce communication Complete results are stored at the peer performing the

computation


After reporting the completion of a work unit a peer will seek new work from all peers

Once the initial peer completes all work the job is done

Peers could then be queried to report maximum scores and alignments

Experiment

Peer to peer algorithm was tested aligning two sequences of 160,000 bases each

Alignments used a typical scoring matrix and gap penalty

The minimum work unit size for decomposition of the total job was 10,000 bases for each sequence resulting in 256 work units

160,000

10,000

2

Experiment

The distributed system was executed on a local area network with 2, 4, and 6 fairly similar peers

The results were compared to a sequential implementation of the algorithm run on one of the peers The sequential implementation does a straightforward

computation of the complete matrix Disk is used to periodically save the sections of the matrix since

the complete matrix would exhaust available memory

Runtime

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:57:36

1:04:48

1:12:00

1 2 3 4 5 6

Number of Nodes

Ru

nti

me

Runtime is reduced from 1 hour and 9 minutes to 28 minutes

This is about 2.4 times faster than sequential

The most dramatic drop is at 2 nodes with at 1.75 times faster at 39 minutes

Node Efficiency

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6

Node

Perc

en

t of

Tim

e W

ork

ing

2 Nodes

4 Nodes

6 Nodes

The first peer generally has the highest efficiency

Average efficiency drops as more nodes are added

Analysis

Findings are in line with the structure of the sequential problem

Due to dependencies of tasks on previous tasks, many nodes must initially remain idle as some nodes complete dependent tasks

As more work becomes eligible for processing, more nodes can work simultaneously

Later in the computation fewer work units become eligible for computation due to fewer dependencies

Comparison With CS Model

Previous work performed the same computation with a client server platform BOINC Aligned two sequences of 100,000 bases each Work unit size was 2,000 bases for each sequence 2500 work units Performed on 30 nodes


Direct comparison is difficultPrevious job was smaller but with smaller

granularity increasing number of work unitsPrevious sequential portions of alignment

seem to be inefficient compared to current implementation based on overall sequential job completion time


Best comparison is overall runtime reduction factor Runtime reduction factor compares distributed completion time

with similar sequential completion time Factors out differing performance of sequential aspects of

computation BOINC implementation achieved a reduction factor only

about 1.2 times sequential Peer to peer achieved 2.4 with only 1/5 the nodes


BOINC implementation was impeded by a central server which was using a slower link to most nodes compared to the peer to peer configuration

Peer to peer implementation also shows signs of diminishing returns as more nodes are added

Peer to peer utilizes all participating systems, client server normally uses a server which does not participate in the computation outside of scheduling

Conclusions Our scheduling strategy is effective on

BOINC P2P is more effective than BOINC Client-

server model Public computing can solve problems of

large computing power requirement and huge memory demand

Potentially replace supercomputing for certain applications (large grains)

Future Work

Measure the impact of fault tolerance on these scheduling algorithms

Measure the impact of work redundancy and work validation

Continue to benchmark the P2P implementation with more nodes

Implement Ant Colony Scheduling on P2P model

Future Work

Currently a peer only seeks more work when it has completed all of its work It does not seek work when waiting for reports from

peers to which it has distributed work Allowing a peer to seek work while it is waiting for

other peers may increase utilization Create a more direct comparison with client

server model Additional applications and improvements to the

base platform

Thank You

Questions?

Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007.

Documents

grid computing

power grid heterogeneous

grid architecture

utility computational

computational power

accessible power

parallel computational

grid nontrivial overhead