Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007
Dec 31, 2015
Public Computing - Challenges and Solutions
Yi Pan Professor and Chair of CS
Professor of CISGeorgia State University
Atlanta, Georgia, USA
AINA 2007May 21, 2007
Outlines
What is Grid Computing? Virtual Organizations Types of Grids Grid Components Applications Grid Issues Conclusions
Outlines -continued Public Computing and the BOINC Architecture Motivation for New Scheduling Strategies Scheduling Algorithms Testing Environment and Experiments MD4 Password Hash Search Avalanche Photodiode Gain and Impulse
Response Gene Sequence Alignment Peer to Peer Model and Experiments Conclusion and Future Research
What is Grid Computing?
Analogy is to power grid Heterogeneous and
geographically dispersed
HydroelectricPlant
Gas Plant
Coal Plant
What is Grid Computing?
Analogy is to power grid Heterogeneous and
geographically dispersed Standards allow for
transportation of power HydroelectricPlant
Gas Plant
Coal Plant
Customer
What is Grid Computing?
Analogy is to power grid Heterogeneous and
geographically dispersed Standards allow for
transportation of power Standards define interface with
grid
HydroelectricPlant
Gas Plant
Coal Plant
Customer
What is Grid Computing?
Analogy is to power grid Heterogeneous and
geographically dispersed Standards allow for
transportation of power Standards define interface with
grid Non-trivial overhead of
managing movement and storage of power
Economies of scale compensate for this overhead allowing for cheap, accessible power
HydroelectricPlant
Gas Plant
Coal Plant
Customer
A Computational “Power Grid”
Goal is to make computation a utility Computational power, data services,
peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way Supercomputer
Cluster
Workstations
A Computational “Power Grid”
Goal is to make computation a utility Computational power, data services,
peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way
Standards allow for transportation of these services
SupercomputerCluster
Workstations
Internet
Customer
A Computational “Power Grid”
Goal is to make computation a utility Computational power, data services,
peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way
Standards allow for transportation of these services
Standards define interface with grid Architecture provides for management
of resources and controlling access Large amounts of computing power
should be accessible from anywhere in the grid
SupercomputerCluster
Workstations
Internet
Customer
Virtual Organizations
Independent organizations come together to pool grid resources
Component organizations could be different research institutions, departments within a company, individuals donating computing time, or anything with resources
Formation of the VO should define participation levels, resources provided, expectations of resource use, accountability, economic issues such as charge for resources
Goal is to allow users to exploit resources throughout the VO transparently and efficiently
Types of Grids
Computational Grid Data Grid Scavenging Grid Peer-to-Peer Public Computing
Computational Grids
Traditionally used to connect high performance computers between organizations
Increases utilization of geographically dispersed computational resources
Provides more parallel computational power to individual applications than is feasible for a single organization
Most traditional grid project concentrate on these types of grids
Globus and OSGA
Data Grids
Distributed data sources Queries of distributed data Sharing of storage and data management
resources D0-Partical Physics Data Grid allows access to
both compute and data resources of huge amounts of physics data
Scavenging Grids
Harness idle cycles on systems especially user workstations
Parallel application must be quite granular to take advantage of large amounts of weak computing power
Grid system must support terminating and restarting work when systems cease idling
Condor system from University of Wisconsin
Peer-to-Peer
Converging technology with traditional grids Contrasts with grids having little infrastructure
and high fault tolerance Highly scalable for participation but difficult to
locate and monitor resources Current P2P like Gnutella, Freenet, FastTrack
concentrate on data services
Public Computing
Also converging with grid computing Often communicates through a central
server in contrast with peer-to-peer technologies
Again scalable with participation Adds even greater impact of multiple
administrative domains as participants are often untrusted and unaccountable
Public Computing Examples
SETI@Home (http://setiathome.ssl.berkeley.edu/) – Search for Extraterrestrial Intelligence in radio telescope data (UC Berkeley) 搜索地外文明的分布式网络计算
Has more than 5 million participants “The most powerful computer, IBM's ASCI
White, is rated at 12 TeraFLOPS and costs $110 million. SETI@home currently gets about 15 TeraFLOPs and has cost $500K so far.”
More Public Computing Examples
Folding@Home project (http://folding.stanford.edu) for molecular simulation aimed at new drug discovery
Distributed.net (http://distributed.net) for cracking RC5 64-bit encryption algorithm – used more than 300,000 nodes over 1757 days
Grid Components
Authentication and Authorization Resource Information Service Monitoring Scheduler Fault Tolerance Communication Infrastructure
Authentication and Authorization Important for allowing
users to cross the administrative boundaries in a virtual organization
System security for jobs outside the administrative domain currently rudimentary
Work being done on sandboxing, better job control, development environments
HPC
A&A Server
A&A ServerUser
A&A Server
Cluster
Resource Information Service
Used in resource discovery
Leverages existing technologies such as LDAP, UDDI
Information service must be able to report very current availability and load data
Balanced with overhead of updating data
HPC
GIS
GISUser
Cluster
GIS
Monitoring
Raw performance characteristics are not the only measurement of resource performance
Current and expected loads can have a tremendous impact
Balance between accurate performance data and additional overhead of monitoring systems and tracking that data
Scheduler
Owners of systems interested in maximizing throughput
Users interested in maximizing runtime performance
Both offer challenges with crossing administrative boundaries
Unique issues such as co-allocation and co-location
Interesting work being done in scheduling like market based scheduling
Fault Tolerance
More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research
Multiple administrative domains in VO challenge the reliability of resources
Faults can refer not only to resource failure but violation of service level agreements (SLA)
Impact on fault tolerance if there is no accountability for failure
HPC
ScavangingCluster
La, la, la, we're computingmolecular modeling forreally cheap.
I'm expensive and idle atthe moment.
Fault Tolerance
More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research
Multiple administrative domains in VO challenge the reliability of resources
Faults can refer not only to resource failure but violation of service level agreements (SLA)
Impact on fault tolerance if there is no accountability for failure
HPC
ScavangingCluster
Oh, no! Half myworkstations aren't idlinganymore I'll never get thisjob done.
Here I am, still idle.
Fault Tolerance
More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research
Multiple administrative domains in VO challenge the reliability of resources
Faults can refer not only to resource failure but violation of service level agreements (SLA)
Impact on fault tolerance if there is no accountability for failure
HPC
ScavangingCluster
My user gave me adeadline and a budget forthis.
I hate you PC's
Fault Tolerance
More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research
Multiple administrative domains in VO challenge the reliability of resources
Faults can refer not only to resource failure but violation of service level agreements (SLA)
Impact on fault tolerance if there is no accountability for failure
HPC
ScavangingCluster
It would have been cheaperfor me to do it, oh well.
Once againsupercomputers win.
Communication Infrastructure
Currently most grids have robust communication infrastructure
As more grids are deployed and used, more concentration must be done on network QoS and reservation
Most large applications are currently data rich P2P and Public Computing have experience in
communication poor environments
Applications
Embarrassingly parallel, data poor applications in the case of pooling large amounts of weak computing power
Huge data-intensive, data rich applications that can take advantage of multiple, parallel supercomputers
Application specific grids like Cactus and Nimrod
Grid Issues
Site autonomy Heterogeneous resources Co-allocation Metrics for resource allocation Language for utilizing grids Reliability
Site autonomy
Each component of the grid could be administered by an individual organization participating in the VO
Each administrative domain has its own policies and procedures surrounding their resources
Most scheduling and resource management work must be distributed to support this
Heterogeneous resources
Grid resources will have not only heterogeneous platforms but heterogeneous workloads
Applications truly exploiting grid resources will need to scale from idle cycles on workstations, huge vector based HPCs, to clusters
Not only computation power, also storage, peripherals, reservability, availability, network connectivity
Co-allocation
Unique challenges of reserving multiple resources across administrative domains
Capabilities of resource management may be different for each component of a composite resource
Failure of allocating components must be handled in a transaction-like manner
Acceptable substitute components may assist in co-allocating a composite resource
Metrics for resource allocation
Different scheduling approaches are measure performance differently
Historical performance Throughput Storage Network connectivity Cost Application specific performance Service level
Language for utilizing grids
Much of the work in grids is protocol or language work
Expressive languages needed for negotiating service level, reporting performance or resource capabilities, security, and reserving resources
Protocol work in authentication and authorization, data transfer, and job management
Summary about Grids
Grids offer tremendous computation and data storage resources not available in single systems or single clusters
Application and algorithm design and deployment still either rudimentary or application specific
Universal infrastructure still in development Unique challenges still unsolved especially in
regard to fault tolerance and multiple administrative domains
Public Computing
Aggregates idle workstations connected to the Internet for performing large scale computations
Initially seen in volunteer projects such as Distributed.net and SETI@home
Volunteer computers periodically download work from a project server and complete the work during idle periods
Currently used in projects that have large workloads on the scale of months or years with trivially parallelizable tasks
BOINC Architecture
Berkeley Open Infrastructure for Network Computing
Developed as a generic public computing framework
Next generation architecture for the SETI@home project
Open source and encourages use in other public computing projects
BOINC lets you donate computing power to the following projects Climateprediction.net: study climate change Einstein@home: search for gravitational signals emitted
by pulsars LHC@home: improve the design of the CERN LHC
particle accelerator Predictor@home: investigate protein-related diseases SETI@home: Look for radio evidence of extraterrestrial
life Cell Computing biomedical research (Japanese; requires
nonstandard client software)
BOINC Architecture
Server Complex
Database
Scheduling
Web Interface
Data Server
SQL
SQL
RPC Calls
User Browser Interaction
File Upload/Download
Participant Nodes
Motivation for New Scheduling Strategies Many projects requiring large scale
computational resources not of the current public computing scale
Grid and cluster scale projects are very popular in many scientific computing areas
Current public computing scheduling does not scale down to these smaller projects
Motivation for New Scheduling Strategies Grid scale scheduling for public computing
would make public computers a viable alternative or complimentary resource to grid systems
Public computing has the potential to offer a tremendous amount of computing resources from idle systems of organizations or volunteers
Scavenging grid projects such as Condor indicate interest in harnessing these resources in the grid research community
Scheduling Algorithms
Current BOINC scheduling algorithm New scheduling algorithms
First Come, First Serve with target workload of 1 workunit (FCFS-1)
First Come, First Serve with target workload of 5 workunits (FCFS-5)
Ant Colony Scheduling Algorithm
BOINC Scheduling
Originally designed for “unlimited” work Clients can request as much work as desired up
to a specified limit Smaller, limited computational jobs faced with
the challenge of more accurate scheduling Too many workunits assigned to a node leads to
either redundant computation by other nodes or exhaustion of available workunits
Too few workunits assigned leads to increased communication overhead
New Scheduling Strategies
New strategies target computational problems on the scale of many hours or days
Four primary goals:Reduce application execution time Increase resource utilizationNo reliance on client supplied informationRemain application neutral
First Come First Serve Algorithms
Naïve scheduling algorithms based solely on the frequency of client requests for work
Server-centric approach which does not depend on client supplied information for scheduling
At each request for work, the server compares the number of workunits already assigned to a node and sends work to the node based on a target worklevel
Two algorithms tested targeting either a workload of one workunit (FCFS-1) or five workunits (FCFS-5)
Ant Colony Algorithms
Meta-heuristic modeling the behavior of ants searching for food
Ants make decisions based on pheromone levels
Decisions affect pheromone levels to influence future decisions
?
Ant Colony Algorithms
Initial decisions are made at random
Ants leave trail of pheromones along their path
Next ants use pheromone levels to decide
Still random since initial trails were random
?
Ant Colony Algorithms
Shorter paths will complete quicker leading to feedback from the pheromone trail
Ant at destination now bases return decision on pheromone level
Decisions begin to become ordered
? ?
Ant Colony Algorithms
Repeated reinforcement of shortest path leads to greater pheromone buildup
Pheromone trails degrade over time
? ?
Ant Colony Algorithms
At this point the route discovery has converged
Probabilistic model of route choice allows for random searching of potentially better routes
Allows escape from local minima or adaptation to changes in environment
? ?
Ant Colony Scheduling
In the context of scheduling, the scheduler attempts to find optimal distribution of workunits to processing nodes
To carry out the analogy, workunits are the “ants”, computational power is the “food”, and the mapping is the “path”
Scheduler begins by randomly choosing mappings of workunits to nodes
As workunits are completed and returned, more powerful nodes are reinforced more often than weaker nodes
More workunits are sent to more powerful nodes
Ant Colony Scheduling in BOINC
To take advantage of more workunits on each node, distributions are chosen on batches of workunits
A percentage of a target batch is sent based on pheromone level
Due to batching of workunits, server to client communication is consolidated and reduced
Using pheromone heuristic ensures nodes get a share of workunits proportional to their computing power
Ant Colony Scheduling in BOINC
Pheromone levels based on actual performance of completed workunits not on reported benchmarks of nodes
Attempts to improve on CPU benchmarks: Incorporates communication overhead Fluctuations in performance Dynamic removal and addition of nodes Level can be calculated completely by server and not
on untrusted nodes
Testing Environment and Experiments Testing of new scheduling strategies
implemented on a working BOINC system Scheduling metrics and data Strategies used to schedule three experiments:
MD4 Password Hash Search Avalanche Photodiode Gain and Impulse Response
Calculations Gene Sequence Alignment
Testing Environment
5 Pentium 4 2.66 GHz
Local Network
23 Pentium 4 2.8 GHz Gateway
Internet
Campus Network
Quad Xeon 1.9 GHz
Cable Modem
Local Network
BOINC ServerAthlon XP 2.08 GHz
Gateway
Scheduling Metrics and Data
All three experiments are measured with the same metrics
Application runtime of each scheduling algorithm and the sequential runtime
Speedup versus sequential runtime for each scheduling algorithm
Workunit Distribution of each algorithm
MD4 Password Hash Search
MD4 is a cryptographic hash used in password security in systems such as Microsoft Windows and the open source SAMBA
Passwords are stored by computing the MD4 hash of the password and storing the hashed result
Ensures clear-text passwords are not stored on a system When password verification is needed, the supplied password is
hashed and compared to the stored hash Cryptographic security of MD4 ensures the password cannot be
derived from the hash Recovering a password is possible through brute force exhaustion
of all possible passwords and searching for a matching hash value
MD4 Search Problem Formulation
MD4 search experiment searches though all possible 6 character passwords
A standard keyboard allows 94 possible characters in a password
For 6 character passwords, there are 946 possible passwords
MD4 Search Problem Formulation
BOINC implementation divides the entire password space into 2,209 workunits of 944×4 possible passwords
All passwords in the workunit are hashed and compared to a target hash
Results are sent back to the central server for processing
All workunits are processed regardless of finding a match
MD4 Search Problem Formulation
Problem is ideally suited to the public computing architectureComputationally intensive Independent tasksLow communication requirements
MD4 Search Results
Parallel runtimes are measured versus an extrapolated sequential runtime based on the time needed for computing passwords of one workunit
Parallel implementation takes on the additional load of scheduling and communication costs
MD4 Search RuntimeAnt Colony v. Sequential Runtime
Sequential
Ant Colony0
20000
40000
60000
80000
100000
120000
140000
160000
180000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
00
Workunits
Sec
on
ds
Sequential
Ant Colony
MD4 Search RuntimeFCFS-5 v. Sequential Runtime
Sequential
FCFS-50
20000
40000
60000
80000
100000
120000
140000
160000
180000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
00
Workunits
Sec
on
ds
Sequential
FCFS-5
MD4 Search RuntimeFCFS-1 v. Sequential Runtime
Sequential
FCFS-10
20000
40000
60000
80000
100000
120000
140000
160000
180000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
00
Workunits
Sec
on
ds
Sequential
FCFS-1
MD4 Search Parallel RuntimeParallel Runtime Comparison
Ant Colony
FCFS-1
FCFS-5
0
2000
4000
6000
8000
10000
12000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
00
Workunits
Tim
e Ant Colony
FCFS-1
FCFS-5
MD4 Search Runtime
All three show runtimes significantly lower than sequential
Ant Colony and FCFS-5 show similar runtimes lower than FCFS-1
FCFS-5 shows erratic runtime due to processing and reporting five workunits at a time
MD4 Search Speedup ComparisonSpeedup Comparison
0
5
10
15
20
25
30
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
00
Workunits
Sp
eed
up
FCFS-1
FCFS-5
Ant Colony
MD4 Search Speedup
FCFS-1 quickly approaches and maintains a lower peak speedup level due to communication overhead and delay from scheduling requests
FCFS-1 also suffers from reduced parallelism due to inability to exploit local parallelism on the quad processor system
FCFS-5 erratically approaches higher speedup level Ant colony approaches peak speedup level with a similar
pattern to FCFS-1 with a level similar to FCFS-5
MD4 Search Workunit DistributionWorkunit Distribution
0
20
40
60
80
100
120
140
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.8
0 GHz
Athlon
XP 2
.08
GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Quad
XEON 1.9
0 GHz
Host
Wo
rku
nit
s
Ant Colony
FCFS-1
FCFS-5
MD4 Search Workunit Distribution
Quad processor system underutilized with FCFS-1 algorithm
Remaining systems evenly distributed for all three scheduling algorithms
Lower speed workstations receive proportionally smaller workloads
MD4 Search Conclusion
MD4 search is ideally suited to the public computing architecture
Calculation benefits from larger workloads assigned to nodes to reduce communication overhead
Ant Colony and FCFS-5 perform similarly with FCFS-1 performing poorly
Avalanche Photodiode Gain and Impulse Response Avalanche Photodiodes (APDs) are used as
photodetectors in long-haul fiber-optic systems The gain and impulse response of APDs is a
stochastic process with a random shape and duration
This experiment calculates the joint probability distribution function (PDF) of APD gain and impulse response
APD Problem Formulation
The joint PDF of APD gain and impulse response is based on the position of an input carrier
This input carrier causes ionization on the APD leading to additional carriers within a multiplication region
This avalanche effect leads to a gain in carrier over time Due to this avalanche effect, the joint PDF can be
calculated iteratively based on the probability of a carrier ionizing and in turn causing additional impacts and ionizations creating new carriers
APD Problem Formulation
BOINC implementation parallelizes calculation of the PDF for any carrier in 360° of the unit circle
360 workunits are created corresponding to each of these positions using identical parameters
The result of each workunit is a matrix of results with all values for all positions of a carrier and the impulse response for all times
APD Runtime
Sequential runtime is based on extrapolating total runtime from the average CPU time of a single workunit
All three parallel schedules show runtimes significantly lower than sequential
Ant Colony and FCFS-5 show similar runtimes lower than FCFS-1
FCFS-5 shows erratic runtime due to processing and reporting five workunits at a time
APD RuntimeAnt Colony v. Sequential Runtime
0
10000
20000
30000
40000
50000
60000
0 30 60 90 120 150 180 210 240 270 300 330 360
Workunits
Sec
on
ds
Sequential
Ant Colony
APD RuntimeFCFS-5 v. Sequential Runtime
0
10000
20000
30000
40000
50000
60000
0 30 60 90 120 150 180 210 240 270 300 330
Sequential
FCFS-5
APD RuntimeFCFS-1 v. Sequential Runtime
0
10000
20000
30000
40000
50000
60000
0 30 60 90 120 150 180 210 240 270 300 330
Workunits
Sec
on
ds
Sequential
FCFS-1
APD Parallel RuntimeParallel Runtime Comparison
0
500
1000
1500
2000
2500
3000
3500
0 30 60 90 120 150 180 210 240 270 300 330 360
Workunits
Tim
e Ant Colony
FCFS-5
FCFS-1
APD Runtime
Ant Colony has lowest runtime followed by FCFS-1 and finally by FCFS-5
Note the spike in runtime for FCFS-5 at the end of the calculation
Long runtime of individual workunits accounts for this spike at the end of the calculation for FCFS-5 when pool of workunits is exhausted
APD Speedup ComparisonSpeedup Comparison
0
5
10
15
20
25
0 30 60 90 120 150 180 210 240 270 300 330
Workunits
Sp
eed
up Ant Colony
FCFS-5
FCFS-1
APD Speedup
Large fluctuations at the beginning of the calculation likely due to constrained bandwidth for output data
Bandwidth constraint leaves all nodes performing similarly except for the single local node
Testing Environment
5 Pentium 4 2.66 GHz
Local Network
23 Pentium 4 2.8 GHz Gateway
Internet
Campus Network
Quad Xeon 1.9 GHz
Cable Modem
Local Network
BOINC ServerAthlon XP 2.08 GHz
Gateway
APD Workunit DistributionWorkunit Distribution
0
5
10
15
20
25
30
35
Athlon
XP 2
.08
GHz
Quad
XEON 1.9
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.8
0 GHz
Hosts
Wo
rku
nit
s
FCFS-5
FCFS-1
Ant Colony
APD Workunit Distribution
Local workstation is highest performer of the nodes Quad processor is weakest performer
Shares outbound bandwidth with most other nodes Constrained bandwidth of single network interface dominates
any benefit from local parallelism Workunits on other nodes randomly distributed due to
contention for communication medium Ant colony allocates the fewest workunits to the quad
processor and most workunits to the local node
APD Conclusion
APD experiment focuses on the impact of communication overhead due to output data on scheduling strategy
All three offer significant speedup over sequential with FCFS-5 performing the worst
Ant colony outperforms both naïve algorithms by an increased allocation of work to the best performing node
Ant colony benefits from reserving more workunits in the work pool for higher performing nodes at the end of the calculation
Gene Sequence Alignment
Problem from bioinformatics finding the best alignment for two sequences of genes based on matching of bases and penalties for insertion of gaps in either sequence
Alignments of two sequences are scored to determine the best alignment
Different alignments can offer different scores
Gene Sequence Alignment
Given two sequences A bonus is given for a match in the sequences A penalty is applied for a mismatch
A C G T T A GA
A G T T A G GA
Sequence 1
Sequence 2
-1+11 -1 +1 -1 -1 +1 -1=
Gene Sequence Alignment
Sequences can be realigned by inserting gaps Gaps are penalized Resulting scores will differ depending on where
gaps are inserted
A C G T T A GA
A G T T A G GA
Sequence 1
Sequence 2
-2+11 +1 +1 +1 +1 +1 3=-2
Sequence Alignment Problem Formulation Finding the best possible
alignment is based on a dynamic programming algorithm
A scoring matrix is calculated to simultaneously calculate all possible alignments
Calculating the scoring matrix steps through each position and determines the score for all combinations of gaps
Once calculated, the best score can be found and backtracked to determine the alignment
A C G T T A GA
A
G
T
T
A
G
G
A
Sequence 1
Seq
uenc
e 2
1 1 0 0 0 0 1 0
1 2 0 0 0 0 0 0
0 0 1 1 0 0 0 1
0 0 0 0 2 1 0 0
0 0 0 0 1 3 1 0
1 1 0 0 0 1 4 2
0 0 0 1 0 0 2 5
0 0 0 1 0 0 0 3
Sequence Alignment Problem Formulation Each entry in the
scoring matrix depends on adjacent neighbors from the position before
These dependencies create a pattern depicted in the diagram
Sequence Alignment Problem Formulation The dependencies of the
scoring matrix make parallelization difficult
Nodes cannot compute scores until previous dependencies are satisfied
Maximum parallelism can be achieved by calculating the scores in a diagonal major fashion
Row Major
Diagon
al M
ajor
Col
umn
Maj
or
Sequence Alignment Problem Formulation BOINC implementation only measures
calculation and storage of the solution matrix Does not include finding the maximum score and
backtracing through the alignment Solution matrix is left on the client and not
transferred to the central server Problem calculates the solution matrix for
aligning two generated sequences each of length 100,000
Sequence Alignment RuntimeAnt Colony v. Sequential Runtime
0
1000
2000
3000
4000
5000
6000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
0023
0024
0025
00
Workunits
Tim
e (S
eco
nd
s)
Sequential
Ant Colony
Sequence Alignment Runtime
Runtime curves shows a slight wave beginning with a decrease in per unit runtime and later increasing again
Due to the wavefront completion of tasks in the diagonal major computation leading to increasing parallelism up to the longest diagonal
After this midpoint, parallelism decreases
Diagonal Major Execution
Row Major
Diagon
al M
ajor
Col
umn
Maj
or
Sequence Alignment RuntimeFCFS-5 v. Sequential Runtime
0
1000
2000
3000
4000
5000
6000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
0023
0024
0025
00
Workunits
Tim
e (S
eco
nd
s)
Sequential
FCFS-5
Sequence Alignment RuntimeFCFS-1 v. Sequential Runtime
0
1000
2000
3000
4000
5000
6000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
0023
0024
0025
00
Workunits
Tim
e (S
eco
nd
s)
Sequential
FCFS-1
Sequence Alignment Parallel Runtime
Parallel Runtime Comparison
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
0023
0024
0025
00
Workunits
Tim
e (S
eco
nd
s)
Ant Colony
FCFS-5
FCFS-1
Sequence Alignment Speedup Comparison
Speedup Comparison
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020
0021
0022
0023
0024
0025
00
Workunits
Sp
eed
up Ant Colony
FCFS-5
FCFS-1
Sequence Alignment Speedup
FCFS-1 shows a steady curve reflecting gradual increase in parallelism due to available tasks and a steady decrease in parallelism as the wavefront passes the largest diagonal of the calculation
FCFS-5 shows a more gradual incline at the beginning of the calculation and steeper decline toward the end
Ant colony shows a steeper incline and gradual decline
Sequence Alignment Speedup
FCFS-5 enjoys less parallelism at the beginning of the calculation due to allocating many workunits to nodes requesting work with such a small pool to draw from
As more workunits become available, FCFS-5’s aggressive scheduling works to its advantage until workunits begin to become exhausted again
FCFS-1 is conservative in scheduling throughout Ant Colony begins conservatively but occasionally sends multiple
workunits to a node This leads to a quicker buildup of generated workunits early in the
calculation Later in the calculation, Ant Colony schedules more aggressively
and eventually exhausts the workunit pool similarly to FCFS-5
Sequence Alignment Workunit Distribution
Workunit Distribution
0
20
40
60
80
100
120
140
160
Athlon
XP 2
.08
GHz
Quad
XEON 1.9
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.8
0 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.6
6 GHz
Pentiu
m 4
2.8
0 GHz
Host
Wo
rku
nit
s
Ant Colony
FCFS-5
FCFS-1
Sequence Alignment Workunit Distribution Mostly random distribution due to task
dependency dominating the calculation All three scheduling techniques show little
preference for any node based on communication or processing resources
Sequence Alignment Conclusion
Ant colony provides an interesting mix of attributes of both FCFS-1 and FCFS-5 when scheduling this computation
It shares the conservative scheduling of FCFS-1 and later aggressive scheduling similar to FCFS-5
All three parallel computations offer only a slight benefit to the problem due to the task dependency structure
It should be noted, the theoretical computation time of the sequential algorithm would require a sequential machine with 37.3 GB of memory if no memory reduction techniques are used in storing the solution matrix
Performance Summary
Ant colony scheduling offers top performance in all three experiments
FCFS-1 and FCFS-5 offer varying performance levels depending on the attributes of the target application
Ant colony adapts to match or better the best of the competing algorithms
All three offer acceptable schedules for the parallel applications without relying on client supplied information
Problems with BOINC
Client Server Model Traffic Congestion Problem Server too busy to handle requests Solution – peer to peer model
Peer to Peer Platform
Written in Java Uses the JXTA toolkit to provide communication services
and peer to peer overlay network Platform provides basic object and messaging primitives
to facilitate peer to peer application development Messages between objects can be transparently passed
to objects on other peers or to local objects Each object in an application runs in its own thread
Distributed Scheduling
Applications provide their own scheduling and decomposition of work
Typical pattern of application design: A factory object generates work units from
decomposed job data Worker objects perform computation Result objects consolidate and report results
The work factory handles distributing work to cooperating peers and consolidating results
Distributed Sequence Alignment
Sequence Alignment job begins as a comparison of two complete sequences
Work factory breaks the complete comparison into work units up to a minimum size
A work unit can begin processing as soon as its dependencies are available Initially, only the upper left corner work unit of the result matrix
has all dependencies satisfied When a work unit is completed, its adjacent work units become
eligible for processing
Distributed Sequence Alignment
The distributed application attempts to complete work units in squares of four adjacent work units
As more work units are completed, larger and larger squares of work units become eligible for processing
Complete
Complete
Eligible
Not Ready
Eligible
Processing
Distributed Sequence Alignment
Peers other than the peer initially starting the job will begin with no work to complete
The initial peer will broadcast a signal signifying availability of eligible work units
Peers will attempt to contact a peer advertising work requesting work
Distributed Sequence Alignment
A peer with available work will distribute the largest amount of work eligible and mark the work unit as remotely processed
When a peer completes all work in a work unit it will report to the peer who initially assigned the work Only adjacent edges of results necessary for computing new
work units are reported to reduce communication Complete results are stored at the peer performing the
computation
Distributed Sequence Alignment
After reporting the completion of a work unit a peer will seek new work from all peers
Once the initial peer completes all work the job is done
Peers could then be queried to report maximum scores and alignments
Experiment
Peer to peer algorithm was tested aligning two sequences of 160,000 bases each
Alignments used a typical scoring matrix and gap penalty
The minimum work unit size for decomposition of the total job was 10,000 bases for each sequence resulting in 256 work units
160,000
10,000
2
Experiment
The distributed system was executed on a local area network with 2, 4, and 6 fairly similar peers
The results were compared to a sequential implementation of the algorithm run on one of the peers The sequential implementation does a straightforward
computation of the complete matrix Disk is used to periodically save the sections of the matrix since
the complete matrix would exhaust available memory
Runtime
0:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
0:43:12
0:50:24
0:57:36
1:04:48
1:12:00
1 2 3 4 5 6
Number of Nodes
Ru
nti
me
Runtime is reduced from 1 hour and 9 minutes to 28 minutes
This is about 2.4 times faster than sequential
The most dramatic drop is at 2 nodes with at 1.75 times faster at 39 minutes
Node Efficiency
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6
Node
Perc
en
t of
Tim
e W
ork
ing
2 Nodes
4 Nodes
6 Nodes
The first peer generally has the highest efficiency
Average efficiency drops as more nodes are added
Analysis
Findings are in line with the structure of the sequential problem
Due to dependencies of tasks on previous tasks, many nodes must initially remain idle as some nodes complete dependent tasks
As more work becomes eligible for processing, more nodes can work simultaneously
Later in the computation fewer work units become eligible for computation due to fewer dependencies
Comparison With CS Model
Previous work performed the same computation with a client server platform BOINC Aligned two sequences of 100,000 bases each Work unit size was 2,000 bases for each sequence 2500 work units Performed on 30 nodes
Comparison With CS Model
Direct comparison is difficultPrevious job was smaller but with smaller
granularity increasing number of work unitsPrevious sequential portions of alignment
seem to be inefficient compared to current implementation based on overall sequential job completion time
Comparison With CS Model
Best comparison is overall runtime reduction factor Runtime reduction factor compares distributed completion time
with similar sequential completion time Factors out differing performance of sequential aspects of
computation BOINC implementation achieved a reduction factor only
about 1.2 times sequential Peer to peer achieved 2.4 with only 1/5 the nodes
Comparison With CS Model
BOINC implementation was impeded by a central server which was using a slower link to most nodes compared to the peer to peer configuration
Peer to peer implementation also shows signs of diminishing returns as more nodes are added
Peer to peer utilizes all participating systems, client server normally uses a server which does not participate in the computation outside of scheduling
Conclusions Our scheduling strategy is effective on
BOINC P2P is more effective than BOINC Client-
server model Public computing can solve problems of
large computing power requirement and huge memory demand
Potentially replace supercomputing for certain applications (large grains)
Future Work
Measure the impact of fault tolerance on these scheduling algorithms
Measure the impact of work redundancy and work validation
Continue to benchmark the P2P implementation with more nodes
Implement Ant Colony Scheduling on P2P model
Future Work
Currently a peer only seeks more work when it has completed all of its work It does not seek work when waiting for reports from
peers to which it has distributed work Allowing a peer to seek work while it is waiting for
other peers may increase utilization Create a more direct comparison with client
server model Additional applications and improvements to the
base platform
Thank You
Questions?