UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week 2005 March 2005
Jan 04, 2016
UAB
Dynamic Tuning of Master/Worker Applications
Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque
Universitat Autònoma de Barcelona
Paradyn/Condor Week 2005March 2005
2
Outline Introduction MATE Number of workers Data distribution Conclusions
3
Outline Introduction MATE Number of workers Data distribution Conclusions
4
IntroductionApplication performance
The main goal of parallel/distributed applications: solve a considered problem in the possible fastest way
Performance is one of the most important issues
Developers must optimize application performance to provide efficient and useful applications
5
Introduction (II)
Difficulties in finding bottlenecks and determining their solutions for parallel/distributed applications
Many tasks that cooperate with each other
Application behavior may change on input data or environment
Difficult task especially for non-expert users
6
Outline Introduction MATE Number of workers Data distribution Conclusions
7
MATE Monitoring, Analysis and Tuning Environment
Dynamic automatic tuning of parallel/distributed applications
Modifications
Instrumentation
User
TuningMonitoring
Tool
SolutionProblem /
Performance analysis
Performance data
Application development
Application
Execution
Source
Events
DynInst
8
MATE (II)
Machine 1 Machine 2
Machine 3
pvmd
Analyzer
pvmd
AC
instr.
events
modif.
events
DMLibDMLibDMLib
Task1 Task2Task3
instr.
AC
Application Controller - AC Dynamic Monitoring Library - DMLib Analyzer
9
MATE (II)
Machine 1 Machine 2
Machine 3
pvmd
Analyzer
pvmd
AC
instr.
events
modif.
events
DMLibDMLibDMLib
Task1 Task2Task3
instr.
AC
Application Controller - AC Dynamic Monitoring Library - DMLib Analyzer
Analyzer•Carries out the application performance analysis•Detects problems “on the fly” and requests changes
10
MATE (II)
Machine 1 Machine 2
Machine 3
pvmd
Analyzer
pvmd
AC
instr.
events
modif.
events
DMLibDMLibDMLib
Task1 Task2Task3
instr.
AC
Application Controller - AC Dynamic Monitoring Library - DMLib Analyzer
Application Controller (AC)•Controls the execution of the application•Has a Monitor module to manage instrumentation via DynInst and gather execution information•Has a Tuner module to perform tuning via DynInst
11
MATE (II)
Machine 1 Machine 2
Machine 3
pvmd
Analyzer
pvmd
AC
instr.
events
modif.
events
DMLibDMLibDMLib
Task1 Task2Task3
instr.
AC
Application Controller - AC Dynamic Monitoring Library - DMLib Analyzer
Dynamic Monitoring Library (DMLib)•Facilitates the instrumentation and data collection•Responsible for registration of events
12
MATE (III) Automatic performance Analysis on the fly
Find bottlenecks among events applying performance model
Find solutions that overcome bottlenecks Analyzer is provided with an application
knowledge about performance problems Information related to one problem is called a
tuning technique A tuning technique describes a complete
performance optimization scenario
13
MATE (IV) Each tuning technique is implemented in MATE as a “tunlet” A tunlet is a C/C++ library dynamically loaded to the Analyzer
process
measure points – what events are needed performance model – how to determine bottlenecks and solutions tuning actions/points/synchronization - what to change, where,
when
Analyzer
Tunlet
Measure points Tuning point, action, sync
Performance model
14
MATE (V) Events (from DMLibs) via TCP/IP
Event Collector
thread
DTAPI
Controller
Tunlet
Tunlet
EventRepository
Application model
AC Proxy
Tuning request (to tuner)
via TCP/IP
Instrument. request (to monitor)
via TCP/IP
MetaData (from ACs) via TCP/IP
Tunlet
15
Outline Introduction MATE Number of workers Data distribution Conclusions
16
Number of Workers Master/Worker paradigm
Easy to understand concept, but with some bottlenecks Example: inadequate number of workers
- workers master idle + workers + communication
Master
Worker Worker Worker Worker
17
Number of Workers (II)Master
Wor
kers
iv*if tl > then tln* iv*+
else
1
0
*n
iivtl
Execution Trace of an Homogeneous Master-Worker Application
(where are homogeneous:
•message size
•workers execution time)
Where...tl = latencyλ = inverse bandwidthvi = size of tasks sent to worker i, in bytes.n = current number of workers in the application.
18
Number of Workers (II)Master
Wor
kers
tci
Execution Trace of an Homogeneous Master-Worker Application
(where are homogeneous:
•message size
•workers execution time)
Where...tci = time that worker i spends processing a task
19
Number of Workers (II)Master
Wor
kers
tl + λ*vm
Execution Trace of an Homogeneous Master-Worker Application
(where are homogeneous:
•message size
•workers execution time)
Where...tl = latencyλ = inverse bandwidthvm = size of results sent back to master
20
Number of Workers (III)
)))**/()*((
)*((
*)1(*)2(
tlVpTcVn
andn
Vptlif
n
VpTcnpVtlTt
)n
Vp** tl( if
)n
V*p* tl(if *
n
Vtl
n
TctlnTt
tlTcVNopt )*(
21
Number of Workers (IV)
22
Number of Workers: Tunlet
Measure points:
The amount of data sent to the workers and received by the master
The total computational time of workers The network overhead and bandwidth
Machine A (master) Machine B (worker)
time time
receive (entry)
receive ( exit )
send (exit)
send (entry)
receive (exit)
send (entry)
send (exit)
receive (entry)
23
Number of Workers: Tunlet (II)
Performance function: Calculation of the optimal number of workers:
Tuning actions: To change the value of “numworkers” to add or
remove as many workers as is needed
tlTcVNopt )*(
24
Experimentation Example application
Forest Fire Propagation simulator – Xfire Intensive computing application Master/Worker Simulation of the fireline propagation Calculates the next position of the fireline considering the current fireline position and weather factors, vegetation,etc.
Platform Cluster of Pentium 4, 1.8Ghz, SuSE Linux 8.0, connected
by 100Mb/sec network
25
Experimentation (II)
Load in the system We designed different external load patterns They simulate the system’s time-sharing Allow us to reproduce experiments
Case Studies Xfire executed with different fixed number of workers
without any tuning, introducing external loads Xfire executed under MATE, introducing external loads
26
Experimentation (III)
1 2 4 6 8 10 12 14 16 18 20 22 24 26 Xf+MATE
0
200
400
600
800
1000
1200
1400
Case studies
Exe
cuti
on
tim
e (S
ec.)
Note that...
• Execution time of Xfire under MATE is close to the best execution times obtained.
• Resources devoted to the application using MATE, are used when they are really needed.
Starts with 1 worker and adapts it
27
Experimentation (IV) Statically, the model fits Dynamically, there are some problems
Nopt Could be extremely high Computation power added or removed may be not
significant considering the previous computational power Solution
Finding a “reasonable” number of workers that define a trade off between resources utilization and execution time.
28
Outline Introduction MATE Number of workers Data distribution Conclusions
29
Data Distribution Imbalance Problem:
Heterogeneous computing and communication powers Varying amount of distributed work
Master
Wor
kers
Unbalanced iteration Balanced iteration
30
Data Distribution (II) Goal:
minimize the idle time by balancing the work among the processes considering efficiency of machines
Performance Model Factoring Scheduling method
Work is divided into different-size tuples according to the factor
Work size(N)
Number ofWorkers (P)
Factor(f)
Tuples
1000 2 1 500,500
1000 2 0.5 250,250,125,125,63,63,32,32,16,16,8,8,4,4,2,2,1,1
31
Data Distribution: Tunlet Measure points:
The work unit processing time. The latency and bandwidth
Performance function: Calculation of the factor. Analyzer simulates the execution considering different
factors. Finally, it decides the best factor. Currently we are working on an analytical model to
determine the factor
Tuning actions: To change the value of “TheFactorF”
32
Experimentation Example application
Forest Fire Propagation simulator – Xfire
Platform Cluster of Pentium 4, 1.8Ghz, SuSE Linux 8.0, connected by
100Mb/sec network
33
Experimentation (II)
Load in the system We designed different external load patterns They simulate the system’s time-sharing Permit us to reproduce experiments
Study Cases Xfire executed without any tuning Xfire, introducing controlled variable external loads Xfire executed under MATE, introducing variable
external loads
34
Experimentation (III)
Note that…
• Introduction of an extra load increases the execution time.
• Execution with MATE corrects the factor value to improve the execution time
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Exe
cuti
on
tim
e (S
ec.)
1 2 4 8 16 30Number of Workers
Xfire
Xfire+Load
Xfire+Load+MATE
35
Outline Introduction MATE Number of workers Data distribution Conclusions
36
Conclusions and open lines
Conclusions Prototype environment – MATE – automatically monitors,
analyses and tunes running applications
Practical experiments conducted with MATE and parallel/distributed applications prove that it automatically adapts application behavior to existing conditions during run time
MATE in particular is able to tune Master/Worker applications and overcome the possible bottlenecks: number of workers and data distribution
Dynamic tuning works, is applicable, effective and useful in certain conditions.
37
Conclusions and open lines
Open Lines
Determining the “reasonable” number of workers.
Considering interaction between different tunlets.
Providing the system with other tuning techniques.
38
Thank you…