DAS3 Workshop, Delft 1 6/6/2007 Grid’5000 DAS3 Workshop, Delft Franck Cappello INRIA Director of Grid’5000 Email: [email protected]http://www.lri.fr/~fci DAS3 Workshop, Delft DAS3 Workshop, Delft Franck Cappello Franck Cappello INRIA INRIA Director of Grid’5000 Director of Grid’5000 Email: Email: fci fci @ @ lri lri . . fr fr http://www. http://www. lri lri . . fr fr / / ~ ~ fci fci Toward an International Computer Science Grid Toward an International Toward an International Computer Science Grid Computer Science Grid
56
Embed
Grid’5000 Toward an International Computer Science Grid · GridSim Bricks Model NS, etc. Protocol proof Data Grid eXplorer WANinLab Emulab Grid’5000 DAS3 PlanetLab GENI We need
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Toward an International Toward an International Computer Science GridComputer Science Grid
DAS3 Workshop, Delft 26/6/2007
Grid’5000
Motivations
Grid’5000
DAS3
International Computer Science Grid
AgendaAgenda
DAS3 Workshop, Delft 36/6/2007
Grid’5000Large Scale Distributed Systems raise Large Scale Distributed Systems raise many research challengesmany research challenges
Computational Steering of a Multi-physics application(code coupling, Multi-heterogeneous sites, Heterogeneous network, volatility, Interactive use, etc.)
LSDS (Grid, P2P, etc.) are complex systems:Large scale, dynamic distributed systemsDeep stack of complicated software
Main Large Scale Distributed Systems research issues:
Security, Performance, Fault tolerance, Scalability, Load Balancing, Coordination, deployment, accounting, Data storage, Programming, Communication protocols, Accounting, etc.
DAS3 Workshop, Delft 46/6/2007
Grid’5000
Research inResearch in Large Scale Distributed Large Scale Distributed systems raise methodological challengessystems raise methodological challenges
How designers of applications, application runtimes, middleware, system, networking protocols, etc.
Can test and compare• Fault tolerance protocols• Security mechanisms• Networking protocols• etc.
Following a strict scientific procedure,
Knowing that all these components run simultaneously during the execution in a complex and dynamic environment ?
DAS3 Workshop, Delft 56/6/2007
Grid’5000
Realism
math simulation emulation live systems
Models:Sys, apps,Platforms,Conditions
Real systemsReal applicationsReal platformsReal conditions
Tools for Distributed System Studies Tools for Distributed System Studies To investigate Distributed System issues, we need:1) Tools (model, simulators, emulators, experi. Platforms)
Real systemsReal applications“In-lab” platformsSynthetic conditions
Key system mecas.Algo, app. kernelsVirtual platformsSynthetic conditions
Abstraction
2) Strong interaction among these research tools
validation
DAS3 Workshop, Delft 66/6/2007
Grid’5000
Existing Grid Research ToolsExisting Grid Research Tools• SimGrid3
• Discrete event simulation with trace injection• Originally dedicated to scheduling studies• Single user, multiple servers
• GridSim• Dedicated to scheduling (with deadline, budget), DES (Java)• Multi-clients, Multi-brokers, Multi-servers• Support for SLAs, DataGrids, Service Model
• Titech Bricks• Discrete event simulation for scheduling and replication studies
• GangSim• Scheduling inside and between VOs
• MicroGrid, • Emulator, Dedicated to Globus, Virtualizes resources and time, Network (MaSSf)
--> Simulators and Emulators are quite slow and do not scale well
France
USA
Australia
Japan
DAS3 Workshop, Delft 76/6/2007
Grid’5000What about production platforms What about production platforms used for experimental purpose?used for experimental purpose?
Not reconfigurable:• Many projects require experiments on OS and networks,• Some projects require the installation of specific hardware
No reproducible experimental conditions:• Scientific studies require reproducible experimental conditions
Not designed for experiments:• Many researchers run short length, highly parallel & distributed algos• Preparation and execution of experiments are highly interactive
Not optimized for experiments• Experimental platforms should exhibit a low utilization rate to allow researchers executing large collocated experiments
Nowhere to test networking/OS/middleware ideas, to measure real application performance,
DAS3 Workshop, Delft 86/6/2007
Grid’5000
log(cost & coordination)
log(realism)
math simulation emulation live systems
SimGridMicroGridGridSimBricksNS, etc.Model
Protocol proof
Data Grid eXplorerWANinLabEmulab
Grid’5000DAS3PlanetLabGENI
We need experimental toolsWe need experimental toolsIn 2002 (DAS2) and 2003 (Grid’5000), the design and development of an experimental platform for Grid researchers was initiated:
This talk
Major challenge
Challenging
Reasonable
RAMPDave Patterson’sProject on MuticoreMulti-processor emulator
DAS3 Workshop, Delft 96/6/2007
Grid’5000
Computer Science GridsComputer Science GridsGrid’5000 and DAS:
-Designed by computer scientists-for computer scientists
-Not production platforms for Physics or Biology-Production platforms for Computer Science
-More than testbeds:-Researchers share experiences, results, skills-access an environment with supporting engineers
DAS3 Workshop, Delft 106/6/2007
Grid’5000
IIss this really Originalthis really Original, New?, New?In Parallel Computing and HPC --> Not really
• Most of the evaluations for new methods, algorithms, optimizations in parallel computing and HPC are conducted on REAL computers NOT simulators• Why: because Parallel Computers and HPC machines are easy to access and it’s easy to build and run a test, users need trustable results!
In Grid and Large Scale Distributed systems --> YES
• It’s difficult to get access to a Grid and a Large ScaleDistributed Systems
• Simulators are easy to build• Results are rarely confronted to reality
Computer Science Grid and Experimental Facilities for Large Scale Distributed systems should change the situation!
DAS3 Workshop, Delft 116/6/2007
Grid’5000
Motivations
Grid’5000DesignStatusResults
DAS3
An International CSG
AgendaAgenda
DAS3 Workshop, Delft 126/6/2007
Grid’5000
GridGrid’’50005000*
www.grid5000.www.grid5000.frfrOne of the 40+ ACI Grid projects
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
DAS3 Workshop, Delft 136/6/2007
Grid’5000
ACI GRID projectsACI GRID projects• Peer-to-Peer
– CGP2P (F. Cappello, LRI/CNRS)
• Application Service Provider– ASP (F. Desprez, ENS Lyon/INRIA)
• Algorithms– TAG (S. Genaud, LSIIT)– ANCG (N. Emad, PRISM)– DOC-G (V-D. Cung, UVSQ)
• Support for disseminations– ARGE (A. Schaff, LORIA)– GRID2 (J-L. Pazat, IRISA/INSA)– DataGRAAL (Y. Denneulin, IMAG)
Thierry Priol
DAS3 Workshop, Delft 146/6/2007
Grid’5000
1) Building a nationwide experimental platform for Large scale Grid & P2P experiments • 9 geographically distributed sites• Every site hosts a cluster (from 256 CPUs to 1K CPUs)• All sites are connected by RENATER (French Res. and Edu. Net.)• RENATER hosts probes to trace network load conditions• Design and develop a system/middleware environment
for safely testing and repeating experiments
2) Use the platform for Grid experiments in real life conditions• Port and test applications, develop new algorithms• Address critical issues of Grid system/middleware:
• High performance transport protocols, Qos• Investigate novel mechanisms
• P2P resources discovery, Desktop Grids
The GridThe Grid’’5000 Project5000 Project
DAS3 Workshop, Delft 156/6/2007
Grid’5000
RoadmapRoadmapJune 2003 2005
PreparationCalibration
Experiments
2007
InternationalcollaborationsCoreGridCoreGrid
2006
12501250CPUsCPUs
50005000
CPUsCPUsFunded
Today
2004
DiscussionsPrototypes
InstallationsClusters & Net
20002000
First Experiments
2300
~3000
2008
DAS3 Workshop, Delft 166/6/2007
Grid’5000 GridGrid’’5000 foundations:5000 foundations:Collection of experiments to be doneCollection of experiments to be done
• Applications– Multi-parametric applications (Climate modeling/Functional Genomic)– Large scale experimentation of distributed applications (Electromagnetism, multi-
material fluid mechanics, parallel optimization algorithms, CFD, astrophysics) – Medical images, Collaborating tools in virtual 3D environment
• Programming– Component programming for the Grid (Java, Corba)– GRID-RPC– GRID-MPI– Code Coupling
• Middleware / OS– Scheduling / data distribution in Grid– Fault tolerance in Grid– Resource management– Grid SSI OS and Grid I/O– Desktop Grid/P2P systems
• Networking– End host communication layer (interference with local communications)– High performance long distance protocols (improved TCP)– High Speed Network Emulation
injection at the network edges.• Stress: high number of clients, servers, tasks, data transfers,• Perturbation: artificial faults (crash, intermittent failure, memory
corruptions, Byzantine), rapid platform reduction/increase, slowdowns, etc.
Allow users running their preferred measurement toolsand experimental condition injectors
DAS3 Workshop, Delft 186/6/2007
Grid’5000
GridGrid’’5000 principle:5000 principle:A highly reconfigurable experimental A highly reconfigurable experimental
platformplatform
Application Runtime
Grid or P2P Middleware
Operating System
Programming Environments
Networking
Application
Let users create, deploy and run their software stack,including the software to test and evaluateusing measurement tools + experimental conditions injectors
Exp
erim
enta
l co
nditio
ns
inje
ctor
Mea
sure
men
t to
ols
DAS3 Workshop, Delft 196/6/2007
Grid’5000
Experiment workflowExperiment workflow
Reserve nodes correspondingto the experiment
Log into Grid’5000Import data/codes
Reboot the nodes in the user experimental environment (optional)
Transfer params + Run the experiment
Collect experiment results
Build an env. ?yes
no
Reserve 1 node
Reboot node(existing env.*)
Adapt env.
Exit Grid’5000
Reboot node
Env. OK ?yes
*Available on all sites:Fedora4allUbuntu4allDebian4all
QuickTime™ et undécompresseur TIFF (non compressé)
sont requis pour visionner cette image.
DAS3 Workshop, Delft 236/6/2007
Grid’5000
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
GridGrid’’5000 Team5000 Team
PI+E
PI+ED+TDPI+E
PI+E
PI+E PI+E
PI+E
PI+E
PI+E
At National Level:1 Director (D)1 Technical Director (TD)1 Steering committee1 Technical committee
Per Site:1 Principal Investigator (PI)(site scientific, Administrative andfinancialManager)1 Technical Advisor1 Engineer (about) (E)
ACI Grid Director + ACI Grid Scientific Committee President
DAS3 Workshop, Delft 246/6/2007
Grid’5000Grid’5000
GridGrid’’5000 as an Instrument5000 as an Instrument4 main features:• A high security for Grid’5000 and the Internet, despite the deep
reconfiguration feature--> Grid’5000 is confined: communications between sites are isolated from
the Internet and Vice versa (Dedicated lambda).
• A software infrastructure allowing users to access Grid’5000 from any Grid’5000 site and have a simple view of the system--> A user has a single account on Grid’5000, Grid’5000 is seen as a cluster
of clusters
• A reservation/scheduling tools allowing users to select nodes and schedule experiments
a reservation engine + batch scheduler (1 per site) + OAR Grid (a co-reservation scheduling system)
• A user toolkit to reconfigure the nodes-> software image deployment and node reconfiguration
tool
QuickTime™ et undécompresseur TIFF (non compressé)
sont requis pour visionner cette image.
QuickTime™ et undécompresseur TIFF (non compressé)
Grid’5000 Experiment: Geophysics: Seismic Ray Experiment: Geophysics: Seismic Ray Tracing in 3D mesh of the EarthTracing in 3D mesh of the Earth
Building a seismic tomography model of the Earth geology using seismic wave propagation characteristics in the Earth.Seismic waves are modeled from events detected by sensors. Ray tracing algorithm: waves are reconstructed from rays traced between the epicenter and one sensor.
A MPI parallel program composed of 3 steps 1) Master-worker: ray tracing and mesh update by each process with blocks of rays successively fetched from the master process, 2) all-to all communications to exchange submesh in-formation between the processes, 3) merging of cell information of the submesh associated with each process.
Reference: 32 CPUs
Stéphane Genaud , Marc Grunberg , and Catherine MongenetIPGS: “Institut de Physique du Globe de Strasbourg”
DAS3 Workshop, Delft 286/6/2007
Grid’5000Solving the FlowSolving the Flow--Shop Scheduling ProblemShop Scheduling Problem
“one of the hardest challenge problems in combinatorial optimization”
Solving large instances of combinatorial optimization problems using a parallel Branch and Bound algorithm
Flow-shop:•Schedule a set of jobs on a set of machines minimizing makespan. •Exhaustive enumeration of all combinations would take several years. •The challenge is thus to reduce the number of explored solutions.
New Grid exact method based on the Branch-and-Bound algorithm (Talbi, melab, et al.), combining new approaches of combinatorial algorithmic, grid computing, load balancing and fault tolerance.
Problem: 50 jobs on 20 machines, optimally solved for the 1st time, with 1245 CPUs (peak)1245 CPUs (peak)
The optimal solution required a wallThe optimal solution required a wall--clock time of clock time of 25 days.25 days.
E. Talbi, N. Melab, 2006
DAS3 Workshop, Delft 296/6/2007
Grid’5000
“rendez vous” peers known by one of the “rendez vous” peerX axis: time ; Y axis: “rendez vous” peer ID
“rendez vous” peers known by one of the “rendez vous” peerX axis: time ; Y axis: “rendez vous” peer ID
Jxta Jxta DHT scalabilityDHT scalability
• Goals: study of a JXTA “DHT”– “Rendez vous” peers form the JXTA DHT– Performance of this DHT?– Scalability of this DHT?
• Organization of a JXTA overlay (peerviewprotocol)– Each rendezvous peer has a local
view of other rendezvous peers– Loosely-Consistent DHT between
rendezvous peers– Mechanism for ensuring
convergence of local views
• Benchmark: time for local views to converge
• Up to 580 nodes on 6 sites
Edge Peer
rdv Peer
G. Antoniu, M. Jan, 2006
DAS3 Workshop, Delft 306/6/2007
Grid’5000 TCP limits on 10Gb/s linksTCP limits on 10Gb/s links
• In Grids, TCP is used by most applications & libraries (MPI, GridFTP….)• Long distance High speed networks are challenging for TCP & transport protocols• Designing new schemes is not straightforward (fairness, stability, friendliness)• New variants are proposed/implemented in Linux (Bic) & Windows (Compound)• Experimental investigations in real high speed networks are highly required
19 flows forward /19 flows reverse Affects flow performance & global throughput by 20% , global instability
Without reverse traffic:Efficiency, Fairness & equilibriumAround 470Mb/s per flow9200 Mb/s global
With reverse traffic:less efficiencymore instabilityAround 400Mb/s per flow8200 Mb/s global
– How to provide this flexibility (across domains)?– How to integrate optical networks with applications?
App
Photonicnetwork
Mgmtplane
ControlplaneHow to get a
topology that suits the need?
How to communicate with the mgmt plane/ control plane?
How to drive the changes in the network?
DAS3 Workshop, Delft 426/6/2007
Grid’5000
Projects Similar to DAS3Projects Similar to DAS3Optical networks for grids
– G-lambda (Japan)
– Enlightened (USA)– views the network as a grid resource, same level as compute and storage resources– Grid framework for 1) dynamic application request (computing, storage high
bandwidth, secure network) 2) Software tools and protocols (fast network reconfiguration, on-demand or in-advance provisioning of lightpaths)
– Determine: how to abstract the network resources & how to distribute the network intelligence among [network control plane, management plane, grid middleware]
– Phosphorus (EU)
DAS3 Workshop, Delft 436/6/2007
Grid’5000 A Prototype of an European A Prototype of an European Computer Science Grid Computer Science Grid
DAS3
Grid’50002007
(July 27)
2007(July 27)
3000 CPUs
1500 CPUs
Renater-Geant-Surfnet(a dedicated lambda at 10G)
DAS3 Workshop, Delft 446/6/2007
Grid’5000
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
Recent long range connection Recent long range connection
France
The Netherlands
Japan
Connection of Grid’5000 to the NAREGI R&D lab @ TokyoOS Reconfiguration in both side (first tests on transportProtocols for large data sets)
July 7th, 20071Gbps
DAS3 Workshop, Delft 456/6/2007
Grid’5000
ConclusionConclusion• Large scale and highly reconfigurable Grid experimental platforms• Used by Master’s and Ph.D. students, PostDoc and researchers (and results are
presented in their reports, thesis, papers, etc.)
• Grid’5000 and DAS-3 offer together in 2007:– 19 clusters distributed over 13 sites in France and The Netherlands,– about 10 Gigabit/s (directional) of bandwidth– The capability for all users to reconfigure the platform (G5K)
[protocols/OS/Middleware/Runtime/Application]– The capability for all users to reconfigure the network topology (DAS-3)
• Grid’5000 and DAS results in 2007:– 280 users + 200 users– 12 Ph. D. + 30 Ph.D.– 300 publications + 11 in ACM/IEEE journals/transactions and 1 in Nature– 340 planned experiments– Tens of developed and experimented software– Participation to tens of research grants (ACI, ANR, Strep, etc.)
• Connection of the two platforms is already started
• Towards an international “Computer Science Grid”!
• More than a Test bed, “Computer Science Large Scale Instrument”
DAS3 Workshop, Delft 466/6/2007
Grid’5000
Questions?Questions?
AcknowledgementsAcknowledgementsGrid’5000Michel Cosnard,Thierry Priol and Brigitte PlateauSteering committee: Michel Dayde, Frederic Desprez,Emmanuel Jeannot, Yvon Jegou, Stéphane Lanteri,Nouredine Melab, Raymond Namyst, Olivier Richard, Pascale Vicat-Blanc Primet and Dany Vandromme, Pierre Neyron
DAS3Andy Tanenbaum, Bob Hertzberger, and Henk SipsSteering group members: Lex Wolters, Dick Epema, Cees de Laat and Frank Seinstra. Kees Verstoep
www.grid5000.www.grid5000.frfr
DAS3 Workshop, Delft 476/6/2007
Grid’5000
GridGrid’’5000 versus 5000 versus PlanetLabPlanetLab
Grid’5000 PlanetLab
Cluster of clusters V -
Distributed PC V
Capability to reproduce experimental conditions V -
Capability for dedicated usage for precise measurement
The objective of this event was to bring togetherProActive users, to present and discuss current and future features of the ProActive Grid platform, and to test the deployment and interoperability ofProActive Grid applications on various Grids.
The N-Queens Contest (4 teams) where the aim was to find the number of solutions to the N-queens problem, N being as big as possible, in a limited amount of time
The Flowshop Contest (3 teams)
2005: 1600 CPUs in total: 1200 provided by Grid’5000 + 50 by the other Grids (EGEE, DEISA, NorduGrid) + 350 CPUs on clusters.
DAS3 Workshop, Delft 496/6/2007
Grid’5000 DASDAS--3: Cluster configurations3: Cluster configurationsLU TUD UvA-VLe UvA-MN VU TOTALS
Head
10TB
2x2.4GHz DC
8GB
1
1
85
250GB
2x2.4GHz DC
4GB
1
86 (2)
8
85 (11)
1 (1)
* storage 10TB 5TB 2TB 2TB 29TB
64GB
27184 TB
1.9 THz
1048 GB
320 Gb/s
339 Gb/s
* CPU 2x2.4GHz DC
2x2.4GHz DC
2x2.2GHz DC
2x2.2GHz DC
* memory 16GB 16GB 8GB 16GB
* Myri 10G 1 1 1
* 10GE 1 1 1 1
Compute 32 68 40 (1) 46
* storage 400GB 250GB 250GB 2x250GB
* CPU 2x2.6GHz 2x2.4GHz 2x2.2GHz DC 2x2.4GHz
* memory 4GB 4GB 4GB 4GB
* Myri 10G 1 1 1
Myrinet
* 10G ports 33 (7) 41 47
* 10GE ports 8 8 8
Nortel
* 1GE ports 32 (16) 136 (8) 40 (8) 46 (2)
* 10GE ports 1 (1) 9 (3) 2 2
DAS3 Workshop, Delft 506/6/2007
Grid’5000
DAS ResultsDAS Results
• 200 users in total• Used for over 32 Ph.D. theses• Used for many publications, including 11 in ACM/IEEE journals/transactions and 1 in Nature• Used to solve Awari: 3500-year old game
OS Reconfiguration techniquesOS Reconfiguration techniquesReboot OR Virtual MachinesReboot OR Virtual Machines
Currently we use Reboot, but Xen will be used inthe default environment.Let users select its experimental environment:Fully dedicated or shared within virtual machine
Reboot:
Remote control with IPMI,RSA, etc.
Disc repartitioning, if necessary
Reboot or Kernel switch (Kexec)
Virtual Machine:
No need for reboot
Virtual machine technologySelection not so easy
Xen has some limitations:-Xen3 in “initial support” status for intel vtx-Xen2 does not support x86/6-Many patches not supported-High overhead on high speed Net.
DAS3 Workshop, Delft 526/6/2007
Grid’5000
TCP limits over 10Gb/s linksTCP limits over 10Gb/s links• Highlighting TCP stream interaction issues in very high bandwidth
links (congestion colapse) and poor bandwidth fairness• Grid’5000 10Gb/s connections evaluation.• Evaluation of TCP variants over Grid’5000 10Gb/s links (BIC TCP,
H-TCP, weswood…)
Aggregated bandwidth of 9,3 Gb/s on a time interval of few minutes. Then a very high drop of the bandwidth on one of the connection.
Interaction of 10 1Gb/s TCP streams,
over the 10Gb/sRennes-Nancy link,
during 1 hour.
DAS3 Workshop, Delft 536/6/2007
Grid’5000 TCP limits on 10Gb/s linksTCP limits on 10Gb/s links
• In Grids, TCP is used by most applications & libraries (MPI, GridFTP….)• Long distance High speed networks are challenging for TCP & transport protocols• Designing new schemes is not straightforward (fairness, stability, friendliness)• New variants are proposed/implemented in Linux (Bic) & Windows (Compound)• Experimental investigations in real high speed networks are highly required
10 GbE
iperfiperfiperf iperfiperfiperfd1 to N (=40) nodes with 1 or 10GbE interfaces
1 to N ( =40) nodes with 1 or 10GbE interfaces
Grid5000 backbone
19 flows forward /19 flows reverse Affects flow performance by 20% & global throughput, instability
Without reverse traffic:Efficiency, Fairness & equilibriumAround 470Mb/s per flow9200 Mb/s global
With reverse traffic:No efficiency, Fairness & instabilityAround 400Mb/s per flow8200 Mb/s global
• Motivation : evaluation of a fully distributed resource allocation service (batch scheduler)
• Vigne : Unstructured network, flooding (random walkoptimized for scheduling).
• Experiment: a bag of 944 homogeneous tasks / 944 CPU– Synthetic sequential code (monte carlo application).– Measure of the mean execution time for a task
(computation time depends on the resource)– Measure the overhead compared with an ideal
execution (central coordinator)– Objective: 1 task per CPU.
• Tested configuration:
• Result :
B
C
E
F
D
A
G
Grid'5000 site #CPU used Execution timeBordeaux 82 1740s
Submission interval 5s 10s 20sAverage execution time 2057s 2042s 2039s
Overhead 4,80% 4,00% 3,90%
DAS3 Workshop, Delft 556/6/2007
Grid’5000 Fault tolerant MPI for the GridFault tolerant MPI for the Grid•MPICH-V: Fault tolerant MPI implementation•Research context: large scale fault tolerance•Research issue: Blocking or non BlockingCoordinated Checkpointing?