1 Alex Romosan, Alex Romosan, Derek Wright, Derek Wright, Ekow Otoo, Doron Rotem, Arie Ekow Otoo, Doron Rotem, Arie Shoshani Shoshani (Guidance: Doug Olson) (Guidance: Doug Olson) Lawrence Berkeley National Lawrence Berkeley National Laboratory Laboratory Co-Scheduling CPU and Storage using Condor and SRMs Presenter: Arie Shoshani Presenter: Arie Shoshani
25
Embed
1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory
Co-Scheduling CPU and Storage using Condor and SRMs
Presenter: Arie ShoshaniPresenter: Arie Shoshani
2
Problem: Running jobs on the Problem: Running jobs on the GridGrid
• Grid architecture needs to include components for Grid architecture needs to include components for dynamic reservation & scheduling of:dynamic reservation & scheduling of:• Compute resources – Condor (startd)• Storage resources – Storage Resource Managers (SRMs)• Network resources – Quality of Service in routers
• Also need to coordinate Also need to coordinate • The co-scheduling of resources
• Compute and storage resources only• The execution of the co-scheduled resources
• Need to get DATA (files) into the execution nodes• Start the jobs running on nodes that have the right data on
them.• Recover from failures• Balance use of nodes• Overall optimization – replicate “hot files”
3
General Analysis ScenarioGeneral Analysis Scenario
MSS
RequestExecuter
Storage Resource Manager
Metadatacatalog
Replicacatalog
NetworkWeatherService
logicalquery
network
clientclient ...
RequestInterpreter
requestplanning
A set oflogical files
Execution plan and site-specific
files
Client’s site
...Disk
Cache
DiskCache
ComputeEngine
DiskCache
Compute Resource Manager
Storage Resource Manager
ComputeEngine
DiskCache
Requests fordata placement andremote computation
Site 2Site 1 Site N
Storage Resource Manager
Storage Resource Manager
Compute Resource Manager
result files
ExecutionDAG
4
Simpler problem: run jobs on Simpler problem: run jobs on multi-node uniform clustersmulti-node uniform clusters
• Optimize parallel analysis jobs on the clusterOptimize parallel analysis jobs on the cluster• Jobs are partitioned into tasks: Jobi: [Ci, {Fij}, Oi ] {Ci, Fij, Oij }• Currently using LFS• Currently files are NFS mounted – bottleneck
• Want to run tasks independently on each nodeWant to run tasks independently on each node• Want to send tasks to where the files areWant to send tasks to where the files are• Very important problem for HENP applicationsVery important problem for HENP applications
HPSS
MasterNode
WorkerNode
WorkerNode
WorkerNode
WorkerNode
5
SRM is a ServiceSRM is a Service
• SRM functionalitySRM functionality• Manage space
• Negotiate and assign space to users• Manage “lifetime” of spaces
• Manage files on behalf of a user• Pin files in storage till they are released• Manage “lifetime” of files• Manage action when pins expire (depends on file types)
• Manage file sharing• Policies on what should reside on a storage resource at any one time• Policies on what to evict when space is needed
• Get files from remote locations when necessary• Purpose: to simplify client’s task
• Manage multi-file requests• A brokering function: queue file requests, pre-stage when possible
• Provide grid access to/from mass storage systems• HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor
(CERN), MSS (NCAR), …
6
Types of SRMsTypes of SRMs
• Types of storage resource managersTypes of storage resource managers• Disk Resource Manager (DRM)
• Manages one or more disk resources
• Tape Resource Manager (TRM)• Manages access to a tertiary storage system (e.g. HPSS)
• Hierarchical Resource Manager (HRM=TRM + DRM)• An SRM that stages files from tertiary storage into its disk cache
• SRMs and File transfersSRMs and File transfers• SRMs DO NOT perform file transfer• SRMs DO invoke file transfer service if needed
(GridFTP, FTP, HTTP, …)• SRMs DO monitor transfers and recover from failures
• TRM: from/to MSS• DRM: from/to network
7
Uniformity of Interface Uniformity of Interface Compatibility of SRMsCompatibility of SRMs
SRM SRM SRM
Enstore JASMine
ClientUSER/APPLICATIONS
Grid Middleware
SRM
DCache
SRM
CASTOR
SRM
DiskCache
8
SRMs used in STAR forSRMs used in STAR forRobust Muti-file replication Robust Muti-file replication
Anywhere
BNL
DiskCache
DiskCache
HRM-COPY(thousands of files)
SRM-GET (one file at a time)
HRM-ClientCommand-line Interface
HRM(performs writes)
HRM(performs reads)
LBNLGridFTP GET (pull mode)
stage filesarchive files
Network transfer
Get listof files
Recovers from staging failures
Recovers from file transfer failures
Recovers from archiving failures
9
File movement functionality: File movement functionality: srmGet, srmPut, srmReplicatesrmGet, srmPut, srmReplicate
Simpler problem: run jobs on Simpler problem: run jobs on multi-node uniform clustersmulti-node uniform clusters
• Optimize parallel analysis on the clusterOptimize parallel analysis on the cluster• Minimize movement of files between cluster nodes• Use nodes in cluster as evenly as possible• Automatic replication of “hot” files• Automatic management of disk space• Automatic removal of cold files
(Automatic garbage collection)
• UseUse• DRMs for disk management on each node
• Space & content (files)
• HRM for access from HPSS• Condor for job scheduling on each node
• Startd to run jobs and monitor progress
• Condor for matchmaking of slots and files
12
ArchitectureArchitecture
JDD
DRM startd DRM startd DRM startd
schedd
Collector
Negotiator
FSD
JDD – Job Decomposition Daemon
FSD – File Scheduling Daemon
DRM startd
HRM
HPSS
13
Detail actions (JDD)Detail actions (JDD)
• JDD partitions jobs to tasksJDD partitions jobs to tasks• Jobi: [Ci, {Fij}, Oi ] {Ci, Fij, Oij }• JDM constructs 2 files
• S(j) – set of tasks (jobs in Condor-speak)• S(f) – set of files requested• (Also keeps reference counts to files)
• JDD probes all DRMsJDD probes all DRMs• For files they have• For missing files it can schedule requests to HRM
• JDD schedules all missing filesJDD schedules all missing files• Simple algorithm: schedule round-robin to nodes• Simply send request to each DRM• DRM removes files if needed and gets file from HRM
• JDD sends each startd the list of files it needsJDD sends each startd the list of files it needs• Startd checks with its DRM which of the needed files it has, and
constructs a class-ad that lists only relevant files
14
Detail actions (FSD)Detail actions (FSD)
• FSD queues with Condor all tasksFSD queues with Condor all tasks
• FSD checks with Condor periodically on status FSD checks with Condor periodically on status of tasksof tasks• If a task is stuck it may choose to replicate the file
(this is where a smart algorithm is needed)• File replication can be made from a neighbor node or
from HRM
• When startd runs a task, it requests DRM to pin When startd runs a task, it requests DRM to pin file, run task, and release filefile, run task, and release file
15
ArchitectureArchitecture
JDD
DRM startd DRM startd DRM startd
schedd
Collector
Negotiator
FSD
JDD – Job Decomposition Daemon
FSD – File Scheduling Daemon
DRM startd
HRM
HPSS
- JDD generates list ofMissing files
- JDD generates list ofMissing files
16
Need to developNeed to develop
• Mechnism for Mechnism for startd to communicate with DRMstartd to communicate with DRM• Recently added to startd
• Mechanism to Mechanism to check status of taskscheck status of tasks
• Mechanism to check Mechanism to check that task is finishedthat task is finished, and , and notify JDDnotify JDD
• Mechanism to Mechanism to check that job is donecheck that job is done, notify , notify clientclient
• Develop JDDDevelop JDD
• Develop FSDDevelop FSD
17
Open Questions (1)Open Questions (1)
• What if a file was removed by a DRM? What if a file was removed by a DRM? • In this case, if DRM does not find the file on its disk,
then the task gets rescheduled• Note: usually, only “cold” files are removed• Should DRMs notify JDD when they remove a file?
• How do you deal with output and merging of How do you deal with output and merging of outputs?outputs?• Need DRMs to be able to schedule durable space• Moving files out of the compute node is the
responsibility of the user (code)• Maybe moving files to their final destination should
be a service of this system
18
Open Questions (2)Open Questions (2)
• Is it best to process as many files on a single Is it best to process as many files on a single system as possible?system as possible?• E.g. one system has 4 files, but also the files are on 4
different systems. Which is better.• Conjecture: if the overhead for splitting job is small,
then splitting is optimized by matchmaking
• What if file bundles are needed?What if file bundles are needed?• A file bundle is a set of files that are needed together• Need a more sophisticated class-ads• How to replicate bundles?
19
Detail activitiesDetail activities
• Development workDevelopment work• Design of JDD and FSD modules• Development of software components• Use of a real experimental cluster (8 + 1 nodes)• Install Condor and SRMs
• Development of an optimization algorithmDevelopment of an optimization algorithm• Represented as a bipartite graph• Using network flow analysis techniques
20
Optimizing file replication Optimizing file replication on the cluster (D. Rotem)*on the cluster (D. Rotem)*
• Jobs can be assigned to servers subject to the Jobs can be assigned to servers subject to the following constraints:following constraints:• 1. Availability of Computation slots on the
server, usually these correspond to CPUs.• 2. File(s) needed by the job must be resident on
the server disk• 3. Sufficient disk space for storing job output.• 4. RAM
• Goal : Maximize number of jobs assigned to Goal : Maximize number of jobs assigned to servers while minimizing file replication costsservers while minimizing file replication costs
• An arc between f-node and s-node exists if the file is stored on that server• The number in the f-node represents the number of jobs that want to process that file• The number in the s-node represents the number of available slots on that server
22
File replication converted to a File replication converted to a
network flow problemnetwork flow problem
1) The total maximum number of jobsthat can be assigned to the servers corresponds to the maximum flow in this network.
2) By the well-known max-flow min-cut theorem this is also equal to the capacity of a minimum cut shown in bold edges.
Where, a cut is a set of edges that disconnects the source from the sink
Max Flow is 11 in this case – Minimum cut shown in bold The number on the arcs is the MIN
between the 2 nodes
23
Improving Flow by adding an edgeImproving Flow by adding an edge
Maximum flow improved to 13, additional edge represents file replication
Problem: to find a subset of edges in of total minimum cost that maximizes the flow between the source and the sink.
24
SolutionSolution
• Problem: Finding a set of edges of minimum cost to maximize flow (MaxFlowFixedCost )
• Problem is (strongly) NP-Complete• We use an approximation algorithm that finds a sub optimal
solution in polynomial time, called Continuous Maximum Flow Improvement (C- MaxFlowImp) using linear programming techniques
• Can show that the solution is bounded relative to the optimal• This will be implemented as part of FSD
25
ConclusionsConclusions
• Combining compute and file resources in class-ads is a useful conceptCombining compute and file resources in class-ads is a useful concept
• Can take advantage of matchmaker
• Using DRMs to manage space and content of space provides:Using DRMs to manage space and content of space provides:
• Information for class-ads
• Automatic garbage collection
• Automatic staging of missing files from HPSS through HRM
• Minimizing the number of files in class-ads is the key to efficiencyMinimizing the number of files in class-ads is the key to efficiency
• Get only needed files from DRM
• Optimization can be done externally to Condor by File replication Optimization can be done externally to Condor by File replication algorithmsalgorithms
• Network flow analogy provide good theoretical foundation
• Interaction between Condor and SRMs are through existing APIsInteraction between Condor and SRMs are through existing APIs
• Small enhancements were needed in startd and DRMs
• We believe that results can be extended to the Grid, but cost of replication We believe that results can be extended to the Grid, but cost of replication will vary greatly – need to extend algorithmswill vary greatly – need to extend algorithms