1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.

1

Alex Romosan,Alex Romosan, Derek Wright, Derek Wright, Ekow Otoo, Doron Rotem, Arie ShoshaniEkow Otoo, Doron Rotem, Arie Shoshani

(Guidance: Doug Olson)(Guidance: Doug Olson)

Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory

Co-Scheduling CPU and Storage using Condor and SRMs

Presenter: Arie ShoshaniPresenter: Arie Shoshani

2

Problem: Running jobs on the Problem: Running jobs on the GridGrid

• Grid architecture needs to include components for Grid architecture needs to include components for dynamic reservation & scheduling of:dynamic reservation & scheduling of:• Compute resources – Condor (startd)• Storage resources – Storage Resource Managers (SRMs)• Network resources – Quality of Service in routers

• Also need to coordinate Also need to coordinate • The co-scheduling of resources

• Compute and storage resources only• The execution of the co-scheduled resources

• Need to get DATA (files) into the execution nodes• Start the jobs running on nodes that have the right data on

them.• Recover from failures• Balance use of nodes• Overall optimization – replicate “hot files”

3

General Analysis ScenarioGeneral Analysis Scenario

MSS

RequestExecuter

Storage Resource Manager

Metadatacatalog

Replicacatalog

NetworkWeatherService

logicalquery

network

clientclient ...

RequestInterpreter

requestplanning

A set oflogical files

Execution plan and site-specific

files

Client’s site

...Disk

Cache

DiskCache

ComputeEngine

DiskCache

Compute Resource Manager


ComputeEngine

DiskCache

Requests fordata placement andremote computation

Site 2Site 1 Site N



Compute Resource Manager

result files

ExecutionDAG

4

Simpler problem: run jobs on Simpler problem: run jobs on multi-node uniform clustersmulti-node uniform clusters

• Optimize parallel analysis jobs on the clusterOptimize parallel analysis jobs on the cluster• Jobs are partitioned into tasks: Jobi: [Ci, {Fij}, Oi ] {Ci, Fij, Oij }• Currently using LFS• Currently files are NFS mounted – bottleneck

• Want to run tasks independently on each nodeWant to run tasks independently on each node• Want to send tasks to where the files areWant to send tasks to where the files are• Very important problem for HENP applicationsVery important problem for HENP applications

HPSS

MasterNode

WorkerNode

WorkerNode

WorkerNode

WorkerNode

5

SRM is a ServiceSRM is a Service

• SRM functionalitySRM functionality• Manage space

• Negotiate and assign space to users• Manage “lifetime” of spaces

• Manage files on behalf of a user• Pin files in storage till they are released• Manage “lifetime” of files• Manage action when pins expire (depends on file types)

• Manage file sharing• Policies on what should reside on a storage resource at any one time• Policies on what to evict when space is needed

• Get files from remote locations when necessary• Purpose: to simplify client’s task

• Manage multi-file requests• A brokering function: queue file requests, pre-stage when possible

• Provide grid access to/from mass storage systems• HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor

(CERN), MSS (NCAR), …

6

Types of SRMsTypes of SRMs

• Types of storage resource managersTypes of storage resource managers• Disk Resource Manager (DRM)

• Manages one or more disk resources

• Tape Resource Manager (TRM)• Manages access to a tertiary storage system (e.g. HPSS)

• Hierarchical Resource Manager (HRM=TRM + DRM)• An SRM that stages files from tertiary storage into its disk cache

• SRMs and File transfersSRMs and File transfers• SRMs DO NOT perform file transfer• SRMs DO invoke file transfer service if needed

(GridFTP, FTP, HTTP, …)• SRMs DO monitor transfers and recover from failures

• TRM: from/to MSS• DRM: from/to network

7

Uniformity of Interface Uniformity of Interface Compatibility of SRMsCompatibility of SRMs

SRM SRM SRM

Enstore JASMine

ClientUSER/APPLICATIONS

Grid Middleware

SRM

DCache

SRM

CASTOR

SRM

DiskCache

8

SRMs used in STAR forSRMs used in STAR forRobust Muti-file replication Robust Muti-file replication

Anywhere

BNL

DiskCache

DiskCache

HRM-COPY(thousands of files)

SRM-GET (one file at a time)

HRM-ClientCommand-line Interface

HRM(performs writes)

HRM(performs reads)

LBNLGridFTP GET (pull mode)

stage filesarchive files

Network transfer

Get listof files

Recovers from staging failures

Recovers from file transfer failures

Recovers from archiving failures

9

File movement functionality: File movement functionality: srmGet, srmPut, srmReplicatesrmGet, srmPut, srmReplicate

SRM Client

Client-FTP-get(pull)

Client-FTP-put(push)

srmGet/srmPut

SRM-FTP-put(push)

SRM ClientSRM/

No-SRMSRM-FTP-get

(pull)

srmReplicate

SRM/No-SRM

FTP-get

10

SRM MethodsSRM Methods

File Movementsrm(Prepare)Get:srm(Prepare)Put:srmReplicate: Lifetime managementsrmReleaseFiles:srmPutDone:srmExtendFileLifeTime:

Terminate/resumesrmAbortRequest:srmAbortFilesrmSuspendRequest:srmResumeRequest:

Space managementsrmReserveSpacesrmReleaseSpacesrmUpdateSpacesrmCompactSpace:srmGetCurrentSpace: FileType managementsrmChangeFileType:

Status/metadatasrmGetRequestStatus:srmGetFileStatus:srmGetRequestSummary:srmGetRequestID:srmGetFilesMetaData:srmGetSpaceMetaData:

11

Simpler problem: run jobs on Simpler problem: run jobs on multi-node uniform clustersmulti-node uniform clusters

• Optimize parallel analysis on the clusterOptimize parallel analysis on the cluster• Minimize movement of files between cluster nodes• Use nodes in cluster as evenly as possible• Automatic replication of “hot” files• Automatic management of disk space• Automatic removal of cold files

(Automatic garbage collection)

• UseUse• DRMs for disk management on each node

• Space & content (files)

• HRM for access from HPSS• Condor for job scheduling on each node

• Startd to run jobs and monitor progress

• Condor for matchmaking of slots and files

12

ArchitectureArchitecture

JDD

DRM startd DRM startd DRM startd

schedd

Collector

Negotiator

FSD

JDD – Job Decomposition Daemon

FSD – File Scheduling Daemon

DRM startd

HRM

HPSS

13

Detail actions (JDD)Detail actions (JDD)

• JDD partitions jobs to tasksJDD partitions jobs to tasks• Jobi: [Ci, {Fij}, Oi ] {Ci, Fij, Oij }• JDM constructs 2 files

• S(j) – set of tasks (jobs in Condor-speak)• S(f) – set of files requested• (Also keeps reference counts to files)

• JDD probes all DRMsJDD probes all DRMs• For files they have• For missing files it can schedule requests to HRM

• JDD schedules all missing filesJDD schedules all missing files• Simple algorithm: schedule round-robin to nodes• Simply send request to each DRM• DRM removes files if needed and gets file from HRM

• JDD sends each startd the list of files it needsJDD sends each startd the list of files it needs• Startd checks with its DRM which of the needed files it has, and

constructs a class-ad that lists only relevant files

14

Detail actions (FSD)Detail actions (FSD)

• FSD queues with Condor all tasksFSD queues with Condor all tasks

• FSD checks with Condor periodically on status FSD checks with Condor periodically on status of tasksof tasks• If a task is stuck it may choose to replicate the file

(this is where a smart algorithm is needed)• File replication can be made from a neighbor node or

from HRM

• When startd runs a task, it requests DRM to pin When startd runs a task, it requests DRM to pin file, run task, and release filefile, run task, and release file

15

ArchitectureArchitecture

JDD

DRM startd DRM startd DRM startd

schedd

Collector

Negotiator

FSD

JDD – Job Decomposition Daemon

FSD – File Scheduling Daemon

DRM startd

HRM

HPSS

- JDD generates list ofMissing files

- JDD generates list ofMissing files

16

Need to developNeed to develop

• Mechnism for Mechnism for startd to communicate with DRMstartd to communicate with DRM• Recently added to startd

• Mechanism to Mechanism to check status of taskscheck status of tasks

• Mechanism to check Mechanism to check that task is finishedthat task is finished, and , and notify JDDnotify JDD

• Mechanism to Mechanism to check that job is donecheck that job is done, notify , notify clientclient

• Develop JDDDevelop JDD

• Develop FSDDevelop FSD

17

Open Questions (1)Open Questions (1)

• What if a file was removed by a DRM? What if a file was removed by a DRM? • In this case, if DRM does not find the file on its disk,

then the task gets rescheduled• Note: usually, only “cold” files are removed• Should DRMs notify JDD when they remove a file?

• How do you deal with output and merging of How do you deal with output and merging of outputs?outputs?• Need DRMs to be able to schedule durable space• Moving files out of the compute node is the

responsibility of the user (code)• Maybe moving files to their final destination should

be a service of this system

18

Open Questions (2)Open Questions (2)

• Is it best to process as many files on a single Is it best to process as many files on a single system as possible?system as possible?• E.g. one system has 4 files, but also the files are on 4

different systems. Which is better.• Conjecture: if the overhead for splitting job is small,

then splitting is optimized by matchmaking

• What if file bundles are needed?What if file bundles are needed?• A file bundle is a set of files that are needed together• Need a more sophisticated class-ads• How to replicate bundles?

19

Detail activitiesDetail activities

• Development workDevelopment work• Design of JDD and FSD modules• Development of software components• Use of a real experimental cluster (8 + 1 nodes)• Install Condor and SRMs

• Development of an optimization algorithmDevelopment of an optimization algorithm• Represented as a bipartite graph• Using network flow analysis techniques

20

Optimizing file replication Optimizing file replication on the cluster (D. Rotem)*on the cluster (D. Rotem)*

• Jobs can be assigned to servers subject to the Jobs can be assigned to servers subject to the following constraints:following constraints:• 1. Availability of Computation slots on the

server, usually these correspond to CPUs.• 2. File(s) needed by the job must be resident on

the server disk• 3. Sufficient disk space for storing job output.• 4. RAM

• Goal : Maximize number of jobs assigned to Goal : Maximize number of jobs assigned to servers while minimizing file replication costsservers while minimizing file replication costs

* Article in preparation

21

Bipartite graph showing files Bipartite graph showing files and serversand servers

• An arc between f-node and s-node exists if the file is stored on that server• The number in the f-node represents the number of jobs that want to process that file• The number in the s-node represents the number of available slots on that server

22

File replication converted to a File replication converted to a

network flow problemnetwork flow problem

1) The total maximum number of jobsthat can be assigned to the servers corresponds to the maximum flow in this network.

2) By the well-known max-flow min-cut theorem this is also equal to the capacity of a minimum cut shown in bold edges.

Where, a cut is a set of edges that disconnects the source from the sink

Max Flow is 11 in this case – Minimum cut shown in bold The number on the arcs is the MIN

between the 2 nodes

23

Improving Flow by adding an edgeImproving Flow by adding an edge

Maximum flow improved to 13, additional edge represents file replication

Problem: to find a subset of edges in of total minimum cost that maximizes the flow between the source and the sink.

24

SolutionSolution

• Problem: Finding a set of edges of minimum cost to maximize flow (MaxFlowFixedCost )

• Problem is (strongly) NP-Complete• We use an approximation algorithm that finds a sub optimal

solution in polynomial time, called Continuous Maximum Flow Improvement (C- MaxFlowImp) using linear programming techniques

• Can show that the solution is bounded relative to the optimal• This will be implemented as part of FSD

25

ConclusionsConclusions

• Combining compute and file resources in class-ads is a useful conceptCombining compute and file resources in class-ads is a useful concept

• Can take advantage of matchmaker

• Using DRMs to manage space and content of space provides:Using DRMs to manage space and content of space provides:

• Information for class-ads

• Automatic garbage collection

• Automatic staging of missing files from HPSS through HRM

• Minimizing the number of files in class-ads is the key to efficiencyMinimizing the number of files in class-ads is the key to efficiency

• Get only needed files from DRM

• Optimization can be done externally to Condor by File replication Optimization can be done externally to Condor by File replication algorithmsalgorithms

• Network flow analogy provide good theoretical foundation

• Interaction between Condor and SRMs are through existing APIsInteraction between Condor and SRMs are through existing APIs

• Small enhancements were needed in startd and DRMs

• We believe that results can be extended to the Grid, but cost of replication We believe that results can be extended to the Grid, but cost of replication will vary greatly – need to extend algorithmswill vary greatly – need to extend algorithms

1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.

Documents