Top Banner
A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL Jón Friðrik Jónatansson Master of Science June 2017 School of Computer Science Reykjavík University M.Sc. Thesis
94

A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

May 13, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A DATA DRIVEN ANALYSIS OF CLUSTERSUSING UPPAAL

Jón Friðrik JónatanssonMaster of ScienceJune 2017School of Computer ScienceReykjavík University

M.Sc. Thesis

Page 2: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

ii

Page 3: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A Data Driven Analysis of clusters using UPPAAL

by

Jón Friðrik Jónatansson

Thesis of 60 ECTS credits submitted to the School of Computer Scienceat Reykjavík University in partial fulfillment

of the requirements for the degree ofMaster of Science (M.Sc.) in Computer Science

June 2017

Supervisor:

Anna Ingólfsdóttir, SupervisorProfessor, Reykjavík University, Iceland

Luca Aceto, SupervisorProfessor, Reykjavík University, Iceland

Examiner:

Brian Nielsen, ExaminerAssociate Professor in Computer Science, Aalborg University, Denmark

Mohammad Hamdaqa, ExaminerAssistant Professor, Reykjavík University, Iceland

Page 4: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

CopyrightJón Friðrik Jónatansson

June 2017

iv

Page 5: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A Data Driven Analysis of clusters using UPPAAL

Jón Friðrik Jónatansson

June 2017

Abstract

With high demand of computation power the popularity of clusters of computers has risenthroughout the years. Clusters are reactive systems where, using some resource managementsystem, the resources of the cluster are allocated to user submitted jobs. Model checkinghas proven to be an effective way for analysing reactive systems. This thesis aims to anal-yse the performance of deCODE Genetics’s cluster using UPPAAL, a model checking tool,and to create a framework to compare the efficiency of different scheduling algorithms ondifferently sized clusters using SLURM as their resource management system. The frame-work is parameterized so other scheduling algorithms and cluster sizes can be added aswell as changing the setting of SLURM. Although others have created such frameworks foranalysing schedulability, they do not provide sufficient control for parameters concerningto the resource management systems. The framework presented in this thesis offers a morespecialized solution for analysing clusters using SLURM as their resource management sys-tem.

Five scheduling algorithms are analysed using the framework proposed in the thesis; theyare First-Come-First-Served (FCFS), Shortest-Job-First (SJF), Round-Robin (RR),Shortest-Remaining-Time-First (SRTF) and the SLURM Backfilling algorithm.The scheduling algorithms were modelled and verified using synthetic analysis in UPPAAL.After verifying the behaviour of the models, dynamic spawning of jobs was added to themodels, where resource requirement and job duration were set using data collected from for-mer jobs in deCODE’s cluster. Two analyses were then performed using these models. Thefirst analysis compared the different algorithms, with focus on resource utilization, through-put and waiting time, on three different sizes of clusters under three different job loads. Theresults of the analysis confirm the effectiveness of SLURM’s backfilling algorithm. In thesecond analysis a job trace was collected from deCODE’s SLURM database and comparedto the same job trace run in the models. The results of the second analysis show that certainabstractions made to the models do not conform with the behaviour of SLURM. These ab-stractions were concerned with the time SLURM uses to perform scheduling attempts andother computations and must be added in future work.

Page 6: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A Data Driven Analysis of clusters using UPPAAL

Jón Friðrik Jónatansson

Júní 2017

Útdráttur

Með aukinni eftirspurn eftir reiknigetu á síðustu árum og áratugum, hafa vinsældir tölvuklasaaukist. Þegar klasar eru notaðir eru verkefni send til stjórnunarkerfis sem sér til þess að klasileysir vekefnið. Stjórnunarkerfið stýrir því í hvaða röð verkefnin eru unnin og miðlar auð-lindum klasans til verkefnanna. Ein leið til að meta hversu skilvirk slík stjórnunarkerfi eru erað gera líkan sem metur skilvirkni slíkra kerfa að stýra verkefnum á tölvuklasa. Í þessu verk-efni er notast við aðferðafræðina model checking til að byggja og sanna slík líkön. Markmiðþessa verkefnis er að meta frammistöðu tölvuklasa fyrirtækisins deCODE Genetics, eðaÍslenskrar Erfðagreiningar, og setja fram ramma til að bera saman afköst ólíkra áætlunaralgríma á ólíkar stærðir klasa sem nota SLURM sem stjórnunarkerfi. Ramminn er hannaðurmeð það í huga að auðvelt sé að breyta stillingum áætlunar algríma, stærð klasa og still-ingum SLURM kerfisins. Gerðar hafa verið tilraunir til að mynda sambærilega ramma tilað meta skilvirkni stjórnunarkerfa, en þessar tilraunir haf þó ekki sýnt nægilega vel fram ámöguleika notenda til að hafa áhrif á breytur stjórnunarkerfisins. Ramminn sem verkefniðdregur fram hefur þann kost hann býður upp á sérhæfðar lausnir þegar kemur að því að notaSLURM stjórnunarkerfið til þess að greina klasa.

Fimm áætlunar algríma eru greindir með með því að nota aðferð sem felur í sér svokallaðafyst-kominn-fyrst-þjónað(FCFS) aðferð, stysta-starf-fyrst(SJF) aðferð, svokölluð Round-Robin(RR)(SRTF) aðferð og að lokum SLURM bakfyllingar (e. SLURM Backfilling)reikniriti. Áætlunar algrímarnir voru hannaðir með stærðfræðilega sönnuðu líkana máli íUPPAAL. Hegðun líkananna var síðan sönnuð með fyrirspurnar máli UPPAAL, með notk-un gerfigreiningu (e. synthetic analysis). Eftir að hegðun líkanana hafði verið sönnuð varsjálfvirkri framleiðslu af verkefnum (e.dynamic spawning) bætt í líkanið þar sem auðlinda-þörf og tími verksins var stillt miðað við þau gögn sem hafði verið safnað af deCODEtölvuklasanum. Þá voru framkvæmdar tvær greiningar á líkaninu. Sú fyrri bar saman ólíkaalgríma, hvað varðar nýtingu auðlinda, gegnumstreymi og biðtíma, á þrem stærðum af klös-um með þrem ólíkum álagskröfum. Niðurstöður fyrri greiningarinnar sýna fram á skilvirkniSLURM’s bakfyllingar algrímans. Í seinni greiningunni var vinnuspori safnað af SLURMgagnagrunni deCODE og hann borin saman við sambærilegar keyrslur og greindar voru ílíkaninu. Niðurstöður seinni greiningarinnar sýna að sökum úrnáms ákveðinnar hegðunarí líkönunum fylgja ekki þau ekki hegðun SLURM. Þetta úrnám snertir tíma sem SLURMtekur við stýringu verkefna röðunnar og þarf að bæta í líkönin í framtíðinni.

vi

Page 7: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A Data Driven Analysis of clusters using UPPAAL

Jón Friðrik Jónatansson

Thesis of 60 ECTS credits submitted to the School of Computer Scienceat Reykjavík University in partial fulfillment of

the requirements for the degree ofMaster of Science (M.Sc.) in Computer Science

June 2017

Student:

Jón Friðrik Jónatansson

Supervisor:

Anna Ingólfsdóttir

Luca Aceto

Examiner:

Brian Nielsen

Mohammad Hamdaqa

Page 8: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

viii

Page 9: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

ix

Contents

Contents ix

List of Figures xi

List of Tables xiii

Listings 1

1 Introduction 3

2 Related Work 5

3 Background Knowledge 73.1 General definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 UPPAAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.1 UPPAAL SMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 SLURM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Slurm Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Scheduling and Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . 133.4.1 A Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.2 Scheduling Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4.3 Scheduling algorithms . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Methods 194.1 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Data Structures and Model Behaviour . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.2 Cluster behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.1 FCFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.2 SJF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.3 RR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.4 SRTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.5 Backfilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Experimental Design 415.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Page 10: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

x

5.2 Analysis One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.1 The Environment Template . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Analysis Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3.1 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Results 536.1 Results of Analysis one . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 Average Waiting Time . . . . . . . . . . . . . . . . . . . . . . . . 566.1.3 Average Residence Time . . . . . . . . . . . . . . . . . . . . . . . 586.1.4 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 596.1.5 Conclusion of Analysis One . . . . . . . . . . . . . . . . . . . . . 62

6.2 Results of Analysis two . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2.1 Throughput and Average Waiting Time . . . . . . . . . . . . . . . 636.2.2 Conclusion of Analysis two . . . . . . . . . . . . . . . . . . . . . . 66

7 Conclusion 67

Bibliography 69

A Code 73A.1 UPPAAL Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73A.2 Cluster Sizes in Analysis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A.2.1 C10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74A.2.2 C50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.2.3 C100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Page 11: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

xi

List of Figures

3.1 A typical cluster architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 UPPAAL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 State space satisfying invariantly ϕ . . . . . . . . . . . . . . . . . . . . . . . . 103.4 State space satisfying eventually ϕ . . . . . . . . . . . . . . . . . . . . . . . . 113.5 State space satisfying leads to property . . . . . . . . . . . . . . . . . . . . . . 113.6 Slurm system overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.7 The states of a process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.8 Example of a cluster using FCFS algorithm with 3 processes. . . . . . . . . . . 153.9 Example of a cluster using SJF algorithm with 4 processes. . . . . . . . . . . . 153.10 Cluster using Backfilling algorithm on a cluster with 2 available CPUs. . . . . 163.11 Cluster using Backfilling algorithm on a cluster with 3 available CPUs. . . . . 163.12 Example of a cluster using RR algorithm with 3 processes. . . . . . . . . . . . 173.13 Example of a cluster using SRTF algorithm with 4 processes. . . . . . . . . . . 18

4.1 A cluster template for FCFS and SJF clusters. . . . . . . . . . . . . . . . . . . 254.2 The job template for FCFS and SJF algorithm. . . . . . . . . . . . . . . . . . 264.3 The cluster template for RR and SRTF algorithms. . . . . . . . . . . . . . . . 314.4 Job template with preemption for RR and SRTF algorithm. . . . . . . . . . . . 324.5 Cluster template for SLURM backfilling algorithm. . . . . . . . . . . . . . . . 364.6 Backfilling template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 An example of a simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 A Gantt chart for simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 An example of the resource utilization of a model. . . . . . . . . . . . . . . . . 445.4 The Environment Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Length of jobs in deCODE’s cluster. . . . . . . . . . . . . . . . . . . . . . . . 465.6 Length of jobs in deCODE’s cluster zoomed in. . . . . . . . . . . . . . . . . . 465.7 Spawn rate of jobs in deCODE’s cluster. . . . . . . . . . . . . . . . . . . . . . 475.8 UPPAAL model with spawn rate λ = 2.5469 . . . . . . . . . . . . . . . . . . . 475.9 CPU requirements in deCODE’s cluster. . . . . . . . . . . . . . . . . . . . . . 485.10 RAM requirements in deCODE’s cluster. . . . . . . . . . . . . . . . . . . . . . 495.11 The Environment Process in Analysis two. . . . . . . . . . . . . . . . . . . . . 515.12 A non-preemtive Job Process that can have a runtime larger than 32767 . . . . . 52

6.1 A Gantt chart showing the simulation of job runs in SLURM Simulator . . . . 636.2 A Gantt chart showing the run of FCFS in analysis two . . . . . . . . . . . . . 646.3 A Gantt chart showing the run of SJF in analysis two . . . . . . . . . . . . . . 646.4 A Gantt chart showing the run Backfilling in analysis two . . . . . . . . . . . . 65

Page 12: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

xii

Page 13: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

xiii

List of Tables

4.1 SJF starvation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 The different cluster sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3 Node sizes for Analysis two . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.4 Analysis two model settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Throughput for FCFS on different sized clusters. . . . . . . . . . . . . . . . . 546.2 Throughput for SJF on different sized clusters. . . . . . . . . . . . . . . . . . 546.3 Throughput for RR on different sized clusters. . . . . . . . . . . . . . . . . . . 556.4 Throughput for SRTF on different sized clusters. . . . . . . . . . . . . . . . . 556.5 Throughput for Backfilling on different sized clusters. . . . . . . . . . . . . . 566.6 Average Waiting Time for FCFS on different sized clusters. . . . . . . . . . . . 566.7 Average Waiting Time for SJF on different sized clusters. . . . . . . . . . . . . 576.8 Average Waiting Time for RR on different sized clusters. . . . . . . . . . . . . 576.9 Average Waiting Time for SRTF on different sized clusters. . . . . . . . . . . . 576.10 Average Waiting Time for Backfilling on different sized clusters. . . . . . . . . 586.11 Average Residence Time for FCFS on different sized clusters. . . . . . . . . . 586.12 Average Residence Time for SJF on different sized clusters. . . . . . . . . . . 586.13 Average Residence Time for RR on different sized clusters. . . . . . . . . . . . 596.14 Average Residence Time for SRTF on different sized clusters. . . . . . . . . . 596.15 Average Residence Time for Backfilling on different sized clusters. . . . . . . 596.16 Resource Utilization for FCFS on different sized clusters. . . . . . . . . . . . . 606.17 Resource Utilization for SJF on different sized clusters. . . . . . . . . . . . . . 606.18 Resource Utilization for RR on different sized clusters. . . . . . . . . . . . . . 616.19 Resource Utilization for SRTF on different sized clusters. . . . . . . . . . . . . 616.20 Resource Utilization for Backfilling on different sized clusters. . . . . . . . . . 626.21 The resource utilization difference relation between FCFS and Backfilling . . . 626.22 The throughput and average waiting time of the models in analysis two . . . . . 656.23 Average waiting time in SLURM and Backfilling model. . . . . . . . . . . . . 666.24 The probabilities of getting a similar average waiting time as in SLURM. . . . 66

Page 14: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

xiv

Page 15: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

1

Listings

4.1 The JobInfo structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 The nodes structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 The ShadowTime structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 The enqueue function for the cluster template . . . . . . . . . . . . . . . . 214.5 The dequeue function of the cluster. . . . . . . . . . . . . . . . . . . . . . 224.6 The checkResources() function in the cluster. . . . . . . . . . . . . . . . . 224.7 The allocation function of the cluster. . . . . . . . . . . . . . . . . . . . . 234.8 The deallocation function of the cluster. . . . . . . . . . . . . . . . . . . . 234.9 The schedule function for FCFS. . . . . . . . . . . . . . . . . . . . . . . . 254.10 The schedule function of SJF . . . . . . . . . . . . . . . . . . . . . . . . . 284.11 The isShortestJob function. . . . . . . . . . . . . . . . . . . . . . . . . . . 294.12 The ascendingOrder function. . . . . . . . . . . . . . . . . . . . . . . . . 294.13 The schedule function for clusters using SRTF. . . . . . . . . . . . . . . . 344.14 The getNextJob function for Backfilling. . . . . . . . . . . . . . . . . . . 364.15 The setMinTime function for Backfilling algorithm. . . . . . . . . . . . . . 365.1 Query to measure Throughput of model. . . . . . . . . . . . . . . . . . . . 425.2 Query to measure average waiting and turnaround time. . . . . . . . . . . . 425.3 Query to measure the resource utilization of a model . . . . . . . . . . . . 43A.1 Uppaal code with column alignment, source in Listing. . . . . . . . . . . . 73A.2 Function that decides what action the cluster should take next and when in

Backfilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73A.3 The resources available on C10 cluster . . . . . . . . . . . . . . . . . . . . 74A.4 The resources available on C50 cluster . . . . . . . . . . . . . . . . . . . . 75A.5 The resources available on C100 cluster . . . . . . . . . . . . . . . . . . . 76

Page 16: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

2

Page 17: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3

Chapter 1

Introduction

The demand of high performance computing is constantly growing in today’s society. Dueto cost efficiency and high performance, many companies and research institutions are usingclusters. Clusters are collections of computers that are connected through high-speed net-works and can be interacted with as with a single computer through some middle-ware. Onesuch company is deCODE Genetics. DeCODE Genetics is a biopharmaceutical companythat focuses on researching and analysing the human genome. Their operations include us-ing genealogical resources to identify genetic risk factors with the intention of developingnew ways for diagnosing, preventing and treating diseases [1]. DeCODE’s cluster is utilizedevery day by their employees to aid in their research. DeCODE is interested in gainingknowledge on the use of its cluster in order to optimize its performance. This thesis presentsa data-driven analysis of the performance and resource utilization of deCODE’s cluster, us-ing data gathered from former job allocations on the cluster, where different algorithms arecompared to the existing algorithm utilized at deCODE. Clusters are large and complicatedparallel systems that can behave in a various ways. The research aims at providing infor-mation that can be of value for deCODE in determining ways to optimize performance andresource utilization of their clusters, therefore potentially improving efficiency and decreas-ing waiting times for all the users of the cluster.

Clusters are reactive systems in the sense that they are systems that react to outside stim-uli from the environment [2]. The environment, in this case, consists of the users of the sys-tem who submit jobs on the cluster. Model checking is a powerful tool to verify the correct-ness of reactive systems. The thesis shows how model checking can be used to analyse theefficiency of clusters employing five different scheduling algorithms using UPPAAL [3]. Thescheduling algorithms analysed in the thesis are First-Come-First-Served (FCFS), Shortest-Job-First (SJF), Round-Robin (RR), Shortest-Remaining-Time-First (SRTF) and SlurmBackfilling.

UPPAAL is a modelling and verification tool suite developed in collaboration betweenAalborg University and Uppsala University. In UPPAAL, systems are modelled using TimedAutomata, a formal mathematically precise language, and verified using a subset of TCTLlogic [4]. In this thesis, each algorithm was modelled separately and the behaviour of themodels was then verified using UPPAAL’s verification tool. After the desired behaviour ofthe algorithms has been verified, dynamic spawning of jobs was added to the models. Theaverage spawn rate and resource usage of the spawned jobs were gathered from the data atdeCODE and added to an environment process in the models that spawned the jobs accord-ing to these values. Two analyses were then performed on the models using the statisticalmodel checker UPPAAL SMC. The first analysis compared simulations of the models ondifferently sized clusters and compared the performance of each algorithm in regards to

Page 18: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4 CHAPTER 1. INTRODUCTION

the throughput, average waiting time and resource utilization. Analysis two compared thecorrectness of the model in respect to the data gathered on deCODE’s cluster.

The results of analysis one show that clusters using the Backfilling algorithm have asignificantly higher throughput compared to the clusters using other algorithms. Backfillingalso offers a lower average waiting time when compared to clusters using FCFS and RR.However, SJF and SRTF have lower average waiting times than Backfilling. This is becauseSJF and SRTF are not fair algorithm as they always schedule the shortest job in the queue.This can result in starvation for larger jobs. The results of analysis two show that dueto certain abstractions some of the quantitative results we obtain from the analysis of themodels are not in agreement with the data gathered on deCODE’s cluster. These abstractionsare presented in chapter 4.1 and must be relaxed in future work if this framework is to beused for further quantitative analysis. The results of analysis one and two are discussed ingreater detail in chapter 6.

The structure of the thesis is as follows. First some related work is discussed in chapter 2.Next, some essential background knowledge about clusters, UPPAAL, the chosen schedulingalgorithms as well as the criteria for the analysis are presented in chapter 3. Thereafter inchapter 4, the knowledge is applied, where the model design is shown for all the schedulingalgorithms as well as their verification. This is followed by the experimental design, inchapter 5, where the setup for the two analyses is explained in detail. The results of the twoanalyses are then presented in chapter 6. Finally, the content of the thesis is summarized inthe conclusion where we discuss our findings and point out directions for future research.

Page 19: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5

Chapter 2

Related Work

Analysing the efficiency of scheduling algorithms and clusters has been a focus point ofresearchers for years. However, few have used model checking to create and verify the sys-tems before performing their analysis. This chapter present former work related to analysingand modelling clusters, starting with two papers presenting analyses performed on clusterswithout the use of modelling. This is followed by work where frameworks were created toanalyse schedulability of systems using UPPAAL and UPPAAL SMC.

Much research has been done on analysis of clusters. Qureshi, Shah and Manuel present,in their paper "Empirical performance evaluation of schedulers for cluster of workstations"[5], an analysis of popular Resource Management Systems (RMS). SLURM is one of thechosen RMS of their analysis. The criteria chosen for their evaluation include CPU uti-lization, memory utilization, throughput, turnaround time and the average residence time ofjobs in RMSs. These are the criteria that will be analysed in this thesis. Another tool usedto analyse data clusters, using SLURM as the RMS, is SLURM simulator, presented in thepaper "Simulation of batch scheduling using real production-ready software tools" [6] and[7]. SLURM simulator is a tool for simulating job runs on clusters of any size with variousscheduling algorithms, using job traces created either from former runs or random values.The tool is used in this thesis to compare the runs of the models to real life runs. The toolemulates the duration of former jobs on a cluster, without performing the computation ofthe job. Users of the tool can add their own scheduling algorithms to SLURM and simulatejobs using that algorithm. This thesis presents a more accurate comparison of scheduling al-gorithms by modelling the systems and verifying that their behaviour is as intended, beforeperforming the analysis.

Some frameworks have already been made using UPPAAL and UPPAAL SMC to analyseschedulability on systems. The following papers presents two such frameworks.

In the paper "Schedulability analysis using Uppaal: Herschel-Planck case study" [8],a frameworks for schedulability analysis is presented. The framework assumes a singleCPU system and was modelled and verified using symbolic analysis in UPPAAL. The workincludes a case study for schedulability of the Herschel-Planck satellite system where apriority scheduling algorithm is analysed with respect to resource utilization, response timeand to check whether deadlines are met. This thesis also presents a framework intendedto analyse scheduling, but for clusters using SLURM as the RMS with multiple nodes withvarying available resources. Symbolic analysis is used to verify the behaviour of the models,but due to the size of the models statistical model checking, in UPPAAL SMC, is used forsimulation and spawning of jobs. Boudjadar et al. [9] presents a framework for analysingthe schedulability of hierarchical scheduling systems, modelled in UPPAAL. The frameworkparametrization used in that paper offers a way for specific scheduling applications and

Page 20: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6 CHAPTER 2. RELATED WORK

scheduling policies to be implemented in the framework. The framework is then verifiedusing UPPAAL and simulated using UPPAAL SMC. In the analysis given in [9], the requiredresources for each component are supplied using non-determinism. The paper also presentsa use case of an aviation system, where the framework is used to analyse the schedulability ofthe system. This thesis presents an alternative framework, where the focus is on the analysisof the utilization of resources on clusters as well as on measuring the average waiting timeand throughput, using different scheduling algorithms and differently sized clusters.

The paper "Checking & Distributing Statistical Model Checking" proposes a frameworkfor distributed statistical model checking of networks of priced timed automata [10]. Toachieve a high confidence level of certain properties holding in statistical models checking,multiple simulations must be performed on models. However, in large models this may beextremely time consuming. The paper proposes a framework where large models can beverified, using statistical model checking, by running multiple simulations on PC-clustersor GRID computers in parallel where a single master process collects simulations from oneor more slave processes that run the simulations. In this thesis this approach is not utilizedin the analysis of the models since the simulations are not used to get a probability of someproperty holding in the models. In the thesis static models were created and verified to be-have correctly before converting the models to dynamically spawn jobs with similar resourceneeds and duration’s as on deCODE’s cluster and then the simulations were conducted onthese models. However, the approach presented by [10] could be useful in future work forfurther analysing certain behaviours of the models created in this thesis.

Page 21: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

7

Chapter 3

Background Knowledge

This chapter will give a brief description of the tools and methods used in the followingchapters. Starting with some general points and definitions regarding clusters and mod-elling. Thereafter UPPAAL, a modelling and verification tool used to implement and verifythe models, is discussed as well as UPPAAL SMC, a different version of UPPAAL whichallows for statistical model checking. Next, SLURM a resource management tool used ondeCODE’s cluster is introduced, as well as SLURM simulator, a tool that allows for easierand faster analysis of former job submissions on a SLURM cluster. Finally the algorithmsthat were modelled and analysed are defined.

3.1 General definitions

As mentioned in the introduction, an analysis of deCODE‘s cluster is performed using mod-els constructed in UPPAAL. Before introducing what UPPAAL is and how it is used to per-form this analysis, it is important to present some key definitions concerning clusters, aswell as how they work. A short section about modelling is also presented.

3.1.1 Clusters

According to Mark Baker and Rajkumar Buyya in Cluster Computing at a Glance [11] acluster is a collection of computers which work in parallel and are interconnected throughsome network technology. A cluster is in other words a collection of computers, see figure3.1, also known as nodes, which all consist of either a single or multiprocessor system, theirown memory, operating system and I/O facilities. Each node has a network interface andcommunication software, which is responsible for communication with other nodes throughthe network. The nodes are then controlled by some middle-ware, such as SLURM [12]which will be discussed in more detail later in section 3.3, that manages the resources on thecluster and allocates jobs to nodes. Figure 3.1 shows a typical architecture of a cluster [11].

Page 22: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

8 CHAPTER 3. BACKGROUND KNOWLEDGE

Figure 3.1: A typical cluster architecture.

Clusters offer high performance and high throughput at a relatively low cost.

3.1.2 ModellingReactive systems, as described in the book Reactive Systems [13], are systems that reactto stimuli from its environment by exchanging information with it. Reactive systems areessentially parallel processes that interact with their environment. A cluster is therefore areactive system where the users take the role of the environment by submitting jobs on thecluster and by interacting with the middle-ware of the cluster, which then interacts with allthe nodes of the cluster by allocating the jobs.

Model checking is a two-step process to formally develop and verify the behaviour of areactive system, where some formal mathematically precise modelling language is used tomodel the system and is then followed by some algorithmic analysis to verify the correctnessof the behaviour of the model. One such modelling and verification tool suite is UPPAAL.

3.2 UPPAAL

UPPAAL [3] is an efficient and powerful tool suite for modelling, simulating and verifyingreal-time systems, developed collectively by researchers at Aalborg University in Denmarkand Uppsala University in Sweden. In UPPAAL, real-time systems are modelled as networksof timed automata [14], which are finite state machines with clocks, extended with datainteger variables and structures. The behaviour of the system is modelled using constraintson both locations in the automaton and edges.

The available constraints on locations are the following:

• Invariants,

• Urgent locations, and

• Committed locations.

Invariants are Boolean expressions that are used to express upper bounds on the amount oftime that an automaton can spend at a given location. This ensures that the process canonly stay in this location if the expression on the location is met. States which violate theinvariant are undefined, meaning the system cannot be in that state. Urgent locations are

Page 23: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3.2. UPPAAL 9

locations which freeze time in the model. In other words, urgent locations are equivalentto locations with an added clock x and the invariant x ≤ 0. The behaviour of the model isrestricted so that time cannot continue and only transitions that happen at this time can betaken. Committed locations are location that freeze time like urgent locations and ensurethat the next move must involve an outgoing edge from a committed location.

The constraint for edges in timed automata in UPPAAL are the following:

• Selections,

• Guards,

• Synchronizations,

• Updates, and

• Weights.

Selections are used to non-deterministically bind a value to a variable of some type. A tran-sition through an edge labelled with a guard can only be performed if the condition of theexpression on the edge holds. Synchronization channels are used in a model to communicatethrough channels using broadcast communication, or handshakes. When two processes syn-chronize, both transitions have matching synchronization labels and perform their transitionsimultaneously. Updates on edges are used to assign values to clocks and variables. Weightsare used to control the probabilistic likelihood of an edge being fired. This is calculated asthe ratio of the transitions weight over the sum of weights of all transitions from a location[15] [16].

The following example show a simple model in UPPAAL: the model in the example hasone clock x and two synchronization channels s1 and s2.

Figure 3.2: UPPAAL example.

This model consists of 3 processes; t1, t2 and t3 and uses the clock x, see figure 3.2. Processt1 has 3 locations. Location names are dark purple coloured in UPPAAL, note that locationscan also be unnamed. Location ‘A’ is the initial one of automaton t1. It has a transitionto an unnamed location, which transitions to the third location, called Done. Location Ais bound by the invariant x <= 2; as mentioned before, an invariant is a condition that thelocation must satisfy in order for the automaton to be at that location. In the example, theinvariant guarantees that the system can only stay at location A for at most 2 time units. Theavailability of the transition from ‘A’ to the next state, the unnamed state, is determined by

Page 24: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

10 CHAPTER 3. BACKGROUND KNOWLEDGE

a guard, green code in figure, which ensures that this transition cannot be activated until theclock x has the value 2. This means that the transition must be made when x has the value2. When the transition is made, a synchronous handshake s1, light blue code in the figure, isperformed in the system; this means that the transition in process t2 is also carried out at thesame time. Finally, an update statement on transitions, dark blue code, sets the value of x to0. The next transition in t1, from the unnamed state to the Done state, must be performedbefore the value of x becomes larger than 2. Note, however, that process t1 is allowedto delay any amount of time in the unnamed location and can disable the synchronous onchannel s2 forever.

The model’s behaviour can be verified using temporal logic as well as simulated usingthe simulator tool in UPPAAL . UPPAAL uses symbolic analysis to verify models. Queries,or formulas, in UPPAAL are made from a subset of the logic TCTL ([4]). In symbolic anal-ysis a state space is constructed with all possible sequences of the model and then fullyanalysed to check whether invariants and reachability properties hold [17]. The followingare the temporal properties used in the verification of the models in this thesis:

• Invariantly - A[],

• Eventually - A<>, and

• Leads to - ’−→’

Invariantly orA[] ϕ, where ϕ is a state property, holds true in a model if in the generatedstate space all sequences of states S0 → S1 → S2 → ...Sn → ,where S is a state and S0 isthe initial state, hold true for pϕ. The following figure shows an example of a state space,borrowed from [4]:

Figure 3.3: State space satisfying invariantly ϕ

Eventually or A <> ϕ hold true if all possible sequences in the generated state spaceeventually reach a state satisfying ϕ.

Page 25: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3.3. SLURM 11

Figure 3.4: State space satisfying eventually ϕ

The leads to property, or ψ → ϕ, means in all possible sequences of the state space if ψholds then eventually ϕ must hold.

Figure 3.5: State space satisfying leads to property

In large systems the state space can becomes very large; this prevents the efficient useof symbolic analysis. Therefore, a different approach is necessary. The statistical modelchecker UPPAAL SMC uses statistical analysis in the analysis of models. This allows forverification of much larger models [18].

3.2.1 UPPAAL SMCUPPAAL SMC is a model checker that allows for behavioural analysis of networks of timedautomata with a stochastic semantic [18]. As mentioned before, symbolic analysis is limitedby the size of the state-space models produce. UPPAAL SMC offers an alternative to thisdilemma, using statistical analysis, which avoids an exhaustive exploration of the state space.This is done by monitoring simulations of the system and evaluating if the system satisfiessome properties with some confidence level. This allows for verification of much largerstate-spaces and is far less time and memory intensive. [18]

3.3 SLURMSimple Linux Utility for Resource Management (SLURM) is an open-source fault tolerantcluster management system and a job scheduling system ([12]). SLURM acts as the middle-ware part of deCODE’s cluster. SLURM runs three daemons on the system to control the

Page 26: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

12 CHAPTER 3. BACKGROUND KNOWLEDGE

allocation and resource management; these are slurmctld, slurmd and slurmdbd. Thearchitecture of the system is shown in the following image.

Figure 3.6: Slurm system overview.

Slurmctld is the centralized manager which monitors resources and the other daemons onthe system and accepts jobs from users, see figure 3.6. This is achieved using programs suchas scontrol, sinfo, srun, squeue, scancel and sacct. Each node on the system runs a slurmddaemon. These daemons accept work allocated to them by slurmctld and inform slurmctldwhen the work is done or some errors occurred. Optionally, if the accounting plug-in isenabled, a third daemon, slurmdbd, records accounting information to a database [19].

3.3.1 Resource Allocation

SLURM comes with two plug-ins for node allocation. Its default node allocation tacticallocates jobs to nodes in exclusive mode. That is, when a job is allocated to a node and thenode has remaining resources to allocate, these resources will not be available to other jobs.This can result in poor resource utilization of the system. SLURM also offers a differenttactic to manage resources on a more precise basis. This tactic is utilized in deCODEscluster. The available consumable resources in the system are the following: the CPU ofthe node, the number of boards on the node, the number of sockets, the number of cores ofthe node and the amount of memory the node has to offer [20]. The tactic chosen for thecluster’s resource allocation is set in the slurmctld config file.

SLURM performs two types of scheduling attempts. The first one is a quick schedul-ing attempt at frequent intervals, which is performed each time a job is submitted, a jobis allocated and when a configuration change occurs. The quick scheduling attempt onlyattempts to schedule a small part of the queue, default value is 100 and is set with theScheduleParameter=default_queue_depth=# parameter in the configuration file, where # isthe number of jobs which are attempted to be scheduled. Quick scheduling attempts stoponce the job at the head of the queue does not fit. The second type of scheduling attemptis performed once each minute and ignores the default_queue_depth parameter mentionedbefore [21].

Page 27: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3.4. SCHEDULING AND SCHEDULING ALGORITHMS 13

3.3.2 Slurm SimulatorAs mentioned before SLURM comes with an accounting plug-in that stores every job execu-tion of the system [22]. The data is stored in a mySQL database. In order to save time whenanalysing the cluster’s performance, SLURM simulator, a modification of SLURM’s work-load simulator created by the Swiss National Supercomputer Centre, was set up ([7], [23]).This tool simulates a job trace on an emulated cluster and allows for time to be speed up.Using the data collected by the SLURM accounting plug-in, job traces can be constructedand used to simulate deCODE’s cluster on a single desktop computer. The simulation doesnot run the jobs themselves, but schedules them on the emulated cluster and logs informationsuch as when a job got scheduled and how long it ‘ran’. The SLURM simulator allowed fora good comparison of different scheduling algorithms on real data.

3.4 Scheduling and Scheduling AlgorithmsTo boost productivity on a cluster it is essential to pick a suitable algorithm to schedulejobs on the queue. This section will provide a clarification on what a job is, how differ-ent scheduling algorithms work, as well as some detail on known pros and cons of thesealgorithms and will introduce some criteria on how to evaluate scheduling algorithms.

3.4.1 A ProcessAccording to Silberschatz in [24] a process, also known as a job, is a program in execution,or more specifically when an executable program is loaded into memory. Processes containthe current activity of a program, such as the value of the program counter and other variablessuch as its stack and heap. During its execution, a process can be in different states as shownin figure 3.7.

Figure 3.7: The states of a process.

When a process is created it enters the ‘new’ state. The state means the process has beencreated and is ready to be added to the ready queue where it will be allocated resources onthe computer and its instructions start running. A process enters the ‘waiting’ state if aninterrupt occurs and the process has to wait for some event, such as I/O completion or somesignals from other processes. A process is in the ‘ready’ state when it is ready to be run,or is ready to be allocated on the CPU. When a process finishes its execution it goes to the‘terminated’ state.

Page 28: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

14 CHAPTER 3. BACKGROUND KNOWLEDGE

3.4.2 Scheduling Criteria

In order to analyse and compare algorithms some criteria for scheduling algorithms is nec-essary. The following is a list of criteria used to evaluate the algorithms that are analysed inthis thesis and are suggested by Silberschatz [24].

• Resource Utilization,

• Throughput,

• Average Residence Time, and

• Average Waiting Time.

Resource utilization is a measure of the utilization of the resources in the system. Sinceclusters are expensive systems, it is preferable to have a high utilization of the resources.The throughput is the measure of how many jobs are finished per time unit. Turnaround timeis the amount of time it takes a job to terminate, from submission to termination, includingall time spent in the waiting state as well as waiting in queue. Waiting time is the time ajob waits in the ready queue. Finally, response time is the time the process is in the waitingqueue as well as the time is the process is processing on the CPU.

3.4.3 Scheduling algorithms

As mentioned above scheduling algorithms decide in which order the jobs in the ready queueare allocated to the CPU. Scheduling algorithms can either be preemptive or non-preemptive.In non-preemptive algorithms, scheduling decisions only switch the state of process from thewaiting state to running. In preemptive algorithms scheduling can also occur when processesswitch from running to the ready state, for example when an interrupt occurs, and when aprocess switches from the waiting state to the ready state. Many scheduling algorithmsexist, but in this thesis the algorithms mentioned in the following sections were modelledand analysed. The first three algorithms that will be discussed, that is FCFS, SJF, andSLURMs Backfilling algorithm are all non-preemptive algorithms, the latter two, RR andSRTF, are preemptive.

First Come, First Served

First-Come, First-Served scheduling (FCFS) is one of the simplest scheduling algorithms.FCFS is a non-preemptive scheduling algorithm meaning, as mentioned before, that jobscannot be interrupted once allocated. In FCFS when a job is submitted to the cluster it entersthe back of the queue and is not allocated until all the jobs that entered the queue before ithave finished. In other words, when enough resources are available on the cluster for the jobat the head of the queue, the job is allocated. FCFS is fair algorithm, meaning starvation ofjobs cannot occur, and it easy to implement. but it also limitations as the following exampleshows.

In the example the system has one CPU to allocate and the jobs, shown in the figure allrequire 1 CPU to run and all jobs arrive at time 0 [24].

Page 29: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3.4. SCHEDULING AND SCHEDULING ALGORITHMS 15

Figure 3.8: Example of a cluster using FCFS algorithm with 3 processes.

The waiting time of the system is dependent on the order of arrival of the jobs. As can beseen when the jobs arrive in the order P1, P2, P3 the average waiting time of the jobs is:P1 = 0 + P2 = 24 + P3 = 27)/3 = 17 time units. If the jobs arrive in a different order P2,P3, P1 the average waiting time drops to 3 time units. Thus, the average waiting time variessubstantially depending on the order in which the jobs arrive to the queue. Another flaw ofthe algorithm occurs when a large job at the head of the queue needs more resources thanare available on the system, while smaller jobs further down the queue would fit. The largejob at the head of the queue blocks the smaller jobs that are waiting; this results in lowerresource utilization.

Shortest Job First

Shortest Job First (SJF), is a non-preemptive scheduling algorithm where the process thathas the shortest next CPU burst is chosen ahead of other jobs. If two or more jobs have anequal burst length, FCFS scheduling is used. The following example shows how the jobsare allocated on a system with one CPU. The jobs have the following burst time, seen infigure 3.9, and are all submitted to the queue at time 0 [24].

Figure 3.9: Example of a cluster using SJF algorithm with 4 processes.

As can be seen the jobs with the lowest CPU burst time are allocated first on the system.The average waiting time of the SJF algorithm is considerably lower than FCFS and thealgorithm motivates users of batch systems to estimate their job length accurately sincegiving the jobs a lower CPU burst time makes jobs more likely to be allocated sooner. Anegative aspect of this is that it can also result in starvation for large jobs if multiple smallerjobs enter the queue repeatedly [24].

SLURM Backfilling

SLURMs Backfilling algorithm is a non-preemptive algorithm that comes with newer ver-sions of SLURM. The algorithm is an optimization of the FCFS algorithm [25] with theadded behaviour of when a job at the head of a queue cannot be allocated, because of itsresource needs, the algorithm checks a ‘window’ of jobs, by default the next 100 jobs in thequeue, if they can be allocated. This is called backfilling, jobs are only allocated if they donot affect the time at which the job at the head of the queue is allocated.

Page 30: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

16 CHAPTER 3. BACKGROUND KNOWLEDGE

The following figures, see figures 3.10 and 3.11, show examples of job allocation inclusters using Backfilling as their scheduling algorithm. In the examples the order of jobarrivals are P1, P2, P3, P4 and all jobs arrive at time 0.

Figure 3.10: Cluster using Backfilling algorithm on a cluster with 2 available CPUs.

In the first example the cluster has two available CPUs. Since P1 arrives first, and is at thehead of the queue, it gets allocated to CPU1. P2 cannot be allocates until P1 has finished,since it requires two CPUs. If SLURM attempts a backfilling attempt at this time, it checksthe next jobs in the queue if they can be allocated without affecting the allocation time of P2.P3 is allocated since its burst time is only 3 time units, and will not postpone the allocationof P2. Thereafter, if a second backfilling attempt is performed, the cluster would see that P4

can also be allocated since it only has a burst time of 2 time units.

Figure 3.11: Cluster using Backfilling algorithm on a cluster with 3 available CPUs.

The second backfilling example shows schedule of jobs, where a job is backfilled despiteit having a longer burst time than the job at the head of the queue, and will therefore notfinish before the allocation of the job at the head of the queue. Here the cluster has 3 CPUsavailable for allocation. P1 is allocated to CPU1 initially as well as P2 on CPU2. P3 cannotbe allocated since the cluster only has one available CPU, and P3 requires two CPUs and

Page 31: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

3.4. SCHEDULING AND SCHEDULING ALGORITHMS 17

they will be busy until time unit 5. A backfilling attempt is performed where SLURM seesthat even though P4 requires 10 time units, and will therefore still be running when P1 and P2

are done, enough available resources will be available at time unit 5 for P3 to be scheduled.Therefore, P4 is backfilled to CPU3.

Research has shown that the Backfilling algorithm can increase the density of resourceutilization of the system by 20% and decreases the average waiting time of the system [26][27]. Backfilling is a computationally expensive operation. The size of the aforementionedwindow governs how many jobs are included in the backfilling process, therefore the size ofthe window will govern how expensive the operation is, and has to be limited. This will bediscussed in greater detail in chapter 5 and shown in the results in chapter 6 [26].

Round Robin

Round Robin (RR) is similar to FCFS scheduling except that preemption is added, whichallows the system to switch CPU bursts between jobs. This means that a job is allocated onthe system for a predefined time, called a timeslice, after which it is preempted and returnsto the ready queue until it is allocated again. If the needed CPU burst of a job is smallerthan the timeslice, the job will terminate and the next job will be allocated. The followingexample shows how three jobs are scheduled on a single CPU system and the timeslice isfour time units [24].

Figure 3.12: Example of a cluster using RR algorithm with 3 processes.

The example shows that P1 runs for four time units before it is interrupted and placed at theback of the queue. Next P2 runs for 3 time units before it finishes and therefore terminates.Then, P3 is allocated and runs for three time units before terminating. Thereafter, P1 runsuntil it finishes, but is interrupted every 4 time units. RR scheduling is considered a fairalgorithm, since all jobs get equal time on the system. It generally has a high averagewaiting time, but that may vary on the size of the timeslice chosen for the preemption.

Shortest Remaining Time First

The preemptive version of SJF algorithm, Shortest Remaining Time First (SRTF), selectsthe job with the shortest CPU burst time and preempts after some selected interval. Thesame pros and cons apply to SRTF as in SJF. The following example show a single CPUsystem running four processes arriving at different times. The clusters timeslice is set to 4.

Page 32: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

18 CHAPTER 3. BACKGROUND KNOWLEDGE

Figure 3.13: Example of a cluster using SRTF algorithm with 4 processes.

The example shows that at time 0, P1 is the only job in the queue, therefore the systemallocates resources to it. P2 arrives at time 1. A scheduling attempt is attempted after P1

timeslice has exceeded. P1 has 5 time units left, therefore P2 has the shortest burst timeof the processes in the queue and is allocated to the CPU. P2 finishes at time 8 and P4 isallocated since it has the shortest required burst time and is closer to the head of the queue, itruns for its timeslice before being preempted again and re-allocated for one more time unit.Next P1 is allocated since it only requires 5 time units. P1 runs for 5 time units, althoughpreempted once and re-allocated. Thereafter, P3 is allocated.

The following chapter will show how the aforementioned scheduling algorithms wereimplemented, on clusters using SLURM as their resource management system, in UPPAAL.The behaviour will be specified and verified using synthetic analysis in UPPAAL.

Page 33: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

19

Chapter 4

Methods

To analyse the performance of the cluster using various scheduling policies in regards tocriteria such as throughput, average waiting time and more, which will be discussed in fur-ther detail in the following chapter 5, several models were implemented in UPPAAL andUPPAAL SMC. The following sections describe how the aforementioned models were im-plemented and define the specifications corresponding to the behaviour of each schedulingpolicy. Their behaviour is then verified using symbolic analysis in UPPAAL.

In order to verify the models using UPPAAL’s query language various transitions, loca-tions and variables have to be added to the models in order to account behaviour, comparevalues and verify that the original model behaves as intended. Adding locations and vari-ables adds computations to the simulations of the model. Therefore, to keep the state-spacesas small as possible and minimize memory consumption when performing the simulationslater in the analyses, separate models were created based on the original models to verifythe models.

SLURM is a complex system that offers a multitude of parameters to specify its be-haviour. Some abstractions had to made to the functionality of SLURM due to time con-straints and complexity. These abstractions are presented in the first section of this chapter.Next, the data structures used in all the models and the schedule functions of the clustertemplate are presented. Followed by sections that present the models for each schedulingalgorithm.

4.1 AbstractionsAs mentioned, SLURM is a highly customisable resource management tool. The followingare some of the behaviours of SLURM that were abstracted from the models:

• Partitioning of nodes,

• The time scheduling decisions take, and

• Best-fit algorithm.

SLURM offers the administrators of the system to partition nodes in the cluster. Parti-tions are groups of, possibly overlapping, sets of nodes, where each partition can be seenas separate job queues. Partitions can have multiple constraints for jobs, such maximummemory allocation and maximum time constraints [28]. This behaviour was not consideredin the templates of the thesis.

Page 34: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

20 CHAPTER 4. METHODS

Scheduling decisions in SLURM can be computationally heavy and take time. They arealso computed in parallel with running jobs on the cluster, therefore new job submissionmight occur while the decision is still being computed. A scheduling attempt is attempted,by default, every 60 seconds, while a backfilling attempt is attempted every 30 seconds.Backfilling is time consuming operation that may delay new job submission [29]. In thetemplates, time does not continue when these operations are performed since it is difficult toknow how long the operations take.

When allocating resources to jobs, SLURM uses a best-fit algorithm to choose the nodesfor allocation [30]. This behaviour is not provided in the models. In the models, the nextnode that the job fits on is selected for allocation.

4.2 Data Structures and Model BehaviourTo provide an extensive analysis of each model, all algorithms were first implemented asstatic models, where the list of jobs and their time and resource needs are predefined, beforeimplementing the models as a dynamic model where each job is dynamically spawned andits time and resource needs are non-deterministically chosen. This was done because thestatic models behaviour was easier to test, using the simulator environment provided byUPPAAL , and verify using the UPPAAL verification tool. The following sections present thestatic models. The dynamic models are discussed in chapter 5.

4.2.1 StructuresThe following data structures were used to keep track of the resource needs and time forjobs, as well as the resources available at each time for the cluster.

When a job enters the queue a JobInfo structure is created and placed into an array,representing the queue in the model. This structure contains five variables as seen in listing4.1. The variable id is of the type id_t, which is a typedef variable that ensures that the valueof the variable can only be from 0 to MAX_JOBS, a constant representing the maximumnumber of permitted jobs in the model. RunTime is an integer value that represents theremaining time the job needs on the cluster to complete and the integers cpu, space and ramrepresent the resources needed to perform the computation.�

1 typedef struct{2 id_t id;3 int runTime;4 int cpu;5 int space;6 int ram;7 } JobInfo;� �

Listing 4.1: The JobInfo structure.

Next, is the nodes structure that represents the available resources of each node of thecluster. As seen in listing 4.2, the Nodes structure consists of six variables, where id is aninteger variable which represents the nodes id. The fields tmpDisk, cpus and realMemorycorrespond to the resources available on the specified node. The boolean variable NodeBusyis used when the configuration of the cluster’s allocation tactic is set to select/regular, mean-ing that when a job has been allocated to the specified node the remaining resources cannotbe used for other jobs. Weight is used to give nodes a higher priority of allocating jobs.

Page 35: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.2. DATA STRUCTURES AND MODEL BEHAVIOUR 21

�1 typedef struct{2 int id;3 int tmpDisk;4 int cpus;5 bool nodeBusy;6 int realMemory;7 int weight;8 } node;� �

Listing 4.2: The nodes structure.

ShadowTime is a structure that is only used in the backfilling algorithm, which wasdiscussed in chapter 3.4.3. The structure consists of four integer variables. The first is time,which corresponds to the time until the job at the head of the queue will arrive. The othersare cpus, space and ram, which correspond to the extra CPUs, disk space and RAM that willbe available when the job at the head of the queue has been allocated.�

1 typedef struct{2 int time;3 int cpus;4 int space;5 int ram;6 } shadowTime;� �

Listing 4.3: The ShadowTime structure.

4.2.2 Cluster behaviour

The models use a single array, of JobInfo structures described above, which corresponds tothe queue of jobs in the system. The array is declared as followsJobInfo queue[MAX_JOBS] where MAX_JOBS is a constant integer value which cor-responds to the number of jobs in the system. UPPAAL does not allow dynamically sizedarrays and therefore an integer variable is used to indicate how large the queue is at any time.When jobs enter the system they are queued using the following function 4.4.�

1 void enqueue(int myid, int myRunTime, int myCpu, int mySpace, int myRam)2 {3 JobInfo tmp = {myid, myRunTime, myCpu, mySpace, myRam};4 queue[len++] = tmp;5 }� �

Listing 4.4: The enqueue function for the cluster template

The enqueue() functions creates a JobInfo object from the information given in the param-eters and places it at the correct place of the queue. When a job is allocated, it is removedfrom the queue using the following function.

Page 36: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

22 CHAPTER 4. METHODS

�1 void dequeue(int id)2 {3 int i = 0;4 int start = −1;5 while (i < len) {6 if( queue[i].id == id){7 start = i;8 }9 i++;

10 }11 i = start;12 len −= 1;13 while (i < len) {14 queue[i] = queue[i + 1];15 i++;16 }17 }� �

Listing 4.5: The dequeue function of the cluster.

The dequeue() function finds the job, using its id, and removes it from the queue, beforeshifting all remaining jobs on the queue by one and decrements the size variable.

When a job allocation is attempted, the cluster first checks if the job will fit on the clusterin its current state, before allocating resources to it. This is done using the checkResources()function. Here the system checks if there exists a node that offers enough resources (that iscpus, space and ram available) for the job. The function returns a boolean value describingwhether a node has been found that has sufficient resources for the job.

�1 bool checkResources(){2 for ( i : int [0, NODE_NR−1]){3 if (nodes[i]. tmpDisk >= nextJob.space && nodes[i].cpus >= nextJob.cpu←↩

↪→&& nodes[i].realMemory >= nextJob.ram){4 return true;5 }6 }7 return false ;8 }� �

Listing 4.6: The checkResources() function in the cluster.

If checkResources() returns true, the job is allocated on the cluster using the allocate(cpu,space, ram) function, see Listing 4.7. This function finds the first available node that hasenough resources for the job, removes the resources needed by the job and returns its id.

Page 37: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 23

�1 // Allocate job on first node that is clear and return the id of the node.2 int allocate ( int cpu, int space, int ram){3 for ( i : int [0, NODE_NR−1]){4 if (nodes[i]. tmpDisk >= space && nodes[i].cpus >= cpu && nodes[i].←↩

↪→realMemory >= ram){5 nodes[i]. tmpDisk −= space;6 nodes[i]. cpus −= cpu;7 nodes[i]. realMemory −= ram;8 return nodes[i]. id ;9 }

10 }11 return −1;12 }� �

Listing 4.7: The allocation function of the cluster.

When a job is terminated it calls the deAllocate(nodeID, cpu, space, ram) function. Theparameters of the function tell the cluster what node the job was running on and the numberof resources that had been reserved for it. The function returns the resources the job wasusing.�

1 void deAllocate(int nodeID, int cpu, int space, int ram){2 for ( i : int [0, NODE_NR−1]) {3 if (nodes[i]. id == nodeID){4 nodes[i]. tmpDisk += space;5 nodes[i]. cpus += cpu;6 nodes[i]. realMemory += ram;7 return;8 }9 }

10 }� �Listing 4.8: The deallocation function of the cluster.

Using the aforementioned data-structures and functions, the behaviour of each modelwas modelled as described in the following section.

4.3 ModelsThis section discusses the implementation of the aforementioned scheduling algorithms.Each sub-section will start by discussing the specifications for each model, or in other wordsspecify the model’s behaviour corresponding to the scheduling algorithm that is being mod-elled. Thereafter, the model implementation in UPPAAL will be shown. Each sub-sectionconcludes with verification and simulation of the behaviour of the model, using the verifica-tion tool UPPAAL.

As mentioned in chapter 3.1.1 the cluster interacts with its environment, that is userssubmitting jobs on the cluster, and allocates resources to jobs on the queue if possible.SLURM schedules jobs at a certain time interval, set by the sched_interval parameter in theconfiguration file of SLURM, and makes a quicker scheduling attempt when a job is eithersubmitted to the queue or is terminated. As mentioned in chapter 3.3, the quick scheduling

Page 38: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

24 CHAPTER 4. METHODS

attempt only considers a small part of the queue set with the default_queue_depth parameterin the SLURM config file. This behaviour is implemented in all the following models.

4.3.1 FCFS

As discussed in chapter 3.4.3, First Come First Serve is one of the simplest schedulingalgorithm for clusters. The jobs are served in the order they arrive on the queue and are onlyallocated when the cluster has sufficient resources to allocate for the job. This means that alarge job at the front of the queue can block smaller jobs, which otherwise would fit on thecluster. This is generally thought to be a poor algorithm because of this flaw, however it wasnevertheless implemented to compare to other algorithms.

Specifications

The specifications for a FCFS algorithm are the following:

• Processes get CPU allocated in the order of their arrival.

• Running processes are not interrupted.

• Fair, all processes eventually get CPU allocation.

The first specification is to ensure that the jobs are allocated in the correct order corre-sponding to the FCFS algorithm, that is the jobs are allocated in the order of their arrival.Since this is a non-preemptive algorithm, the second specification should ensure that once ajob starts running it is never interrupted by the cluster. The last specification is essentially acheck that all jobs eventually get allocated and therefore do not suffer from starvation. Thefollowing sub-chapter discusses the implementation of the model, thereafter these specifica-tions are checked in the verification sub-chapter using the UPPAAL verification tool.

Implementation

To ensure the aforementioned behaviour of SLURMs scheduling, that is SLURM schedulesjobs at a certain time intervals as well as attempting a quick schedule attempt when a jobenters the system or is terminated, this behaviour was modelled using the cluster template,which can be seen in the following figure.

Page 39: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 25

Figure 4.1: A cluster template for FCFS and SJF clusters.

The scheduling attempt, that is performed at certain time intervals and considers all jobson the queue, is modelled into the template by setting an invariant at the initial location,forcing a scheduling attempt at an interval controlled by the variable sched_interval. Thequick scheduling attempt is performed when a broadcast message is received from the jobprocess, detailed later in the chapter, and forces an attempt to allocate jobs when a job issubmitted or terminated. The quick scheduling attempt can be seen in the bottom transition,from the initial location leading to the ‘QuickSched’ location, in figure 4.1 while the regularscheduling attempt is the top transition, from the initial location to the ‘Sched’ location. Inboth scheduling attempts the nextJob variable is set, using the schedule() function. In theFCFS algorithm, this function is trivial, it simply sets the nextJob variable to the id of thejob that is at the head of the queue. The function can be seen in the following listing.�

1 void schedule()2 {3 nextJob = queue[0];4 }� �

Listing 4.9: The schedule function for FCFS.

The quickSchedule transitions will only attempt to schedule a fixed number of jobs,determined by the default_queue_depth variable before returning to the initial state. Thisis achieved by counting the attempted schedules and if their number is higher than the de-fault_queue_depth the quick scheduling attempt halts and returns to the initial location. Nextthe cluster checks if enough resources for this job are available using thecheckResources(nextJob) function, see listing above 4.7. This function returns a booleanvalue corresponding to whether the cluster has enough resources available to allocate thejob. If the job can be allocated, the corresponding Job process receives a synchronous hand-shake and starts to run, and the cluster attempts to schedule the next job in the queue. Ifthere are not enough resources, the cluster template returns to the initial location until itlater attempts another allocation.

Page 40: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

26 CHAPTER 4. METHODS

The job template for the FCFS algorithm is shown in figure 4.2.

Figure 4.2: The job template for FCFS and SJF algorithm.

To ensure that each job value is set before the cluster allocates, the initial location is declaredas a committed location. The next location has two transitions leading from it, one whichenables jobs that should arrive later to go to a location where they stay until the system timeis the same as their arrivalTime and another for jobs that should arrive immediately. Bothof these locations have a transition that creates a JobInfo object, see listing 4.1, which isplaced on the queue, as well as call a quick-scheduling attempt from the cluster, using theSchedule synchronization channel, and lead to the ‘Waiting’ location in the template. Whena process is in the ‘Waiting’ location it remains there until the cluster allocates resources toit. When the cluster allocates resources to the job, it receives a handshake request from theRun[id] synchronous channel. This allows the job to move from the ‘Waiting’ location to the‘Running’ location. In this transition the job gets allocated a node on the cluster, using theallocate() function where the resources are reserved from the node and the job is removedfrom the queue using dequeue(). The job remains in the ‘Running’ location until its run timeis finished and its resources are deallocated, using the deAllocate() function, before enteringthe ‘Done’ location and synchronizes with the cluster to make a quick scheduling attempt.

Verification

The first specification for the FCFS algorithm states that processes get CPU allocated inthe order of their arrival. First the following query was constructed to verify that the modelalways eventually reaches its ‘Done’ location.

A <> Cluster.Done

Page 41: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 27

The query returns true, and since their are no transitions from the ‘Done’ location wehave verified that the FCFS model always reaches its ‘Done’ location. Next, the followingquery verifies that the order of job allocations is the same as the order of the job submission.In order to verify this specification two arrays were added to the model. These arrays arearrival[MAX_JOBS] and allocated[MAX_JOBS]. When the job process transitions from theinitial location to the ‘Waiting’ location its id is stored in the arrival array and when the jobprocess transitions from the ‘Waiting’ location to the ‘Running’ location the process’s id isstored in the allocated array. The following query can then be used in UPPAAL to verify thatthe aforementioned specification always eventually holds.

A[] Cluster.Done imply (arrival == allocated)

Informally this query states that always when the cluster process reaches the ‘Done’location this implies that the arrays arrival and allocated are equal and the system is in adeadlock, meaning no other transition is available. This query holds for the FCFS modelmeaning there is no path through the generated state-space where the cluster process reachesits ‘Done’ location where the arrival and allocated arrays are not equal.

The second specification states that running processes cannot be interrupted, or pre-empted, in the FCFS model. In regards to the model this means that the behaviour of a jobprocess cannot transition from the ‘Running’ location back to the ‘Waiting’ location onceallocated. To verify this behaviour two integer variables, waiting and running, were addedto the model. These variables were incremented each time a transition to the ‘Waiting’ and‘Running’ locations were made. The property was then verified using the following query:

A[]forall(i : id_t) Cluster.Done imply (Job(i).waiting == 1 &&

Job(i).running == 1)

This query checks that invariantly when the cluster process is in the ‘Done’ location itimplies that all jobs have visited the ‘Waiting’ and ‘Running’ location exactly once. Thisproperty is satisfied in the model meaning that a job process is never preempted.

The final specification for the FCFS model states that the algorithm is fair, or in otherwords all processes finally get CPU allocation. To verify this specification a transition wasadded to the job template, where once the job had finished running it would transition backto the ‘Waiting’ location and re-enter the queue, this mimics the behaviour of multiple jobsentering the system. The number of times a job would re-enter the queue is controlledwith a variable run, which specifies how many times the model should loop before the jobprocess enters the ‘Done’ location. A similar query as used to verify the specification thatthe processes get CPU allocated in the order of their arrival was then used to verify that theorder of the jobs that arrive into the queue and are allocated on the cluster never changes,proving that no job can starve in the model.

A[] Cluster.Done imply (arrival == allocated)

This query shows that the order of arrival and allocation is invariantly the same in the model,proving that no job can starve in the model.

4.3.2 SJFAs mentioned in chapter 3.4.3, the SJF is a non-preemptive algorithm which allocates re-sources to the job on the queue that require the least amount of runtime on the cluster. A

Page 42: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

28 CHAPTER 4. METHODS

cluster using the SJF algorithm is modelled in similar fashion to FCFS. In fact, it uses thesame templates as the FCFS model. The difference lies in how the next job is chosen.

Specifications

The specifications for a SJF algorithm are the following:

• Processes with the shortest service time get scheduled first.

• Running processes are not interrupted.

• Unfair, starvation can occur.

In the first specification, it is verified that the shortest job will always be serviced beforelonger jobs. In the case of jobs that have an equal amount of required service time needed,the job that is further ahead in the queue will be serviced first. The next specification, asin FCFS, is so that when a job is in its ‘Running’ state it cannot be preempted. The lastspecification states that starvation can occur using the SJF algorithm.

Implementation

As mentioned before the key difference between the SJF and the FCFS algorithm is theorder of which jobs are allocated to the cluster. This is decided with the schedule() functionin the cluster process, see figure 4.1. In the SJF version of the model the nextJob variable isset to the job with the shortest run time using the following function.�

1 void schedule()2 {3 JobInfo tmp = list[0];4 int i;5 for( i = 0; i < len; i++){6 if(list[i].runTime < tmp.runTime && i < len){7 tmp = list[i];8 }9 }

10 nextJob = tmp;11 }� �

Listing 4.10: The schedule function of SJF

This function selects the job that has the least amount of required runtime by setting the firstjob on the queue, looping though all others jobs and finding the job which has the shortestruntime. In case of a tie the job that is closest to the job at the head of the queue, includingthe job at the head of the queue, is set.

Verification

To verify that processes with the shortest service time get scheduled first a committed inter-mediate location ‘preRunning’ was added between the ‘Waiting’ and ‘Running’ location ofthe job template. The following queries to verify the behaviour. The location is committedand therefore does not alter the behaviour of the processes.

Page 43: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 29

A[] forall(i : id_t) Job(i).preRunning imply Job(i).isShortestJob() == true

The first query checks that when each job reaches its ‘preRunning’ location it implies thatjob is the shortest job on the queue. To verify that the job is the shortest job the functionisShortestJob(), see Listing 4.11, is used.�

1 bool isShortestJob(){2 int i;3 for( i = 0; i < len; i++){4 if(queue[i].runTime < runTime){5 return false;6 }7 }8 return true;9 }� �

Listing 4.11: The isShortestJob function.

The function isShortestJob() returns a boolean value describing whether the job, that runsthe function, is the shortest job on the queue. An alternative way to test this specification ispresented with the following queries:

A <> Cluster.Done

A[] Cluster.Done imply Cluster.ascendingOrder() == true

Here, the first query verifies that the cluster always reaches its ‘Done’ location and the sec-ond query states that once the cluster process is in its ‘Done’ location when all job havefinished running, all jobs in the allocated array are in ascending order, using the ascendin-gOrder() function shown in Listing 4.12.�

1 bool ascendingOrder() {2 int i;3 for ( i = 1; i < MAX_JOBS; i++){4 if( allocated[i−1].runTime > allocated[i].runTime){5 return false;6 }7 }8 return true;9 }� �

Listing 4.12: The ascendingOrder function.

This function iterates through the allocated array, which stores the order of allocation of thejobs, and checks if the jobs were allocated in ascending order in regards to their runtime.

To verify that running processes cannot be interrupted the same query as in FCFS areused.

A[] forall(i : id_t) Cluster.Done imply (Job(i).waiting == 1 &&

Job(i).running == 1)

As in the FCFS, the state-space is searched to verify that in all computational paths the clus-ter process reaches its ‘Done’ location and the ‘Waiting’ and ‘Running’ locations have beenvisited only once. The query returns true so it is proven that the jobs are never interrupted.

Page 44: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

30 CHAPTER 4. METHODS

As mentioned in the Background chapter, see chapter 3, starvation can occur when mul-tiple short jobs enter the system and block longer jobs from being allocated to the cluster. Toverify that starvation may occur in the SJF algorithm a integer variable jobsRun was addedto the model which counted all jobs that were able to transition to its ‘Running’ location, thefollowing query was then used to verify that starvation can occur in the model.

A[] jobsRun < MAX_JOBS

In the example set up for the verification 4 jobs were allowed to queue and then re-enter thequeue once they finished running. The resource needs and expected runtime for each jobcan be seen in table 4.1. The cluster only consists of one node with 10 available CPUs. Alljobs arrive at time 0.

Job ID 0 1 2 3Runtime 10 100 35 10cpus 5 1 2 20

Table 4.1: SJF starvation example

The query check that globally in the state-space jobsRun < MAX_JOBS. This queryreturns as positive, meaning at least one job is never able to run. In this example at leastone of the four jobs is never able to transition to its ‘Running’ location. Further verificationshows that these are Job(1) and Job(2), which requires the longest runtime of the jobs in thesystem. Therefore we can assume that Job(1) and Job(2) are being starved since Job(0) andJob(3) always re-enter the queue and are re-scheduled because they require less runtime,never leaving enough resources to allow Job(1) and Job(2) to be allocated. This was verifiedusing the following four queries.

Job(3).Waiting −→ Job(3).Running

Job(2).Waiting −→ Job(2).Running

Job(1).Waiting −→ Job(1).Running

Job(0).Waiting −→ Job(0).Running

Here these queries check that there exists a computational path in the state-space where,when the jobs are in their ‘Waiting’ locations some transitions will eventually lead them totheir ‘Running’ location. The first and last query return positive, while query 2 and 3 returnnegative meaning Job(1) and Job(2) are being starved.

4.3.3 RRRound Robin a similar algorithm to FCFS, except that it preempts running jobs after acertain time. This time is called a timeslice, and is set in the model using a constant integervalue called TIMESLICE. When a job is preempted it re-enters the queue.

Specifications

The specifications for a RR algorithm are the following:

• Service time is divided into timeslices.

Page 45: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 31

• Running processes can be interrupted, or preempted.

• Fair, all processes eventually get CPU allocation.

These are the specifications for the RR algorithm. They state that a job process can onlystay in the running state for at most the time specified by the timeslice before the processis interrupted. The second specification states that running processes can be interrupted, orpreempted, in RR. Finally, Starvation cannot not occur using the RR algorithm.

Implementation

The cluster template for the RR algorithm is the same template that is used in FCFS andSJF, except that the added constraint lastTask != nextJob.id is checked when schedulingand quick scheduling. This constraint is to avoid a looping behaviour in the model whenonly one job is left in the queue and forces the process to go back to the initial location toallow time to pass in the system, and therefore allows the job process to finish its executionwithout interruption.

Figure 4.3: The cluster template for RR and SRTF algorithms.

Since RR is a preemptive algorithm, the job template has to add a way for the templateto go from the running location back to the waiting location. This behaviour can be seen in4.4. From the ‘Running’ location, the Job can take three transitions:

Page 46: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

32 CHAPTER 4. METHODS

Figure 4.4: Job template with preemption for RR and SRTF algorithm.

1. The first is a transition where the job is preempted when the timeslice has passed. Thefollowing locations in this transition are a synchronization message through the Pre-empt channel followed by a synchronization message through the Schedule channelfor the cluster to perform a quick schedule attempt. This transition is performed whenthe job templates a clock variable, called jobTime, becomes equal to the timeslice. Theclock jobTime is reset every time the process enters its ‘Running’ location.

2. The next transition is only taken when a synchronization message is received fromthe Preempt channel. This makes all other running jobs go to a location where threechecks are performed. First, it is checked whether their timeslice has expired. If theirtimeslice has expired and their runtime has not expired they are deallocated and placedat the back of the queue. If their timeslice has not expired and their runtime has notexpired either, they return back to the ‘Running’ location. Finally if their remainingruntime is equal to the timeslice, they transition to the ‘Done’ location.

3. The final possible transition from the ‘Running’ location is the same as in the other jobtemplates, that is when the job has finished its runtime it transitions to a location and itimmediately performs a Schedule synchronization message to make a quick scheduleattempt in the cluster before finally transitioning to the ‘Done’ location.

The remaining behaviour, in the job template, is the same as in the job template in FCFSand SJF.

Verification

As mentioned above the first specification for RR is that service time is divided into times-lices. When a job finishes its timeslice and has not finished running, it is preempted andreturns to its ‘Waiting’ location. This behaviour is verified in the model using the followingquery:

A[] forall(i : id_t) Job(i).Running imply Job(i).jobT ime <= TIMESLICE

Page 47: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 33

Here, the state-space is explored to see if a computational path exists where once a jobin its ‘Running’ location can exceed the TIMESLICE time. The query checks that globallyif a job in its ‘Running’ location it implies that the processes jobTime clock is always lessthan or equal to the TIMESLICE. The query verifies to be true since the ‘Running’ locationin the job process of the model is bound by the invariant jobT ime <= TIMESLICE,meaning that states that do not fulfil this condition are undefined.

The second specification states that running processes can be interrupted, or preempted.To check if a job can be interrupted, variables were added to the model that count how manytimes the process enters the ‘Running’ and ‘Waiting’ locations. The query checks that,when all jobs are in the ‘Done’ location, all jobs have entered the ‘Waiting’ and ‘Running’location more than once. To verify this, each job in the model has to have a runtime of atleast runT ime > 2 ∗ TIMESLICE. This is verified using the following query.

A[] Cluster.Done imply forall(i : id_t) Job(i).runningCount > 1 &&

Job(i).waitingCount > 1

The query checks the state-space if invariantly, when the cluster process is in the ‘Done’location, all job processes have visited their ‘Running’ and ‘Waiting’ location more thanonce. This query is satisfied and therefore it is verified that the models behaviour satisfiesthis property.

The final specification for RR is that the algorithm is fair, that is all processes eventuallyget CPU allocation. To verify starvation a transition was added from the ‘Done’ locationin the job template back to the ‘Waiting’ location, as well as re-entering the job to thequeue. The same strategy as in SJF was then used to account if all jobs eventually reachits ‘Running’ location. The following query was then used to verify that the model includesthis behaviour.

A <> jobsRun ==MAX_JOBS

The query is satisfied, meaning that all job processes always reach the ‘Running’ loca-tion. Therefore, it is verified that RR is a fair algorithm, and starvation cannot occur.

4.3.4 SRTFSRTF is the preemptive version of SJF, like RR is the preemptive version of FCFS, andtherefore uses the same model as RR and only changes the schedule() function to set thenextJob variable to the job on the queue with the shortest runtime.

Specifications

The requirements for a SRTF algorithm are the following:

• Service time is divided into timeslices.

• Processes with the shortest service time get scheduled first.

• Running processes may be interrupted.

• Unfair, starvation can occur.

The specifications for SRTF state that, as in RR, a job process will only get service timeequal to the specified timeslice in the system before being preempted and re-entering thequeue. Starvation may occur using the SRTF algorithm.

Page 48: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

34 CHAPTER 4. METHODS

Implementation

As stated before SRTF uses the same model as RR. This difference lies in what job is chosento be scheduled. The job is chosen using the following function.�

1 void schedule() {2 JobInfo tmp = list[0];3 int i;4 for( i = 0; i < len; i++){5 if(list[i].runTime < tmp.runTime){6 tmp = list[i];7 }8 }9 nextJob = tmp;

10 }� �Listing 4.13: The schedule function for clusters using SRTF.

Here, as in SJF, the job with the shortest runtime is returned from this function; this ensuresthat the shortest job, closest to the head of the queue, will always be scheduled first.

Verification

As in RR the verification of the specification that the service time is divided into timeslicesis checked using the following query:

A[]forall(i : id_t)Job(i).RunningimplyJob(i).jobT ime <= TIMESLICE

The results show that the model behaves as expected, that is the job processes never stay inthe ‘Running’ location longer than the timeslice allows.

The next requirement regarding SRTF is that processes with the shortest service timeget scheduled first. Using the same functions and queries as in the SJF model we can verifythat, each time a job transitions from its ‘Waiting’ location to the ‘Running’ location, it isthe job that requires the shortest runtime on the cluster. The function isShortestJob() can beseen above in Listing 4.11. The queries are as follows:

A[] forall(i : id_t) Job(i).preRunning imply Job(i).isShortestJob() == true

A[] Cluster.Done imply Cluster.ascendingOrder() == true

As in SJF these queries are satisfied, meaning that each time a job is allocated it is indeedthe shortest job on the queue.

Since SRTF is a preemptive algorithm like RR, the specification that running processescan be interrupted, or preempted, was tested. As in RR, to verify if the behaviour of themodel meets that requirement, two variables are added to the model to count how manytimes the process visits the ‘Waiting’ and ‘Running’ locations. The following query wasthen used to verify this behaviour.

A[] Cluster.Done imply forall(i : id_t) Job(i).runningCount > 1 &&

Job(i).waitingCount > 1

Page 49: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 35

As in RR the query checks if, invariantly when the cluster reaches its ‘Done’ location itimplies that the ‘Waiting’ and ‘Running’ locations in all job processes have been visitedmore than once. The query is satisfied in the SRTF model.

The final specification for SRTF was that starvation can occur. Starvation was verifiedthe same way as it was verified in SJF. Four jobs with the same resource needs and runtimerequirements as in the example in SJF were run on the system to see if any job resultedin starvation, see table 4.1, on a cluster with one node with 10 available CPUs. The shortjobs were allowed to re-enter the queue once done, via a transition from the ‘Done’ locationback to the ‘Waiting’ location, and therefore longer jobs never managed to be allocated andtherefore starved. The following query was used to verify the specification:

A[] jobsRun < MAX_JOBS

As in the verification in SJF Job(1) and Job(2) result in starvation since they is never able tobe allocated on the cluster.

4.3.5 BackfillingThe Backfilling algorithm, as detailed in chapter 3.4.3, is a algorithm built into SLURMand is similar to the FCFS algorithm except that it attempts a backfill at some given timeinterval. As mentioned in the Background chapter 3, when the cluster attempts a backfill itgoes through a ‘window’ of jobs on the queue and attempts to allocate them if they do notpostpone the allocation of jobs at the head of the queue.

Specifications

The specifications for a Backfilling algorithm are the following:

• Processes get CPU allocated in the order of their arrival, except if backfilled.

• A backfilled job will not postpone the allocation of a job at the head of the queue.

• Running processes are not interrupted.

The specifications for the Backfilling algorithm state that the cluster allocates jobs in aFCFS order, and attempts a backfilling attempt at a certain time interval. The backfillingprocess cannot postpone the allocation of a job at the head of the queue. Furthermore,running jobs cannot be preempted.

Implementation

The implementation of the Backfilling model is more complex than that of the aforemen-tioned models. To model the Backfilling algorithm’s behaviour efficiently, the cluster tem-plate had to be changed and an additional process for Backfilling was added to the model.As mentioned earlier, see 4.1, backfilling is a time consuming process and SLURM maysometimes either have to cancel the calculation if it is too time consuming or disregard newjobs. This behaviour was not implemented in the model, although in order to open the pos-sibility of adding this behaviour later the Backfilling process was modelled into a separatetemplate, this allows delays and other behaviour to be added later to template. The job pro-cess for Backfilling is the same as in the non-preemptive models. The cluster template forthe Backfilling scheduling algorithm can be seen in the following Figure 4.5.

Page 50: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

36 CHAPTER 4. METHODS

Figure 4.5: Cluster template for SLURM backfilling algorithm.

Initially, when the model starts running a scheduling attempt is performed. Here thecluster continues to allocate resources to jobs on the queue until a job is found that will notfit on the cluster. This is done by setting the nextJob variable, as in the other models, tothe next job on the queue using a similar scheduling function as in FCFS, except that in theBackfilling model the function returns a JobInfo object.�

1 JobInfo getNextJob()2 {3 return queue[0];4 }� �

Listing 4.14: The getNextJob function for Backfilling.

When a job does not fit on the cluster the model transitions to an urgent location where itcalculates what type of transition is performed next. This is done using a transition from thisintermediate location back to the initial location and calling the setMinTime() and setTime()functions. The setMinTime() function checks all running jobs, which are stored in an arrayof JobInfo objects called onCluster, and finds the job with the least amount of runTime left,see listing 4.15.�

1 void setMinTime(){2 minTime = onCluster[0].runTime;3 for(x : int[0,MAX_JOBS−1]) {4 if(x >= onClusterlen){5 return;6 }7 if(onCluster[x].runTime < minTime){8 minTime = onCluster[x].runTime;9 }

10 }11 }� �

Listing 4.15: The setMinTime function for Backfilling algorithm.

The setTime() function, see Listing A.2 in Appendix, decides what type of transition shouldbe taken next, depending on the current time of the system. Once the initial transitions are

Page 51: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 37

finished and the time has been calculated for which to wait for the next transition, the clusterreturns to the initial location. There are four possible transitions from this location:

1. A transition that leads to a scheduling attempt.

2. A transition that leads to a quick scheduling attempt.

3. A transition that attempts a backfill of jobs.

4. A transition that is taken when a job, or jobs, finish running.

As mentioned before a Backfill attempt is taken at a certain interval decided by BC-FILL_INTERVAL, a scheduling attempt is done at an interval decided bySCHED_INTERVAL and a quick scheduling attempt is attempted each time a job enters thequeue and when a job terminates. Depending on the system time, setTime() checks whattransition should be performed next and sets the variable waitTime to that time. When thecluster re-enters the initial state it delays for the time set by waitTime and then fires theappropriate transition, be it a backfill attempt, a scheduling attempt, a quick schedule at-tempt or the action to terminate all jobs that have finished, and performs a quick schedulingattempt.

If setTime() sets the next transition to be a scheduling attempt, the cluster repeats theprocess described above in the description of the initial run, that is, it schedules all thepossible jobs in the order they are on the queue until the next job on the queue does not fiton the cluster. At that point it returns to the aforementioned intermediate location to calculatewhat transition should be taken next and how much time it should wait. If a quick scheduletransition is made, when a job is submitted, the cluster attempts to schedule a maximumof default_queue_depth jobs on the queue. If a job does not fit, the cluster returns to theinitial state and waits until the next transition is taken. Note that, when a quick scheduleattempt is performed when a job is submitted on the queue, the cluster does not calculatewhat transition to take next; it simple returns to the initial state and performs the task it wasmeant to take before the quick schedule transition was taken. When the setTime() functionsees that the next transition should be taken to allow a job or jobs to terminate, the clustertransitions to a committed location which checks whether all jobs that are running on thecluster if they have finished their runtime and sends them a message through a synchronouschannel to terminate. When this is completed, it attempts a quick schedule attempt beforereturning to the intermediate state that calculates what transition to take next. If a backfillingis attempted, the cluster process sends a message to the Backfill process. The Backfillingprocess can be seen on the following image 4.6.When a backfill is in progress the Backfill process performs a series of calculations seenon the transition from the initial state. First it resets the shadowTime structure, discussedbefore in this chapter 4.2.1. Thereafter, it copies the state of the cluster to a new array ofNode structures; this allows subsequent functions to attempt backfilling without affecting thestate of the nodes. Next it sorts the jobs on the cluster in ascending order of their expectedtermination time. With the sorted array, it is possible to loop over the array of jobs and collectnodes until the number of available nodes is sufficient for the first job in the queue and theshadowTime object variables are updated to correspond to the available time until the jobwill be allocated and extra resources available after the job is allocated. Finally the jobs thatcan be backfilled are found using the setBackFillJobs() function. Here the queue is scannedand all jobs that fulfil the backfilling requirement are found, these are jobs that require nomore than the currently free nodes and will terminate before the shadowtime is reached, or

Page 52: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

38 CHAPTER 4. METHODS

Figure 4.6: Backfilling template.

jobs that require no more than the minimum of the currently free nodes or the extra nodes.In the next location, the backfill jobs are allocated to the cluster and thereafter the processreturns to the initial location, ready to receive another synchronization message when thecluster performs the next backfilling. The cluster process then returns to the intermediatelocation and sets the waitTime variable again before returning to the initial location.

As mentioned before the Job process for the backfilling model is the same as in the othernon-preemptive models.

Verification

To verify the specification that processes get CPU allocated in the order of their arrival,except if backfilled, for Backfilling the same strategy as in FCFS and RR was used. Twoarrays were added to the model, that are arrival[MAX_JOBS] and allocated[MAX_JOBS]and the order of the jobs were accounted using these arrays. Additionally, a boolean variablewas added to block backfilling attempts and the setTime() function was changed in order toblock a transition to a backfilling attempt. The following query was then used to check thatif no backfilling occurs the processes get allocated in the order of their arrival.

A[] Cluster.Done imply arrival == allocated

The query verifies that when the cluster process reaches its ‘Done’ location this impliesthat the arrays arrived and allocated are equal, that is the order of process id’s in the arraysare the same. This query holds for the Backfilling model meaning there is no path throughthe generated state-space where the cluster process reaches its ‘Done’ location where thearrival and allocated arrays are not equal. This confirms that when backfilling does notoccur the algorithm behaves like the FCFS algorithm.

The next specification is as follows: backfilled jobs will not postpone the allocation of ajob at the head of the queue. To verify the specification a transition had to be added to thejob template where when jobs were being backfilled the job at the head of the queue would

Page 53: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

4.3. MODELS 39

transition from the ‘Waiting’ location to a new location ‘ScheduleWait’ bounded with the in-variant schedClock <= schedAlloT ime. The variable schedClock is a new clock variablein the job template which counts how long the process stays in this location and schedAllo-Time is the sTime variable, calculated in the Backfilling process, when the job at the headof the queue would otherwise be allocated. A job process in this new location waits until itshould be allocated before transitioning to a urgent location, where time cannot proceed, andgets a synchronization from the cluster to allocate. Since states that violate the constraintschedClock <= schedAlloT ime are undefined in UPPAAL and time cannot proceed in thelocation the job process transitions to and receives a synchronous broadcast from the clusterto run the behaviour of the system cannot fail to behave such as the specification specifies.The following queries verify this behaviour:

A[] forall(i : id_t) Job(i).ScheduledWait imply Job(i).schedClock <=

Job(i).schedAlloT ime

The query states that invariantly for all jobs in the ‘ScheduledWait’ location it implies thatthe jobs schedClock variable is always less than or equal to the scheduled allocation time,schedAlloTime. This query is satisfied and therefore it has proven the specification thatbackfilled jobs do not postpone the allocation of a job at the head of the queue.

The final behavioural specification for Backfilling is that running processes are not inter-rupted. This is verified as explained in other model here above. The query is the following:

A[] Cluster.Done imply forall(i : id_t) Job(i).runningCount > 1 &&

Job(i).waitingCount > 1

The query is satisfied meaning no job process ever returns back to its ‘Waiting’ locationafter entering its ‘Running’ location.

Using the aforementioned models an analysis was performed to see how clusters usingthese scheduling algorithms would perform.

Page 54: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

40

Page 55: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

41

Chapter 5

Experimental Design

In this chapter the setup for the models and the analysis of deCODE‘s cluster will be dis-cussed, starting with the criteria for the experiments, briefly mentioned in chapter 3.4.2,which will be discussed in more detail than before as well as the queries used to gatherthe data from the simulated runs of the models. Thereafter the setups for the two analyses,that is Analysis one and Analysis two, that were performed on the models are described .Analysis one compares the models with respect to the chosen criteria on three different sizesof clusters and with three different job loads. In analysis two, a job trace gathered fromdeCODE’s SLURM database is simulated on an emulated cluster, with the same size as de-CODE’s cluster, using the SLURM simulator tool. The result of the run is then comparedwith the same job trace run through each model. The following sub-chapter details howthese analysis were performed and how the different parameters of the models were set.

5.1 CriteriaAs mentioned before, in Chapter 3.4.2, the models were evaluated and compared with re-spect the following criteria:

• Throughput,

• Average Residence Time,

• Average Waiting Time, and

• Resource Utilization.

To obtain this information from the models multiple simulations were run on the modelsusing various queries. The queries were not simulated in the UPPAAL SMC‘s GUI applica-tion because of the size of the simulations and the amount of them that had to be run. Insteadthey were run with UPPAAL‘s command line verification tool verifyta. Using verifyta thesimulations were run on deCODE’s cluster to allow for multiple simultaneous simulationsand to be able to use more powerful cores that are able to perform computations in the simu-lation quicker than a regular desktop computer. The simulations also require a large amountof memory; therefore the nodes on the cluster had to be used since they have much moreavailable RAM than a regular desktop computer. When run with a simulation query, verifytareturns a file containing the x and y coordinates for a line graph that shows how the expres-sions in the query behave. The output was then analysed by reading the output files intoa Java program that produces the line graphs and calculates the average waiting time, the

Page 56: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

42 CHAPTER 5. EXPERIMENTAL DESIGN

average residence time and average resource utilization. The following queries were used togather this information.

To measure throughput of a model, a simulation was needed that shows when all jobshad finished running. The following query was used:�

1 simulate 1 [<=10000] {numOf(Job), jobWaiting, jobDone, len}� �Listing 5.1: Query to measure Throughput of model.

This query returns the number of jobs in the system, as well as the number of jobs thatare in their ‘Waiting’ location, ‘Done’ location and the length of the queue. Using thisinformation the throughput can be calculated by getting the time when all jobs have finishedand dividing the number of jobs by the time. This gives us the throughput of the cluster,seen in figure 5.1.

Figure 5.1: An example of a simulation.

The lines in the graph represent the number of jobs in the graph, the number of jobs thatare in their ‘Waiting’ location, the number of jobs that are in their ‘Done’ location and thelength of the queue.

It is preferable for a cluster and all its users to have the least amount of waiting time.To calculate the average waiting time of the cluster, in the models, the following query wasused.�

1 simulate 1 [<=10000] {foreach (j : Job) (10* j.id + 2*j.Waiting + 3*j.Running)}� �Listing 5.2: Query to measure average waiting and turnaround time.

The query returns a Gantt like graph of the jobs in the system. To illustrate the output moreclearly, more space was put between the lines by multiplying the id of the job process by10 in the chart. When a job process enters its ‘Waiting’ and ‘Running’ locations, the valueof y for this line becomes one. This value can then be multiplied by a constant to illustrateon the graph when the process is in the location. For example, when a job process is in itsWaiting location j.Waiting returns the value 1, by multiplying this value by 2 the y valuefor the line becomes 10 ∗ id+ 2. Therefore we can see in the graph when the process entersits ‘Waiting’ location. The same applies for the ‘Running’ location, except then the value is10 ∗ id+ 3. The following figure show a Gantt chart for a run.

Page 57: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5.1. CRITERIA 43

Figure 5.2: A Gantt chart for simulations

The average waiting time can then be calculated by reading all the output files for themodels and adding up, for each job, the time duration when the value is the job’s id ∗ 10+ 2and dividing by the amount of jobs in the system.

Turnaround time, or the residence time, is all the time a job process is in either its ‘Run-ning’ or its ‘Waiting’ location. The average residence time is calculated just like the averagewaiting time, as described here before, but now adding the time the process is in its ‘Run-ning’ location as well and then calculating the average.

Resource utilization is an important criterion for the administrators of a cluster. It showshow well the resources are utilized in the system, and can therefore indicate what resources,if any, should be added to system. The resource utilization is gathered using the followingquery.

�1 simulate 1 [<=10000] {sum(i : id_n) nodes[i].cpus, sum(i : id_n) nodes[i].tmpDisk←↩

↪→, sum(i : id_n) nodes[i].realMemory}� �Listing 5.3: Query to measure the resource utilization of a model

This gives a line chart of the currently available resources for each time unit. To obtainthe average utilization, the amount of available resources is subtracted from the maximumavailable resources and the average is taken from that. The following image shows an ex-ample of such a graph:

Page 58: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

44 CHAPTER 5. EXPERIMENTAL DESIGN

Figure 5.3: An example of the resource utilization of a model.

The blue line in the figure shows the utilization of the disk space in the cluster, the red linecorresponds to the RAM usage and the green line is the CPU utilization.

Using these queries, two analyses were performed on the models. The remaining sec-tions of this chapter will present the experimental designs for those two analyses, while thefollowing chapter 6 will present the results of the experiments.

5.2 Analysis OneAs mentioned before Analysis one compares the different models with regards to the afore-mentioned criteria. The models were tested on three differently sized clusters with threedifferent job loads. This was done to show how the different models, modelling the differentalgorithms, behave with different resources and job loads. The three different cluster sizesare shown in table A.2, where each cell is the sum of all the available resources at each nodeof the clusters.

Table 5.1: The different cluster sizes

C10 C50 C100Cpus 260 1316 2640Ram 15035 65035 133855HDD 38500 191500 391000

C10, C50 and C100 stand for the amount of nodes in the cluster; where C10 has 10nodes, C50 has 50 nodes and C100 has 100 nodes. The nodes offer various amounts ofresources, shown in Appendix A.2.

The models were all tested with 3 different job loads as well, where the smallest job loadwas 1000 jobs, the medium load was 5000 jobs and the highest load was 10000 jobs. Thejobs were created and spawned in the Environment template, which will be discussed in thefollowing sub chapter.

Page 59: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5.2. ANALYSIS ONE 45

5.2.1 The Environment Template

The Environment template, seen in figure 5.4, was created to emulate the average job spawnrate and the resource needs for the jobs spawned over the duration of a whole month on de-CODE‘s cluster. This was done by collecting all the jobs that were submitted on deCODE‘scluster during the month of January 2015 and calculating the average spawn rate of the jobs,as well as the distribution of CPU, RAM and runtime requirements for the jobs.

Figure 5.4: The Environment Process.

The gathered runtime data is shown in the following two figures, figure 5.5 and figure5.6. Figure 5.5 shows all the runtimes of the jobs that ran on deCODE’s cluster duringJanuary 2015. The large majority of jobs ran for less then 2000 seconds, which can barelybe seen on figure 5.5. Figure 5.6 zooms in to show the number of jobs that run for shorterruntimes.

Page 60: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

46 CHAPTER 5. EXPERIMENTAL DESIGN

Figure 5.5: Length of jobs in deCODE’s cluster.

Figure 5.6: Length of jobs in deCODE’s cluster zoomed in.

The rate of job spawns was calculated using the gathered data from deCODE’s clusterand can be seen in figure 5.7.

Page 61: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5.2. ANALYSIS ONE 47

Figure 5.7: Spawn rate of jobs in deCODE’s cluster.

The rate of job spawns is set in UPPAAL using the exponential distribution function ofthe form Pr(t) = 1− e−λt, where lambda is the fixed rate and t is the time as an invariant ona location [15]. Using the collected data of job spawns at deCODE a suitable value for λwasacquired by calculating the slope of job spawns. The slope of the line, seen in figure 5.7 is2, 5469 and therefore λ = 2, 5469. To verify this a UPPAAL model was created that spawnsempty jobs for 1000 time units to see if this in fact behaves like the gathered data. Usingthe calculated slope, 2546 jobs would be spawned on deCODE’s cluster in 1000 time units.The UPPAAL model with this rate spawned an average of 2540 jobs in five simulations, seefigure 5.8.

Figure 5.8: UPPAAL model with spawn rate λ = 2.5469

UPPAAL only provides a 32-bit integer, therefore values stored in an integer variable canonly reach a maximum value of 32767. This means that, for example, the runtime of a jobcould not exceed a runtime of 32767 in UPPAAL , but clearly as can be seen in figure 5.5many jobs have a runtime greater than 32767. One solution would be to use some variable tocount the overflows so it would be possible to have longer jobs in the system, and this is in

Page 62: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

48 CHAPTER 5. EXPERIMENTAL DESIGN

fact done in Analysis two. But, as can be seen in figure 5.6, most jobs only run for less than200 time units. In order to shorten the simulations, the longest jobs were abstracted awayand the remaining job runtimes were normalized so that the longest runtime would only be200 time units in the Environment template.

The Environment template creates a job process by traversing the edges shown in figure5.4, assigning runtime and resource needs to the job, and finally spawning the process in thesystem. The transitions create branches where each edge is assigned a weight; for example,from the initial state, there is one edge that is further branched into 5 edges all assigned withdifferent weights. The leftmost branch in the figure has a weight of 703 and assigns thevariable ServiceT ime = 1, while the other edges have a weight of 208, 66, 14 and 6. Theprobability of the model transitioning to the leftmost branch is determined by the ratio of thesum of all weights emitting from the branch, in this case 703/(703 + 208 + 66 + 14 + 6) ≈0, 70. As can be seen in these branches, the maximum runtime that a job can have is 200time units.

The process then continues and sets a suitable disk space. The HDD usage is not ac-counted in SLURM, therefore a random value between 0-2000 units is set for each job.

The number of CPUs used on deCODEs cluster can be seen in figure 5.9. As mentionedin the abstraction section in Chapter 4.1, jobs cannot use resources from more than one node.Since no node has more than 28 CPUs in the models, the larger jobs, requiring more than 28CPUs, were abstracted away when setting the required CPUs in the environment template.The branches for CPU requirements were made using the remaining data.

Figure 5.9: CPU requirements in deCODE’s cluster.

The RAM requirements for the jobs were handled similarly to the CPU requirements,that is, the jobs that require more RAM than is possible to get on a single node were notconsidered.

Page 63: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5.2. ANALYSIS ONE 49

Figure 5.10: RAM requirements in deCODE’s cluster.

The next section discusses the settings that were applied to the models for the experi-ment.

5.2.2 Model Settings

As described in the Methods chapter, see Chapter 4, all the models have some settings thatcontrol the behaviour of the model, such as the interval between attempted schedules. Thissection will describe the settings in the models used in the two following analyses. Thechosen settings are the default values in SLURM when possible, while for the settings forwhich SLURM does not offer a suitable value was chosen.

The following are the values for each model.

Table 5.2: Model Settings

sched_interval queue_depth timeslice backfill_interval max bfqFCFS 6 100SJF 6 100RR 6 100 2SRTF 6 100 2Backfill 6 100 3 100

The default value for scheduling intervals in SLURM is 60 seconds. To make the simulationsshorter and correspond to the abstraction made on the runtimes discussed above, the timeunits were divided by 10. Therefore the scheduling interval was set to 6 in the models, and3 for backfilling interval instead of 30 seconds in SLURM. The quick scheduling queue wasset to 100, as is in SLURM. SLURM does not have a setting for timeslices so this was set to2.

Page 64: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

50 CHAPTER 5. EXPERIMENTAL DESIGN

5.3 Analysis TwoIn Analysis two a simulated run of a job trace was performed on a cluster of the samesize as deCODE’s cluster using the SLURM simulator tool. This run was then comparedto runs performing the same job trace in the models with the same size of cluster. Thisanalysis shows how well the models correspond to real life runs of the cluster, and how theabstractions from SLURM’s cluster affect the outcome. The job trace was extracted fromdeCODE’s SLURM database using the SLURM simulator tool. The tool can read SLURM’smySQL database and extract jobs that were allocated for some given time interval. The toolthen creates a job trace for these jobs and their CPU requirements and runtimes. In thisanalysis a job trace was created for jobs that were allocated to deCODE’s cluster on January29th 2015 between 12:00 and 13:00. During this time 8856 jobs ran on deCODE’s cluster.The job trace was then simulated using the simulator tool, where the slurmdbd daemonaccounted the arrival, submission and end times for each job in the simulation to an emptydatabase.

Using the SLURM simulator tool, a job trace was created using the command line toolmysql_trace_builder which is supplied with the tool. The tool communicates with theslurmdbd daemon and creates a file that the tool can later read as a job trace and run on theemulated cluster. In order to use the command slurmdbd has to be running and able to readthe mySQL database. The command for creating the job trace is the following:

1 $ mysql_trace_builder -s "2015-01-29 12:00:00" -e "←↩↪→2015-01-29 13:00:00" -h localhost -u root -v -t ←↩↪→lphc_job_table -f JobTrace.trace

After creating the trace file the slurmdbd daemon was re-configured to use an emptymySQL database. Thereafter the SLURM simulator was run using the trace file and accel-erated to speed up the evaluation. This was done using the following command:

1 $ sim_mgr 1422470000 -w JobTrace.trace -a 1000

After running the file the data was extracted from the database and compared to the runsof the models.

5.3.1 Model SettingsTo obtain a proper comparison of the models, the size of the cluster was set to the same sizeas the cluster in the simulation. The cluster has 309 nodes and the amount of CPUs availableon each node can be seen in table 5.3.

Table 5.3: Node sizes for Analysis two

16 cores 20 cores 24 cores 28 coresCPUs 7 13 100 189

620 1250 7570 1250RAM 7 13 100 189

The Environment template, in analysis two, stores the list of jobs collected from thedatabase and spawns all the jobs with the correct resource needs as well as the time at whichthe job should arrive at the beginning of the run. The Environment template for analysis twocan be seen in figure 5.11:

Page 65: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

5.3. ANALYSIS TWO 51

The initial location is set as a committed location, meaning that time cannot proceed inthis location and that the next transition in the model must involve an automaton currentlyin a committed location. From the initial location the transition is taken until all jobs in thearray have been read and spawned; then the process transitions to a location indicating thatall jobs have been spawned. When a job is spawned, a process is created and the processmoves to a location where it can either queue immediately or go to a location where it staysuntil it should be queued, see figure 5.12.

Figure 5.11: The Environment Process in Analysis two.

Since the environment process spawns all the jobs at time 0, the job template has tomake sure that each process is queued at the right time. This is done initially in the processas follows: The process transitions from its initial location to a location that checks whetherit should be queued now or if it should transition to a state where it waits until it shouldbe queued. As mentioned before the runtime of this system can be longer than the 32-bitinteger value UPPAAL can hold. Therefore an overflow variable was added to the JobInfostructure to store how many times a job process can overflow the value. For example in thenon-preemptive Job template, if a job process should run for 40000 time units, the runtimevalue is set to 40000− 32767 = 7233 and the overflow value is set to one, meaning that theprocess will stay in the ‘Running’ location for 7233 time units before transitioning to a newlocation that will set the runtime to 32767 again. This behaviour can be seen in figure 5.12.

Page 66: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

52 CHAPTER 5. EXPERIMENTAL DESIGN

Figure 5.12: A non-preemtive Job Process that can have a runtime larger than 32767 .

In analysis two the parameters for each model are set to the same values as the defaultSLURM values. The values are presented in the following table 5.4

Table 5.4: Analysis two model settings

sched_interval queue_depth timeslice backfill_interval max bfqFCFS 60 100SJF 60 100RR 60 100 20SRTF 60 100 20Backfill 60 100 30 100

Here the parameters are set to correspond to the values in SLURM, since the runtimes arethe same as in the simulation on deCODE’s cluster.

The query used to compare the simulations is the same query as used in analysis one tocreate the Gantt chart, see Listing 5.2 above. Using this query the duration of all jobs, bothin ‘Waiting’ and ‘Running’ locations can be compared to the run of the simulation in theSLURM simulator tool.

Now that the experimental design has been clarified and how the parameters of the modelwere set, the following chapter will discuss the results of these experiments.

Page 67: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

53

Chapter 6

Results

This chapter will present the results of analysis one and analysis two, which were presentedin chapter 5.

6.1 Results of Analysis oneAs mentioned above, the data was collected by running simulations on deCODE’s clusterusing the models and queries discussed in chapter 5. It should be noted that the simulationsfor the preemptive models, that is RR and SRTF, were computationally very heavy. Theruntime of their simulations was up to 33 hours and due to time constraints were only sim-ulated 5 times. The non-preemptive models, that is FCFS, SJF and Backfilling, required ashorter time to simulate, up to 16 hours on deCODEs cluster, and were simulated 25 times.The following results are gathered from these simulations.

The acquired data from the simulations was compared according to the criteria discussedin 5.1. The first section will present the throughput of the model on three sizes of clusters,followed by the average waiting time and residence time. Finally, the results related to theresource utilization of the clusters are described as well as the claim made by Jackson, Snelland Clement, in [27], stating that Backfilling can result in up to a 20% increase of resourceutilization in clusters.

6.1.1 ThroughputThe throughput of a simulation was gathered by obtaining the time when all jobs have fin-ished running and dividing by the time it took the cluster to run all the jobs. The throughputis therefore the amount of jobs a cluster services per time unit.

The service rate of a cluster is controlled by its available resources and the maximumload a cluster can handle is called the saturation point of the cluster. When the rate ofjob submissions is higher than the service rate a cluster becomes unstable, resulting in thequeue of the cluster growing infinitely in real systems [31]. Therefore, the throughput,resource utilization, residence and waiting time of clusters with job submissions higher thanits saturation point will not increase much past its saturation point.

FCFS

The throughput for FCFS for each cluster size and job load, can be seen in the followingtable:

Page 68: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

54 CHAPTER 6. RESULTS

Table 6.1: Throughput for FCFS on different sized clusters.

C10 C50 C1001000 jobs 0.5849 1.0326 1.72095000 jobs 0.59 1.1266 2.327210000 jobs 0.5905 1.1327 2.4285

As can be seen in table 6.1, the larger a cluster is the higher the throughput. This is becausethere are more resources in the larger cluster for the jobs to be allocated on. The throughputalso increases when the job loads are higher on the cluster; this is because of the distributionof the runtimes and requirements for the jobs, since the number of shorter jobs which requirefewer resources is higher in higher job loads.

In C10, the saturation point of the cluster has been reached before 1000 jobs and there-fore the difference in throughput when comparing the job loads is low. In C50 and C100the throughput of the clusters is lower when simulated with 1000 jobs compared to higherjob loads. However, the clusters are saturated in 5000 and 10000 jobs resulting in a similarthroughput in these job loads. In C100 the throughput raises quite significantly between1000 and 5000 jobs, but the difference is smaller between 5000 and 10000 jobs meaning thesaturation point lies somewhere between these values.

As mentioned before, jobs in the queue in FCFS can block other jobs later in the queue.Therefore, as is expected the throughput is lower than in the following models.

SJF

When using a SJF algorithm, the throughput of the cluster is often higher than when usingFCFS, as can be seen in the following table:

Table 6.2: Throughput for SJF on different sized clusters.

C10 C50 C1001000 jobs 0.6721 1.1262 1.74565000 jobs 0.7041 1.2993 2.318910000 jobs 0.7171 1.2951 2.4248

Using the SJF algorithm on small clusters results in a higher throughput compared to FCFS.This is because shorter jobs are more likely to spawn in the systems and they are thereforemore likely to be allocated sooner, resulting in shorter waiting time for the jobs. As can beseen, the throughput is higher for SJF on clusters of the sizes C10 and C50. With a mediumjob load, that is 5000 jobs, FCFS has a throughput rate of 1.1266 while SJF has throughputvalue of 1.2993, this means that in 1000 time unit SJF would complete on average 1299.3jobs while FCFS would only complete 1126.6 jobs. When a cluster becomes larger, such asin C100, the difference becomes negligible.

In C10 there is some difference in throughput from 1000 jobs and 5000, indicating thatthe saturation point is higher in SJF than FCFS. In C50 the throughput increases from 1000to 5000 jobs, indicating that the cluster has not reached its saturation point when running1000 jobs, however the throughput is lower in 10000 jobs than 5000 jobs indicating that thesaturation point has been reached.

Page 69: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.1. RESULTS OF ANALYSIS ONE 55

RR

As stated in chapter 3.4.3, RR is the preemptive version of FCFS. When jobs are allocatedon the cluster they can run for the duration of a timeslice before being preempted and re-entering the queue. As with FCFS, large jobs can block smaller jobs in the queue, but onlyfor the duration of the timeslice, since resources are freed quickly. This results in morefairness for each job, since all jobs get equal time on the cluster.

The following table shows the throughput results for the preemptive algorithm RR:

Table 6.3: Throughput for RR on different sized clusters.

C10 C50 C1001000 jobs 0.7846 1.1837 1.68375000 jobs 0.8263 1.2401 2.32210000 jobs 0.8212 1.2742 2.4601

As shown, the throughput for clusters using RR as their scheduling algorithm is a significantimprovement compared to clusters using FCFS and, in most cases, some improvementscompared to SJF, especially on smaller clusters.

The throughput is higher in 5000 and 10000 jobs when compared to 1000 jobs in allthe cluster using RR, meaning the saturation point has not been reached in 1000 jobs in theclusters. In C10 the throughput is very similar in 5000 and 10000 jobs. In C50 there is aslight difference indicating that the saturation point has not been reached in 5000 jobs, butis close to it. In C100 their is a larger difference in throughput, compared to C50, meaningthe saturation point is even larger in the larger cluster.

SRTF

SRTF is the preemptive version of SJF, meaning the shortest job on the queue will beallocated to the cluster for a maximum burst of the timeslice. The results of the throughputof SRTF on the differently sized clusters are presented in the following table:

Table 6.4: Throughput for SRTF on different sized clusters.

C10 C50 C1001000 jobs 0.7137 1.1729 1.73415000 jobs 0.7942 1.1978 2.34310000 jobs 0.8068 1.3595 2.4361

On the clusters C10 and C50 the throughput of SRTF is higher than the throughput ofFCFS and SJF. In C100, where more resources are available, the throughput is the same.When compared to the throughput of RR, the throughput of SRTF is lower or close tothe throughput of RR in the smaller clusters. In larger clusters the throughput of SRTF isslightly higher than the throughput of RR. As stated before, the preemptive models wereonly run 5 times; in order to get a significant result further simulations would have to be run.

The saturation points of the clusters are similar to RR, where there is an increase ofthroughput from 1000 jobs to 5000 jobs. In C10 the saturation point has been reached in5000 jobs, therefore there the throughput of 5000 and 10000 jobs are very similar. In C50and C100 the there is an increase of throughput in the larger job traces.

Page 70: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

56 CHAPTER 6. RESULTS

Backfilling

The throughput for Backfilling is presented in the following table:

Table 6.5: Throughput for Backfilling on different sized clusters.

C10 C50 C1001000 jobs 0.7988 1.517 1.76195000 jobs 0.8452 1.9853 2.349410000 jobs 0.8525 2.1418 2.4388

The results show that clusters using Backfilling as their scheduling algorithm have a higherthroughput than all other algorithms. The difference is significantly higher on small andmedium sized clusters, although in larger cluster the difference becomes smaller.

As in the other models the saturation point of the clusters have not been reached in thesmallest job load, where the throughput is lower than the simulation of C10 with 5000 jobs.However, in 5000 and 1000 jobs in C50 and C100 the difference is quite low indicating thatthe clusters have a saturation point somewhere between 5000 and 10000 jobs.

6.1.2 Average Waiting Time

As mentioned in chapter 5, the average waiting time was gathered by calculating the averagetime each job process stayed in it ‘Waiting’ location. On unstable clusters, where the jobload is higher than the clusters service rate, the average waiting time therefore increasesdrastically since the queue size grows while the throughput of the cluster stays the same.

FCFS

The average waiting time for the clusters using FCFS are presented in the following table:

Table 6.6: Average Waiting Time for FCFS on different sized clusters.

C10 C50 C100low 585.9772 291.0031 3.2681med 3175.6185 1185.2588 5.7213high 6445.8924 2467.7152 4.8206

The results show that the average waiting time increases the smaller a cluster is and thehigher the job load is. This is intuitive, since the more resources a cluster has the morelikely it is to have enough resources for the jobs on the queue. Resulting in shorter time thejob processes have to wait in the queue.

SJF

The following are the results for the average waiting time for SJF:

Page 71: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.1. RESULTS OF ANALYSIS ONE 57

Table 6.7: Average Waiting Time for SJF on different sized clusters.

C10 C50 C100low 131.8811 57.9991 1.626med 648.4124 217.2876 2.2522high 1254.9464 408.5366 1.4929

A cluster using SJF scheduling always chooses the shortest job for allocation; the averagewaiting time for these jobs is therefore lower. This is due to the fact that the Environmenttemplate is more prone to spawn jobs requiring fewer resources and less runtime, see figure5.4. This results in lower average waiting time in all cases for SJF compared to FCFS. Eventhough the average waiting time is considerably lower using SJF, compared to FCFS, thewaiting time for longer jobs can be much higher, meaning that longer jobs can suffer fromstarvation when multiple shorter jobs are being queued.

RR

The average waiting time RR is presented in the following table:

Table 6.8: Average Waiting Time for RR on different sized clusters.

C10 C50 C100low 239.3683 101.3892 0.0med 1186.0357 497.1037 0.3043high 2304.187 1112.6161 0.4871

Since the algorithm is preemptive, and the jobs re-enter the queue when their timeslice isdone and have to wait to be re-allocated to the cluster, the average waiting time is in mostcases higher than in SJF. There is, however, a turning point in the case where a large enoughcluster is used with a low job load, as seen in cluster C100 where there was no waiting timefor the jobs, meaning as soon as they were queued the jobs got allocated. RR is a fairalgorithm meaning, just as FCFS, that jobs do not suffer from starvation when queued.Therefore, RR is a considerable improvement on FCFS while still being fair.

SRTF

The average waiting time for SRTF is shown in the following table:

Table 6.9: Average Waiting Time for SRTF on different sized clusters.

C10 C50 C100low 96.4095 35.8212 3.1011med 510.9254 140.7514 0.977high 1035.8076 503.6518 0.7277

SRTF has the lowest average waiting time of all the models, although the difference isclearly higher in clusters with fewer resources, as seen when comparing C10 with SJF whereto difference is the greatest. As in SJF the algorithm is not fair and longer jobs may sufferfrom starvation. However, since the Environment template is prone to spawning shorter jobs,the average waiting time is low.

Page 72: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

58 CHAPTER 6. RESULTS

Backfilling

The result for the average waiting time for Backfilling are the following:

Table 6.10: Average Waiting Time for Backfilling on different sized clusters.

C10 C50 C100low 226.262 62.3555 0.7085med 1304.8136 142.5837 0.4562high 2656.2986 207.7489 0.2235

Backfilling has a shorter average waiting time than FCFS, but longer than SJF, RR andSRTF in small job loads and small clusters. In higher job loads on larger clusters, such asC50 and C100, the average waiting time is shorter than SJF and equal to SRTF. This indi-cates that Backfilling is better utilized on larger systems with higher job loads, increasingthe likelihood of a successful backfill. As seen in table 6.10 and 6.6, the average waitingtime is considerably lower on clusters using the Backfilling scheduling algorithm comparedto FCFS. This supports Leonenkov’s and Zhumatiy’s claim that the Backfilling algorithmcan decrease the average waiting time of clusters [26].

6.1.3 Average Residence TimeThe residence time for jobs is the sum of the time the job is in its ‘Waiting’ location and its‘Running’ location.

FCFS

The following table shows the residence time for FCFS:

Table 6.11: Average Residence Time for FCFS on different sized clusters.

C10 C50 C100low 621.8003 326.8132 38.6949med 3209.4146 1219.0511 39.8332high 6475.2614 2496.9972 34.2574

As for the average waiting, time the residence time for each job grows when the clustershave fewer resources.

Average Residence Time

The results for the average residence time for SJF are the following:

Table 6.12: Average Residence Time for SJF on different sized clusters.

C10 C50 C100low 167.875 94.1603 37.3992med 683.0569 251.4455 36.3563high 1285.3743 438.345 30.9777

Page 73: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.1. RESULTS OF ANALYSIS ONE 59

As in the waiting time, the residence time is also lower in all cases for SJF when comparedto FCFS.

RR

The following are the results for the average residence time for RR:

Table 6.13: Average Residence Time for RR on different sized clusters.

C10 C50 C100low 243.9965 114.6213 35.6647med 1187.4448 502.696 33.3247high 2306.1082 1118.093 29.8886

As mentioned before, RR is a fair algorithm, meaning that all jobs get an equal amount ofburst time on the CPU; this also means that residence time for each job is higher, compared toSJF. There is an improvement when compared to FCFS, since jobs can be blocked becauseof large jobs there, especially on smaller clusters with a multiple jobs.

SRTF

Table 6.14: Average Residence Time for SRTF on different sized clusters.

C10 C50 C100low 130.6075 70.6087 38.6736med 544.0898 173.335 34.0011high 1065.9049 533.3958 29.6606

The average residence time is also the lowest in SRTF compared to the other models, al-though more simulations should be run to get a better result.

Backfilling

The average residence time for Backfilling is shown in the following table:

Table 6.15: Average Residence Time for Backfilling on different sized clusters.

C10 C50 C100low 251.7662 86.596 26.9013med 1328.8548 163.6441 24.2291high 2676.1345 224.4085 20.0816

The results are similar to those we obtained for the average waiting time; this further indi-cates that Backfilling is better suited for larger clusters with higher job loads.

6.1.4 Resource UtilizationThe resource utilization shows the average utilization of the CPUs, RAM and disk space inthe simulations.

Page 74: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

60 CHAPTER 6. RESULTS

FCFS

The resource utilization of the 25 simulations of the FCFS model is presented in the follow-ing table:

Table 6.16: Resource Utilization for FCFS on different sized clusters.

C10 C50 C100

CPUlow 0.1799 0.0643 0.0574med 0.1829 0.0667 0.0735high 0.1894 0.0709 0.0768

RAMlow 0.5567 0.201 0.1698med 0.557 0.2061 0.2236high 0.5538 0.2138 0.2323

Disklow 0.4029 0.8627 0.1445med 0.4194 0.8634 0.1927high 0.4147 0.864 0.1992

As can be seen in the table above, the model does not utilize much of its CPU power. Mostjobs in the system require more RAM and disk space than CPUs, due to how the Environ-ment template, see Figure 5.4, is created. This also indicates that the nodes on the clusterare not providing enough RAM and disk space compared to CPUs. Therefore, the CPUutilization is low in most cases.

FCFS utilizes in most cases fewer resources than SJF, especially in clusters with lowerresources. The difference becomes smaller or even disappear in larger systems. Whencompared to the preemptive models, the resource utilization is in most cases higher, exceptwhen the difference is close to none.

SJF

The result of the resource utilization for SJF are as follows:

Table 6.17: Resource Utilization for SJF on different sized clusters.

C10 C50 C100

CPUlow 0.2096 0.0766 0.056med 0.2234 0.0826 0.0733high 0.229 0.0797 0.0768

RAMlow 0.6416 0.2363 0.1672med 0.6791 0.2543 0.2247high 0.6919 0.2447 0.233

Disklow 0.4754 0.8676 0.1444med 0.5118 0.8697 0.1917high 0.5183 0.8686 0.1992

Since the throughput is higher in SJF compared to FCFS, it is expected that the averageresource utilization is higher. This is true in all the cases, that is SJF has the highest averageresource utilization of all the models for all resource types. However, the difference becomessmaller on larger clusters.

Page 75: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.1. RESULTS OF ANALYSIS ONE 61

RR

The average resource utilization for RR are presented in the following table:

Table 6.18: Resource Utilization for RR on different sized clusters.

C10 C50 C100

CPUlow 0.152 0.0455 0.0448med 0.1592 0.0546 0.0654high 0.1824 0.0679 0.0658

RAMlow 0.4452 0.1386 0.1486med 0.4257 0.1725 0.1978high 0.4513 0.2061 0.2073

Disklow 0.3217 0.8534 0.1256med 0.333 0.8569 0.1691high 0.3636 0.8617 0.1718

The utilization of resources is lower per time unit in RR than in both FCFS and SJF. Onsmaller clusters the utilization is also higher than SRTF.

SRTF

The following table shows the average resource utilization for SRTF:

Table 6.19: Resource Utilization for SRTF on different sized clusters.

C10 C50 C100

CPUlow 0.1144 0.0509 0.0467med 0.1196 0.0495 0.0655high 0.1231 0.0445 0.0672

RAMlow 0.3448 0.1462 0.1461med 0.3621 0.1515 0.2072high 0.3761 0.1378 0.2118

Disklow 0.2412 0.8543 0.124med 0.2621 0.8544 0.1733high 0.2776 0.8526 0.1791

As in RR the resource utilization for SRTF is lower than in the non-preemptive mod-els. SRTF has the lowest resource utilization of the aforementioned models on small andmedium size clusters. However, the difference becomes smaller or non existent on largerclusters.

Page 76: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

62 CHAPTER 6. RESULTS

Backfilling

Table 6.20: Resource Utilization for Backfilling on different sized clusters.

C10 C50 C100

CPUlow 0.1839 0.0661 0.042med 0.1987 0.0753 0.0553high 0.1968 0.0807 0.0569

RAMlow 0.5531 0.2103 0.128med 0.5922 0.2328 0.1675high 0.5742 0.2418 0.1738

Disklow 0.3976 0.8632 0.1071med 0.4356 0.8664 0.1431high 0.42 0.867 0.1492

The resource utilization of Backfilling in clusters of the size of C10 and C50 is in mostcases less than SJF but greater than FCFS. In larger clusters the utilization is the lowestof all models. As stated before, David, Snell and Clement [27] claim that the Backfillingalgorithm can increase the clusters resource utilization by 20%. Table 6.21 shows the in-creases and decreases of resource utilization of the different sized clusters using Backfillingcompared to FCFS.

Table 6.21: The resource utilization difference relation between FCFS and Backfilling

C10 C50 C100

CPUlow 2% 3% -27%med 9% 13% -25%high 4% 14% -26%

RAMlow -1% 5% -25%med 6% 13% -25%high 4% 13% -25%

Disklow -1% 0% -26%med 4% 0% -26%high 1% 0% -25%

According to the collected results from the models, the claim made by Jackson, Snell andClement, is, in most cases, not supported by the collected results of this analysis. In largeclusters the resource utilization decreases by up to 27%. However, the resource utilizationis, in most cases, higher in Backfilling in smaller clusters, such as C10 and C50, but thedifference is at most 14%.

6.1.5 Conclusion of Analysis OneThe results here above indicate that Backfilling is a suitable algorithm to use on clusters. Ithas the highest throughput of jobs, especially on larger systems with a large job load, of allthe algorithms. Backfilling offers lower waiting time, while still being a free of starvation,than both FCFS and RR. SJF and SRTF both have a lower waiting time than Backfilling,but these algorithms are unfair towards longer jobs and may lead to starvation. The result

Page 77: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.2. RESULTS OF ANALYSIS TWO 63

further show, in table 6.21, that the claim, made by Jackson, Snell and Clement in [27],claiming that Backfilling can increase a clusters resource utilization by up to 20% does nothold. However, Leonenkov and Zhumatiy [26] further claim that Backfilling can decreasewaiting time on clusters, which the results support.

6.2 Results of Analysis twoThe following section presents the results of analysis two. As discussed, in section 5.3,analysis two shows a comparison of a simulated run on deCODE’s cluster, using the SLURMsimulator tool, to a simulation of the models running the same job trace. The analysis showsthe run in three of the five models, that is Backfilling, FCFS and SJF. Because of the amountof resources and time it took to simulate the preemptive models, RR and SRTF, they wereexcluded from this analysis. Due to time limits the models were only simulated once.

6.2.1 Throughput and Average Waiting TimeResults from the SLURM simulator tool

As mentioned in chapter 5.3, a job trace was created from former job submissions on de-CODE’s cluster using the mysql_trace_builder tool in SLURM simulator. The followingfigure shows a Gantt chart of the jobs submitted on the 29th of January 2015 between 12:00and 13:00.

Figure 6.1: A Gantt chart showing the simulation of job runs in SLURM Simulator

Due to one very long job in the job trace, the SLURM simulation ran for 117,190 timeunits until all 8859 jobs had finished. Therefore, the throughput of the cluster is 0.0756. Theaverage waiting time of the run is 714.11.

Results from UPPAAL models

The Gantt charts for the models are shown in the following three figures:

Page 78: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

64 CHAPTER 6. RESULTS

Figure 6.2: A Gantt chart showing the run of FCFS in analysis two

Figure 6.3: A Gantt chart showing the run of SJF in analysis two

Page 79: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

6.2. RESULTS OF ANALYSIS TWO 65

Figure 6.4: A Gantt chart showing the run Backfilling in analysis two

The throughput of the models are the same, that is 0.0756, which is also the throughputof the SLURM simulation. The throughput is the same in all models because the job traceincludes a job with a very high runtime early in the trace. Since enough resources areavailable in all models when the job is queued it gets scheduled very early. The averagewaiting times of the models can be seen in the following table:

Table 6.22: The throughput and average waiting time of the models in analysis two

Backfilling FCFS SJFThroughput 0.0756 0.0756 0.0756Average Waiting time 224.2 237.8 234.8

Backfilling has the lowest average waiting time of the models, although the difference isnot high. Since the cluster has a lot of resources for this job trace the waiting time for eachjob is low.

As can be seen in the Gantt charts and the table 6.22, the waiting time for each job isconsiderably lower in the models than in the SLURM simulation. For example, the averagewaiting time of the Backfilling model is 224.2 while the average waiting time of the SLURMsimulation was 714.11. This is a result of the abstractions made on the models, see 4.1,where scheduling and backfilling calculations do not take time in the model.

Further inspection shows, because of this abstraction, the difference in waiting timegrows when more jobs are in the job load. This indicates that the error is propagating themore jobs need to be scheduled. Three smaller instances of the Backfilling model werecreated in the tool with various job loads. The smallest job load runs 10 jobs on the cluster,the medium job load runs 25 jobs and the largest 50 jobs. The jobs in the job loads requirebetween 20-30 time units of runtime and require 12-20 CPUs. The models resources are setto one node with 28 CPUs. The job loads were also run in SLURM simulator for comparison.

Page 80: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

66 CHAPTER 6. RESULTS

The following table shows the average waiting time of five runs in both UPPAAL SMC andthe SLURM simulator.

Table 6.23: Average waiting time in SLURM and Backfilling model.

SLURM Model10 jobs 114.94 64.425 jobs 318.79 180.250 jobs 667.01 383.8

Table 6.23 shows that the difference of the average waiting time grows with the number ofjobs in the job trace. This was also verified using the UPPAAL SMC tool. The followingqueries calculate the probability of the models having an equal, or higher, average waitingtime than SLURM:

Pr[<= 1500] (<> Cluster.Done && (allWaitingT ime()) >= 114)

Pr[<= 1500] (<> Cluster.Done && (allWaitingT ime()) >= 318)

Pr[<= 1500] (<> Cluster.Done && (allWaitingT ime()) >= 667)

The function allWaitingT ime() in the query returns the average waiting time of the model.Therefore, the queries checks for the probability of allWaitingT ime() returning an equalor higher average waiting time when the Cluster process has reached its ‘Done’ location,in job trace sizes. All queries return a probability ranging in the interval [0, 0.0974] witha 95% confidence level after 36 runs. The probability of the models having an equal, orhigher, average waiting time of 60% and 50% of SLURM’s average waiting time was alsocalculated using the same queries. The results are shown in the following table:

Table 6.24: The probabilities of getting a similar average waiting time as in SLURM.

10 jobs 25 jobs 50 jobs100% [0, 0.0974] [0, 0.0974] [0, 0.0974]60% [0.065, 0.165] [0.052 , 0.152] [0.021, 0.121]50% [0.0903, 1] [0.0903, 1] [0.0903, 1]

As seen in table 6.24, the probability of getting a average waiting time of 60% ofSLURM’s average waiting time becomes lower the more jobs there are. This suggests thatthe error in the waiting time of the jobs is propagating in the models the more jobs there arein its queue.

6.2.2 Conclusion of Analysis twoAs stated in the aforementioned section, the Backfilling models behaviour does not corre-spond to the behaviour of SLURM’s backfilling algorithms behaviour. This is due to waitingtime on SLURM cluster are significantly higher, compared to the model, because of the timescheduling and backfilling attempts. As shown in the three comparisons of small job loadsin SLURM and the models, the waiting time increases in larger job loads on the cluster,showing a propagating increase of waiting time with larger job traces in SLURM.

Page 81: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

67

Chapter 7

Conclusion

Using UPPAAL a data-driven analysis was performed on deCODE’s cluster. The perfor-mance and resource utilization of the cluster was measured using simulations of the fivemodels, modelling the scheduling algorithms. An obvious flaw to this method of analysis isthat a lot of resources are needed to perform these simulations since they are computation-ally very heavy. Using model checking to perform such an analysis also provides a greatdeal of certainty, since the models behaviour can be verified. This gives a confidence levelthat the results are significant since the models should not include any unwanted behaviour.Modelling also provides a lot of scalability to the analysis, it is just a matter of constructingother queries for the simulations.

The result of analysis one shows that Backfilling is the best choice of the five algorithms,as it provides a higher throughput of jobs than the other algorithms, especially on largerclusters. Although SJF has the lowest waiting time, an obvious flaw with SJF is that itis not fair, and therefore not usable on batch systems with multiple users where long jobscould starve. However, the results did not agree with the claim that Backfilling can increasethe utilization of the resources use by 20% made by Leonenkov and Zhumatiy, in [26].The results showed some increase of density in small clusters while larger cluster showed adrastic decrease of density. The results of analysis two show a propagating error in regardsto the waiting time in the simulations of the models. Further analysis showed that this is dueto the abstracted behaviour regarding the time scheduling and backfilling attempts take inSLURM, discussed in chapter 4.1.

As analysis two shows, a necessary aspect that should be added in future work, to aid theanalysis, is the abstracted behaviour, mentioned in 4.1. In order to use the framework as asuitable analysis tool for deCODE, the behaviour regarding to the length of scheduling andbackfilling attempts must be implemented in the models and the analysis redone to show ifthe algorithms present different results. Another possibility of future work would be to adddifferent algorithms and to compare the results of those considered in this thesis. A thirdtype of analysis could also be added, if partitioning of the nodes were added to the clusters,to analyse how clusters using multiple partitions of nodes would perform when differentalgorithms are being used on different partitions. This could open up some possibilities ofsome type of pre-scheduling choices where different jobs are placed on the partition thatwould suit it best. A different approach to scheduling could also be attempted by usingUPPAAL STRATEGO to implement adaptive scheduling. UPPAAL STRATEGO is a toolsuite, like UPPAAL and UPPAAL SMC, that is capable of constructing optimal strategiesfor stochastic priced timed games [32]. Using UPPAAL STRATEGO, optimal strategies forscheduling could be calculated when scheduling attempts are performed in the clusters. Thiswork could then be compared to the results from the models in the framework.

Page 82: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

68

Page 83: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

69

Bibliography

[1] COMPANY | deCODE genetics. [Online]. Available: https://www.decode.com/company/ (visited on 04/29/2017).

[2] R. Buyya, High Performance Cluster Computing: Architectures and Systems, Vol.1. Prentice Hall, 1999, ISBN: 0130137847. [Online]. Available: https://www.amazon.com/High-Performance-Cluster-Computing-Architectures/dp/0130137847?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=0130137847.

[3] Uppaal web site. [Online]. Available: http://uppaal.org/ (visited on 04/30/2017).

[4] G. Behrmann, A. David, and K. G. Larsen, “A tutorial on uppaal”, in Formal methodsfor the design of real-time systems, Springer, 2004, pp. 200–236. [Online]. Available:http://link.springer.com/10.1007%2F978-3-540-30080-9_7(visited on 08/29/2016).

[5] K. Qureshi, S. M. H. Shah, and P. Manuel, “Empirical performance evaluation ofschedulers for cluster of workstations”, Cluster Computing, vol. 14, no. 2, pp. 101–113, 2011, ISSN: 1573-7543. DOI: 10.1007/s10586-010-0128-5. [Online].Available: http://dx.doi.org/10.1007/s10586-010-0128-5.

[6] A. Lucero, “Simulation of batch scheduling using real production-ready softwaretools”, [Online]. Available: https://www.academia.edu/25294817/Simulation_of_batch_scheduling_using_real_production-ready_software_tools.

[7] S. Trofinoff and M. Benini, “Using and modifying the BSC slurm workload simula-tor”, Slurm User Group Meeting 2015, Sep. 16, 2015, [Online]. Available: https://slurm.schedmd.com/SLUG15/BSC_Slurm_Workload_Simulator_Enhancements.pdf.

[8] M. Mikucionis, K. G. Larsen, J. I. Rasmussen, B. Nielsen, A. Skou, S. U. Palm,J. S. Pedersen, and P. Hougaard, “Schedulability analysis using uppaal: Herschel-planck case study”, in Leveraging Applications of Formal Methods, Verification, andValidation: 4th International Symposium on Leveraging Applications, ISoLA 2010,Heraklion, Crete, Greece, October 18-21, 2010, Proceedings, Part II, T. Margaria andB. Steffen, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 175–190,ISBN: 978-3-642-16561-0. DOI: 10.1007/978-3-642-16561-0_21. [Online].Available: http://dx.doi.org/10.1007/978-3-642-16561-0_21.

Page 84: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

70 BIBLIOGRAPHY

[9] A. Boudjadar, A. David, J. H. Kim, K. G. Larsen, M. Mikucionis, U. Nyman, andA. Skou, “Hierarchical scheduling framework based on compositional analysis usinguppaal”, in Formal Aspects of Component Software: 10th International Symposium,FACS 2013, Nanchang, China, October 27-29, 2013, Revised Selected Papers, J. L.Fiadeiro, Z. Liu, and J. Xue, Eds. Cham: Springer International Publishing, 2014,pp. 61–78, ISBN: 978-3-319-07602-7. DOI: 10.1007/978-3-319-07602-7_6.[Online]. Available: http://dx.doi.org/10.1007/978-3-319-07602-7_6.

[10] P. Bulychev, A. David, K. Guldstrand Larsen, A. Legay, M. Mikucionis, and D. Bøg-sted Poulsen, “Checking and distributing statistical model checking”, in NASA For-mal Methods: 4th International Symposium, NFM 2012, Norfolk, VA, USA, April 3-5,2012. Proceedings, A. E. Goodloe and S. Person, Eds. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2012, pp. 449–463, ISBN: 978-3-642-28891-3. DOI: 10.1007/978-3-642-28891-3_39. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-28891-3_39.

[11] M. Baker and B. Rajkumar, Cluster computing at a glance, in: R.Buyya (Ed.) High-Performance Cluster Computing, Architecture and Systems. Upper Saddle River, NJ,Prentice-Hall, 1999, vol. 1, pp.3-47, chapter 1.

[12] Simple Linux Utility for Resource Management. [Online]. Available: http://slurm.schedmd.com/slurm.html (visited on 08/29/2016).

[13] L. Aceto, A. Ingólfsdóttir, K. G. Larsen, and J. Srba, Reactive Systems: Modelling,Specification and Verification. Cambridge University Press, 2007. [Online]. Avail-able: https://www.amazon.com/Reactive-Systems-Modelling-Specification-Verification-ebook/dp/B001EQ4P0U?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B001EQ4P0U.

[14] R. Alur and D. L. Dill, “A theory of timed automata”, Theor. Comput. Sci., vol.126, no. 2, pp. 183–235, Apr. 1994, ISSN: 0304-3975. DOI: 10.1016/0304-3975(94)90010-8. [Online]. Available: http://dx.doi.org/10.1016/0304-3975(94)90010-8.

[15] K. G. Larsen, W. Yi, P. Pettersson, A. David, B. Nielsen, A. Skou, J. Hakansson, J. I.Rasmussen, P. Krcal, U. Larsen, M. Mikucionis, and L. Mokrushin, Uppaal manual,External contributors include (not complete): Patricia Bouyer, Martijn Hendriks, andRadek Pelanek, Department of Information Technology at Uppsala University (UPP)in Sweden and the Department of Computer Science at Aalborg University (AAL) inDenmark, Jul. 2014.

[16] A. David et al., „uppaal 4.0: Small tutorial”, 2009.

[17] K. G. Larsen, P. Pettersson, and W. Yi, “Uppaal in a nutshell”, International Journalon Software Tools for Technology Transfer, vol. 1, no. 1, pp. 134–152, 1997, ISSN:1433-2779. DOI: 10.1007/s100090050010. [Online]. Available: http://dx.doi.org/10.1007/s100090050010.

[18] A. David, K. G. Larsen, A. Legay, M. Mikucionis, and D. B. Poulsen, “Uppaal SMCtutorial”, International Journal on Software Tools for Technology Transfer, vol. 17,no. 4, pp. 397–415, Aug. 2015, ISSN: 1433-2779, 1433-2787. DOI: 10.1007/s10009-014-0361-y. [Online]. Available: http://link.springer.com/10.1007/s10009-014-0361-y (visited on 10/03/2016).

Page 85: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

BIBLIOGRAPHY 71

[19] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/overview.html (visited on 04/30/2017).

[20] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/cons_res.html (visited on 04/30/2017).

[21] M. Jette, Tuning Slurm Scheduling for Optimal Responsiveness and Utilization. [On-line]. Available: https://slurm.schedmd.com/SUG14/sched_tutorial.pdf (visited on 01/24/2017).

[22] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/accounting.html (visited on 04/30/2017).

[23] (). Beninim/slurm_simulator, GitHub, [Online]. Available: https://github.com/beninim/slurm_simulator (visited on 08/29/2016).

[24] A. Silberschatz, Operating System Concepts with Java, 8th Edition. Wiley, 2011.[Online]. Available: https://www.amazon.com/Operating-System-Concepts- Java- 8th- ebook/dp/B006R6I9AO?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B006R6I9AO.

[25] A. W. Mu’alem and D. G. Feitelson, “Utilization, predictability, workloads, and userruntime estimates in scheduling the ibm sp2 with backfilling”, IEEE Trans. ParallelDistrib. Syst., vol. 12, no. 6, pp. 529–543, Jun. 2001, ISSN: 1045-9219. DOI: 10.1109/71.932708. [Online]. Available: http://dx.doi.org/10.1109/71.932708.

[26] S. Leonenkov and S. Zhumatiy, “Introducing new backfill-based scheduler for slurmresource manager”, Procedia Computer Science, vol. 66, pp. 661–669, 2015, ISSN:1877-0509. DOI: http://dx.doi.org/10.1016/j.procs.2015.11.075. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050915034249.

[27] D. B. Jackson, Q. Snell, and M. J. Clement, “Core algorithms of the maui scheduler”,in Revised Papers from the 7th International Workshop on Job Scheduling Strategiesfor Parallel Processing, ser. JSSPP ’01, London, UK, UK: Springer-Verlag, 2001,pp. 87–102, ISBN: 3-540-42817-8. [Online]. Available: http://dl.acm.org/citation.cfm?id=646382.689682.

[28] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/quickstart.html (visited on 04/30/2017).

[29] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/sched_config.html (visited on 04/30/2017).

[30] Slurm Workload Manager. [Online]. Available: https://slurm.schedmd.com/topology.html (visited on 04/30/2017).

[31] E. Frachtenberg and D. G. Feitelson, “Pitfalls in parallel job scheduling evaluation”,in Job Scheduling Strategies for Parallel Processing: 11th International Workshop,JSSPP 2005, Cambridge, MA, USA, June 19, 2005, Revised Selected Papers, D. Fei-telson, E. Frachtenberg, L. Rudolph, and U. Schwiegelshohn, Eds. Berlin, Heidel-berg: Springer Berlin Heidelberg, 2005, pp. 257–282, ISBN: 978-3-540-31617-6. DOI:10.1007/11605300_13. [Online]. Available: http://dx.doi.org/10.1007/11605300_13.

Page 86: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

72 BIBLIOGRAPHY

[32] Uppaal web site. [Online]. Available: http://people.cs.aau.dk/~marius/stratego/ (visited on 04/30/2017).

Page 87: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

73

Appendix A

Code

A.1 UPPAAL Functions�1 typedef struct{2 id_t id;3 int runTime;4 int cpu;5 int space;6 int ram;7 } JobInfo;� �

Listing A.1: Uppaal code with column alignment, source in Listing.�1 void setTime(){2

3 jobFin = false;4 scheduleInt = false;5 backfill = false;6 finish = false;7 //Check if firstRun == true − firstRun = false & scheduleInt = true & waitTime = 08 if(firstRun == true){9 scheduleInt = true;

10 firstRun = false;11 waitTime = 0;12 return;13 }14

15 //Check if all jobs have finished − finish = true & waitTime = 016 if(jobsDone == MAX_JOBS){17 finish = true;18 waitTime = 0;19 return;20 }21

22 //Check if a job finishes before schedule or backfill will occure23 if(minTime < Sched_Interval && minTime < BcFill_Interval) {24 jobFin = true;25 waitTime = minTime;26

Page 88: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

74 APPENDIX A. CODE

27 //Remove waitTime from jobs28 Sched_Interval = Sched_Interval − waitTime;29 BcFill_Interval = BcFill_Interval − waitTime;30 }31 //Check if schedule will occure next or backfill32 else if(Sched_Interval <= BcFill_Interval) {33 scheduleInt = true;34 waitTime = Sched_Interval;35

36 if(Sched_Interval == BcFill_Interval){37 Sched_Interval = SCHED_INTERVAL;38 BcFill_Interval = BCFILL_INTERVAL;39 } else {40 Sched_Interval = SCHED_INTERVAL;41 BcFill_Interval = BcFill_Interval − waitTime;42 }43 } else if (BcFill_Interval < Sched_Interval){44 backfill = true;45 waitTime = BcFill_Interval;46

47 if(Sched_Interval == BcFill_Interval){48 Sched_Interval = SCHED_INTERVAL;49 BcFill_Interval = BCFILL_INTERVAL;50 } else {51 Sched_Interval = Sched_Interval − waitTime;52 BcFill_Interval = BCFILL_INTERVAL;53 }54 }55 reduceTime(waitTime);56 }� �Listing A.2: Function that decides what action the cluster should take next and when inBackfilling

A.2 Cluster Sizes in Analysis 1

A.2.1 C10�1 //Cluster settings2 typedef struct{3 int id;4 int tmpDisk;5 int cpus;6 bool nodeBusy;7 int realMemory;8 int weight;9 } node;

10

11 const int NODE_NR = 10;12 typedef int[0,NODE_NR−1] id_n;

Page 89: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A.2. CLUSTER SIZES IN ANALYSIS 1 75

13 node nodes[NODE_NR] = {14 {101, 3500, 20, false, 1250, 1},15 {102, 3500, 24, false, 1250, 1},16 {103, 3500, 24, false, 1250, 1},17 {104, 4000, 24, false, 1250, 1},18 {105, 4000, 28, false, 1250, 1},19 {106, 4000, 28, false, 1250, 1},20 {107, 4000, 28, false, 1250, 1},21 {108, 4000, 28, false, 1250, 1},22 {109, 4000, 28, false, 1250, 1},23 {110, 4000, 28, false, 3785, 1}24 };� �

Listing A.3: The resources available on C10 cluster

A.2.2 C50�1 //Cluster settings2 typedef struct{3 int id;4 int tmpDisk;5 int cpus;6 bool nodeBusy;7 int realMemory;8 int weight;9 } node;

10

11 const int NODE_NR = 50;12 typedef int[0,NODE_NR−1] id_n;13 node nodes[NODE_NR] = {14 {101, 3500, 16, false, 1250, 1},15 {102, 3500, 20, false, 1250, 1},16 {103, 3500, 24, false, 1250, 1},17 {104, 3500, 24, false, 1250, 1},18 {105, 3500, 24, false, 1250, 1},19 {106, 3500, 24, false, 1250, 1},20 {107, 3500, 24, false, 1250, 1},21 {108, 3500, 24, false, 1250, 1},22 {109, 3500, 24, false, 1250, 1},23 {110, 3500, 24, false, 1250, 1},24 {111, 3500, 24, false, 1250, 1},25 {112, 3500, 24, false, 1250, 1},26 {113, 3500, 24, false, 1250, 1},27 {114, 3500, 24, false, 1250, 1},28 {115, 3500, 24, false, 1250, 1},29 {116, 3500, 24, false, 1250, 1},30 {117, 3500, 24, false, 1250, 1},31 {118, 4000, 24, false, 1250, 1},32 {119, 4000, 28, false, 1250, 1},33 {120, 4000, 28, false, 1250, 1},34 {121, 4000, 28, false, 1250, 1},

Page 90: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

76 APPENDIX A. CODE

35 {122, 4000, 28, false, 1250, 1},36 {123, 4000, 28, false, 1250, 1},37 {124, 4000, 28, false, 1250, 1},38 {125, 4000, 28, false, 1250, 1},39 {126, 4000, 28, false, 1250, 1},40 {127, 4000, 28, false, 1250, 1},41 {128, 4000, 28, false, 1250, 1},42 {129, 4000, 28, false, 1250, 1},43 {130, 4000, 28, false, 1250, 1},44 {131, 4000, 28, false, 1250, 1},45 {132, 4000, 28, false, 1250, 1},46 {133, 4000, 28, false, 1250, 1},47 {134, 4000, 28, false, 1250, 1},48 {135, 4000, 28, false, 1250, 1},49 {136, 4000, 28, false, 1250, 1},50 {137, 4000, 28, false, 1250, 1},51 {138, 4000, 28, false, 1250, 1},52 {139, 4000, 28, false, 1250, 1},53 {140, 4000, 28, false, 1250, 1},54 {141, 4000, 28, false, 1250, 1},55 {142, 4000, 28, false, 1250, 1},56 {143, 4000, 28, false, 1250, 1},57 {144, 4000, 28, false, 1250, 1},58 {145, 4000, 28, false, 1250, 1},59 {146, 4000, 28, false, 1250, 1},60 {147, 4000, 28, false, 1250, 1},61 {148, 4000, 28, false, 1250, 1},62 {149, 4000, 28, false, 1250, 1},63 {150, 4000, 28, false, 3785, 1}64 };� �

Listing A.4: The resources available on C50 cluster

A.2.3 C100�1 //Cluster settings2 typedef struct{3 int id;4 int tmpDisk;5 int cpus;6 bool nodeBusy;7 int realMemory;8 int weight;9 } node;

10

11 const int NODE_NR = 100;12 typedef int[0,NODE_NR−1] id_n;13 node nodes[NODE_NR] = {14 {101, 3500, 16, false, 1250, 1},15 {102, 3500, 20, false, 1250, 1},16 {103, 3500, 20, false, 1250, 1},

Page 91: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

A.2. CLUSTER SIZES IN ANALYSIS 1 77

17 {104, 3500, 24, false, 1250, 1},18 {105, 3500, 24, false, 1250, 1},19 {106, 3500, 24, false, 1250, 1},20 {107, 3500, 24, false, 1250, 1},21 {108, 3500, 24, false, 1250, 1},22 {109, 3500, 24, false, 1250, 1},23 {110, 3500, 24, false, 1250, 1},24 {111, 3500, 24, false, 1250, 1},25 {112, 3500, 24, false, 1250, 1},26 {113, 3500, 24, false, 1250, 1},27 {114, 3500, 24, false, 1250, 1},28 {115, 3500, 24, false, 1250, 1},29 {116, 3500, 24, false, 1250, 1},30 {117, 3500, 24, false, 1250, 1},31 {118, 3500, 24, false, 1250, 1},32 {119, 3500, 24, false, 1250, 1},33 {120, 3500, 24, false, 1250, 1},34 {121, 3500, 24, false, 1250, 1},35 {122, 3500, 24, false, 1250, 1},36 {123, 3500, 24, false, 1250, 1},37 {124, 3500, 24, false, 1250, 1},38 {125, 3500, 24, false, 1250, 1},39 {126, 3500, 24, false, 1250, 1},40 {127, 3500, 24, false, 1250, 1},41 {128, 3500, 24, false, 1250, 1},42 {129, 3500, 24, false, 1250, 1},43 {130, 3500, 24, false, 1250, 1},44 {131, 3500, 24, false, 1250, 1},45 {132, 3500, 24, false, 1250, 1},46 {133, 3500, 24, false, 1250, 1},47 {134, 3500, 24, false, 1250, 1},48 {135, 4000, 24, false, 1250, 1},49 {136, 4000, 24, false, 1250, 1},50 {137, 4000, 28, false, 1250, 1},51 {138, 4000, 28, false, 1250, 1},52 {139, 4000, 28, false, 1250, 1},53 {140, 4000, 28, false, 1250, 1},54 {141, 4000, 28, false, 1250, 1},55 {142, 4000, 28, false, 1250, 1},56 {143, 4000, 28, false, 1250, 1},57 {144, 4000, 28, false, 1250, 1},58 {145, 4000, 28, false, 1250, 1},59 {146, 4000, 28, false, 1250, 1},60 {147, 4000, 28, false, 1250, 1},61 {148, 4000, 28, false, 1250, 1},62 {149, 4000, 28, false, 1250, 1},63 {150, 4000, 28, false, 1250, 1},64 {151, 4000, 28, false, 1250, 1},65 {152, 4000, 28, false, 1250, 1},66 {153, 4000, 28, false, 1250, 1},67 {154, 4000, 28, false, 1250, 1},68 {155, 4000, 28, false, 1250, 1},

Page 92: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

78 APPENDIX A. CODE

69 {156, 4000, 28, false, 1250, 1},70 {157, 4000, 28, false, 1250, 1},71 {158, 4000, 28, false, 1250, 1},72 {159, 4000, 28, false, 1250, 1},73 {160, 4000, 28, false, 1250, 1},74 {161, 4000, 28, false, 1250, 1},75 {162, 4000, 28, false, 1250, 1},76 {163, 4000, 28, false, 1250, 1},77 {164, 4000, 28, false, 1250, 1},78 {165, 4000, 28, false, 1250, 1},79 {166, 4000, 28, false, 1250, 1},80 {167, 4000, 28, false, 1250, 1},81 {168, 4000, 28, false, 1250, 1},82 {169, 4000, 28, false, 1250, 1},83 {170, 4000, 28, false, 1250, 1},84 {171, 4000, 28, false, 1250, 1},85 {172, 4000, 28, false, 1250, 1},86 {173, 4000, 28, false, 1250, 1},87 {174, 4000, 28, false, 1250, 1},88 {175, 4000, 28, false, 1250, 1},89 {176, 4000, 28, false, 1250, 1},90 {177, 4000, 28, false, 1250, 1},91 {178, 4000, 28, false, 1250, 1},92 {179, 4000, 28, false, 1250, 1},93 {180, 4000, 28, false, 1250, 1},94 {181, 4000, 28, false, 1250, 1},95 {182, 4000, 28, false, 1250, 1},96 {183, 4000, 28, false, 1250, 1},97 {184, 4000, 28, false, 1250, 1},98 {185, 4000, 28, false, 1250, 1},99 {186, 4000, 28, false, 1250, 1},

100 {187, 4000, 28, false, 1250, 1},101 {188, 4000, 28, false, 1250, 1},102 {189, 4000, 28, false, 1250, 1},103 {190, 4000, 28, false, 1250, 1},104 {191, 4000, 28, false, 1250, 1},105 {192, 4000, 28, false, 1250, 1},106 {193, 4000, 28, false, 1250, 1},107 {194, 4000, 28, false, 1250, 1},108 {195, 4000, 28, false, 1250, 1},109 {196, 4000, 28, false, 1250, 1},110 {197, 4000, 28, false, 1250, 1},111 {198, 4000, 28, false, 1250, 1},112 {199, 4000, 28, false, 3785, 1},113 {200, 12000, 28, false, 7570, 1}114 };� �

Listing A.5: The resources available on C100 cluster

Page 93: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL
Page 94: A DATA DRIVEN ANALYSIS OF CLUSTERS USING UPPAAL

School of Computer ScienceReykjavík UniversityMenntavegur 1101 Reykjavík, IcelandTel. +354 599 6200Fax +354 599 6201www.ru.is