Advanced Techniques for Power, Energy, and Thermal ...978-3-319-77479-4/1.pdf · Advanced Techniques for Power, Energy, and Thermal Management for Clustered Manycores 123. Santiago

Advanced Techniques for Power, Energy,and Thermal Management for Clustered Manycores

Santiago Pagani • Jian-Jia ChenMuhammad Shafique • Jörg Henkel

Advanced Techniquesfor Power, Energy,and Thermal Managementfor Clustered Manycores

123

Santiago PaganiARMCambridgeUK

Jian-Jia ChenTechnical University of DortmundDortmund, North Rhine-WestphaliaGermany

Muhammad ShafiqueVienna University of TechnologyViennaAustria

Jörg HenkelKarlsruhe Institute of TechnologyKarlsruhe, Baden-WürttembergGermany

ISBN 978-3-319-77478-7 ISBN 978-3-319-77479-4 (eBook)https://doi.org/10.1007/978-3-319-77479-4

Library of Congress Control Number: 2018934885

© Springer International Publishing AG, part of Springer Nature 2018This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made. The publisher remains neutral with regard tojurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer International Publishing AGpart of Springer NatureThe registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Efficient and effective system-level power, energy, and thermal management arevery important issues in modern computing systems, e.g., to reduce the packagingcost, to prolong the battery lifetime of embedded systems, or to avoid the chip frompossible overheating. These are some of the main motivations why computingsystems have shifted from single-core to multicore/manycore platforms, mainly tobalance the power consumption and computation performance. Furthermore,clustered architectures with multiple voltage islands, where the voltage on a clustercan change independently and all cores in a cluster share the same supply voltage atany given time, are an expected compromise between global and per-core DynamicVoltage and Frequency Scaling (DVFS) for modern manycore systems. In thisbook, we focus on two of the most relevant problems for such architectures, par-ticularly, optimizing performance under power/thermal constraints and minimizingenergy under performance constraints.

For performance optimization, we first present a novel thermal-aware powerbudgeting concept, called Thermal Safe Power (TSP), which is an abstraction thatprovides safe power and power density constraints as a function of the number ofactive cores. TSP conceptually changes the typical design that uses a single andconstant value as power budget, e.g., the Thermal Design Power (TDP), and canalso serve as a fundamental tool for guiding task partitioning and core mappingdecisions. Second, we show that runtime decisions normally used to optimizeresource usages (e.g., task migration, power gating, DVFS, etc.) can result intransient temperatures much higher than the normally considered steady-statescenarios. In order to be thermally safe, it is important to evaluate the transientpeaks before making resource management decisions. To this end, we present alightweight method for computing these transient peaks, called MatEx, based onanalytically solving the system of thermal differential equations by using matrixexponentials and linear algebra, instead of using regular numerical methods. Third,we present an effective and lightweight runtime boosting technique based ontransient temperature estimation, called seBoost. Traditional boosting techniquesselect the boosting levels (for boosted cores) and the throttle-down levels (fornon-boosted cores) arbitrarily or through step-wise control approaches, and might

v

result in unnecessary performance losses for the non-boosted cores or may fail tosatisfy the required runtime performance surges. Contrarily, seBoost relies onMatEx to select the boosting levels, and hence it guarantees meeting the requiredruntime performance surges, while maximizing the boosting time with minimumperformance losses for the non-boosted cores.

In regards to energy minimization, we first focus on a single cluster, and wepropose to use the Double Largest Task First (DLTF) strategy for partitioning tasksto cores based on load balancing and idle energy reduction, combined with eitherthe Single Frequency Approximation (SFA) scheme or the Single VoltageApproximation (SVA) scheme for deciding the DVFS levels for execution.Furthermore, we provide thorough theoretical analysis of both solutions, in terms ofenergy efficiency and peak power reduction, against the optimal task partitioningand optimal DVFS schedule, particularly for the state-of-the-art designs, that have alimited number of cores inside each cluster. In SFA, all the cores in a cluster run at asingle voltage and frequency, such that all tasks meet their performance constraints.In SVA, all the cores in a cluster also run at the same single voltage as in SFA;however, the frequency of each core is individually chosen, such that the tasks ineach core can meet their performance constraints, but without running at unnec-essarily high frequencies. Finally, we extend our analysis for systems with multipleclusters, and present two task-to-core mapping solutions when using SFA onindividual clusters, particularly, a dynamic programming algorithm that derivesoptimal solutions for homogeneous manycores, and a lightweight and efficientheuristic for heterogeneous manycores.

Related Publications

This book is based on the results published in several conferences, journals,workshops, and book chapters, particularly:

• [61] Pagani, S., Chen, J.J.: Energy efficiency analysis for the single frequencyapproximation (SFA) scheme. In: Proceedings of the 19th IEEE InternationalConference on Embedded and Real-Time Computing Systems and Applications(RTCSA), pp. 82–91 (2013). doi: 10.1109/rtcsa.2013.6732206. [Best PaperAward].

• [62] Pagani, S., Chen, J.J.: Energy efficient task partitioning based on the singlefrequency approximation scheme. In: Proceedings of the 34th IEEE Real-TimeSystems Symposium (RTSS), pp. 308–318 (2013). doi: 10.1109/rtss.2013.38.

• [70] Pagani, S., Khdr, H., Munawar, W., Chen, J.J., Shafique, M., Li, M.,Henkel, J.: TSP: Thermal Safe Power - Efficient power budgeting for manycoresystems in dark silicon. In: Proceedings of the 9th IEEE/ACM InternationalConference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 10:1–10:10 (2014). doi: 10.1145/2656075.2656103. [Best PaperAward].

vi Preface

https://doi.org/10.1109/rtcsa.2013.6732206

https://doi.org/10.1109/rtss.2013.38

https://doi.org/10.1145/2656075.2656103

• [63] Pagani, S., Chen, J.J.: Energy efficiency analysis for the single frequencyapproximation (SFA) scheme. ACM Transactions on Embedded ComputingSystems (TECS) 13(5s), 158:1–158:25 (2014). doi: 10.1145/2660490.

• [66] Pagani, S., Chen, J.J., Shafique, M., Henkel, J.: MatEx: Efficient transientand peak temperature computation for compact thermal models. In: Proceedingsof the 18th Design, Automation and Test in Europe (DATE), pp. 1515–1520(2015). doi: 10.7873/date.2015.0328.

• [65] Pagani, S., Chen, J.J., Li, M.: Energy efficiency on multi-core architectureswith multiple voltage islands. IEEE Transactions on Parallel and DistributedSystems (TPDS) 26(6), 1608–1621 (2015). doi: 10.1109/tpds.2014.2323260.

• [64] Pagani, S., Chen, J.J., Henkel, J.: Energy and peak power efficiencyanalysis for the single voltage approximation (SVA) scheme. IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems (TCAD) 34(9),1415–1428 (2015). doi: 10.1109/tcad.2015.2406862.

• [72] Pagani, S., Shafique, M., Khdr, H., Chen, J.J., Henkel, J.: seBoost:Selective boosting for heterogeneous manycores. In: Proceedings of the 10thIEEE/ACM International Conference on Hardware/Software Codesign andSystem Synthesis (CODES+ISSS), pp. 104–113 (2015). doi: 10.1109/codesisss.2015.7331373.

• [67] Pagani, S., Chen, J.J., Shafique, M., Henkel, J.: Thermal-aware powerbudgeting for dark silicon chips. In: Proceedings of the 2nd Workshop onLow-Power Dependable Computing (LPDC) at the International Green andSustainable Computing Conference (IGSC) (2015).

• [69] Pagani, S., Khdr, H., Chen, J.J., Shafique, M., Li, M., Henkel, J.: ThermalSafe Power (TSP): Efficient power budgeting for heterogeneous manycoresystems in dark silicon. IEEE Transactions on Computers (TC) 66(1), 147–162(2017). doi: 10.1109/tc.2016.2564969. [Feature Paper of the Month].

• [71] Pagani, S., Pathania, A., Shafique, M., Chen, J.J., Henkel, J.: Energyefficiency for clustered heterogeneous multicores. IEEE Transactions on Paralleland Distributed Systems (TPDS) 28(5), 1315–1330 (2017). doi: 10.1109/tpds.2016.2623616.

• [68] Pagani, S., Khdr, H., Chen, J.J., Shafique, M., Li, M., Henkel, J.: ThermalSafe Power: Efficient thermal-aware power budgeting for manycore systems indark silicon. In: A.M. Rahmani, P. Liljeberg, A. Hemani, A. Jantsch,H. Tenhunen (eds.) The Dark Side of Silicon. Springer (2017).

• [60] Pagani, S.: Power, energy, and thermal management for clustered many-cores. Ph.D. thesis, Chair for Embedded Systems (CES), Department ofComputer Science, Karlsruhe Institute of Technology (KIT), Germany (2016).[Received Summa cum Laude and the ACM SIGBED Paul Caspi MemorialDissertation Award].

Preface vii

https://doi.org/10.1145/2660490

https://doi.org/10.7873/date.2015.0328

https://doi.org/10.1109/tpds.2014.2323260

https://doi.org/10.1109/tcad.2015.2406862

https://doi.org/10.1109/tc.2016.2564969

Acknowledgements

This work was partly supported by the German Research Foundation (DFG) as partof the Transregional Collaborative Research Centre Invasive Computing [SFB/TR89], by Baden Württemberg MWK Juniorprofessoren-Programms, and by a grantfrom the Research Grants Council of the Hong Kong Special AdministrativeRegion, China [Project CityU 117913].

Cambridge, UK Santiago PaganiDortmund, Germany Jian-Jia ChenVienna, Austria Muhammad ShafiqueKarlsruhe, Germany Jörg Henkel

viii Preface

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Optimization Goals and Constraints . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Computational Performance . . . . . . . . . . . . . . . . . . . . 21.1.2 Power and Energy Consumption . . . . . . . . . . . . . . . . . 31.1.3 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Optimization Knobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Core Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Task-to-Core Assignment/Mapping . . . . . . . . . . . . . . . 71.2.3 Dynamic Power Management (DPM) . . . . . . . . . . . . . 81.2.4 Dynamic Voltage and Frequency Scaling (DVFS) . . . . 8

1.3 Performance Optimization Under Power or ThermalConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Energy Minimization Under Performance Constraints . . . . . . . . 121.5 Summary of the State-of-the-Art, Problems, and Challenges . . . 131.6 Book Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.1 Performance Optimization Under Power and ThermalConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.2 Energy Minimization Under Real-Time/PerformanceConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.8 Orientation Within Funding Projects . . . . . . . . . . . . . . . . . . . . 18

1.8.1 Invasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 181.8.2 Power Management for Multicore Architecture

with Voltage Islands . . . . . . . . . . . . . . . . . . . . . . . . . . 20References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ix

2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1 Performance Optimization Under Power or Thermal

Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.1 Techniques Using Per-chip Power Constraints . . . . . . . 232.1.2 Techniques Using Thermal Constraints . . . . . . . . . . . . 242.1.3 Temperature Estimation . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Energy Minimization Under Real-Time/PerformanceConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.1 Per-core DVFS Techniques . . . . . . . . . . . . . . . . . . . . . 292.2.2 Global DVFS Techniques . . . . . . . . . . . . . . . . . . . . . . 292.2.3 Clustered Manycores / Multiple Voltage Islands . . . . . . 30

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.1 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5 Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Detailed Mode Setup . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 High-Level Mode Setup . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2.1 Homogeneous Architectures . . . . . . . . . . . . . . . . . . . . 544.2.2 Heterogeneous Architectures . . . . . . . . . . . . . . . . . . . . 55

4.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Thermal Safe Power (TSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1.1 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . 605.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Thermal Safe Power for Homogeneous Systems . . . . . . . . . . . . 635.2.1 Given Core Mapping on Homogeneous Systems . . . . . 635.2.2 Worst-Case Mappings on Homogeneous Systems . . . . . 67

5.3 Thermal Safe Power for Heterogeneous Systems . . . . . . . . . . . 735.3.1 Given Core Mapping on Heterogeneous Systems . . . . . 735.3.2 Worst-Case Mappings on Heterogeneous Systems . . . . 75

5.4 Transient-State Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 775.4.1 Adjusting the Temperature for Computing TSP . . . . . . 795.4.2 Nominal DVFS Operation for a Given Mapping . . . . . 81

x Contents

5.5 Experimental Evaluations for Homogeneous Systems . . . . . . . . 825.5.1 Power Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.5.2 Execution Time of Online TSP Computation . . . . . . . . 855.5.3 Dark Silicon Estimations . . . . . . . . . . . . . . . . . . . . . . 865.5.4 Performance Simulations . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Experimental Evaluations for Heterogeneous Systems . . . . . . . . 895.6.1 Power Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.6.2 Performance Simulations . . . . . . . . . . . . . . . . . . . . . . . 89

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Transient and Peak Temperature Computation Basedon Matrix Exponentials (MatEx) . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


6.2 Computing All Transient Temperatures . . . . . . . . . . . . . . . . . . 966.3 Computing Peaks in Transient Temperatures . . . . . . . . . . . . . . 996.4 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


7 Selective Boosting for Multicore Systems (seBoost) . . . . . . . . . . . . . 1117.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


7.2 Given Required Boosting Levels . . . . . . . . . . . . . . . . . . . . . . . 1167.3 Unknown Required Boosting Levels . . . . . . . . . . . . . . . . . . . . 1197.4 Unknown Maximum Expected Boosting Time . . . . . . . . . . . . . 1227.5 Concurrency and Closed-Loop Control-Based Boosting . . . . . . 1237.6 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8 Energy and Peak Power Efficiency Analysis for SimpleApproximation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338.1.2 Largest Task First (LTF) Scheme . . . . . . . . . . . . . . . . 1358.1.3 Double Largest Task First (DLTF) Scheme . . . . . . . . . 137

Contents xi

8.1.4 Single Frequency Approximation (SFA) Scheme . . . . . 1418.1.5 Single Voltage Approximation (SVA) Scheme . . . . . . . 143

8.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1468.2.1 Lower Bound for the Energy Consumption . . . . . . . . . 1468.2.2 Lower Bound for the Peak Power Consumption . . . . . . 154

8.3 Approximation Factor Analysis: DLTF-SFA . . . . . . . . . . . . . . . 1548.3.1 Energy Minimization Analysis for DLTF-SFA . . . . . . . 1548.3.2 Peak Power Reduction Analysis for DLTF-SFA . . . . . . 162

8.4 Approximation Factor Analysis: DLTF-SVA . . . . . . . . . . . . . . 1668.4.1 Energy Minimization Analysis for DLTF-SVA . . . . . . 1668.4.2 Peak Power Reduction Analysis for DLTF-SVA . . . . . 170

8.5 Comparing DLTF-SFA and DLTF-SVA. . . . . . . . . . . . . . . . . . 1728.6 Discrete Voltage and Frequency Pairs . . . . . . . . . . . . . . . . . . . 1748.7 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1768.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177


9 Energy-Efficient Task-to-Core Assignment for HomogeneousClustered Manycores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1819.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1829.2 Simple Heuristic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9.2.1 Description of Simple Heuristic Algorithms . . . . . . . . . 1859.2.2 Approximation Factor for Simple Heuristics . . . . . . . . . 1879.2.3 Numerical Examples for the Approximation Factor

of Simple Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 1889.3 Assignment Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

9.3.1 Given Highest Cycle Utilization Task Sets Assignedto Every Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

9.3.2 All Possible Highest Cycle Utilization Task Sets . . . . . 1949.4 Dynamic Programming Solution . . . . . . . . . . . . . . . . . . . . . . . 195

9.4.1 Description of the DYVIA Algorithm . . . . . . . . . . . . . 1959.4.2 Complexity Analysis for the DYVIA Algorithm . . . . . . 1999.4.3 Optimal Task Set Assignment Under SFA Versus

Optimal DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2019.5 Experimental Evaluations on SCC . . . . . . . . . . . . . . . . . . . . . . 202

9.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2029.5.2 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 2059.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

xii Contents

9.6 Additional Experimental Evaluations . . . . . . . . . . . . . . . . . . . . 2089.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2089.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210


10 Energy-Efficient Task-to-Core Assignment for HeterogeneousClustered Manycores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21510.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215


10.2 Assignment of Given Task Sets . . . . . . . . . . . . . . . . . . . . . . . . 21910.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

10.3 Special Case: Tasks Consume Equal Average Poweron a Given Core and Voltage/Frequency . . . . . . . . . . . . . . . . . 22410.3.1 Special Task Partitioning for Fixed Frequencies . . . . . . 22510.3.2 Potential DVFS Configurations for all Clusters

in the Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 22710.4 General Case: Different Tasks Consume Different

Average Power on the Same Core . . . . . . . . . . . . . . . . . . . . . . 23010.4.1 General Task Partitioning for Fixed Frequencies . . . . . 23010.4.2 Potential DVFS Configurations for all Clusters

in the General Case . . . . . . . . . . . . . . . . . . . . . . . . . . 23110.5 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

10.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23310.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234


11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24111.1 Book Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24111.2 Current Impact of Our Contributions . . . . . . . . . . . . . . . . . . . . 24311.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Contents xiii

About the Authors

Santiago Pagani is currently a Staff Firmware Engineerand Team Lead at ARM Ltd. (Cambridge, UK), wherehe runs an Agile firmware development team workingon key components for the next-generation MaliGPU products. He received his Diploma in ElectronicsEngineering from the National Technological University(UTN), Argentina, in 2010. He received his Ph.D. inComputer Science from the Karlsruhe Institute ofTechnology (KIT) with “Summa cum Laude” in 2016.From 2003 until 2012, he worked as a hardware andsoftware developer in the industry sector for severalcompanies in Argentina, including 2 years as a technicalgroup leader. From 2012 until 2017, he worked as aresearch scientist (doctoral researcher and later postdoc) aspart of the research staff at KIT. He received two BestPaper Awards (IEEE RTCSA in 2013 and IEEE/ACMCODES+ISSS in 2014), one Feature Paper of the Month(IEEE Transactions on Computers in 2017), and threeHiPEAC Paper Awards. He received the 2017ACM SIGBED “Paul Caspi Memorial DissertationAward” in recognition of an outstanding Ph.D. disserta-tion. His interests include embedded systems, real-timesystems, energy-efficient scheduling, temperature-awarescheduling, and power-aware designs.

xv

Jian-Jia Chen is a Professor of Computer Science atTU Dortmund, Germany. Prior to taking up his currentposition, he was with the Department of Informatics atKarlsruhe Institute of Technology (KIT) in Germany asa Junior Professor for Institute for Process Control andRobotics (IPR). He received his B.S. degree from theDepartment of Chemistry at National TaiwanUniversity 2001. He obtained his Ph.D degree in June2006 with Ph.D Dissertation “Energy-EfficientScheduling for Real-Time Tasks in Uniprocessor andHomogeneous Multiprocessor Systems.” BetweenJanuary 2008 and April 2010, he was a PostdocResearcher at Computer Engineering and NetworksLaboratory (TIK) in Swiss Federal Institute ofTechnology (ETH) Zurich, Switzerland. His researchinterests include real-time systems, embedded systems,energy-efficient and power-aware designs, reliabilitysystem designs, and design automation. He haspublished more than 175 research papers in prestigiousjournals and conferences in the areas of real-timeembedded systems, lower power system designs, anddistributed computing. He has served as TPC memberin several international conferences in real-time andembedded systems such as RTSS, RTAS, RTCSA,DATE, ICCAD, etc., and associate editors and guesteditors in international journals.

Muhammad Shafique (M’11, SM’16) is a FullProfessor at the Institute of Computer Engineering,Department of Informatics, Vienna University ofTechnology (TU Wien), Austria. He is directing theGroup on Computer Architecture and Robust,Energy-Efficient Technologies (CARE-Tech). He wasa Senior Research Group Leader at Karlsruhe Instituteof Technology (KIT), Germany for more than 5 years.He received his Ph.D. in Computer Science from KITin January 2011. Before, he was with StreamingNetworks Pvt. Ltd. where he was involved in researchand development of advanced video coding systems forseveral years. His research interests are in computerarchitecture, power- and energy-efficient systems,robust computing covering various aspects of depend-ability and fault-tolerance, hardware security, emergingcomputing trends like neuromorphic and approximate

xvi About the Authors

computing, neurosciences, emerging technologies andnanosystems, self-learning and intelligent/cognitivesystems, FPGAs, MPSoCs, and embedded systems.His research has a special focus on cross-layer analysis,modeling, design, and optimization of computing andmemory systems covering various layers of the hard-ware and software stacks, as well as their integration inapplication use cases from Internet-of-Things (IoT),Cyber-Physical Systems (CPS), and ICT forDevelopment (ICT4D) domains.

He received the prestigious 2015 ACM/SIGDAOutstanding New Faculty Award, six gold medals ineducational career, and several best paper awards andnominations at prestigious conferences like DATE,DAC, ICCAD, and CODES+ISSS, Best Master ThesisAward, and Best Lecturer Award. He has given severalInvited Talks, Tutorials, and Keynotes. He has alsoorganized many special sessions at premier venues (likeDAC, ICCAD, DATE, and ESWeek) and served as theGuest Editor for IEEE Design and Test Magazine(D&T) and IEEE Transactions on SustainableComputing (T-SUSC). He has served as the TPCCo-Chair of ESTIMedia and LPDC, General Chair ofESTIMedia, and Track Chair at DATE and FDL. Hehas served on the program committees of severalIEEE/ACM conferences like ICCAD, ISCA, DATE,CASES, FPL, and ASPDAC. He is a Senior Memberof the IEEE and IEEE Signal Processing Society (SPS),and a member of ACM, SIGARCH, SIGDA, SIGBED,and HiPEAC. He holds one US patent and over 180papers in premier journals and conferences.

Professor Jörg Henkel is with Karlsruhe Institute ofTechnology (KIT), Germany, where he is directing theChair for Embedded SystemsCES. Before, hewas a SeniorResearch Staff Member at NEC Laboratories in Princeton,NJ. He received his Ph.D. from Braunschweig Universitywith “Summa cum Laude”. He has/is organizing vari-ous embedded systems and low-power ACM/IEEEconferences/symposia as General Chair and ProgramChair and was a Guest Editor on these topics in variousjournals like the IEEE Computer Magazine. He wasProgram Chair of CODES’01, RSP’02, ISLPED’06,SIPS’08, CASES’09, Estimedia’11, VLSI Design’12,ICCAD’12, PATMOS’13, NOCS’14 and served as

About the Authors xvii

General Chair for CODES’02, ISLPED’09,Estimedia’12, ICCAD’13 and ESWeek’16. He is/hasbeen a steering committee member of major conferencesin the embedded systems field like at ICCAD, ESWeek,ISLPED, Codes+ISSS, CASES, and is/has been aneditorial board member of various journals like theIEEE TVLSI, IEEE TCAD, IEEE TMSCS, ACMTCPS,JOLPE, etc. In recent years, he has given more than tenkeynotes at various international conferences primarilywith focus on embedded systems dependability. He hasgiven full/half-day tutorials at leading conferences likeDAC, ICCAD, DATE, etc. He received the 2008 DATEBest Paper Award, the 2009 IEEE/ACM WilliamJ. McCalla ICCAD Best Paper Award, the Codes+ISSS2015, 2014, and 2011 Best Paper Awards, and theMaXentric Technologies AHS 2011 Best Paper Awardas well as the DATE 2013 Best IP Award and the DAC2014 Designer Track Best Poster Award. He is theChairman of the IEEE Computer Society, GermanySection, and was the Editor-in-Chief of the ACMTransactions on Embedded Computing Systems (ACMTECS) for two consecutive terms. He is an initiator andthe coordinator of the German Research Foundation’s(DFG) program on “Dependable Embedded Systems”(SPP 1500). He is the site coordinator (Karlsruhe site)of the Three-University Collaborative Research Centeron “Invasive Computing” (DFG TR89). He is the Editor-in-Chief of the IEEE Design & Test Magazine. He holdsten US patents and is a Fellow of the IEEE.

xviii About the Authors

Acronyms

ADI Alternating Direction Implicit, 26ARMA Autoregressive Moving Average, 28blackscholes Application from the PARSEC benchmark suite, 3, 4, 28, 40, 83,

89, 113 171, 216, 233bodytrack Application from the PARSEC benchmark suite, 3, 4, 28, 40, 83,

89, 113, 171, 216, 233BUH Balanced Utilization Heuristic, 183, 184, 186–188, 200CBGA Ceramic Ball Grid Array,28,249CCH Consecutive Cores Heuristic, 185–190, 202, 208,210DLTF Double Largest Task First,131,137, 139, 178, 242DLTF-SFA Double Largest Task First combined with Single Frequency

Approximation, 133, 142, 146, 154–160DLTF-SVA Double Largest Task First combined with Single Voltage

Approximation, 132, 133, 142, 143, 146, 154, 155–168, 172–174DPM Dynamic Power Management, 11, 12, 16–19, 59, 77, 88, 91, 93,

102, 132–134, 140–144, 154, 160, 162, 166, 176–179, 209, 210,213, 215, 241

DTM Dynamic Thermal Management, 6, 52, 53, 59–64, 68DVFS Dynamic Voltage and Frequency Scaling, v, vi, xv, xvi, xvii, 8,

159, 162–166, 173–175, 179, 181–188, 196, 198, 202, 203, 210,211, 213, 214, 217–235, 238, 240–244

DYVIA Dynamic Voltage Island Assignment, 181, 183, 184, 195–201,202–207, 212–214, 242, 244

EDF Earliest Deadline First, 36, 37, 137, 146, 178, 244EOH Extremal Optimization Heuristic, 28, 30, 182, 182, 186, 188, 199,

203, 207, 209, 244EWFD Equally-Worst-Fit-Decreasing, 29, 31, 228, 229, 230, 232, 233, 235EXU Execution Unit, 38, 39FEA Finite Element Analysis, 28, 38, 39ferret Application from the PARSEC benchmark suite, 28

xix

FFI-LTF Fixed Frequency Island-Aware Largest Task First, 225–231FFT Fast Fourier Transform, 203FIT-LTF Fixed Frequency Island- and Task-Aware Largest Task First, 225,

226gem5 gem5 multicore simulator, 2, 5, 22, 25, 233, 234GIPS Giga Instructions per Second, 26, 114, 118, 121–123GIT Generalized Integral Transforms, 28HI-LTF Heterogeneous Island-Aware Largest Task First, 229, 231HIT-LTF Heterogeneous Island- and Task-Aware Largest Task First, 232HotSpot HotSpot modeling and temperature computation tool, 16, 27, 28,

48, 51, 60IFU Instruction Fetch Unit, 38, 39ILP Instruction-Level Parallelism, 2, 24, 25IPC Instructions per Cycle, 2IPS Instructions per Second, 2, 86LPF Low-Power First, 232, 235, 238LPT Longest Processing Time,132, 137, 138LSU Load and Store Unit, 38, 39LTF Largest Task First, 14, 17, 29, 30, 131, 133, 137, 144, 145LTR Left to Right, 154, 157, 162, 163MatEx (From Matrix Exponentials) Transient and Peak Temperature

Computation Tool for Compact Thermal Models, xxvi, 3, 16, 18,74, 76, 89, 90, 92, 96, 99–105, 103–111, 112, 114, 118, 117, 122,123

McPAT McPAT power consumption simulator, 3, 20, 25, 32, 41, 50–53NoC Network on Chip, 2, 52, 243Odroid-XU3 Odroid-XU3 mobile platform, with an Exynos 5 Octa (5422) chip

based on ARM’s “big.LITTLE” architecture, 52, 56, 89, 124, 233OOO Out-of-Order, 38, 55, 56, 89PARSEC PARSEC benchmark suite,2, 4, 20, 25, 41, 172, 216, 218, 234QoS Quality of Service, 31SCC Single Chip Cloud computer, 9, 55, 182, 184, 202–204seBoost Selective Boosting for Manycore Systems, 16, 18, 111, 112, 117,

118, 120, 124SFA Single Frequency Approximation, 14, 16, 30, 132SVA Single Voltage Approximation, vi, 17, 18, 132, 142swaptions Application from the PARSEC benchmark suite, 3, 4, 28, 40, 83,

89, 113, 171, 216, 233, 234TDP Thermal Design Power, v, 5, 13, 14, 24, 31, 74, 111, 241TLP Thread-Level Parallelism, 2, 24TSP Thermal Safe Power, v, vii, 14, 15, 111, 241, 243VLSI Very Large-Scale Integration, 9, 27x264 Application from the PARSEC benchmark suite, 41, 83, 89, 113

xx Acronyms

Symbols

A Matrix A ¼[ai;j�N�N that contains the thermal capacitancevalues of an RC thermal network (generally a diagonalmatrix, since thermal capacitances are modeled to ground),47, 48, 52, 96, 104

ai;j Element in row i and column j inside the thermal capacitancematrix A, such that 1� i�N and 1� j�N, 47

AFDVFS¼SFAASG¼ANY Approximation factor for any task partition mapping heuristic

that uses M�K

� �clusters with non-empty task sets, when using

SFA to decide the DVFS levels on individual clusters, againstthe optimal assignment that also uses SFA in individualclusters, 187–189

AFenergy overheadsDLTF-SFA

Approximation factor (i.e., the worst-case behavior) ofDLTF-SFA for energy minimization in relation to the optimaltask partitioning and optimal DVFS solution for energyminimization (i.e., the task partitioning and DVFS solutionthat result in the minimum energy consumption) when weconsider negligible overheads for sleeping, 154–157, 159,169

AFenergyDLTF�SFA Approximation factor (i.e., the worst-case behavior) ofDLTF-SFA for energy minimization in relation to the optimaltask partitioning and optimal DVFS solution for energyminimization (i.e., the task partitioning and DVFS solutionthat result in the minimum energy consumption), 133, 134,163–164, 174

AFpeak powerDLTF�SFAApproximation factor (i.e., the worst-case behavior) ofDLTF-SFA for peak power reduction in relation to theoptimal task partitioning and optimal DVFS solution for peakpower reduction (i.e., the task partitioning and DVFS solutionthat result in the minimum peak power consumption), 133,134, 167–170

xxi

AFenergyDLTF�SVA Approximation factor (i.e., the worst-case behavior) ofDLTF-SVA for energy minimization in relation to theoptimal task partitioning and optimal DVFS solution forenergy minimization (i.e., the task partitioning and DVFSsolution that result in the minimum energy consumption),128, 129, 159, 170

AFpeak powerDLTF�SVAApproximation factor (i.e., the worst-case behavior) ofDLTF-SVA for peak power reduction in relation to theoptimal task partitioning and optimal DVFS solution for peakpower reduction (i.e., the task partitioning and DVFS solutionthat result in the minimum peak power consumption), 41, 134

a For the approximated power consumption on a CMOS core, aconstant including the effective switching capacitance, theaverage activity factor of the core, and a scaling factor for thelinear relationship between the voltage of the cluster and thehighest frequency in the cluster (i.e., Vdd / fcluster), 38, 47, 48

areacorem Area of core m (among all M in the chip), 38, 75–77areatypeq Area of a core of type q, 48

B�1 Matrix B�1 ¼ ~bi;j� �

N�N is the inverse of matrix B, 47B Matrix B ¼ bi;j

� �N�N that contains the thermal conductance

values between vertical and lateral neighboring thermal nodesof an RC thermal network, 47, 48

~bi;j Element in row i and column j inside matrix B�1, such that1� i�N and 1� j�N, 47

bi;j Element in row i and column j inside the thermal conductancematrix B, such that 1� i�N and 1� j�N, 47, 48

b For the approximated power consumption on a CMOS core,b � fcluster � 0 represents the leakage power consumption onthe core, 41

C Matrix of an RC thermal network, such that C¼ � A�1B, 47D Hyper-period, i.e., the least common multiple among all

periods of all R tasks, 36, 43, 143, 145–150, 184, 188–190dn Period and implicit deadline of task sn, 36, 53, 143, 144–148,

152, 154, 184, 188–170, 192d Variable of auxiliary function UðdÞ, used to choose a value of

fdyn such that E�# becomes a continuous function, in order to

derive an approximation factor without unnecessary pes-simism, 152–159, 164, 165, 168–171, 264

dmax Value of d that maximizes auxiliary function UðdÞ, used tochoose a value of fdyn such that E�

# becomes a continuousfunction, in order to derive an approximation factor withoutunnecessary pessimism, 152–157

xxii Symbols

DYVIA ði; jÞ Dynamic programming function, where i is the index of thefirst task set to be considered in this sub-problem, and j is theindex of the last task set to be considered in this sub-problem,such that function DYVIA ði; jÞ returns the minimum energyconsumption for the assignment of task setsSi; Siþ 1; . . .; Sj�1; Sj onto cores, using v ¼ j�iþ 1

K clusters(from Corollary 9.1, j� iþ 1 will always be an integermultiple of K), 196–180, 265

DYVIAback�trackingði; jÞ Backtracking table in which entry DYVIAback�

trackingði; jÞ containsthe task sets indexes f‘1; ‘2; . . .; ‘Kg that resulted in theminimum energy consumption for sub-problem DYVIA ði; jÞ,196, 197, 194

DYVIA#combinations Total number of combinations that algorithm DYVIA needsto evaluate when building its dynamic programming table,200

Ecoreð f Þ Approximated energy consumption on a CMOS core for thecase in which the core runs at the same frequency whichdetermines the voltage of the cluster, where f is the executionfrequency of the core, 42, 135, 145

Ecoreð fcluster; f Þ Approximated energy consumption on a CMOS core for thegeneral case of having voltage scaling at a cluster level andfrequency scaling at a core level, where fcluster is the highestexecution frequency among all cores in the cluster (thussetting the voltage of the cluster), and f is the executionfrequency of the core, 44, 135, 145

eCt Matrix exponential eCt ¼ ½eCti;j �N�N , 96, 97

EDVFS¼optimalASG¼DYVIA

Total energy consumption when using an optimal DVFSalgorithm to decide the DVFS levels on individual clusters,and when using DYVIA for assigning task sets to clusters(i.e., the optimal task set assignment solution under SFA),201, 202

EDVFS¼optimalASG¼optimal DVFS

Total energy consumption when using an optimal DVFSalgorithm to decide the DVFS levels on individual clusters,and when assigning task sets to clusters by using analgorithm that is optimal when using an optimal DVFSalgorithm to decide the DVFS levels on individual clusters,200, 201

EDVFS¼SFAASG¼DYVIA Total energy consumption when using SFA to decide the

DVFS levels on individual clusters, and when using DYVIAfor assigning task sets to clusters (i.e., the optimal task setassignment solution under SFA), 200, 201

Symbols xxiii

EDVFS¼SFAASG¼optimal DVFS Total energy consumption when using SFA to decide the

DVFS levels on individual clusters, and when assigning tasksets to clusters by using an algorithm that is optimal whenusing an optimal DVFS algorithm to decide the DVFS levelson individual clusters, 187, 200, 201

EjDVFS¼SFAASG¼CCH

Energy consumption of cluster Ij when using the CCH taskpartition mapping algorithm to map task sets to cores andclusters, and when using SFA to decide the DVFS levels onindividual clusters, 189

EjDVFS¼per�coreASG¼ANY

Energy consumption of cluster Ij when using any taskpartition mapping algorithm to map task sets to cores andclusters, and when having per-core DVFS, which will resultin the lower bound for the energy consumption since havingper-core DVFS is the optimal solution and the task setassignment plays no role for such a case, 189, 190

EjDVFS¼SFAASG¼ANY

Energy consumption of cluster Ij when using any taskpartition mapping algorithm to map task sets to cores andclusters, and when using SFA to decide the DVFS levels onindividual clusters, 189

EjDVFS¼SFAASG¼SFA

Energy consumption of cluster Ij when using SFA to decidethe DVFS levels on individual clusters, and when using theoptimal task set assignment solution under SFA, 189

eq;n Worst-case execution cycles of task sn when being executedon a core of type, 35, 43, 136, 140, 141, 143, 218–220, 229,234

EðLÞ Energy consumption of the highest DVFS level cluster foreach combination, which is similar to Eq. (9.1), but for setL instead of set Lj, 198–200

E�# Lower bound for the optimal energy consumption for the

optimal task partition and any feasible DVFS schedule duringa hyper-period D, 135, 136, 149, 151–157, 160, 168, 170,176, 179, 256, 266, 273

E�OPT Optimal energy consumption for the optimal task partition

and optimal DVFS schedule during a hyper-period D, 135,136

~Esnq Ftype

q;j

� �Energy consumed during one hyper-period for executing tasksn on a core of type q at frequency index j (such that0� j� F̂type

q ) in case that all tasks consume equivalent powerwhen executing at the same frequency on a core of type q,i.e.,Ps1

q Ftypeq;j

� �¼ Ps2

q Ftypeq;j

� �¼ � � � ¼ PsR

q Ftypeq;j

� �¼ Pq Ftype

q;j

� �, 220

xxiv Symbols

~Esgq Ftype

q;j

� �Energy consumed during one hyper-period for executing taskset Sg on a core of type q at frequency index j (such that0� j� F̂type

q ) in case that all tasks consume equivalent powerwhen executing at the same frequency on a core of typeq i.e.,Ps1

q Ftypeq;j

� �¼ Ps2

q Ftypeq;j

� �¼ � � � ¼ PsR

q Ftypeq;j

� �¼ Pq Ftype

q;j

� �, 220

Esnq Ftype

q;j

� �Energy consumed during one hyper-period for executing tasksn on a core of type q at frequency index j (such that0� j� F̂type

q ), 43, 217, 218

EDLTFSFA Total energy consumption during a hyper-period D for

partitioning tasks with DLTF and selecting the DVFSschedule with SFA,136, 145, 156

EDLTFSVA Total energy consumption during a hyper-period D for

partitioning tasks with DLTF and selecting the DVFSschedule with SVA, 136, 146, 147, 155

g Amount of power consumed by a cluster for being in theactive state (since there is no voltage regulator with 100%efficiency) when at least one core inside the cluster has toexecute some workload, 184–186, 189–191, 207, 205, 210

fcluster For the approximated power consumption on a CMOS core,fcluster is the highest execution frequency among all cores inthe cluster, and therefore determines the minimum voltageof the cluster for stable execution, 4, 42, 45, 46, 136, 137,146

fcritsnq Critical frequency of task sn running on a core of type q thatminimizes the energy consumption for execution when theoverhead for entering/leaving a low-power mode can beconsidered negligible, 43, 218

fcritq Critical frequency on a core of type q in case that all tasksconsume equivalent power when executing at the samefrequency on a core of type q,i.e.,Ps1

q Ftypeq;j

� �¼ Ps2

q Ftypeq;j

� �¼ � � � ¼ PsR

q Ftypeq;j

� �¼ Pq Ftype

q;j

� �such that

fcrits1q ¼ fcrits2q ¼ � � � ¼ fcritsRq ¼ fcritq , 220, 268, 272fcrit Critical frequency for the energy model in Eq. (3.6) (focusing

on homogeneous systems and assuming that all tasks havesimilar average activity factors) that minimizes the energyconsumption for execution when the overhead forentering/leaving a low-power mode can be considered neg-ligible, 45, 134–138, 140, 141, 143, 145–147, 155, 161,162–166, 169, 170, 171, 172, 180

fmaxdyn Maximum value of the auxiliary frequency used to obtain an

analytical expression for the lower bound of the energyconsumption which can be used for general cases, 154,155–158

Symbols xxv

fdyn Auxiliary frequency used to obtain an analytical expressionfor the lower bound of the energy consumption which can beused for general cases, 149–159

Fcorei;F̂core

iMaximum frequency for core i,38, 118

F̂corei

Number of available frequencies for core i, 38, 118Fmax Maximum frequency when considering homogeneous sys-

tems,135, 146, 149, 164, 171, 174, 183Fmin Minimum frequency when considering homogeneous sys-

tems, 1345, 174, 183Ftypeq;F̂type

q

Maximum frequency for a core of type q, 31, 218

F̂typeq

Number of available frequencies for cores of type q, 31, 38,218

G Column vector G¼[gi]N�1 that contains the values of thethermal conductances between each thermal node and theambient temperature of an RC thermal network, 47, 48 66, 92

C�1 Matrix C�1 ¼ ½~Ci;j�N�N represents the inverse of matrix C,96, 97

C Matrix C ¼ ½Ci;j�N�N represents a matrix containing theeigenvectors of matrix C for a given thermal model, 96, 97

c For the approximated power consumption on a CMOS core,c [ 1 is a constant related to the hardware (in CMOSprocessors, c is generally modeled equal to 3), 96,97,

HqAuxiliary matrix Hq ¼ hqq;i;j

h iQ�Z�Mtype

q

, used to compute the

maximum amount of heat that any mq cores of type q cancontribute to the temperature on node i, for all core typesq ¼ 1; 2; . . .;Q, 75–77

H Auxiliary matrix H ¼[hi;j]Z�M , used to compute the maxi-mum amount of heat that any m cores can contribute to thesteady-state temperature on thermal node i, 70–75

Imax Maximum chip current constraint for the entire chip thatcannot be exceeded (e.g., from the capacity of the powersupply or the wire thickness), 40, 116–118, 122

K Set K ¼fk1; k2; . . .; kMg that contains all the indexes of thethermal nodes that correspond to cores (among all cores,ignoring the types of the cores), 49, 64, 70

K Total number of cores inside every cluster/island, for asystem in which all the clusters have equal number of coresper cluster/island, 184–203, 211–224, 230

KqSet Kq ¼ kq1 ; k

q2; . . .; k

qMtype

q

n ofor all core types

q ¼ 1; 2; . . .;Q, that contains the indexes of the thermalnodes that correspond to cores of type q, 49, 76

xxvi Symbols

j For the approximated power consumption on a CMOS core,j� 0 represents the independent power consumption attrib-uted to maintaining the core in execution mode (i.e., thevoltage- and frequency-independent part of the powerconsumption), 41, 42, 44, 45, 135, 143, 210

L Set L ¼ ‘1; ‘2; . . .; ‘Zf g that includes all the indexes of thethermal nodes that correspond to blocks in the floorplan (asopposed to thermal nodes that represent the heat sink, internalnodes of the heat spreader, the thermal interface material,etc.), 49, 62, 63, 65, 66, 70, 74, 76, 115, 117, 121, 195

k Lagrange multiplier inside the Lagrangian used whenapplying the Kuhn–Tucker conditions, 146–148

L Lagrangian used when applying the Kuhn–Tucker condi-tions, 147

L Set containing the indexes of the task sets assigned to ageneral cluster (as opposed to set Lj, defined for the particularcluster Ij), such that ‘1\‘2\ � � �\‘K , with ‘0 auxiliary andless than ‘1, 195, 262, 264

Lj Set containing the indexes of the task sets assigned to clusterIj such that ‘j;1\‘j;2\ � � �\‘j;K with j ¼ 1; 2; . . .;V , it holdsthat ‘j;i 2 ½1;M�, and ‘j;i is unique for all j, i, 183, 185, 191,193, 194, 264, 266

K ði; jÞ Set that contains all possible L sets that satisfy Theorem 9.2,i.e., K ði; jÞ stores all the potentially optimal combinations,such that ‘0 ¼ i� 1; ‘K ¼ j and ‘h ¼ ‘h�1 þ 1þ n � K for0\h\K with ‘h\j and n 2 N

0, 195, 196–198K Diagonal matrix K ¼diag ðek1�t; ek2�t; . . .; ekN �t, where

k1; k2; . . .; kN are the eigenvalues of matrix C for a giventhermal model, 47

M Total number of cores in the system, 37–42, 52, 66–68MH Total number of task sets in which some task partitioning

algorithm partitions the tasks, 187, 190, 192, 208M Total number of task sets in which we partition the tasks,

37–42, 184, 187m Set m ¼ m1;m2; . . .;mQ

� that represents the number of

active cores for core types 1; 2; . . .;Qf g respectively, 63,75–77, 83, 84

Mclusterk Total number of cores inside cluster/island k, 37, 217, 219,

223Mcluster

max Maximum number of cores inside a cluster among allclusters, i.e., Mcluster

max ¼ max1� k�V Mclusterk

� , 227, 229–231

M 6¼0 When partitioning tasks using DLTF, the resulting number ofcores after regrouping with cycle utilization larger than 0, i.e.,the cores that remain active, 142, 145, 167–173, 177 179

Symbols xxvii

Mtypeq Total number of cores of type q, 37, 39, 49, 175,–177

N Total number of thermal nodes in the RC thermal network,such that there are at least as many thermal nodes in the RCthermal network as blocks in the floorplan, i.e., N� Z, 45–48, 117

X Auxiliary matrix X ¼ Xi;j� �

N�N , used to speed up thecomputation of the transient temperatures in MatEx, 98–100,104, 269

Xk;i Element in row k and column i inside auxiliary matrix X,such that 1� k�N and 1� i�N, 97–99

P Column vector P ¼ pi½ �N�N that contains the values of thepower consumption on every node of an RC thermal network,46, 48, 96–98

Pblocks Column vector Pblocks ¼ pblocksi

� �N�1 that represents the

power consumption on other blocks in the floorplan that donot correspond to cores (e.g., a block of an L2 cache), 48, 49,63, 63, 75

Pcoreð f Þ Approximated average power consumption on a CMOS corefor the case in which the core runs at the same frequencywhich determines the voltage of the cluster, where f is theexecution frequency of the core, 41, 134, 135, 142, 174

Pcore fcluster; fð Þ Approximated average power consumption on a CMOS corefor the general case of having voltage scaling at a cluster leveland frequency scaling at a core level, where fcluster is thehighest execution frequency among all cores in the cluster(thus setting the voltage of the cluster), and f is the executionfrequency of the core, 41, 136, 137, 147

Pcoreinactj Power consumption of core j (among all M in the chip) when

the core is inactive (i.e., idle or in a low-power mode), 74, 77Pcoreinactm Power consumption of core m (among all M in the chip)

when the core is inactive (i.e., idle or in a low-power mode),39

Pcoreinact Power consumption of an inactive core (i.e., idle or in a

low-power mode) for the special case of homogeneousmanycore systems, 40, 65–67, 70, 71, 78

Pcore qmax ðmÞ Auxiliary function used to assist in deriving the amount of

power density that any m ¼ m1;m2; . . .;mQ�

active coresare allowed to consume, such that the total power con-sumption precisely reaches the value of Pmax, 77

Pcore qmax ðXÞ Auxiliary function used to assist in deriving the amount of

power density that the active cores in mapping X are allowedto consume, such that the total power consumption preciselyreaches the value of Pmax, 74, 75

xxviii Symbols

PcoremaxðmÞ Auxiliary function used to assist in deriving the amount of

power that any m active cores are allowed to consume, suchthat the total power consumption precisely reaches the valueof Pmax, 67, 72, 73

Pcores Column vector Pcores ¼ ½pcoresi �N�1 represents the powerconsumption on the cores, 48, 49

Pqequal For a heterogeneous or homogeneous manycore system,

power density on all active cores when we assume that allactive cores have equal power density at a given point intime, 74, 76

Pequal For a homogeneous manycore system, power consumption ofall active cores when we assume that all active cores areconsuming equal power at a given point in time, 65, 66, 68,70–72, 85

P̂�# Lower bound for the optimal peak power consumption for the

optimal task partition and any feasible DVFS schedule duringa hyper-period D, 135, 136, 156, 165, 171, 176, 179

Pmax Maximum chip power constraint for the entire chip thatcannot be exceeded (e.g., from the capacity of the powersupply or the wire thickness), 40, 52, 64–67, 70–72, 74–77,118–121, 123

P̂�OPT

Optimal peak power consumption for the optimal taskpartition and optimal DVFS schedule during a hyper-periodD, 123, 135, 136

Pq Ftypeq;j

� �Average power consumption on a core of type q running atfrequency index j (such that 0� j� F̂type

q ), in case that alltasks consume equivalent power when executing at the same

frequency on a core of type q i.e., in case that Ps1q Ftype

q;j

� �¼

Ps2q Ftype

q;j

� �¼ � � � ¼ PsR

q Ftypeq;j

� �¼ Pq Ftype

q;j

� �such that

fcrits1q ¼ fcrits2q ¼ � � � ¼ fcritsRq ¼ fcritq ,218, 230

Psnq Ftype

q;j

� �Average power consumption on a core of type q whenexecuting task sn when executing task j (such that0� j� F̂type

q ), 39, 43, 218, 219, 230Prest Column vector Prest ¼ prest½ �N�1 that represents the power

consumption of thermal nodes that are on the floorplan (e.g.,internal thermal nodes of the heat sink), for which it holdsthat presti ¼ 0 for all i, 48, 49

P̂DLTFSFA

Total peak power consumption for partitioning tasks withDLTF and selecting the DVFS schedule with SFA, 134, 137,145, 165

Symbols xxix

P̂DLTFSVA

Total peak power consumption for partitioning tasks withDLTF and selecting the DVFS schedule with SVA, 136, 147,171

PqTSP Xð Þ Per-core power density budget for each active core in the

specified core mapping (independent of the type of core) thatresults in a maximum steady-state temperature among allcores which does not exceed the critical threshold temper-ature for triggering DTM, for heterogeneous or homogeneousmanycore systems, 63, 73–75

PTSP Xð Þ Per-core power budget for each active core in the specifiedcore mapping that results in a maximum steady-statetemperature among all cores which does not exceed thecritical threshold temperature for triggering DTM, forhomogeneous manycore systems, 62–64, 67, 68

PqHTSP Xð Þ Per-core power density budget for each active core in the

specified core mapping (independent of the type of core)while ignoring Pmax that results in a maximum steady-statetemperature among all cores which does not exceed thecritical threshold temperature for triggering DTM, forheterogeneous or homogeneous manycore systems, 74, 75

PHTSP Xð Þ Per-core power budget for each active core in the specified

core mapping while ignoring Pmax that results in a maximumsteady-state temperature among all cores which does notexceed the critical threshold temperature for triggering DTM,for homogeneous manycore systems, 65–67

Pq worstTSP mð Þ Per-core power density budget (independent of the type of

core) for each active core in any possible core mapping withm ¼ m1;m2; . . .;mQ

� active cores for core types

1; 2; . . .;Qf g, respectively, that results in a maximumsteady-state temperature among all cores which does notexceed the critical threshold temperature for triggering DTM,for heterogeneous or homogeneous manycore systems, 63,75, 77

PworstTSP mð Þ Per-core power budget for each active core in any possible

core mapping with m simultaneously active cores that resultsin a maximum steady-state temperature among all coreswhich does not exceed the critical threshold temperature fortriggering DTM, for homogeneous manycore systems, 62, 67,68, 70, 72, 73, 83, 84

xxx Symbols

PqHworstTSP mð Þ Per-core power density budget (independent of the type of

core) for each active core in any possible core mapping withm ¼ m1;m2; . . .;mQ

� active cores for core types

1; 2; . . .;Qf g, respectively, that results in a maximumsteady-state temperature among all cores which does notexceed the critical threshold temperature for triggering DTMwhile ignoring Pmax, for heterogeneous or homogeneousmanycore systems, 76, 77

PHworstTSP mð Þ Per-core power budget for each active core in any possible

core mapping with m simultaneously active cores whileignoring Pmax that results in a maximum steady-statetemperature among all cores which does not exceed thecritical threshold temperature for triggering DTM, forhomogeneous manycore systems, 71, 72, 73

Ptypeinactq

Power consumption of a core of type q when the core isinactive (i.e., idle or in a low-power mode), 39, 76, 77, 218

Q Number of different types of cores in the system, 37, 39, 49,61

Qhk Index identifying the types of cores inside cluster/island k,

37, 217, 219R Total number of tasks to be executed, 35–37qenergySFA Worst-case energy consumption and peak power consump-

tion ratio for SFA when we consider discrete DVFS levelsagainst the continuous cases, 176, 177

qpeak powerSFA

Worst-case energy consumption and peak power consump-tion ratio for SFA when we consider discrete DVFS levelsagainst the continuous cases, 176, 177

qSVA Worst-case energy consumption and peak power consump-tion ratio for SVA when we consider discrete DVFS levelsagainst the continuous cases, 177

SDLTFiTask set assigned to core i after partitioning the tasks usingthe DLTF strategy, 140, 141, 143, 147

SLTFiTask set assigned to core i after partitioning the tasks usingthe LTF strategy, 139

Si Task set assigned to core i, xv, 37, 186, 197T0 Column vector T0 ¼ T 0

i tð Þ� �

N�1 that accounts for thefirst-order derivative of the temperature on each thermal nodeof an RC thermal network with respect to time 46, 48, 96

T Column vector T ¼ Ti tð Þ½ �N�1 that represents the tempera-tures on the thermal nodes of an RC thermal network, 46–48

Tamb Ambient temperature, 16, 46, 47, 52, 59, 62–64, 68, 71, 73,76, 80, 91, 95, 96, 124

Symbols xxxi

TDTM Critical temperature which serves as thermal threshold fortriggering DTM, 52, 53, 61–65, 71, 73–75, 78, 79–81, 177–125

T 0i ðtÞ Element in row i of the column vector T0, i.e., the first-order

derivate of the temperature on thermal node i with respect totime, such that 1� i�N, 144

T 0kðtÞ Element in row k of the column vector T0, i.e., the first-order

derivate of the temperature on thermal node k with respect totime, such that 1� k�N, 99–102

Tsteady i Element in row i of the column vector Tsteady, i.e., thesteady-state temperature on thermal node i, such that1� i�N, 48, 49, 62, 63, 66, 68, 71

Tsteady kElement in row k of the column vector Tsteady, i.e., thesteady-state temperature on thermal node k, such that1� k�N, 97–99, 102, 117

TiðtÞ Element in row i of the column vector T, i.e., the temperatureon thermal node i, such that 1� i�N, 46

TkðtÞ Element in row k of the column vector T, i.e., the temperatureon thermal node k, such that 1� k�N, 96, 98, 101, 102

Tinitk Element in row k of the column vector Tinit, i.e., the initialtemperature on thermal node k at time zero, such that1� k�N, 103

Tinit Column vector Tinit ¼ Tinitk½ �N�1 that contains the initialtemperatures on all nodes of an RC thermal network at timezero, 47, 96–99

Tkðt"kÞ Maximum temperature on thermal node k, which occurs attime t"k, 96, 101, 102

T 00k ðtÞ Second-order derivate of the temperature on thermal node

k with respect to time, such that 1� k�N, 102t"k Time point in which the temperature on thermal node

k reaches it maximum value, 96, 102, 103Tsteady Column vector Tsteady ¼ Tsteadyi

� �N�1 that represents the

steady-state temperatures on the thermal nodes of an RCthermal network, 48, 49, 102, 103

sn Periodic real-time task n with implicit deadline, 35, 37, 39,40, 136, 138, 141, 185, 225–227, 230

hLTF Approximation factor of the LTF strategy in terms of taskpartitioning, due to the approximation factor of the LPTalgorithm for the Makespan problem, 140, 159, 166, 167,170, 174, 175

U dmaxð Þ Maximum value of auxiliary function U dð Þ, used to choose avalue of fdyn such that E�

# becomes a continuous function, inorder to derive an approximation factor without unnecessarypessimism, 154, 155, 159, 161–163, 256

xxxii Symbols

U dð Þ Auxiliary function used to choose a value of fdyn such that E�#

becomes a continuous function, in order to derive anapproximation factor without unnecessary pessimism, 154–161, 166, 167, 170, 172, 173, 256, 266, 272

usn tð Þ Instantaneous activity factor of a core executing task sn attime t, 40

V Total number of clusters/islands in the system, 37, 184–197,200–204, 213–216, 224, 227, 230

Vdd Supply voltage of a core, xxii, 8, 40, 41Vth Transistor threshold voltage, 8, 40W Number of iterations used when applying the

Newton–-Raphson method, 102, 103w0i Cycle utilization unit cycles

second

� �of task set Si that results in a

lower bound of the optimal energy consumption for theoptimal task partition and the optimal DVFS schedule forhomogeneous systems, 152–160, 165, 166, 168, 171–173

w�i Cycle utilization unit cycles

second

� �of task set Si that results in

the optimal energy consumption for the optimal task partitionand the optimal DVFS schedule for homogeneous systems,137, 150, 152, 154, 159, 160, 168, 172, 174

wDLTFi Cycle utilization unit cycles

second

� �of task set Si when parti-

tioning tasks using the DLTF strategy for homogeneoussystems, 140, 141, 145–149, 158, 160, 163, 165, 168, 170–174, 178

wLTFi Cycle utilization unit cycles

second

� �of task set Si when parti-

tioning tasks using the LTF strategy for homogeneoussystems, 139, 141, 160, 168

wq;i Cycle utilization unit cyclessecond

� �of task set Si running on a

core of type q, computed asP

sn2Sieq;ndn, 37

w0M Maximum cycle utilization unit cycles

second

� �among all task

sets Si that results in a lower bound of the optimal energyconsumption for the optimal task partition and the optimalDVFS schedule for homogeneous systems, xiii, 152–161,165–167, 169–174

w�M Maximum cycle utilization unit cycles

second


sets Si that results in the optimal energy consumption for theoptimal task partition and the optimal DVFS schedule forhomogeneous systems, 136, 138, 139, 142, 150–154, 158–161, 165–167, 169–173

Symbols xxxiii

wDLTFM Maximum cycle utilization unit cycles

second


sets Si when partitioning tasks using the DLTF strategy forhomogeneous systems, 135–140, 143, 148, 149, 153–156,161–163, 170, 173, 174, 177

wLTFM Maximum cycle utilization unit cycles

second


sets Si when partitioning tasks using the LTF strategy forhomogeneous systems, xiii, 137–140, 142, 146, 160, 166,170

wM Maximum cycle utilization unit cyclessecond


sets Si running on a homogeneous system, 184, 187, 191wDLTFmax Auxiliary cycle utilization used as a maximum cycle

utilization for the task regrouping procedure of DLTF,computed as wDLTF

max ¼ max fcrit;wDLTFM

� , 139, 140, 144, 165

X Column vector X ¼ xi½ �M�1 for a particular mapping of activecores (among all cores, ignoring the types of the cores),where xi ¼ 1 means that core i corresponds to an active core,while xi ¼ 0 means that core i corresponds to an inactivecore, xxi, 38, 62–68, 70, 74

Xq Column vector Xq¼ xqi½ �Mtypeq �1 for all types of cores

q ¼ 1; 2; . . .;Q, which represents a particular mapping ofactive cores of type q, where xqi ¼ 1 means that corei corresponds to an active core of type q, while xqi ¼ 0 meansthat core i corresponds to an inactive core of type q, 39

vq Constant scaling factor for the worst-case execution cycles ofall tasks executing on cores of type q in relation to theworst-case execution cycles of the same tasks executing onan arbitrary reference type of core y, such that eq;n ¼ aq � ey;nfor all core types q ¼ 1; 2; . . .;Q and for all tasksn ¼ 1; 2; . . .;R, 219–222, 225

Y Set Y ¼ y1; y2; . . .; yVf g containing the indexes of the tasksets with the highest cycle utilizations mapped to everycluster, such that task set Syj will have the highest cycleutilization inside cluster Ij, 191–197

Z Total number of blocks in the floorplan, such that Z �M isthe number of blocks corresponding components other thancores, e.g., L2 caches and memory controllers, xix–xxii,xxvii, 26, 38, 46, 49, 67,71–77

xxxiv Symbols

List of Figures

Fig. 1.1 Execution cycles, execution time, and speed-up factors(normalized to the execution time of each application runningat the lowest frequency on each type of core) of one instance offive applications from the PARSEC benchmark suite [6],executing a single thread with simsmall input, based onsimulations in gem5 [7] and McPAT [8] (for 22 nm), andmeasured on the Exynos 5 Octa (5422) processor [9]. . . . . . . . . 3

Fig. 1.2 Power consumption of a PARSEC bodytrack applicationwith simsmall input, executing four threads at 2:2GHzon a quad-core Intel Nehalem cluster, based on simulationsusing Sniper [10] and McPAT [8] . . . . . . . . . . . . . . . . . . . . . . . 4

Fig. 1.3 Average power and average energy consumption for oneinstance of five applications from the PARSEC benchmarksuite [6] executing a single thread with simsmall input, basedon simulations in gem5 [7] and McPAT [8] (for 22 nm),and measured on the Exynos 5 Octa (5422) processor [9] . . . . . 5

Fig. 1.4 Example of transient temperatures and steady-statetemperatures for a change in power at t ¼ 1 s. . . . . . . . . . . . . . . 6

Fig. 1.5 Abstract example of a standard reactive control-basedclosed-loop DTM technique based on voltage and frequencyscaling. When the critical temperature is exceeded in some partof the chip, the voltage and frequency levels of all cores aredecreased. After a control period (or hysteresis time), and if themaximum temperature in the chip is below the critical value,the voltage and frequency levels of the cores can be broughtback to the nominal operation settings . . . . . . . . . . . . . . . . . . . . 6

Fig. 1.6 Exynos 5 Octa (5422) processor based on ARM’s big.LITTLEarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Fig. 1.7 Relationship between the supply voltage and the maximumstable frequency, modeled by Eq. (1.1) for the experimentalresults of a 28 nm x86-64 microprocessor developed in [15] . . . 9

xxxv

Fig. 1.8 Abstract example of a safe and efficient power budget . . . . . . . . 10Fig. 1.9 Interactions and relations of the contributions presented

in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Fig. 1.10 Overview of a distributed dark silicon management

system illustrating the interaction of different componentsacross several hardware and software layers . . . . . . . . . . . . . . . . 19

Fig. 2.1 Example of Intel’s Turbo Boost [12]. The bold line representsthe maximum temperature among all elements in the chipat a given point in time (left axis). The performanceof the applications is measured in Giga Instructionsper Second (GIPS) (right axis) . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Fig. 2.2 Speed-up factors (with respect to the number of parallelthreads, normalized to the execution time of running a singlethread) of three applications from the PARSEC benchmarksuite [18] running at 2GHz on an OOO Alpha 21264 core,based on simulations on gem5 [15] (up to eight threads)and Amdahl’s law (for more than eight threads). . . . . . . . . . . . . 27

Fig. 2.3 Stacked layers in a typical Ceramic Ball Grid Array(CBGA) package (adapted from a figure in [17]) . . . . . . . . . . . . 28

Fig. 2.4 An accelerating schedule for frame-based real-time tasksthat satisfies the deep sleeping property . . . . . . . . . . . . . . . . . . . 30

Fig. 3.1 Example of an EDF schedule for three periodic tasks, i.e., s1,s2, and s3, with periods (and deadlines) d1¼2ms, d2¼3ms,and d3¼5ms, hyper-period D¼30ms, and worst-caseexecution cycles eq;1¼0:5 � 106 cycles, eq;2¼1 � 106 cycles,and eq;3¼2 � 106 cycles, assigned to a core of type q runningat 1GHz. Symbol # represents the arrival time of a new taskinstance (job) ready for execution. Symbol " represents theperiod and deadline of a task instance (job). For periodic taskswith implicit deadlines, # and " overlap, i.e., l. In thisexample, when two or more tasks have the same absolutedeadline and one of these tasks was already being executed,then we continue to run it such that we can reduce thenumber of preemptions. Otherwise, we run the task withlowest index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Fig. 3.2 Floorplan of a homogeneous 64-core system (16 quad-coreclusters) based on simulations in gem5 [6] and McPAT [7],where each core is composed of several units: an IFU, anEXU, an LSU, an OOO issue/dispatch, and a private L1cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Fig. 3.3 Experimental results for a 22 nm OOO Alpha 21264 core,based on simulations conducted in gem5 [6] and McPAT [7]for an x264 application from the PARSEC benchmark suite [1]

xxxvi List of Figures

running a single thread, and the power model from Eq. (3.3)when c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz, and j¼0:5W . . . . . . . . 41Fig. 3.4 Experimental results from [10] for its 48-core system,

and the power model from Eq. (3.3) when c¼3, a¼1:76 WGHz3

,

b¼0 WGHz, and j¼0:5W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Fig. 3.5 Energy consumption for a single core from [10] executing 109

computer cycles, when modeled using the energy modelfrom Eq. (3.6) with c¼3, a¼1:76 W

GHz3, b¼0 W

GHz, andj¼0:5W, resulting in a critical frequency of 0:52GHz . . . . . . . 44

Fig. 3.6 Energy consumption examples for a core executing Dc ¼ 109

compute cycles, considering four different cases for the voltage(i.e., different values for fcluster), for the model in Eq. (3.5)when c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz, and j¼0:5W . . . . . . . . 45Fig. 3.7 (From [14]) Simplified RC thermal network example

for two cores, where we assume that cores are in immediatecontact with the heat sink, and there is no other connectionbetween a core and the ambient temperature. In a more detailedexample, we would have several layers between a core and theheat sink (e.g., as seen in Fig. 2.3, the ceramic packagingsubstrate, the thermal interface material, the heat spreader,etc.), and there would also exist more paths that leadto the ambient temperature (e.g., through the circuit board) . . . . 46

Fig. 4.1 Overview of the simulation framework used throughoutthis book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Fig. 4.2 (From [8]) Temperature snapshot for the floorplan illustratedin Fig. 3.2, for a specific mapping with 12 active cores inwhich every active core has a power consumption of 18:75W(225W in total), resulting in a highest steady-state temperatureof 102:9 C and a lowest steady-state temperature (amongthe active cores) of 97:0 C. Top numbers are the powerconsumptions of each active core (boxed in black). Bottomnumbers are the temperatures in the center of each core.Detailed temperatures are shown according to the color bar . . . . 54

Fig. 4.3 Floorplans of two homogeneous 16-core systems usedfor the motivational examples in Chaps. 5, 6 and 7, basedon simulations conducted in gem5 [1] and McPAT [2] . . . . . . . 55

Fig. 4.4 (From [8]) Floorplan of a heterogeneous 72-core system basedon simulations conducted in gem5 [1] and McPAT [2], and anOdroid [4] mobile platform with an Exynos 5 Octa (5422) [3]chip. For a given cluster type, every cluster is identified as a, b,c, and d, from left to right and from top to bottom . . . . . . . . . . 56

Fig. 5.1 Temperature distribution example (with DTM deactivated)for using a 90:2W per-chip power budget evenly distributed

List of Figures xxxvii

among different number of simultaneously active cores. Topnumbers are the power consumptions on each active core(boxed in black). Bottom numbers are the temperatureson the center of each core. Detailed temperaturesare shown according to the color bar . . . . . . . . . . . . . . . . . . . . . 61

Fig. 5.2 Maximum steady-state temperature (with DTM deactivated)among all cores as a function of the number of simultaneouslyactive cores, when using different single and constant per-chipand per-core power budgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Fig. 5.3 Examples resulting in a maximum steady-state temperatureof 80 C. Top numbers are the power consumptions on eachactive core (boxed in black). Bottom numbersare the temperatures on the center of each core. Detailedtemperatures are shown according to the color bar . . . . . . . . . . . 62

Fig. 5.4 (From [1]) Example of TSP for two different mappingsresulting in a maximum steady-state temperature of 80 C. Topnumbers are the power consumptions on each active core(boxed in black). Bottom numbers are the temperatureson the center of each core. Detailed temperaturesare shown according to the color bar . . . . . . . . . . . . . . . . . . . . . 64

Fig. 5.5 (From [1]) TSP computation example for the givenmapping from Fig. 5.4a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Fig. 5.6 Example of TSP results for the worst-case core mappings . . . . . 69Fig. 5.7 Example of different core mappings executing under TSP for

the worst-case mappings with four active cores. Top numbersare the power consumptions on each active core (boxed inblack). Bottom numbers are the temperatures on the centerof each core. Detailed temperatures are shown accordingto the color bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Fig. 5.8 (From [1]) TSP computation example for the worst-casemappings with four active cores . . . . . . . . . . . . . . . . . . . . . . . . . 72

Fig. 5.9 Transient example for both TSP and TDP (the red lineshows the maximum temperature among all cores). Duringt ¼ 0 s; 0:5 s½ �, there are eight active cores accordingto Fig. 5.3b, each core consuming 11:27W. Duringt ¼ 0:5 s; 1 s½ �, these cores are shutdown and we activatefour cores according to Fig. 5.3a, each core consuming14:67W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Fig. 5.10 Transient example for both TSP and TDP (the red line showsthe maximum temperature among all cores). Duringt ¼ 0 s; 0:5 s½ �, there are four active cores according toFig. 5.3a, each core consuming 14:67W. Duringt ¼ 0:5 s; 1 s½ �, these cores are shutdown and we activate eightcores according to Fig. 5.3b, each core consuming 11:27W . . . 79

xxxviii List of Figures

Fig. 5.11 Transient example for 16 cores. The number of active coresand their power consumption changes every 0:1 s. The adoptedmapping and power values are those of TSP for the worst-casemappings, computed for (a) 80 C and (b) 69:5 C. Thetemperature on each core is illustrated using a different color,and the maximum temperature among all cores at any givenpoint in time is highlighted by the bold-red curve . . . . . . . . . . . 80

Fig. 5.12 Transient example for 16 active cores when using differentDVFS policies for idling. The partial number of active coreschanges every 0.1 s, e.g., when threads become idle waiting forother threads to finish. In (a), DVFS is used to match theworst-case TSP values for the partial number of active cores inevery time interval. In (b), DVFS levels are maintainedconstant to match the worst-case TSP values when all cores areactive. The temperature on each core is illustrated using adifferent color, and the maximum temperature among all coresat any given time is highlighted by the bold-red curve . . . . . . . . 82

Fig. 5.13 Worst-case and best-case TSP for the 64-core homogeneousarchitecture illustrated in Fig. 3.2 and described in Sect. 4.2.1,compared to a constant per-core power budget, and estimationsof constant per-chip power budgets evenly distributed amongthe active cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Fig. 5.14 Constant per-chip power budgets, compared to estimationsby multiplying the number of active cores with a constantper-core power budget, and the worst-case and best-case TSPvalues, for the 64-core homogeneous architecture illustrated inFig. 3.2 and described in Sect. 4.2.1. . . . . . . . . . . . . . . . . . . . . . 84

Fig. 5.15 Maximum steady-state temperatures throughout the chip forthe 64-core homogeneous architecture illustrated in Fig. 3.2and described in Sect. 4.2.1, with DTM deactivated, whenusing TSP, a constant per-core power budget, and three evenlydistributed constant per-chip power budgets . . . . . . . . . . . . . . . . 85

Fig. 5.16 Execution time for computing TSP for a given mapping,considering several floorplans with different number of coresin each floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Fig. 5.17 Example of dark silicon estimations when using TSP for theworst-case mappings and a constant per-chip power budget,for the 64-core homogeneous architecture illustrated in Fig. 3.2and described in Sect. 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Fig. 5.18 Experimental evaluation results for homogeneous systemsshowing the average total system performance when usingdifferent power budgets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

List of Figures xxxix

Fig. 5.19 Experimental evaluation results for heterogeneous systemsshowing the average total system performance when usingdifferent power budgets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Fig. 6.1 (From [5]) Example of two steady-states for the 16-core chipillustrated in Fig. 4.3b (described in Sect. 4.2.1). Active coresare boxed in black. Top numbers represent the powerconsumptions on active cores. Bottom numbers represent thetemperatures on the center of each core. Detailed temperaturesare shown according to the color bar . . . . . . . . . . . . . . . . . . . . . 94

Fig. 6.2 Transient temperatures example. During t¼ 0 s; 1 s½ � there aretwelve active cores according to Fig. 6.1a. During t¼ 1 s; 3 s½ �,there are four cores active according to Fig. 6.1b. The highesttransient temperature occurs on core C5 at time t ¼ 1:06 s . . . . . 95

Fig. 6.3 Overview of the steps involved in deriving MatEx to computeall transient temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Fig. 6.4 Physical interpretation of the symbols in Eq. (6.3) forcomputing the transient temperature of thermal node k. InEq. (6.3), time t ¼ 0 corresponds to the moment at whicha change in power occurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Fig. 6.5 Overview of the steps involved in deriving MatEx to computethe peaks in transient temperatures . . . . . . . . . . . . . . . . . . . . . . . 100

Fig. 6.6 Example showing that, since Tk tð Þ results in a differentexpression for every thermal node, the value of t"k is alsodifferent for every node k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Fig. 6.7 First-order derivative of the temperature on core C5 for thetransient simulations in Fig. 6.2. After the change in power,T 0k tð Þ is zero at t ¼ 1:06 s and when t ! 1 (i.e., when the

temperature reaches its steady-state) . . . . . . . . . . . . . . . . . . . . . . 101Fig. 6.8 Conceptual example of the Newton–Raphson method

for solving T 0k tð Þ¼0 on core C5 in W iterations (only a few

shown for simplicity of presentation), for the transientsimulations in Fig. 6.2. After the change in power,the first solution to T 0

k tð Þ¼0 is found at time t¼1:06 s . . . . . . . 103Fig. 6.9 Overview of the simulation flow used for evaluating the

accuracy and execution time of HotSpot and MatEx . . . . . . . . . 104Fig. 6.10 Simulation results for all transient temperatures used

to compare the accuracy of HotSpot and MatEx, for the caseof 64 cores and a single change in power. . . . . . . . . . . . . . . . . . 105

Fig. 6.11 Execution time for computing all transient temperaturesand the peaks in the transient temperatureswhen using HotSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Fig. 6.12 Execution time for computing all transient temperatures whenusing MatEx (Algorithms 6.1 and 6.2) . . . . . . . . . . . . . . . . . . . . 106

xl List of Figures

Fig. 6.13 Execution time for computing the peaks in the transienttemperatures when using MatEx (Algorithms 6.1 and 6.3) . . . . . 107

Fig. 7.1 Abstract example of a performance requirement surge/peak atruntime. At the beginning, all three applications are executingat the required nominal performance. Then, application 3 (ahigh-priority application) needs to increment its performanceduring some time in order to meet its timing requirements dueto a drastic increase in its workload . . . . . . . . . . . . . . . . . . . . . . 112

Fig. 7.2 Motivational example with four different boosting techniqueswhen the bodytrack application requires to increase itsperformance at runtime, for a critical temperature of 80 C(represented by a dotted line). The bold line shows themaximum temperature among all elements in the chip (leftaxis). The performance of the applications is measured in GigaInstructions per Second (GIPS) (right axis). . . . . . . . . . . . . . . . . 114

Fig. 7.3 seBoost’s system and problem overview. . . . . . . . . . . . . . . . . . . 115Fig. 7.4 Conceptual problem example for the case with given required

boosting levels, when the bodytrack application requires toincrease its performance at runtime and the x264 applicationcan be throttled down. The unknown parameter is representedby symbol ?, in this case, the throttle-down levels of thenon-boosted cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Fig. 7.5 Flowchart of our seBoost technique for given boostingrequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Fig. 7.6 Conceptual example of our seBoost technique for givenboosting requirements (algorithm illustrated in Fig. 7.5), whenthe bodytrack application requires to increase its performanceat runtime and the x264 application can be throttled down. Thered line shows the maximum temperature among all cores (leftaxis). The performance of the applications is measured in GigaInstructions per Second (GIPS) (right axis). . . . . . . . . . . . . . . . . 119

Fig. 7.7 Conceptual problem example for the case with unknownrequired boosting levels, when the bodytrack applicationrequires to increase its performance at runtime and the x264application can be throttled down. The unknown parametersare represented by symbol ?, in this case, the boosting levelsof the boosted cores and the throttle-down levels of thenon-boosted cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Fig. 7.8 Flowchart of our seBoost technique for unknown requiredboosting levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Fig. 7.9 Conceptual example of the evaluation of our seBoosttechnique for unknown required boosting levels (illustrated inFig. 7.8), when testing if the boosted cores can (left figure) orcannot (right figure) be safely executed at their maximum

List of Figures xli

DVFS levels. The cores requiring boosting are those mappedwith the blackscholes or the bodytrack application. The boldline shows the maximum temperature among all elements inthe chip at any given point in time (left axis). The performanceof the applications is measured in Giga Instructions per Second(GIPS) (right axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Fig. 7.10 Conceptual example of our seBoost technique for unknownrequired boosting levels (illustrated in Fig. 7.8), when allboosted cores of the blackscholes application can be safelyexecuted at their maximum DVFS levels. Therefore, the DVFSlevels of the boosted cores are set to their maximum values,and the throttle-down levels of the non-boosted cores areselected through the algorithm for given boosting requirements(illustrated in Fig. 7.5). The bold line shows the maximumtemperature among all elements in the chip at any given pointin time (left axis). The performance of the applications ismeasured in Giga Instructions per Second (GIPS)(right axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Fig. 7.11 Conceptual example of our seBoost technique for unknownrequired boosting levels (illustrated in Fig. 7.8), when allboosted cores of the bodytrack application cannot be safelyexecuted at their maximum DVFS levels. Therefore, theboosted and non-boosted cores are exchanged, the DVFSlevels of the non-boosted cores are set to their minimumvalues, and the boosting levels of the boosted cores areselected through the algorithm for given boosting requirements(illustrated in Fig. 7.5). The bold line shows the maximumtemperature among all elements in the chip at any given pointin time (left axis). The performance of the applications ismeasured in Giga Instructions per Second (GIPS)(right axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Fig. 7.12 Concurrent boosting example. The bodytrack application isboosted to 3:3GHz from 0:10 s to 0:16 s, and the x264application is boosted to 3:4GHz from 0:14 s to 0:35 s. Thered line shows the maximum temperature among all cores(left axis). The performance of the applications is measuredin Giga Instructions per Second (GIPS) . . . . . . . . . . . . . . . . . . . 123

Fig. 7.13 Timing simulation results for mixed application scenario M6.The red line shows the maximum temperature among all cores(left axis). The added performance of the applications ismeasured in application instances per second(i.e., throughput) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Fig. 7.14 Evaluation results for individual applications from thePARSEC benchmark suite. We consider multiple instances

xlii List of Figures

of the same application, with different number of threads perinstance, different thread-to-core mappings, different arrivalperiods for the runtime performance surges, and differentmaximum expected boosting times. Details can be foundin Table 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Fig. 7.15 Evaluation results for mixed applications from the PARSECbenchmark suite (right). We consider different applications,with different number of threads per application instance,different thread-to-core mappings, different arrival periodsfor the runtime performance surges, and different maximumexpected boosting times. Details can be found in the table(left), which is very similar to Table 7.1. The main differenceis that in this table we detail which application type is executedin each cluster, and the word threads are omitted. . . . . . . . . . . . 128

Fig. 8.1 Example of the cycle utilization relations of the LTF schemewhen wLTF

M �w�M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Fig. 8.2 Example of the cycle utilization relations of the LTF schemewhen there are at least two tasks in task set SLTFM

and w�M\wLTF

M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Fig. 8.3 Brief example comparing an initial task partition obtained

by applying LTF and the resulting task partition after theregrouping procedure of DLTF, both cases using SFA as theirDVFS schedule. All tasks share a common arrival, deadline,and period (i.e., they are frame-based tasks). In LTF, core 2and core 3 have to remain idle after they finish the execution ofa task instance, since their idle times are shorter than thebreak-even time. Under DLTF, the idle energy consumption ofcore 2 and core 3 is reduced, while core 4 and core 5 arealways kept in a low-power mode . . . . . . . . . . . . . . . . . . . . . . . 140

Fig. 8.4 SFA example for a cluster with four cores. The hyper-period ofall tasks is 10 s. In order to meet all deadlines, the frequencydemands of the cores are 0.2, 0.4, 0.6, and 0:8GHz. Therefore,the single frequency of SFA is set to 0:8GHz. In order to saveenergy, cores individually enter a low-power mode when thereis no workload on their ready queues . . . . . . . . . . . . . . . . . . . . . 141

Fig. 8.5 SVA example for a cluster with four cores. The hyper-periodof all tasks is 10 s. In order to meet all deadlines, the frequencydemands of the cores are 0.2, 0.4, 0.6, and 0:8GHz. Thefrequency on each core is set according to its demands, and thevoltage is set according to 0:8GHz. All cores are always busysuch that they just meet their timing constraints . . . . . . . . . . . . . 144

Fig. 8.6 Example of an accelerating schedule that satisfies the deepsleeping property. The hyper-period is 10 s, and the frequencydemands of the cores are 0.2, 0.4, 0.6, and 0:8GHz. Such a

List of Figures xliii

schedule can result in the optimal solution if the DVFS levelsare chosen such that the total power consumption is constantand the core with the highest cycle utilization is alwaysbusy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Fig. 8.7 Example of the cycle utilizations adjustment from w�1; . . .;w

�M

to w01; . . .;w

0M , when w�

M �wDLTFM . . . . . . . . . . . . . . . . . . . . . . . . 150

Fig. 8.8 Example of E�# as a function of w0

M , by usingthe Newton–Raphson method to solve Eq. (8.16),and of the approximated lower bound from Eq. (8.19),with c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz, j¼0:5W, M¼16,

D¼1 s, constantPM

i¼1 w0i¼5 � 109 cycles

second , and

w01¼w0

2¼� � �¼w0M�1¼

PM

i¼1w0i�w0

M

M�1 . . . . . . . . . . . . . . . . . . . . . . . . 152Fig. 8.9 Example of function U dð Þ with c ¼ 3, for different values

of M, where we also highlight U dmaxð Þ . . . . . . . . . . . . . . . . . . . 153

Fig. 8.10 Example of AFenergy overheadsDLTF-SFA ðw0

MÞ when c¼3, a¼0:27 WGHz3

,

b¼0:52 WGHz, and j¼0:5W (i.e., for 22 nm OOO Alpha 21264

cores, as detailed in Sect. 3.3), for M¼16, with w0i¼0:51 � w0

M

for all i ¼ 1; . . .;M � 1, and wDLTFM ¼ w0

M , consideringdifferent choices for fdyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Fig. 8.11 Example of function U dð Þ with c ¼ 3, for different valuesof M, where we also highlight U dmaxð Þ and U 4Mþ 1

6M

�. From

the figure, we can observe that if d� 4Mþ 16M , then U dð Þ is

maximized at U 4Mþ 16M

�. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Fig. 8.12 Example of the approximation factor for DLTF-SFA in termsof energy consumption when c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz,and j¼0:5W (i.e., for 22 nm OOO Alpha 21264 cores,as detailed in Sect. 3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Fig. 8.13 Example of the approximation factor for DLTF-SFA in termsof peak power consumption, for different values of M, whenc¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz, and j¼0:5W (i.e., for 22 nmOOO Alpha 21264 cores, as detailed in Sect. 3.3) . . . . . . . . . . . 166

Fig. 8.14 Example of AFenergyDLTF�SVA and AFenergy overheadsDLTF-SFA ðw0

MÞ,when c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz, j¼0:5W, and M¼16,

with w0i¼wDLTF

i ¼0:51 � w0M for all i¼1;. . .;M � 1,

and w0M¼wDLTF

M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167Fig. 8.15 Example of the approximation factor for DLTF-SVA

in terms of energy consumption when c¼3, a¼0:27 WGHz3

,

b¼0:52 WGHz, and j¼0:5W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Fig. 8.16 Example of the approximation factor for DLTF-SVA in termsof peak power consumption, for different values of M, when

xliv List of Figures

c¼3, a¼0:27 WGHz3

, b¼0:52 WGHz, and j¼0:5W (i.e., for 22 nm

OOO Alpha 21264 cores, as detailed in Sect. 3.3) . . . . . . . . . . . 172Fig. 8.17 Comparison of the approximation factors for DLTF-SFA

and DLTF-SVA in terms of energy consumption when c¼3,a¼0:27 W

GHz3, b¼0:52 W

GHz, and j¼0:5W (i.e., for 22 nm OOOAlpha 21264 cores, as detailed in Sect. 3.3) . . . . . . . . . . . . . . . . 173

Fig. 8.18 Comparison of the approximation factors for DLTF-SFAand DLTF-SVA in terms of peak power consumption, fordifferent values of M, when c¼3, a¼0:27 W

GHz3, b¼0:52 W

GHz,and j¼0:5W (i.e., for 22 nm OOO Alpha 21264 cores,as detailed in Sect. 3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Fig. 8.19 Experimental results of energy minimization for DLTF-SFAand DLTF-SVA, against the energy lower bound . . . . . . . . . . . . 177

Fig. 8.20 Experimental results of peak power reduction for DLTF-SFAand DLTF-SVA, against the peak power lower bound . . . . . . . . 178

Fig. 8.21 Experimental results of energy minimization comparingmaximum and average ratios for DLTF-SFA againstDLTF-SVA, for an island with M¼16 cores . . . . . . . . . . . . . . . 178

Fig. 8.22 Experimental results of peak power reduction comparingmaximum and average ratios for DLTF-SFA againstDLTF-SVA, for an island with M = 16 cores . . . . . . . . . . . . . . . 178

Fig. 9.1 Example of CCH for V ¼ 4 and K ¼ 3. The high of the barsrepresents the cycle utilization of the task sets, and the colorsrepresent their cluster assignment . . . . . . . . . . . . . . . . . . . . . . . . 185

Fig. 9.2 Example of BUH for V ¼ 4 and K ¼ 3. The high of the barsrepresents the cycle utilization of the task sets, and the colorsrepresent their cluster assignment . . . . . . . . . . . . . . . . . . . . . . . . 186

Fig. 9.3 Approximation factor for any task partition mapping heuristicthat uses M�

K

� �clusters with non-empty task sets, and SFA to

select the DVFS levels of individual clusters, whenPcoreðf Þ ¼ a � f 3 with a[ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Fig. 9.4 Examples of different sets Y, for a chip with V ¼ 3 and K ¼ 3.The task sets with highest cycle utilizations in each cluster(i.e., the task sets whose indexes are inside set Y) are circledand colored. Unfeasible combinations that contradict thedefinition of Y are crossed out. For example, if y1\3, wecannot assign 3 task sets to cluster I1 and still have that wy1 isthe highest cycle utilization inside cluster I1. A similarsituation occurs for y2 and y3. Cases in which y3\M, allunfeasible, are omitted from the figure . . . . . . . . . . . . . . . . . . . . 191

Fig. 9.5 Examples of possible task set assignments, for a chip withV ¼ 4 and K ¼ 3. a shows the only possible assignmentcombination when Y ¼ 3; 6; 9; 12f g. b–d show the three

List of Figures xlv

possible assignments when Y ¼ 4; 6; 9; 12f g, for whichTheorem 9.1 proves that combination (b) minimizes the energyconsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Fig. 9.6 Example of the process executed by Algorithm 9.1, for V ¼ 4and K ¼ 3, when Y ¼ 5; 7; 11; 12f g. The task sets withhighest cycle utilizations in each cluster are circled andcolored. Every value of j represents the assignment of tasksets to cluster Ij, i.e., one iteration inside the algorithm . . . . . . . 194

Fig. 9.7 Groups of adjacent remaining task sets, with ni 2 N0 . . . . . . . . . 195

Fig. 9.8 Example of algorithm DYVIA for building DY V I A 1; 9ð Þ,with V ¼ 3 and K ¼ 3. Each combination corresponds to a setL 2 K 1; 9ð Þ to assign in cluster I3, for which the algorithmevaluates the resulting energy, returning the minimum amongall tested cases. The task sets assigned to I3 in a combinationare boxed in yellow and the resulting sub-problems are coloredin gray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Fig. 9.9 Example of algorithm DYVIA for sub-problemsDY V I A 1; 6ð Þ, DY V I A 2; 7ð Þ, and DY V I A 3; 8ð Þ, withV¼3 and K¼3. Each combination corresponds to a set Linside K 1; 6ð Þ, K 2; 7ð Þ, or K 3; 8ð Þ to assign in I2, for whichDYVIA evaluates the resulting energy, returning the minimumamong all tested cases. Task sets assigned to I2 in acombination are boxed in yellow and the resultingsub-problems are colored in gray . . . . . . . . . . . . . . . . . . . . . . . . 199

Fig. 9.10 Block diagram of Intel’s SCC architecture [2], with two IA(P54C) cores per tile, one router (R) associated with each tile,four memory controllers (MC) for off-die (but onboard) DDR3memory, and a Voltage Regulator Controller (VRC) to set thevoltage of the clusters and the frequency of the cores . . . . . . . . 203

Fig. 9.11 Experimental power consumption profile for Intel’s SCC . . . . . . 204Fig. 9.12 Measured experimental execution cycles of the benchmarks

used for the evaluations on SCC. Every bar represents theaverage execution cycles for the given frequency (among 103

runs). The error bars represent the minimum and maximum(worst-case) measured execution cycles . . . . . . . . . . . . . . . . . . . 205

Fig. 9.13 Abstract example of the integration error intrinsic to digitalenergy measurements indirectly computed by measuringpower and time. The changes in power occur every 10 ls;however, the measurement sample period is 1ms. In thisexample, the resulting measured energy is 1:66% higherthan the energy actually consumed . . . . . . . . . . . . . . . . . . . . . . . 207

Fig. 9.14 Experimental results on SCC (running at 533MHz) of theaverage execution time of the evaluated task set mapping

xlvi List of Figures

algorithms, for 12 hypothetical platforms with different Vand K values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Fig. 9.15 Simulation results for energy efficiency for 12 hypotheticalplatforms with different V and K values, where the range of theenergy consumption ratios of a configuration (normalized tothe energy consumption of DYVIA) is shown by the verticalline and the bar represents the average energy consumptionratio (among the 100 evaluated cases). All four algorithmspartition tasks with LTE by using all the available cores(i.e., M� ¼ M for each configuration) . . . . . . . . . . . . . . . . . . . . . 212

Fig. 10.1 Examples of four possible task set assignments, for twoclusters with three cores per cluster, in which cluster 1 runs atfrequency Ftype

Qh1 ;h1

and cluster 2 runs at frequency FtypeQh

2 ;h2. The

task sets are ordered increasingly according to their cycleutilizations when running on reference core typey. Furthermore, it holds thatw1;1 �w1;2 �w1;3 �w1;4 �Ftype

Qh1 ;h1

�w1;5 �w1;6 and

w2;1 �w2;2 �w2;3 �w2;4 �w2;5 �w2;6 �FtypeQh

2 ;h2, such that task

sets S1, S2, S3, and S4 can be mapped to either cluster, but tasksets S5 and S6 can only be mapped to cluster 2. According toLemma 10.1, in case that cluster 1 has a lower energy factorthan cluster 2 for the given DVFS levels, then option (c) savesmore energy than option (d), option (b) saves more energythan option (c), and option (a) saves more energy than option(b), such that assignment (a) is the more energy-efficientoption. Similarly, in case that cluster 2 has a lower energyfactor than cluster 1 for the given DVFS levels, then option(b) saves more energy than option (a), option (c) saves moreenergy than option (b), and option (d) saves more energy thanoption (c), such that assignment (d) is the moreenergy-efficient option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Fig. 10.2 Example of the process of an algorithm based onTheorem 10.1 and Observation 10.1, for a system with fourtypes of cores and three cores per cluster. For the given DVFSlevels, the clusters are ordered increasingly according to theirenergy factors. Every value of j represents the core types j inTheorem 10.1. The task sets assigned to cores of type j arecolored in dark-gray and crossed out for other core types withlarger energy factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

Fig. 10.3 Abstract example of the energy factors of three hypotheticalcore types running at different DVFS levels. The criticalfrequencies of Core A and Core B are A2 and B2, respectively,

List of Figures xlvii

therefore running at frequencies A1 or B1 is not energyefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Fig. 10.4 Average energy factors of five representative applications fromthe PARSEC benchmark suite, based on simulationsconducted in gem5 and McPAT (for both types of Alphacores), and based on measurements on the Exynos 5 Octa(5422) processor (for the Cortex-A7 and Cortex-A15 cores) . . . 234

Fig. 10.5 Detailed experimental results for one experiment with 52random tasks when running on platform (c). The height of thebars represents the cycles per second required by a task to meetits timing constraints when executed on the assigned type ofcore (a task requires different cycles per second when assignedto different core types) on the top figures, and the associatedenergy consumptions on the bottom figures. The total energyconsumption for using LPF and EWFD is 3.81� and 4.64�,respectively, higher than HIT-LTF . . . . . . . . . . . . . . . . . . . . . . . 235

Fig. 10.6 Overall energy consumption results for the four differentevaluated platforms. The results are presented using anormalized empirical cumulative distribution representation.Namely, by normalizing the results to the energy consumptionof HIT-LTF, each figure shows the percentage of experiments(among the 104 cases) for which the resulting energyconsumption ratio for LPF or EWFD is below a specific value.For example, for LPF on platform (d), 80% (20%) of the testedsets of tasks resulted in an energy consumption ratio below(above) 1.82x, compared to HIT-LTF. . . . . . . . . . . . . . . . . . . . . 237

Fig. 10.7 Summarized overall energy consumption experimental resultsfor the four different evaluated platforms (as detailed inSect. 10.5.1). The results are presented in a box plotrepresentation (whiskers represent maximum and minimumratios), normalized with respect to the overall energyconsumption of our HIT-LTF algorithm . . . . . . . . . . . . . . . . . . . 238

xlviii List of Figures

List of Tables

Table 5.1 Details of the application mapping scenarios for ourexperiments. Indexes a, b, c, and d represent the cluster ID asexplained in Fig. 4.4. Every line corresponds to an applicationinstance executed in the corresponding cluster with theindicated number of threads, where “–” means that a cluster isnot executing any application . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Table 7.1 Details of the application mapping scenarios for theexperiment with individual applications. Indexes a, b, c, andd represent the cluster ID as explained in Fig. 4.4. Every linecorresponds to an application instance executed in thecorresponding cluster with the indicated number of threads,where “–” means that a cluster is not executing anyapplication. A super-index enclosed in brackets next to thenumber of threads implies that the specific application instancewill have runtime performance surges, where the targetfrequency is the number between the brackets (in GHz). Theduration of the surges is detailed in the Scenario column(under Boost), while Period details how often such surgesarrive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Table 9.1 Experimental measurement results of the total energyconsumption on SCC. For each one of the 12 cases of differentbenchmark utilizations and for every evaluated algorithm, thetable shows the average energy consumption among tenconsecutive executions lasting 100 s each, as well as theassociated expected energy consumption values . . . . . . . . . . . . . 207

Table 9.2 Summary of the experimental energy ratios between the(expected and measured) energy consumption of each heuristicand the energy consumption of DYVIA . . . . . . . . . . . . . . . . . . . 208

xlix

Table 9.3 Power consumption profile derived from the measurementspresented in [3] for a customized version of SCC with richerDVFS and DPM features. Given that for a constant number ofexecuted cycles, the minimum energy consumption is found atfrequency 686:7MHz, this frequency is the critical frequencyfor this power profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

l List of Tables

Advanced Techniques for Power, Energy, and Thermal ...978-3-319-77479-4/1.pdf · Advanced Techniques for Power, Energy, and Thermal Management for Clustered Manycores 123. Santiago

Documents