Top Banner
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2018.18.1.115 ISSN(Online) 2233-4866 Manuscript received Nov. 30, 2017; accepted Dec. 18, 2017 1 Department of Computer and Information Science, The University of Mississippi, University, MS, 38677 USA 2 Department of Electrical & Computer Engineering, Missouri University of Science & Technology, USA 3 Information and Communication Research Center, Daegu University, Gyeongsan, South Korea E-mail : [email protected], [email protected] Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor Hossein Pourmeidani 1 , Ajay Sharma 1 , Kyoshin Choo 1 , Mainul Hassan 1 , Minsu Choi 2 , KyungKi Kim 3 , and Byunghyun Jang 1 Abstract—The 3D stacked integration of CPU, GPU and DRAM dies is a rising horizon in chip fabrication, where dies are vertically interconnected by TSVs (Through-Silicon Vias) to achieve high bandwidth, low latency and power consumption. However, thinned substrate, high power density and low thermal conductivity of inter-layer dielectric material cause thermal management a crucial problem. Moreover, the vertically stacked dies are susceptible to tight thermal correlations. High temperatures which tend to show higher spatial/temporal localities can make a negative impact on the IC’s reliability and lifetime. To mitigate such problems on CPU-GPU 3D heterogeneous processors, a novel dynamic temperature-aware task scheduling approach for compute workloads using OpenCL framework is proposed in this work. The proposed scheduler predicts the future temperature of each core from a regression model based on its current temperature, the neighbors’ temperatures and the execution profile of each workgroup. The scheduler then selects a core to assign workgroups from task queue based on their predicted temperature to keep the 3D chip below certain threshold temperature. Our experimental results demonstrate that the proposed scheduling technique is a viable solution to address the hotspots and heat dissipation issue of 3D stacked heterogeneous processors under reasonable performance tradeoffs. Index Terms—Dynamic thermal management, 3D IC, task scheduling, heterogeneous computing, GPGPU I. INTRODUCTION The 3D integration technology has gained considerable attention recently. The result of this new technology is the notable reduction of interconnect wires among dies in a System on Chip (SoC). The primary source of latency, area, and power in modern microprocessors is wire. The prior studies have shown that wires can consume more than 30% of total power in traditional 2D chip multiprocessors [1]. In comparison, 3D technology decreases the wire length by a factor of the square root of the number of layers by the vertical stacking of two or more dies with high-density, high- speed interfaces [2]. This remarkable reduction results in better performance and less power dissipation on the interconnection. Despite such significant advantages, 3D integration technology encounters a new problem that has never existed and solved before. As dies are stacked, the power density grows because of less distance between active devices, which makes the chip temperature to significantly increase. Also, the lower dies are placed far from the heat sink and have longer heat dissipation paths. Therefore, hotspots formed in the chip become a crucial concern for the reliability of the processor. As an example, previous studies show that the peak
10

Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

Aug 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2018.18.1.115 ISSN(Online) 2233-4866

Manuscript received Nov. 30, 2017; accepted Dec. 18, 2017 1 Department of Computer and Information Science, The University of Mississippi, University, MS, 38677 USA 2 Department of Electrical & Computer Engineering, Missouri University of Science & Technology, USA 3 Information and Communication Research Center, Daegu University, Gyeongsan, South Korea E-mail : [email protected], [email protected]

Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor

Hossein Pourmeidani1, Ajay Sharma1, Kyoshin Choo1, Mainul Hassan1, Minsu Choi2,

KyungKi Kim3, and Byunghyun Jang1

Abstract—The 3D stacked integration of CPU, GPU and DRAM dies is a rising horizon in chip fabrication, where dies are vertically interconnected by TSVs (Through-Silicon Vias) to achieve high bandwidth, low latency and power consumption. However, thinned substrate, high power density and low thermal conductivity of inter-layer dielectric material cause thermal management a crucial problem. Moreover, the vertically stacked dies are susceptible to tight thermal correlations. High temperatures which tend to show higher spatial/temporal localities can make a negative impact on the IC’s reliability and lifetime. To mitigate such problems on CPU-GPU 3D heterogeneous processors, a novel dynamic temperature-aware task scheduling approach for compute workloads using OpenCL framework is proposed in this work. The proposed scheduler predicts the future temperature of each core from a regression model based on its current temperature, the neighbors’ temperatures and the execution profile of each workgroup. The scheduler then selects a core to assign workgroups from task queue based on their predicted temperature to keep the 3D chip below certain threshold temperature. Our experimental results demonstrate that the proposed scheduling technique is a viable solution to address the hotspots

and heat dissipation issue of 3D stacked heterogeneous processors under reasonable performance tradeoffs. Index Terms—Dynamic thermal management, 3D IC, task scheduling, heterogeneous computing, GPGPU

I. INTRODUCTION

The 3D integration technology has gained considerable attention recently. The result of this new technology is the notable reduction of interconnect wires among dies in a System on Chip (SoC). The primary source of latency, area, and power in modern microprocessors is wire. The prior studies have shown that wires can consume more than 30% of total power in traditional 2D chip multiprocessors [1]. In comparison, 3D technology decreases the wire length by a factor of the square root of the number of layers by the vertical stacking of two or more dies with high-density, high-speed interfaces [2]. This remarkable reduction results in better performance and less power dissipation on the interconnection.

Despite such significant advantages, 3D integration technology encounters a new problem that has never existed and solved before. As dies are stacked, the power density grows because of less distance between active devices, which makes the chip temperature to significantly increase. Also, the lower dies are placed far from the heat sink and have longer heat dissipation paths. Therefore, hotspots formed in the chip become a crucial concern for the reliability of the processor. As an example, previous studies show that the peak

Page 2: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

116 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE …

temperature can add up to 20 ℃ more with a 3D structure with two layers for an Alpha-like processor in comparison to a 2D structure [3, 4]. Other studies on multiple non-memory stacking 3D floorplans also show similar thermal behaviors [1, 5, 6]. The prior studies have shown that vertically neighboring dies tend to show significantly higher thermal correlations [7, 8]. As an example, a core in one layer could become hot due to another high-temperature core in the same vertical location at a different layer.

To address this issue, we propose a dynamic temperature aware task scheduling technique for compute workloads on 3D stacked CPU-GPU heterogeneous processors. The proposed technique efficiently suppresses hotspots through temperature prediction modeling and smart task scheduling while minimizing performance degradation.

II. RELATED WORK AND BACKGROUND

Conventional CPU-GPU heterogeneous systems where two processors are connected through PCI-E bus suffer from a considerable data copy overhead between host and device memories. Industry finds a solution in single-chip heterogeneous processors where CPU and GPU are fabricated on a single die and share a physically unified memory. Even this recent 2D IC suffers from poor parallelism and scalability due to the limited bandwidth, high latency, and energy consumption of off-chip DRAM. To solve all these problems, the processor architecture is evolving to a 3D IC of CPU, GPU and DRAM dies vertically interconnected by TSVs (Through-Silicon Vias). However a new problem arises; three vertically stacked multiple active dies produce a considerable amount of heat in a three dimensional fashion and they suffers from poor heat dissipation, high thermal density, and hotspots because the multiple active layers are separated from each other by dielectrics layers. Therefore, thermal management is surely the biggest challenge in such fabrication technology as they can cause faults and reliability issues. Due to the importance and difficulty of thermal-reliability management in 3D IC, a number of design-time mechanical cooling solutions have been proposed such as liquid cooling [15], thermal vias [16], heat sinks and fans. Although these approaches will remain the front-line mechanisms for dealing with the

thermal wall, these approaches are costly, unwieldy, and do not provide a complete solution to the transient nature of the problem.

The idea of task scheduling for temperature management on multiple homogenous processors has been well-studied in the past. Yin et al. [9] proposed an algorithm to diminish core temperature rapidly by inactivating the core which reaches a critical temperature in next clock tick and by migrating the tasks based on the affinity of each core. The high integration density in 3D IC makes the thermal modeling and management more complicated as the thermal management techniques developed for 2D IC cannot be directly applied to 3D IC. Therefore, new techniques which are tailored for 3D chip thermal management and modeling are emerging. Several approaches have been proposed to target the thermal modeling of 3D chips. Recently, Zhao et al. [10] proposed a migration approach to decrease temperature in a 3D architecture with stacked DRAM. The main idea of their work is migrating threads between cores based on their temperatures. They propose a thread migration algorithm that the hottest and coldest threads switch places when the temperature variance is large enough. In [11], a task scheduling is proposed to manage both chip temperature and memory access delay. This approach attempts to increase the performance by preventing a task migration far from its data. Unfortunately they do not put any emphasis on decreasing the temperature of the chip. Coskun [12] combined dynamic thread migration with DVFS for thermal management, and achieved similar results to DVFS in the thermal optimization but with less performance degradation. Zhou et al. [13] shows that there is a strong thermal correlation between vertically adjacent layers in a 3D chip. They treat vertically adjacent cores as super cores. Then, they proposed an OS-level scheduling algorithm that the hottest super task is allocated to the coolest super core where a super task is a set of tasks. In a 3D chip multiprocessor, the heat dissipation ability varies from core to core. Liu et al. [14] proposed an algorithm to map and schedule jobs according to the thermal conductivity of various cores. The proposed algorithm allocates hotter jobs to closer cores to heat sink, and cooler jobs to farther cores from the heat sink. While the approaches described above target homogeneous multicore processors, our work aims to address the thermal management problem for

Page 3: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 117

emerging heterogeneous processors where CPU and GPU are vertically stacked.

III. TEMPERATURE AWARE DYNAMIC TASK

SCHEDULING FOR 3D CPU-GPU

HETEROGENEOUS PROCESSORS

We propose a Dynamic Thermal Management (DTM) technique to solve the hotspot problem of vertically stacked 3DHP (3D Heterogeneous Processor). As we use OpenCL workloads, our goal is to assign workgroups to cores that can remain under certain temperature threshold with minimal performance degradation. To that end, we consider current temperature of the running core, the temperatures of neighboring cores, and the heat to be generated by a workgroup being assigned. We use DTM as a runtime solution to reduce the thermal hotspots and temperature gradients with minimal possible performance impact. Fig. 1 shows overall system diagram of the DTM system proposed in this paper. The management engine continuously monitors the thermal map of chip that is obtained from temperature sensors in each layer of the chip, and checks the activities of CPU/GPU/DRAM and takes workload profiles of heterogeneous tasks from dynamic global task scheduler. Based on the gathered information, the engine runs an algorithm to specify what and how to manage software and hardware techniques for best control of thermal hot spots. A range of proposed hardware techniques have been developed such as fined-grained DVFS and Power Gating (PG). Almost all processors in the market are equipped with some forms of hardware techniques and they have been proven to be effective. In 3DHP, heterogeneous workloads written in heterogeneous programming languages such as OpenCL provide unique additional opportunities to dynamically manage tasks and hardware devices at multiple levels.

In order to model a realistic 3D stacked chip, a three-layer floorplan shown in Fig. 2 was considered. The bottom layer is a CPU with 4 cores, the middle layer is a GPU with 32 CUs and the top layer is a DRAM. These three active silicon layers generate heat and their vertically stacked structure makes it more difficult to dissipate heat than the case of 2D structure.

As in existing works, the distance between cores is considered as an important factor in modeling the

thermal correlation among cores. In this work, three major factors are analyzed to predict the temperature changes of each core: current temperature, neighbors’ temperature and the execution time of workgroup to be assigned. We configure temperature greater than a threshold, ThresholdCritical, as a critical temperature, less than a threshold, ThresholdHot, as a normal temperature and between these temperatures as a hot temperature. In addition, two cores are considered as neighbors when their shortest distance is 1, 2 or 3 based on the distance graph.

Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance based on the GPU distance graph is shown in Table 1. For example, Core 8 has distance 1 with Cores 7 and 16, distance 2 with Cores 6, 15, 24 and distance 3 with Cores 5, 14, 23, 32. Table 2 shows the neighbor distance for a CPU with 4 cores based on the CPU distance graph. Also, the distance between each GPU core and its direct

Fig. 1. The system diagram of the proposed Dynamic Thermal Management (DTM) for 3DHP.

Fig. 2. A floorplan of CPU-GPU 3D heterogeneous processor.

Page 4: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

118 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE …

underneath CPU core is assumed to be 2. We then compute the neighbors’ temperature weight

from the following formula. We empirically found and used different b values as neighbors with different distances have different impacts on the target core.

( ) NeighborsTemperatureWeight NTW =

( ) ( ) ( )1 2 3

1 2 3

1 2 3ATND ATND ATNDb b bb b b

´ + ´ + ´

+ +

where ATND (I) is the average temperature of neighbors with distance I and Ib is a weight for the neighbors

with distance I, and b 1 > b 2 > b 3. Along with NTW, another factor that we consider is

the execution time of each workgroup. This factor directly affects the future temperature because once a workgroup is assigned to a CU it is impossible to migrate it to another core according to the modern GPU’s thread execution model. Therefore, a core is likely to reach higher temperature when a workgroup runs for a longer time on it.

We suppose that we have M workgroups that are extracted from N kernels defined as WG11, ... ,WGMN and P available cores (including both CPU and GPU) defined as C1, ... ,CP . Note that modern GPUs can run different kernels simultaneously. Each workgroup from each kernel has an execution time on each of the available cores - Timei,j,k defines the execution time of workgroup WGij on core Ck. Also, the allocation variable Ai,j,k = 1 if

workgroup WGij is assigned to core Ck and 0 otherwise. A workgroup can be assigned to only one core. Therefore, for each WGij:

, ,1

1P

i j kk

A=

where i = 1, ... ,M , j = 1, ... ,N and k = 1, ... ,P. The execution time of workgroup WGij is given by:

Fig. 3. A distance graph (a) GPU, (b) CPU.

Table 1. GPU shortest neighbor distance

Core Distance 1 Distance 2 Distance 3 1 2, 9 3, 10, 17 4, 11, 18, 25 2 1, 3, 10 4, 9, 11, 18 5, 12, 17, 19, 26 3 2, 4, 11 1, 5, 10, 12, 19 6, 9, 13, 18, 20, 27 4 3, 5, 12 2, 6, 11, 13, 20 1, 7, 10, 14, 19, 21, 28 5 4, 6, 13 3, 7, 12, 14, 21 2, 8, 11, 15, 20, 22, 29 6 5, 7, 14 4, 8, 13, 15, 22 3, 12, 16, 21, 23, 30 7 6, 8, 15 5, 14, 16, 23 4, 13, 22, 24, 31 8 7, 16 6, 15, 24 5, 14, 23, 32 9 1, 10, 17 2, 11, 18, 25 3, 12 19, 26 10 2, 9, 11, 18 1, 3, 12, 17, 19, 26 4, 13, 20, 25, 27 11 3, 10, 12, 19 2, 4, 9, 13, 18, 20, 27 1, 5, 14, 17, 21, 26, 28 12 4, 11, 13, 20 3, 5, 10, 14, 19, 21, 28 2, 6, 9, 15, 18, 22, 27, 29 13 5, 12, 14, 21 4, 6, 11, 15, 20, 22, 29 3, 7, 10, 16, 19, 23, 28, 30 14 6, 13, 15, 22 5, 7, 12, 16, 21, 23, 30 4, 8, 11, 20, 24, 29, 31 15 7, 14, 16, 23 6, 8, 13, 22, 24, 31 5, 12, 21, 30, 32 16 8, 15, 24 7, 14, 23, 32 6, 13, 22, 31 17 9, 18, 25 1, 10, 19, 26 2, 11, 20, 27 18 10, 17, 19, 26 2, 9, 11, 20, 25, 27 1, 3, 12, 21, 28 19 11, 18, 20, 27 3, 10, 12, 17, 21, 26, 28 2, 4, 9, 13, 22, 25, 29 20 12, 19, 21, 28 4, 11, 13, 18, 22, 27, 29 3, 5, 10, 14, 17, 23, 26, 30 21 13, 20, 22, 29 5, 12, 14, 19, 23, 28, 30 4, 6, 11, 15, 18, 24, 27, 31 22 14, 21, 23, 30 6, 13, 15, 20, 24, 29, 31 5, 7, 12, 16, 19, 28, 32 23 15, 22, 24, 31 7, 14, 16, 21, 30, 32 6, 8, 13, 20, 29 24 16, 23, 32 8, 15, 22, 31 7, 14, 21, 30 25 17, 26 9, 18, 27 1, 10, 19, 28 26 18, 25, 27 10, 17, 19, 28 2, 9, 11, 20, 29 27 19, 26, 28 11, 18, 20, 25, 29 3, 10, 12, 17, 21, 30 28 20, 27, 29 12, 19, 21, 26, 30 4, 11, 13, 18, 22, 25, 31 29 21, 28, 30 13, 20, 22, 27, 31 5, 12, 14, 19, 23, 26, 32 30 22, 29, 31 14, 21, 23, 28, 32 6, 13, 15, 20, 24, 27 31 23, 30, 32 15, 22, 24, 29 7, 14, 16, 21, 28 32 24, 31 16, 23, 30 8, 15, 22, 29

Table 2. CPU shortest neighbor distance

Core Distance 1 Distance 2 Distance 3 1 2 3 4 2 1, 3 4 - 3 2, 4 1 - 4 3 2 1

Page 5: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 119

( ) , , , ,1

, , P

i j k i j kk

ExecutionTime i j k A Time=

= ´å

Our main objective is to predict the future temperature

of each core based on the three factors above. We use a statistical regression method that finds a relationship between one dependent variable (e.g., Y) and other independent variables (e.g., X1, X2, …). The regression method takes multiple variables and creates a mathematical relationship between them in order to predict the dependent variable. The regression can be linear or multiple linear. A linear regression works with one independent variable while a multiple linear regression works with more than one independent variable to predict the consequence. The general form of multiple linear regressions is:

0 1 1 2 2 i i i n in iY X X Xa a a a= + + +¼+ +ò

In the formula above, we have one dependent variable

(Y) and n independent variables (X1, X2, …, Xn) for several observations. The value a needs to be estimated and ò is an error. For example, Fig. 4 shows how the multiple linear regression can find a linear relationship between the two independent variables X1 and X2 and the dependent variable Y. In order to predict the future temperature of each core, we consider future temperature as a dependent variable and current temperature, NTW and execution time as independent variables.

Finally, the scheduler assigns a workgroup to a core whose predicted temperature is the lowest among normals. If there is no core with predicted normal temperature then our scheduler decreases the frequency

in order to reduce the temperature of cores. The proposed algorithm is pseudo-coded as follows:

Algorithm: The proposed task scheduler algorithm for 3DHP 1: while (there is a core whose current temperature is critical) do 2: for k = 1 to P do 3: Compute NTW for each Ck 4: endfor 5: for k = 1 to P do 6: Predict future temperature for each Ck 7: endfor 8: if (there is a core whose predicted temperature is normal) then 9: Assign next workgroup to the core whose predicted

temperature is the lowest and normal 10: else 11: Decrement the frequency of cores 12: endif 13: endwhile

IV. EXPERIMENTAL SETUP AND RESULTS

Extensive experiments are carried out on the sample floorplan shown in Fig. 2. The ThresholdCritical and ThresholdHot are set to 80℃ and 60℃ respectively in

all experiments. Also, the values β 1 , β 2 and β 3 are set to be 4, 2, 1, respectively. First, we compute the power consumed by each core in 3DHP. The power consumption of each core is calculated from the McPAT [17] power simulator using cycle-level detailed statistics collected from the Multi2Sim [18] architectural simulation. Once power consumption is computed, we then use the HotSpot [19] heat simulator to compute the temperature of each component of the 3DHP. The temperature of all cores are computed every 925 clock cycles which matches GPU clock frequency tested. Several metrics are chosen to evaluate the proposed scheduling algorithm including peak temperature, temperature changes, final temperature and performance. Five well-known benchmark workloads are considered: MatrixMultiplication, BinarySearch, Reduction, FFT and BitonicSort fromAMD OpenCL SDK.

The final temperature is an important metric because it is the initial temperature of next kernel to be executed. Our experiments show that the final temperatures of all cores in our proposed scheduler are normal for all benchmarks tested. For example, Fig. 5 and 6 show the final thermal map after the completion of Matrix Multiplication benchmark. Fig. 5 shows the final

Fig. 4. Multiple linear regression for temperature prediction.

Page 6: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

120 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE …

temperature for GPU cores. We can clearly see that there is 16 critical, 8 hot and 8 normal cores for the default Round-Robin scheduler while all cores are normal when our proposed scheduler is used. In Fig. 6, all CPU cores are critical for the Round-Robin scheduler while all cores are normal for our proposed scheduler at the end. As mentioned earlier, the thermal correlation between CPU and GPU cores is obvious. Table 3 shows the

average final temperature of all CPU and GPU cores for the Round-Robin and the proposed scheduler. The proposed scheduler reduces the final temperature by more than 50%.

(a)

(b)

Core8

Core6

Core5

Core4

Core3

Core2

Core1

Core7

Core16

Core14

Core13

Core12

Core11

Core10

Core9

Core15

Core24

Core22

Core21

Core20

Core19

Core18

Core17

Core23

Core32

Core30

Core29

Core28

Core27

Core26

Core25

Core31

Core8

Core6

Core5

Core4

Core3

Core2

Core1

Core7

Core16

Core14

Core13

Core12

Core11

Core10

Core9

Core15

Core24

Core22

Core21

Core20

Core19

Core18

Core17

Core23

Core32

Core30

Core29

Core28

Core27

Core26

Core25

Core31

60 °C

80 °C

Fig. 5. Final thermal map for GPU cores (a) Our proposed scheduler, (b) Default round-robin scheduler.

Fig. 6. Final thermal map for CPU cores (a) Our proposed scheduler, (b) Default round-robin scheduler.

Table 3. Average final temperatures (℃ ) of all cores

Benchmark Round-

Robin

Proposed

Scheduler Degradation

MatrixMultiplicatoin 91.46 43.79 52.1%

Reduction 91.78 40.82 55.5%

BinarySearch 89.59 43.26 51.7%

FFT 92.33 40.82 55.7%

BitonicSort 108.66 40.81 62.4%

Fig. 7. Temperature changes (a) MatrixMultiplication, (b) BinarySearch, (c) Reduction, (d) FFT, (e) BitonicSort.

Page 7: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 121

The changes in temperatures shows the ability of a scheduler to keep the temperature below certain threshold. Fig. 7 shows the changes in temperature every 20 interval for all benchmarks tested. The experimental results show how our proposed scheduler outperforms the default Round-Robin scheduler in maintaining temperature. Also, the average temperatures of cores are always under the critical temperature when our proposed scheduler is used. Down slopes represent that the temperature is controlled by the scheduler.

The peak temperature represents how a scheduler can eliminate the worst thermal conditions. Table 4 shows the peak temperature degradation for the benchmarks based on Fig. 7. It demonstrates that a peak temperature decreased by 19% on average across all benchmarks tested.

Performance degradation is measured using the overall execution time. Table 5 shows the performance overhead of our proposed scheduler in comparison with the default Round-Robin scheduler. The results show that the overhead is less than 20% for BinarySearch and slightly over 30% for other benchmarks. The performance reduction is caused by not utilizing the cores with critical temperature. Therefore, the critical cores are idle to be cooled and only normal cores are used. The overhead of BinarySearch is less than others because the temperature changes of BinarySearch are less than others as shown in Fig. 7(b).

We also tested the back-to-back run of multiple kernels. Based on the final and peak temperatures in the default Round-Robin scheduler, we classify the

benchmarks into three categories: hot, warm and cool, as described in Table 6. The benchmarks in different categories are selected to combine into different benchmark mix to demonstrate the importance of final temperature and evaluating the efficiency of scheduler. Table 7 shows the mix of benchmarks used in our experiments. The results are shown in Tables 8 and 9. From the tables, it can be observed that the proposed scheduler reduces the peak temperature for warmer combinations more than cooler combinations while they have higher overhead. For example, the temperature degradation for the combination WWW is 11.1% more than the combination WCW while the overhead is 7% more.

V. CONCLUSIONS

In this paper, a novel temperature-aware workgroup assignment algorithm for vertically stacked 3D

Table 4. Peak temperature degradation

Benchmark Peak Temperature Degradation MatrixMultiplicatoin 19.5%

BinarySearch 16.3% Reduction 17.8%

FFT 16.5% BitonicSort 24.9%

Table 5. Performance overhead

Benchmark Performance Overhead MatrixMultiplicatoin 36.5%

BinarySearch 18.9% Reduction 37.3%

FFT 36.9% BitonicSort 30.5%

Table 6. Classification of benchmarks

Benchmark Thermal Group BinarySearch Cool

MatrixMultiplicatoin Warm Reduction Warm

FFT Warm BitonicSort Hot

Table 7. Mix of benchmarks

Benchmark Mix Classification Bitonic + MM + FFT HWW

Red + FFT + MM WWW Bin + FFT + Bitonic CWH

FFT + Bin + MM WCW

Table 8. Peak temperature degradation of benchmark mix

Benchmark Mix Peak Temperature Degradation Bitonic + MM + FFT 16.7%

Red + FFT + MM 18.9% Bin + FFT + Bitonic 13.7%

FFT + Bin + MM 7.8%

Table 9. Performance overhead of benchmark mix

Benchmark Mix Performance Overhead Bitonic + MM + FFT 34.5%

Red + FFT + MM 35.5% Bin + FFT + Bitonic 32.4%

FFT + Bin + MM 28.5%

Page 8: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

122 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE …

heterogeneous processors has been proposed and validated. Unlike previous approaches on thermal management for homogeneous multicore processors, we target emerging heterogeneous workloads that run on both CPU and GPU processors. Using well-verified simulators widely used in the field, the efficiency of proposed temperature-aware scheduler has been demonstrated in terms of improving the thermal conditions of 3D CPU-GPU heterogeneous processors. To reduce hotspots, peak temperature and final temperature, the proposed scheduler predicts the future temperature of each core and assigns next workgroups to most desirable cores. The experimental results show that the proposed scheduler reduces the final temperature by more than 50%, peak temperature by 19% on average and performance degradation by 32% on average.

ACKNOWLEDGMENTS

This work was supported by National Science Foundation (NSF) grant CCF-1337138.

REFERENCES

[1] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso et al., “Die stacking (3D) microarchitecture,” in 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 2006, pp. 469–479.

[2] J. W. Joyner, P. Zarkesh-Ha, and J. D. Meindl, “A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3D-SoC),” in ASIC/SOC Conference, 2001. Proceedings. 14th Annual IEEE International. IEEE, 2001, pp. 147–151.

[3] W.-L. Hung, G. M. Link, Y. Xie, N. Vijaykrishnan, and M. J. Irwin, “Interconnect and thermal-aware floorplanning for 3d microprocessors,” in 7th International Symposium on Quality Electronic Design (ISQED’06). IEEE, 2006, pp. 6–pp.

[4] K. Puttaswamy and G. H. Loh, “Thermal analysis of a 3D die-stacked high-performance micro- processor,” in Proceedings of the 16th ACM Great Lakes symposium on VLSI. ACM, 2006, pp. 19–24.

[5] M. Awasthi and R. Balasubramonian, “Exploring the design space for 3D clustered architectures,” in Proceedings of the 3rd IBM Watson Conference on Interaction between Architecture, Circuits, and Compilers, 2006.

[6] K. Puttaswamy and G. H. Loh, “Thermal herding: Microarchitecture techniques for controlling hotspots in high-performance 3d-integrated processors,” in 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 2007, pp. 193–204.

[7] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration,” Proceedings of the IEEE, vol. 89, no. 5, pp. 602–633, 2001.

[8] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, “Design space exploration for 3D architectures,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 2, no. 2, pp. 65–103, 2006.

[9] X. Yin, Y. Zhu, L. Xia, J. Ye, T. Huang, Y. Fu, and M. Qiu, “Efficient implementation of thermal-aware scheduler on a quad-core processor,” in 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 2011, pp. 1076–1082.

[10] D. Zhao, H. Homayoun, and A. V. Veidenbaum, “Temperature aware thread migration in 3D architecture with stacked DRAM,” in Quality Electronic Design (ISQED), 2013 14th International Symposium on. IEEE, 2013, pp. 80–87.

[11] H. Wang, Y. Fu, T. Liu, and J. Wang, “Thermal management via task scheduling for 3D NoC based multi-processor,” in SoC Design Conference (ISOCC), 2010 International. IEEE, 2010, pp. 440–444.

[12] A. K. Coskun, J. L. Ayala, D. Atienza, T. S. Rosing, and Y. Leblebici, “Dynamic thermal management in 3D multicore architectures,” in 2009 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 2009, pp. 1410–1415.

[13] X. Zhou, J. Yang, Y. Xu, Y. Zhang, and J. Zhao, “Thermal-aware task scheduling for 3D multicore processors,” IEEE Transactions on Parallel and

Page 9: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 123

Distributed Systems, vol. 21, no. 1, pp. 60–71, 2010.

[14] S. Liu, J. Zhang, Q. Wu, and Q. Qiu, “Thermal-aware job allocation and scheduling for three dimensional chip multiprocessor,” in Quality Electronic Design (ISQED), 2010 11th International Symposium on. IEEE, 2010, pp. 390–398.

[15] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and D. Atienza, “3D-ICe: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling,” in Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 2010, pp. 463–470.

[16] B. Goplen and S. Sapatnekar, “Thermal via placement in 3D ICs,” in Proceedings of the 2005 international symposium on Physical design. ACM, 2005, pp. 167–174.

[17] D. M. Tullsen, “Simulation and modeling of a simultaneous multithreading processor,” in The 1996 22 nd International Conference for the Resource Management & Performance Evaluation of Enterprise Computing Systems, CMG. Part 2(of 2), 1996, pp. 819–828.

[18] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2sim: a simulation framework for cpu-gpu computing,” in Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on. IEEE, 2012, pp. 335–344.

[19] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, “Temperature-aware microarchitecture: Modeling and implemen- tation,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 1, no. 1, pp. 94–125, 2004.

Hossein Pourmeidani received his B.S. and M.S. degrees in Computer Engineering from Islamic Azad University in 2010 and 2012 respectively. He is currently pursuing his Ph.D. degree at the University of Mississippi. His interests include

computer architecture and GPU computing.

Ajay Sharma received his M.S degree in Computer Science from the University of Mississippi in 2016. He is currently working for FedEx, U.S.A. His research includes CPU-GPU heterogeneous computing and high performance computing.

Kyoshin Choo received his B.S. degree from Glogbal Handong University in South Korea, MS degree from the University of Michigan, Ann Arbor, and PhD degree in Computer Science from the University of Mississippi in 2016. He

is currently working for AMD, U.S.A.

Mainul Hassan received his B.S. degree in Computer Engineering from Bangladesh University of Engineering and Technology and M.S degree in Computer Science from the University of Mississippi in 2015. He is currently working for

IMS Health, U.S.A.

Minsu Choi [M02 SM08] received his B.S.,M.S. and Ph.D. degrees in Computer Science from Oklahoma State University in 1995, 1998 and 2002, respectively. He is currently an associate professor of Electrical and Computer Engineering at Missouri

University of Science & Technology (Missouri S&T). His research mainly focuses on Computer Architecture & VLSI, Crypto-hardware design, Nanoelectronics, Embedded Systems, Fault Tolerance, Testing, Quality Assurance, Reliability Modeling and Analysis, Configurable Computing, Parallel & Distributed Systems and Dependable Instrumentation & Measurement. He has won two outstanding teaching awards at MST in 2008 and 2009. He is a senior member of IEEE and a member of Golden Key National Honor Society and Sigma Xi.

Page 10: Dynamic Temperature Aware Scheduling for CPU-GPU 3D ...€¦ · graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance

124 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE …

Kyung Ki Kim received his BS and MS degrees in Electronic Engi- neering from Yeungnam University, South Korea, in 1995 and 1997, respectively, and his Ph.D. degree in Computer Engineering from North- eastern University, Boston, MA, in

2008. He was a member of technical staff with Sun Microsystems, Santa Clara, CA in 2008 and a senior researcher with Illinois Institute of Technology, Chicago, IL in 2009. Currently, he is an Associate Professor at Daegu University, South Korea. His current research focuses on nanoscale CMOS design, high speed low power VLSI design, analog VLSI circuit design, electronic CAD and nano-electronics.

Byunghyun Jang received his BS in Bio-Mechatronic Engineering from Sungkyunkwan University, South Korea, MS degree in Computer Science from Oklahoma State University, Stillwater OK, and Ph.D in Computer Engineering from

Northeastern University, Boston MA. He is currently an Assistant Professor of Computer and Information Science at the University of Mississippi, University, MS where he directs the Heterogeneous Systems Research (HEROES) Laboratory. Prior to joining academia in 2012, he spent several years at AMD and Samsung. His research focuses on CPU-GPU heterogeneous computing, hardware architecture and compilers for data parallel architectures.