A View from the Facility Operations Side on the Water/Air Cooling System of the K Computer Jorji Nonaka, Keiji Yamamoto, Akiyoshi Kuroda, Toshiyuki Tsukamoto (RIKEN R-CCS) Kazuki Koiso, Naohisa Sakamoto (Kobe University) Abstract The Operations and Computer Technologies Division at the RIKEN R-CCS is responsible for the operations of the entire HPC Facility, which includes the supercomputer itself and its auxiliary subsystems such as the power supply and water/air cooling subsystems. It is worth noting that part of these subsystems will be reused in the next supercomputer Fugaku, thus a better understanding of the operational behavior as well as the potential impacts especially on the hardware failure and power consumption would be greatly beneficial. In this poster, we will present some preliminary impressions of the impact of the water/air cooling system on the K computer system, focusing on the potential benefits of the use of low water/air temperature respectively for the CPU (15 o C) and DRAM (17 o C) produced by the chilled water cooling system. We expect that the obtained knowledge will be helpful for the decision support and/or operation planning of the next supercomputer Fugaku. Contact: Jorji Nonaka <[email protected]> HPC Usability Development Unit (HUD Unit) Operations and Computer Technologies Division RIKEN Center for Computational Science Acknowledgements Part of the results was obtained by using the K computer at the RIKEN R-CCS. We are grateful for the colleagues at the RIKEN R-CCS who directly or indirectly collaborated in this work, and we especially thank Fumiyoshi Shoji (Director of the Operations and Computer Technologies Division), Atsuya Uno (Unit Leader of the System Operations and Development Unit), and Shun Ito (currently at Fujitsu), for their helpful collaboration during the experiments, and also some local staffs from Fujitsu for their supportive assistance. CPU cooling water 10 o C chilled water is used to control the CPU cooling water temperature (set to 15 o C). This graph shows a 1-day input and output water temperature, and the water flow inside a heat exchanger. Idle mode This graph shows the impact of the water cooling temperature on the power consumption of an entire compute rack (T45) during the idle period of the K computer. We observed an increase of around 1.75% (20 o C) and 3.5% (25 o C) in the energy consumption. Benchmark applications We utilized five benchmark applications with well-known behavior to evaluate the power consumption of an entire compute rack (T45). We could observe a power consumption increase of less than 4%, when increasing the CPU cooling water temperature in 10 o C (25 o C). Conclusions We could observe in practice some of the theoretical benefits (energy consumption and hardware failure) of using low cooling water temperature (15±1 o C) when running the K computer. We could also observe that even increasing the CPU cooling water temperature in 10 o C, it may still allow the hardware to operate within specification with limited impact on the energy consumption and hardware failure rate. We expect that the obtained knowledge will be helpful for the decision support and operation planning of the next supercomputer Fugaku. Temperature variation inside a compute rack CPU and the cooling air temperature variation inside a compute rack (T45) during the execution of some benchmark applications. SLEEP (Do nothing); PEK99 (CPU intensive); MEM72 (Memory intensive); SUB09 (CPU/Memory balanced use); and ADVMV (Kernel from a production grade application). CPU / ICC SB / DRAM Cooling Water (Around 15 o C) Cooling Air (Around 17 o C) Energy consumption Hardware failure CPU and DRAM failures Spatiotemporal distribution of the compute racks which have substituted CPU and DRAM due to the hardware failure (From Feb. 2012 to May 2019). Accumulated number of failures per rack did not exceed three (CPU) and five (DRAM), and the neighborhood of rack T45 concentrated the racks with higher DRAM failures. Chilled Water (Around 10 o C) Compute Rack InterConnect Controller System Board SPARC64 VIIIfx CPU DDR3 Memory CPU ICC DRAM System Board Water-cooling module Evaluations We utilized a single compute rack (T45), with an attached power monitoring and logging device, and the low priority “Micro” class job in order to verify the temperature variation behavior, and the energy consumption. CPU DRAM