Advances in Parallel Computing

Advances in Parallel Computing

Parallel processing is ubiquitous today, with applications ranging from mobile devices such as

laptops, smart phones and in-car systems to creating Internet of Things (IoT) frameworks and

High Performance and Large Scale Parallel Systems. The increasing expansion of the application

domain of parallel computing, as well as the development and introduction of new technologies

and methodologies are covered in the Advances in Parallel Computing book series. The series

publishes research and development results on all aspects of parallel computing. Topics include

one or more of the following:

• Parallel Computing systems for High Performance Computing (HPC) and High Throughput

Computing (HTC), including Vector and Graphic (GPU) processors, clusters, heterogeneous

systems, Grids, Clouds, Service Oriented Architectures (SOA), Internet of Things (IoT), etc.

• High Performance Networking (HPN)

• Performance Measurement

• Energy Saving (Green Computing) technologies

• System Software and Middleware for parallel systems

• Parallel Software Engineering

• Parallel Software Development Methodologies, Methods and Tools

• Parallel Algorithm design

• Application Software for all application fields, including scientific and engineering

applications, data science, social and medical applications, etc.

• Neuromorphic computing

• Brain Inspired Computing (BIC)

• AI and (Deep) Learning, including Artificial Neural Networks (ANN)

• Quantum Computing

Series Editor:

Professor Dr. Gerhard R. Joubert

Volume 36

Recently published in this series

Vol. 35. F. Xhafa and A.K. Sangaiah (Eds.), Advances in Edge Computing: Massive Parallel

Processing and Applications

Vol. 34. L. Grandinetti, G.R. Joubert, K. Michielsen, S.L. Mirtaheri, M. Taufer and R. Yokota

(Eds.), Future Trends of HPC in a Disruptive Scenario

Vol. 33. L. Grandinetti, S.L. Mirtaheri, R. Shahbazian, T. Sterling and V. Voevodin (Eds.), Big

Data and HPC: Ecosystem and Convergence

Volumes 1–14 published by Elsevier Science.

ISSN 0927-5452 (print)

ISSN 1879-808X (online)

Parallel Computing: Technology

Trends

Edited by

Ian Foster Argonne National Laboratory and University of Chicago, Chicago, USA

Gerhard R. Joubert Technical University Clausthal, Clausthal-Zellerfeld, Germany

Luděk Kučera Charles University, Prague, Czech Republic

Wolfgang E. Nagel Technical University Dresden, Dresden, Germany

and

Frans Peters formerly Philips Research, Eindhoven, Netherlands

Amsterdam Berlin Washington, DC

© 2020 The authors and IOS Press.

This book is published online with Open Access and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).

ISBN 978-1-64368-070-5 (print) ISBN 978-1-64368-071-2 (online) Library of Congress Control Number: 2020934256 doi: 10.3233/APC36

Publisher

IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected]

For book sales in the USA and Canada:

IOS Press, Inc. 6751 Tepper Drive Clifton, VA 20124 USA Tel.: +1 703 830 6300 Fax: +1 703 830 2300 [email protected]

LEGAL NOTICE

The publisher is not responsible for the use which might be made of the following information.

PRINTED IN THE NETHERLANDS

Conference Organisation

Conference Committee

Gerhard Joubert, Germany (Conference Chair)

Ian Foster, USA

Luděk Kučera, Czech Republic

Thomas Lippert, Germany

Wolfgang Nagel, Germany Frans Peters, Netherlands

Program Committee

Ian Foster, USA

Wolfgang Nagel, Germany

Symposium Committee

Gerhard Joubert, Germany

Thomas Lippert, Germany


Organising & Exhibition Committee


Finance Committee

Frans Peters, Netherlands (Finance Chair)

ParCo2019 Sponsors

EPI (European Processor Initiative)

hoComputer & Intel Jülich Supercomputing Centre, Jülich Forschungszentrum, Germany

The University of Chicago, USA

Charles University, Czech Republic

Technical University Clausthal, Germany

vii

Program Committee

Ian Foster, USA (Program Committee Chair)

Wolfgang Nagel, Germany (Program Committee Chair)

David Abramson, Australia Marco Aldinucci, Italy

Christian Bischof, Germany

Jens Breitbart, Germany

Kris Bubendorfer, New Zealand

Andrea Clematis, Italy

Umit Catalyurek, USA Sudheer Chunduri, USA

Massimo Coppola, Italy

Luisa D’Amore, Italy

Pasqua D’Ambra, Italy

Erik D’Hollander, Belgium

Bjorn De Sutter, Belgium Ewa Deelman, USA

Frédéric Desprez, France

Didier El Baz, France

Ian Foster, USA

Geoffrey Fox, USA Franz Franchetti, USA

Basilio B. Fraguela, Spain

Karl Fürlinger, Germany

Edgar Gabriel, USA

Efstratios Gallopoulos, Greece

José Daniel Garcia Sanchez, Spain Michael Gerndt, Germany

William Gropp, USA

Georg Hager, Germany

Kevin Hammond, UK

Lei Huang, USA

Emmanuel Jeannot, France Odej Kao, Germany

Wolfgang Karl, Germany

Carl Kesselman, USA

Christoph Kessler, Sweden

Harald Köstler, Germany

Dieter Kranzlmüller, Germany Herbert Kuchen, Germany

Alexey Lastovetsky, Ireland

Jin-Fu Li, Taiwan

Jay Lofstead, USA

Ignatio Martin Llorente, Spain

Allen D. Malony, USA Simon McIntosh-Smith, UK

Bernd Mohr, Germany

Wolfgang E. Nagel, Germany

Kengo Nakajima, Japan

Bogdan Nicolae, USA

Dimitrious Nikolopoulos, UK Manish Parashar, USA

Christian Pérez, France

Sege Petiton, France

Oscar Plata, Spain

Sabri Pllana, Sweden Enrique S. Quintana-Ortí, Spain

Carl Raicu, USA

J. (Ram) Ramanujam, USA

Matei Ripeanu, Canada

Dirk Roose, Belgium

Peter Sanders, Germany Henk Sips, Netherlands

Domenico Talia, Italy

Michela Taufer, USA

Valerie Taylor, USA

Doug Thain, USA

George Thiruvathukal, USA Massimo Torquati, Italy

Denis Trystram, France

Sudharshan Vashkudai, USA

Jose Luis Vazquez-Poletti, Spain

Jon Weissman, USA

viii

Mini-Symposia

Tools and Infrastructure for Reproducibility in Data-intensive Applications

Organisers

Sandro Fiore, USA

Ian Foster, USA

Carl Kesselman, USA

ParaFPGA 2019: Parallel Computing with FPGAs

Organisers

Erik D’Hollander, Belgium

Abdellah Touhafi, Belgium

Program Committee:

Frank Hannig, Germany

Yun Liang, China

Tsutomu Maruyama, Japan

Dionisios Pnevmatikatos, Greece Viktor Prasanna, USA

Dirk Stroobandt, Belgium

Wim Vanderbauwhede, UK

Sotirios G. Ziavras, USA

Energy-efficient Computing on Parallel Architectures (ECO-PAR)

Organisers

Enrico Calore, Italy

Nikela Papadopoulou, Greece Sebastiano Fabio Schifano, Italy

Vladimir Stegailov, Russia

ELPA – A Parallel Dense Eigensolver for Symmetric Matrices with Applications

in Computational Chemistry

Organisers

Thomas Huckle, Germany

Bruno Lang, Germany

ix

Contents

Preface v

Ian Foster, Gerhard Joubert, Luděk Kučera, Wolfgang Nagel and Frans Peters

Conference Organisation vii

Opening

Four Decades of Cluster Computing 3

Gerhard Joubert and Anthony Maeder

Invited Talks

Will We Ever Have a Quantum Computer? 11 M.I. Dyakonov

Empowering Parallel Computing with Field Programmable Gate Arrays 16

Erik H. D’Hollander

Main Track

Deep Learning Applications

First Experiences on Applying Deep Learning Techniques to Prostate Cancer

Detection 35

Eduardo José Gómez-Hernández and José Manuel García

Deep Generative Model Driven Protein Folding Simulations 45

Heng Ma, Debsindhu Bhowmik, Hyungro Lee, Matteo Turilli, Michael Young,

Shantenu Jha and Arvind Ramanathan

Economics

A Scalable Approach to Econometric Inference 59

Philip Nadler, Rossella Arcucci and Yi-Ke Guo

Cloud vs On-Premise HPC: A Model for Comprehensive Cost Assessment 69

Marco Ferretti and Luigi Santangelo

GPU Computing Methods

GPU Architecture for Wavelet-Based Video Coding Acceleration 83

Carlos de Cea-Dominguez, Juan C. Moure, Joan Bartrina-Rapesta

and Francesc Aulí-Llinàs

xi

GPGPU Computing for Microscopic Pedestrian Simulation 93

Benedikt Zönnchen and Gerta Köster

High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for

LOBPCG Method on CUDA GPU 105

Susumu Yamada, Masahiko Machida and Toshiyuki Imamura

Parallel Smoothers in Multigrid Method for Heterogeneous CPU-GPU Environment 114

Neha Iyer and Sashikumaar Ganesan

Load Balancing Methods

Progressive Load Balancing in Distributed Memory. Mitigating Performance and

Progress Variability in Iterative Asynchronous Algorithms 127 Justs Zarins and Michèle Weiland

Learning-Based Load Balancing for Massively Parallel Simulations of Hot

Fusion Plasmas 137

Theresa Pollinger and Dirk Pflüger

Load-Balancing for Large-Scale Soot Particle Agglomeration Simulations 147

Steffen Hirschmann, Andreas Kronenburg, Colin W. Glass and Dirk Pflüger

On the Autotuning of Task-Based Numerical Libraries for Heterogeneous

Architectures 157

Emmanuel Agullo, Jesús Cámara, Javier Cuenca and Domingo Giménez

Parallel Algorithms

Batched 3D-Distributed FFT Kernels Towards Practical DNS Codes 169 Toshiyuki Imamura, Masaaki Aoki and Mitsuo Yokokawa

On Superlinear Speedups of a Parallel NFA Induction Algorithm 179

Tomasz Jastrząb

A Domain Decomposition Reduced Order Model with Data Assimilation

(DD-RODA) 189

Rossella Arcucci, César Quilodrán Casas, Dunhui Xiao, Laetitia Mottet, Fangxin Fang, Pin Wu, Christopher Pain and Yi-Ke Guo

Predicting Performance of Classical and Modified BiCGStab Iterative Methods 199

Boris Krasnopolsky

Parallel Applications

Gadget3 on GPUs with OpenACC 209 Antonio Ragagnin, Klaus Dolag, Mathias Wagner, Claudio Gheller,

Conradin Roffler, David Goz, David Hubber and Alexander Arth

Exploring High Bandwidth Memory for PET Image Reconstruction 219

Dai Yang, Tilman Küstner, Rami Al-Rihawi and Martin Schulz

xii

Parallel Architecture

The Architecture of Heterogeneous Petascale HPC RIVR 231

Miran Ulbin and Zoran Ren

Design of an FPGA-Based Matrix Multiplier with Task Parallelism 241

Yiyu Tan, Toshiyuki Imamura and Daichi Mukunoki

Application Performance of Physical System Simulations 251 Vladimir Getov, Peter M. Kogge and Thomas M. Conte

Parallel Methods

A Hybrid MPI+Threads Approach to Particle Group Finding Using Union-Find 263

James S. Willis, Matthieu Schaller, Pedro Gonnet and John C. Helly

Parallel Performance

Improving the Scalability of the ABCD Solver with a Combination of New Load

Balancing and Communication Minimization Techniques 277

Iain Duff, Philippe Leleux, Daniel Ruiz and F. Sukru Torun

Characterization of Power Usage and Performance in Data-Intensive Applications

Using MapReduce over MPI 287

Joshua Davis, Tao Gao, Sunita Chandrasekaran, Heike Jagode, Anthony Danalis, Jack Dongarra, Pavan Balaji and Michela Taufer

Feedback-Driven Performance and Precision Tuning for Automatic Fixed Point

Exploitation 299

Daniele Cattaneo, Michele Chiari, Stefano Cherubin, Antonio Di Bello

and Giovanni Agosta

Parallel Programming

A GPU-CUDA Framework for Solving a Two-Dimensional Inverse Anomalous

Diffusion Problem 311

P. de Luca, A. Galletti, H.R. Ghehsareh, L. Marcellino and M. Raei

Parallelization Strategies for GPU-Based Ant Colony Optimization Applied

to TSP 321

Breno Augusto de Melo Menezes, Luis Filipe de Araujo Pessoa, Herbert Kuchen and Fernando Buarque De Lima Neto

DBCSR: A Blocked Sparse Tensor Algebra Library 331

Ilia Sivkov, Patrick Seewald, Alfio Lazzaro and Jürg Hutter

Acceleration of Hydro Poro-Elastic Damage Simulation in a Shared-Memory

Environment 341 Harel Levin, Gal Oren, Eyal Shalev and Vladimir Lyakhovsky

xiii

BERTHA and PyBERTHA: State of the Art for Full Four-Component

Dirac-Kohn-Sham Calculations 354

Loriano Storchi, Matteo de Santis and Leonardo Belpassi

Prediction-Based Partitions Evaluation Algorithm for Resource Allocation 364

Anna Pupykina and Giovanni Agosta

Unified Generation of DG-Kernels for Different HPC Frameworks 376 Jan Hönig, Marcel Koch, Ulrich Rüde, Christian Engwer

and Harald Köstler

Invasive Computing for Power Corridor Management 386

Jophin John, Santiago Narvaez and Michael Gerndt

Enforcing Reference Capability in FastFlow with Rust 396 Luca Rinaldi, Massimo Torquati and Marco Danelutto

Performance

AITuning: Machine Learning-Based Tuning Tool for Run-Time Communication

Libraries 409 Alessandro Fanfarillo and Davide del Vento

Towards Benchmarking the Asynchronous Progress of Non-Blocking MPI

Operations 419

Alexey V. Medvedev

Power Management

Acceleration of Interactive Multiple Precision Arithmetic Toolbox MuPAT

Using FMA, SIMD, and OpenMP 431

Hotaka Yagi, Emiko Ishiwata and Hidehiko Hasegawa

Dynamic Runtime and Energy Optimization for Power-Capped HPC

Applications 441 Bo Wang, Christian Terboven and Matthias Müller

Programming Paradigms

Paradigm Shift in Program Structure of Particle-in-Cell Simulations 455

Takayuki Umeda

Backus FP Revisited: A Parallel Perspective on Modern Multicores 465

Alessandro di Giorgio and Marco Danelutto

Multi-Variant User Functions for Platform-Aware Skeleton Programming 475

August Ernstsson and Christoph Kessler

Scalability Analysis

POETS: Distributed Event-Based Computing – Scaling Behaviour 487

Andrew Brown, Mark Vousden, Alex Rast, Graeme Bragg, David Thomas,

Jonny Beaumont, Matthew Naylor and Andrey Mokhov

xiv

Towards High-End Scalability on Biologically-Inspired Computational Models 497

Dario Dematties, George K. Thiruvathukal, Silvio Rizzi,

Alejandro Wainselboim and B. Silvano Zanutto

Scientific Visualization

GraphiX: A Fast Human-Computer Interaction Symmetric Multiprocessing

Parallel Scientific Visualization Tool 509

Re’em Harel and Gal Oren

When Parallel Performance Measurement and Analysis Meets In Situ Analytics

and Visualization 521

Allen D. Malony, Matt Larsen, Kevin Huck, Chad Wood, Sudhanshu Sane and Hank Childs

Stream Processing

Seamless Parallelism Management for Video Stream Processing on Multi-Cores 533 Adriano Vogel, Dalvan Griebler, Luiz Gustavo Fernandes

and Marco Danelutto

High-Level Stream Parallelism Abstractions with SPar Targeting GPUs 543

Dinei A. Rockenbach, Dalvan Griebler, Marco Danelutto

and Luiz G. Fernandes

Mini-Symposia

Energy-Efficient Computing on Parallel Architectures (ECOPAR)

Energy-Efficiency Evaluation of FPGAs for Floating-Point Intensive Workloads 555

Enrico Calore and Sebastiano Fabio Schifano

GPU Acceleration of Four-Site Water Models in LAMMPS 565 Vsevolod Nikolskiy and Vladimir Stegailov

Energy Consumption of MD Calculations on Hybrid and CPU-Only

Supercomputers with Air and Immersion Cooling 574

Ekaterina Dlinnova, Sergey Biryukov and Vladimir Stegailov

Direct N-Body Application on Low-Power and Energy-Efficient Parallel

Architectures 583 David Goz, Georgios Ieronymakis, Vassilis Papaefstathiou,

Nikolaos Dimou, Sara Bertocco, Antonio Ragagnin, Luca Tornatore,

Giuliano Taffoni and Igor Coretti

Performance and Energy Efficiency of CUDA and OpenCL for GPU Computing

Using Python 593 Håvard H. Holm, André R. Brodtkorb and Martin L. Sætra

Computational Performances and Energy Efficiency Assessment for a Lattice

Boltzmann Method on Intel KNL 605

Ivan Girotto, Sebastiano Fabio Schifano, Enrico Calore, Gianluca di Staso

and Federico Toschi

xv

Performance, Power Consumption and Thermal Behavioral Evaluation

of the DGX-2 Platform 614

Matej Spetko, Lubomir Riha and Branislav Jansik

On the Performance and Energy Efficiency of Sparse Matrix-Vector

Multiplication on FPGAs 624

Panagiotis Mpakos, Nikela Papadopoulou, Chloe Alverti, Georgios Goumas and Nectarios Koziris

Evaluation of DVFS and Uncore Frequency Tuning Under Power Capping

on Intel Broadwell Architecture 634

Lubomir Riha, Ondrej Vysocky and Andrea Bartolini

ELPA – A Parallel Dense Eigensolver for Symmetric Matrices with Applications

in Computational Chemistry

ELPA: A Parallel Solver for the Generalized Eigenvalue Problem 647

Hans-Joachim Bungartz, Christian Carbogno, Martin Galgon,

Thomas Huckle, Simone Köcher, Hagen-Henrik Kowalski, Pavel Kus,

Bruno Lang, Hermann Lederer, Valeriy Manin, Andreas Marek,

Karsten Reuter, Michael Rippl, Matthias Scheffler and Christoph Scheurer

ParaFPGA 2019. Parallel Computing with FPGAs

Parallel Totally Induced Edge Sampling on FPGAs 671

Akshit Goel, Sanmukh R. Kuppannagari, Yang Yang, Ajitesh Srivastava

and Viktor K. Prasanna

An Implementation of Non-Local Means Algorithm on FPGA 681

Hayato Koizumi and Tsutomu Maruyama

Accelerating Binarized Convolutional Neural Networks with Dynamic Partial

Reconfiguration on Disaggregated FPGAs 691

Panagiotis Skrimponis, Emmanouil Pissadakis, Nikolaos Alachiotis

and Dionisios Pnevmatikatos

Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs 701

Enrico Calore and Sebastiano Fabio Schifano

A Processor Architecture for Executing Global Cellular Automata as Software 711

Christian Ristig and Christian Siemers

Crossbar Implementation with Partial Reconfiguration for Stream Switching

Applications on an FPGA 721

Yuichi Kawamata, Tomohiro Kida, Yuichiro Shibata and Kentaro Sano

Tools and Infrastructure for Reproducibility in Data-Intensive Applications

Cryptographic Methods with a Pli Cacheté. Towards the Computational

Assurance of Integrity 733

Thatcher L. Collins

xvi

Replicating Machine Learning Experiments in Materials Science 743

Line Pouchard, Yuewei Lin and Hubertus Van Dam

Documenting Computing Environments for Reproducible Experiments 756

Jason Chuah, Madeline Deeds, Tanu Malik, Youngdon Choi

and Jonathan L. Goodall

Toward Enabling Reproducibility for Data-Intensive Research Using the Whole Tale Platform 766

Kyle Chard, Niall Gaffney, Mihael Hategan, Kacper Kowalik,

Bertram Ludäscher, Timothy McPhillips, Jarek Nabrzyski,

Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk

and Craig Willis

Subject Index 779

Author Index 783

xvii

Four Decades of Cluster Computing

Gerhard JOUBERTa,1, Anthony MAEDERb a Clausthal University of Technology, Germany

b Flinders University, Adelaide, Australia

Abstract.

During the latter half of the 1970s high performance computers (HPC) were con-structed using specially designed and manufactured hardware. The preferred archi-tectures were vector or array processors, as these allowed for high speed pro-cessing of a large class of scientific/engineering applications. Due to the high costof the development and construction of such HPC systems, the number of avail-able installations was limited. Researchers often had to apply for compute time onsuch systems and wait for weeks before being allowed access. Cheaper and moreaccessible HPC systems were thus in great need. The concept to construct highperformance parallel computers with distributed Multiple Instruction MultipleData (MIMD) architectures using standard off-the-shelf hardware promised theconstruction of affordable supercomputers. Considerable scepticism existed at thetime about the feasibility that MIMD systems could offer significant increases inprocessing speeds. The reasons for this were due to Amdahl’s Law, coupled withthe overheads resulting from slow communication between nodes and the complexscheduling and synchronisation of parallel tasks. In order to investigate the poten-tial of MIMD systems constructed with existing off-the-shelf hardware a firstsimple two processor system was constructed that finally became operational in1979. In this paper aspects of this system and some of the results achieved are re-viewed.

Keywords. MIMD parallel computer, cluster computer, parallel algorithms, speed-up, gain factor.

1. Introduction

During the 1960s and 1970s the solution of increasingly complex scientific problems

resulted in a demand for more powerful computers. The available sequential processors

proved unable to meet these demands. The attempts implemented in the late 1960s to

optimise the execution of sequential program code by analysing program execution

patterns resulted in optimised execution strategies [1, 2]. These attempts to increase the

processing speeds of sequential SISD (Single Instruction Single Data) computers were

limited and did not offer the compute power needed for the processing of compute in-

tensive problems. A typical problem at the time was to be able to compute a 24 hour

weather forecast in less than 24 hours.

A next step was to speed up the execution of compute intensive sections of a pro-

gram through specially designed hardware. An often occurring operation in scientific

computations is the processing of vectors and matrices. Such operations can be ex-

ecuted in parallel by SIMD (Single Instruction Multiple Data) processors. It was thus a

natural approach in the 1970’s to develop vector and array processors as the supercom-

1 Lange-Feld-Str. 45, Hanover, Germany. E-mail: [email protected]

Parallel Computing: Technology TrendsI. Foster et al. (Eds.)© 2020 The authors and IOS Press.This article is published online with Open Access by IOS Press and distributed under the termsof the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).doi:10.3233/APC200017

3

puters of the day. Examples are the ICL DAP (Distributed Array Processor), ILLIAC,

CRAY, etc.

The problem was that the development of such specially designed and built ma-

chines was expensive. The use of such supercomputers by researchers as well as soft-

ware developers was limited due to the high cost of purchasing and running these sys-

tems. In addition the programming of applications software often had to resort to ma-

chine level instructions in order to utilise the particular hardware characteristics of the

available machine.

The development of integrated circuits during the early 1970’s, which enabled the

large scale production of processors at ever lower cost, opened up the possibility to use

such components to construct MIMD parallel computers at low cost. The concept pro-

posed in a non-published talk in 1976 [3] was that the future of high performance com-

puting at acceptable costs was possible by using standard COTS (Components Off The

Shelf) to construct low-cost parallel computers. The architecture of such systems could

be adapted by using standard as well as special compute nodes, different storage archi-

tectures and various interconnection networks.

The concept of developing such systems was, however, deemed unattractive dur-

ing the late 1970’s mainly due to two aspects. The first was Amdahl’s Law [4] that

only a relatively small percentage of programs could be parallelised, and the second

was that the synchronisation and communication requirements would create an over-

head, which made parallel systems highly inefficient. A further aspect that hampered

the acceptance of MIMD systems, was Grosch’s Law [5], which stated that computer

performance increases as the square of the cost, i.e. if a computer costs twice as much

one could expect it to be four times more powerful. This does not apply to MIMD sys-

tems as the addition of nodes results in a linear increase in compute power. Moore’s

Law [6] maintained in 1965 that the number of components per integrated circuit

doubled every year, which was revised in 1975 to double every two years. This resulted

in an estimated doubling of computer chip performance due to design improvements

about every 18 months. It was an open question in how far these developments could

offset the inherent disadvantages of MIMD systems.

In 1977 Prof. Tsutomu Hoshino and Prof. Kawai started a project in Japan to con-

struct a parallel computer using standard components. Their aim was to develop a par-

allel system architecture that could be used to solve particular problems. The system

was later called the PAX computer [7]. This approach was different from that described

in the following sections, where the general applicability of MIMD systems to solve

compute intensive problems was the main objective.

2. A Simple MIMD Parallel Computer

In 1976/77 a project was started at the University of Natal, South Africa to investigate

the possibilities of achieving higher compute performances by connecting standard

available mini-computers [8]. The final development stage was reached in 1979 when

the system was upgraded to have both nodes with identical hardware. The parallel sys-

tem was later named the CSUN (Computer System of the University of Natal) [8].

The project involved three aspects, viz. hardware and architecture, network and

software.

G. Joubert and A. Maeder / Four Decades of Cluster Computing4

2.1 Hardware and Architecture

The available hardware consisted of two standard HP1000 mini-computers. The pro-

cessors were identical, but the memory sizes differed initially. The architecture decided

on was a master-slave configuration with distributed memories. No commonly access-

ible memory was available. The HP1000 offered a micro programming capability,

which allowed for special functions to be executed at high speed.

Fig. 1: The cluster system, admired by Chris Handley2

2.2 Network

The connection of the two nodes had to offer high communication speeds. This was

realised by using a high-speed connection available for HP1000 mini computers for

logging high volumes of data collected by scientific instruments. The cable was adap-

ted by HP to supply a computer interface at both ends allowing the interconnection of

the two nodes via interface cards installed in each machine. These interfaces were user

configurable by means of adjustable switch settings for timing or logistic characterist-

ics, allowing a computer-to-computer mode. The maximum transmission speed was

one million 16 bit words per second.

2.3 Software

The Real Time Operating System (RTOS), HP-RTE, available for the HP1000 offered

the basic platform for running and managing the nodes. The system had to be enhanced

2 Later: University of Otago, New Zealand

G. Joubert and A. Maeder / Four Decades of Cluster Computing 5

by additional software modules to achieve the control of the overall parallel computer

system. A monitor was developed to create an interface for users to input and run pro-

grams. Programs and data were provided on punched cards or tape.

A critical component was the communication between the two nodes. For this

drivers were developed that also allowed for the synchronisation of tasks. With the

master-slave organisation of the system the slave always had to be under control of the

master. In an interrupt-driven environment this is easily accomplished. The communic-

ation available between the two nodes did not allow to transmit specific interrupt sig-

nals between the two machines. Thus data controlled transmission, i.e. sending all mes-

sages with header information, was used. Both sender and receiver had to wait for ac-

knowledgement from the counterpart before message transmission could begin. This

caused an additional overhead for the synchronisation of tasks.

The master node was responsible for all controlling activities. It prepared tasks

for execution by the slave, downloaded these together with the data needed to the slave,

which then started executing the tasks. The master in the meantime prepared its own

tasks and executed these in parallel, exchanging intermediate results with the slave.

The master also executed any serial tasks as required. The later upgrade of the system

to have two equally equipped nodes simplified task scheduling.

Such a setup is of course very sensitive to the volume and frequency of data trans-

mission. This must thus be considered by programmers when selecting an algorithm for

solving a particular problem.

No programming tools for developing parallel software were available at the time.

The standard programming language for scientific applications was FORTRAN. A pre-

compiler was developed that processed instructions from programmers to automatically

create parallel tasks that were inserted in the FORTRAN program code. The compiler

subsequently created tasks that could be executed in parallel, which information was

used to schedule the parallel execution of tasks.

3. Applications

The aim with the project was to show that at least some algorithms could be executed

in less time by a cluster constructed with standard components. The two-node cluster

was a starting point that could be easily expanded by adding more, not necessarily

identical, nodes.

The physical limitations of the available nodes as well as the architecture of the

cluster limited the classes of problems that could possibly be efficiently executed.

Thus, a comparatively low volume of interprocessor data transfers as well as few syn-

chronisation points relative to the amount of computational work, was an advantage.

Problems implemented on the cluster were, for example:

• Partial Differential Equations: One-dimensional heat equation solved by expli-

cit and implicit difference methods [9]

• Solution of tridiagonal linear systems [10]

• Numerical integration [11].


4. Gain Factor

Several methods for assessing parallel computer performance are available, such as

speedup, cost, etc. These metrics proved insufficient, especially in view of Amdahl’s

Law [4], for a comparison of the overall time used to solve a problem on a sequential

processor and the MIMD system described above.

The measurement needed was a comparison of overall sequential compute time, Ts,

and overall parallel compute time, Tp. A further aspect was that the optimal sequential

and parallel algorithms may differ substantially. Thus, in the comparisons, the optimal

algorithm for each processing mode—sequential or parallel—was used.

A large number of aspects influence the value of Tp, such as organisation and

speed of processors (these need not be identical, thus potentially resulting in a hetero-

geneous system), interprocessor communication speed, communications software

design, construction of algorithms, etc. In practice time measurements can be made to

obtain values for Ts and Tp for particular algorithms. This gives a Gain Factor:

G = (Ts - Tp)/Ts

If 0 < G ≤ 1 parallel processing offers an advantage over sequential processing.

The upper limit, G = 1, is obtained when Tp, the overall time used to solve a problem

with the parallel machine, is zero. When G ≤ 0 parallel computation offers no advant-

age. Note that G applies equally well to the performance measurement of heterogen-

eous systems, and includes communication and administration overheads and covers

the limitations expressed in Amdahl's Law.

Results obtained for a number of test cases using the two node cluster, are [12]:

• Solution of tridiagonal linear systems, 120x120: G = 0.42

• One-dimensional diffusion equation, 30.000 time steps: G = 0.481

• Numerical integration, 30.000 steps: G = 0.497.

With a two node cluster the value of G ≤ 0.5.

These results showed that, at least in some cases, parallel processing using an

MIMD system with distributed memories may offer significant advantages.

5. Conclusions

The results obtained with the simple two-node MIMD parallel system showed that

clusters constructed with standard components can be used to boost the execution of

parallel algorithms for solving certain classes of problems.

The results obtained with the system prompted further research on the effects of

more nodes, different connection networks and suitable algorithms.

This work resulted in the start of the international Parallel Computing (ParCo)

conference series with the first conference held in 1983 in West-Berlin. The aims with

these events was to stimulate research and development of all types of parallel systems,

as it was clear from the outset that not one architecture is suitable for solving all prob-

lems.

It took more than a decade for the idea of using standard components to construct

HPC systems to be adopted by industry on a comprehensive scale. It was also only

gradually realised that the flexibility of cluster systems allowed for the processing of a

G. Joubert and A. Maeder / Four Decades of Cluster Computing 7

wide range of compute intensive and/or large scale problems. The resulting advent of

cheaper parallel systems built with commodity hardware lead to many specially de-

signed HPC systems becoming less competitive due to their high price tags and limited

application spectrum. The resulting major crisis in the supercomputing industry during

the late 1980’s and early 1990’s lead to the demise of many companies supplying spe-

cially designed hardware aimed at particular problem classes..

Exascale computing is presently the next step in HPC and this will require extreme

parallelism, employing many thousands or millions of nodes, to achieve its goals.

With the end of Moore’s Law approaching, new technologies may emerge, to

achieve the future development of HPC beyond exascale.

References

[1] Anderson, D. W., Sparacio, F. J., Tomasulo, R. M.: The IBM System/360 Model 91: Machine Philosophyand Instruction-Handling (1967), See: http://home.eng.iastate.edu/~zzhang/courses/cpre585-f04/reading/

ibm67-anderson-360.pdf

[2] Schneck, Paul B.: The IBM 360-91, In: Supercomputer Architecture, The Kluwer International Series in

Engineering and Computer Science (Parallel Processing and Fifth generation Computing), Springer, Bo

ston, MA, Vol. 31, 53-98 (1987)

[3] Joubert, G.: Invited Talk, Helmut Schmidt University, Hamburg, January 1976

[4] Amdahl, Gene M.: Validity of the Single Processor Approach to Achieving Large-Scale Computing Cap

abilities . AFIPS Conference Proceedings (30): 483–485. doi:10.1145/1465482.1465560, (1967)

[5] Grosch, H.R.J.: High Speed Arithmetic: The Digital Computer as a Research Tool, Journal of the Optical

Society of America. 43 (4): 306–310 (1953). doi:10.1364/JOSA.43.000306

[6] Moore, Gordon: Cramming More Components onto Integrated Circuits, Electronics

Magazine. 38 (8): 114–117 (1965)

[7] Hoshino, Tsutomu: PAX Computer, Reading, Massachusetts, etc.: Addison Wesley Publishing Company

(1989)

[8] Proposed by U. Schendel, Free University of Berlin (1979)

[9] Joubert, G. R., Maeder, A. J.: An MIMD Parallel Computer System, Computer Physics Communications,Amsterdam: North Holland Publishing Company, Vol. 26, 253-257 (1982)

[10] Joubert, Gerhard, Maeder, Anthony: Solution of Differential Equations with a Simple Parallel Com puter, International Series on Numerical Mathematics (ISNM), Birkhäuser: Basel, Vol. 68,137-144

(1982)

[11] Joubert, G. R., Cloete, E.: The Solution of Tridiagonal Linear Systems with an MIMD Parallel Com-

puter, ZAMM Zeitschrift für Angewandte Mathematik und mechanik, Vol. 65, 4, 383-385 (1985)

[12] Joubert, G. R., Maeder, A. J., Cloete, E.: Performance measurements of Parallel Numerical Algorithms

on a Simple MIMD Computer, Proceedings of the the Seventh South African Symposium on Numerical

Mathematics, Computer Science Department, University of Natal, Durban, ISBN 0 86980 264 X, 25-36 (1981)


Advances in Parallel Computing

Documents