Top Banner
CELL B.E The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic processor cores, capable of massive floating point processing, optimized for compute-intensive workloads and broadband rich media applications
40

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Mar 28, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

CELL B.E

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic processor cores, capable of massive floating point processing, optimized for compute-intensive workloads and broadband rich media applications

Page 2: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

IMPLEMENTING MATRIX MULTIPLICATION ON CELL B.E

Dense matrix multiplication is one of the most common numerical operations and important algorithms.

Cell B.E excels in its capabilities to process compute-intensive workloads like matrix multiplication in single precision through its powerful SIMD capabilities

Page 3: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

CONCEPT

Computational micro-kernels are architecture specific codes when used with systematic analysis of problem combined with exploitation of low-level features of synergistic processing unit of cell B.E leads of dense matrix multiplication kernels achieving peak performance.

Page 4: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

IMPLEMENTING MATRIX FACTORIZATIONS ON THE CELL B.E

Introducing highly optimized cell B.E implementations of two classic dense linear algebra computations,

Cholesky factorization QR factorization

Page 5: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Work has been done to prove that a silicon chip can provide great performance for compute-intensive scientific workloads by combining short-vector single instruction multiple data with multicore architecture.

SPEs allow implementation of complex synchronization mechanisms, task level parallelism .

Page 6: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

DENSE LINEAR ALGEBRA FOR HYBRID GPU-BASED SYSTEMS

Hybrid GPU based multicore platforms which has both homogeneous multicores and GPUs provide effective solution for challenges of appetite power and gap between compute and communication speeds and hence is the trend taken by GPU’s and hybrid combinations of GPU’s with homogeneous multicores is appreciated as it can

freeze the frequency escalate the number of cores, provide data parallelism high bandwidth

Page 7: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

The development of dense linear algebra algorithms for GPUs is done where the approach is based on development of hybrid algorithms where in general small, non-parallelizable tasks are executed on CPU and data parallel tasks are executed on GPU and it uses CUDA to develop low-level kernels and high-level libraries like LAPACK and BLAS

Page 8: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

BLAS FOR GPUs

Approach to develop high performance BLAS for GPUs which is essential to enable GPU-based hybrid approaches in area of dense linear algebra.

Important issues for design of kernels-blocking and coalesced memory access are discussed

Three optimization techniques of implementations of BLAS-pointer redirecting, padding and auto-tuning are discussed

Page 9: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

SPARSE MATRIX VECTOR MULTIPLICATION ON MULTICORE AND

ACCELERATORS

Sparse matrix vector multiplication (SpMV) is an interesting computation as it appears in scientific and engineering, financial, economic modeling and information retrieval applications.

The level of performance is achieved through the diversity of architectural designs and input matrix characteristics i.e complex combination of architecture and matrix specific techniques

Page 10: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

PERFORMANCE

A comparison for better performance is done on different platforms across the suite of matrices and it is evident that the optimized implementations deliver better performance and it is also observed that bandwidth is the determining performance factor

Page 11: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

HARDWARE –ORIENTED MULTIGRID FINITE ELEMENT SOLVERS ON GPU ACCELERATED

CLUSTERS

The accurate simulation of real world phenomena in computational science is based on mathematical model that has a set of partial differential equations and finite element methods are considered to be the most promising approaches for numerical treatment of partial differential equations.

Page 12: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Graphics processing units are considered to be working well in such cases and in order to achieve peak performance, selection of proper data structures, parallelization techniques especially when combining coarse grained parallelism on cluster level and medium and fine grained parallelism between CPU cores and within accelerated drivers like GPUs

Page 13: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

MIXED PRECISION GPU-MULTIGRID SOLVERS WITH STRONG SMOOTHERS

The way of applying fine grained parallelization techniques for robust multigrid solvers which are numerically strong like sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements

Page 14: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Parallelization techniques are implemented on graphics processors as representatives of throughput oriented wide SIMD many-core architectures as GPUs offer a tremendous amount of fine-grained parallelism. Here the NVIDIA CUDA is being used where the concepts of memory coalescing, wraps, shared memory and thread blocks are encountered

Page 15: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

DESIGNING FAST FOURIER TRANFORM FOR THE IBM CELL

BROADBAND ENGINE

Design of efficient parallel implementation of Fast Fourier Transform(FFT) on cell/B.E and it is a fundamental kernel in computationally intensive scientific applications like computer tomography, data filtering, fluid dynamics, spectral analysis of speech, sonar, radar, seismic, vibration detection, digital filtering, signal decomposition, PDEs

Page 16: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

An interactive approach is used to solve 1D FFT that divides the work among SPEs to efficiently parallelize FFT computation and it requires synchronization among SPEs after each stage of FFT computation where the computation of SPEs is fully vectorized with other optimization techniques such as loop unrolling and double buffering.

Page 17: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

IMPLEMENTING ON FFTs ON MULTICORE ARCHITECTURES

A way in which the FFT can exploit typical parallel resources on multicore architecture platforms to achieve near-optimal performance for which designers have to adopt a systematic approach that takes into account the attributes of both the application and target system.

Page 18: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

A successful implementation lies on deep understanding of data access patterns , computation properties, available hardware resources where it can take advantage of generalized performance planning techniques to produce successful implementation across a wide variety of multicore architectures.

Page 19: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

COMBINATORIAL ALGORITHM DESIGN ON THE CELL/B.E PROCESSOR

Combinatorial algorithms play important role in scientific computing for efficient parallelization of linear algebra, computational physics, numerical optimization computations, massive data analysis routines, systems biology, the study of natural phenomena involving networks and complex interactions

Page 20: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

A complexity model to simplify design of algorithms on cell/B.E multicore architecture and a systematic procedure to evaluate performance is presented. In order to get the execution time of algorithm, the computational complexity, memory access patterns and complexity of branching instructions are considered.

Page 21: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

AUTO TUNING STENCIL COMPUTATIONS ON MULTICORE

AND ACCELERATORS

The application of auto-tuning to the 7- and 27-point stencils on widest range of multicore architectures where the chip multiprocessors lie at extremes of spectrum of design tradeoffs that range from replication of existing core technology to employing large numbers of simple cores and novel memory hierarchies.

Page 22: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Important aspects are parallelism discovery, selecting from various forms of hardware parallelism and enabling memory hierarchy optimizations, made more challenging by separate address space, software managed memory local stores and NUMA features that appear in multicore systems.

Page 23: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

MANYCORE STENCIL COMPUTATIONS IN HYPERTHERMIA APPLICATIONS

Multi core and many core and heterogeneous micro architecture is very important in hardware landscape. Specialized processing units such as commodity graphics processing units are proved to compute accelerators that are capable of solving specific scientific problems orders of magnitude faster than conventional CPUs

Page 24: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Hyperthermia is a relatively new treatment modality which is used as complementary therapy to radio or chemo therapies. Here we study the optimizations of a computational kernel appearing within biomedical application hyperthemia cancer treatment on NVIDIAs graphic processing unit

Page 25: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

ENABLING BIOINFORMATICS ALGORITHMS ON THE CELL/B.E

PROCESSOR

The implementation and results of two bioinformatics applications, namely FASTA for the Smith-Watersman kernel and ClustalW . The results show that cell/B.E is an attractive avenue for bioinformatics applications. A cell/B.E is considered to be a power-efficient platform provided that the total power consumption of cell/B.E is less than super scalar processor.

Page 26: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Also the implementation of the CustalW running on cell/B.E that uses software caches inside SPEs for data movement is described. Using the software caches enhances the programmer productivity without major decrease in performance.

Page 27: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

PAIRWISE COMPUTATIONS ON THE CELL PROCESSOR WITH APPLICATIONS IN

COMPUTATIONLA BIOLOGY

Efficient and scalable strategies to orchestrate all-pairs computations on cell architecture, based on decomposition of the computations and input entries is described. General case is to schedule computations on cell processor and to extend the strategies to incorporate cases when number of input entries is large and size of individual entries is too large to fit memory limitations of SPEs

Page 28: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

The performance results showed that cell processor is a good platform to accelerate various kinds of applications dealing with pairwise computations. The all-pairs computations strategies can be applied to many applications from a wide range of areas which requires such computations to be performed.

Page 29: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

DRUG DESIGN ON CELL BE

The main applications of drug design are figured and two practical case studies, FTDock and Moldy, which are a docking and a molecular dynamics application are discussed. The advantages of using cell B.E in the drug design are noticed.

Page 30: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Regarding FTDock, a 3x speedup is achieved compared to a parallel version running on a POWER5 multicore with two 1.5GHz POWER5 chips with 16GB of RAM.

Moldy on cell BE consumes less power and takes same time as an MPI parallelization on four Itanium Montecito processors of SGI Altix 4700

Page 31: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

GPU ALGORITHMS FOR MOLECULAR MODELING

GPUs are parallel computing devices capable of accelerating a wide variety of data-parallel algorithms and their tremendous computing capabilities help accelerate molecular modeling applications, enabling molecular dynamics simulations and their analyses to run much faster than before and allowing use of scientific techniques that are impractical on conventional hardware platforms.

Page 32: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Most computationally expensive algorithms used in molecular modeling are presented and explained how these algorithms may be reformulated as arithmetic intensive, data parallel algorithms capable of achieving high performance on GPUs. In coming years, we expect GPU hardware architecture to continue to evolve rapidly and become increasingly sophisticated.

Page 33: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

DATAFLOW FRAMEWORKS FOR EMERGING HETEROGENEOUS ARCHITECTURES AND THEIR

APPLICATION TO BIOMEDICINE

Biomedical applications are an important focus for high performance computing(HPC) researchers. The use of accelerators, with their low cost and high performance is possible solution for investigating methods to provide high performance.

Page 34: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

It is clear that the data flow programming model and associated runtime systems can, at multiple application and hardware granularities, ease the implementation of challenging biomedical applications for these types of computational resources. GPU is designed to deliver maximum performance through its SIMD architecture.

Page 35: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

ACCELERATOR SUPPORT IN THE CHARM++ PARALLEL PROGRAMMING MODEL

The charm++ parallel programming model and runtime system to support accelerators and heterogeneous clusters that include accelerators is presented. Also several extensions to charm++ programming model, including SIMD instruction abstraction, accelerated entry methods and accelerated blocks are presented.

Page 36: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

The important concept is that the support for CUDA based GPUs is presented where all these extensions are continuing to be developed and improved upon, as we increase support for heterogeneous clusters in charm++.

Page 37: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

EFFICIENT PARALLEL SCAN ALGORITHMS FOR MANYCORE GPUs

The modern many-core GPUs are massively parallel processors where the CUDA programming model provides a straightforward way of writing scalable parallel programs to execute on GPU. Data parallel techniques provide convenient way of expressing such parallelism.

Page 38: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

The design of efficient scan and segmented scan routines which are essential primitives in a broadband range of data parallel algorithms is presented and thus by tailoring the existing algorithms to natural granularities of machine and by minimizing synchronization, one of the fastest scan and segmented scan algorithms are designed for GPU.

Page 39: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

HIGH PERFORMANCE TOPOLOGY AWARE COMMUNICATION IN MULTICORE PROCESSORS

The performance evaluation of the interprocess communication mechanism for modern multicore CPUs is analyzed. It is observed that the streaming instructions are expected to deliver good performance where the current implementation generates a high number of resource stalls and hence low performance.

Page 40: The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

It is also found that intra-node communication performance is highly dependant on memory and cache architecture and also the way how the improvements in processor and interconnect technology have affected the balance of computation to communication performance is presented.