HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B 044169 By: Zaid Abassi Supervisor: Rolf.

HSDSL, TechnionSpring 2014

Preliminary Design Review

Matrix Multiplication on FPGAProject No. : 1998

Project B 044169 By:Zaid AbassiSupervisor:Rolf Hilgendorf

April 2, 2014

Background and Motivation:

1. Matrix multiplication naively carried out is unjustifiably expensive, ergo there is a need for research into an efficient algorithm for Matrix Multiplication with a parallel approach.

• 2. In application specific (in this case Matrix Multiplication) designs, as opposed to broader architectural designs, the order and magnitude of operations is known at design time thus providing a potential to save overhead that would have been incurred.

3. Matrix multiplication is an elementary building block of more advanced Linear Algebra Core operations on matrices such as inverting matrices and linear transformations, so the need for efficient matrix multiplication is ever greater.

4. Over the years matrix multiplication complexity in software has improved with specialized data structures and we aim to research inspired approaches on an FPGA implementation.

Our Goal

To develop a matrix multiplication algorithm especially on FPGA to maximize efficiency via parallel design, while at the same time reducing power consumption as much as possible.

The System Top Level View

Processing Entity (PE)

PE unit

PE unit• The controller for each PE is

a FSM to regulate PE operations : storage, computation and communication (broadcasting).

• The controller needs to be smart and autonomously manage synchronized PE operations with handshake and global communication depending on implicit synchronization between all PEs.

PE unit

• Each PE is equipped with its own local memory for the purpose of storing entries of the multiplied matrices upon commencing and for broadcasting via same rows and columns

Handling Larger Matrices

• For handling larger matrices, we choose the possibility of breaking down the input matrices to a sequence of smaller updates using a hierarchical blocking of input matrices. Each update in the hierarchy is called a “loop”.

• No loop-carried dependency so we aim to pipeline outer loop to overlap current cycle’s computation along with previous cycle’s write back and next cycle’s prefetching of matrices.

A Problem With Larger Matrices

• Moving data in and out of the computational grid for each hierarchy block independently can be expensive and so we need to amortize the cost.

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B 044169 By: Zaid Abassi Supervisor: Rolf.

Documents

pe unit slide

processing entity pe

level view slide

inverting matrices

multiplied matrices

handling larger matrices

synchronized pe operations

cycles prefetching of