Implementation of a double-precision multiplier accumulator with exception treatment to a dense matrix multiplier module in FPGA

Implementation of a Double-Precision Multiplier Accumulator with Exception Treatment to a Dense Matrix

Multiplier Module in FPGA Abner C. Barros1, Victor W. C. de Medeiros1, Viviane L. S. Souza1, Paulo S. B. Nascimento1, Ângelo L. Mazer1, João P. F. Barbosa1, Bruno P. Neves1, Ismael H. Santos2, Manoel E. de

Lima1 1Federal University of Pernambuco

Informatics Center Fone: +55 81 2126.8430, Recife – PE, Brazil

{acb, vwcm, vlss, psbn, alm, jpfb, bpn, mel}@cin.ufpe.br 2Centro de pesquisa da Petrobrás - CENPES/Petrobrás - Rio

[email protected] ABSTRACT

Recently, the manufactures of supercomputers have made use of FPGAs to accelerate scientific applications [16][17]. Traditionally, the FPGAs were used only on non-scientific applications. The main reasons for this fact are: the floating-point computation complexity; the FPGA logic cells are not sufficient for the scientific cores implementation; the cores complexity prevents them to operate on high frequencies.

Nowadays, the increase of specialized blocks availability in complex operations, as sum and multiplier blocks, implemented directly in FPGA and, the increase of internal RAM blocks (BRAMs) have made possible high performance systems that use FPGA as a processing element for scientific computation [2].

These devices are used as co-processors that execute intensive computation. The emphasis of these architectures is the exploration of parallelism present on scientific computation operations and data reuse.

In major of these applications, the scientific computation uses, in general, operations of big floating-point dense matrices, which are normally operated by MACs.

In this work, we describe the architecture of an accumulative multiplier (MAC) in double precision floating-point, according to IEEE-754 standard and we propose the architecture of a multiplier of matrices that uses developed instances of the MACs and explores the reuse of data through the use of the BRAMs (Blocks of RAM internal to the FPGAs) of a Xilinx Virtex 4 LX200 FPGA. The synthesis results showed that the implemented MAC could reach a performance of 4GFLOPs.

Categories and Subject Descriptors B.5.1 [Register-Transfer-Level Implementation]: Design – arithmetic and logic units, control design, styles.

General Terms Performance, Design.

Keywords

Floating-point, FPGA, scientific computing, HPC.

1. INTRODUCTION The projects that use reconfigurable hardware, more specifically FPGAs, combine the flexibility of software with a performance close to the ASICs (Application Specific Integrated Circuits) [1]. The FPGAs are characterized for presenting, not only one high degree of computational parallelism, as well as, an I/O parallelism, which can be explored to allow a great data throughput between the FPGA and the others system elements (host, memory, controller).

These characteristics associated to the increase of the computational power of FPGAs and the demand for performance improvement, have made it an attractive option to speed up the scientific applications [1][3][4].

The high performance reconfigurable systems are being developed with the combination of FPGAs and general-purpose processors to improve application performance. These systems [5] are similar to the ones distributed with multiples nodes of computation on a network. However, general-purpose processors and FPGAs form these nodes of computation.

The objective is that the processor will be responsible for the control and FPGAs for the intensive computation.

The reconfigurable computer system explores two levels of parallelism: the thick granularity, executed by the multiples nodes of computation and the fine granularity achieved by the execution in the FPGAs. This association has reached superior performance comparing it with the general-purpose processor [6].

Some applications use floating-point operations of multiplication and sum, operating with dense matrices, in some cases double precision floating-point.

This work presents the architecture of MAC (multiplying accumulator) in double precision floating-point, developed in hardware language description. The MAC is a library module that could be integrated to others modules in the development of scientific problem solutions.

https://www.researchgate.net/publication/3301181_Scalable_and_Modular_Algorithms_for_Floating-Point_Matrix_Multiplication_on_Reconfigurable_Computing_Systems?el=1_x_8&enrichId=rgreq-0ae2aa13-5181-4b18-859f-75f5f1d6cff7&enrichSource=Y292ZXJQYWdlOzIyMDg1MDg5MTtBUzoxMDEzNTA2MDQwMTc2NzJAMTQwMTE3NTI3NTYyOQ==

https://www.researchgate.net/publication/220844154_Computing_Lennard-Jones_Potentials_and_Forces_with_Reconfigurable_Hardware?el=1_x_8&enrichId=rgreq-0ae2aa13-5181-4b18-859f-75f5f1d6cff7&enrichSource=Y292ZXJQYWdlOzIyMDg1MDg5MTtBUzoxMDEzNTA2MDQwMTc2NzJAMTQwMTE3NTI3NTYyOQ==

https://www.researchgate.net/publication/220329385_Scalable_Hybrid_Designs_for_Linear_Algebra_on_Reconfigurable_Computing_Systems?el=1_x_8&enrichId=rgreq-0ae2aa13-5181-4b18-859f-75f5f1d6cff7&enrichSource=Y292ZXJQYWdlOzIyMDg1MDg5MTtBUzoxMDEzNTA2MDQwMTc2NzJAMTQwMTE3NTI3NTYyOQ==

As an application of the developed module, we present the architecture of a dense matrices multiplier that uses instances of the MACs and makes use of the BRAMs to allow reuses of data and a virtual increase in the number of execution units (MACs).

This article is organized as: Section 2 presents the works related to the development of multiplier-adders cores in FPGA. Section 3 presents the architecture of the developed MAC, its characteristics and a description of its modules. Section 4 proposes the architecture for the matrices multiplier. Section 5 contains the experimental results. Finally, section 6 presents the work conclusions.

2. RELATED WORKS Some related works were found during our research and are briefly discussed below:

In [3], is presented a historical comparative study between the performance of general-purpose processors and FPGAs when used as processor elements for scientific computation.

This study analyzes processor elements performance since 1998 until 2004, and make previsions until 2010. Both approaches are analyzed implementing operations as: dot product, matrix-vector multiplication and matrix-matrix multiplication with dense matrices of double precision floating-points.

The ANSI/IEEE 754-1985 standard for floating-point arithmetic operations and rounding [9], were adopted for this article.

In [11] is presented a FPU (Floating point Unit) description, proposed to run arithmetic operations at 250 Mhz. However, just single precision floating-point numbers (32 bits) can be processed, without treatment for special numbers, i.e. NaN (Not a Number) and infinity. Moreover, there is no rounding implementation consequently the numbers are truncated.

The [12] provides a high level overview about the multiplication algorithm with normalized numbers and four different types of rounding (nearest even, zero, positive infinity, negative infinity). Furthermore, supports Nan's and infinity representations.

In [13], a complete MAC project for a single FPGA, using a Xilinx Virtex II Pro was described. It uses the 444 DSP blocks of the FPGA to implement 39 MACs, reaching performance of 15,6 GFLOPS.

In our work, we achieved the throughput of 4 GFLOPS with a Xilinx Virtex IV which provides only 96 DSP blocks, allowing just 10 MACs per FPGA. Same as in [13], our MAC runs at 200Mhz. Furthermore, we created a FTU (Fail Treatment Unit) to deal with exceptions that can be generated during all the process and to make the MAC capable to work properly with the 4 standard types (Infinite, NaNs, normal and denormal).

3. MULTIPLIER-ACCUMULATOR The MAC works with three input vectors, A, B and C. The inputs A and B are multiplied and the result is added to C (i.e. A x B + C).

Different from [13], the internal structure of our MAC is formed by a fourteen pipeline stages where all the arithmetic operations are done. This approach were used to improve the clock rate, once for use in large amount of data, the pipeline latency is only a constant and do not have impact in total time of processing.

Another difference between our work and [13] is the presence of the exception control block, that allows this MAC to accept work with infinite numbers and NaN's.

The internal structure was designed to explore the parallelism in the execution of the operations, eliminating the redundancies that would be generated if the multiplication and sum were made separately.

The basic architecture of the device is presented in Figure 1. The MAC is composed by a Multiplication Unit (unit 1), which can be considered the main element of MAC, receiving the mantissa elements that will be multiplied. It also has the Exception 1 (unit 13), Exception 2 (unit 14) and Exception 3 (unit 15), that are responsible for treating the cases where infinite numbers, NaN's are operated. These exceptions can be produced by invalid input data or by internal operations. We choose to treat such exceptions to produce the standard results that indicate the wrong state. The elements Adder (unit 2), Compare (unit 4) and Adjuster (units 11) operate the exponents of the three input numbers in such a way to affect the multiplication and the addition. The modules Adder and Adjuster communicate with some of the exception treatment units to solve the cases where infinite numbers are generated during the operations. The modules Complement (unit 3), Shifter (unit 5) and Mant_Adder (unit 6) are responsible for adjusting the mantissas in accordance with the exponents and adding them. The Complement elements (unit 7), Counter (unit 8), Shifter_Mant (unit 9) and Rounding (unit 10) are responsible for making the final adjustments to generate the final mantissa. Among these modules, the Rounding one must be detailed, therefore it works with the recommended IEEE standard. This unit is described in details in section 3.

3.1 The Units Description The following sections describe the main units of MAC. The Multiplication, the Counter, the Exception Treatment and the Shifter units will be shown in details.

3.1.1 Multiplication Unit As said previously the multiplication unit is responsible for multiplying the mantissas of the inputs A and B. This unit is divided in two modules: Partial multiplier and Addition Tree.

The partial multiplier executes the DADA algorithm that divides the inputs mantissas in four parts: three of 17-bit and one of 2-bit. The three biggest parts of the divided mantissas are multiplied by each part of the other, one of the two shortest are multiplied for the whole vector and the other multiplies the whole vector except the two most significant bits of the other one (Figure 2). The result of such operations are nine 34-bit vector, one of 55-bit and one of 53-bit, these vectors compose the output of the partial multiplier.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SBCCIí08, September 14, 2008, Gramado, Brazil.

Copyright 2008 ACM 978-1-60558-231-3/08/09...$5.00.

Figure 1 - Basic architecture of MAC

Figure 2 - A and B mantissas

In the Addition Tree the vectors produced by the Partial Multiplier are summed using the carry propagate adders (CPA), with the reduction circuit to explore the parallelism advantage in this operations.

3.1.2 Counter This device is responsible for locating the position of the first bit with value equals to ‘1’ and produce a response indicating this position. This result will be used in the final mantissas normalization. Our algorithm is based on a multiplex that is operated by the result mantissa.

3.1.3 Exception Treatment Units In this unit, the exception conditions are evaluated to determine if the output number is resultant of MAC internal operations or some standard number that represents the condition of exception. The Exception 1 unit inputs are evaluated and classified in accordance with its format. In the Exception 2, the result of the previous module is treated, as well as the result of the exponents addition A and B. The Exception 2 output is propagated to the Exception 3 where the result of the adjustment of the exponent will be evaluated. If some exception is signaled, the multiplexer (component 12 of Figure 1) will send to the devices output the

adequate representation of the exception; otherwise, the result will come from the internal operations.

3.1.4 Shifter This unit will move one of the inputs mantissas according to the result of the Compares module. It is responsible for evaluating the exponents of the numbers that will be added. It is important to highlight that the right displacement, necessary to adjust the mantissa, was implemented through a multiplexer. This approach made possible to increase the unit clock rate.

3.2 Description of Pipeline Stages The main features of the fourteen stages of MAC’s pipeline will be described below.

Stage 1 to Stage 5: A and B mantissas are multiplied in the Multiplication Unit. A and B exponents are added in the Adder Unit and the addition result will be propagated. The input vectors are evaluated in the Exception Treatment Unit, as well as the result of the exponents addition.

Stage 6: if is necessary the two complement of the mantissas will be executed. The addition of A and B exponents is compared with C exponent in the Compare Unit. The greatest of them will be propagated until the Stage 13 of pipeline and the difference between them will be used as input to the Shifter module. The result of the exception treatment will be propagated until the Stage 14 of pipeline. To avoid a pipeline short circuit, pipeline keeps running even if an exception occurred before or in this stage.

Stage 7: the mantissa is moved to the right to deal with the exponent adjustment.

Stage 8: the result of the mantissa multiplication of vectors A and B is added to the mantissa of C generating a new mantissa that after some treatments will be a result mantissa.

Stage 9: in this stage the mantissa is converted from two’s complement notation to a module and signal notation to adequate it to the IEEE mantissa standard.

Stage 10 and Stage 11: the Counter Unit is responsible for counting the number of zeros before the first bit with value equals to ‘1’. This information will be used on the Shifter_Mant and the Adjuster Units to normalize the result.

Stage 12: in the Shifter_Mant Unit the mantissa is normalized by a left shift, according to the number of zeros counted on the last stage. Furthermore, in this stage, the 106 bits of mantissa will be separated in three groups: 55 bits composing the result mantissa, one bit to the rounding bit and the last bits to generate the stick bit. These groups will be used in the next stage to perform the rounding. Stage 13: at this moment the mantissa is rounded in the Rounding Unit, based on the results of the previous stage, and the final exponent is adjusted in the Adjuster Unit. Stage 14: any exception generated at the final exponent adjustment will be treated in this stage by the last Exception Treatment Unit - Exception 3. In the end, if an exception occurred the Exception Treatment Unit result will be selected by a multiplexer (Unit 12 of Figure 1) as the final result. Otherwise, the MAC result will be the final result.

3.3 Reference Model and Tests Results First, we implemented a canonical model based on double precision floating-point sum and multiplication operations from the standard C library. Through this library, we generated input and output files for the MAC operation (test vectors).

After this, we built a description of the proposed MAC architecture, which was used to generate the intermediary vectors for debug and validation of the MAC internal modules (Figure 1).

The test vectors (files) generated in these steps were used on MAC testbench (implemented in VHDL) execution with the support of ModelSim simulation tool [18].

During the tests was identified a inconsistency between the values found on the reference model, using the standard library of the C compiler [14], and the values obtained in our algorithm execution that uses the “Round to Nearest Even” rounding as prescribed by the IEEE standard. In order to check if the problem was in our implementation, we repeated the tests using another C library [15] that allow us to define explicitly the rounding mode to be used. The defined rounding algorithm was “Round to Nearest Even” and we obtained 100% of consistency with our results.

4. PROJECT OF A MATRICES MULTIPLIER Based on the previously discussed MAC, we propose an architecture for multiplication of ‘n’ dimension quadratic matrices in FPGA.

Matrices multiplication are characterized by the high cost on memory access, O(n3) for ‘n’ dimension quadratic matrices.

For this reason, many times the memory access cost overloads the computation time cost, causing a great impact on the total computation cost.

If there is no possibility of increasing the memory controller performance, the only way to treat this problem is exploring the data reuse. The matrix multiplication offers a great possibility of data reuse because each row of the first matrix (matrix A) must operate with all columns of the second (matrix B), and vice versa.

Thus, reusing the matrix A rows the memory access cost goes from 2N3 to N2+N3. If the matrix B columns are also reused the cost goes from N2+N3 to 2N2, which is the lower possible cost for this operation.

A simple way to reach this objective could be the use of N2 MACs (N = matrix dimension), simultaneously, in systolic array architecture. This approach would allow a 100% of matrix A rows and matrix B columns reuse. Since all dot product operations could be done in parallel, it would be possible to reduce memory access cost to 2N2 and temporal computational cost on operations from N3 to N.

As a more economical variation of this solution, we could reuse only the matrix A rows, or the matrix B columns using N MACs instead of N2. Each MAC would be responsible for operate the dot product of one or more A rows by B columns. This new approach would enable us to reduce the memory access cost to N3+N2 and the temporal computational cost to N2.

Meanwhile, due to the limited amount of multiplier blocks available on current FPGAs and the dimension of computed matrices both solutions previously presented are unfeasible.

As a third solution for this problem, we propose an architecture that uses the internal FPGA memory blocks (BRAMs) as key elements to provide both rows of A and columns of B reuse.

Our approach uses these memory elements to increase the number of available MACs virtually and, thus, reach the previously presented data reuse. When all characteristics of this approach are used, it avoids the need of intermediary storage elements for temporary results, and therefore eliminates the need of computing partial results at processing end.

For didactic reasons, we will treat first the A rows reuse and after B columns reuse. In order to provide the A rows elements reuse, we use ‘N’ BRAM memory elements, distributed as FIFO memory blocks of ‘K’ deep, where K = N/m and ‘m’ is the number of available MACs, plugged at each MAC output, as shown in Figure 3.

.

Figure 3 - Modified MAC These FIFOs plugged to MACs outputs allow increasing, virtually, the number of MACs on the system from ‘m’ to ‘N’. Once, through them one MAC will be able to operate the dot product among one A row and ‘K’ B columns.

Figure 4 - Matriz A rows data

The Figure 4 presents the schematic diagram of our first approach. In this first configuration, both matrix A and matrix B data must be read sequentially, row-by-row. The elements read from matrix A rows will be applied sequentially to the rows’ FIFO, while the elements read from the matrix B will be distributed uniformly in the columns’ FIFOs, as shown in the Figure 5 diagram.

All element loaded from A to be operated by the MACs, must remain unchanged until every ‘K’ elements from B in each MAC have been operated, and all FIFO has been filled. An important detail to note is that although the matrix B is being read row by row, in fact, it is being operated columns by columns, each MAC is responsible to operate one row with ‘K’ columns.

Figure 5 - Data distribution scheme for matrix A rows data

reuse The data reading from A and B will be interleaved, for each A row read, all B rows will be read. This approach allows that each row of A operates with all correspondent elements in B.

To provide the reuse of elements of both A rows and B columns, we need to increase the amount of FIFO memories present on MACs outputs from N to N2, distributed in blocks of deep k1=N2/m. It will provide us N2 virtual MACs. Thus, each MAC will be able to operate the dot product N2/m rows by N2/m columns.

In this second configuration, in order to provide both matrix A and matrix B elements reuse, the matrix A must be read transposed, column by column, while matrix B will keep being read in the conventional way, row by row. Unlike the previous approach, this time the row elements of the matrix B will be applied sequentially to the row’s FIFOs, while the columns elements of the matrix A will be distributed uniformly in the columns’ FIFOs, Figure 6. Once we want to reuse both rows elements from B and columns elements from A, each row’s FIFO element that is loaded to operate in MACs must remain unchanged in its input until all correspondent elements on columns’ FIFOs has been operated.

Figure 6 - Architecture for rows and columns data reuse

At next, the operated element is replaced by its successor. The elements of columns’ FIFOs that are being operated by the MACs must remain in MACs input until all of them have been operated with their correspondent elements on row’s FIFO, and so on. The Figure 7 shows this algorithm development.

An important point to be highlighted in both proposed architectures is that although we could reach a significant reduction on temporal cost of accessing memory, from 2N3 to N2+N2 in the first approach and from 2N3 to 2N2 on the second, the temporal multiplication cost remains N3/m, where ‘m’ is the number of MACs, once we did not have a real increase on the number of MACs used to process the multiplication in parallel.

In our work, we analyzed the implementation of this approach in a Xilinx Virtex 4 LX200 FPGA. It has 96 DSP blocks and 336 BRAM blocks (18 Kbits each) that allow us to implement 10 real MACs that were transformed, using the mentioned technique, in ‘N’ virtual MACs.

Figure 7 - Data distribution scheme for matrix A rows and

matrix B columns reuse

This FPGA was chosen because it is the hardware used in RC-100 from SGI [17], where our project will be prototyped. The Figure 9 shows the final architecture of our system.

Also we guarantee the execution of 20 operations in floating-point (multiplication and adds) for cycle of clock, since we are capable to instance 10 MACs. If we consider the frequency of execution of the MACs of 200MHz we can arrive at a performance of 4GFLOPs.

Figure 8 - Matrices Multiplier

5. EXPERIMENTAL RESULTS 5.1 Description of Synthesis Results In the synthesis process of our architecture we use a Xilinx board of the family Virtex 4, device XC4VLX200, package FF1513 and speed -10. In this scene each MAC reached a frequency of 200 MHz, occupied 3% of the available logical units and nine blocks DSP. Such results allowed place up to ten MACs per FPGA because the cited device has ninety and six DSP blocks that limits the amount of MACs per FPGA.

Figure 7 - Architecture of the matrix multiplication

With these results we found that each MAC operates at 400 MFLOPS and as we have ten of these units per FPGA, we reach a 4 GFLOPS performance. These results can be improved if we use devices that allow a bigger number of MACs to work simultaneously. This demonstrates that we could reach a satisfactory result or even better than some conventional platforms, using a lower clock frequency.

6. CONCLUSIONS The use of FPGAs as processing elements for high performance scientific computing is already a reality.

However, the limitations on data access and the little amount of specific resources for scientific processing in FPGAs still a problem to be addressed observing the characteristics inherent to each application and each platform.

This article discussed one efficient implementation of MAC using a 14-stage pipeline. This MAC is able to do sum and multiplication of double precision floating-points according to the IEEE-754 standard and operates without restrictions to NaNs and infinite numbers.

We also presented two approaches that allow attenuating the difficult in accessing data and reusing it efficiently in matrices multiplication.

Finally, it was proposed a system based on the described MAC and the presented data reusing approaches.

Is expected that this system reach a sustained performance of 4GFLOPs running on a Xilinx Virtex 4 LX200 FPGA present on a RC-100 blade of the SGI Altix 350 [17].

7. ACKNOWLEDGMENT We would like to thank the Petrobrás Research Center (CENPES), FINEP/CNPq and RPCMod network coordination.

8. REFERENCES [1] Ling Zhuo, Viktor K. Prasanna, Scalable and Modular

Algorithms for Floating-Point Matrix Multiplication on

Reconfigurable Computing Systems, IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 18, No. 4, pp. 433-448, April 2007.

[2] K.D. Underwood and K.S. Hemmert. Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In Proc. Of 2004 IEEE Symposium on Field Programmable Custom Computing Machines, California, USA, April 2004.

[3] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt and K. D. Underwood and K. S. Hemmert. Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In Proc. of 2004 IEEE Symposium on Field-Programmable Custom Computing Machines, California,USA, April 2004.

[4] L. Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-Point matrix Multiplication on FPGAs. In Proc. of the 18th International Parallel & Distributed Processing Simposium, New Mexico, USA, April 2004.

[5] L. Zhuo and V. K. Prasanna. Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems, submitted to IEEE Transactions on Computers.

[6] R. Scrofano and V. K. Prasanna. Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware. In Proc. Int’l Conf. Eng. of Reconfigurable Systems and Algorithms (ERSA’04), pages 284–290, June 2004.

[7] K. D. Underwood and K. S. Hemmert. Closing the Gap:CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In Proc. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM’04), April 2004.

[8] L. Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs. In Proc. 18th Int’l Parallel & Distributed Processing Symp. (IPDPS’04), New Mexico, USA, April 2004.

[9] IEEE Standard for Binary Floating-Point Arithmetic 1985.

[10] C. Babb, J. Blank, I. Castellanos, J. Moskal. Floating Point Multiplier Final Project ECE 587

[11] Per Karlström, Andreas Ehliar, Dake Liu. High Performace, Low Latency FPGA based Floating Point Adder and Multiplier Units in a Virtex 4

[12] E. Mark. Free Floating-Point Madness

[13] Youg Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev. 64-bit Floating-Point FPGA Matrix Multiplication

[14] Mingw 5.1.3 - <http://www.mingw.org/> [15] GSL – GNU Scientific Library

[16] Cray Inc. Cray XD1 FPGA Development. <http://www.cray.com/. [http://www.cray.com/.]>

[17] SGI Inc. <http://www.sgi.com/ [http://www.sgi.com/]>

[18] ModelSim. <http://www.model.com/ [http://www.model.com/]

Implementation of a double-precision multiplier accumulator with exception treatment to a dense matrix multiplier module in FPGA

Documents