ATHENa and Deliverables

Comprehensive environment for benchmarking using FPGAs:

ATHENa - Automated Tool for Hardware EvaluatioN

1

Modern Benchmarking: Natural Progression of Tools

2

Software ASICsFPGAs

eBACS

D. Bernstein,T. Lange

? ?

ATHENa – Automated Tool for Hardware EvaluatioN

3

Set of scripts written in Perl aimed at an AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms

Currently under development at George Mason University.

Version 0.3.1

http://cryptography.gmu.edu/athena

Why Athena?

4

"The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess of Wisdom was known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.”

from "Athena, Greek Goddess of Wisdom and Craftsmanship"

Designers of ATHENa

Venkata“Vinny”MS CpEstudent

Ekawat“Ice”

MS CpEstudent

Marcin

PhD ECEstudent

Rajesh

PhD ECEstudent

Xin

PhD ECEstudent

MichalPhD exchangestudent from

Slovakia

ATHENaServer

FPGA Synthesis and Implementation

Result Summary+ Database Entries

2 3

HDL + scripts + configuration files

1

Database Entries

Download scripts andconfiguration files8

Designer

4

HDL + FPGA Tools

User

Databasequery

Ranking of designs

5

6

Basic Dataflow of ATHENa

0

Interfaces+ Testbenches 6

7

synthesizable source files


configuration files

configuration files testbenchtestbench

constraint files

constraint files

result summary (user-friendly)


database entries

(machine- friendly)

database entries

(machine- friendly)



configuration files

configuration files



ATHENa Major Features (1)• synthesis, implementation, and timing analysis in the batch mode

• support for devices and tools of multiple FPGA vendors:

• generation of results for multiple families of FPGAs of a given vendor

• automated choice of a best-matching device within a given family

9

ATHENa Major Features (2)• automated verification of the design through simulation in the batch

mode

• exhaustive search for optimum options of tools

• heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.)

OR

10

ATHENa Major Features (2)• automated verification of the design through simulation in the batch

mode

• exhaustive search for optimum options of tools

• heuristic adaptive optimization strategies aimed at maximizing selected performance measures (e.g., speed, area, speed/area ratio, power, cost, etc.)

OR

11

12

Multi-Pass Place-and-Route AnalysisGMU SHA-512, Xilinx Virtex 5

100 runs for different placement starting points

The smaller the better

~ 20%

best worst

Minimum clock 12

13

Dependence of Results on Requested Clock Frequency

ATHENa Applications

• single_run: - one set of options

• placement_search - one set of options - multiple starting points for placement

• exhaustive_search - multiple sets of options - multiple starting points for placement - multiple requested clock frequencies

SHA-1 Results

Throughput [Mbit/s]

Architectures

Virtex 5

Virtex 4

Spartan 3

15

spartan3 virtex4 virtex5 cyclone2 cyclone3 stratix2 stratix3

0

200

400

600

800

1000

1200

1400

1600

1800

2000

sha1

sha256

sha512

FPGA family

Mb

/sATHENA Results for SHA-1, SHA-256 & SHA-512

16

Ideas (1)

17

• Select several representative FPGA platforms with significantly different properties

e.g., different vendor – Xilinx vs. Altera

process - 90 nm vs. 65 nm

LUT size - 4-input vs. 6-input

optimization - low-cost vs. high-performance

• Use ATHENa to characterize all SHA-3 candidates and SHA-2 using these platforms in terms of the target performance metrics (e.g. throughput/area ratio)

Ideas (2)

18

• Calculate ratio

SHA-3 candidate performance vs.

SHA-2 performance (for the same security level)

• Calculate geometrical average over multiple

platforms

Technology Low-cost High-performance

120/150 nm Virtex 2, 2 Pro

90 nm Spartan 3 Virtex 4

65 nm Virtex 5

45 nm Spartan 6

40 nm Virtex 6

Xilinx FPGA Devices

Xilinx FPGA Device Support by Tools

Version Low-cost High-performance

Xilinx ISE 10.1 All up to Virtex 5 All up to Virtex 5

Xilinx WebPACK 11.1 Smallest up to Virtex 5 Smallest up to Virtex 5

Xilinx WebPACK 11.3 Smallest up to Virtex 5Smallest Spartan 6,

Virtex 6

Smallest up to Virtex 5Smallest Spartan 6,

Virtex 6

Altera FPGA Devices

Technology Low-cost Mid-range High-performance

130 nm Cyclone Stratix

90 nm Cyclone II Stratix II

65 nm Cyclone III Arria I Stratix III

40 nm Cyclone IV Arria II Stratix IV

Altera FPGA Device Support by Tools

Version Low-cost Mid-range High-performance

Quartus 7.1 Cyclone IV none, Cyclone III all

Arria GX allArria II GX none

Stratix II smallest,Stratix III none

Quartus 8.1 Cyclone IV none, Cyclone III all


Stratix I, II, IIIsmallest

Quartus 9.0 sp2, Sep. 09

Cyclone IV none, Cyclone III all


Stratix I, II, IIIsmallest

Quartus 9.1Nov. 09

Cyclone IV smallest,

Cyclone III all

Arria GX allArria II GX smallest

Stratix I, II, IIIall

Stratix IV none

FPGA and ASIC Performance Measures

23

The common ground is vague

• Hardware Performance: cycles per block, cycles per byte, Latency (cycles), Latency (ns), Throughput for long messages, Throughput for short messages, Throughput at 100 KHz, Clock Frequency, Clock Period, Critical Path Delay, Modexp/s, PointMul/s

• Hardware Cost: Slices, Slices Occupied, LUTs, 4-input LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on ASIC, DSP Blocks, BRAMS, Number of Cores, CLB, MUL, XOR, NOT, AND

• Hardware efficiency: Hardware performance/Hardware cost

24

25

Our Favorite Hardware Performance Metrics:

Mbit/s for Throughput

ns for Latency

Allows for easy cross-comparison among implementationsin software (microprocessors), FPGAs (various vendors),

ASICs (various libraries)

26

But how to define and measure throughput and latency for hash functions?

Time to hash N blocks of message = Htime(N, TCLK) =

Initialization Time(TCLK) + N * Block Processing Time(TCLK) + Finalization Time(TCLK)

Latency = Time to hash ONE block of message = Htime(1, TCLK) = = Initialization Time + Block Processing Time + Finalization Time

Throughput (for long messages) = Htime(N+1, TCLK) - Htime(N, TCLK)

Block size

= Block size

Block Processing Time (TCLK)

But how to define and measure throughput and latency for hash functions?

Initialization Time(TCLK) = cyclesI T⋅ CLK

Block Processing Time(TCLK) = cyclesP T⋅ CLK

Finalization Time(TCLK) = cyclesF T⋅ CLK

Block size

from specification

from analysis of block diagram

and/or functional simulation

from place & route report

(or experiment)

27

How to compare hardware speed vs. software speed?

EBASH reports (http://bench.cr.yp.to/results-hash.html)

In graphs

Time(n) = Time in clock cycles vs. message size in bytes for n-byte messages, with n=0,1, 2, 3, … 2048, 4096

In tables

Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg

Time(4096) – Time(2048)

2048 Performance for long message =

28

How to compare hardware speed vs. software speed?

Throughput [Gbit/s] = Performance for long message [cycles/byte]

8 bits/byte clock frequency [GHz] ⋅

29

30

How to measure hardware cost in FPGAs?

1. Stand-alone cryptographic core on FPGA

2. Part of an FPGA System On-Chip

3. FPGA prototype of an ASIC implementation

Cost of a smallest FPGA that can fit the core. Unit: USD [FPGA vendors would need to publish MSRP (manufacturer’s suggested retail price) of their chips] – not very likelyor size of the chip in mm2 - easy to obtain

Vector: (CLB slices, BRAMs, MULs, DSP units) for Xilinx

(LEs, memory bits, PLLs, MULs, DSP units) for Altera

Force the implementation using only reconfigurable logic(no DSPs or multipliers, distributed memory vs. BRAM): Use CLB slices as a metric. [LEs for Altera]

How to measure hardware cost in ASICs?

1. Stand-alone cryptographic core

2. Part of an ASIC System On-Chip

Cost = f(die area, pin count)

Tables/formulas available from semiconductor foundries

Cost ~ circuit area

Units: μm2

or GE (gate equivalent) = size of a NAND2 cell

31

Deliverables (1)1. Detailed block diagram of the Datapath with names of all signals matching VHDL code [electronic version a bonus]

2. Interface with the division into the Datapath and the Controller [electronic version]

3. ASM charts of the Controller, and a block diagram of connections among FSMs (if more than one used)

[electronic version a bonus]

4. RTL VHDL code of the Datapath, the Controller, and the Top-Level Circuit

5. Updated timing and area analysis formulas for timing confirmed through simulation 32

Deliverables (2)

6. Report on verification

− highest level entity verified for functional correctness• Functional simulation• Post-synthesis simulation• Timing simulation [bonus]

− verification of lower-level entities- Name of entity- Testbench used for verification- Result of verification, incorrect behavior, possible source of

error

33

Deliverables (3)

7. Results of benchmarking using ATHENa

– Entire core or the highest level entity verified for correct functionality– Xilinx Spartan 3, Virtex 4, Virtex 5– Three methods of testing

• Single_run• Placement_search [cost table = 1, 11, 21]• Exhaustive_search

[cost_table = 31, 41, 51; speed or area; two sets of requested frequencies]– Results generated by ATHENa– Your own graphs and charts– Observations and conclusions

34

Bonus Deliverables (4)

8. Pseudo-code [but not a C code]

9. Bugs and suspicious behavior of ATHENa

10. Additional results of benchmarking using ATHENa

– Altera Cyclone II, Stratix II, Cyclone III, Arria I, Stratix III– Three methods of testing

• Single_run• Placement_search [seed = 1, 1000, 2000]• Exhaustive_search

[seed = 3000, 4000, 5000; speed or area; two sets of requested frequencies]– Results generated by ATHENa– Your own graphs and charts– Observations and conclusions

35


11. Report from the meeting with students working on the same SHA core

– Summary of major differences– Advantages and disadvantages of your design

12. Bugs found in the – Padding script– Testbench– Class examples– Slides– Documentation– SHA-3 Packages– Etc.

36


13. Extending the design to cover all hash function variants

– Hash value sizes: 512 [highest priority], 384, 224– Other variant/parameter support specific to a given hash function– Support through generics or constants

14. Padding in hardware

Assuming that message size before padding is already a multiple of the– word size– byte size– a single bit

37

38

14 local students

(with 3 former BSCpE graduates)

14 internationalstudents

4 GWU PhD candidates

Composition of Students

After Grading

1. Summary of results published on the course web page

2. Selected students invited to develop articles/reports to be posted on the - ATHENa web page - SHA-3 Zoo Web Page 3. Unification, generalization and optimization of codes by Ice, myself, and other students

4. Presentation to NIST, conference submissions, presentation at the Second SHA-3 Conference in Santa Barbara in August 2010.

39

ATHENa and Deliverables

Documents

multiple starting points

athena major features

synthesizable source files

compare hardware speed

arria ii gx

measure hardware cost

iii smallest stratix

batch mode