Implementation of binary ﬂoating-point arithmetic on ...

Implementation of binary floating-point arithmeticon embedded integer processors

Polynomial evaluation-based algorithmsand

certified code generation

Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard

Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS

Ph.D. Defense – December 1st, 2009

Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 1/45

Motivation

Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost

Some embedded systems do not have any FPU (floating-point unit)

Highly used in audio and video applicationsI demanding on floating-point computations


Motivation



Embedded systems

No FPU



Motivation



Applications

FP computations

Embedded systems

No FPU



Motivation



floating-point arithmeticSoftware implementing

Applications

FP computations

Embedded systems

No FPU



Motivation



ICache

ST231 core

DTLBMul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports

61 interrupts Debuglink

PeripheralsDebug

Timers3 x

controllersupport unit 32-bit

I-sidememorysubsystem

Interrupt

registerPC andbranch

unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul

D-sidememorysubsystem

StoreUnit

ITLB

Instructionbuffer

floating-point arithmeticSoftware implementing

Applications

FP computations

Embedded systems

No FPU



Overview of the ST231 architecture

ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports


Peripherals

DebugTimers

3 xcontroller support unit 32-bit


Interrupt


unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul


StoreUnit

ITLB

Instructionbuffer

ST231 core

4-issue VLIW 32-bit integer processor→ no FPU

Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32

Latencies: ALU→ 1 cycle, Mul→ 3cycles

VLIW (Very Long Instruction Word)→ instructions grouped into bundles

→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler

uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;

Issue 1 Issue 2 Issue 3 Issue 4

0 R1 R2 R3

1 R4



ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports


Peripherals

DebugTimers



Interrupt


unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul


StoreUnit

ITLB

Instructionbuffer

ST231 core








0 R1 R2 R3

1 R4



ICache

DTLB

Mul

Registerfile (64

registers8 read

4 write)

Load

(LSU)

IU IU IU IU

Trapcontroller

4 x SDI

STBus

SDI ports


Peripherals

DebugTimers



Interrupt


unit

Branch

file

DCache

bufferWrite

Prefetchbuffer

SCU

CMC

STBus

64-bit

registersControl

UTLBMul


StoreUnit

ITLB

Instructionbuffer

ST231 core








0 R1 R2 R3

1 R4


How to emulate floating-point arithmetic in software?

Design and implementation of efficient software support forIEEE 754 floating-point arithmetic on integer processors

Existing software for IEEE 754 floating-point arithmetic:

I Software floating-point support of GCC, Glibc and µClibc, GoFastFloating-Point Library

I SoftFloat (→ STlib)

I FLIP (Floating-point Library for Integer Processors)

• software support for binary32 floating-point arithmetic on integer processors

• correctly-rounded addition, subtraction, multiplication, division, square root,reciprocal, ...

• handling subnormals, and handling special inputs


Towards the generation of fast and certified codes

Underlying problem: development “by hand”

I long and tedious, error prone

I new target ? new floating-point format ?

⇒ need for automation and certification

Current challenge: tools and methodologies for the automatic generationof efficient and certified programs

I optimized for a given format, for the target architecture



















Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)

Spiral project: hardware and software code generation for DSP algorithms

Can we teach computers to write fast libraries?

Our tool: CGPE (Code Generation for Polynomial Evaluation)

In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a

given format?



Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)

Spiral project: hardware and software code generation for DSP algorithms

Can we teach computers to write fast libraries?

Our tool: CGPE (Code Generation for Polynomial Evaluation)

In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a

given format?


Basic blocks for implementing correctly-rounded operators(X, Y )

no

Floating-point number unpacking

Normalization

Result significand approximation

Rounding condition decision

Correct rounding computation

Result reconstruction

R

Range reduction

Result sign/exponentcomputation

Special output selection

yes

Special input detectionfunction independent

function dependent

Objectives

→ Low latency, correctly-roundedimplementations

→ ILP exposure



no


Normalization





R

Range reduction



yes

Special input detection

yes


Special input detectionfunction independent

function dependent

Objectives

→ Low latency, correctly-roundedimplementations

→ ILP exposure



no


Normalization





R

Range reduction



yes


yes



automatedFully

generation

Problem: function to be evaluated

Efficient and certified C code

ST231 features

C code Certificate

Computation of polynomial approximant

Uniform approach for nth rootsand their reciprocals→ polynomial evaluation

Extension to division


Flowchart for generating efficient and certified C codes

generation



ST231 features

C code Certificate


Constraints

Accuracy of approximant andC code

I Sollya

I interval arithmetic (MPFI),Gappa

Low evaluation latency onST231, ILP exposure

I ?

Efficiency of the generationprocess



generation



ST231 features

C code Certificate


Constraints


I Sollya



I ?




generation



ST231 features

C code Certificate


Constraints


I Sollya



I ?




generation



ST231 features

C code Certificate


Constraints


I Sollya



I ?




generation



ST231 features

C code Certificate


Constraints


I Sollya



I ?




generation



ST231 features

C code Certificate


CGPECGPE

Constraints


I Sollya



I ?




generation



ST231 features

C code Certificate


CGPECGPE

Constraints


I Sollya



I ?



Outline of the talk

1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding

2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency

3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes

4. Numerical results

5. Conclusions and perspectives


Design and implementation of floating-point operators

Outline of the talk

1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding

2. Low latency parenthesization computation

3. Selection of effective evaluation parenthesizations





Notation and assumptions

Division C code(x, y) RN(x/y)

Input (x ,y ) and output RN(x/y): normal numbers

→ no underflow nor overflow

→ precision p, extremal exponents emin, emax

x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}




Division C code(x, y) RN(x/y)

Input (x ,y ) and output RN(x/y): normal numbers

→ no underflow nor overflow

→ precision p, extremal exponents emin, emax

x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}

→ RoundTiesToEvenx/y RN(x/y)




Division C code(X, Y ) R

Standard binary encoding: k -bit unsigned integer X encodes input x

Tx = mx,1 . . . mx,p−1

p− 1 bits

Ex = ex − emin − 1

w = k − p bits1 bit

sx

Computation: k -bit unsigned integers

→ integer and fixed-point arithmetic




Division C code(X, Y ) R??

Standard binary encoding: k -bit unsigned integer X encodes input x

Tx = mx,1 . . . mx,p−1

p− 1 bits

Ex = ex − emin − 1

w = k − p bits1 bit

sx

Computation: k -bit unsigned integers

→ integer and fixed-point arithmetic


Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach

Range reduction of divisionExpress the exact result r = x/y as:

r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d

with` ∈ [1,2) and d ∈ {emin, . . . ,emax}

Definition

c = 1 if mx ≥my , and c = 0 otherwise

Range reduction

x/y =(21−c ·mx/my

)︸︷︷︸:= ` ∈ [1,2)

· 2d with d = ex −ey −1+ c

How to compute the correctly-rounded significand RN(`) ?




r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d


Definition


Range reduction


)︸︷︷︸:= ` ∈ [1,2)






r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d


Definition


Range reduction


)︸︷︷︸:= ` ∈ [1,2)






r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d


Definition


Range reduction


)︸︷︷︸:= ` ∈ [1,2)





Methods for computing the correctly-rounded significand

Iterative methods: restoring, non-restoring, SRT, ...

I Oberman and Flynn (1997)

I minimal ILP exposure, sequential algorithm

Multiplicative methods: Newton-Raphson, Goldschmidt

I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)

I exploit available multipliers, more ILP exposure

Polynomial-based methods

I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation

I Our approach→ bivariate polynomial evaluation: maximal ILP exposure



























Correct rounding via truncated one-sided approximation

How to compute RN(`), with ` = 21−c ·mx/my ?

Three steps for correct rounding computation

1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0

→ implied by |(`+2−p−1)− v |< 2−p−1

→ bivariate polynomial evaluation

2. compute u as the truncation of v after p fraction bits

3. determine RN(`) after possibly adding 2−p

How to compute the one-sided approximation v and then deduce RN(`)?



Correct rounding via truncated one-sided approximation

How to compute RN(`), with ` = 21−c ·mx/my ?

Three steps for correct rounding computation

1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0

→ implied by |(`+2−p−1)− v |< 2−p−1

→ bivariate polynomial evaluation

2. compute u as the truncation of v after p fraction bits

3. determine RN(`) after possibly adding 2−p

How to compute the one-sided approximation v and then deduce RN(`)?



One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function

F(s, t) = s/(1+ t)+2−p−1

at the points s∗ = 21−c ·mx and t∗ = my −1

2. Approximate F(s, t) by a bivariate polynomial P(s, t)

P(s, t) = s ·a(t)+2−p−1

→ a(t): univariate polynomial approximant of 1/(1+ t)

→ approximation error Eapprox

3. Evaluate P(s, t) by a well-chosen efficient evaluation program P

v = P (s∗, t∗)→ evaluation error Eeval

How to ensure that |(`+2−p−1)− v |< 2−p−1 ?




F(s, t) = s/(1+ t)+2−p−1



P(s, t) = s ·a(t)+2−p−1









F(s, t) = s/(1+ t)+2−p−1



P(s, t) = s ·a(t)+2−p−1









F(s, t) = s/(1+ t)+2−p−1



P(s, t) = s ·a(t)+2−p−1








Sufficient error bounds

To ensure |(`+2−p−1)− v |< 2−p−1

it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,

since

|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p

This gives the following sufficient conditions

Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox




To ensure |(`+2−p−1)− v |< 2−p−1


since



Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox




To ensure |(`+2−p−1)− v |< 2−p−1


since



Eapprox ≤ θ with θ < 2−p−1/µ ⇒ Eeval < η = 2−p−1−µ ·θ



Example for the binary32 division

Sufficient conditions with µ = 4−2−21

Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ

Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation error Required bound

I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9



Example for the binary32 division



Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t


I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9




Selection of effective parenthesizations

Eapprox ≤ θF(s,t) Eeval < η ST231 features

C code Certificate


??

Computation of low latency parenthesizations

ST231 features

??

v |(` + 2−p−1)− v| < 2−p−1

truncation

u


Design and implementation of floating-point operators Implementation of correct rounding

Rounding condition: definitionApproximation u of ` with

` = 21−c ·mx/my

The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed

midpointfloating-point

u u

` `

Compute RN(`) requires to be able to decide whether u ≥ `

→ ` cannot be a midpoint

Rounding condition: u ≥ `

u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx




` = 21−c ·mx/my



u u

` `




u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx




` = 21−c ·mx/my



u u

` `




u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx



Rounding condition: implementation in integer arithmetic

Rounding condition: u ·my ≥ 21−c ·mx

Approximation u and my : representable with 32 bits

u

my×

u · my

I u ·my is exactly representable with 64 bits

I 21−c ·mx is representable with 32 bits since c ∈ {0,1}

⇒ one 32×32→ 32-bit multiplication and one comparison






u

my×

u · my

21−c · mx

I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}







u

my×

u ·my

21−c ·mx

≥

I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}







C code Certificate


??


ST231 features

??






a(t)


C code Certificate


??


Low latency parenthesization computation

Outline of the talk

1. Design and implementation of floating-point operators

2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency





Low latency parenthesization computation Classical evaluation methods

Objectives

Compute an efficient parenthesization for evaluating P(s, t)

→ reduces the evaluation latency on unbounded parallelism

Evaluation program P = main part of the full software implementation

→ dominates the cost

Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and

Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic

I algorithms without coefficient adaptation



Objectives










Objectives










Objectives










Classical parenthesizations for binary32 division

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

Horner’s rule: (3+1)×11 = 44 cycles

→ no ILP exposure

Second-order Horner’s rule: 27 cycles

→ evaluation of odd and even parts independently with Horner, more ILP

Estrin’s method: 19 cycles

→ evaluation of high and low parts in parallel, even more ILP

→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles

... We can do better.

How to explore the solution space of parenthesizations?




P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i


→ no ILP exposure











P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i


→ no ILP exposure











P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i


→ no ILP exposure









Low latency parenthesization computation Computation of all parenthesizations

Algorithm for computing all parenthesizations

a(x ,y) = ∑0≤i≤nx

∑0≤j≤ny

ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0

ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then

a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.

Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k

3 building rules for computing all parenthesizations



Algorithm for computing all parenthesizations

a(x ,y) = ∑0≤i≤nx

∑0≤j≤ny

ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0

ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then

a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.

Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k

3 building rules for computing all parenthesizations



Rules for building valid expressions

Consider step k of the algorithm

E(k): valid expressions of total degree k

P(k): powers x iy j of total degree k = i + j







Rule R1 for building the powers

p1 p2

p

deg(p) = deg(p1) + deg(p2)

deg(p1) ≤ bk/2c dk/2e ≤ deg(p2) < k







Rule R2 for expressions by multiplications

e′ p

deg(p) ≤ kdeg(e′) < k

deg(e) = deg(e′) + deg(p)

e







Rule R3 for expressions by additions

e2e1

deg(e1) = k deg(e2) ≤ k

deg(e) = max ( deg(e1), deg(e2))

e



Number of parenthesizations

nx = 1 nx = 2 nx = 3 nx = 4 nx = 5 nx = 6

ny = 0 1 7 163 11602 2334244 1304066578

ny = 1 51 67467 1133220387 207905478247998 · · · · · ·ny = 2 67467 106191222651 10139277122276921118 · · · · · · · · ·

Number of generated parenthesizations for evaluating a bivariate polynomial

Timings for parenthesization computation→ for univariate polynomial of degree 5 ≈ 1h on a 2.4 GHz core

→ for bivariate polynomial of degree (2,1) ≈ 30s

→ for P(s, t) of degree (3,1) ≈ 7s (88384 schemes)

Optimization for univariate polynomial and P(s, t)→ univariate polynomial of degree 5 ≈ 4min

→ for P(s, t) of degree (3,1) ≈ 2s (88384 schemes)




10

100

1000

10000

100000

1e+06

10 12 14 16 18 20

Num

ber

ofde

gree

-5pa

rent

hesi

zati

ons

Latency on unbounded parallelism (# cycles)

→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)




10

100

1000

10000

100000

1e+06

10 12 14 16 18 20

Num

ber

ofde

gree

-5pa

rent

hesi

zati

ons

Latency on unbounded parallelism (# cycles)

→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)

How to compute only parenthesizations of low latency?


Low latency parenthesization computation Towards low evaluation latency

Determination of a target latency

Target latency = minimal cost for evaluating

a0,0 +anx ,ny · xnx yny

I if no scheme satisfies τ then increase τ and restart

Static target latency τstatic

I as general as evaluating a0,0 + xnx +ny +1

τstatic = A+M×dlog2(nx +ny +1)e

Dynamic target latency τdynamic

I cost of operator on anx ,ny and delay on intederminatesI dynamic programming





























Example

Degree-9 bivariate polynomial: nx = 8 and ny = 1

Latencies: A = 1 and M = 3

Delay: y available 9 cycles later than x

τstatic τdynamic

1+3×dlog2(10)e = 13 cycles 16 cycles



Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial

a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .

⇒ find a best splitting of the polynomial→ low latency




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


(a0,0 +a1,0 · x +a0,1 · y

)+(

a1,1 · x · y)




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


((a0,0 +a1,0 · x

)+a0,1 · y

)+(

a1,1 · x · y)




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


(a0,0 +

(a1,0 · x +a0,1 · y

))+(

a1,1 · x · y)




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


(a0,0 +a1,0 · x

)+(

a0,1 · y +a1,1 · x · y)




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


a0,0 +(

a1,0 · x +a0,1 · y +a1,1 · x · y)




a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .


Level 1

a(x, y)

a0,0

a′′(x, y)if support ≤max

exhaustive search

a(x, y)

a′′(x, y)

a′(x, y)

keep

Level 2

if support > max

a′(x, y) a′(x, y)

a′2(x, y)a′1(x, y)a′2(x, y)

a′1(x, y)

a(x, y)

a′(x, y)

anx,nyxnxyny



Efficient evaluation parenthesization generation

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

First target latency τ = 13→ no parenthesization found

Second target latency τ = 14→ obtained in about 10 sec.

Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles

131211109876543210

s

a0

a1

s

t t

a2

t a3

a4

t a5

a6

t a7

a8

t a9

a10

14

t

v

2−25



Efficient evaluation parenthesization generation

P(s, t) = 2−25 + s · ∑0≤i≤10

ai · t i

First target latency τ = 13→ no parenthesization found

Second target latency τ = 14→ obtained in about 10 sec.

Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles

131211109876543210

s

a0

a1

s

t t

a2

t a3

a4

t a5

a6

t a7

a8

t a9

a10

14

t

v

2−25






a(t)


C code Certificate


??






a(t)


C code Certificate



Selection of effective evaluation parenthesizations

Outline of the talk



3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes




Selection of effective evaluation parenthesizations General framework


1. Arithmetic Operator Choice

I all intermediate variables are of constant sign

2. Scheduling on a simplified model of the ST231

I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates

3. Certification of generated C code

I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic




















Selection of effective evaluation parenthesizations Automatic certification of generated C codes

Certification of evaluation error for binary32 division



-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t


I Eapprox ≤ θ,

with θ = 3 ·2−29 ≈ 6 ·10−9

I Eeval < η,

with η≈ 7.4 ·10−9




Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η

s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t

Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9

Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9





s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t



1. determine an interval I around this point

2. compute Eapprox over I3. determine an evaluation error bound η

4. check if Eeval < η ?





s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0.965 0.97 0.975 0.98 0.985 0.99 0.995

Abs

olut

eap

prox

imat

ion

erro

r

t



1. determine an interval I around this point

2. compute Eapprox over I3. determine an evaluation error bound η

4. check if Eeval < η ?




Sufficient conditions for each subinterval, with µ = 4−2−21

E(i)approx ≤ θ

(i) with θ(i) < 2−25/µ and E(i)

eval < η(i) = 2−25−µ ·θ(i)

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t


E(i)approx ≤ θ(i)

E(i)eval < η(i)




Sufficient conditions for each subinterval, with µ = 4−2−21

E(i)approx ≤ θ

(i) with θ(i) < 2−25/µ and E(i)

eval < η(i) = 2−25−µ ·θ(i)

-8e-09

-6e-09

-4e-09

-2e-09

0

2e-09

4e-09

6e-09

8e-09

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Abs

olut

eap

prox

imat

ion

erro

r

t


I E(i)approx ≤ θ(i)

I E(i)eval < η(i)



Certification using a dichotomy-based strategy

Implementation of the splitting by dichotomy

I for each T (i)

1. compute a certified approximation error bound θ(i)

Sollya

2. determine an evaluation error bound η(i)

Sollya

3. check this bound: E(i)eval < η(i)

Gappa

⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals

Example of binary32 implementation

→ launched on a 64 processor grid

→ 36127 subintervals found in several hours (≈ 5h.)





I for each T (i)


Sollya


Sollya


Gappa









I for each T (i)


Sollya


Sollya


Gappa






Numerical results

Outline of the talk







Numerical results

Performances of FLIP on ST231

0

20

40

60

80

100

120

140

160

180

addsub

muldiv

sqrt

Lat

ency

(#cy

cles

)

Floating-point operators

FLIP 1.0FLIP 0.3

STlib

0

20

40

60

80

100

addsub

muldiv

sqrt

Spee

d-up

(%)


FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3

Performances on ST231, in RoundTiesToEven

⇒ Speed-up between 20 and 50 %

Implementations of other operators

x−1 x−1/2 x1/3 x−1/3 x−1/4

25 29 34 40 42

Performances on ST231, in RoundTiesToEven (in number of cycles)


Numerical results

Performances of FLIP on ST231

0

20

40

60

80

100

120

140

160

180

addsub

muldiv

sqrt

Lat

ency

(#cy

cles

)


FLIP 1.0FLIP 0.3

STlib

0

20

40

60

80

100

addsub

muldiv

sqrt

Spee

d-up

(%)


FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3

Performances on ST231, in RoundTiesToEven

⇒ Speed-up between 20 and 50 %

Implementations of other operators

x−1 x−1/2 x1/3 x−1/3 x−1/4

25 29 34 40 42

Performances on ST231, in RoundTiesToEven (in number of cycles)


Numerical results

Impact of dynamic target latency

x1/3 x−1/3

Degree (nx ,ny ) (8,1) (9,1)

Delay on the operand s (# cycles) 9 9

Static target latency 13 13

Dynamic target latency 16 16

Latency on unbounded parallelism and on ST231 16 16

Latency (# cycles) on unbounded parallelism and on ST231

=⇒ Conclude on the optimality in terms of polynomial evaluation latency


Numerical results

Impact of dynamic target latency

x1/3 x−1/3

Degree (nx ,ny ) (8,1) (9,1)

Delay on the operand s (# cycles) 9 9

Static target latency 13 13

Dynamic target latency 16 16

Latency on unbounded parallelism and on ST231 16 16

Latency (# cycles) on unbounded parallelism and on ST231

=⇒ Conclude on the optimality in terms of polynomial evaluation latency


Numerical results

Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1

Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)

Static target latency 13 13 13 13 13

Dynamic target latency 13 13 16 16 13

Latency on unbounded parallelism 13 13 16 16 13

Latency on ST231 13 14 16 16 13

Parenthesization generation 172ms 152ms 53s 56s 168ms

Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms

Scheduling 29s 4m21s 32ms 132ms 7s

Certification (Gappa) 6s 4s 1m38s 1m07s 11s

Total time (≈) 35s 4m25s 2m31s 2m03s 18s

Timing of each step of the generation flow

Impact of the target latency on the first step of the generation

What may dominate the cost

→ scheduling algorithm

→ certification using Gappa


Numerical results


Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)




Latency on ST231 13 14 16 16 13












Numerical results


Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)




Latency on ST231 13 14 16 16 13












Conclusions and perspectives

Outline of the talk







Conclusions and perspectives Conclusions

Conclusions

Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division

I polynomial evaluation-based method, very high ILP exposure

⇒ new, much faster version of FLIP

Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation

implementationI heuristics and techniques for generating quickly efficient and certified C

codes

⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP



Conclusions






codes




Conclusions






codes



Conclusions and perspectives Perspectives

Perspectives

Faithful implementation of floating-point operators

→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/

√1+ x2 over [0,0.5), ...

→ roots and their reciprocals: rounding condition decision not automated yet

Extension to other binary floating-point formats

→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib

Extension to other architectures, typically FPGAs

→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs

→ the other techniques introduced of this thesis: should be investigated further



Perspectives



√1+ x2 over [0,0.5), ...









Perspectives



√1+ x2 over [0,0.5), ...









Implementation of binary floating-point arithmeticon embedded integer processors

Polynomial evaluation-based algorithmsand

certified code generation

Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard

Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS

Ph.D. Defense – December 1st, 2009


Implementation of binary ﬂoating-point arithmetic on ...

Documents