Page 1
Implementation of binary floating-point arithmeticon embedded integer processors
Polynomial evaluation-based algorithmsand
certified code generation
Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard
Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS
Ph.D. Defense – December 1st, 2009
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 1/45
Page 2
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45
Page 3
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45
Page 4
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45
Page 5
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
floating-point arithmeticSoftware implementing
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45
Page 6
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
ICache
ST231 core
DTLBMul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
PeripheralsDebug
Timers3 x
controllersupport unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
floating-point arithmeticSoftware implementing
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45
Page 7
Overview of the ST231 architecture
ICache
DTLB
Mul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
Peripherals
DebugTimers
3 xcontroller support unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
ST231 core
4-issue VLIW 32-bit integer processor→ no FPU
Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32
Latencies: ALU→ 1 cycle, Mul→ 3cycles
VLIW (Very Long Instruction Word)→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;
Issue 1 Issue 2 Issue 3 Issue 4
0 R1 R2 R3
1 R4
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45
Page 8
Overview of the ST231 architecture
ICache
DTLB
Mul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
Peripherals
DebugTimers
3 xcontroller support unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
ST231 core
4-issue VLIW 32-bit integer processor→ no FPU
Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32
Latencies: ALU→ 1 cycle, Mul→ 3cycles
VLIW (Very Long Instruction Word)→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;
Issue 1 Issue 2 Issue 3 Issue 4
0 R1 R2 R3
1 R4
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45
Page 9
Overview of the ST231 architecture
ICache
DTLB
Mul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
Peripherals
DebugTimers
3 xcontroller support unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
ST231 core
4-issue VLIW 32-bit integer processor→ no FPU
Parallel execution unitI 4 integer ALUI 2 pipelined multipliers 32 × 32→ 32
Latencies: ALU→ 1 cycle, Mul→ 3cycles
VLIW (Very Long Instruction Word)→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
uint32_t R1 = A0 + C;uint32_t R2 = A3 * X;uint32_t R3 = A1 * X;uint32_t R4 = X * X;
Issue 1 Issue 2 Issue 3 Issue 4
0 R1 R2 R3
1 R4
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45
Page 10
How to emulate floating-point arithmetic in software?
Design and implementation of efficient software support forIEEE 754 floating-point arithmetic on integer processors
Existing software for IEEE 754 floating-point arithmetic:
I Software floating-point support of GCC, Glibc and µClibc, GoFastFloating-Point Library
I SoftFloat (→ STlib)
I FLIP (Floating-point Library for Integer Processors)
• software support for binary32 floating-point arithmetic on integer processors
• correctly-rounded addition, subtraction, multiplication, division, square root,reciprocal, ...
• handling subnormals, and handling special inputs
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 4/45
Page 11
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I long and tedious, error prone
I new target ? new floating-point format ?
⇒ need for automation and certification
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45
Page 12
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I long and tedious, error prone
I new target ? new floating-point format ?
⇒ need for automation and certification
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45
Page 13
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I long and tedious, error prone
I new target ? new floating-point format ?
⇒ need for automation and certification
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45
Page 14
Towards the generation of fast and certified codes
Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a
given format?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45
Page 15
Towards the generation of fast and certified codes
Arenaire’s developments: hardware (FloPoCo) and software (Sollya,Metalibm)
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computersto write fast and certified codes, for a given target and optimized for a
given format?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45
Page 16
Basic blocks for implementing correctly-rounded operators(X, Y )
no
Floating-point number unpacking
Normalization
Result significand approximation
Rounding condition decision
Correct rounding computation
Result reconstruction
R
Range reduction
Result sign/exponentcomputation
Special output selection
yes
Special input detectionfunction independent
function dependent
Objectives
→ Low latency, correctly-roundedimplementations
→ ILP exposure
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45
Page 17
Basic blocks for implementing correctly-rounded operators(X, Y )
no
Floating-point number unpacking
Normalization
Result significand approximation
Rounding condition decision
Correct rounding computation
Result reconstruction
R
Range reduction
Result sign/exponentcomputation
Special output selection
yes
Special input detection
yes
Special output selection
Special input detectionfunction independent
function dependent
Objectives
→ Low latency, correctly-roundedimplementations
→ ILP exposure
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45
Page 18
Basic blocks for implementing correctly-rounded operators(X, Y )
no
Floating-point number unpacking
Normalization
Result significand approximation
Rounding condition decision
Correct rounding computation
Result reconstruction
R
Range reduction
Result sign/exponentcomputation
Special output selection
yes
Special input detection
yes
Special output selection
Special input detection
automatedFully
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Uniform approach for nth rootsand their reciprocals→ polynomial evaluation
Extension to division
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45
Page 19
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 20
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 21
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 22
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 23
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 24
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
CGPECGPE
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 25
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Efficient and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
CGPECGPE
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Efficiency of the generationprocess
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45
Page 26
Outline of the talk
1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding
2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency
3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 9/45
Page 27
Design and implementation of floating-point operators
Outline of the talk
1. Design and implementation of floating-point operatorsBivariate polynomial evaluation-based approachImplementation of correct rounding
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 10/45
Page 28
Design and implementation of floating-point operators
Notation and assumptions
Division C code(x, y) RN(x/y)
Input (x ,y ) and output RN(x/y): normal numbers
→ no underflow nor overflow
→ precision p, extremal exponents emin, emax
x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45
Page 29
Design and implementation of floating-point operators
Notation and assumptions
Division C code(x, y) RN(x/y)
Input (x ,y ) and output RN(x/y): normal numbers
→ no underflow nor overflow
→ precision p, extremal exponents emin, emax
x =±1.mx ,1 . . .mx ,p−1 ·2ex with ex ∈ {emin, . . . ,emax}
→ RoundTiesToEvenx/y RN(x/y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45
Page 30
Design and implementation of floating-point operators
Notation and assumptions
Division C code(X, Y ) R
Standard binary encoding: k -bit unsigned integer X encodes input x
Tx = mx,1 . . . mx,p−1
p− 1 bits
Ex = ex − emin − 1
w = k − p bits1 bit
sx
Computation: k -bit unsigned integers
→ integer and fixed-point arithmetic
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45
Page 31
Design and implementation of floating-point operators
Notation and assumptions
Division C code(X, Y ) R??
Standard binary encoding: k -bit unsigned integer X encodes input x
Tx = mx,1 . . . mx,p−1
p− 1 bits
Ex = ex − emin − 1
w = k − p bits1 bit
sx
Computation: k -bit unsigned integers
→ integer and fixed-point arithmetic
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45
Page 32
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Range reduction of divisionExpress the exact result r = x/y as:
r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d
with` ∈ [1,2) and d ∈ {emin, . . . ,emax}
Definition
c = 1 if mx ≥my , and c = 0 otherwise
Range reduction
x/y =(21−c ·mx/my
)︸ ︷︷ ︸:= ` ∈ [1,2)
· 2d with d = ex −ey −1+ c
How to compute the correctly-rounded significand RN(`) ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45
Page 33
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Range reduction of divisionExpress the exact result r = x/y as:
r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d
with` ∈ [1,2) and d ∈ {emin, . . . ,emax}
Definition
c = 1 if mx ≥my , and c = 0 otherwise
Range reduction
x/y =(21−c ·mx/my
)︸ ︷︷ ︸:= ` ∈ [1,2)
· 2d with d = ex −ey −1+ c
How to compute the correctly-rounded significand RN(`) ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45
Page 34
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Range reduction of divisionExpress the exact result r = x/y as:
r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d
with` ∈ [1,2) and d ∈ {emin, . . . ,emax}
Definition
c = 1 if mx ≥my , and c = 0 otherwise
Range reduction
x/y =(21−c ·mx/my
)︸ ︷︷ ︸:= ` ∈ [1,2)
· 2d with d = ex −ey −1+ c
How to compute the correctly-rounded significand RN(`) ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45
Page 35
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Range reduction of divisionExpress the exact result r = x/y as:
r = ` ·2d ⇒ RN(x/y) = RN(`) ·2d
with` ∈ [1,2) and d ∈ {emin, . . . ,emax}
Definition
c = 1 if mx ≥my , and c = 0 otherwise
Range reduction
x/y =(21−c ·mx/my
)︸ ︷︷ ︸:= ` ∈ [1,2)
· 2d with d = ex −ey −1+ c
How to compute the correctly-rounded significand RN(`) ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45
Page 36
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I Oberman and Flynn (1997)
I minimal ILP exposure, sequential algorithm
Multiplicative methods: Newton-Raphson, Goldschmidt
I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)
I exploit available multipliers, more ILP exposure
Polynomial-based methods
I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation
I Our approach→ bivariate polynomial evaluation: maximal ILP exposure
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45
Page 37
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I Oberman and Flynn (1997)
I minimal ILP exposure, sequential algorithm
Multiplicative methods: Newton-Raphson, Goldschmidt
I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)
I exploit available multipliers, more ILP exposure
Polynomial-based methods
I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation
I Our approach→ bivariate polynomial evaluation: maximal ILP exposure
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45
Page 38
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I Oberman and Flynn (1997)
I minimal ILP exposure, sequential algorithm
Multiplicative methods: Newton-Raphson, Goldschmidt
I Pineiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)
I exploit available multipliers, more ILP exposure
Polynomial-based methods
I Agarwal, Gustavson and Schmookler (1999)→ univariate polynomial evaluation
I Our approach→ bivariate polynomial evaluation: maximal ILP exposure
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45
Page 39
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Correct rounding via truncated one-sided approximation
How to compute RN(`), with ` = 21−c ·mx/my ?
Three steps for correct rounding computation
1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0
→ implied by |(`+2−p−1)− v |< 2−p−1
→ bivariate polynomial evaluation
2. compute u as the truncation of v after p fraction bits
3. determine RN(`) after possibly adding 2−p
How to compute the one-sided approximation v and then deduce RN(`)?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45
Page 40
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Correct rounding via truncated one-sided approximation
How to compute RN(`), with ` = 21−c ·mx/my ?
Three steps for correct rounding computation
1. compute v = 1.v1 . . .vk−2 such that −2−p ≤ `− v < 0
→ implied by |(`+2−p−1)− v |< 2−p−1
→ bivariate polynomial evaluation
2. compute u as the truncation of v after p fraction bits
3. determine RN(`) after possibly adding 2−p
How to compute the one-sided approximation v and then deduce RN(`)?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45
Page 41
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function
F(s, t) = s/(1+ t)+2−p−1
at the points s∗ = 21−c ·mx and t∗ = my −1
2. Approximate F(s, t) by a bivariate polynomial P(s, t)
P(s, t) = s ·a(t)+2−p−1
→ a(t): univariate polynomial approximant of 1/(1+ t)
→ approximation error Eapprox
3. Evaluate P(s, t) by a well-chosen efficient evaluation program P
v = P (s∗, t∗)→ evaluation error Eeval
How to ensure that |(`+2−p−1)− v |< 2−p−1 ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45
Page 42
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function
F(s, t) = s/(1+ t)+2−p−1
at the points s∗ = 21−c ·mx and t∗ = my −1
2. Approximate F(s, t) by a bivariate polynomial P(s, t)
P(s, t) = s ·a(t)+2−p−1
→ a(t): univariate polynomial approximant of 1/(1+ t)
→ approximation error Eapprox
3. Evaluate P(s, t) by a well-chosen efficient evaluation program P
v = P (s∗, t∗)→ evaluation error Eeval
How to ensure that |(`+2−p−1)− v |< 2−p−1 ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45
Page 43
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function
F(s, t) = s/(1+ t)+2−p−1
at the points s∗ = 21−c ·mx and t∗ = my −1
2. Approximate F(s, t) by a bivariate polynomial P(s, t)
P(s, t) = s ·a(t)+2−p−1
→ a(t): univariate polynomial approximant of 1/(1+ t)
→ approximation error Eapprox
3. Evaluate P(s, t) by a well-chosen efficient evaluation program P
v = P (s∗, t∗)→ evaluation error Eeval
How to ensure that |(`+2−p−1)− v |< 2−p−1 ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45
Page 44
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials1. Consider `+2−p−1 as the exact result of the function
F(s, t) = s/(1+ t)+2−p−1
at the points s∗ = 21−c ·mx and t∗ = my −1
2. Approximate F(s, t) by a bivariate polynomial P(s, t)
P(s, t) = s ·a(t)+2−p−1
→ a(t): univariate polynomial approximant of 1/(1+ t)
→ approximation error Eapprox
3. Evaluate P(s, t) by a well-chosen efficient evaluation program P
v = P (s∗, t∗)→ evaluation error Eeval
How to ensure that |(`+2−p−1)− v |< 2−p−1 ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45
Page 45
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure |(`+2−p−1)− v |< 2−p−1
it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,
since
|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p
This gives the following sufficient conditions
Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45
Page 46
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure |(`+2−p−1)− v |< 2−p−1
it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,
since
|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p
This gives the following sufficient conditions
Eapprox < 2−p−1/µ ⇒ Eeval < 2−p−1−µ ·Eapprox
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45
Page 47
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure |(`+2−p−1)− v |< 2−p−1
it suffices to ensure that µ ·Eapprox +Eeval < 2−p−1,
since
|(`+2−p−1)− v | ≤ µ ·Eapprox +Eeval with µ = 4−23−p
This gives the following sufficient conditions
Eapprox ≤ θ with θ < 2−p−1/µ ⇒ Eeval < η = 2−p−1−µ ·θ
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 16/45
Page 48
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Example for the binary32 division
Sufficient conditions with µ = 4−2−21
Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ
Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation error Required bound
I Eapprox ≤ θ,
with θ = 3 ·2−29 ≈ 6 ·10−9
I Eeval < η,
with η≈ 7.4 ·10−9
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 17/45
Page 49
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Example for the binary32 division
Sufficient conditions with µ = 4−2−21
Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ
Approximation of 1/(1+ t) by a Remez-like polynomial of degree 10
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation error Required bound
I Eapprox ≤ θ,
with θ = 3 ·2−29 ≈ 6 ·10−9
I Eeval < η,
with η≈ 7.4 ·10−9
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 17/45
Page 50
Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach
Flowchart for generating efficient and certified C codes
Selection of effective parenthesizations
Eapprox ≤ θF(s,t) Eeval < η ST231 features
C code Certificate
Computation of polynomial approximant
??
Computation of low latency parenthesizations
ST231 features
??
v |(` + 2−p−1)− v| < 2−p−1
truncation
u
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 18/45
Page 51
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: definitionApproximation u of ` with
` = 21−c ·mx/my
The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed
midpointfloating-point
u u
` `
Compute RN(`) requires to be able to decide whether u ≥ `
→ ` cannot be a midpoint
Rounding condition: u ≥ `
u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45
Page 52
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: definitionApproximation u of ` with
` = 21−c ·mx/my
The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed
midpointfloating-point
u u
` `
Compute RN(`) requires to be able to decide whether u ≥ `
→ ` cannot be a midpoint
Rounding condition: u ≥ `
u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45
Page 53
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: definitionApproximation u of ` with
` = 21−c ·mx/my
The exact value ` may have an infinite number of bits→ the sticky bit cannot always be computed
midpointfloating-point
u u
` `
Compute RN(`) requires to be able to decide whether u ≥ `
→ ` cannot be a midpoint
Rounding condition: u ≥ `
u ≥ ` ⇐⇒ u ·my ≥ 21−c ·mx
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45
Page 54
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u ·my ≥ 21−c ·mx
Approximation u and my : representable with 32 bits
u
my×
u · my
I u ·my is exactly representable with 64 bits
I 21−c ·mx is representable with 32 bits since c ∈ {0,1}
⇒ one 32×32→ 32-bit multiplication and one comparison
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45
Page 55
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u ·my ≥ 21−c ·mx
Approximation u and my : representable with 32 bits
u
my×
u · my
21−c · mx
I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}
⇒ one 32×32→ 32-bit multiplication and one comparison
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45
Page 56
Design and implementation of floating-point operators Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u ·my ≥ 21−c ·mx
Approximation u and my : representable with 32 bits
u
my×
u ·my
21−c ·mx
≥
I u ·my is exactly representable with 64 bitsI 21−c ·mx is representable with 32 bits since c ∈ {0,1}
⇒ one 32×32→ 32-bit multiplication and one comparison
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45
Page 57
Design and implementation of floating-point operators Implementation of correct rounding
Flowchart for generating efficient and certified C codes
Selection of effective parenthesizations
Eapprox ≤ θF(s,t) Eeval < η ST231 features
C code Certificate
Computation of polynomial approximant
??
Computation of low latency parenthesizations
ST231 features
??
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 21/45
Page 58
Design and implementation of floating-point operators Implementation of correct rounding
Flowchart for generating efficient and certified C codes
Computation of low latency parenthesizations
Selection of effective parenthesizations
a(t)
Eapprox ≤ θF(s,t) Eeval < η ST231 features
C code Certificate
Computation of polynomial approximant
??
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 21/45
Page 59
Low latency parenthesization computation
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computationClassical evaluation methodsComputation of all parenthesizationsTowards low evaluation latency
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 22/45
Page 60
Low latency parenthesization computation Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P(s, t)
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45
Page 61
Low latency parenthesization computation Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P(s, t)
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45
Page 62
Low latency parenthesization computation Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P(s, t)
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45
Page 63
Low latency parenthesization computation Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P(s, t)
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45
Page 64
Low latency parenthesization computation Classical evaluation methods
Classical parenthesizations for binary32 division
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
Horner’s rule: (3+1)×11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles
... We can do better.
How to explore the solution space of parenthesizations?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45
Page 65
Low latency parenthesization computation Classical evaluation methods
Classical parenthesizations for binary32 division
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
Horner’s rule: (3+1)×11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles
... We can do better.
How to explore the solution space of parenthesizations?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45
Page 66
Low latency parenthesization computation Classical evaluation methods
Classical parenthesizations for binary32 division
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
Horner’s rule: (3+1)×11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles
... We can do better.
How to explore the solution space of parenthesizations?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45
Page 67
Low latency parenthesization computation Classical evaluation methods
Classical parenthesizations for binary32 division
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
Horner’s rule: (3+1)×11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t)→ 16 cycles
... We can do better.
How to explore the solution space of parenthesizations?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45
Page 68
Low latency parenthesization computation Computation of all parenthesizations
Algorithm for computing all parenthesizations
a(x ,y) = ∑0≤i≤nx
∑0≤j≤ny
ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0
ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then
a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.
Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k
3 building rules for computing all parenthesizations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 25/45
Page 69
Low latency parenthesization computation Computation of all parenthesizations
Algorithm for computing all parenthesizations
a(x ,y) = ∑0≤i≤nx
∑0≤j≤ny
ai ,j · x i · y j with n = nx +ny , and anx ,ny 6= 0
ExampleLet a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y . Then
a1,0 +a1,1 · y is a valid expression, while a1,0 · x +a1,1 · x is not.
Exhaustive algorithm: iterative process→ step k = computation of all the valid expressions of total degree k
3 building rules for computing all parenthesizations
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 25/45
Page 70
Low latency parenthesization computation Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k): valid expressions of total degree k
P(k): powers x iy j of total degree k = i + j
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45
Page 71
Low latency parenthesization computation Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k): valid expressions of total degree k
P(k): powers x iy j of total degree k = i + j
Rule R1 for building the powers
p1 p2
p
deg(p) = deg(p1) + deg(p2)
deg(p1) ≤ bk/2c dk/2e ≤ deg(p2) < k
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45
Page 72
Low latency parenthesization computation Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k): valid expressions of total degree k
P(k): powers x iy j of total degree k = i + j
Rule R2 for expressions by multiplications
e′ p
deg(p) ≤ kdeg(e′) < k
deg(e) = deg(e′) + deg(p)
e
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45
Page 73
Low latency parenthesization computation Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k): valid expressions of total degree k
P(k): powers x iy j of total degree k = i + j
Rule R3 for expressions by additions
e2e1
deg(e1) = k deg(e2) ≤ k
deg(e) = max ( deg(e1), deg(e2))
e
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45
Page 74
Low latency parenthesization computation Computation of all parenthesizations
Number of parenthesizations
nx = 1 nx = 2 nx = 3 nx = 4 nx = 5 nx = 6
ny = 0 1 7 163 11602 2334244 1304066578
ny = 1 51 67467 1133220387 207905478247998 · · · · · ·ny = 2 67467 106191222651 10139277122276921118 · · · · · · · · ·
Number of generated parenthesizations for evaluating a bivariate polynomial
Timings for parenthesization computation→ for univariate polynomial of degree 5 ≈ 1h on a 2.4 GHz core
→ for bivariate polynomial of degree (2,1) ≈ 30s
→ for P(s, t) of degree (3,1) ≈ 7s (88384 schemes)
Optimization for univariate polynomial and P(s, t)→ univariate polynomial of degree 5 ≈ 4min
→ for P(s, t) of degree (3,1) ≈ 2s (88384 schemes)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45
Page 75
Low latency parenthesization computation Computation of all parenthesizations
Number of parenthesizations
10
100
1000
10000
100000
1e+06
10 12 14 16 18 20
Num
ber
ofde
gree
-5pa
rent
hesi
zati
ons
Latency on unbounded parallelism (# cycles)
→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45
Page 76
Low latency parenthesization computation Computation of all parenthesizations
Number of parenthesizations
10
100
1000
10000
100000
1e+06
10 12 14 16 18 20
Num
ber
ofde
gree
-5pa
rent
hesi
zati
ons
Latency on unbounded parallelism (# cycles)
→ minimal latency for univariate polynomial of degree 5: 10 cycles(36 schemes)
How to compute only parenthesizations of low latency?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45
Page 77
Low latency parenthesization computation Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 +anx ,ny · xnx yny
I if no scheme satisfies τ then increase τ and restart
Static target latency τstatic
I as general as evaluating a0,0 + xnx +ny +1
τstatic = A+M×dlog2(nx +ny +1)e
Dynamic target latency τdynamic
I cost of operator on anx ,ny and delay on intederminatesI dynamic programming
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45
Page 78
Low latency parenthesization computation Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 +anx ,ny · xnx yny
I if no scheme satisfies τ then increase τ and restart
Static target latency τstatic
I as general as evaluating a0,0 + xnx +ny +1
τstatic = A+M×dlog2(nx +ny +1)e
Dynamic target latency τdynamic
I cost of operator on anx ,ny and delay on intederminatesI dynamic programming
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45
Page 79
Low latency parenthesization computation Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 +anx ,ny · xnx yny
I if no scheme satisfies τ then increase τ and restart
Static target latency τstatic
I as general as evaluating a0,0 + xnx +ny +1
τstatic = A+M×dlog2(nx +ny +1)e
Dynamic target latency τdynamic
I cost of operator on anx ,ny and delay on intederminatesI dynamic programming
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45
Page 80
Low latency parenthesization computation Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 +anx ,ny · xnx yny
I if no scheme satisfies τ then increase τ and restart
Example
Degree-9 bivariate polynomial: nx = 8 and ny = 1
Latencies: A = 1 and M = 3
Delay: y available 9 cycles later than x
τstatic τdynamic
1+3×dlog2(10)e = 13 cycles 16 cycles
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45
Page 81
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 82
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
(a0,0 +a1,0 · x +a0,1 · y
)+(
a1,1 · x · y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 83
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
((a0,0 +a1,0 · x
)+a0,1 · y
)+(
a1,1 · x · y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 84
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
(a0,0 +
(a1,0 · x +a0,1 · y
))+(
a1,1 · x · y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 85
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
(a0,0 +a1,0 · x
)+(
a0,1 · y +a1,1 · x · y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 86
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
a0,0 +(
a1,0 · x +a0,1 · y +a1,1 · x · y)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 87
Low latency parenthesization computation Towards low evaluation latency
Optimized search of best parenthesizationsExampleLet a(x ,y) be a degree-2 bivariate polynomial
a(x ,y) = a0,0 +a1,0 · x +a0,1 · y +a1,1 · x · y .
⇒ find a best splitting of the polynomial→ low latency
Level 1
a(x, y)
a0,0
a′′(x, y)if support ≤max
exhaustive search
a(x, y)
a′′(x, y)
a′(x, y)
keep
Level 2
if support > max
a′(x, y) a′(x, y)
a′2(x, y)a′1(x, y)a′2(x, y)
a′1(x, y)
a(x, y)
a′(x, y)
anx,nyxnxyny
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45
Page 88
Low latency parenthesization computation Towards low evaluation latency
Efficient evaluation parenthesization generation
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
First target latency τ = 13→ no parenthesization found
Second target latency τ = 14→ obtained in about 10 sec.
Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles
131211109876543210
s
a0
a1
s
t t
a2
t a3
a4
t a5
a6
t a7
a8
t a9
a10
14
t
v
2−25
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 30/45
Page 89
Low latency parenthesization computation Towards low evaluation latency
Efficient evaluation parenthesization generation
P(s, t) = 2−25 + s · ∑0≤i≤10
ai · t i
First target latency τ = 13→ no parenthesization found
Second target latency τ = 14→ obtained in about 10 sec.
Classical methodsI Horner: 44 cycles,I Estrin: 19 cycles,I Estrin by distributing s: 16 cycles
131211109876543210
s
a0
a1
s
t t
a2
t a3
a4
t a5
a6
t a7
a8
t a9
a10
14
t
v
2−25
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 30/45
Page 90
Low latency parenthesization computation Towards low evaluation latency
Flowchart for generating efficient and certified C codes
Computation of low latency parenthesizations
Selection of effective parenthesizations
a(t)
Eapprox ≤ θF(s,t) Eeval < η ST231 features
C code Certificate
Computation of polynomial approximant
??
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 31/45
Page 91
Low latency parenthesization computation Towards low evaluation latency
Flowchart for generating efficient and certified C codes
Computation of low latency parenthesizations
Selection of effective parenthesizations
a(t)
Eapprox ≤ θF(s,t) Eeval < η ST231 features
C code Certificate
Computation of polynomial approximant
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 31/45
Page 92
Selection of effective evaluation parenthesizations
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizationsGeneral frameworkAutomatic certification of generated C codes
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 32/45
Page 93
Selection of effective evaluation parenthesizations General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I all intermediate variables are of constant sign
2. Scheduling on a simplified model of the ST231
I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates
3. Certification of generated C code
I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45
Page 94
Selection of effective evaluation parenthesizations General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I all intermediate variables are of constant sign
2. Scheduling on a simplified model of the ST231
I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates
3. Certification of generated C code
I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45
Page 95
Selection of effective evaluation parenthesizations General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I all intermediate variables are of constant sign
2. Scheduling on a simplified model of the ST231
I constraints of architecture: cost of operators, instructions bundling, ...I delays on indeterminates
3. Certification of generated C code
I straightline polynomial evaluation programI “certified C code”: we can bound the evaluation error in integer arithmetic
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45
Page 96
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions with µ = 4−2−21
Eapprox ≤ θ with θ < 2−25/µ and Eeval < η = 2−25−µ ·θ
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation error Required bound
I Eapprox ≤ θ,
with θ = 3 ·2−29 ≈ 6 ·10−9
I Eeval < η,
with η≈ 7.4 ·10−9
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 34/45
Page 97
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45
Page 98
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
1. determine an interval I around this point
2. compute Eapprox over I3. determine an evaluation error bound η
4. check if Eeval < η ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45
Page 99
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥my → condition satisfiedCase 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t∗ = 0.97490441799163818359375
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation errorRequired bound 2−25/(4− 2−21) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
1. determine an interval I around this point
2. compute Eapprox over I3. determine an evaluation error bound η
4. check if Eeval < η ?
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45
Page 100
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions for each subinterval, with µ = 4−2−21
E(i)approx ≤ θ
(i) with θ(i) < 2−25/µ and E(i)
eval < η(i) = 2−25−µ ·θ(i)
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation error Required bound
E(i)approx ≤ θ(i)
E(i)eval < η(i)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 36/45
Page 101
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions for each subinterval, with µ = 4−2−21
E(i)approx ≤ θ
(i) with θ(i) < 2−25/µ and E(i)
eval < η(i) = 2−25−µ ·θ(i)
-8e-09
-6e-09
-4e-09
-2e-09
0
2e-09
4e-09
6e-09
8e-09
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Abs
olut
eap
prox
imat
ion
erro
r
t
Approximation error Required bound
I E(i)approx ≤ θ(i)
I E(i)eval < η(i)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 36/45
Page 102
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I for each T (i)
1. compute a certified approximation error bound θ(i)
Sollya
2. determine an evaluation error bound η(i)
Sollya
3. check this bound: E(i)eval < η(i)
Gappa
⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals
Example of binary32 implementation
→ launched on a 64 processor grid
→ 36127 subintervals found in several hours (≈ 5h.)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45
Page 103
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I for each T (i)
1. compute a certified approximation error bound θ(i)
Sollya
2. determine an evaluation error bound η(i)
Sollya
3. check this bound: E(i)eval < η(i)
Gappa
⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals
Example of binary32 implementation
→ launched on a 64 processor grid
→ 36127 subintervals found in several hours (≈ 5h.)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45
Page 104
Selection of effective evaluation parenthesizations Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I for each T (i)
1. compute a certified approximation error bound θ(i)
Sollya
2. determine an evaluation error bound η(i)
Sollya
3. check this bound: E(i)eval < η(i)
Gappa
⇒ if this bound is not satisfied, T (i) is split up into 2 subintervals
Example of binary32 implementation
→ launched on a 64 processor grid
→ 36127 subintervals found in several hours (≈ 5h.)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45
Page 105
Numerical results
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 38/45
Page 106
Numerical results
Performances of FLIP on ST231
0
20
40
60
80
100
120
140
160
180
addsub
muldiv
sqrt
Lat
ency
(#cy
cles
)
Floating-point operators
FLIP 1.0FLIP 0.3
STlib
0
20
40
60
80
100
addsub
muldiv
sqrt
Spee
d-up
(%)
Floating-point operators
FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3
Performances on ST231, in RoundTiesToEven
⇒ Speed-up between 20 and 50 %
Implementations of other operators
x−1 x−1/2 x1/3 x−1/3 x−1/4
25 29 34 40 42
Performances on ST231, in RoundTiesToEven (in number of cycles)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45
Page 107
Numerical results
Performances of FLIP on ST231
0
20
40
60
80
100
120
140
160
180
addsub
muldiv
sqrt
Lat
ency
(#cy
cles
)
Floating-point operators
FLIP 1.0FLIP 0.3
STlib
0
20
40
60
80
100
addsub
muldiv
sqrt
Spee
d-up
(%)
Floating-point operators
FLIP 1.0 vs STlibFLIP 1.0 vs FLIP 0.3
Performances on ST231, in RoundTiesToEven
⇒ Speed-up between 20 and 50 %
Implementations of other operators
x−1 x−1/2 x1/3 x−1/3 x−1/4
25 29 34 40 42
Performances on ST231, in RoundTiesToEven (in number of cycles)
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45
Page 108
Numerical results
Impact of dynamic target latency
x1/3 x−1/3
Degree (nx ,ny ) (8,1) (9,1)
Delay on the operand s (# cycles) 9 9
Static target latency 13 13
Dynamic target latency 16 16
Latency on unbounded parallelism and on ST231 16 16
Latency (# cycles) on unbounded parallelism and on ST231
=⇒ Conclude on the optimality in terms of polynomial evaluation latency
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45
Page 109
Numerical results
Impact of dynamic target latency
x1/3 x−1/3
Degree (nx ,ny ) (8,1) (9,1)
Delay on the operand s (# cycles) 9 9
Static target latency 13 13
Dynamic target latency 16 16
Latency on unbounded parallelism and on ST231 16 16
Latency (# cycles) on unbounded parallelism and on ST231
=⇒ Conclude on the optimality in terms of polynomial evaluation latency
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45
Page 110
Numerical results
Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1
Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)
Static target latency 13 13 13 13 13
Dynamic target latency 13 13 16 16 13
Latency on unbounded parallelism 13 13 16 16 13
Latency on ST231 13 14 16 16 13
Parenthesization generation 172ms 152ms 53s 56s 168ms
Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms
Scheduling 29s 4m21s 32ms 132ms 7s
Certification (Gappa) 6s 4s 1m38s 1m07s 11s
Total time (≈) 35s 4m25s 2m31s 2m03s 18s
Timing of each step of the generation flow
Impact of the target latency on the first step of the generation
What may dominate the cost
→ scheduling algorithm
→ certification using Gappa
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45
Page 111
Numerical results
Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1
Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)
Static target latency 13 13 13 13 13
Dynamic target latency 13 13 16 16 13
Latency on unbounded parallelism 13 13 16 16 13
Latency on ST231 13 14 16 16 13
Parenthesization generation 172ms 152ms 53s 56s 168ms
Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms
Scheduling 29s 4m21s 32ms 132ms 7s
Certification (Gappa) 6s 4s 1m38s 1m07s 11s
Total time (≈) 35s 4m25s 2m31s 2m03s 18s
Timing of each step of the generation flow
Impact of the target latency on the first step of the generation
What may dominate the cost
→ scheduling algorithm
→ certification using Gappa
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45
Page 112
Numerical results
Timings for code generationx1/2 x−1/2 x1/3 x−1/3 x−1
Degree (nx ,ny ) (8,1) (9,1) (8,1) (9,1) (10,0)
Static target latency 13 13 13 13 13
Dynamic target latency 13 13 16 16 13
Latency on unbounded parallelism 13 13 16 16 13
Latency on ST231 13 14 16 16 13
Parenthesization generation 172ms 152ms 53s 56s 168ms
Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms
Scheduling 29s 4m21s 32ms 132ms 7s
Certification (Gappa) 6s 4s 1m38s 1m07s 11s
Total time (≈) 35s 4m25s 2m31s 2m03s 18s
Timing of each step of the generation flow
Impact of the target latency on the first step of the generation
What may dominate the cost
→ scheduling algorithm
→ certification using Gappa
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45
Page 113
Conclusions and perspectives
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 42/45
Page 114
Conclusions and perspectives Conclusions
Conclusions
Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division
I polynomial evaluation-based method, very high ILP exposure
⇒ new, much faster version of FLIP
Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation
implementationI heuristics and techniques for generating quickly efficient and certified C
codes
⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45
Page 115
Conclusions and perspectives Conclusions
Conclusions
Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division
I polynomial evaluation-based method, very high ILP exposure
⇒ new, much faster version of FLIP
Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation
implementationI heuristics and techniques for generating quickly efficient and certified C
codes
⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45
Page 116
Conclusions and perspectives Conclusions
Conclusions
Design and implementation of floating-point operatorsI uniform approach for correctly-rounded roots and their reciprocalsI extension to correctly-rounded division
I polynomial evaluation-based method, very high ILP exposure
⇒ new, much faster version of FLIP
Code generation for efficient and certified polynomial evaluationI methodologies and tools for automating polynomial evaluation
implementationI heuristics and techniques for generating quickly efficient and certified C
codes
⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45
Page 117
Conclusions and perspectives Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/
√1+ x2 over [0,0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Extension to other binary floating-point formats
→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib
Extension to other architectures, typically FPGAs
→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs
→ the other techniques introduced of this thesis: should be investigated further
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45
Page 118
Conclusions and perspectives Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/
√1+ x2 over [0,0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Extension to other binary floating-point formats
→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib
Extension to other architectures, typically FPGAs
→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs
→ the other techniques introduced of this thesis: should be investigated further
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45
Page 119
Conclusions and perspectives Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:• log2(1+ x) over [0.5,1), 1/
√1+ x2 over [0,0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Extension to other binary floating-point formats
→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib
Extension to other architectures, typically FPGAs
→ polynomial evaluation-based approach: already seems to be a goodalternative to multiplicative methods on FPGAs
→ the other techniques introduced of this thesis: should be investigated further
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45
Page 120
Conclusions and perspectives Perspectives
Implementation of binary floating-point arithmeticon embedded integer processors
Polynomial evaluation-based algorithmsand
certified code generation
Guillaume RevyAdvisors: Claude-Pierre Jeannerod and Gilles Villard
Arenaire INRIA project-team (LIP, Ens Lyon) Universite de Lyon CNRS
Ph.D. Defense – December 1st, 2009
Guillaume Revy – December 1st, 2009. Implementation of binary floating-point arithmetic on embedded integer processors 45/45