CAPS team Compiler and Architecture for superscalar and embedded processors
CAPS team
Compiler and Architecture
for superscalar and embedded processors
2
CA
PS
p
roje
ct
CAPS members
2 INRIA researchers: A. Seznec, P. Michaud
2 professors: F. Bodin, J. Lenfant
11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux,
K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir,
A. Fraboulet, O. Rochecouste, E.Toullec
3 engineers: S. Bihan, P. Villalon, J. Simonnet
3
CA
PS
p
roje
ct
CAPS themes
Two interacting activities
High performance microprocessor
architecture
Performance oriented compilation
4
CA
PS
p
roje
ct
CAPS Grail
Performance at the best cost
Progress in computer science and applications are driven by
performance
5
CA
PS
p
roje
ct
CAPS path to the Grail
Defining the tradeoffs between:
what should be done through hardware
what can be done by the compiler
for maximum performance
or for minimum cost
or for minimum size, power ..
6
CA
PS
p
roje
ct
Need for high-performance processors
Current applications general purpose: scientific, multimedia, data bases … embedded systems: cell phones, automotive, set-top boxes ..
Future applications don’t worry: users have a lot of imagination !
New software engineering techniques are CPU hungry: reusability, generality portability, extensibility (indirections, virtual machines) safety (run-time verifications) encryption/decryption
7
CA
PS
p
roje
ct
CAPS (ancient) background
« ancient » background in hardware and software
management of ILP decoupled pipeline architectures OPAC, an hardware matrix floating-point
coprocessor software pipeline for LIW
« Supercomputing » background interleaved memories Fortran-S
CA
PS
p
roje
ct
CAPS background in architecture
Solid knowledge in microprocessor architecture technological watch on microprocessors A. Seznec worked with Alpha Development Group in
1999-2000
Researches in cache architecture
Researches in branch prediction mechanisms
9
CA
PS
p
roje
ct
CAPS background in compilers
Software optimizations for cache memories Numerical algorithms on dense structures Optimizing data layout
Many prototype environments for parallel compilers: CT++ (with CEA): image processing C++ library for a SIMD
architecture, Menhir: a parallel compiler for MatLab
IPF (with Thomson-LER): Fortran Compiler for image processing on Maspar
Sage (with Indiana): Infrastusture for source level transformation
10
CA
PS
p
roje
ct
We build on
SALTO: System for Assembly-Language Transformations and Optimizations retargetable assembly source to source preprocessor Erven Rohou’s Ph. D
TSF: Scripting language for program transformation on top
of ForeSys (Simulog) Yann Mevel’s Ph. D
11
CA
PS
p
roje
ct
Salto overview Assembly source to source preprocessor Fine grain machine description Independent from compilers
Transformationtool
SALTO
inte
rfa
ce
C++
Machine Description
assemblylanguage
assemblylanguage
12
CA
PS
p
roje
ct
Compiler activities
Code optimizations for embedded applications infrastructures rather than compilers
optimizing compiler strategies rather than new code optimizations
Global constraints performance /code sizes/ low power (starting)
Focus on interactive tools rather than automatic code tuning case based reasoning assembly code optimizations
13
CA
PS
p
roje
ct
Computer aided hand tuning
Automatic optimization has many shortcomings
rather provide the user with a testbed to hand-tune
applications
Target applications
Fortran codes and embedded C applications
Our approach
case based reasoning
static code analysis and pattern matching
profiling
learning techniques
the user is the ultimate responsible
14
CA
PS
p
roje
ctCAHT
Prototype built onForesys: Fortran interactive front-end (from Simulog)
TSF: Scripting language for program transformation
Sage++: Infrastusture for source level transformation
15
CA
PS
p
roje
ct
Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)
ATLLAS objectives : Has the compiler done a good job ? Try to match source and optimized assembly at fine
grain Development/analysis environment:
Models for both source and assembly Global and local analysis (WCET, …) at both levels Interactive environment for codes visualization and
manual/ automatic analysis and optimization
Built using Salto and Sage++: Retargetable with compilers and architectures
16
CA
PS
p
roje
ct
ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning method
Good ?
Half-Automatic or Manual Source
Optimisations
Atllas
compilation profiling
End
Yes
Half-Automatic or Manual Assembly
Optimisations
Source Code Assembly Code
Post-Processing
ProcessingSupport
Code matching analysis and evaluationsGraphic Display of Ass. And Src. Code
17
CA
PS
p
roje
ct
Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)
ALISE enhanced SALTO for code optimization:
• better integration with code generation– interface with front-end– interface for profiling data
• targets global optimization• based on component software optimization
engines
Answer to a real need from industry: A retargetable infrastructure
18
CA
PS
p
roje
ct
ALISE
Environment for: global assembly code optimization providing optimization alternatives
Support for new embedded processors ISAs with ILP support (VLIW, EPIC) Predicated instructions Functional unit clusters, ..
19
CA
PS
p
roje
ct
ALISE
ArchitectureDescription
D to MArchitecture Model
Intermediate representation
Opt 1 Opt 2 Opt n
P to IRTextInput
IR to Ass(Emit)
OptimizedProgram
High Level API
Interfaces
External Infrastructure
User interfaceG.U.I.
IntermediateCode
External Infrastructure
20
CA
PS
p
roje
ct
Preprocessor for media processors (MEDEA+ Mesa project)
Multimedia instructions on embedded and general-purpose processors but : no consensus on MMD instructions among constructors:
• saturated arithmetic or not, different instructions, …
Multimedia instructions are not well handled by compilers:
• but performance is very dependent
21
CA
PS
p
roje
ct
Preprocessor for media processors:our approach
C source to source preprocessor user oriented idioms recognition:
easy to retarget target dedicated recognition
exploiting loop parallelism vectorization techniques multiprocessor systems
available soon
Collaboration with Stmicroelectonics
22
CA
PS
p
roje
ct
Iterative compilation
Embedded systems: Compile time is not critical Performance/code size/power are critical One can often relate on profiling
Classical compiler: local optimizations but constraints are GLOBAL
Proof of concept for code sizes (Rohou ’s Ph. D) new Ph. D. beginning in september 2000
23
CA
PS
p
roje
ct
High performance instruction set simulation
Embedded processors: // development of silicon, ISA, compiler and applications
Need for flexible instruction set simulation: high performance simulation of large codes debugging retargetable to experiment:
• new ISA• various microarchitecture options
First results: up to 50x faster than ad-hoc simulator
24
CA
PS
p
roje
ctABSCISS: Assembly Based System for Compiled Instruction Set Simulation
C Source TriMedia Assemblytmcc
TriMedia Binary
ABSCISS
tmsim
tmas
gcc
C/C++ Source
Compiled simulator
Architecture Description
25
CA
PS
p
roje
ct
Enabling superscalar processor simulation
Complete O-O-O microprocessor simulation: 10000-100000 slower than real hardware can not simulate realistic applications, but slices even fast mode emulation is slow (50-100x):
• simulation generally limited to slices at the beginning of the application
• representativeness ? Calvin2 + DICE:
combines direct execution with simulation really fast mode: 1-2x slowdown enables simulating slices distributed over the whole
application
26
CA
PS
p
roje
ct DICEHost ISAEmulator
User analysisroutines
Calvin2 + DICE
Original code
SPARC V9 assembly
code
calvin2Static Code Annotation Tool
checkpoint
checkpoint
checkpoint
checkpoint
checkpoint
Switching event
Emulation modeSwitching event
27
CA
PS
p
roje
ct
Moving tools to IA64
New 64bit ISA from Intel/HP: Explicitly Parallel Instruction Computing Predicated Execution Advanced loads (i.e. speculative) A very interesting platform for research !!
Porting SALTO and Calvin2+DICE approach to IA64
Exploring new trade-offs enabled by instruction sets: predicting the predicates ? advanced loads against predicting dependencies ultimate out-of-order execution against compiler
28
CA
PS
p
roje
ct
Low power, compilation, architecture, …(just beginning :=)
Power consumption becomes a major issue: Embedded and general purpose
Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan): Is it different from performance optimization ? Global constraint optimization Instruction Set Architecture support ?
Architecture: High order bits are generally null, … registers and memory ALUs
29
CA
PS
p
roje
ct
Caches and branch predictors
International CAPS visibility in architecture = skewed associative cache + decoupled sectored cache + multiple block ahead branch prediction + skewed branch predictor
Continue recurrent work on these topics: multiple block ahead + tradeoffs complexity/accuracy
30
CA
PS
p
roje
ct
Simultaneous Multithreading
Sharing functional units among several processes Among the first groups working on this topic
S. Hily’s Ph. D. SMT behavior well understood for independent threads
now, focus on // threads from a single application
Current research directions: speculative multithreading
• ultimate performance with a single thread through predicting threads
performance/complexity tradeoffs: SMT/CMP/hybrid
31
CA
PS
p
roje
ct
« Enlarging » the instruction window (supported by Intel)
In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.
Limitations are: size of the window number of physical registers
Prescheduling: separate data flow scheduling from resource arbitration. coarser units of work ?
Reducing the number of physical registers: how to detect when a physical register is dead ? Per group validation ? revisiting CISC/RISC war ?
32
CA
PS
p
roje
ct
Unwritten rule on superscalar processor designs
For general purpose registers:
Any physical register can be the source or the result of any instruction executed
on any functional unit
33
CA
PS
p
roje
ct
4-cluster WSRS architecture(supported by Intel)
S0
S0 C0
S1
S1C1
S2
C2
S3
S3C3S2
•Half the read ports, one
fourth the write ports•Register file:
• Silicon area x 1/8• Power x 1/2• Access time x 0.6
•Gains on:•bypass network•selection logic
34
CA
PS
p
roje
ct
Multiprocessor on a chip
Not just replicating board level solutions !
A way to manage a large on-chip cache capacity: how can a sequential application use efficiently a distributed
cache ? architectural supports for distributing a sequential application
on several processors ? how should instructions and data be distributed ?
35
CA
PS
p
roje
ct
HIPSORHIgh Performance SOftware Random number generation
Need for unpredicable random number generation: sequences that cannot be reproduced
State of the art: < 100 bit/s using the operating system 75Kbit/s using hardware generator on Pentium III
Internal state of a superscalar can not be reproduced use this state to generate unpredictable random
numbers
36
CA
PS
p
roje
ct
HIPSOR (2)
1000’s of unmonitorable states modified by OS interrupts
Hardware clock counter to indirectly probe these states
Combined with in-line pseudo-random number generation
100 Mbit/s unpredictable random numbers
ARC INRIA with CODES