Compiler Construction Chapter 1: Introduction Slides modified from Louden Book and Dr. Scherger
Terminology
January, 2010 Chapter 1: Introduction 2
Compiler
Interpreter
Translator
Assembler
Linker
Loader
Preprocessor
Editor
Debugger
Profiler
Source Language
Target Language
Target Platform
Relocatable
Macro substitution
IDE
Cross Compiler
Dissambler
Front End
Back End
Compiler Stages
January, 2010 Chapter 1: Introduction 3
Scanner
Parser
Semantic
Analyzer
Source Code
Optimizer
Code
Generator
Target Code
Optimizer
Source Code
Target
Code
Tokens
Syntax Tree
Annotated
Tree
Intermediate
Code
Target
Code
Literal
Table
Symbol
Table
Error
Handler
Analysys
Synthesis
Files Used by Compilers
January, 2010 Chapter 1: Introduction 4
A source code text file (.c, .cpp, .java, etc. file extensions).
Intermediate code files: transformations of source code
during compilation, usually kept in temporary files rarely
seen by the user.
An assembly code text file containing symbolic machine
code, often produced as the output of a compiler (.asm,
.s file extensions).
Files Used by Compilers (cont.)
January, 2010 Chapter 1: Introduction 5
One or more binary object code files: machine
instructions, not yet linked or executable (.obj, .o file
extensions)
A binary executable file: linked, independently executable
(well, not always…) code (.exe, .out extensions, or no
extension).
Extended Example
January, 2010 Chapter 1: Introduction 7
Source code:
a[index] = 4 + 2
Tokens: ID Lbracket ID Rbracket AssignOp Num AddOp Num
Parse tree (syntax tree with all steps of the parser in
gory detail):
Parse Tree
January, 2010 Chapter 1: Introduction 8
expression = expression
subscript-expression
identifier
[
identifier
]
a index
additive-expression
number 4
expression expression +
number 2
expression expression
assign-expression
expression
Syntax Tree
January, 2010 Chapter 1: Introduction 9
a "trimmed" version of the parse tree with only
essential information:
assign-expression
subscript-expression
identifier identifier
a index
additive-expression
number
4
number
2
Annotated Syntax Tree (with attributes)
January, 2010 Chapter 1: Introduction 10
assign-expression
subscript-expression
identifier identifier
a index
additive-expression
number
4
number
2
integer integer
array of
integer integer integer integer
integer
Intermediate Code
Syntax tree very abstract
Machine code too specific
Something in between may make optimization much
easier
One such representation is three-address code
Has only up to three different variables (addresses)
t = 4 + 2
a[index] = t
Target Code
January, 2010 Chapter 1: Introduction 12
(edited & modified for this presentation):
mov eax, 6
mov ecx, DWORD PTR _index$[ebp]
mov DWORD PTR _a$[ebp+ecx*4], eax
(Note source level constant folding optimization.)
Source code: a[index] = 4 + 2
Tokens: ID Lbracket ID Rbracket AssignOp Num AddOp Num
Scanner
Parser
Semantic
Analyzer
Source Code
Optimizer
Code
Generator
Target Code
Optimizer
Source Code
Target
Code
Tokens
Syntax Tree
Annotated
Tree
Intermediate
Code
Target
Code
Literal
Table
Symbol
Table
Error
Handler
The Big Picture
January, 2010 Chapter 1: Introduction 13
mov eax, 6
mov ecx, DWORD PTR _index$[ebp]
mov DWORD PTR _a$[ebp+ecx*4], eax
ID Lbracket ID Rbracket AssignOp Num AddOp Num
a[index] = 4 + 2
assign-expression
subscript-expression
identifier identifier
a index
additive-expression
number
4
number
2
assign-expression
subscript-expression
identifier identifier
a index
additive-expression
number
4
number
2
integer integer
array of
integer integer integer integer
integer
t = 4 + 2
a[index] = t
Algorithmic Tools
January, 2010 Chapter 1: Introduction 14
Tokens: defined using regular expressions. (Chapter 2)
Scanner:
an implementation of a finite state machine (deterministic
automaton) that recognizes the token regular expressions
(Chapter 2).
Algorithmic Tools (cont.)
January, 2010 Chapter 1: Introduction 15
Parser
A push-down automaton (i.e. uses a stack), based on grammar
rules in a standard format (BNF – Backus-Naur Form).
(Chapters 3, 4, 5)
Semantic Analyzer and Code Generator:
Recursive evaluators based on semantic rules for attributes
(properties of language constructs). (Chapters 6, 7, 8)
Other Phase Features
January, 2010 Chapter 1: Introduction 16
Parser and scanner together typically operate as a unit
(parser calls scanner repeatedly to generate tokens).
Front end:
Parser, scanner, semantic analyzer and source code optimizer
depend primarily on source language.
Back end:
code generator and target code optimizer depend primarily on
target language (machine architecture).
Other Classifications
January, 2010 Chapter 1: Introduction 17
Logical unit: phase
Physical unit: separately compiled code file (see later)
Temporal unit: pass
Passes: trips through the source code (or intermediate code).
These are not phases (but they could be).
Data Structure Tools
January, 2010 Chapter 1: Introduction 18
Syntax tree: see previous pictures.
Literal table: "Hello, world!", 3.141592653589793, etc.
If a literal is used more than once (as they often are in a program), we still want to store it only once.
So we use a table (almost always a hash table or table of hash tables).
Symbol table: all names (variables, functions, classes, typedefs, constants,
namespaces).
Again, a hash table or set of hash tables is the most likely data structure.
Error Handler
January, 2010 Chapter 1: Introduction 19
One of the more difficult parts of a compiler to design.
Must handle a wide range of errors
Must handle multiple errors.
Must not get stuck.
Must not get into an infinite loop (typical simple-minded
strategy:count errors, stop if count gets too high).
Kinds of Errors
January, 2010 Chapter 1: Introduction 20
Syntax: iff (x == 0) y + = z + r; }
Semantic: int x = "Hello, world!";
Runtime: int x = 2;
...
double y = 3.14159 / (x - 2);
Errors (cont.)
January, 2010 Chapter 1: Introduction 21
A compiler must handle syntax and semantic errors, but
not runtime errors (whether a runtime error will occur
is an undecidable question).
Sometimes a compiler is required to generate code to
catch runtime errors and handle them in some graceful
way (either with or without exception handling).
This, too, is often difficult.
Sample Compilers in This Class ("Toys")
January, 2010 Chapter 1: Introduction 22
TINY: a 4-pass compiler for the TINY language, based on
Pascal (see text, pages 22-26)
C-Minus: A project language given in the text(see text,
pages 26-27 and Appendix A). Based on C.
SIL: Simple Island Language:
TINY Example
January, 2010 Chapter 1: Introduction 23
read x;
if x > 0 then
fact := 1;
repeat
fact := fact * x;
x := x - 1
until x = 0;
write fact
end
C-Minus Example
January, 2010 Chapter 1: Introduction 24
int fact( int x )
{ if (x > 1)
return x * fact(x-1);
else
return 1;
}
void main( void )
{ int x;
x = read();
if (x > 0) write( fact(x) );
}
Structure of the TINY Compiler
January, 2010 Chapter 1: Introduction 25
globals.h main.c
util.h util.c
scan.h scan.c
parse.h parse.c
symtab.h symtab.c
analyze.h analyze.c
code.h code.c
cgen.h cgen.c
Conditional Compilation Options
January, 2010 Chapter 1: Introduction 26
NO_PARSE:
Builds a scanner-only compiler.
NO_ANALYZE:
Builds a compiler that parses and scans only.
NO_CODE:
Builds a compiler that performs semantic analysis, but generates
no code.
Listing Options (built in - not flags)
January, 2010 Chapter 1: Introduction 27
EchoSource:
Echoes the TINY source program to the listing, together with line
numbers.
TraceScan:
Displays information on each token as the scanner recognizes it.
TraceParse:
Displays the syntax tree in a linearlized format.
TraceAnalyze:
Displays summary information on the symbol table and type checking.
TraceCode:
Prints code generation-tracing comments to the code file.