CIS 461 Compiler Design and Construction Fall 2012 Instructor: Hugh McGuire Lecture-Module 2b Phases of a Compiler
Jan 13, 2016
CIS 461Compiler Design and Construction
Fall 2012 Instructor: Hugh McGuire
Lecture-Module 2b
Phases of a Compiler
• Must recognize legal (and illegal) programs
• Must generate correct code
• Must manage storage of all variables (and code)
• Must agree with OS and linker on format for object code
High-level View of a Compiler
Sourcecode
Machinecode
Compiler
Errors
A Higher Level View: How Does the Compiler Fit In?
sourceprogra
m
absolutemachine
code
CompilerPreprocessor
Assembler Loader/Linker
skeletalsource
program
targetassemblyprogram
relocatablemachine
code
library routines,relocatable object files
generates machine codefrom the assembly code
• collects the source program thatis divided into seperate files• macro expansion
• links the library routines andother object modules• generates absolute addresses
Traditional Two-pass Compiler
• Use an intermediate representation (IR)• Front end maps legal source code into IR
• Back end maps IR into target machine code
• Admits multiple front ends and multiple passes
– Typically, front end is O(n) or O(n log n), while back end is NPC
• Different phases of compiler also interact through the symbol table
Source
code
FrontEnd
Errors
Machinecode
BackEnd
IR
SymbolTable
Responsibilities
• Recognize legal programs
• Report errors for the illegal programs in a useful way
• Produce IR and construct the symbol table
• Much of front end construction can be automated
The Front End
Sourcecode
Scanner IRParsertokens IR TypeChecker
Errors
The Front End
Scanner
• Maps character stream into words—the basic unit of syntax
• Produces tokens and stores lexemes when it is necessary– x = x + y ; becomes <id,x> EQ <id,x> PLUS <id,y> SEMICOLON– Typical tokens include number, identifier, +, -, while, if
• Scanner eliminates white space and comments
Sourcecode
Scanner IRParsertokens IR TypeChecker
Errors
The Front End
Parser• Uses scanner as a subroutine
• Recognizes context-free syntax and reports errors
• Guides context-sensitive analysis (type checking)
• Builds IR for source program
• Scanning and parsing can be grouped into one pass
Sourcecode
Scanner IRParser IR TypeChecker
Errors
token
get nexttoken
The Front End
Context Sensitive Analysis• Check if all the variables are declared before they are used• Type checking
– Check type errors such as adding a procedure and an array• Add the necessary type conversions
– int-to-float, float-to-double, etc.
Sourcecode
Scanner IRParsertokens IR TypeChecker
Errors
The Back End
Responsibilities• Translate IR into target machine code• Choose instructions to implement each IR operation• Decide which values to keep in registers• Schedule the instructions for instruction pipeline
Automation has been much less successful in the back end
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
The Back End
Instruction Selection• Produce fast, compact code• Take advantage of target language features such as addressing modes• Usually viewed as a pattern matching problem
– ad hoc methods, pattern matching, dynamic programmingThis was the problem of the future in late 70’s when instruction sets were
complex– RISC architectures simplified this problem
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
The Back End
Instruction Scheduling
• Avoid hardware stalls (keep pipeline moving)• Use all functional units productively• Optimal scheduling is NP-Complete
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
The Back End
Register allocation
• Have each value in a register when it is used• Manage a limited set of registers• Can change instruction choices and insert LOADs and STOREs• Optimal allocation is NP-Complete
Compilers approximate solutions to NP-Complete problems
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
Traditional Three-pass Compiler
Code Optimization
• Analyzes IR and transforms IR
• Primary goal is to reduce running time of the compiled code
– May also improve space, power consumption (mobile computing)
• Must preserve “meaning” of the code
Errors
SourceCode
MiddleEnd
FrontEnd
Machinecode
BackEnd
IR IR
The Optimizer (or Middle End)
Typical Transformations• Discover and propagate constant values• Move a computation to a less frequently executed place• Discover a redundant computation and remove it• Remove unreachable code
Errors
Opt1
Opt3
Opt2
Optn
...IR IR IR IR IR
Modern optimizers are structured as a series of passes
First Phase: Lexical Analysis (Scanning)
Scanner
• Maps stream of characters into words
– Basic unit of syntax
• Characters that form a word are its lexeme
• Its syntactic category is called its token
• Scanner discards white space and comments
Sourcecode Scanner
IRParser
Errors
token
get nexttoken
Why Lexical Analysis?
• By separating context free syntax from lexical analysis
– We can develop efficient scanners
– We can automate efficient scanner construction
– We can write simple specifications for tokens
Scanner
ScannerGenerator
specifications (regular expressions)
source code tokens
tables or code
What are Tokens?
• Token: Basic unit of syntax
– Keywords
if, while, ...– Operators
+, *, <=, ||, ...– Identifiers (names of variables, arrays, procedures, classes)
i, i1, j1, count, sum, ...– Numbers
12, 3.14, 7.2E-2, ...
What are Tokens?
• Tokens are terminal symbols for the parser
– Tokens are treated as undivisible units in the grammar defining the source language
1. S expr
2. expr expr op term3. | term
4. term number5. | id
6. op +7. | -
number, id, +, -are tokens passed fromscanner to parser.They form the terminalsymbols of this simple grammar.
Lexical Concepts
• Token: Basic unit of syntax, syntactic output of the scanner
• Pattern: The rule that describes the set of strings that correspond to a token, specification of the token
• Lexeme: A sequence of input characters which match to a pattern and generate the token
WHILE while while
IF if if
ID i1, length, letter followed bycount, sqrt letters and digits
Token Lexeme Pattern
Tokens can have Attributes
• A problem
• If we send this output to the parser is it enough? Where are the variable names, procedure, names, etc.? All identifiers look the same.
• Tokens can have attributes that they can pass to the parser (using the symbol table)
if (i == j) z = 0;else z = 1;
becomes
IF, LPAREN,ID,EQEQ,ID,RPAREN,ID,EQ,NUM,SEMICOLON,ELSE,ID,EQ,NUM,SEMICOLON
IF, LPAREN,<ID, i>,EQEQ,<ID, j>,RPAREN,
<ID, z>,EQ,<NUM,0>,SEMICOLON,ELSE,
<ID,z>,EQ,<NUM,1>,SEMICOLON
How do we specify lexical patterns?
Some patterns are easy
• Keywords and operators
– Specified as literal patterns: if, then, else, while, =, +, …
Some patterns are more complex
• Identifiers
– letter followed by letters and digits
• Numbers
– Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9
– Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits
GOAL: We want to have concise descriptions of patterns, and we want to automatically construct the scanner from these descriptions
Specifying Lexical Patterns
Specifying Lexical Patterns: Regular Expressions
Regular expressions (REs) describe regular languages
Regular Expression (over alphabet )
(empty string) is a RE denoting the set {}
• If a is in , then a is a RE denoting {a}
• If x and y are REs denoting languages L(x) and L(y) then
– x is an RE denoting L(x)
– x | y is an RE denoting L(x) L(y)
– xy is an RE denoting L(x)L(y)
– x* is an RE denoting L(x)*
Precedence is closure, then concatenation, then alternation
All left-associative
x | y* z is equivalent tox | ((y*) z)
Operations on Languages
Operation Definition
Union of L and MWritten L M L M = {s | s L or s M }
Concatenation of L and MWritten LM
LM = {st | s L and t M }
Kleene closure of LWritten L* L* = 0i L
i
L+ = 1i Li
Exponentiation of LWritten Li Li =
{} if i = 0
Li-1L if i > 0
Positive closure of L
Written L+
Examples of Regular Expressions
• All strings of 1s and 0s
( 0 | 1 )*
• All strings of 1s and 0s beginning with a 1
1 ( 0 | 1 )*
• All strings of 0s and 1s containing at lest two consecutive 1s
( 0 | 1 )* 1 1( 0 | 1 )*
• All strings of alternating 0s and 1s
( | 1 ) ( 0 1 )* ( | 0 )
Extensions to Regular Expressions (a la JLex)
• x+= x x* denotes L(x)+
• x? = x | denotes L(x) {}• [abc] = a | b | c matches one character in the square bracket• a-z = a | b | c | ... | z range• [0-9a-z] = 0 | 1 | 2 | ... | 9 | a | b | c | ... | z • [^abc] ^ means negation matches any character except a, b or c • . (dot) matches any character except the
newline• . = [^\n] \n means newline, dot is equivalent to [^\n]• “[“ matches left square bracket, metacharacters
in double quotes become plain characters• \[ matches left square bracket, metacharacter
after backslash becomes plain character
Regular Definitions
• We can define macros using regular expressions and use them in other regular expressions
Letter (a|b|c| … |z|A|B|C| … |Z)
Digit (0|1|2| … |9)
Identifier Letter ( Letter | Digit )*
• Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion)
• Regular definitions can be converted to basic regular expressions with macro expansion
• In JLex enclose definitions using curly braces
Identifier {Letter} ( {Letter} | {Digit} )*
Examples of Regular Expressions
Digit (0|1|2| … |9)
Integer (+|-)? (0| (1|2|3| … |9)(Digit *))
Decimal Integer “.” Digit *
Real ( Integer | Decimal ) E (+|-)?Digit *
Complex “(“ Real , Real “)”
Numbers can get even more complicated.
From Regular Expressions to Scanners
• Regular expressions are useful for specifying patterns that correspond to tokens
• However, we also want to construct programs which recognize these patterns
• How do we do it?
– Use finite automata!
Consider the problem of recognizing register names in an assembler
Register R (0|1|2| … |9) (0|1|2| … |9)*
• Allows registers of arbitrary number
• Requires at least one digit
RE corresponds to a recognizer (or DFA)
Example
S0 S2 S1
R
(0|1|2| … |9)
accepting state
(0|1|2| …|9)
Recognizer for Register
initial state
Se
RR
(R|0|1|2| …|9)error state
(0|1|2| …|9)
Deterministic Finite Automata (DFA)
• A set of states S– S = { s0 , s1 , s2 , se}
• A set of input symbols (an alphabet) = { R , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
• A transition function : S S– Maps (state, symbol) pairs to states = { ( s0 , R) s1, ( s0 , 0-9) se ,( s1 , 0-9 ) s2 ,( s1 , R ) se ,
( s2 , 0-9 ) s2 , ( s2 , R ) se , ( se , R | 0-9 ) se }• A start state
– s0
• A set of final (or accepting) states– Final = { s2 }
A DFA accepts a word x iff there exists a path in the transition graph from start state to a final state such that the edge labels along the path spell out x
DFA simulation
• Start in state s0 and follow transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (s2 )
So,
• “R17” takes it through s0 , s1 , s2 and accepts
• “R” takes it through s0 , s1 and fails
• “A” takes it straight to se
• “R17R” takes it through s0 , s1 , s2 , se and rejects
Example
S0 S2 S1
R
(0|1|2| …|9)
accepting state
(0|1|2| …|9)
Recognizer for Registerinitial state
Simulating a DFA
state = s0 ;char = get_next_char();while (char != EOF) { state = (state,char); char =get_next_char();}if (state Final) report acceptance;else report failure;
R
0,1,2,3,4,5,6,7,8,9
other
S0 S1 Se Se
S1 Se S2 Se
S2 Se S2 Se
Se Se Se Se
•The recognizer translates directly into code
•To change DFAs, just change the arrays
•Takes O(|x|) time for input string x
Final = { s2 }We can also store the final states in an array
We can store the transition table in atwo-dimensional array:
Recognizing Longest Accepted Prefixaccepted = false;current_string = ; // empty stringstate = s0 ; // initial stateif (state Final) { accepted_string = current_string; accepted = true;}char =get_next_char();while (char != EOF) { state = (state,char); current_string = current_string + char; if (state Final) { accepted_string = current_string; accepted = true; } char =get_next_char();}if accepted return accepted_string;else report error;
R
0,1,2,3,4,5,6,7,8,9
other
S0 S1 Se Se
S1 Se S2 Se
S2 Se S2 Se
Se Se Se Se
Given an input string, this simulation algorithm returns the longest accepted prefix
Given the input “R17R” , this simulationalgorithm returns “R17”
Final = { s2 }