CIS 461 Compiler Design and Construction Fall 2012 Instructor: Hugh McGuire Lecture-Module 2b Phases of a Compiler.

CIS 461Compiler Design and Construction

Fall 2012 Instructor: Hugh McGuire

Lecture-Module 2b

Phases of a Compiler

• Must recognize legal (and illegal) programs

• Must generate correct code

• Must manage storage of all variables (and code)

• Must agree with OS and linker on format for object code

High-level View of a Compiler

Sourcecode

Machinecode

Compiler

Errors

A Higher Level View: How Does the Compiler Fit In?

sourceprogra

m

absolutemachine

code

CompilerPreprocessor

Assembler Loader/Linker

skeletalsource

program

targetassemblyprogram

relocatablemachine

code

library routines,relocatable object files

generates machine codefrom the assembly code

• collects the source program thatis divided into seperate files• macro expansion

• links the library routines andother object modules• generates absolute addresses

Traditional Two-pass Compiler

• Use an intermediate representation (IR)• Front end maps legal source code into IR

• Back end maps IR into target machine code

• Admits multiple front ends and multiple passes

– Typically, front end is O(n) or O(n log n), while back end is NPC

• Different phases of compiler also interact through the symbol table

Source

code

FrontEnd

Errors

Machinecode

BackEnd

IR

SymbolTable

Responsibilities

• Recognize legal programs

• Report errors for the illegal programs in a useful way

• Produce IR and construct the symbol table

• Much of front end construction can be automated

The Front End

Sourcecode

Scanner IRParsertokens IR TypeChecker

Errors

The Front End

Scanner

• Maps character stream into words—the basic unit of syntax

• Produces tokens and stores lexemes when it is necessary– x = x + y ; becomes <id,x> EQ <id,x> PLUS <id,y> SEMICOLON– Typical tokens include number, identifier, +, -, while, if

• Scanner eliminates white space and comments

Sourcecode


Errors

The Front End

Parser• Uses scanner as a subroutine

• Recognizes context-free syntax and reports errors

• Guides context-sensitive analysis (type checking)

• Builds IR for source program

• Scanning and parsing can be grouped into one pass

Sourcecode

Scanner IRParser IR TypeChecker

Errors

token

get nexttoken

The Front End

Context Sensitive Analysis• Check if all the variables are declared before they are used• Type checking

– Check type errors such as adding a procedure and an array• Add the necessary type conversions

– int-to-float, float-to-double, etc.

Sourcecode


Errors

The Back End

Responsibilities• Translate IR into target machine code• Choose instructions to implement each IR operation• Decide which values to keep in registers• Schedule the instructions for instruction pipeline

Automation has been much less successful in the back end

Errors

IR InstructionScheduling

InstructionSelection

Machinecode

RegisterAllocation

IR IR

The Back End

Instruction Selection• Produce fast, compact code• Take advantage of target language features such as addressing modes• Usually viewed as a pattern matching problem

– ad hoc methods, pattern matching, dynamic programmingThis was the problem of the future in late 70’s when instruction sets were

complex– RISC architectures simplified this problem

Errors



Machinecode

RegisterAllocation

IR IR

The Back End

Instruction Scheduling

• Avoid hardware stalls (keep pipeline moving)• Use all functional units productively• Optimal scheduling is NP-Complete

Errors



Machinecode

RegisterAllocation

IR IR

The Back End

Register allocation

• Have each value in a register when it is used• Manage a limited set of registers• Can change instruction choices and insert LOADs and STOREs• Optimal allocation is NP-Complete

Compilers approximate solutions to NP-Complete problems

Errors



Machinecode

RegisterAllocation

IR IR

Traditional Three-pass Compiler

Code Optimization

• Analyzes IR and transforms IR

• Primary goal is to reduce running time of the compiled code

– May also improve space, power consumption (mobile computing)

• Must preserve “meaning” of the code

Errors

SourceCode

MiddleEnd

FrontEnd

Machinecode

BackEnd

IR IR

The Optimizer (or Middle End)

Typical Transformations• Discover and propagate constant values• Move a computation to a less frequently executed place• Discover a redundant computation and remove it• Remove unreachable code

Errors

Opt1

Opt3

Opt2

Optn

...IR IR IR IR IR

Modern optimizers are structured as a series of passes

First Phase: Lexical Analysis (Scanning)

Scanner

• Maps stream of characters into words

– Basic unit of syntax

• Characters that form a word are its lexeme

• Its syntactic category is called its token

• Scanner discards white space and comments

Sourcecode Scanner

IRParser

Errors

token

get nexttoken

Why Lexical Analysis?

• By separating context free syntax from lexical analysis

– We can develop efficient scanners

– We can automate efficient scanner construction

– We can write simple specifications for tokens

Scanner

ScannerGenerator

specifications (regular expressions)

source code tokens

tables or code

What are Tokens?

• Token: Basic unit of syntax

– Keywords

if, while, ...– Operators

+, *, <=, ||, ...– Identifiers (names of variables, arrays, procedures, classes)

i, i1, j1, count, sum, ...– Numbers

12, 3.14, 7.2E-2, ...

What are Tokens?

• Tokens are terminal symbols for the parser

– Tokens are treated as undivisible units in the grammar defining the source language

1. S expr

2. expr expr op term3. | term

4. term number5. | id

6. op +7. | -

number, id, +, -are tokens passed fromscanner to parser.They form the terminalsymbols of this simple grammar.

Lexical Concepts

• Token: Basic unit of syntax, syntactic output of the scanner

• Pattern: The rule that describes the set of strings that correspond to a token, specification of the token

• Lexeme: A sequence of input characters which match to a pattern and generate the token

WHILE while while

IF if if

ID i1, length, letter followed bycount, sqrt letters and digits

Token Lexeme Pattern

Tokens can have Attributes

• A problem

• If we send this output to the parser is it enough? Where are the variable names, procedure, names, etc.? All identifiers look the same.

• Tokens can have attributes that they can pass to the parser (using the symbol table)

if (i == j) z = 0;else z = 1;

becomes

IF, LPAREN,ID,EQEQ,ID,RPAREN,ID,EQ,NUM,SEMICOLON,ELSE,ID,EQ,NUM,SEMICOLON

IF, LPAREN,<ID, i>,EQEQ,<ID, j>,RPAREN,

<ID, z>,EQ,<NUM,0>,SEMICOLON,ELSE,

<ID,z>,EQ,<NUM,1>,SEMICOLON

How do we specify lexical patterns?

Some patterns are easy

• Keywords and operators

– Specified as literal patterns: if, then, else, while, =, +, …

Some patterns are more complex

• Identifiers

– letter followed by letters and digits

• Numbers

– Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9

– Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits

GOAL: We want to have concise descriptions of patterns, and we want to automatically construct the scanner from these descriptions

Specifying Lexical Patterns

Specifying Lexical Patterns: Regular Expressions

Regular expressions (REs) describe regular languages

Regular Expression (over alphabet )

(empty string) is a RE denoting the set {}

• If a is in , then a is a RE denoting {a}

• If x and y are REs denoting languages L(x) and L(y) then

– x is an RE denoting L(x)

– x | y is an RE denoting L(x) L(y)

– xy is an RE denoting L(x)L(y)

– x* is an RE denoting L(x)*

Precedence is closure, then concatenation, then alternation

All left-associative

x | y* z is equivalent tox | ((y*) z)

Operations on Languages

Operation Definition

Union of L and MWritten L M L M = {s | s L or s M }

Concatenation of L and MWritten LM

LM = {st | s L and t M }

Kleene closure of LWritten L* L* = 0i L

i

L+ = 1i Li

Exponentiation of LWritten Li Li =

{} if i = 0

Li-1L if i > 0

Positive closure of L

Written L+

Examples of Regular Expressions

• All strings of 1s and 0s

( 0 | 1 )*

• All strings of 1s and 0s beginning with a 1

1 ( 0 | 1 )*

• All strings of 0s and 1s containing at lest two consecutive 1s

( 0 | 1 )* 1 1( 0 | 1 )*

• All strings of alternating 0s and 1s

( | 1 ) ( 0 1 )* ( | 0 )

Extensions to Regular Expressions (a la JLex)

• x+= x x* denotes L(x)+

• x? = x | denotes L(x) {}• [abc] = a | b | c matches one character in the square bracket• a-z = a | b | c | ... | z range• [0-9a-z] = 0 | 1 | 2 | ... | 9 | a | b | c | ... | z • [^abc] ^ means negation matches any character except a, b or c • . (dot) matches any character except the

newline• . = [^\n] \n means newline, dot is equivalent to [^\n]• “[“ matches left square bracket, metacharacters

in double quotes become plain characters• \[ matches left square bracket, metacharacter

after backslash becomes plain character

Regular Definitions

• We can define macros using regular expressions and use them in other regular expressions

Letter (a|b|c| … |z|A|B|C| … |Z)

Digit (0|1|2| … |9)

Identifier Letter ( Letter | Digit )*

• Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion)

• Regular definitions can be converted to basic regular expressions with macro expansion

• In JLex enclose definitions using curly braces

Identifier {Letter} ( {Letter} | {Digit} )*

Examples of Regular Expressions

Digit (0|1|2| … |9)

Integer (+|-)? (0| (1|2|3| … |9)(Digit *))

Decimal Integer “.” Digit *

Real ( Integer | Decimal ) E (+|-)?Digit *

Complex “(“ Real , Real “)”

Numbers can get even more complicated.

From Regular Expressions to Scanners

• Regular expressions are useful for specifying patterns that correspond to tokens

• However, we also want to construct programs which recognize these patterns

• How do we do it?

– Use finite automata!

Consider the problem of recognizing register names in an assembler

Register R (0|1|2| … |9) (0|1|2| … |9)*

• Allows registers of arbitrary number

• Requires at least one digit

RE corresponds to a recognizer (or DFA)

Example

S0 S2 S1

R

(0|1|2| … |9)

accepting state

(0|1|2| …|9)

Recognizer for Register

initial state

Se

RR

(R|0|1|2| …|9)error state

(0|1|2| …|9)

Deterministic Finite Automata (DFA)

• A set of states S– S = { s0 , s1 , s2 , se}

• A set of input symbols (an alphabet) = { R , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }

• A transition function : S S– Maps (state, symbol) pairs to states = { ( s0 , R) s1, ( s0 , 0-9) se ,( s1 , 0-9 ) s2 ,( s1 , R ) se ,

( s2 , 0-9 ) s2 , ( s2 , R ) se , ( se , R | 0-9 ) se }• A start state

– s0

• A set of final (or accepting) states– Final = { s2 }

A DFA accepts a word x iff there exists a path in the transition graph from start state to a final state such that the edge labels along the path spell out x

DFA simulation

• Start in state s0 and follow transitions on each input character

• DFA accepts a word x iff x leaves it in a final state (s2 )

So,

• “R17” takes it through s0 , s1 , s2 and accepts

• “R” takes it through s0 , s1 and fails

• “A” takes it straight to se

• “R17R” takes it through s0 , s1 , s2 , se and rejects

Example

S0 S2 S1

R

(0|1|2| …|9)

accepting state

(0|1|2| …|9)

Recognizer for Registerinitial state

Simulating a DFA

state = s0 ;char = get_next_char();while (char != EOF) { state = (state,char); char =get_next_char();}if (state Final) report acceptance;else report failure;

R

0,1,2,3,4,5,6,7,8,9

other

S0 S1 Se Se

S1 Se S2 Se

S2 Se S2 Se

Se Se Se Se

•The recognizer translates directly into code

•To change DFAs, just change the arrays

•Takes O(|x|) time for input string x

Final = { s2 }We can also store the final states in an array

We can store the transition table in atwo-dimensional array:

Recognizing Longest Accepted Prefixaccepted = false;current_string = ; // empty stringstate = s0 ; // initial stateif (state Final) { accepted_string = current_string; accepted = true;}char =get_next_char();while (char != EOF) { state = (state,char); current_string = current_string + char; if (state Final) { accepted_string = current_string; accepted = true; } char =get_next_char();}if accepted return accepted_string;else report error;

R

0,1,2,3,4,5,6,7,8,9

other

S0 S1 Se Se

S1 Se S2 Se

S2 Se S2 Se

Se Se Se Se

Given an input string, this simulation algorithm returns the longest accepted prefix

Given the input “R17R” , this simulationalgorithm returns “R17”

Final = { s2 }

CIS 461 Compiler Design and Construction Fall 2012 Instructor: Hugh McGuire Lecture-Module 2b Phases of a Compiler.

Documents

end construction

irback end maps ir

ir operationdecide

compiler fit

compiler design

useful wayproduce ir

source programscanning

object codehighlevel